The commonly used methods in target detection can be divided into one-stage method and two-stage method. One-step method firstly inputs pictures, outputs bounding boxes (bbox) and classification labels, which are completed by the network. YOLO and SSD are the main representatives of this method. The two-stage method is represented by fast-RCNN. After the image is input, a regional suggestion is generated first, and then it is input into the classifier for classification. These two tasks are completed by different networks.
Among them, YOLO target detection is an outstanding algorithm, which is the abbreviation of "You only look at it once", which means that you can identify the category and position of the object in the picture only by browsing once, which perfectly balances the relationship between detection speed and accuracy. YOLO has also developed from the original YOLO v 1 to the latest YOLO v5.
In 20 15, YOLO v 1 first edition proposed that YOLO proposed the Darknet network with reference to GoogleNet. Darknet is an open source neural network framework written in C language and CUDA. 1x 1 convolution layer +3x3 convolution layer replaced the Inception module of GoogleNet. The network consists of 24 convolution layers and 2 full connections, as shown in figure 1:
The frame of YOLO v 1 is shown in Figure 2: First, adjust the image size to 448×448, then input the image into CNN, and finally save the final calibration frame through NMS.
The core idea of YOLO v 1 is to regard target detection as a regression problem, and divide the picture into SxS grids. If the target center falls into a grid unit, the grid is responsible for detecting the target. Each grid cell predicts B bounding box (bbox) and category information. In addition, each bbox needs to predict (x, y, w, h) and confidence ***5 values. Therefore, in the end, each grid should predict B bbox and C classes, and finally output the tensor of S x S x (5*B+C).
Advantages:
YOLO v2 has made a series of improvements on the basis of YOLO v 1, which improves the accuracy and recall of target location while maintaining the classification accuracy. First of all, YOLO v2 can adapt to different input sizes, and can weigh the detection accuracy and detection speed as needed. Secondly, according to hierarchical classification, a word tree is proposed to mix detection data sets and classification data sets. Finally, a joint training method is proposed, which can be carried out on both detection and classification data sets. The detection data set is used to train the model identification part, and the classification data set is used to train the model classification part to expand the detection type.
More specific improvements to YOLO v 1 include the following points:
However, YOLO v2 still cannot solve the problem of overlapping objects in the same grid. YOLO v3 continues to make some improvements on the basis of YOLO v2:
In April 2020, YOLO v4 was released. The accuracy on MS COCO data set reaches 43.5% AP, and the speed reaches 65FPS, which is higher than YOLO v3 by 10% and 12% respectively.
YOLO v4 first summarizes the related work and splits the target detection framework:
Object detection = spine+neck+head
In addition, all tuning methods are divided into "a bag of gifts" and "a bag of specials".
YOLO v4 summarizes the above tuning techniques to find the best combination. During the training, the influence of free gifts and professional gifts on YOLO v4 was verified.
More than 40 days after the release of YOLO v5, Ultralytics has opened the unofficial Yolo V5, which is completely based on PyTorch. It is worth noting that the reasoning time of each picture reaches 140 FPS, and the weight file size of YOLOv5 is only 1/9 of YOLOv4. YOLO v5 is faster and smaller!
From the above development course of YOLO, it can be seen that YOLO series paid more attention to application landing in the later stage of development, and did not put forward very novel innovations.