Detailed interpretation of TPH-YOLOv5

1 Introduction

Target detection in UAV capture scenes is a popular task recently. Since UAVs fly at different altitudes, the target scale changes greatly, which also brings a great burden to the optimization of the model. In addition, when the drone flies at high speed and low altitude, it will also cause motion blur of dense targets.

Figure 1 Small target and dense problem

In order to solve the above two problems, this article proposes TPH-YOLOv5. TPH-YOLOv5 adds a prediction heads to YOLOv5 to detect targets of different scales. Then Transformer Prediction Heads (TPH) were used to replace the original prediction heads by exploring the prediction potential of Self-Attention. At the same time, the author also integrated the convolutional block attention model (CBAM) to find attention areas in dense scenes.

To further improve TPH-YOLOv5, the authors also provide a number of useful strategies, such as data augmentation, multi-scale testing, multi-model ensembles and the use of additional classifiers.

Extensive experiments on the VisDrone2021 dataset show that TPH-YOLOv5 has good performance and interpretability on drone capture scenes. On the DET-test-challenge data set, the AP result of TPH-YOLOv5 is 39.18, which is 1.81 higher than the previous SOTA method (DPNetV3). In VisDrone Challenge 2021, TPH-YOLOv5 improved by about 7 compared to YOLOv5.

The contributions of this article are as follows:

2 Summary of previous work 2.1 Data Augmentation

The significance of data augmentation is mainly to expand the data set to make the model more suitable for different environments The obtained images have high robustness.

Photometric and geometric are widely used by researchers. For Photometric, the hue, saturation and value of the image are mainly adjusted. When dealing with geometry, it is mainly to add random scaling, cropping, translation, shearing and rotation.

In addition to the above-mentioned global pixel enhancement methods, there are also some unique data enhancement methods. Some researchers have proposed methods to combine multiple images for data enhancement, such as MixUp, CutMix and Mosaic.

MixUp randomly selects 2 samples from the training images for random weighted summation, and the labels of the samples also correspond to the weighted summation. Unlike occlusion work, which typically uses a zero-pixel mask to occlude an image, CutMix uses an area from another image to cover the occluded area. Mosaic is an improved version of CutMix. Stitching 4 images greatly enriches the background of the detected object. Additionally, batch normalization calculates activation statistics for 4 different images on each layer.

The work of TPH-YOLOv5 mainly combines data enhancement with MixUp, Mosaic and traditional methods.

2.2 Multi-Model Ensemble Method

We all know that the deep learning model is a nonlinear method. They provide greater flexibility and can scale in proportion to the amount of training data.

One drawback of this flexibility is that they learn via a stochastic training algorithm, which means they are very sensitive to the details of the training data and may end up with a different set of weights each time they are trained, resulting in different predictions. This brings a high variance to the model.

A successful way to reduce model variance is to train multiple models instead of a single model and combine the predictions of these models.

For different target detection models, there are three different ensemble boxes methods: non-maximum suppression (NMS), Soft-NMS, and Weighted Boxes Fusion (WBF).

In the NMS method, if the overlap and Intersection Over Union (IoU) of boxes are greater than a certain threshold, they are considered to belong to the same object. For each target NMS, only one box with the highest confidence is left and other boxes are deleted. Therefore, the box filtering process relies on the selection of this single IoU threshold, which has a great impact on model performance.

Soft-NMS is a slight modification of NMS, which makes Soft-NMS significantly improved over traditional NMS on standard benchmark data sets (such as PASCAL VOC and MS COCO). It sets a decay function based on the confidence of adjacent bounding boxes based on IoU values, rather than completely setting their confidence scores to 0 and removing them.

WBF works differently from NMS. Both NMS and Soft-NMS exclude some boxes, while WBF merges all boxes to form the final result. Therefore, it can resolve all inaccurate predictions in the model. This article uses WBF to integrate the final model, and its performance is significantly better than NMS.

2.3 Object Detection

CNN-based object detectors can be divided into many types:

Some detectors are specially designed for images captured by drones , such as RRNet, PENet, CenterNet, etc. But from a component perspective, they usually consist of 2 parts, one is the CNN-based backbone for image feature extraction, and the other is the detection head for predicting the class and box of the target.

In addition, the target detectors developed in recent years often insert some layers between the backbone and the head. People usually call this part the Neck of the detector. Next, these three structures will be introduced in detail:

Backbone

Commonly used Backbones include VGG, ResNet, DenseNet, MobileNet, EfficientNet, CSPDarknet53, Swin-Transformer, etc., none of which are Design your own network. Because these networks have proven that they have strong feature extraction capabilities in classification and other problems. But researchers will also fine-tune Backbone to make it better suited for specific vertical tasks.

Neck

Neck is designed to better utilize the features extracted by Backbone. Reprocess and rationally use the feature maps extracted by Backbone at different stages. Usually, a Neck consists of several bottom-up paths and several top-down paths. Neck is a key link in the target detection framework. The earliest Neck used upper and lower sampling blocks. The characteristic of this method is that there is no feature layer aggregation operation, such as SSD, and the multi-level feature map directly follows the head.

Commonly used Neck aggregation blocks include: FPN, PANet, NAS-FPN, BiFPN, ASFF, and SAM. The nature of these methods is to iteratively use various upsampling, splicing, point sums or dot products to design aggregation strategies. Neck also has some additional blocks like SPP, ASPP, RFB, CBAM.

Head

As a classification network, Backbone cannot complete the positioning task. Head is responsible for detecting the location and category of the target through the feature map extracted by Backbone.

Head is generally divided into two types: One-Stage detector and Two-Stage detector.

Two-stage detectors have always been the dominant method in the field of target detection, the most representative of which is the RCNN series. Compared with the Two-Stage detector, the One-Stage detector predicts the categories of boxes and objects simultaneously. The One-Stage detector has obvious speed advantages, but its accuracy is lower. For One-Stage detectors, the most representative models are YOLO series, SSD and RetaNet.

3TPH-YOLOv53.1 Overview of YOLOv5

YOLOv5 has 4 different configurations, including YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x. Under normal circumstances, YOLOv5 uses CSPDarknet53 SPP as Backbone, PANet as Neck, and YOLO to detect Head. In order to further optimize the entire architecture. Since it is the most significant and convenient One-Stage detector, the authors chose it as Baseline.

Figure 2 THP-YOLOv5 overall architecture

When using the VisDrone2021 data set to train the model, using data enhancement strategies (Mosaic and MixUp) it was found that the results of YOLOv5x are far better than YOLOv5s and YOLOv5m Compared with YOLOv5l, the difference in AP value is greater than 1.5. Although the training computational cost of the YOLOv5x model is higher than the other three models, YOLOv5x is still chosen to pursue the best detection performance. In addition, the commonly used photometric and geometric parameters were adjusted according to the characteristics of the images captured by the drone.

3.2 TPH-YOLOv5

The framework of TPH-YOLOv5 is shown in Figure 3. The original YOLOv5 has been modified to focus on the VisDrone2021 data set:

Figure 3 Prediction heads of small objects with TPH-YOLOv5 model structure

The author counted the VisDrone2021 data set and found that it Contains many very small targets, so a prediction head for small object detection is added. Combined with the other 3 prediction heads, the 4-head structure can alleviate the negative impact of drastic target scale changes. As shown in Figure 3, the added prediction head (Head 1) is generated from a low-level, high-resolution feature map and is more sensitive to small objects. After adding a detection head, although the calculation and storage costs are increased, the detection performance of small objects is greatly improved.

Transformer encoder block

Figure 4 Transformer Block

The Transformer encoder block replaces some convolution blocks and CSP bottleneck blocks in the original YOLOv5 version. Its structure is shown in Figure 4.

Compared with the original bottleneck blocks in CSPDarknet53, the author believes that the Transformer encoder block can capture global information and rich contextual information.

Each Transformer encoder block contains 2 sub-layers. The first sub-layer is a multi-head attention layer, and the second sub-layer (MLP) is a fully connected layer. Residual connections are used between each sub-layer. The Transformer encoder block adds the ability to capture different local information. It can also exploit the feature representation potential using self-attention mechanism. In the VisDrone2021 dataset, the Transformer encoder block has better performance on high-density occlusion objects.

Based on YOLOv5, the author only applies the Transformer encoder block on the head part to form the transformer Prediction head (TPH) and backbone end. Because the feature map at the end of the network has lower resolution. Applying TPH to low-resolution feature maps can reduce computational and storage costs. Furthermore, some TPH blocks of early layers can optionally be removed when scaling up the resolution of the input image to make the training process available.

Convolutional block attention module (CBAM)

CBAM is a simple but effective attention module. It is a lightweight module that can be plug-and-played into CNN architectures and can be trained in an end-to-end manner. Given a feature map, CBAM will sequentially infer the attention map along the two independent dimensions of channel and space, and then multiply the attention map with the input feature map to perform adaptive feature refinement.

Figure 5 CBAM attention mechanism

The structure of the CBAM module is shown in Figure 5. Through the experiments of this article, CBAM is integrated into different models on different classification and detection data sets, and the performance of the model is greatly improved, proving the effectiveness of this module.

In images captured by drones, large coverage areas always contain confusing geographical elements. Attention regions can be extracted using CBAM to help TPH-YOLOv5 resist confusing information and focus on useful target objects.

Self-trained classifier

After training the VisDrone2021 data set with TPH-YOLOv5, test the test-dev data set, and then analyze the results of visual failure cases to obtain TPH -YOLOv5 has good positioning ability but poor classification ability. The authors further explored the confusion matrix as shown in Figure 6 and observed that some hard categories such as tricycles and sunshade tricycles had very low accuracy.

Figure 6 Detection confusion matrix

Therefore, the author proposed a Self-trained classifier. First, a training set is constructed by cropping the ground-truth bounding box and resizing each image patch to 64 × 64. Then select ResNet18 as the classifier network. Experimental results show that with the help of this Self-trained classifier, the proposed method improves the AP value by about 0.8~1.0.

4 Experiments and Conclusions

Finally, it achieved a good score of 39.18 on the test-set-challenge, which is far higher than the highest score of 37.37 in VisDrone2020.

Figure 9 Test result chart