FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking
FairMOT:複数のオブジェクトの追跡における検出と再識別の公平性について
Abstract—There has been remarkable progress on object detection and re-identification (re-ID) in recent years which are the key components of multi-object tracking. However, little attention has been focused on jointly accomplishing the two tasks in a single network. Our study shows that the previous attempts ended up with degraded accuracy mainly because the re-ID task is not fairly learned which causes many identity switches. The unfairness lies in two-fold: (1) they treat re-ID as a secondary task whose accuracy heavily depends on the primary detection task. So training is largely biased to the detection task but ignores the re-ID task; (2) they use ROI-Align to extract re-ID features which is directly borrowed from object detection. However, this introduces a lot of ambiguity in characterizing objects because many sampling points may belong to disturbing instances or background. To solve the problems, we present a simple approach FairMOT which consists of two homogeneous branches to predict pixel-wise objectness scores and re-ID features. The achieved fairness between the tasks allows FairMOT to obtain high levels of detection and tracking accuracy and outperform previous state-of-the-arts by a large margin on several public datasets. The source code and pre-trained models are released at https://github.com/ifzhang/FairMOT.
Multi-Object Tracking (MOT) has been a longstanding goal in computer vision [1], [2], [3], [4] which aims to estimate trajectories for objects of interest in videos. The successful resolution of the problem can benefit many applications such as video analysis, action recognition, smart elderly care, and human computer interaction.
The existing methods such as [1], [2], [3], [4], [5], [6], [7] often address the problem by two separate models: the detection model firstly localizes the objects of interest by bounding boxes in each frame, then the association model extracts re-identification (re-ID) features for each bounding box and links it to one of the existing tracks according to certain metrics defined on features. There has been remarkable progress on object detection [8], [9], [10], [11] and re-ID [3], [12] respectively in recent years which in turn significantly boosts the overall tracking performance. However, those methods cannot perform real-time inference especially when there are a large number of objects because the two models do not share features and they need to apply the re-ID models for every bounding box in the video.
With maturity of multi-task learning [13], one-shot trackers which estimate objects and learn re-ID features using a single network have attracted more attention [14], [15]. For example, Voigtlaender et al. [15] propose to add a re-ID branch on top of Mask R-CNN to obtain proposals’ re-ID features using ROI-Align. It reduces inference time by re-using the backbone features for the re-ID network. Unfortunately, the tracking accuracy drops remarkably compared to the two-step ones. In particular, the number of ID switches increases by a large margin. The result suggests that combining the two tasks is a non-trivial problem and should be treated carefully. In this paper, we aim to deeply understand the reasons behind the failure, and present a simple yet effective approach. In particular, three factors are identified.
The existing one-shot trackers such as Track R-CNN [15] and JDE [14] are mostly anchor-based since they are directly modified from anchor-based object detectors such as YOLO [11] and Mask R-CNN [9]. However, we find in this study that the anchor-based framework is not suitable for learning re-ID features which result in a large number of ID switches in spite of the good detection results.
Overlooked re-ID task: Track R-CNN [15] operates in a cascaded style which first estimates object proposals (boxes) and then pools re-ID features from the proposals to estimate the corresponding re-ID features. It is worth noting that the quality of re-ID features heavily depends on the quality of proposals. As a result, in the training stage, the model is seriously biased to estimate accurate object proposals rather than high quality re-ID features. To summarize, this de facto standard “detection first, re-ID secondary” framework makes the re-ID network not fairly learned.
One anchor corresponds to multiple identities: The anchor-based methods usually use ROI-Pool or ROI-Align to extract features from each proposal. Most sampling locations in ROI-Align may belong to other disturbing instances or background as shown in Figure 1. As a result, the extracted features are not optimal in terms of accurately and discriminatively representing the target objects. Instead, we find in this work that it is significantly better to only extract features at the estimated object centers.
Multiple anchors correspond to one identity: In both [15] and [14], multiple adjacent anchors, which correspond to different image patches, may be forced to estimate the same identity as long as their IoU is sufficiently large. This introduces severe ambiguity for training. See Figure 1 for illustration. On the other hand, when an image undergoes small perturbation, e.g., due to data augmentation, it is possible that the same anchor is forced to estimate different identities. In addition, feature maps in object detection are usually downsampled by 8/16/32 times to balance accuracy and speed. This is acceptable for object detection but it is too coarse for learning re-ID features because features extracted at coarse anchors may not be aligned with object centers.
Fig. 1. (a) Track R-CNN treats detection as the primary task and re-ID as the secondary one. Both Track R-CNN and JDE are anchor-based. The red boxes represent positive anchors and the green boxes represent the target objects. The three methods extract re-ID features differently. Track R-CNN extracts re-ID features for all positive anchors using ROI-Align. JDE extracts re-ID features at the centers of all positive anchors. FairMOT extracts re-ID features at the object center. (b) The red anchor contains two different instances. So it will be forced to predict two conflicting classes. (c) Three different anchors with different image patches are response for predicting the same identity. (d) FairMOT extracts re-ID features only at the object center and can mitigate the problems in (b) and (c).
For one-shot trackers, most features are shared between the object detection and re-ID tasks. But it is well known that they actually require features from different layers to achieve the best results. In particular, object detection requires deep and abstract features to estimate object classes and positions but re-ID focuses more on low-level appearance features to distinguish different instances of the same class.
We empirically find that multi-layer feature aggregation is effective to address the contradiction by allowing the two tasks (network branches) to extract whatever features they need from the multi-layer aggregated features. Without multi-layer fusion, the model will be biased to the primary detection branch and generates lowquality re-ID features. In addition, multi-layer fusion, which fuses features from layers with different receptive fields, also improves the capability to handle object scale variation which is very common in practice.
The previous re-ID works usually learn very high dimensional features and have achieved promising results on the benchmarks of their field. However, we find that learning lowerdimensional features is actually better for one-shot MOT for three reasons: (1) although learning high dimensional re-ID features may slightly improve their capability to differentiate objects, it notably harms the object detection accuracy due to the competition of the two tasks which in turn also has negative impact to the final tracking accuracy. So considering that the feature dimension in object detection is usually very low (class numbers + box locations), we propose to learn low-dimensional re-ID features to balance the two tasks; (2) when training data is small, learning low dimensional re-ID features reduces the risk of over-fitting. The datasets in MOT are usually much smaller than those in the re-ID area. So it is favorable to decrease feature dimensions; (3) learning low dimensional re-ID features improves the inference speed as will be shown in our experiments.
In this work, we present a simple approach termed as FairMOT to jointly address the three fairness issues. It essentially differs from the previous “detection first, re-ID secondary” framework because the detection and re-ID tasks are treated equal in FairMOT. Our contributions are three-fold. Firstly, we empirically demonstrate and discuss the challenges faced by the previous one-shot tracking frameworks which have been overlooked but severely limit their performance. Second, on top of the anchorless object detection methods such as [10], we introduce a framework to fairly balance the detection and re-ID tasks which significantly outperforms the previous methods without bells and whistles. Finally, we also present a self supervised learning approach to train FairMOT on large scale detection datasets which improves its generalization capability. This has significant empirical values.
Figure 2 shows an overview of FairMOT. It adopts a very simple network structure which consists of two homogeneous branches for detecting objects and extracting re-ID features, respectively. Inspired by [10], [16], [17], [18], the detection branch is implemented in an anchor-free style which estimates object centers and sizes represented as position-aware measurement maps. Similarly, the re-ID branch estimates a re-ID feature for each pixel to characterize the object centered at the pixel. Note that the two branches are completely homogeneous which essentially differs from the previous methods which perform detection and re-ID in a cascaded style. So FairMOT eliminates the unfair advantage of the detection branch as reflected in Table 3, effectively learns high-quality re-ID features and obtains a good trade-off between detection and re-ID for better MOT results.
It is also worth noting that FairMOT operates on highresolution feature maps of strides four while the previous anchor-based methods operate on feature maps of stride 32. The elimination of anchors as well as the use of highresolution feature maps better aligns re-ID features to object centers which significantly improves the tracking accuracy. The dimension of re-ID features is set to be only 64 which not only reduces computation time but also improves tracking robustness by striking a good balance between the detection and re-ID tasks. We equip the backbone network [19] with the Deep Layer Aggregation operator [20] to fuse features from multiple layers in order to accommodate both branches and handle objects of different scales.
We evaluate FairMOT on the MOT Challenge benchmark via the evaluation server. It ranks first among all trackers on the 2DMOT15 [21], MOT16 [22], MOT17 [22] and MOT20 [23] datasets. When we further pre-train our model using our proposed self supervised learning method, it achieves additional gains on all datasets. In spite of the strong results, the approach is very simple and runs at 30 FPS on a single RTX 2080Ti GPU. It sheds light on the relationship between detection and re-ID in MOT and provides guidance for designing one-shot video tracking networks.
Fig. 2. Overview of our one-shot tracker FairMOT. The input image is first fed to an encoder-decoder network to extract high resolution feature maps (stride=4). Then we add two homogeneous branches for detecting objects and extracting re-ID features, respectively. The features at the predicted object centers are used for tracking.
In this section, we present the technical details of FairMOT including the backbone network, the object detection branch, the re-ID branch as well as training details.
We adopt ResNet-34 as backbone in order to strike a good balance between accuracy and speed. An enhanced version of Deep Layer Aggregation (DLA) [10] is applied to the backbone to fuse multi-layer features as shown in Figure 2. Different from original DLA [20], it has more skip connections between low-level and high-level features which is similar to the Feature Pyramid Network (FPN) [45]. In addition, convolution layers in all up-sampling modules are replaced by deformable convolution such that they can dynamically adjust the receptive field according to object scales and poses. These modifications are also helpful to alleviate the alignment issue. The resulting model is named DLA-34. Denote the size of input image as Himage × Wimage, then the output feature map has the shape of C × H × W where H = Himage/4 and W = Wimage/4. Besides DLA, other deep networks that provide multi-scale convolutional features, such as Higher HRNet [46], can be used in our framework to provide fair features for both detection and re-ID
Our detection branch is built on top of CenterNet [10] but other anchor-free methods such as [16], [18], [47], [48] can also be used. We briefly describe the approach to make this work self-contained. In particular, three parallel heads are appended to DLA-34 to estimate heatmaps, object center offsets and bounding box sizes, respectively. Each head is implemented by applying a 3 × 3 convolution (with 256 channels) to the output features of DLA-34, followed by a 1×1 convolutional layer which generates the final targets.
This head is responsible for estimating the locations of the object centers. The heatmap based representation, which is the de facto standard for the landmark point estimation task, is adopted here. In particular, the dimension of the heatmap is 1× H ×W. The response at a location in the heatmap is expected to be one if it collapses with the ground-truth object center. The response decays exponentially as the distance between the heatmap location and the object center. For each GT box b i = (x i 1 , yi 1 , xi 2 , yi 2 ) in the image, we compute the object center (c i x , ci y ) as c i x = x i 1+x i 2 2 and c i y = y i 1+y i 2 2 , respectively. Then its location on the feature map is obtained by dividing the stride (ec i x , ec i y ) = (b c i x 4 c, b c i y 4 c). Then the heatmap response at the location (x, y) is computed as Mxy = PN i=1 exp − (x−ce i x ) 2+(y−ce i y ) 2 2σ2 c where N represents the number of objects in the image and σc represents the standard deviation. The loss function is defined as pixel-wise logistic regression with focal loss [49]:
このヘッドは、オブジェクトの中心の位置を推定する責任があります。ここでは、ランドマーク点推定タスクの事実上の標準であるヒートマップベースの表現が採用されています。特に、ヒートマップの寸法は1×H×Wです。ヒートマップ内のある場所での応答は、グラウンドトゥルースオブジェクトの中心で崩壊した場合に1になると予想されます。応答は、ヒートマップの場所とオブジェクトの中心の間の距離として指数関数的に減衰します。画像内の各GTボックスbi =(xi 1、yi 1、xi 2、yi 2)について、オブジェクトの中心(cix、ci y)をcix = xi 1 + xi 22およびciy = yi 1 + yiとして計算します。それぞれ22。次に、ストライド(ec i x、ec i y)=(b c i x 4 c、b c i y 4 c)を除算することにより、フィーチャマップ上のその位置が取得されます。次に、位置(x、y)でのヒートマップ応答は次のように計算されます。Mxy= PN i = 1 exp −(x-ce ix)2+(y-ce iy)22σ2cここで、Nは画像内のオブジェクトの数を表します。 Σcは標準偏差を表します。損失関数は、焦点損失を伴うピクセル単位のロジスティック回帰として定義されます[49]。