はじめに

今回はFairMOTという論文の対訳をただつらつらと貼っていきます
自分の勉強用ですが、もし誰かの役に立つなら幸いです

※編集中です

FairMOTについて

トラッキング（物体追尾）アルゴリズムで、
現在（2021年2月）最もベンチマークで優れているアルゴリズムです
原文はこちら　https://arxiv.org/pdf/2004.01888.pdf

FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking

FairMOT：複数のオブジェクトの追跡における検出と再識別の公平性について

Abstract—There has been remarkable progress on object detection and re-identification (re-ID) in recent years which are the key components of multi-object tracking. However, little attention has been focused on jointly accomplishing the two tasks in a single network. Our study shows that the previous attempts ended up with degraded accuracy mainly because the re-ID task is not fairly learned which causes many identity switches. The unfairness lies in two-fold: (1) they treat re-ID as a secondary task whose accuracy heavily depends on the primary detection task. So training is largely biased to the detection task but ignores the re-ID task; (2) they use ROI-Align to extract re-ID features which is directly borrowed from object detection. However, this introduces a lot of ambiguity in characterizing objects because many sampling points may belong to disturbing instances or background. To solve the problems, we present a simple approach FairMOT which consists of two homogeneous branches to predict pixel-wise objectness scores and re-ID features. The achieved fairness between the tasks allows FairMOT to obtain high levels of detection and tracking accuracy and outperform previous state-of-the-arts by a large margin on several public datasets. The source code and pre-trained models are released at https://github.com/ifzhang/FairMOT.

要約—近年、マルチオブジェクト追跡の主要コンポーネントであるオブジェクトの検出と再識別（re-ID）が目覚ましい進歩を遂げています。ただし、単一のネットワークで2つのタスクを共同で実行することにほとんど注意が向けられていません。私たちの調査によると、これまでの試みでは、主にre-IDタスクが適切に学習されておらず、多くのID切り替えが発生するため、精度が低下していました。不公平性には2つの要素があります。（1）re-IDを二次タスクとして扱い、その精度は一次検出タスクに大きく依存します。したがって、トレーニングは検出タスクに大きく偏っていますが、re-IDタスクは無視されます。（2）ROI-Alignを使用して、オブジェクト検出から直接借用したre-ID機能を抽出します。ただし、多くのサンプリングポイントが邪魔なインスタンスまたは背景に属している可能性があるため、これにより、オブジェクトの特性評価に多くのあいまいさが生じます。問題を解決するために、ピクセル単位のオブジェクト性スコアとre-ID機能を予測する2つの同種のブランチで構成される単純なアプローチFairMOTを提示します。タスク間で達成された公平性により、FairMOTは高レベルの検出と追跡の精度を取得し、いくつかの公開データセットで以前の最先端技術を大幅に上回ります。ソースコードと事前トレーニング済みモデルは、https：//github.com/ifzhang/FairMOTでリリースされています。

1 INTRODUCTION
- 1.4 Overview of FairMOT
3. FairMOT

1 INTRODUCTION

1 INTRODUCTION
Multi-Object Tracking (MOT) has been a longstanding goal in computer vision [1], [2], [3], [4] which aims to estimate trajectories for objects of interest in videos. The successful resolution of the problem can benefit many applications such as video analysis, action recognition, smart elderly care, and human computer interaction.	マルチオブジェクトトラッキング（MOT）は、ビデオで関心のあるオブジェクトの軌道を推定することを目的としたコンピュータビジョン[1]、[2]、[3]、[4]の長年の目標です。問題の解決に成功すると、ビデオ分析、行動認識、スマートな高齢者介護、ヒューマンコンピュータインタラクションなどの多くのアプリケーションにメリットがあります。
The existing methods such as [1], [2], [3], [4], [5], [6], [7] often address the problem by two separate models: the detection model firstly localizes the objects of interest by bounding boxes in each frame, then the association model extracts re-identification (re-ID) features for each bounding box and links it to one of the existing tracks according to certain metrics defined on features. There has been remarkable progress on object detection [8], [9], [10], [11] and re-ID [3], [12] respectively in recent years which in turn significantly boosts the overall tracking performance. However, those methods cannot perform real-time inference especially when there are a large number of objects because the two models do not share features and they need to apply the re-ID models for every bounding box in the video.	[1]、[2]、[3]、[4]、[5]、[6]、[7]などの既存の方法は、多くの場合、2つの別々のモデルによって問題に対処します。検出モデルは最初に対象のオブジェクトをローカライズします。各フレームの境界ボックスによって、関連付けモデルは各境界ボックスの再識別（re-ID）フィーチャを抽出し、フィーチャで定義された特定のメトリックに従って既存のトラックの1つにリンクします。近年、物体検出[8]、[9]、[10]、[11]およびre-ID [3]、[12]はそれぞれ目覚ましい進歩を遂げており、これにより全体的な追跡性能が大幅に向上しています。ただし、これらのメソッドは、特にオブジェクトが多数ある場合、2つのモデルが機能を共有せず、ビデオ内のすべてのバウンディングボックスにre-IDモデルを適用する必要があるため、リアルタイムの推論を実行できません。
With maturity of multi-task learning [13], one-shot trackers which estimate objects and learn re-ID features using a single network have attracted more attention [14], [15]. For example, Voigtlaender et al. [15] propose to add a re-ID branch on top of Mask R-CNN to obtain proposals’ re-ID features using ROI-Align. It reduces inference time by re-using the backbone features for the re-ID network. Unfortunately, the tracking accuracy drops remarkably compared to the two-step ones. In particular, the number of ID switches increases by a large margin. The result suggests that combining the two tasks is a non-trivial problem and should be treated carefully. In this paper, we aim to deeply understand the reasons behind the failure, and present a simple yet effective approach. In particular, three factors are identified.	マルチタスク学習の成熟に伴い[13]、単一のネットワークを使用してオブジェクトを推定し、re-ID機能を学習するワンショットトラッカーがより注目を集めています[14]、[15]。たとえば、Voigtlaender etal。 [15] ROI-Alignを使用してプロポーザルのre-ID機能を取得するために、MaskR-CNNの上にre-IDブランチを追加することを提案します。 Re-IDネットワークのバックボーン機能を再利用することで推論時間を短縮します。残念ながら、追跡精度は2段階のものと比較して著しく低下します。特に、IDスイッチの数が大幅に増加しています。この結果は、2つのタスクを組み合わせることが重要な問題であり、慎重に扱う必要があることを示しています。この論文では、失敗の背後にある理由を深く理解し、シンプルでありながら効果的なアプローチを提示することを目指しています。特に、3つの要因が特定されています。

1.1 Unfairness Caused by Anchors	1.1アンカーによって引き起こされる不公平
The existing one-shot trackers such as Track R-CNN [15] and JDE [14] are mostly anchor-based since they are directly modified from anchor-based object detectors such as YOLO [11] and Mask R-CNN [9]. However, we find in this study that the anchor-based framework is not suitable for learning re-ID features which result in a large number of ID switches in spite of the good detection results.	Track R-CNN [15]やJDE [14]などの既存のワンショットトラッカーは、YOLO [11]やMaskR-CNN [9]などのアンカーベースのオブジェクト検出器から直接変更されるため、ほとんどがアンカーベースです。ただし、この調査では、アンカーベースのフレームワークは、良好な検出結果にもかかわらず、多数のIDスイッチをもたらすre-ID機能の学習には適していないことがわかりました。
Overlooked re-ID task: Track R-CNN [15] operates in a cascaded style which first estimates object proposals (boxes) and then pools re-ID features from the proposals to estimate the corresponding re-ID features. It is worth noting that the quality of re-ID features heavily depends on the quality of proposals. As a result, in the training stage, the model is seriously biased to estimate accurate object proposals rather than high quality re-ID features. To summarize, this de facto standard “detection first, re-ID secondary” framework makes the re-ID network not fairly learned.	見落とされたre-IDタスク：Track R-CNN [15]は、最初にオブジェクト提案（ボックス）を推定し、次に提案からre-ID機能をプールして、対応するre-ID機能を推定するカスケードスタイルで動作します。 Re-ID機能の品質は、提案の品質に大きく依存することに注意してください。その結果、トレーニング段階では、モデルは、高品質のre-ID機能ではなく、正確なオブジェクト提案を推定するように真剣にバイアスされます。要約すると、この事実上の標準の「最初の検出、2番目のre-ID」フレームワークは、re-IDネットワークを公正に学習しません。
One anchor corresponds to multiple identities: The anchor-based methods usually use ROI-Pool or ROI-Align to extract features from each proposal. Most sampling locations in ROI-Align may belong to other disturbing instances or background as shown in Figure 1. As a result, the extracted features are not optimal in terms of accurately and discriminatively representing the target objects. Instead, we find in this work that it is significantly better to only extract features at the estimated object centers.	1つのアンカーが複数のIDに対応します。アンカーベースの方法では、通常、ROI-PoolまたはROI-Alignを使用して、各提案から特徴を抽出します。ROI-Alignのほとんどのサンプリング位置は、図1に示すように、他の邪魔なインスタンスまたは背景に属している可能性があります。その結果、抽出された特徴は、ターゲットオブジェクトを正確かつ識別的に表すという点で最適ではありません。代わりに、この作業では、推定されたオブジェクトの中心でのみ特徴を抽出する方がはるかに優れていることがわかりました。
Multiple anchors correspond to one identity: In both [15] and [14], multiple adjacent anchors, which correspond to different image patches, may be forced to estimate the same identity as long as their IoU is sufficiently large. This introduces severe ambiguity for training. See Figure 1 for illustration. On the other hand, when an image undergoes small perturbation, e.g., due to data augmentation, it is possible that the same anchor is forced to estimate different identities. In addition, feature maps in object detection are usually downsampled by 8/16/32 times to balance accuracy and speed. This is acceptable for object detection but it is too coarse for learning re-ID features because features extracted at coarse anchors may not be aligned with object centers.	複数のアンカーが1つのIDに対応します。[15]と[14]の両方で、異なる画像パッチに対応する複数の隣接するアンカーは、IoUが十分に大きい限り、同じIDを推定するように強制される場合があります。これにより、トレーニングに深刻なあいまいさが生じます。図1を参照してください。一方、画像が小さな摂動を受ける場合、たとえばデータの増強のために、同じアンカーが異なるアイデンティティを推定することを余儀なくされる可能性があります。さらに、オブジェクト検出の特徴マップは通常、精度と速度のバランスをとるために8/16/32倍ダウンサンプリングされます。これはオブジェクト検出には受け入れられますが、粗いアンカーで抽出された特徴がオブジェクトの中心に位置合わせされていない可能性があるため、re-ID特徴を学習するには粗すぎます。

Fig. 1. (a) Track R-CNN treats detection as the primary task and re-ID as the secondary one. Both Track R-CNN and JDE are anchor-based. The red boxes represent positive anchors and the green boxes represent the target objects. The three methods extract re-ID features differently. Track R-CNN extracts re-ID features for all positive anchors using ROI-Align. JDE extracts re-ID features at the centers of all positive anchors. FairMOT extracts re-ID features at the object center. (b) The red anchor contains two different instances. So it will be forced to predict two conflicting classes. (c) Three different anchors with different image patches are response for predicting the same identity. (d) FairMOT extracts re-ID features only at the object center and can mitigate the problems in (b) and (c).

図1.（a）Track R-CNNは、検出をプライマリタスクとして扱い、re-IDをセカンダリタスクとして扱います。 TrackR-CNNとJDEはどちらもアンカーベースです。赤いボックスはポジティブアンカーを表し、緑のボックスはターゲットオブジェクトを表します。 3つの方法では、re-ID機能を異なる方法で抽出します。 Track R-CNNは、ROI-Alignを使用してすべてのポジティブアンカーのre-ID機能を抽出します。 JDEは、すべてのポジティブアンカーの中心でre-ID機能を抽出します。 FairMOTは、オブジェクトの中心でre-ID機能を抽出します。（b）赤いアンカーには2つの異なるインスタンスが含まれています。したがって、2つの競合するクラスを予測する必要があります。（c）異なる画像パッチを持つ3つの異なるアンカーは、同じアイデンティティを予測するための応答です。（d）FairMOTは、オブジェクトの中心でのみre-IDフィーチャを抽出し、（b）および（c）の問題を軽減できます。

Fig1の説明

1.2 Unfairness Caused by Features	1.2機能によって引き起こされる不公平
For one-shot trackers, most features are shared between the object detection and re-ID tasks. But it is well known that they actually require features from different layers to achieve the best results. In particular, object detection requires deep and abstract features to estimate object classes and positions but re-ID focuses more on low-level appearance features to distinguish different instances of the same class.	ワンショットトラッカーの場合、ほとんどの機能はオブジェクト検出タスクと再IDタスクの間で共有されます。しかし、最良の結果を得るには、実際にはさまざまなレイヤーの機能が必要であることはよく知られています。特に、オブジェクト検出では、オブジェクトのクラスと位置を推定するために深く抽象的な機能が必要ですが、re-IDは、同じクラスの異なるインスタンスを区別するために、低レベルの外観機能に重点を置いています。
We empirically find that multi-layer feature aggregation is effective to address the contradiction by allowing the two tasks (network branches) to extract whatever features they need from the multi-layer aggregated features. Without multi-layer fusion, the model will be biased to the primary detection branch and generates lowquality re-ID features. In addition, multi-layer fusion, which fuses features from layers with different receptive fields, also improves the capability to handle object scale variation which is very common in practice.	2つのタスク（ネットワークブランチ）が多層集約された機能から必要な機能を抽出できるようにすることで、多層機能の集約が矛盾に対処するのに効果的であることが経験的にわかりました。多層融合がない場合、モデルは一次検出ブランチにバイアスされ、低品質のre-ID機能を生成します。さらに、異なる受容野を持つ層からの特徴を融合する多層融合はまた、実際に非常に一般的であるオブジェクトスケールの変動を処理する能力を改善します。

1.3 Unfairness Caused by Feature Dimension

1.3特徴の次元によって引き起こされる不公平

The previous re-ID works usually learn very high dimensional features and have achieved promising results on the benchmarks of their field. However, we find that learning lowerdimensional features is actually better for one-shot MOT for three reasons: (1) although learning high dimensional re-ID features may slightly improve their capability to differentiate objects, it notably harms the object detection accuracy due to the competition of the two tasks which in turn also has negative impact to the final tracking accuracy. So considering that the feature dimension in object detection is usually very low (class numbers + box locations), we propose to learn low-dimensional re-ID features to balance the two tasks; (2) when training data is small, learning low dimensional re-ID features reduces the risk of over-fitting. The datasets in MOT are usually much smaller than those in the re-ID area. So it is favorable to decrease feature dimensions; (3) learning low dimensional re-ID features improves the inference speed as will be shown in our experiments.

以前のre-ID作業は通常、非常に高次元の特徴を学習し、その分野のベンチマークで有望な結果を達成しました。ただし、次の3つの理由から、ワンショットMOTの場合、実際には低次元の特徴を学習する方が優れていることがわかります。（1）高次元のre-ID特徴を学習すると、オブジェクトを区別する能力がわずかに向上する可能性がありますが、 2つのタスクの競合は、最終的な追跡精度にも悪影響を及ぼします。したがって、オブジェクト検出の特徴次元は通常非常に小さい（クラス番号+ボックスの場所）ことを考慮して、2つのタスクのバランスをとるために低次元のre-ID特徴を学習することを提案します。（2）トレーニングデータが小さい場合、低次元のre-ID機能を学習すると、過剰適合のリスクが軽減されます。 MOTのデータセットは通常、re-ID領域のデータセットよりもはるかに小さくなります。したがって、フィーチャーの寸法を小さくすることが望ましいです。（3）低次元のre-ID機能を学習すると、実験で示されるように推論速度が向上します。

1.4 Overview of FairMOT

1.4 Overview of FairMOT	1.4 FairMOTの概要
In this work, we present a simple approach termed as FairMOT to jointly address the three fairness issues. It essentially differs from the previous “detection first, re-ID secondary” framework because the detection and re-ID tasks are treated equal in FairMOT. Our contributions are three-fold. Firstly, we empirically demonstrate and discuss the challenges faced by the previous one-shot tracking frameworks which have been overlooked but severely limit their performance. Second, on top of the anchorless object detection methods such as [10], we introduce a framework to fairly balance the detection and re-ID tasks which significantly outperforms the previous methods without bells and whistles. Finally, we also present a self supervised learning approach to train FairMOT on large scale detection datasets which improves its generalization capability. This has significant empirical values.	この作業では、3つの公平性の問題に共同で対処するためのFairMOTと呼ばれる簡単なアプローチを紹介します。 FairMOTでは検出タスクと再IDタスクが同等に扱われるため、これは以前の「最初の検出、2番目の再ID」フレームワークとは本質的に異なります。私たちの貢献は3つあります。まず、見過ごされてきたがパフォーマンスを大幅に制限していた以前のワンショット追跡フレームワークが直面した課題を経験的に示し、説明します。次に、[10]などのアンカーレスオブジェクト検出方法に加えて、検出タスクと再IDタスクのバランスをとるためのフレームワークを導入します。これは、ベルやホイッスルのない以前の方法を大幅に上回ります。最後に、大規模な検出データセットでFairMOTをトレーニングするための自己教師あり学習アプローチも紹介します。これにより、一般化機能が向上します。これには重要な経験的価値があります。
Figure 2 shows an overview of FairMOT. It adopts a very simple network structure which consists of two homogeneous branches for detecting objects and extracting re-ID features, respectively. Inspired by [10], [16], [17], [18], the detection branch is implemented in an anchor-free style which estimates object centers and sizes represented as position-aware measurement maps. Similarly, the re-ID branch estimates a re-ID feature for each pixel to characterize the object centered at the pixel. Note that the two branches are completely homogeneous which essentially differs from the previous methods which perform detection and re-ID in a cascaded style. So FairMOT eliminates the unfair advantage of the detection branch as reflected in Table 3, effectively learns high-quality re-ID features and obtains a good trade-off between detection and re-ID for better MOT results.	図2に、FairMOTの概要を示します。オブジェクトの検出とre-ID機能の抽出をそれぞれ行うための2つの同種のブランチで構成される非常に単純なネットワーク構造を採用しています。 [10]、[16]、[17]、[18]に触発されて、検出ブランチは、位置認識測定マップとして表されるオブジェクトの中心とサイズを推定するアンカーフリースタイルで実装されます。同様に、re-IDブランチは、各ピクセルのre-ID機能を推定して、ピクセルを中心とするオブジェクトを特徴付けます。 2つのブランチは完全に同種であり、カスケードスタイルで検出と再IDを実行する以前の方法とは本質的に異なります。したがって、FairMOTは、表3に反映されているように、検出ブランチの不公平な利点を排除し、高品質のre-ID機能を効果的に学習し、検出とre-IDの間の適切なトレードオフを取得して、より良いMOT結果を実現します。
It is also worth noting that FairMOT operates on highresolution feature maps of strides four while the previous anchor-based methods operate on feature maps of stride 32. The elimination of anchors as well as the use of highresolution feature maps better aligns re-ID features to object centers which significantly improves the tracking accuracy. The dimension of re-ID features is set to be only 64 which not only reduces computation time but also improves tracking robustness by striking a good balance between the detection and re-ID tasks. We equip the backbone network [19] with the Deep Layer Aggregation operator [20] to fuse features from multiple layers in order to accommodate both branches and handle objects of different scales.	以前のアンカーベースの方法がストライド32の特徴マップで動作するのに対し、FairMOTはストライド4の高解像度特徴マップで動作することにも注意してください。アンカーの排除と高解像度特徴マップの使用により、re-IDフィーチャがより適切に調整されます。追跡精度を大幅に向上させるオブジェクトセンター。 Re-ID機能の次元はわずか64に設定されており、計算時間を短縮するだけでなく、検出タスクとre-IDタスクのバランスをとることで追跡の堅牢性を向上させます。バックボーンネットワーク[19]にDeepLayer Aggregation演算子[20]を装備して、複数のレイヤーの機能を融合し、両方のブランチに対応し、さまざまなスケールのオブジェクトを処理します。
We evaluate FairMOT on the MOT Challenge benchmark via the evaluation server. It ranks first among all trackers on the 2DMOT15 [21], MOT16 [22], MOT17 [22] and MOT20 [23] datasets. When we further pre-train our model using our proposed self supervised learning method, it achieves additional gains on all datasets. In spite of the strong results, the approach is very simple and runs at 30 FPS on a single RTX 2080Ti GPU. It sheds light on the relationship between detection and re-ID in MOT and provides guidance for designing one-shot video tracking networks.	評価サーバーを介して、MOTチャレンジベンチマークでFairMOTを評価します。これは、2DMOT15 [21]、MOT16 [22]、MOT17 [22]、およびMOT20 [23]データセットのすべてのトラッカーの中で1位にランクされています。提案された自己教師あり学習方法を使用してモデルをさらに事前トレーニングすると、すべてのデータセットで追加のゲインが得られます。強力な結果にもかかわらず、アプローチは非常に単純で、単一のRTX 2080TiGPUで30FPSで実行されます。 MOTでの検出とre-IDの関係に光を当て、ワンショットビデオ追跡ネットワークを設計するためのガイダンスを提供します。

Fig. 2. Overview of our one-shot tracker FairMOT. The input image is first fed to an encoder-decoder network to extract high resolution feature maps (stride=4). Then we add two homogeneous branches for detecting objects and extracting re-ID features, respectively. The features at the predicted object centers are used for tracking.

図2.ワンショットトラッカーFairMOTの概要。入力画像は、最初にエンコーダー-デコーダーネットワークに送られ、高解像度の特徴マップ（ストライド= 4）が抽出されます。次に、オブジェクトを検出し、re-IDフィーチャを抽出するために、それぞれ2つの同種のブランチを追加します。予測されたオブジェクトの中心の特徴は、追跡に使用されます。

Fig2の説明

3. FairMOT

3 FAIRMOT
In this section, we present the technical details of FairMOT including the backbone network, the object detection branch, the re-ID branch as well as training details.	このセクションでは、バックボーンネットワーク、オブジェクト検出ブランチ、re-IDブランチ、トレーニングの詳細など、FairMOTの技術的な詳細を示します。

3.1 Backbone Network

We adopt ResNet-34 as backbone in order to strike a good balance between accuracy and speed. An enhanced version of Deep Layer Aggregation (DLA) [10] is applied to the backbone to fuse multi-layer features as shown in Figure 2. Different from original DLA [20], it has more skip connections between low-level and high-level features which is similar to the Feature Pyramid Network (FPN) [45]. In addition, convolution layers in all up-sampling modules are replaced by deformable convolution such that they can dynamically adjust the receptive field according to object scales and poses. These modifications are also helpful to alleviate the alignment issue. The resulting model is named DLA-34. Denote the size of input image as Himage × Wimage, then the output feature map has the shape of C × H × W where H = Himage/4 and W = Wimage/4. Besides DLA, other deep networks that provide multi-scale convolutional features, such as Higher HRNet [46], can be used in our framework to provide fair features for both detection and re-ID

精度とスピードのバランスをとるため、バックボーンとしてResNet-34を採用しています。図2に示すように、拡張バージョンのDeep Layer Aggregation（DLA）[10]がバックボーンに適用され、マルチレイヤー機能を融合します。元のDLA [20]とは異なり、低レベルと高レベルの間のスキップ接続が多くなっています。 Feature Pyramid Network（FPN）[45]に類似したレベルの機能。さらに、すべてのアップサンプリングモジュールの畳み込みレイヤーは、オブジェクトのスケールとポーズに応じて受容野を動的に調整できるように、変形可能な畳み込みに置き換えられています。これらの変更は、配置の問題を軽減するのにも役立ちます。結果のモデルの名前はDLA-34です。入力画像のサイズをHimage×Wimageとして表すと、出力フィーチャマップはC×H×Wの形状になります。ここで、H = Himage / 4およびW = Wimage / 4です。 DLAに加えて、Higher HRNet [46]などのマルチスケール畳み込み機能を提供する他のディープネットワークをフレームワークで使用して、検出と再IDの両方に公正な機能を提供できます。

3.2 Detection Branch

Our detection branch is built on top of CenterNet [10] but other anchor-free methods such as [16], [18], [47], [48] can also be used. We briefly describe the approach to make this work self-contained. In particular, three parallel heads are appended to DLA-34 to estimate heatmaps, object center offsets and bounding box sizes, respectively. Each head is implemented by applying a 3 × 3 convolution (with 256 channels) to the output features of DLA-34, followed by a 1×1 convolutional layer which generates the final targets.

私たちの検出ブランチはCenterNet [10]の上に構築されていますが、[16]、[18]、[47]、[48]などの他のアンカーフリーの方法も使用できます。この作業を自己完結型にするためのアプローチについて簡単に説明します。特に、ヒートマップ、オブジェクトの中心オフセット、および境界ボックスのサイズをそれぞれ推定するために、3つの平行なヘッドがDLA-34に追加されています。各ヘッドは、DLA-34の出力機能に3×3畳み込み（256チャネル）を適用し、その後に1×1畳み込み層を適用して最終的なターゲットを生成することで実装されます。

3.2.1 Heatmap Head

This head is responsible for estimating the locations of the object centers. The heatmap based representation, which is the de facto standard for the landmark point estimation task, is adopted here. In particular, the dimension of the heatmap is 1× H ×W. The response at a location in the heatmap is expected to be one if it collapses with the ground-truth object center. The response decays exponentially as the distance between the heatmap location and the object center. For each GT box b i = (x i 1 , yi 1 , xi 2 , yi 2 ) in the image, we compute the object center (c i x , ci y ) as c i x = x i 1+x i 2 2 and c i y = y i 1+y i 2 2 , respectively. Then its location on the feature map is obtained by dividing the stride (ec i x , ec i y ) = (b c i x 4 c, b c i y 4 c). Then the heatmap response at the location (x, y) is computed as Mxy = PN i=1 exp − (x−ce i x ) 2+(y−ce i y ) 2 2σ2 c where N represents the number of objects in the image and σc represents the standard deviation. The loss function is defined as pixel-wise logistic regression with focal loss [49]:

このヘッドは、オブジェクトの中心の位置を推定する責任があります。ここでは、ランドマーク点推定タスクの事実上の標準であるヒートマップベースの表現が採用されています。特に、ヒートマップの寸法は1×H×Wです。ヒートマップ内のある場所での応答は、グラウンドトゥルースオブジェクトの中心で崩壊した場合に1になると予想されます。応答は、ヒートマップの場所とオブジェクトの中心の間の距離として指数関数的に減衰します。画像内の各GTボックスbi =（xi 1、yi 1、xi 2、yi 2）について、オブジェクトの中心（cix、ci y）をcix = xi 1 + xi 22およびciy = yi 1 + yiとして計算します。それぞれ22。次に、ストライド（ec i x、ec i y）=（b c i x 4 c、b c i y 4 c）を除算することにより、フィーチャマップ上のその位置が取得されます。次に、位置（x、y）でのヒートマップ応答は次のように計算されます。Mxy= PN i = 1 exp −（x-ce ix）2+（y-ce iy）22σ2cここで、Nは画像内のオブジェクトの数を表します。 Σcは標準偏差を表します。損失関数は、焦点損失を伴うピクセル単位のロジスティック回帰として定義されます[49]。