Focal Loss for Dense Object Detection (RetinaNet)

Abstract — 摘要

The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage methods to date. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design a simple dense detector called RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.

當前精度最高的物件偵測器基於以 R-CNN 為代表的二階段方法，其中分類器被應用於一組稀疏的候選物件位置。相比之下，單階段偵測器在可能的物件位置上進行規則的密集取樣，具有更快速、更簡潔的潛力，但精度一直落後於二階段方法。本文探究其中的原因。我們發現，密集偵測器訓練過程中遭遇的極端前景-背景類別不平衡是核心成因。我們提出透過重塑標準交叉熵損失來應對此類別不平衡，降低分配給已良好分類樣本的損失權重。我們新穎的 Focal Loss 將訓練聚焦於一組稀疏的困難樣本，防止大量容易的負樣本在訓練過程中壓制偵測器。為驗證此損失函數的有效性，我們設計了一個簡潔的密集偵測器 RetinaNet。結果顯示，以 focal loss 訓練後，RetinaNet 能匹配先前單階段偵測器的速度，同時超越所有現有最先進二階段偵測器的精度。

段落功能全文總覽——以「為何單階段偵測器不如二階段」的問題框架引出 focal loss 的核心貢獻。

邏輯角色摘要以偵探式敘事展開：先設問（為什麼？），再揭示原因（類別不平衡），最後提出解方（focal loss）。此結構天然具有吸引力。

論證技巧 / 潛在漏洞「核心成因」的措辭極為大膽——暗示這是單一解釋，而非多因素之一。實際上，特徵對齊品質、正負樣本定義策略等因素也影響單階段偵測器的表現。

1. Introduction — 緒論

Current state-of-the-art object detectors are based on a two-stage, proposal-driven mechanism. As popularized in the R-CNN framework, the first stage generates a sparse set of candidate proposals, and the second stage classifies each proposal. Through a sequence of advances, this two-stage framework consistently achieves top accuracy on the challenging COCO benchmark. Despite the success of two-stage methods, a natural question is: could a simple one-stage detector achieve similar accuracy? One-stage detectors are applied over a regular, dense grid of possible object locations, sizes, and aspect ratios. Recent work on one-stage methods, such as YOLO and SSD, demonstrates promising results, yielding faster detectors with accuracy within 10-40% relative to state-of-the-art two-stage methods.

當前最先進的物件偵測器基於二階段、提案驅動的機制。以 R-CNN 框架為代表，第一階段生成稀疏的候選提案，第二階段對每個提案進行分類。經過一系列進展，這一二階段框架在具挑戰性的 COCO 基準上持續取得最高精度。儘管二階段方法相當成功，一個自然的問題是：簡潔的單階段偵測器能否達到類似精度？單階段偵測器在可能的物件位置、尺寸與長寬比的規則密集網格上運作。YOLO 與 SSD 等近期的單階段方法展示了令人期待的結果，產生更快速的偵測器，精度落後最先進二階段方法 10-40%（相對值）。

段落功能建立研究場域——梳理二階段與單階段偵測器的精度差距現狀。

邏輯角色以問句形式（「能否達到類似精度？」）設定全文的研究問題，暗示答案是肯定的。

論證技巧 / 潛在漏洞以「10-40% 相對精度差距」的量化描述增強問題的緊迫感。但此數據涵蓋多種方法，選擇性引用可能誇大差距。

We push the envelope further in this paper: we show for the first time that a one-stage object detector can match or surpass two-stage detectors in accuracy. We identify class imbalance during training as the main obstacle impeding one-stage methods from achieving state-of-the-art accuracy. Class imbalance is addressed in two-stage detectors by a two-stage cascade (proposals reduce candidates from ~100k to ~1-2k) and by biased minibatch sampling (e.g., 1:3 foreground-to-background ratio). In one-stage detectors, similar heuristics are applied but they are inefficient as the training procedure is still dominated by easily classified background examples.

本文進一步推展此領域的邊界：我們首次展示單階段物件偵測器能匹配甚至超越二階段偵測器的精度。我們將訓練過程中的類別不平衡確認為阻礙單階段方法達到最先進精度的主要障礙。二階段偵測器透過二階段級聯（提案將候選從約 10 萬降至約 1-2 千）與偏置小批量取樣（如前景對背景 1:3 的比例）來應對類別不平衡。在單階段偵測器中，類似的啟發式方法也被採用，但效率不彰，因為訓練過程仍被易分類的背景樣本所支配。

段落功能診斷根本原因——精確指出類別不平衡是單階段偵測器的核心瓶頸。

邏輯角色關鍵的分析段落：將二階段方法的隱性優勢（級聯過濾）顯性化，解釋為何直接套用啟發式方法無法解決單階段偵測器的問題。

論證技巧 / 潛在漏洞將類別不平衡從「已知問題」提升至「主要障礙」，論證有力。但是否為「唯一」主要障礙仍值得商榷——特徵品質與錨框設計同樣重要。

Two-stage detectors such as Faster R-CNN use a Region Proposal Network (RPN) followed by a classification network. The two-stage approach addresses class imbalance by generating ~1-2k proposals that filter out most background, and by using fixed foreground-to-background sampling ratios. One-stage detectors like SSD and YOLO predict detections directly from a dense grid. These methods address the imbalance issue through hard negative mining or bootstrapping, which select hard examples for training. Online Hard Example Mining (OHEM) is a representative approach that selects the highest-loss examples for each mini-batch, but still relies on heuristic sampling strategies. In contrast, our focal loss addresses class imbalance directly through the loss function without any sampling or mining.

Faster R-CNN 等二階段偵測器使用區域提案網路（RPN）後接分類網路。二階段方法透過生成約 1-2 千個提案過濾大部分背景，並使用固定的前景對背景取樣比例來應對類別不平衡。SSD 與 YOLO 等單階段偵測器直接從密集網格預測偵測結果。這些方法透過困難負樣本挖掘或自助法（bootstrapping）來應對不平衡問題，選取困難樣本進行訓練。線上困難樣本挖掘（OHEM）是代表性方法，選取每個小批量中損失最高的樣本，但仍依賴啟發式取樣策略。相比之下，我們的 focal loss 透過損失函數直接應對類別不平衡，無需任何取樣或挖掘。

段落功能文獻回顧——系統性對比各種處理類別不平衡的策略。

邏輯角色透過批判現有方法的「啟發式」本質，為 focal loss 的「原理驅動」設計提供對比基礎。

論證技巧 / 潛在漏洞將所有先前方法歸類為「啟發式」是有效的修辭策略，但 OHEM 實際上也有理論依據。focal loss 本身的超參數 gamma 某種程度上也是啟發式的。

3. Focal Loss — 焦點損失

We start from the cross entropy (CE) loss for binary classification: CE(p, y) = -log(p_t), where p_t = p if y = 1, else 1-p. A notable property of CE is that even easily-classified examples (p_t >> 0.5) incur a non-trivial loss. When summed over a large number of easy examples, these small loss values can overwhelm the rare class. We propose to add a modulating factor (1 - p_t)^gamma to the cross entropy loss, with tunable focusing parameter gamma >= 0: FL(p_t) = -(1 - p_t)^gamma * log(p_t). The focal loss has two key properties: (1) when an example is misclassified and p_t is small, the modulating factor is near 1 and the loss is unaffected; (2) as p_t approaches 1, the factor goes to 0 and the loss for well-classified examples is down-weighted. For example, with gamma = 2, an example classified with p_t = 0.9 has 100x lower loss compared to CE. In practice, we use an alpha-balanced variant: FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t).

我們從二元分類的交叉熵（CE）損失出發：CE(p, y) = -log(p_t)，其中 p_t 在 y=1 時等於 p，否則等於 1-p。交叉熵的一個顯著性質是，即使是容易分類的樣本（p_t 遠大於 0.5）也會產生非微不足道的損失。當大量簡單樣本的損失值累加後，這些小損失值可以壓制稀有類別。我們提出在交叉熵損失上添加一個調節因子 (1 - p_t)^gamma，其中 gamma 為可調節的聚焦參數（gamma >= 0）：FL(p_t) = -(1 - p_t)^gamma * log(p_t)。Focal loss 具有兩個關鍵性質：(1) 當樣本被錯誤分類且 p_t 較小時，調節因子接近 1，損失不受影響；(2) 當 p_t 趨近 1 時，調節因子趨近 0，良好分類樣本的損失被降低權重。例如，gamma = 2 時，p_t = 0.9 的樣本損失比標準交叉熵低 100 倍。在實務上，我們使用 alpha 平衡版本：FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t)。

段落功能核心技術貢獻——從數學推導角度定義 focal loss。

邏輯角色此段是全文的技術核心。從交叉熵的缺陷出發，以一個簡潔的調節因子實現「自動困難樣本聚焦」，無需外部取樣策略。

論證技巧 / 潛在漏洞「100 倍」的具體數字極具衝擊力，讓讀者直觀感受 focal loss 的效果。但 gamma 的選擇仍是經驗性的——作者在附錄中承認「focal loss 的精確形式並非關鍵」，暗示可能存在其他等效設計。

An important practical consideration is model initialization. For models trained with CE, random initialization typically results in equal probability for all classes. Under this initialization, the loss due to the frequent class (background) can dominate total loss and cause instability in early training. To counter this, we introduce a "prior" concept for the value of p estimated by the model for the rare class (foreground) at the start of training. We set this prior to pi = 0.01, which ensures that the loss from the dominant class is low at initialization, improving training stability. We find that this initialization is critical for stable training with both CE and FL, improving AP by ~0.7 points.

一個重要的實務考量是模型初始化。以交叉熵訓練的模型，隨機初始化通常對所有類別給出相等的機率。在此初始化下，頻繁類別（背景）的損失可能支配總損失，導致早期訓練不穩定。為此，我們引入「先驗」概念，設定模型在訓練開始時對稀有類別（前景）估計的 p 值。我們將此先驗設為 pi = 0.01，確保主導類別的初始損失較低，改善訓練穩定性。我們發現此初始化對於交叉熵和 focal loss 的穩定訓練均至關重要，可提升約 0.7 個 AP 百分點。

段落功能工程細節——描述確保 focal loss 穩定訓練的初始化技巧。

邏輯角色補充 focal loss 在實際應用中的必要配套措施，展現作者對訓練動態的深入理解。

論證技巧 / 潛在漏洞此段展現了理論與工程的良好結合。但這也暗示 focal loss 並非「即插即用」——需要精心的初始化配合才能發揮效果。

4. RetinaNet Detector — RetinaNet 偵測器

RetinaNet is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone is a Feature Pyramid Network (FPN) built on top of ResNet, which generates a rich, multi-scale feature pyramid from levels P3 through P7. The first subnet is the classification subnet: a small FCN attached to each FPN level, consisting of four 3x3 conv layers with 256 channels (each followed by ReLU), terminated by a 3x3 conv with KA outputs (K classes, A anchors) and sigmoid activation. The second subnet is the box regression subnet: an identical structure parallel to the classification subnet, but outputting 4A values per spatial location. Anchors comprise 9 per level (3 aspect ratios x 3 scales), covering a scale range of 32 to 813 pixels. Importantly, our simple detector achieves top results not based on innovations in network design but due to our novel loss.

RetinaNet 是一個由骨幹網路和兩個任務專用子網路組成的統一網路。骨幹是建構於 ResNet 之上的特徵金字塔網路（FPN），從 P3 到 P7 層級生成豐富的多尺度特徵金字塔。第一個子網路是分類子網路：一個附加在每個 FPN 層級上的小型全摺積網路，由四個 256 通道的 3x3 摺積層（各接 ReLU）組成，以一個輸出 KA 值（K 類別、A 錨框）並使用 sigmoid 啟動的 3x3 摺積作為結尾。第二個子網路是邊界框迴歸子網路：與分類子網路結構相同但平行，每個空間位置輸出 4A 個值。錨框在每個層級有 9 個（3 個長寬比乘以 3 個尺度），涵蓋 32 到 813 像素的尺度範圍。重要的是，我們簡潔的偵測器取得最佳結果並非依靠網路設計的創新，而是歸功於我們新穎的損失函數。

段落功能架構說明——詳述 RetinaNet 的網路結構與設計選擇。

邏輯角色此段刻意將架構描述得簡單直接，以強化「核心貢獻在於損失函數而非架構」的論點。FPN + 雙子網路的設計是標準的，但正是這種「標準性」才能突顯 focal loss 的效果。

論證技巧 / 潛在漏洞「不依靠網路設計的創新」這一自我限定極為巧妙——既展示了 focal loss 的通用性，也為未來結合更強架構留下了提升空間。但 FPN 骨幹本身是一個強大的多尺度特徵擷取器，其貢獻不應被忽視。

5. Experiments — 實驗

All experiments are conducted on the COCO benchmark. In ablation studies with ResNet-50, standard CE with proper initialization achieves 30.2 AP. Adding alpha-balanced CE improves this to 31.1 AP (best alpha = 0.75). Focal loss with gamma = 2 and alpha = 0.25 achieves 34.0 AP, a 2.9 point improvement over alpha-balanced CE. FL outperforms the best variant of online hard example mining (OHEM) by over 3 points AP (36.0 vs. 32.8 AP with ResNet-101). For state-of-the-art comparison, RetinaNet-101-800 achieves 39.1 AP on COCO test-dev. With a ResNeXt-101-FPN backbone, it reaches 40.8 AP, surpassing all prior one-stage and two-stage detectors including Faster R-CNN with FPN. In terms of speed, RetinaNet-101-600 achieves 36.2 AP at 122ms per image, while Faster R-CNN+FPN achieves the same AP at 172ms — RetinaNet is 29% faster at comparable accuracy.

所有實驗在 COCO 基準上進行。消融研究中使用 ResNet-50，標準交叉熵搭配適當初始化達到 30.2 AP。添加 alpha 平衡交叉熵後提升至 31.1 AP（最佳 alpha = 0.75）。Focal loss 搭配 gamma = 2、alpha = 0.25 達到 34.0 AP，比 alpha 平衡交叉熵提升 2.9 個百分點。FL 比線上困難樣本挖掘（OHEM）的最佳變體高出超過 3 個 AP 百分點（使用 ResNet-101 時為 36.0 對 32.8 AP）。在最先進方法的比較中，RetinaNet-101-800 在 COCO test-dev 上達到 39.1 AP。使用 ResNeXt-101-FPN 骨幹時達到 40.8 AP，超越所有先前的單階段與二階段偵測器，包括 Faster R-CNN + FPN。在速度方面，RetinaNet-101-600 以每張影像 122ms 達到 36.2 AP，而 Faster R-CNN+FPN 以 172ms 達到相同 AP——RetinaNet 在相當精度下快了 29%。

段落功能提供全面的實驗證據——以消融研究與最先進比較驗證 focal loss 的有效性。

邏輯角色多層次的實證支撐：(1) 消融驗證 focal loss 的逐步改善；(2) 與 OHEM 的直接對比；(3) 與二階段方法的精度超越；(4) 速度-精度的權衡分析。

論證技巧 / 潛在漏洞逐步遞增的數據呈現讓人信服。但 39.1 AP 對比 Faster R-CNN+FPN 的 36.2 AP，部分差距可能來自不同的輸入解析度（800 vs. 600），而非單純的 focal loss 貢獻。作者可加入同解析度的公平比較。

6. Conclusion — 結論

In this paper, we identified class imbalance as the primary cause for the accuracy gap between one-stage and two-stage object detectors. To address this, we proposed the focal loss which applies a modulating term to the cross entropy loss to focus learning on hard misclassified examples. Our approach is simple and highly effective. We demonstrated its effectiveness by designing RetinaNet, a simple one-stage object detector that achieves state-of-the-art accuracy, surpassing all previously published two-stage detectors. We hope the simplicity and effectiveness of focal loss will benefit other domains with severe class imbalance.

本文確認類別不平衡是造成單階段與二階段物件偵測器精度差距的主要原因。為解決此問題，我們提出 focal loss，在交叉熵損失上施加調節項，將學習聚焦於困難的錯誤分類樣本。我們的方法簡潔且高度有效。我們透過設計 RetinaNet 這一簡潔的單階段物件偵測器來展示其有效性，該偵測器達到最先進精度，超越所有先前發表的二階段偵測器。我們期待 focal loss 的簡潔性與有效性能為其他類別嚴重不平衡的領域帶來助益。

段落功能總結全文——重申核心發現與貢獻，展望更廣泛的應用。

邏輯角色結論段精準呼應摘要的問-答結構：問題（類別不平衡）-> 解方（focal loss）-> 驗證（RetinaNet）-> 展望（其他領域）。

論證技巧 / 潛在漏洞結尾的「其他嚴重類別不平衡的領域」暗示了 focal loss 的廣泛適用性，但未提供具體例證。事實上，focal loss 確實被後續廣泛應用於醫學影像、自然語言處理等領域，驗證了此展望的遠見。

論證結構總覽

問題
單階段偵測器精度
為何落後二階段？

→

論點
極端類別不平衡
是核心成因

→

證據
Focal loss + RetinaNet
超越所有二階段方法

→

反駁
無需啟發式取樣
損失函數直接解決

→

結論
簡潔損失函數設計
可推廣至其他領域

作者核心主張（一句話）

單階段偵測器落後二階段方法的根本原因是訓練過程中極端的前景-背景類別不平衡，而 focal loss 透過動態降低容易樣本的權重即可解決此問題，使簡潔的 RetinaNet 超越所有先前方法。

論證最強處

問題診斷的深刻性：focal loss 的設計直接對應問題根源，調節因子 (1-p_t)^gamma 以數學上精確的方式實現「自動困難樣本挖掘」。消融實驗中與 OHEM 的直接對比（超出 3 AP 以上），以及與二階段方法在速度-精度雙維度上的全面超越，構成了極具說服力的實證支撐。

論證最弱處

因果歸因的排他性：將類別不平衡定位為「核心成因」可能過度簡化了單階段偵測器落後的多重因素。特徵對齊品質、錨框設計策略、以及多尺度特徵融合等因素均可能是次要但顯著的貢獻者。此外，gamma = 2 的最佳選擇缺乏理論推導，依靠經驗搜尋確定。