CornerNet: Detecting Objects as Paired Keypoints

Abstract — 摘要

We propose CornerNet, a new approach to object detection where we detect an object bounding box as a pair of keypoints — the top-left corner and bottom-right corner. By detecting corners instead of anchor boxes, we eliminate the need for anchor box design, a critical but often heuristic component of modern one-stage detectors. We introduce corner pooling, a new type of pooling layer that helps the network better localize corners by encoding the information along the boundary of the object. CornerNet achieves 42.1% AP on MS COCO, outperforming all existing one-stage detectors at the time of publication.

本文提出 CornerNet，一種新穎的物件偵測方法，將物件邊界框偵測為一對關鍵點——左上角和右下角。透過偵測角點而非錨框，我們消除了錨框設計的需求——這是現代單階段偵測器的關鍵但通常具啟發性的組件。我們引入了角點池化，一種新型池化層，透過編碼物體邊界上的資訊來幫助網路更準確地定位角點。CornerNet 在 MS COCO 上達到 42.1% AP，在發表時超越所有現有的單階段偵測器。

段落功能全文總覽——提出角點偵測的核心思想及其消除錨框的優勢。

邏輯角色摘要建立了「重新定義（框→角點對）→ 技術創新（角點池化）→ 驗證（COCO SOTA）」的論證預告。

論證技巧 / 潛在漏洞以「消除錨框」作為核心賣點，切中了物件偵測社群對超參數繁多的錨框設計的痛點。但角點配對本身引入了新的挑戰（如何正確匹配左上角和右下角）。

1. Introduction — 緒論

Modern object detection methods are dominated by anchor-based approaches that densely tile predefined anchor boxes across the image and classify each anchor as foreground or background. One-stage methods like SSD, RetinaNet, and YOLO achieve real-time speed by directly predicting class labels and bounding box offsets from anchors. However, anchor-based detectors suffer from two significant drawbacks. First, the design of anchor boxes requires careful tuning of sizes, aspect ratios, and densities, which is typically done through heuristic rules or exhaustive search. Second, anchor boxes create a severe imbalance between positive and negative samples, as most anchors correspond to background, necessitating complex sampling strategies or focal loss.

現代物件偵測方法以基於錨框的方法為主導，它們在影像上密集鋪設預定義的錨框並將每個錨分類為前景或背景。單階段方法如 SSD、RetinaNet 和 YOLO 透過直接從錨預測類別標籤和邊界框偏移實現即時速度。然而，基於錨框的偵測器存在兩個顯著缺陷。第一，錨框的設計需要仔細調校大小、長寬比和密度，通常透過啟發式規則或窮舉搜尋完成。第二，錨框造成正負樣本之間的嚴重不平衡，因為大多數錨對應背景，需要複雜的取樣策略或 focal loss。

段落功能建立問題意識——系統性地批評錨框偵測器的兩大缺陷。

邏輯角色論證起點：錨框的「設計啟發性」和「樣本不平衡」兩個痛點為無錨框方法的引入建立強烈動機。

論證技巧 / 潛在漏洞以兩個具體缺陷（設計啟發性、樣本不平衡）取代模糊的批評，論證力度強。但 RetinaNet 的 focal loss 已相當有效地解決了樣本不平衡問題，此批評的時效性需斟酌。

We propose a fundamentally different detection paradigm: detecting objects as pairs of keypoints (top-left and bottom-right corners) rather than classifying anchor boxes. This approach offers several advantages. First, it completely eliminates the need for anchor box design, removing a significant source of hyperparameters. Second, the set of possible corners is much smaller than the set of possible anchors, leading to a more balanced distribution between positive and negative training examples. Third, corners encode the extent of objects more efficiently than centers, as a corner only needs to attend to one side of the object boundary rather than all four sides.

我們提出一種根本不同的偵測範式：將物體偵測為關鍵點對（左上角和右下角）而非分類錨框。此方法提供多項優勢。第一，它完全消除了錨框設計的需求，移除了一個重要的超參數來源。第二，可能的角點集合遠小於可能的錨集合，導致正負訓練樣本之間更平衡的分布。第三，角點比中心更有效率地編碼物體範圍，因為角點只需關注物體邊界的一側而非四側。

段落功能核心洞察引介——提出角點偵測的三項優勢。

邏輯角色從「問題」過渡到「解方」：三個優勢分別回應了錨框的兩大缺陷（設計啟發性、樣本不平衡）並增加了效率論證。

論證技巧 / 潛在漏洞「角點只需關注一側」的論點具有數學上的優雅性。但角點位於物體邊界而非物體內部，可能缺乏豐富的視覺紋理資訊，使得角點偵測本身更為困難。

Two-stage detectors such as Faster R-CNN and R-FCN first generate region proposals and then classify each proposal. One-stage detectors like SSD and RetinaNet directly predict detections from a dense set of anchor boxes. Both families rely heavily on anchor box design. Keypoint-based methods have been successfully used for tasks like human pose estimation, where body joints are detected as heatmap peaks and connected into skeletons. CornerNet adapts this keypoint detection paradigm to object detection, treating each bounding box as two keypoints connected by their associative embeddings.

兩階段偵測器如 Faster R-CNN 和 R-FCN 先生成區域提案再分類每個提案。單階段偵測器如 SSD 和 RetinaNet 直接從密集的錨框集合預測偵測結果。兩類方法都高度依賴錨框設計。基於關鍵點的方法已在人體姿態估計等任務中成功應用，其中身體關節作為熱力圖峰值被偵測並連接為骨架。CornerNet 將此關鍵點偵測範式調適至物件偵測，將每個邊界框視為以關聯嵌入連接的兩個關鍵點。

段落功能文獻回顧——將 CornerNet 定位於偵測與關鍵點方法的交匯。

邏輯角色以人體姿態估計的成功類比，為將關鍵點方法引入物件偵測提供先例和信心。

論證技巧 / 潛在漏洞以姿態估計作為跨領域類比增強了說服力。但身體關節有明確的視覺特徵，角點位於物體邊緣，兩者的偵測難度可能有本質差異。

3. Method — 方法

CornerNet uses a single hourglass network as the backbone to produce feature maps. The network predicts two sets of heatmaps — one for top-left corners and one for bottom-right corners — along with embedding vectors for grouping corners and offsets for precise localization. For each corner, the heatmap indicates the probability of a corner of a specific category existing at each location. The associative embedding assigns a vector to each detected corner such that corners belonging to the same object have similar embeddings, enabling us to group top-left and bottom-right corners into bounding boxes by matching embeddings with small distances.

CornerNet 使用單一沙漏網路作為骨幹來產生特徵圖。網路預測兩組熱力圖——分別用於左上角和右下角——以及用於分組角點的嵌入向量和用於精確定位的偏移量。對於每個角點，熱力圖指示特定類別的角點在每個位置存在的概率。關聯嵌入為每個偵測到的角點指派一個向量，使屬於同一物體的角點具有相似的嵌入，讓我們能透過匹配距離較小的嵌入來將左上角和右下角分組為邊界框。

段落功能核心方法描述——詳述 CornerNet 的預測輸出與角點分組機制。

邏輯角色將「角點對偵測」具體化為：熱力圖（在哪裡）+ 嵌入（哪些配對）+ 偏移（精確位置）的三重輸出。

論證技巧 / 潛在漏洞三重輸出的設計清晰完整。但關聯嵌入的匹配在物體密集重疊時可能出現配對錯誤，此失敗模式值得關注。

We introduce corner pooling, a novel pooling operation designed to help the network locate corners more accurately. The intuition is that a corner is defined by two edges of the bounding box, and these edges may be far from the corner itself. For a top-left corner, corner pooling takes the maximum value looking rightward along the horizontal direction and the maximum value looking downward along the vertical direction, and adds them together. This allows the network to gather information from the entire top edge and left edge of the bounding box, even when the corner itself lacks distinctive visual features. Corner pooling is implemented efficiently using cumulative maximum operations.

我們引入角點池化，一種新穎的池化操作，旨在幫助網路更準確地定位角點。直覺是角點由邊界框的兩條邊定義，而這些邊可能遠離角點本身。對於左上角，角點池化取沿水平方向向右看的最大值和沿垂直方向向下看的最大值，並將它們相加。這使網路能從邊界框的整個上邊和左邊收集資訊，即使角點本身缺乏獨特的視覺特徵。角點池化透過累積最大值操作高效實現。

段落功能技術創新描述——詳細說明角點池化的設計動機與運作機制。

邏輯角色角點池化直接回應了「角點缺乏視覺特徵」的潛在質疑，是方法設計中最精巧的組件。

論證技巧 / 潛在漏洞以直覺性解釋（「邊可能遠離角點」）讓複雜操作易於理解。但角點池化假設邊界框的邊上存在強視覺特徵，對於紋理均勻的物體可能效果有限。

The training of CornerNet uses a variant of focal loss for the corner heatmaps, with Gaussian penalty reduction around each ground-truth corner location to tolerate minor localization errors. The associative embedding loss uses a pull-push formulation: it pulls the embeddings of corners belonging to the same object close together while pushing embeddings of different objects apart. The offset loss uses smooth L1 loss to refine the corner locations from the quantized heatmap positions to sub-pixel accuracy. At inference, corners are extracted as heatmap peaks via non-maximum suppression, paired using embedding distances, and filtered by detection score.

CornerNet 的訓練在角點熱力圖上使用focal loss 的變體，並在每個真實角點位置周圍施加高斯懲罰衰減以容忍微小的定位誤差。關聯嵌入損失使用拉-推公式：將屬於同一物體的角點嵌入拉近，同時推開不同物體的嵌入。偏移損失使用 smooth L1 loss 將角點位置從量化的熱力圖位置精修至亞像素精度。推論時，角點透過非最大值抑制從熱力圖峰值提取，使用嵌入距離配對，並按偵測分數篩選。

段落功能訓練與推論流程——完整描述損失函數設計和推論管線。

邏輯角色三種損失函數（focal、嵌入、偏移）分別對應三個預測輸出，形成完整的訓練目標體系。

論證技巧 / 潛在漏洞高斯懲罰衰減的設計體現了對定位容忍度的深思熟慮。但多個損失函數的加權平衡可能需要仔細調參。

4. Experiments — 實驗

CornerNet is evaluated on MS COCO test-dev. Using the Hourglass-104 backbone, CornerNet achieves 42.1% AP, outperforming all existing one-stage detectors including RetinaNet (40.8%), YOLOv3 (33.0%), and DSSD (33.2%). CornerNet also achieves competitive results compared to two-stage detectors, matching Cascade R-CNN (42.8%) closely. Particularly notable is CornerNet's strong performance on small objects (20.8% AP_small), which benefits from the absence of anchor-based priors that often underperform on small objects due to coarse feature resolution. Ablation studies confirm that corner pooling improves AP by 2.0 points, and associative embeddings outperform alternative grouping strategies.

CornerNet 在 MS COCO test-dev 上評估。使用 Hourglass-104 骨幹，CornerNet 達到42.1% AP，超越所有現有的單階段偵測器，包括 RetinaNet（40.8%）、YOLOv3（33.0%）和 DSSD（33.2%）。CornerNet 也與兩階段偵測器取得具競爭力的結果，接近 Cascade R-CNN（42.8%）。特別值得注意的是 CornerNet 在小物體上的強勁效能（20.8% AP_small），這得益於不存在基於錨的先驗——後者常因特徵解析度粗糙而在小物體上表現不佳。消融研究確認角點池化改善了 2.0 個 AP 百分點，且關聯嵌入優於其他分組策略。

段落功能提供核心實證——以 COCO 上的全面比較展示 CornerNet 的效能。

邏輯角色超越所有單階段偵測器並接近兩階段方法，有力支撐了「無錨框偵測可行且有效」的核心論點。小物體上的優勢是意外但有價值的發現。

論證技巧 / 潛在漏洞角點池化 2.0 AP 的貢獻清晰可量化。但 CornerNet 使用的 Hourglass-104 骨幹計算量遠大於 RetinaNet 的 ResNet-101，公平性比較需考慮計算預算。

Error analysis reveals that CornerNet's primary failure mode is incorrect corner grouping, where corners from different objects are mistakenly paired. This accounts for approximately 19.3% of false positive detections. The grouping errors are most common in crowded scenes where many objects of the same category overlap. Additionally, CornerNet shows relatively weaker performance on large objects compared to anchor-based methods, likely because the two corners of a large object are far apart and the embedding vectors must encode long-range spatial relationships. These observations motivate future work on improved grouping mechanisms and center-aware detection.

誤差分析揭示 CornerNet 的主要失敗模式是不正確的角點分組，即來自不同物體的角點被錯誤配對。這約占19.3% 的假陽性偵測。分組誤差在同類別多個物體重疊的擁擠場景中最為常見。此外，CornerNet 在大物體上的效能相較於基於錨框的方法較弱，可能因為大物體的兩個角點相距較遠，嵌入向量需要編碼長程空間關係。這些觀察啟發了改進分組機制和中心感知偵測的未來工作方向。

段落功能誤差分析——坦誠剖析 CornerNet 的失敗模式。

邏輯角色 19.3% 的配對錯誤率量化了角點範式的核心風險，為後續改進（CenterNet）提供了明確的研究方向。

論證技巧 / 潛在漏洞主動分析失敗模式增強了學術誠信度。大物體效能較弱的發現與小物體效能較強形成有趣對比，顯示角點偵測的尺度偏好。

5. Conclusion — 結論

We have presented CornerNet, a novel anchor-free object detection approach that detects objects as pairs of corner keypoints. By eliminating anchor boxes, CornerNet removes a significant source of hyperparameters and achieves state-of-the-art performance among one-stage detectors on MS COCO. The introduction of corner pooling provides an effective mechanism for the network to leverage boundary information for accurate corner localization. Our work demonstrates that keypoint-based representations provide a viable and powerful alternative to anchor-based detection, opening new directions for object detection research including center-point detection and other geometric representations.

本文提出了 CornerNet，一種新穎的無錨框物件偵測方法，將物體偵測為成對的角點關鍵點。透過消除錨框，CornerNet 移除了一個重要的超參數來源，並在MS COCO 上的單階段偵測器中達到最先進效能。角點池化的引入為網路提供了有效的機制，利用邊界資訊實現精確的角點定位。我們的工作證明了基於關鍵點的表示為基於錨框的偵測提供了可行且強大的替代方案，開闢了包括中心點偵測和其他幾何表示在內的物件偵測研究新方向。

段落功能全文總結——重申無錨框偵測的核心貢獻與開創性意義。

邏輯角色以「開闢新方向」結尾，將 CornerNet 定位為範式轉移的開端而非終點。提及中心點偵測（CenterNet）暗示了後續工作的方向。

論證技巧 / 潛在漏洞「範式轉移」的定位提升了論文的影響力敘事。事後來看，CornerNet 確實啟發了一系列無錨框偵測器（CenterNet、FCOS 等），驗證了此定位的準確性。

Abstract — 摘要

1. Introduction — 緒論

3. Method — 方法

4. Experiments — 實驗

5. Conclusion — 結論

論證結構總覽

核心主張

最強論點

最弱環節

Abstract — 摘要

1. Introduction — 緒論

2. Related Work — 相關工作

3. Method — 方法

4. Experiments — 實驗

5. Conclusion — 結論

論證結構總覽

核心主張

最強論點

最弱環節