Simple Baselines for Human Pose Estimation

Abstract 摘要

There has been significant progress on pose estimation and tracking in recent years. Yet, much of the focus has been on developing complex solutions involving multi-component systems, cascaded networks, and sophisticated loss functions. In this work, we ask a simple question: how good could a simple method be? Specifically, we provide simple baseline methods for both pose estimation and pose tracking.

近年來，姿態估計與追蹤取得了顯著進展。然而，大多數研究聚焦於開發涉及多元件系統、級聯網路和精巧損失函數的複雜方案。在本研究中，我們提出一個簡單的問題：一個簡單方法究竟能好到什麼程度？具體而言，我們為姿態估計和姿態追蹤提供簡單的基線方法。

段落功能以反思性提問開場，挑戰領域內追求複雜化的趨勢

邏輯角色建立研究動機：質疑複雜方案的必要性

論證技巧以修辭提問「how good could a simple method be?」引發好奇心，策略性地將「簡單」從劣勢轉化為研究亮點

For pose estimation, our method involves a ResNet backbone augmented with a few deconvolutional layers to generate heatmaps for joint detection. For pose tracking, our method involves assigning poses based on greedy matching of detected keypoints across frames using optical flow. Despite the simplicity of these methods, they achieve competitive or even superior results compared to state-of-the-art methods on challenging benchmarks such as COCO and PoseTrack.

對於姿態估計，我們的方法使用以 ResNet 為骨幹並增加若干反摺積層來生成關節偵測的熱力圖。對於姿態追蹤，我們的方法採用基於光流的貪婪匹配來跨幀指派偵測到的關鍵點。儘管這些方法極為簡單，它們在 COCO 和 PoseTrack 等挑戰性標竿上取得了與最先進方法相當甚至更優的成績。

段落功能概述兩項任務的基線方案與核心結果

邏輯角色快速勾勒方法論並以結果驗證「簡單即有效」的主張

論證技巧「despite the simplicity」形成戲劇性反差，讓讀者對簡單方法的競爭力感到意外，強化論文的核心訊息

1. Introduction 緒論

Human pose estimation, the task of determining the spatial locations of body joints or keypoints from an image, has attracted significant attention due to its widespread applications in action recognition, human-computer interaction, and augmented reality. Recent approaches to the problem have achieved impressive results by employing increasingly complex network architectures and training strategies. However, the growing complexity makes it difficult to identify which design choices are truly essential for achieving high performance.

人體姿態估計是從影像中判定身體關節或關鍵點空間位置的任務，由於其在動作辨識、人機互動和擴增實境等領域的廣泛應用而備受關注。近期的方法透過採用日益複雜的網路架構和訓練策略取得了出色的成果。然而，不斷增長的複雜度使得識別哪些設計選擇對於達到高效能真正不可或缺變得困難。

段落功能定義問題背景並指出領域內的方法論危機

邏輯角色為「回歸簡單」的研究方向提供正當性

論證技巧先肯定現有成果再指出問題，避免直接否定他人工作，展現學術外交手腕

We believe that establishing strong and simple baselines is important for understanding the core problem and for measuring progress in the field. Our goal is not to propose a new state-of-the-art method but rather to show that a straightforward approach, when implemented carefully with proper training procedures, can achieve surprisingly strong performance. The key insight is that a simple network architecture combined with proper training details can match or exceed the performance of more complex designs.

我們認為建立強大而簡單的基線對於理解核心問題及衡量領域進展至關重要。我們的目標不是提出新的最先進方法，而是展示一個直觀的方法在仔細實作並搭配適當訓練流程時，能達到令人驚訝的強大效能。關鍵洞見在於，簡單的網路架構搭配適當的訓練細節能匹配甚至超越更複雜的設計。

段落功能闡明研究目的與核心洞見

邏輯角色確立論文的哲學立場：簡約主義

論證技巧刻意降低讀者預期（「not to propose a new state-of-the-art」），使後續超越 SOTA 的結果產生更大衝擊

3. Approach 方法

3.1 Pose Estimation

Our pose estimation network adopts the structure of adding a few deconvolutional layers over the last convolution stage in the ResNet architecture. Three deconvolutional layers with batch normalization and ReLU activation are used. Each layer has 256 filters with 4x4 kernel. A 1x1 convolutional layer is added at last to generate predicted heatmaps for all k key points. The overall architecture is remarkably simple and it can be generated by the ResNet backbone and a few transposed convolution layers with no bells and whistles.

我們的姿態估計網路採用在 ResNet 架構的最後摺積階段上方增加若干反摺積層的結構。使用三個帶批次正規化和 ReLU 啟動函數的反摺積層，每層包含 256 個 4x4 核的濾波器。最後加入一個 1x1 摺積層來生成所有 k 個關鍵點的預測熱力圖。整體架構極為簡潔，僅由 ResNet 骨幹和幾個轉置摺積層構成，沒有任何花俏的附加設計。

段落功能描述姿態估計網路的完整架構

邏輯角色以最少的組件實現核心功能，支撐「簡單基線」的主張

論證技巧「no bells and whistles」的口語化表達刻意強調方法的樸素性，與領域內追求複雜度的傳統形成鮮明對比

3.2 Pose Tracking

For multi-person pose tracking in videos, we adopt a simple approach based on optical flow. Given detected poses in a current frame, we propagate each keypoint to the next frame using optical flow and perform greedy bipartite matching between the propagated keypoints and the newly detected keypoints. The matching cost is computed as the Object Keypoint Similarity (OKS) between the propagated and detected poses. This simple strategy avoids the need for complex tracking algorithms such as graph-based optimization or recurrent networks.

對於影片中的多人姿態追蹤，我們採用基於光流的簡單方法。給定當前幀中偵測到的姿態，我們使用光流將每個關鍵點傳播到下一幀，然後在傳播的關鍵點與新偵測的關鍵點之間執行貪婪二部匹配。匹配成本以傳播姿態與偵測姿態之間的物體關鍵點相似度（OKS）來計算。這個簡單策略避免了對複雜追蹤演算法（如圖優化或循環神經網路）的需求。

段落功能描述姿態追蹤的基線方案

邏輯角色將「簡單基線」的哲學延伸到追蹤任務

論證技巧透過列舉被省略的複雜方法（圖優化、RNN），反襯自身方案的極簡特質

4. Experiments 實驗

We evaluate our approach on the COCO keypoint detection benchmark. Our models are trained on the COCO train2017 dataset including 57K images and 150K person instances. We report results on the val2017 set with 5000 images and test-dev2017 set. The evaluation metric is Object Keypoint Similarity (OKS) based AP. We use the person detector from the two-stage top-down pipeline and apply our pose estimation network on each detected person bounding box.

我們在 COCO 關鍵點偵測標竿上評估我們的方法。模型在 COCO train2017 資料集上訓練，包含 57K 張影像和 150K 個人體實例。我們在 val2017（5000 張影像）和 test-dev2017 上報告結果。評估指標為基於物體關鍵點相似度（OKS）的 AP。我們使用兩階段由上而下管線中的人體偵測器，並對每個偵測到的人體邊界框應用姿態估計網路。

段落功能說明實驗設定：資料集、指標、檢測管線

邏輯角色建立公平比較的基礎

論證技巧詳列資料集規模和評估協定，確保與其他方法的可比性

With ResNet-152 as the backbone, our simple baseline achieves 73.7 AP on the COCO test-dev set, which is competitive with the state-of-the-art results from much more complex methods such as CPN which uses a cascaded network architecture with online hard keypoint mining. When using the same person detector and ResNet-50 backbone, our method achieves 70.4 AP, which already outperforms many existing approaches. The results clearly demonstrate that most of the improvement in recent methods comes from better backbone networks and training strategies rather than architectural innovations.

以 ResNet-152 為骨幹，我們的簡單基線在 COCO test-dev 上達到 73.7 AP，與來自更複雜方法的最先進成果具有競爭力，例如使用級聯網路架構搭配線上困難關鍵點挖掘的 CPN。當使用相同的人體偵測器和 ResNet-50 骨幹時，我們的方法達到 70.4 AP，已超越許多現有方法。結果清楚表明，近期方法的大部分提升來自更好的骨幹網路和訓練策略，而非架構創新。

段落功能報告核心定量結果並推導洞見

邏輯角色用數據佐證「簡單即有效」的主張，並揭示效能提升的真正來源

論證技巧與 CPN 的直接比較特別有說服力：同等效能下，架構複雜度差距巨大，完美支撐論文的核心訊息

For pose tracking on the PoseTrack dataset, our simple baseline with greedy matching achieves competitive tracking accuracy compared to methods using more sophisticated temporal modeling. Specifically, our approach obtains multi-person pose tracking accuracy of 65.4 MOTA on the PoseTrack validation set. While more complex temporal models may offer marginal improvements, the gap is surprisingly small, suggesting that the pose estimation quality is the primary bottleneck for tracking performance.

在 PoseTrack 資料集上的姿態追蹤方面，我們基於貪婪匹配的簡單基線取得了與使用更精巧時序建模方法相當的追蹤精度。具體而言，我們的方法在 PoseTrack 驗證集上達到 65.4 MOTA 的多人姿態追蹤精度。雖然更複雜的時序模型可能帶來微小的提升，但差距小得令人驚訝，這表明姿態估計品質才是追蹤效能的主要瓶頸。

段落功能報告追蹤任務結果並提出瓶頸分析

邏輯角色將「簡單基線」的論證擴展到第二個任務，並提供診斷性洞見

論證技巧以讓步語氣（「may offer marginal improvements」）承認複雜方法的邊際優勢，但隨即將焦點導向真正的瓶頸，轉守為攻

5. Conclusion 結論

We have provided simple yet strong baseline methods for human pose estimation and tracking. For pose estimation, we show that a ResNet backbone with a few deconvolutional layers can achieve surprisingly strong results. For pose tracking, a simple greedy matching strategy based on optical flow is sufficient to achieve competitive performance. Our results suggest that the community should pay more attention to training details and baseline comparisons rather than solely focusing on architectural novelty.

我們提供了簡單而強大的人體姿態估計與追蹤基線方法。在姿態估計方面，我們展示了 ResNet 骨幹搭配若干反摺積層便能達到令人驚訝的強大結果。在姿態追蹤方面，基於光流的簡單貪婪匹配策略已足以達到具競爭力的效能。我們的結果表明，學術社群應更加關注訓練細節和基線比較，而非僅僅聚焦於架構新穎性。

段落功能總結研究發現並對社群提出建議

邏輯角色將技術發現昇華為方法論層次的反思

論證技巧結論中對社群的「建議」具有溫和的批判性，呼應緒論的問題意識，形成完整的論述弧線

We hope that our work can serve as a reference point for future research, providing a clear and reproducible baseline against which new methods can be evaluated. We acknowledge that there is certainly room for improvement through more sophisticated approaches, but we argue that any new method should demonstrate significant advantages over these simple baselines to justify its added complexity.

我們希望本研究能作為未來研究的參考點，提供一個清晰且可重現的基線以評估新方法。我們承認透過更精巧的方法確實還有改進空間，但我們主張任何新方法都應展示相對於這些簡單基線的顯著優勢，以證明其額外複雜度的合理性。

段落功能展望與謙遜的讓步

邏輯角色以讓步+條件限定的方式結束，既承認局限又設立門檻

論證技巧最後的「justify its added complexity」像是對未來研究立下的評判標準，巧妙地將基線方法定位為領域的「試金石」

論證結構總覽

問題
姿態估計方法
日益複雜化

→

論點
簡單方法也能
達到高效能

→

方法
ResNet + 反摺積
光流貪婪匹配

→

證據
COCO 73.7 AP
PoseTrack 65.4

→

結論
架構創新不如
訓練細節重要

核心主張

人體姿態估計與追蹤的核心效能提升主要來自骨幹網路品質和訓練策略，而非複雜的架構設計；簡單的 ResNet + 反摺積方案已足夠作為強力基線。

論證最強處

以極簡架構（僅三層反摺積）在 COCO test-dev 上達到 73.7 AP，與使用級聯網路的 CPN 相當，對比效果極具說服力。

論證最弱處

論文未深入分析「訓練細節」的具體影響（如資料擴增、學習率排程等），僅泛泛提及其重要性，使「訓練比架構重要」的結論缺乏系統性消融支撐。