Stacked Hourglass Networks for Human Pose Estimation

Abstract — 摘要

This work introduces a novel network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a "stacked hourglass" network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII Human Pose benchmarks outperforming all recent methods.

本工作為人體姿態估計任務引入了一種新穎的網路架構。特徵在所有尺度上進行處理並整合，以最佳方式捕捉與身體相關的各種空間關係。我們展示了重複的自下而上、自上而下處理結合中間監督對於提升網路性能至關重要。基於為產生最終預測而進行的連續池化和上取樣步驟，我們將此架構稱為「堆疊沙漏」網路。在 FLIC 和 MPII 人體姿態基準上達到最先進的結果，超越所有近期方法。

段落功能提出堆疊沙漏架構的核心設計理念與基準結果。

邏輯角色以「多尺度處理 + 中間監督」為核心論述，建立架構的理論基礎。

論證技巧 / 潛在漏洞「沙漏」的命名直觀地傳達了架構的對稱降取樣-上取樣結構。

1. Introduction — 緒論

A key component of understanding people in images and video is pose estimation. Achieving an understanding of a person's posture and limb articulation is useful for higher level tasks like action recognition. The challenge is that the appearance and position of body parts can vary dramatically due to changes in clothing, body shape, lighting, and occlusion. Furthermore, the need to capture information at different scales is essential — local evidence is critical for identifying body parts, but the final pose estimate requires a coherent understanding of the full body. Prior methods have addressed multi-scale processing through various means, but we propose a more systematic approach.

理解影像和影片中人物的關鍵組成是姿態估計。實現對人體姿勢和肢體關節的理解有助於動作辨識等更高層次的任務。挑戰在於身體部位的外觀和位置可能因服裝、體型、光線和遮擋的變化而有巨大差異。此外，在不同尺度上捕捉資訊至關重要——局部證據對於識別身體部位很關鍵，但最終的姿態估計需要對全身的一致理解。先前方法透過各種方式處理多尺度問題，但我們提出更系統化的方法。

段落功能定義姿態估計的挑戰與多尺度需求。

邏輯角色建立「局部 vs 全域」的二元需求，為沙漏架構的設計提供動機。

論證技巧 / 潛在漏洞清楚地闡述了多尺度需求的本質——局部偵測需要精細特徵，全域一致性需要高階語義。

2. Network Architecture — 沙漏模組設計

The design of our network is motivated by the need to capture information at every scale. We use a single pipeline with skip connections to preserve spatial information at each resolution and pass it along for upsampling further in the network. The hourglass is a simple, minimal design that has the capacity to capture features across scales. The network reaches its lowest resolution at 4x4 pixels before beginning the top-down upsampling with nearest neighbor interpolation and the combination of features across scales. At each scale, a residual module performs convolutions. The full architecture involves repeated pooling at each layer from the highest to lowest resolution, and subsequent upsampling and combination of features across all scales.

網路設計的動機是需要在每個尺度上捕捉資訊。我們使用單一管線，搭配跳躍連接在每個解析度保留空間資訊，並在網路後續的上取樣中傳遞。沙漏是一種簡單、最小化的設計，具有跨尺度捕捉特徵的能力。網路達到最低解析度 4x4 像素後，開始以最近鄰插值進行自上而下的上取樣，並跨尺度組合特徵。在每個尺度上，殘差模組執行摺積。完整架構涉及在每一層從最高到最低解析度進行重複池化，以及後續的上取樣和所有尺度特徵的組合。

段落功能描述單個沙漏模組的結構設計。

邏輯角色將多尺度需求轉化為具體的對稱編碼-解碼架構。

論證技巧 / 潛在漏洞跳躍連接確保精細空間資訊不會在降取樣中丟失，與 U-Net 的設計理念相呼應。

3. Stacking Hourglasses with Intermediate Supervision — 堆疊沙漏與中間監督

A single hourglass already captures multi-scale information, but we hypothesize that further performance gains can be achieved by stacking multiple hourglasses together, allowing for repeated bottom-up and top-down inference. After each hourglass, we apply intermediate supervision — the network produces an intermediate set of heatmaps that are compared against the ground truth. This encourages each hourglass to produce a reasonable prediction on its own, and allows the subsequent hourglass to refine and correct the predictions of the previous stage. We find that using 8 stacked hourglasses provides the best trade-off between performance and computational cost.

單個沙漏已能捕捉多尺度資訊，但我們假設透過堆疊多個沙漏，允許重複的自下而上和自上而下推理，可以獲得進一步的性能提升。在每個沙漏之後，我們應用中間監督 — 網路產生一組中間熱力圖，與真實值進行比較。這鼓勵每個沙漏自行產生合理的預測，並允許後續沙漏精煉和修正前一階段的預測。我們發現使用 8 個堆疊沙漏在性能和計算成本之間提供最佳權衡。

段落功能說明堆疊與中間監督的設計動機。

邏輯角色「重複精煉」的概念為漸進式預測改善提供了直觀的解釋。

論證技巧 / 潛在漏洞中間監督既解決了深層網路的梯度問題，又提供了逐階段的預測精煉機制。

4. Experiments — 實驗

We evaluate our approach on the MPII Human Pose and FLIC datasets. On MPII, our 8-stack hourglass network achieves 90.9% PCKh@0.5, significantly outperforming prior methods. The improvement is particularly notable for difficult joints like wrists (83.7%) and ankles (80.2%), where multi-scale reasoning is crucial. On FLIC, we achieve 99.0% accuracy for elbows and 97.0% for wrists at PCK@0.2. Ablation studies confirm that both stacking and intermediate supervision contribute significantly: removing intermediate supervision drops performance by 1.2%, and reducing from 8 to 2 hourglasses drops by 2.3%.

我們在 MPII 人體姿態和 FLIC 資料集上評估方法。在 MPII 上，8 層堆疊沙漏網路達到 90.9% PCKh@0.5，顯著超越先前方法。改進在困難關節上特別顯著，如手腕（83.7%）和腳踝（80.2%），在這些部位多尺度推理至關重要。在 FLIC 上，肘部達到 99.0% 精度，手腕在 PCK@0.2 下達到 97.0%。消融研究確認堆疊和中間監督兩者均有顯著貢獻：去除中間監督性能下降 1.2%，從 8 個沙漏減少到 2 個則下降 2.3%。

段落功能報告基準結果與消融實驗。

邏輯角色量化驗證堆疊與中間監督各自的貢獻。

論證技巧 / 潛在漏洞消融實驗清楚地分離了兩個設計元素的貢獻，使讀者能精確理解每項改進的來源。

5. Conclusions — 結論

We have introduced the stacked hourglass network, a novel architecture for human pose estimation that captures and consolidates information across all scales of the image through repeated bottom-up, top-down processing with intermediate supervision. The design is modular and elegant, achieving state-of-the-art results on multiple benchmarks. We believe the hourglass design will find broad applications beyond pose estimation in tasks requiring dense prediction at multiple scales.

我們引入了堆疊沙漏網路，一種用於人體姿態估計的新架構，透過重複的自下而上、自上而下處理與中間監督，捕捉並整合影像所有尺度的資訊。設計模組化且優雅，在多個基準上達到最先進的結果。我們相信沙漏設計將在姿態估計之外的多尺度密集預測任務中找到廣泛應用。

段落功能總結架構貢獻並展望更廣泛的應用。

邏輯角色以通用性展望收束全文，暗示架構的影響力超越單一任務。

論證技巧 / 潛在漏洞確實如其所預言，沙漏架構後來被廣泛應用於物件偵測、深度估計等多種任務。

論證結構總覽

問題
多尺度資訊整合不足

➔

論點
對稱沙漏+堆疊+中間監督

➔

證據
MPII 90.9% PCKh

➔

反駁
消融驗證各元件貢獻

➔

結論
通用多尺度密集預測架構

核心主張

透過堆疊對稱的沙漏模組並施加中間監督，可重複進行多尺度的自下而上和自上而下推理，顯著提升姿態估計精度。

最強論證

在困難關節（手腕、腳踝）上的顯著改進直接驗證了多尺度推理的有效性，消融實驗清楚分離了各設計元素的貢獻。

最弱環節

8 層堆疊帶來的計算成本較高，且邊際效益遞減，文中未充分討論效率最佳化的策略。