HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation

Abstract — 摘要

This paper addresses the scale variation challenge in bottom-up human pose estimation. The authors present HigherHRNet, which leverages high-resolution feature pyramids to learn scale-aware representations. Built upon HRNet, the method combines "multi-resolution supervision for training and multi-resolution aggregation for inference" to generate high-quality multi-scale heatmaps. HigherHRNet achieves 70.5% AP on COCO test-dev, with a 2.5% AP improvement for medium persons, and 67.6% AP on CrowdPose, surpassing all top-down approaches.

本文針對由下而上人體姿態估計中的尺度變化挑戰。作者提出 HigherHRNet，利用高解析度特徵金字塔學習尺度感知表徵。建構於 HRNet 之上，該方法結合「用於訓練的多解析度監督與用於推論的多解析度聚合」，以生成高品質的多尺度熱力圖。HigherHRNet 在 COCO test-dev 上達到 70.5% AP，對中等尺寸人物有 2.5% AP 的提升，在 CrowdPose 上達到 67.6% AP，超越所有由上而下的方法。

段落功能全文總覽——界定問題（尺度變化）、提出方法（高解析度特徵金字塔）、報告核心結果。

邏輯角色摘要建立了清晰的「問題-方法-成果」三段式結構。特別以「超越由上而下方法」的宣稱引起注意——由下而上方法在精度上超過由上而下方法是值得關注的里程碑。

論證技巧 / 潛在漏洞以具體數字（70.5% AP、2.5% 提升、67.6% AP）增強可信度。但「超越所有由上而下方法」的宣稱範圍需仔細驗證——是否包含當時所有最新方法，以及是否在公平的運算成本比較下。

1. Introduction — 緒論

Bottom-up human pose estimation detects all body joints first, then groups them into individual persons. Compared to top-down methods that first detect persons then estimate each person's pose, bottom-up approaches are inherently more efficient as their computation does not scale with the number of persons. However, bottom-up methods face a critical challenge: "the scale of persons in an image varies significantly," and single-resolution heatmaps struggle to accurately localize keypoints for both small and large persons simultaneously. Small persons require high-resolution heatmaps for precise localization, while large persons benefit from lower-resolution representations with larger receptive fields.

由下而上的人體姿態估計先偵測所有身體關節，再將其分組為個別人物。相較於由上而下方法，由下而上方法在本質上更為高效，因其運算量不隨人物數量增加。然而，由下而上方法面臨一個關鍵挑戰：「影像中人物的尺度差異顯著」，而單一解析度的熱力圖難以同時準確定位小尺寸與大尺寸人物的關鍵點。小尺寸人物需要高解析度熱力圖以精確定位，大尺寸人物則受惠於具有較大感受野的低解析度表徵。

段落功能建立研究場域——定義由下而上姿態估計的優勢與核心挑戰。

邏輯角色論證鏈的起點：先肯定由下而上方法的效率優勢（值得研究），再指出其精度瓶頸（尺度變化），為引入多尺度解決方案建立動機。

論證技巧 / 潛在漏洞以「效率 vs. 精度」的權衡框架切入，清晰定位研究動機。但由下而上方法的效率優勢在實際部署中是否顯著（考慮分組後處理的成本），此處未深入分析。

Existing bottom-up methods like OpenPose and Associative Embedding typically produce heatmaps at 1/4 of the input resolution, which loses fine-grained spatial information critical for small person detection. Simply increasing the heatmap resolution is computationally expensive and does not address the fundamental scale variation problem. The authors propose that generating multi-scale heatmaps through a high-resolution feature pyramid is the principled solution to simultaneously handle persons at different scales.

現有的由下而上方法（如 OpenPose 與關聯嵌入）通常產生輸入解析度 1/4 的熱力圖，這會損失對小尺寸人物偵測至關重要的細粒度空間資訊。單純提升熱力圖解析度不僅運算成本高昂，且無法解決根本的尺度變化問題。作者提出，透過高解析度特徵金字塔生成多尺度熱力圖，才是同時處理不同尺度人物的原理性解決方案。

段落功能批判既有方法——指出 1/4 解析度限制與暴力提升解析度的不足。

邏輯角色排除簡單替代方案（直接提升解析度），引導讀者理解為何需要多尺度特徵金字塔這一更精密的設計。

論證技巧 / 潛在漏洞先排除天真方案再提出己方方案，是有效的論證策略。但「原理性解決方案」的宣稱較強——特徵金字塔並非唯一處理尺度變化的策略，輸入金字塔（多尺度輸入）亦是常見替代方案。

Top-down methods such as SimpleBaseline and HRNet achieve high accuracy by cropping each detected person and estimating pose on normalized patches, but their inference time scales linearly with the number of persons. Bottom-up methods including OpenPose (using Part Affinity Fields) and Associative Embedding (using tag-based grouping) are person-count-independent but historically lag behind in accuracy. Feature Pyramid Networks (FPN) have been successful in object detection for handling scale variation, but their application to pose estimation requires careful design due to the need for high-resolution spatial precision.

由上而下方法（如 SimpleBaseline 與 HRNet）透過裁切每個偵測到的人物並在正規化的影像區塊上估計姿態來達到高精度，但其推論時間隨人物數量線性增加。由下而上方法（包括使用部位親和場的 OpenPose 與使用標籤分組的關聯嵌入）不受人物數量影響，但在精度上歷來落後。特徵金字塔網路（FPN）在物體偵測中成功處理了尺度變化問題，但將其應用於姿態估計需要精心設計，因為姿態估計對高解析度空間精度有嚴格需求。

段落功能文獻回顧——對比由上而下與由下而上方法的優劣，並引入 FPN 作為靈感來源。

邏輯角色系統性地建立方法論坐標系：由上而下（高精度但慢）、由下而上（快但精度低）、FPN（處理尺度但需適配）。HigherHRNet 旨在結合三者優勢。

論證技巧 / 潛在漏洞文獻回顧結構清晰，但「歷來落後」的描述可能過於籠統——部分由下而上方法在特定指標上已與由上而下方法接近。FPN 在偵測中的成功是否能直接轉移至姿態估計，需要更多論證。

3. Method — 方法

3.1 High-Resolution Feature Pyramids — 高解析度特徵金字塔

HigherHRNet builds upon HRNet as the backbone, which maintains high-resolution representations throughout the network via parallel multi-resolution branches with repeated information exchange. The key extension is a deconvolution-based upsampling module that generates feature maps at higher resolutions than HRNet's native output. Starting from HRNet's 1/4-resolution output, the module applies transposed convolutions to produce 1/2-resolution features, creating a high-resolution feature pyramid with heatmap predictions at multiple scales. This design enables the network to produce heatmaps where small persons are better represented at higher resolutions and large persons are captured at lower resolutions.

HigherHRNet 以 HRNet 作為骨幹網路，HRNet 透過平行的多解析度分支與反覆的資訊交換，在整個網路中維持高解析度表徵。關鍵擴展是一個基於反摺積的上取樣模組，能生成比 HRNet 原生輸出更高解析度的特徵圖。從 HRNet 的 1/4 解析度輸出開始，模組透過轉置摺積產生 1/2 解析度的特徵，構建出高解析度特徵金字塔，在多個尺度上產生熱力圖預測。此設計使網路能夠產生這樣的熱力圖：小尺寸人物在較高解析度下獲得更好的表徵，大尺寸人物則在較低解析度下被捕捉。

段落功能方法細節——描述從 HRNet 擴展至高解析度特徵金字塔的架構設計。

邏輯角色此段建立方法的架構基礎：在 HRNet 已有的多解析度能力上，進一步向上延伸解析度，形成完整的特徵金字塔。

論證技巧 / 潛在漏洞以 HRNet 為基礎進行擴展是務實的策略——繼承了經過驗證的骨幹優勢。但反摺積上取樣是否會引入棋盤格偽影（checkerboard artifacts），以及高解析度分支的額外運算成本，需要在實驗中量化。

3.2 Multi-Resolution Supervision and Aggregation — 多解析度監督與聚合

During training, the method applies heatmap supervision at each resolution level of the feature pyramid, with ground-truth heatmaps generated at corresponding scales. This multi-resolution supervision encourages each pyramid level to specialize in detecting persons of the appropriate scale. During inference, heatmap predictions from all resolution levels are aggregated by resizing them to a common resolution and averaging. This multi-resolution aggregation combines the complementary strengths of different scales — high-resolution precision for small persons and large-context awareness for large persons.

在訓練階段，該方法在特徵金字塔的每個解析度層級上施加熱力圖監督，真實標註熱力圖在對應尺度上生成。這種多解析度監督鼓勵金字塔的每個層級專精於偵測適當尺度的人物。在推論階段，來自所有解析度層級的熱力圖預測透過調整至統一解析度後取平均進行聚合。這種多解析度聚合結合了不同尺度的互補優勢——高解析度為小尺寸人物提供精度，大上下文為大尺寸人物提供感知能力。

段落功能核心機制——描述訓練與推論階段的多解析度策略。

邏輯角色此段回答「如何利用多尺度特徵」——訓練時分別監督促進專精化，推論時聚合實現互補。兩個階段的策略相互配合形成完整的方法論。

論證技巧 / 潛在漏洞多解析度監督促進尺度專精化的設計直覺合理，但簡單的取平均聚合是否為最優？加權聚合或注意力機制可能帶來進一步提升。此外，「每個層級專精特定尺度」的假設是否在實際訓練中成立，需要視覺化分析支持。

4. Experiments — 實驗

Experiments are conducted primarily on COCO and CrowdPose benchmarks. On COCO test-dev, HigherHRNet achieves 70.5% AP, setting a new state-of-the-art for bottom-up methods. The improvement is most pronounced for medium-sized persons (+2.5% AP), validating the scale-aware design. On CrowdPose, the method reaches 67.6% AP, surpassing top-down approaches that struggle with heavily overlapping persons where person detection is unreliable. Ablation studies confirm that both multi-resolution supervision and aggregation contribute to the final performance, with the high-resolution branch providing the most significant gains on small and medium persons.

實驗主要在 COCO 與 CrowdPose 基準上進行。在 COCO test-dev 上，HigherHRNet 達到 70.5% AP，為由下而上方法創下新的最先進水準。改善在中等尺寸人物上最為顯著（+2.5% AP），驗證了尺度感知設計的有效性。在 CrowdPose 上，該方法達到 67.6% AP，超越了由上而下方法——由上而下方法在人物嚴重重疊、人物偵測不可靠的情況下表現受限。消融研究證實多解析度監督與聚合均對最終性能有所貢獻，高解析度分支在小尺寸與中等尺寸人物上提供了最顯著的增益。

段落功能提供全面的實驗證據——在兩個基準上驗證方法有效性，並以消融研究分析各元件貢獻。

邏輯角色實證支柱覆蓋四個維度：(1) COCO 整體性能（70.5% AP）；(2) 尺度特定改善（中等人物 +2.5%）；(3) 擁擠場景優勢（CrowdPose 超越由上而下）；(4) 消融驗證（各元件貢獻）。

論證技巧 / 潛在漏洞 CrowdPose 的結果特別有說服力——在由上而下方法的弱點場景中展示由下而上方法的優勢。但小尺寸人物（APsmall）的改善未被突出報告，可能暗示在極端小尺度下改善有限。推論速度的比較亦未充分呈現。

5. Conclusion — 結論

HigherHRNet addresses the scale variation problem in bottom-up human pose estimation through high-resolution feature pyramids with multi-resolution supervision and aggregation. The method achieves state-of-the-art results on COCO and demonstrates particular strength in crowded scenes on CrowdPose. The scale-aware design principle provides a general framework for improving bottom-up pose estimation and potentially other dense prediction tasks requiring multi-scale reasoning.

HigherHRNet 透過高解析度特徵金字塔結合多解析度監督與聚合，解決了由下而上人體姿態估計中的尺度變化問題。該方法在 COCO 上達到最先進水準，並在 CrowdPose 的擁擠場景中展現出特殊優勢。尺度感知的設計原則為改善由下而上姿態估計以及其他需要多尺度推理的密集預測任務，提供了一個通用框架。

段落功能總結全文——重述問題、方法、成果，並暗示泛化潛力。

邏輯角色結論段呼應摘要的結構，並以「通用框架」的前瞻性收束，拓展論文的影響範圍。

論證技巧 / 潛在漏洞「通用框架」的宣稱需要其他任務的實驗支持，目前僅在姿態估計上驗證。未討論方法的局限性，如反摺積的運算成本、極端尺度比的處理能力等。

論證結構總覽

問題
由下而上姿態估計
尺度變化挑戰

→

論點
高解析度特徵金字塔
尺度感知表徵

→

證據
COCO 70.5% AP
CrowdPose 67.6% AP

→

反駁
擁擠場景超越
由上而下方法

→

結論
尺度感知設計
通用框架潛力

作者核心主張（一句話）

透過在 HRNet 基礎上建構高解析度特徵金字塔，並結合多解析度監督與聚合策略，由下而上的人體姿態估計能夠有效克服尺度變化挑戰，在精度上與由上而下方法匹敵甚至超越。

論證最強處

CrowdPose 上超越由上而下方法：在人物嚴重重疊的擁擠場景中，由上而下方法因人物偵測器失效而性能下降，HigherHRNet 的由下而上策略展現了結構性優勢。中等尺寸人物 +2.5% AP 的改善直接驗證了尺度感知設計的針對性效果。

論證最弱處

小尺寸人物改善的沉默：儘管方法以解決尺度變化為核心動機，但對小尺寸人物（APsmall）的具體改善著墨有限，可能暗示高解析度分支對極端小尺度的效益不如預期。此外，多解析度聚合使用簡單的平均策略，缺乏自適應性，且額外的運算成本分析不夠充分。