Sapiens — 雙欄批注

Abstract — 摘要

We present Sapiens, a family of models for four fundamental human-centric vision tasks: 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We show that self-supervised pretraining on a curated dataset of human images significantly boosts performance for a diverse set of human-centric tasks, given the same computational budget. Model performance across tasks consistently improves as the number of parameters is scaled from 0.3 to 2 billion. Sapiens achieves significant improvements over prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error.

我們提出 Sapiens，一系列用於四項基礎以人為中心視覺任務的模型：二維姿態估測、身體部位分割、深度估測和表面法向量預測。模型原生支援1K 高解析度推論，且僅需對在超過三億張自然環境人體影像上預訓練的模型進行微調，即可極其輕鬆地適配各項任務。我們證明，在相同計算預算下，在策展的人體影像資料集上進行自監督預訓練可顯著提升各種以人為中心任務的效能。模型在各任務上的效能隨著參數量從 3 億擴展至 20 億而持續改善。Sapiens 在Humans-5K（姿態）上提升 7.6 mAP、Humans-2K（部位分割）上提升 17.1 mIoU、Hi4D（深度）上降低 22.4% 相對 RMSE、THuman2（法向量）上降低 53.5% 相對角度誤差，均顯著超越先前最先進水準。

段落功能全文總覽——定義模型家族、四項任務、預訓練策略及量化結果。

邏輯角色以「基礎模型」定位 Sapiens，強調單一預訓練可適配多任務，用大幅超越 SOTA 的數字建立信心。

論證技巧 / 潛在漏洞四組 SOTA 的改進幅度驚人（特別是法向量的 53.5%），但如此巨大的提升可能部分歸因於基準任務本身的不成熟。

1. Introduction — 緒論

Human-centric vision encompasses a wide range of tasks that focus on understanding human appearance, pose, and geometry from images. These tasks are critical for applications in augmented reality, fitness tracking, virtual try-on, human-computer interaction, and autonomous systems. Despite their importance, existing approaches typically train separate, task-specific models, each requiring specialized architectures and training procedures. This fragmented approach limits knowledge sharing across related tasks and makes it difficult to leverage the full potential of large-scale pretraining. In contrast, foundation models in NLP and general vision have demonstrated that a single, well-pretrained model can serve as a powerful starting point for diverse downstream tasks.

以人為中心的視覺涵蓋了廣泛的任務，專注於從影像中理解人體外觀、姿態和幾何。這些任務對擴增實境、體適能追蹤、虛擬試穿、人機互動和自動化系統等應用至關重要。儘管其重要性不言而喻，現有方法通常訓練各自獨立、針對特定任務的模型，各需專門的架構和訓練程序。這種碎片化方法限制了相關任務間的知識共享，也難以充分利用大規模預訓練的潛力。相比之下，自然語言處理和通用視覺領域的基礎模型已經證明，單一、經過良好預訓練的模型可作為多樣下游任務的強大起點。

段落功能建立研究場域——對比碎片化現狀與基礎模型的成功。

邏輯角色以 NLP 和通用視覺的基礎模型成功為先例，論證人體視覺同樣需要基礎模型方法。

論證技巧 / 潛在漏洞類比論證有效但需謹慎：人體視覺任務的特殊性（如遮擋、關節靈活性）可能使通用基礎模型策略需要額外調整。

2. Method — 方法

The Sapiens framework consists of two stages: large-scale self-supervised pretraining followed by task-specific fine-tuning. For pretraining, we curate a dataset of over 300 million human images from public sources, filtered to ensure high quality and diversity in terms of body poses, camera viewpoints, clothing, and environmental conditions. We use Masked Autoencoder (MAE) as our pretraining objective, which learns representations by randomly masking large portions (75%) of the input image and reconstructing the missing patches. The key distinction from general-purpose MAE pretraining is our focus on human-only images, which allows the model to develop specialized representations for human body structure.

Sapiens 框架由兩階段組成：大規模自監督預訓練接著任務特定微調。預訓練方面，我們從公開來源策展了超過三億張人體影像的資料集，經過篩選以確保在身體姿態、相機視角、服裝和環境條件方面的高品質與多樣性。我們使用遮罩自編碼器（MAE）作為預訓練目標，透過隨機遮蔽大量（75%）輸入影像並重建缺失區塊來學習表徵。與通用 MAE 預訓練的關鍵區別在於我們聚焦於僅含人體的影像，使模型能發展出人體結構的專業化表徵。

段落功能闡述核心方法——兩階段框架與領域專門化預訓練。

邏輯角色「三億張人體影像 + MAE」的組合是方法的核心：用規模換取品質，用領域聚焦換取專業化。

論證技巧 / 潛在漏洞策展三億張影像的資料工程本身即為重要貢獻，但也意味著方法的複製門檻極高。

We adopt Vision Transformer (ViT) as our backbone architecture, scaling from ViT-Small (0.3B parameters) to ViT-Huge (2B parameters). A critical design choice is our support for native 1024x1024 resolution during both pretraining and fine-tuning. This is essential for human-centric tasks because human body details such as finger positions, facial features, and clothing textures require high-resolution input to be accurately perceived. For each downstream task, we attach a simple task-specific head to the pretrained backbone and fine-tune end-to-end. The simplicity of adaptation — requiring only the addition of a lightweight head — underscores the quality of the learned representations.

我們採用 Vision Transformer（ViT）作為骨幹架構，從 ViT-Small（3 億參數）擴展至 ViT-Huge（20 億參數）。一個關鍵設計選擇是在預訓練和微調中支援原生 1024x1024 解析度。這對以人為中心的任務至關重要，因為手指位置、面部特徵和衣物紋理等人體細節需要高解析度輸入才能被精確感知。對於每個下游任務，我們在預訓練骨幹上附加一個簡潔的任務特定頭部並進行端到端微調。適配的簡潔性——僅需添加輕量級頭部——彰顯了所學表徵的品質。

段落功能說明架構選擇與高解析度設計的必要性。

邏輯角色 1K 解析度不是簡單的工程選擇而是任務需求驅動，這增強了設計的合理性。

論證技巧 / 潛在漏洞「簡潔適配 = 表徵品質」是優雅的論證。但 1024x1024 的 ViT-Huge 推論成本可能限制實際部署。

3. Experiments — 實驗

We evaluate Sapiens on four tasks across multiple benchmarks. For 2D pose estimation on Humans-5K, Sapiens-2B achieves 78.2 mAP, improving over the previous state-of-the-art by 7.6 mAP. For body-part segmentation on Humans-2K, we achieve 68.3 mIoU, a 17.1 mIoU improvement. For depth estimation on Hi4D, we reduce the relative RMSE by 22.4%, and for surface normal prediction on THuman2, we reduce the relative angular error by 53.5%. Critically, we demonstrate consistent scaling behavior across all tasks: performance improves monotonically as we scale from 0.3B to 2B parameters, with no signs of saturation, suggesting that further scaling would yield additional improvements.

我們在四項任務的多個基準上評估 Sapiens。在 Humans-5K 的二維姿態估測上，Sapiens-2B 達到 78.2 mAP，相較先前最先進水準提升 7.6 mAP。在 Humans-2K 的身體部位分割上，達到 68.3 mIoU，改進 17.1 mIoU。在 Hi4D 的深度估測上降低 22.4% 相對 RMSE，在 THuman2 的表面法向量預測上降低 53.5% 相對角度誤差。關鍵的是，我們展示了跨所有任務的一致擴展行為：效能隨參數量從 3 億擴展至 20 億而單調改善，未出現飽和跡象，暗示進一步擴展將帶來額外改進。

段落功能提供核心實證——四項任務的全面量化結果與擴展規律。

邏輯角色數字本身即為最強論證。擴展規律的展示更暗示了模型潛力的巨大未開發空間。

論證技巧 / 潛在漏洞「未飽和」的觀察既是優勢（潛力巨大）也可能是暗示（目前尚非最優）。計算資源需求的線性增長是否可持續值得關注。

4. Conclusion — 結論

We have introduced Sapiens, a family of foundation models for human-centric vision that demonstrates the power of large-scale, domain-specific self-supervised pretraining. By curating 300M+ human images and pretraining ViT models at 1K resolution with up to 2B parameters, we achieve significant advances across four fundamental human vision tasks. Our results establish that the foundation model paradigm — pretrain once, fine-tune for many tasks — is highly effective for the human-centric vision domain. We release our pretrained models and code to accelerate research in this important area.

我們提出了 Sapiens，一個以人為中心視覺的基礎模型家族，展現了大規模領域專門化自監督預訓練的強大能力。透過策展超過三億張人體影像並以1K 解析度、最高 20 億參數預訓練 ViT 模型，我們在四項基礎人體視覺任務上取得顯著進展。我們的結果確立了基礎模型典範——預訓練一次、微調多項任務——在以人為中心視覺領域高度有效。我們釋出預訓練模型和程式碼以加速此重要領域的研究。

段落功能總結全文——重申基礎模型典範的有效性並宣布開源。

邏輯角色以開源承諾收束，既是學術貢獻也是生態系建設。

論證技巧 / 潛在漏洞開源是擴大影響力的有效策略。作為 Meta 的研究，Sapiens 展現了大公司資源優勢與學術開放的平衡。

論證結構總覽

問題
人體視覺任務碎片化

→

論點
領域專門化基礎模型可統一解決

→

方法
3 億影像 MAE 預訓練 + 微調

→

證據
四任務均大幅超越 SOTA

→

結論
人體視覺基礎模型典範

核心主張

在策展的大規模人體影像資料集上進行自監督預訓練的 ViT 模型，可透過簡潔微調在四項基礎人體視覺任務上顯著超越先前最先進水準。

論證最強處

跨四項任務的一致擴展規律（效能隨參數量單調增長且未飽和）是最具說服力的發現，暗示基礎模型方法在人體視覺中的巨大潛力。

論證最弱處

對三億張影像的資料策展細節和品質控制未充分描述。此外，計算成本分析的缺失使得方法的實際可及性難以評估。