Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos

Abstract — 摘要

This paper presents a method to learn detailed depth of dressed humans from social media dance videos. The key insight is that social media dance videos provide natural multi-view supervision through diverse poses and viewpoints, enabling self-supervised learning of fine-grained human geometry. The proposed self-supervised framework recovers detailed geometry including clothing wrinkles and surface deformations without requiring any depth sensor data or 3D annotations. The method produces high-fidelity depth maps that surpass methods trained on synthetic data, demonstrating the power of leveraging abundantly available in-the-wild video data.

本文提出一種從社群媒體舞蹈影片中學習穿著衣物之人體精細深度的方法。其核心洞見在於，社群媒體舞蹈影片透過多樣化的姿態與視角提供了天然的多視角監督，使細緻人體幾何的自監督式學習成為可能。所提出的自監督式框架能恢復包含衣物皺褶與表面形變在內的精細幾何細節，無需任何深度感測器資料或三維標註。該方法產生的高保真深度圖超越了在合成資料上訓練的方法，展現了利用大量可得的真實世界影片資料的潛力。

段落功能全文總覽——以「社群媒體舞蹈影片」這一非傳統資料來源作為開場亮點，勾勒核心洞見與成果。

邏輯角色摘要的論證策略極具吸引力：先以反直覺的資料來源（舞蹈影片）引起注意，再解釋其合理性（天然多視角），最後以超越合成資料方法的結果收束。

論證技巧 / 潛在漏洞「社群媒體舞蹈影片」的選擇兼具學術新穎性與大眾親和力，有助於擴大論文影響力。但「天然多視角監督」的前提需要嚴格論證——舞蹈影片中的視角變化是否足以替代結構化的多視角採集系統。

1. Introduction — 緒論

Reconstructing detailed 3D geometry of dressed humans is a fundamental challenge in computer vision, with applications in virtual try-on, telepresence, and digital content creation. Existing approaches rely on either expensive depth sensors (e.g., structured light, LiDAR) that are impractical for casual capture, or synthetic training data that fails to capture the diversity and complexity of real-world clothing. Methods trained on parametric body models like SMPL produce smooth body shapes but miss fine surface details such as clothing wrinkles, accessories, and hair.

重建穿著衣物之人體的精細三維幾何是電腦視覺的基礎挑戰，在虛擬試穿、遠端臨場及數位內容創建等領域均有應用。現有方法仰賴昂貴的深度感測器（如結構光、LiDAR），對日常拍攝並不實用；或依賴合成訓練資料，但無法捕捉真實世界衣物的多樣性與複雜性。基於參數化人體模型（如 SMPL）訓練的方法能產生平滑的人體形狀，卻遺失衣物皺褶、配件及頭髮等精細表面細節。

段落功能建立問題背景——定義目標任務，系統性地批判現有方法的三類限制。

邏輯角色論證鏈的起點：以三重批判（感測器昂貴、合成資料不足、參數模型過於平滑）構建全面的「研究缺口」，為利用舞蹈影片的非傳統方案製造必要性。

論證技巧 / 潛在漏洞三類批判各自針對不同方法類型，展現了全面的文獻理解。但將 SMPL 方法歸為「遺失細節」略顯不公——近年的隱式函數方法（如 PIFu）已在此方面取得顯著進展。

The authors observe that social media platforms contain an enormous volume of dance videos where individuals perform diverse movements in front of relatively static cameras. These videos naturally provide multi-view observations of the same person from different angles as the dancer rotates and moves through various poses. This observation motivates the key idea: instead of collecting multi-view data with calibrated cameras or generating synthetic data, we can leverage the natural multi-view information embedded in freely available dance videos to supervise depth learning.

作者觀察到，社群媒體平台包含大量舞蹈影片，其中個體在相對靜止的相機前執行多樣化的動作。這些影片天然地提供了同一人從不同角度的多視角觀測，因為舞者在各種姿態間旋轉與移動。此觀察啟發了核心思路：無需以校準相機採集多視角資料或生成合成資料，我們可以利用免費可得的舞蹈影片中嵌含的天然多視角資訊來監督深度學習。

段落功能提出核心洞見——將舞蹈影片重新詮釋為天然的多視角資料來源。

邏輯角色此段是全文最具創意的論證環節：透過重新框架化（reframing），將日常的社群媒體內容賦予了嚴謹的幾何學意義。

論證技巧 / 潛在漏洞「天然多視角」的論述極具說服力且優雅。但需注意：舞蹈影片中的「多視角」是因人體運動而非相機移動產生的，這與傳統多視角重建的假設（場景靜止、相機移動）截然不同，姿態變化帶來的非剛體形變是核心挑戰。

Monocular human depth estimation methods have evolved from template-based fitting of parametric models (SMPL, SCAPE) to learning-based approaches using implicit functions (PIFu, PIFuHD). While PIFuHD produces impressive results, it relies on synthetic training data rendered from 3D scans, limiting its generalization to in-the-wild scenarios. Self-supervised depth estimation from video has been explored for general scenes, using photometric consistency as supervision, but these methods struggle with non-rigid human motion and typically produce coarse depth maps.

單目人體深度估計方法已從基於模板的參數模型擬合（SMPL、SCAPE）演進至使用隱式函數的學習式方法（PIFu、PIFuHD）。儘管 PIFuHD 產生了令人印象深刻的結果，但它依賴從三維掃描渲染的合成訓練資料，限制了其對真實場景的泛化能力。基於影片的自監督深度估計已在一般場景中被探索，使用光度一致性作為監督，但這些方法在處理非剛體人體運動時遭遇困難，且通常只能產生粗糙的深度圖。

段落功能文獻回顧——概述從參數模型到隱式函數再到自監督方法的演進脈絡。

邏輯角色為本文的方法定位：結合了隱式函數的精細度與自監督學習的無標註優勢，同時克服了兩者各自的限制。

論證技巧 / 潛在漏洞精準地指出 PIFuHD 的合成資料依賴與自監督方法的粗糙度問題，為本文方法的「精細 + 真實資料」定位創造了文獻空間。但對 PIFuHD 的批判可能需要更多量化支持。

3. Proposed Approach — 提出方法

3.1 Multi-View Supervision from Dance Videos — 舞蹈影片的多視角監督

The framework processes dance videos by first estimating the 3D pose (using an off-the-shelf SMPL estimator) and camera parameters for each frame. For any pair of frames showing the same person in different poses, the method establishes dense correspondences by mapping pixels through the estimated body model — warping from the source frame's surface to the canonical pose, then to the target frame's surface. This body-mediated correspondence enables photometric supervision: the depth prediction should be consistent with the warped appearance from other viewpoints.

此框架處理舞蹈影片時，首先為每一影格估計三維姿態（使用現成的 SMPL 估計器）及相機參數。對於顯示同一人不同姿態的任意影格對，方法透過已估計的人體模型建立密集對應——從來源影格的表面映射至標準姿態，再映射至目標影格的表面。此以人體為媒介的對應關係使光度監督成為可能：深度預測應與從其他視角變形而來的外觀保持一致。

段落功能方法推導核心——描述如何從舞蹈影片中提取多視角監督信號。

邏輯角色這是將「舞蹈影片包含多視角資訊」的洞見轉化為具體技術方案的關鍵步驟。SMPL 模型作為中介橋樑，連接了不同姿態下的像素對應。

論證技巧 / 潛在漏洞「以人體為媒介的對應」是精巧的設計，但其品質高度依賴 SMPL 估計的準確性。SMPL 模型本身不建模衣物，而論文的目標恰恰是恢復衣物細節——此循環依賴需要被仔細處理。

3.2 Depth Estimation Network — 深度估計網路

The depth estimation network takes a single RGB image as input and predicts a detailed depth map of the visible human. The network architecture employs a coarse-to-fine strategy: first predicting a base depth aligned with the SMPL body, then refining it with a detail network that adds high-frequency surface information. The training loss combines photometric consistency across frames, perceptual similarity, and a smoothness regularization term. The detail network specifically learns the residual between the smooth body model and the actual clothed surface, capturing wrinkles, folds, and accessories that parametric models cannot represent.

深度估計網路以單張 RGB 影像作為輸入，預測可見人體的精細深度圖。網路架構採用由粗到細的策略：先預測與 SMPL 人體對齊的基礎深度，再以細節網路精煉並添加高頻表面資訊。訓練損失結合了跨影格光度一致性、感知相似性及平滑性正則化項。細節網路專門學習平滑人體模型與實際穿衣表面之間的殘差，捕捉參數化模型無法表示的皺褶、褶層及配件。

段落功能技術細節——描述由粗到細的深度估計架構與損失函數設計。

邏輯角色由粗到細的策略巧妙地利用了 SMPL 模型的優勢（提供粗略幾何）同時克服其限制（缺乏細節），殘差學習使網路聚焦於最有價值的高頻資訊。

論證技巧 / 潛在漏洞殘差學習的設計是工程上的明智選擇，降低了學習難度。但細節網路的品質取決於基礎深度的準確性——若 SMPL 估計存在系統性偏差，殘差學習可能學到錯誤的補償。多項損失函數的權重平衡也是需要仔細調校的超參數。

4. Experiments — 實驗

The method is trained on approximately 3,600 TikTok dance videos collected from social media. Evaluation on the BUFF dataset (which provides ground-truth 3D scans of clothed humans) shows that the method achieves depth error of 5.7mm on average, outperforming PIFuHD (7.2mm) which is trained on synthetic 3D scan data. Qualitative results demonstrate the method's ability to recover fine-grained details including clothing wrinkles, hood contours, and skirt folds that other methods miss. The framework generalizes to diverse body types, clothing styles, and poses not seen during training, showing strong in-the-wild robustness. Ablation studies confirm that the multi-view supervision from dance videos is essential — models trained without it show significantly degraded detail recovery.

該方法以從社群媒體蒐集的約 3,600 支 TikTok 舞蹈影片進行訓練。在 BUFF 資料集（提供穿衣人體的真實三維掃描）上的評估顯示，該方法達到平均 5.7mm 的深度誤差，超越了在合成三維掃描資料上訓練的 PIFuHD（7.2mm）。定性結果展示了方法恢復精細細節的能力，包括其他方法遺漏的衣物皺褶、帽兜輪廓及裙子褶皺。此框架能泛化至訓練時未見過的多樣體型、衣著風格及姿態，展現強大的真實場景穩健性。消融實驗確認舞蹈影片的多視角監督至關重要——未使用此監督的模型在細節恢復上顯著退化。

段落功能提供全面的實驗證據——涵蓋量化比較、定性展示、泛化能力及消融實驗。

邏輯角色此段是論文的實證支柱，覆蓋五個維度：(1) 訓練資料描述；(2) 量化超越 PIFuHD；(3) 精細細節的定性展示；(4) 泛化能力；(5) 消融實驗確認多視角監督的關鍵性。

論證技巧 / 潛在漏洞 5.7mm vs. 7.2mm 的比較令人印象深刻，尤其考量到本方法使用免費的社群媒體影片而 PIFuHD 使用昂貴的三維掃描資料。但 BUFF 資料集的規模較小且場景受限，更大規模的定量評估將增強說服力。此外，3,600 支影片的蒐集與篩選標準未充分說明。

5. Conclusion — 結論

This work demonstrates that social media dance videos are a surprisingly powerful source of supervision for learning detailed human depth estimation. The self-supervised framework leveraging natural multi-view information through body-mediated correspondences produces high-fidelity depth maps that capture fine clothing details, surpassing methods trained on synthetic data. This paradigm of repurposing abundantly available in-the-wild videos as structured training data opens new possibilities for 3D human understanding without expensive data collection infrastructure.

本研究證明社群媒體舞蹈影片是學習精細人體深度估計的意外強大監督來源。利用透過人體媒介對應關係獲取天然多視角資訊的自監督式框架，產生了捕捉精細衣物細節的高保真深度圖，超越了在合成資料上訓練的方法。此將大量可得的真實世界影片重新利用為結構化訓練資料的範式，為無需昂貴資料採集基礎設施的三維人體理解開啟了新的可能性。

段落功能總結全文——回顧核心洞見並提煉更廣泛的啟示。

邏輯角色結論段從具體成果（深度估計性能）提升至一般性原則（重新利用真實世界影片作為訓練資料），擴大了論文的影響力與啟發性。

論證技巧 / 潛在漏洞「意外強大」的措辭巧妙地強調了發現的反直覺性。但結論未討論方法的限制（如對 SMPL 估計的依賴、遮擋處理、多人場景），以及此範式對其他三維理解任務（如手部、臉部）的延伸可能性。

論證結構總覽

問題
穿衣人體深度估計
缺乏精細真實資料

→

論點
舞蹈影片提供
天然多視角監督

→

證據
5.7mm 深度誤差
超越 PIFuHD (7.2mm)

→

反駁
消融實驗確認
多視角監督關鍵性

→

結論
真實世界影片可作為
三維學習的監督來源

作者核心主張（一句話）

社群媒體舞蹈影片透過人體運動提供的天然多視角資訊，可作為學習精細穿衣人體深度估計的有效自監督信號，無需昂貴的感測器或合成資料。

論證最強處

資料來源的創意與實用性：將免費、大量可得的社群媒體影片轉化為結構化訓練資料，在學術創新性與實際可部署性之間取得完美平衡。以 5.7mm 的深度誤差超越使用昂貴三維掃描訓練的 PIFuHD (7.2mm)，有力地證明了「更多真實資料」可優於「少量精確資料」。

論證最弱處

SMPL 依賴的循環性：方法目標是恢復超越 SMPL 的精細幾何，但多視角對應的建立卻依賴 SMPL 估計的準確性。當 SMPL 擬合失敗（如極端姿態、寬鬆衣物）時，對應關係的品質將劣化，進而影響監督信號。此外，評估僅在小規模 BUFF 資料集上進行，更大規模的定量基準將增強結論的穩健性。