DeepCap: Monocular Human Performance Capture Using Weak Supervision

Abstract — 摘要

This paper proposes DeepCap, a novel deep learning approach for monocular dense human performance capture. The method is trained in a weakly supervised manner based on multi-view supervision, removing the need for 3D ground truth annotations. The architecture disentangles the task into pose estimation and non-rigid surface deformation, enabling dense tracking of human body shape and motion from a single RGB camera.

本文提出 DeepCap，一種用於單目稠密人體表演捕捉的新型深度學習方法。該方法以弱監督方式訓練，基於多視角監督信號，免除對 3D 真實標註的需求。架構將任務解耦為姿態估計與非剛體表面變形兩個子問題，使得從單一 RGB 攝影機即可實現人體形狀與運動的稠密追蹤。

段落功能全文總覽——點出三項核心特色：單目輸入、弱監督訓練、任務解耦架構。

邏輯角色摘要以「移除 3D 真實標註需求」為最大賣點，直接回應該領域的資料瓶頸問題。任務解耦的提及則預告了方法的技術深度。

論證技巧 / 潛在漏洞「弱監督」的表述策略性地降低了讀者對資料需求的疑慮，但實際上仍需多視角影像作為訓練監督——這並非零成本的資料收集。

1. Introduction — 緒論

Human performance capture — reconstructing dense 3D human geometry and motion — is fundamental to applications in visual effects, virtual reality, and telepresence. Existing approaches either require expensive multi-camera setups at test time or rely on 3D ground truth data that is extremely difficult to obtain at scale. The central challenge is: can we achieve dense performance capture from a single monocular camera without requiring 3D supervision?

人體表演捕捉——重建稠密的 3D 人體幾何與運動——是視覺特效、虛擬實境及遠端臨場等應用的基礎。現有方法不是在測試階段需要昂貴的多攝影機設置，就是依賴極難大規模取得的 3D 真實標註資料。核心挑戰在於：能否在不需要 3D 監督的情況下，僅從單一單目攝影機實現稠密的表演捕捉？

段落功能建立研究動機——定義任務、指出應用場景，並揭示現有方法的資料與設備瓶頸。

邏輯角色以提問形式引出研究核心，將「單目」與「無 3D 監督」兩個約束條件同時提出，凸顯問題的挑戰性。

論證技巧 / 潛在漏洞將多攝影機設置與 3D 標註的困難性並列，有效地建構了雙重研究缺口。但「稠密捕捉」的精度要求與「單目輸入」的資訊限制之間存在根本張力，此處未預先回應。

Recent learning-based methods for human pose estimation (e.g., HMR, SPIN) focus on parametric body model fitting but capture only coarse body shape without surface details like clothing deformation. Optimization-based methods can recover fine details but are slow and require careful initialization. DeepCap bridges this gap through a learning-based approach that captures both skeletal pose and non-rigid surface deformation.

近年來基於學習的人體姿態估計方法（如 HMR、SPIN）專注於參數化人體模型擬合，但僅能捕捉粗略的體型輪廓，缺乏衣物變形等表面細節。基於最佳化的方法雖能恢復精細細節，卻速度緩慢且需要謹慎的初始化。DeepCap 透過同時捕捉骨骼姿態與非剛體表面變形的學習式方法，彌合了這一鴻溝。

段落功能批判既有方法——指出學習式方法的精度不足與最佳化式方法的效率不足。

邏輯角色經典的「取兩者之長」論證策略：學習式方法快但粗，最佳化式方法精但慢，DeepCap 兼顧兩者。

論證技巧 / 潛在漏洞二分法式的批判簡潔有效，但可能忽略了近期混合方法（如 SMPL-based 與學習結合的方法）的進展。「彌合鴻溝」的宣稱需要實驗章節的量化支持。

Multi-view performance capture systems use multiple synchronized cameras to reconstruct dense geometry via multi-view stereo or volumetric fusion, achieving high quality but limiting deployment to controlled studio environments. Template-based tracking methods deform a pre-scanned mesh to match observations using optimization-based energy minimization. Monocular approaches using SMPL or SCAPE body models predict pose parameters but cannot model non-rigid deformations beyond the parametric model space.

多視角表演捕捉系統使用多台同步攝影機，透過多視角立體視覺或體積融合重建稠密幾何，品質雖高但部署受限於控制式攝影棚環境。基於範本的追蹤方法將預掃描的網格變形以匹配觀測值，採用基於最佳化的能量最小化。單目方法使用 SMPL 或 SCAPE 人體模型預測姿態參數，但無法建模參數化模型空間之外的非剛體變形。

段落功能文獻回顧——系統性地梳理三類方法（多視角、範本追蹤、單目參數化）及其局限。

邏輯角色透過三類方法的限制分析，精確定位 DeepCap 的貢獻空間：單目（不需多攝影機）、稠密（超越參數化模型）、學習式（快於最佳化）。

論證技巧 / 潛在漏洞三分法的文獻組織清晰，每類方法各有一個明確缺陷被指出。但「無法建模非剛體變形」的批評對 SMPL 而言可能過於絕對——SMPL+D 等擴展模型已部分解決此問題。

3. Proposed Approach — 提出方法

3.1 Pose Estimation Network — 姿態估計網路

The first stage predicts skeletal pose parameters from a single RGB image. A convolutional encoder extracts image features and regresses joint angles and global translation of a predefined skeleton. Training uses multi-view consistency as supervision: the predicted 3D pose is projected onto each camera view, and the reprojection error across all views serves as the loss function. This eliminates the need for 3D pose ground truth annotations, using only 2D keypoint detections from multi-view images.

第一階段從單一 RGB 影像預測骨骼姿態參數。摺積編碼器提取影像特徵，並迴歸預定義骨架的關節角度與全域平移。訓練以多視角一致性作為監督：將預測的 3D 姿態投影至各攝影機視角，以跨視角的重投影誤差作為損失函數。這免除了 3D 姿態真實標註的需求，僅使用多視角影像中的 2D 關鍵點偵測結果。

段落功能方法第一階段——說明姿態估計網路的架構與弱監督訓練策略。

邏輯角色此段建立了弱監督的核心機制：多視角幾何一致性取代 3D 標註。「重投影誤差」是連結 2D 觀測與 3D 預測的數學橋樑。

論證技巧 / 潛在漏洞以多視角重投影作為弱監督是巧妙的設計，但此方法在訓練階段仍需多視角資料——「弱監督」的定義值得追問：相較於 3D 標註確實更容易取得，但並非真正的「少量標註」。

3.2 Non-rigid Deformation Network — 非剛體變形網路

The second stage predicts per-vertex displacements on top of the posed body template to capture clothing deformation and surface details. Given the pose-driven mesh from stage one, a graph convolutional network processes the mesh structure combined with image features projected onto mesh vertices. Training supervision comes from multi-view silhouette consistency and photometric loss: the deformed mesh is rendered from each view and compared against observed silhouettes and RGB images.

第二階段在已擺姿的人體範本之上預測逐頂點位移，以捕捉衣物變形與表面細節。給定第一階段的姿態驅動網格，圖摺積網路處理網格結構，並結合投影至網格頂點上的影像特徵。訓練監督來自多視角輪廓一致性與光度損失：將變形後的網格從各視角渲染，並與觀測到的輪廓及 RGB 影像進行比較。

段落功能方法第二階段——說明非剛體變形的預測機制與訓練策略。

邏輯角色此段完成了任務解耦的後半部分：姿態估計提供粗略的人體形狀，非剛體變形網路在此基礎上添加細節。兩階段的監督信號均來自多視角觀測，保持了「無 3D 標註」的一致性。

論證技巧 / 潛在漏洞使用圖摺積網路處理網格是合理的架構選擇，但逐頂點位移的預測可能產生不自然的表面扭曲。此外，輪廓監督對於凹陷表面（如衣物褶皺的深處）的約束力有限。

4. Experiments — 實驗

Experiments are conducted on multi-view studio recordings of multiple subjects performing various motions. Quantitative evaluation uses per-vertex error measured against multi-view reconstruction ground truth. DeepCap achieves mean vertex error of approximately 2-3 cm, significantly outperforming HMR and optimization-based baselines. The method runs at near real-time speed, whereas optimization-based approaches require minutes per frame. Ablation studies confirm that both the disentangled architecture and multi-view supervision are critical — removing either significantly degrades performance.

實驗在多位受試者執行各種動作的多視角攝影棚錄影上進行。定量評估採用以多視角重建真值為基準的逐頂點誤差。DeepCap 達到約 2-3 公分的平均頂點誤差，顯著優於 HMR 及基於最佳化的基準方法。方法以接近即時的速度運行，而基於最佳化的方法則需每幀數分鐘。消融實驗確認解耦架構與多視角監督均為關鍵——移除任一要素都會顯著降低性能。

段落功能核心實驗證據——以定量數字與消融實驗驗證方法的有效性。

邏輯角色實驗從三個維度驗證：(1) 精度（2-3 cm 頂點誤差），(2) 速度（近即時 vs. 分鐘級），(3) 設計合理性（消融實驗）。

論證技巧 / 潛在漏洞速度對比（即時 vs. 分鐘級）是極具說服力的亮點。但 2-3 cm 的頂點誤差在衣物細節層面仍相當粗糙，且實驗限於攝影棚場景——在野外環境（不同光照、背景）的泛化能力未被測試。

5. Conclusion — 結論

DeepCap presents the first weakly supervised deep learning method for monocular dense human performance capture. By disentangling pose estimation and non-rigid deformation and leveraging multi-view supervision during training, the method achieves accurate dense reconstruction from a single camera at test time without requiring any 3D ground truth annotations. The approach opens new possibilities for accessible human performance capture outside controlled studio environments.

DeepCap 提出了首個用於單目稠密人體表演捕捉的弱監督深度學習方法。透過解耦姿態估計與非剛體變形，並在訓練階段利用多視角監督，該方法在測試階段僅需單一攝影機即可實現精確的稠密重建，無需任何 3D 真實標註。此方法為控制式攝影棚環境之外的可及性人體表演捕捉開啟了新的可能性。

段落功能總結全文——重申「首個」弱監督單目方法的定位，並展望應用前景。

邏輯角色結論以「首個」強調新穎性，以「無需 3D 標註」強調實用性，與摘要形成首尾呼應。

論證技巧 / 潛在漏洞「攝影棚外的可能性」是前瞻性的展望，但實驗並未在攝影棚外進行驗證。結論也未討論方法的局限性——如對校準多視角資料的訓練依賴、在寬鬆衣物上的表現限制等。

論證結構總覽

問題
稠密人體捕捉需
多攝影機或 3D 標註

→

論點
弱監督解耦架構
可從單目影像重建

→

證據
2-3 cm 頂點誤差
近即時速度

→

反駁
消融實驗確認
解耦與多視角均關鍵

→

結論
首個弱監督單目
稠密表演捕捉方法

作者核心主張（一句話）

透過將姿態估計與非剛體變形解耦，並以多視角一致性取代 3D 真實標註作為訓練監督，DeepCap 首次實現了從單一攝影機進行稠密人體表演捕捉的弱監督深度學習方法。

論證最強處

弱監督策略的實用性：以多視角重投影取代 3D 標註的設計極具工程智慧——多視角影像的收集成本遠低於精確 3D 標註。解耦架構的消融實驗提供了令人信服的設計驗證，速度優勢（近即時 vs. 分鐘級）則展現了實際部署的可行性。

論證最弱處

實驗場景的受限性：所有實驗均在控制式攝影棚中進行，對野外環境的泛化能力未被驗證。2-3 cm 的頂點誤差對於細緻衣物細節而言仍顯粗糙，且方法在訓練階段仍依賴校準的多視角資料——「弱監督」的標籤可能誇大了方法的低門檻特性。