Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies

Abstract — 摘要

We present a unified deformation model for the markerless capture of multiple scales of human movement, including facial expressions, body motion, and hand gestures. We develop two models: an initial "Frankenstein" model created by stitching together existing body, face, and hand components, and an improved "Adam" model optimized using captures of 70 people wearing everyday clothes. Our approach achieves the first markerless method to capture total body motion including facial expression, coarse body motion from torso and limbs, and hand gestures at a distance.

本文提出一套統一的變形模型，用於無標記點地擷取人體多尺度運動，涵蓋臉部表情、身體動作與手部姿態。我們發展了兩個模型：一個初始的「科學怪人」模型，將既有的身體、臉部與手部元件拼接而成；以及一個改良版的「Adam」模型，透過 70 位穿著日常服裝的受試者的擷取資料進行最佳化。本方法實現了首個能在遠距離下無標記點地擷取全身動作（包括臉部表情、軀幹與四肢的粗略動作、以及手部姿態）的方法。

段落功能全文總覽——以「統一」為關鍵詞，預告同時擷取臉、手、身體的核心貢獻。

邏輯角色摘要承擔「範圍界定」與「優先權宣告」的雙重功能：以「首個」一詞標明新穎性，以多尺度運動標明覆蓋範圍。

論證技巧 / 潛在漏洞「科學怪人」的暱稱兼具幽默與技術暗示：初始模型是拼接而成的，自然有接縫問題。「首個」的主張極為強烈，需在實驗中以明確的基準比較來支撐。

1. Introduction — 緒論

Capturing the full range of human non-verbal communication requires tracking facial expressions, body pose, and hand gestures simultaneously. Existing methods either focus on individual body parts in isolation — face tracking, body pose estimation, or hand pose estimation — or require specialized hardware such as marker-based motion capture systems. A unified model that captures all scales of human movement from standard multi-view video would have broad applications in social interaction analysis, sign language recognition, and virtual reality.

要擷取人類非語言溝通的完整範圍，需要同步追蹤臉部表情、身體姿態與手部姿態。現有方法要麼單獨聚焦於個別身體部位——臉部追蹤、身體姿態估計或手部姿態估計——要麼需要專用硬體如標記點式動作擷取系統。一個能從標準多視角影像擷取人體所有尺度運動的統一模型，將在社會互動分析、手語辨識和虛擬實境等領域擁有廣泛應用。

段落功能建立動機——指出現有方法的碎片化與硬體依賴問題。

邏輯角色以「孤立」vs.「統一」的對比建立研究缺口，並以應用場景列舉增強實用性論述。

論證技巧 / 潛在漏洞將問題框架從「各部位獨立追蹤」轉向「全身統一擷取」，暗示統一方法本質上更優。但實際上，模組化方法在精度上可能具有優勢，作者需證明統一性不會犧牲精度。

The key challenge lies in the vastly different scales and deformation characteristics of faces, hands, and bodies. Facial expressions involve subtle, millimeter-scale deformations, while body motion spans meters of translation and large articulations. Hands present unique difficulties due to frequent self-occlusion and high degrees of freedom. Our approach addresses this by developing a hierarchical model that represents the body at multiple resolutions, combined with a multi-stage optimization framework.

核心挑戰在於臉部、手部與身體之間截然不同的尺度與變形特性。臉部表情涉及細微的毫米級變形，而身體動作橫跨數公尺的移動與大幅度關節運動。手部因頻繁的自遮擋與高自由度而帶來獨特的困難。本方法以開發一個在多解析度下表示身體的階層式模型來解決此問題，並結合多階段最佳化框架。

段落功能困難分析——量化不同身體部位在尺度與變形上的差異。

邏輯角色深化問題：不僅是「各部位獨立」的組織問題，更是「尺度差異」的技術難題。這為階層式模型的設計提供了直接動機。

論證技巧 / 潛在漏洞以具體的物理量化（毫米 vs. 公尺）使挑戰具象化，說服力強。但手部的「高自由度」挑戰在實際解決方案中可能仍是瓶頸，需在實驗中展示手部追蹤的精度。

Body pose estimation has advanced rapidly with deep learning methods such as OpenPose, which detects body, face, and hand keypoints in 2D. Parametric body models like SMPL provide a low-dimensional representation of body shape and pose with 6,890 vertices. For faces, 3D Morphable Models (3DMM) and FaceWarehouse capture identity and expression variations. Hand models remain less mature, with most methods limited to close-range RGB-D capture. No prior work has combined all three modalities into a single deformable model for markerless capture.

身體姿態估計隨著深度學習方法（如 OpenPose，可在二維中偵測身體、臉部與手部關鍵點）而快速進展。參數化身體模型如 SMPL 以 6,890 個頂點提供身體形狀與姿態的低維表示。在臉部方面，三維可變形模型（3DMM）與 FaceWarehouse 擷取身份與表情變化。手部模型仍較不成熟，多數方法局限於近距離的 RGB-D 擷取。先前尚無研究將三種模態整合為單一可變形模型進行無標記點擷取。

段落功能文獻回顧——概述三個子領域（身體、臉部、手部）的現有方法與差距。

邏輯角色以三條平行線（身體、臉部、手部）分別追蹤研究進展，最終匯聚於「無人統一」的空白點，精確定義本文的學術定位。

論證技巧 / 潛在漏洞承認手部模型「較不成熟」是誠實的評估，但這也暗示本文在手部擷取上的品質可能不如臉部與身體。讀者應注意實驗中手部精度的報告。

3. Method — 方法

3.1 Frankenstein Model — 科學怪人模型

The initial model, dubbed "Frankenstein," is constructed by stitching together three existing component models: SMPL for the body (6,890 vertices), FaceWarehouse for the face (11,510 vertices), and an artist-rigged hand model (2,068 vertices each). These components are combined into a unified skeleton hierarchy with a total of 18,540 vertices. The stitching involves blending the deformations at boundary regions between body, face, and hand meshes to ensure smooth transitions.

初始模型名為「科學怪人」，透過拼接三個既有元件模型構建而成：SMPL 用於身體（6,890 個頂點）、FaceWarehouse 用於臉部（11,510 個頂點）、以及藝術家手動綁定的手部模型（每隻手 2,068 個頂點）。這些元件整合為一個統一的骨架階層，共計 18,540 個頂點。拼接過程涉及在身體、臉部與手部網格之間的邊界區域進行變形混合，以確保平滑過渡。

段落功能方法第一階段——描述基線模型的構建方式。

邏輯角色科學怪人模型扮演「快速原型」的角色：先以拼接方式驗證統一模型的可行性，再以 Adam 模型改進。

論證技巧 / 潛在漏洞精確列出每個元件的頂點數增強了技術可信度。但「邊界混合」的細節不足——如何處理不同模型之間拓撲不一致的問題，是實作上的關鍵挑戰。

3.2 Adam Model — Adam 模型

To overcome the limitations of the Frankenstein model, the authors develop Adam — a simplified, unified parametric model with a coherent parameterization. Adam is learned from captures of 70 subjects and includes shape variations for hair and clothing geometry, making it more applicable to real-world scenarios. The model uses a single consistent skeleton and linear blend skinning with corrective blend shapes for both pose-dependent and identity-dependent deformations.

為克服科學怪人模型的限制，作者開發了 Adam——一個具有一致參數化的簡化統一參數模型。Adam 從 70 位受試者的擷取資料中學習，包含頭髮與服裝幾何的形狀變化，使其更適用於真實場景。該模型使用單一一致的骨架和線性混合蒙皮，並搭配修正混合形狀來處理姿態相關與身份相關的變形。

段落功能方法改進——從拼接模型演進為統一學習模型。

邏輯角色 Adam 解決了科學怪人的「接縫」問題，代表從工程拼接到資料驅動學習的方法論升級。包含服裝變化更進一步提升了實用性。

論證技巧 / 潛在漏洞 70 位受試者的資料量對於學習人體形狀變化而言相對有限（SMPL 使用了數千筆掃描）。服裝建模的加入雖然實用，但可能引入大量非剛性變形的歧義。

3.3 Optimization Framework — 最佳化框架

The fitting pipeline combines three complementary objectives: anatomical keypoint matching using OpenPose detections (including custom feet keypoints trained on ~5,000 annotated COCO instances), iterative closest point (ICP) alignment to point clouds from multi-view stereo, and temporal smoothing via optical flow propagation. The optimization proceeds in stages: first fitting global pose and body shape, then refining hand articulation and facial expression parameters.

擬合管線結合三個互補目標：使用 OpenPose 偵測（包括在約 5,000 個標註的 COCO 實例上訓練的自訂足部關鍵點）進行解剖關鍵點匹配、以迭代最近點（ICP）對齊多視角立體視覺的點雲、以及透過光流傳播實現時序平滑。最佳化分階段進行：先擬合全域姿態與身體形狀，再精煉手部關節運動與臉部表情參數。

段落功能核心技術——描述多目標最佳化的具體實現。

邏輯角色三重目標（關鍵點、點雲、時序）分別解決不同面向的問題：關鍵點提供稀疏但穩健的約束，ICP 提供密集的幾何約束，光流確保時序連貫性。

論證技巧 / 潛在漏洞分階段最佳化（先身體後細節）是務實的策略，但也意味著手部與臉部的擬合受限於身體姿態的先期估計品質。錯誤可能從粗略階段級聯傳播至精細階段。

4. Experiments — 實驗

Quantitative evaluation uses silhouette overlap accuracy compared to ground truth. Results show progressive improvement: SMPL alone achieves 84.79% (+-4.55%), Frankenstein reaches 85.91% (+-4.57%), Frankenstein with ICP improves to 87.68% (+-4.53%), and Adam with ICP achieves the best at 87.74% (+-4.18%). The method is demonstrated on diverse scenarios including social interactions between multiple people, musical performances (piano, violin), and furniture assembly tasks.

定量評估使用輪廓重疊準確度與真值比較。結果顯示逐步改善：單獨使用 SMPL 達到 84.79%（正負 4.55%），科學怪人模型達到 85.91%（正負 4.57%），加入 ICP 的科學怪人模型提升至 87.68%（正負 4.53%），而加入 ICP 的 Adam 模型達到最佳的 87.74%（正負 4.18%）。該方法在多種場景中進行展示，包括多人社交互動、音樂演奏（鋼琴、小提琴）以及傢俱組裝任務。

段落功能實證驗證——以量化指標與多樣化場景展示方法的有效性。

邏輯角色漸進式的精度提升（84.79% -> 87.74%）驗證了每個元件的增量貢獻。多樣化的場景展示則證明方法的泛用性。

論證技巧 / 潛在漏洞 Adam 相較於 Frankenstein+ICP 僅提升 0.06%，統計顯著性可疑。但 Adam 的標準差較小（4.18% vs. 4.53%），暗示更穩定的表現。整體準確度（約 88%）在無標記點系統中合理，但未與其他無標記點方法做直接比較。

5. Conclusion — 結論

This paper presents the first system for total body capture — simultaneously tracking faces, hands, and bodies — from multi-view video without markers. The Adam model, learned from real capture data, provides a unified parameterization that enables holistic capture of human non-verbal communication. The multi-stage optimization combining keypoint detection, point cloud alignment, and temporal smoothing produces temporally coherent results across challenging scenarios. Future work includes improving hand tracking accuracy at a distance and extending to outdoor environments.

本文提出首個全身擷取系統——從多視角影像中同步追蹤臉部、手部與身體，無需標記點。Adam 模型從真實擷取資料學習而得，提供統一參數化，使人類非語言溝通的整體擷取成為可能。結合關鍵點偵測、點雲對齊與時序平滑的多階段最佳化，在具挑戰性的場景中產生時序連貫的結果。未來工作包括改善遠距離手部追蹤的精度以及擴展至戶外環境。

段落功能總結全文——重申「首個」的貢獻主張並指出未來方向。

邏輯角色結論呼應摘要的「首個」宣告，形成論證閉環。未來方向的列舉也坦承了當前系統的限制。

論證技巧 / 潛在漏洞「遠距離手部追蹤」與「戶外環境」作為未來工作被明確列出，這些正是系統目前最薄弱的環節。作為學生論文，範圍的限制是可理解的，但也暗示系統離實用部署仍有距離。

論證結構總覽

問題
臉/手/身體擷取
各自孤立運作

→

論點
統一變形模型
實現全身擷取

→

證據
輪廓準確度 87.74%
多場景展示

→

反駁
遠距手部追蹤不足
僅限室內環境

→

結論
首個無標記點
全身擷取系統

作者核心主張（一句話）

透過建構統一的參數化人體變形模型（Adam），首次實現從多視角影像無標記點地同步擷取臉部表情、身體姿態與手部姿態的完整人體運動。

論證最強處

從拼接到統一的漸進式設計：先以科學怪人模型驗證可行性，再以資料驅動的 Adam 模型改進品質，展現了務實的工程方法論。多階段最佳化框架的設計亦體現了對多尺度問題的深刻理解。

論證最弱處

定量評估的說服力不足：Adam 與 Frankenstein+ICP 的精度差異僅 0.06%，難以證明統一模型在精度上的顯著優勢。此外，缺乏與其他無標記點方法的直接比較，使得「首個」的主張難以在精度維度上被充分驗證。