Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild

Abstract — 摘要

We propose a method to learn 3D deformable object categories from raw single-view images, without external supervision. The method employs an autoencoder that factorizes each input image into depth, albedo, viewpoint, and illumination. The key insight is that many object categories have an approximate reflective symmetry, which we model with a symmetry probability map, learned end-to-end, accounting for the fact that not all object parts are symmetric. Our method achieves results that outperform or are comparable to state-of-the-art supervised and non-supervised methods for single-image 3D reconstruction of human faces, cat faces, and cars.

本文提出一種從原始單視角影像學習三維可變形物體類別的方法，無需任何外部監督訊號。該方法採用一個自動編碼器，將每張輸入影像分解為深度、反照率、視角與光照四個因素。核心洞見在於：許多物體類別具有近似的反射對稱性，我們透過一個端對端學習的對稱機率圖來建模此特性，同時考量並非所有物體部位皆為對稱的事實。本方法在單影像三維重建任務上，達到了超越或匹敵當前最佳有監督與無監督方法的成果。

段落功能全文總覽——以精煉語言勾勒核心方法（影像分解自動編碼器）與關鍵創新（機率對稱性建模）。

邏輯角色摘要承擔「設定期望」的功能：先點出無監督的困難性，再以對稱性作為突破口，最後以跨類別的實驗結果證明泛化能力。論證從問題到方法到結果一氣呵成。

論證技巧 / 潛在漏洞「超越有監督方法」的宣稱極為大膽——無監督方法勝過有監督基線，是強有力的賣點。但摘要未明確指出在哪些指標上勝出，讀者需至實驗章節方能驗證。

1. Introduction — 緒論

Humans can effortlessly perceive 3D shape and appearance of objects from a single image. Achieving the same with machines is a fundamental goal of computer vision. Most existing approaches rely on expensive supervision such as 3D ground truth, multi-view images, or manually annotated keypoints, limiting their scalability. We ask: can we learn to reconstruct 3D objects from single images without any supervision?

人類能夠毫不費力地從單張影像感知物體的三維形狀與外觀。讓機器達到同等能力是電腦視覺的根本目標之一。現有大多數方法依賴昂貴的監督訊號，例如三維真實標註、多視角影像或人工標註的關鍵點，這嚴重限制了其可擴展性。本文提出的核心問題是：我們能否在完全無監督的條件下，從單張影像學習重建三維物體？

段落功能建立研究場域——從人類感知能力出發，定義電腦視覺的核心挑戰，並揭示現有方法的瓶頸。

邏輯角色論證鏈起點：先確立「三維重建」的重要性，再以「昂貴監督」的痛點製造研究動機，最後以問句形式提出核心研究問題。

論證技巧 / 潛在漏洞以反問句引導讀者思考，是學術寫作中有效的修辭策略。然而「完全無監督」的定義邊界模糊——方法仍需同類別影像集合，這並非零假設。

Our key idea is to exploit the fact that many natural object categories possess an approximate bilateral symmetry. This symmetry prior provides a powerful self-supervisory signal: given a predicted 3D shape, we can flip the reconstruction and check consistency with the original image. However, objects are not perfectly symmetric — hair, expressions, and accessories break symmetry. We therefore introduce a learned confidence map that assigns a probability of symmetry to each pixel.

本文的核心構想是利用許多自然物體類別具有近似雙側對稱性這一事實。此對稱性先驗提供了強大的自我監督訊號：給定預測的三維形狀，我們可以翻轉重建結果並檢驗其與原始影像的一致性。然而，物體並非完美對稱——髮型、表情與配件都會打破對稱性。因此，我們引入一個學習式的信賴度圖，為每個像素指派一個對稱機率。

段落功能提出核心創新——以機率對稱性作為自我監督訊號的來源。

邏輯角色此段是全文最關鍵的轉折：從「問題」過渡到「解法」。先提出對稱性先驗，再主動承認其局限性（非完美對稱），最後以機率圖作為解決方案。

論證技巧 / 潛在漏洞先讓步再解決的論證模式非常有效，增強可信度。但「對稱性」假設從根本上限制了方法的適用範圍——對於高度非對稱物體（如椅子、工具），此先驗可能失效。

Previous works on single-image 3D reconstruction include 3D Morphable Models (3DMM) that fit a parametric shape model to image observations, but they require pre-computed shape templates from 3D scans. Learning-based approaches using CNNs have shown promise, but typically require 3D ground truth, multi-view supervision, or keypoint annotations. Recent differentiable rendering techniques enable end-to-end learning of 3D representations from 2D images, which we leverage in our framework.

先前在單影像三維重建的研究包括三維可變形模型（3DMM），其透過將參數化形狀模型擬合至影像觀測值來運作，但這類方法需要從三維掃描資料預先計算形狀模板。基於學習的摺積神經網路方法已展現出潛力，但通常需要三維真實標註、多視角監督或關鍵點標註。近期的可微分渲染技術則使得從二維影像端對端學習三維表徵成為可能，本文的框架即利用了此一技術。

段落功能文獻回顧——系統性地整理現有三維重建方法，並指出各自的監督需求。

邏輯角色透過列舉各方法所需的監督訊號（3D 掃描、多視角、關鍵點），凸顯本文「無監督」定位的獨特性。可微分渲染的提及則為方法論章節預埋技術基礎。

論證技巧 / 潛在漏洞以監督需求作為分類軸線，巧妙地將所有既有方法歸入「需要外部監督」的陣營，凸顯本文方法的差異化。但未深入比較同期其他無監督三維重建工作（如 SfM 相關方法）。

3. Method — 方法

Our model is based on an image formation autoencoder that decomposes an input image into four components: depth d, albedo a, viewpoint w, and illumination l. An encoder maps the input image to these latent factors, and a differentiable renderer reconstructs the image from the predicted factors. The model is trained with a photometric reconstruction loss that measures how well the predicted factors explain the observed image. This approach does not require any 3D annotations, keypoints, or multi-view data.

本模型基於一個影像形成自動編碼器，將輸入影像分解為四個組成部分：深度 d、反照率 a、視角 w 與光照 l。編碼器將輸入影像映射至這些潛在因素，而可微分渲染器則從預測的因素重建影像。模型透過光度重建損失進行訓練，衡量預測因素對觀測影像的解釋能力。此方法不需要任何三維標註、關鍵點或多視角資料。

段落功能方法架構總覽——描述自動編碼器的四因素分解設計。

邏輯角色此段建立方法的基本框架：編碼器-渲染器結構。透過明確列出四個分解因素，為讀者提供心理模型。「無需任何三維標註」的重申強化核心訊息。

論證技巧 / 潛在漏洞四因素分解是一個強假設——假定影像可被完整解釋為深度、反照率、視角與光照的組合。但真實場景中還存在遮擋、透明度、陰影等複雜因素，此分解是否充分值得商榷。

3.1 Probabilistic Symmetry — 機率對稱性

The core innovation lies in modeling symmetry probabilistically. We learn a symmetry probability map sigma that assigns to each pixel the probability that it has a symmetric counterpart. During training, the reconstruction loss is weighted by sigma: symmetric regions contribute more to the loss, while asymmetric regions (e.g., hair, accessories) are down-weighted. This allows the model to "automatically discover which parts of the object are symmetric and which are not", without any manual annotation. The map sigma is predicted by the encoder and learned end-to-end together with all other factors.

核心創新在於以機率方式建模對稱性。我們學習一個對稱機率圖 sigma，為每個像素指派其具有對稱對應點的機率。在訓練過程中，重建損失由sigma 加權：對稱區域對損失的貢獻較大，而非對稱區域（如髮型、配件）則被降低權重。這使得模型能夠自動發現物體的哪些部分是對稱的、哪些不是，完全無需人工標註。對稱機率圖 sigma 由編碼器預測，與所有其他因素一同端對端學習。

段落功能闡述核心技術創新——機率對稱性圖的設計與學習機制。

邏輯角色此段是全文技術貢獻的頂點。將「對稱性」從一個離散的布林判斷（是/否對稱）提升為連續的機率估計，是本文最關鍵的理論跳躍。

論證技巧 / 潛在漏洞端對端學習 sigma 的設計極為優雅——避免了手動定義對稱區域的繁瑣。但存在一個退化風險：模型可能學到將 sigma 全設為零，從而完全忽略對稱性約束。作者需要額外的正則化來避免此情況。

The total loss combines several terms: a photometric loss measuring RGB reconstruction error, a perceptual loss using VGG features for higher-level similarity, and a regularization on the depth map to encourage smoothness. Crucially, the symmetry-weighted flip consistency loss reconstructs the image from a horizontally flipped version of the predicted depth and albedo, enforcing that symmetric parts of the object should produce consistent reconstructions regardless of the viewing direction.

總損失函數結合了多個項目：衡量 RGB 重建誤差的光度損失、使用 VGG 特徵計算高階相似度的感知損失，以及鼓勵深度圖平滑性的正則化項。至關重要的是，對稱加權翻轉一致性損失從預測深度與反照率的水平翻轉版本重建影像，強制物體的對稱部分無論觀看方向如何，都應產生一致的重建結果。

段落功能詳述訓練目標——分解總損失函數的各組成項。

邏輯角色將方法從架構層面推進至最佳化層面。翻轉一致性損失是對稱性先驗的具體實現，將直覺轉化為可微分的數學表達。

論證技巧 / 潛在漏洞多項損失的組合需要權重超參數調整，但此處未深入討論各項損失的相對重要性。感知損失依賴於預訓練的 VGG 網路，這在某種程度上引入了隱性監督。

4. Experiments — 實驗

We evaluate on three object categories: human faces (CelebA), cat faces (Cats dataset), and cars (CompCars). For human faces, the method achieves state-of-the-art 3D reconstruction quality with SIDE 0.793 and MAD 15.4 degrees on BFM benchmark, outperforming supervised baselines including 3DMM-CNN. On cat faces, the model successfully learns plausible 3D geometry despite the high variability in cat face shapes. For cars, reconstructions show accurate depth and shape recovery from single viewpoints. The symmetry probability map learned by the model correctly identifies asymmetric regions such as hair and earrings in human faces.

我們在三個物體類別上進行評估：人臉（CelebA）、貓臉（Cats 資料集）與車輛（CompCars）。在人臉方面，本方法在 BFM 基準上達到最先進的三維重建品質，SIDE 為 0.793、MAD 為 15.4 度，超越了包含 3DMM-CNN 在內的有監督基線。在貓臉方面，模型成功學習到合理的三維幾何形狀，儘管貓臉形狀變異性極高。在車輛方面，重建結果展現出從單一視角準確恢復深度與形狀的能力。模型所學習的對稱機率圖能正確辨識非對稱區域，例如人臉中的髮型與耳環。

段落功能提供全面的實驗證據——在三類物體上驗證方法的有效性與泛化能力。

邏輯角色此段是論文的實證支柱。跨類別（人臉、貓臉、車輛）的實驗設計展示方法的通用性，而超越有監督基線的結果直接回應摘要中的核心宣稱。

論證技巧 / 潛在漏洞「無監督勝過有監督」的結果令人印象深刻，但需注意比較的公平性——有監督基線可能使用了不同的訓練資料或網路架構。此外，車輛的評估缺乏量化指標，僅以定性結果呈現，說服力較弱。

Ablation studies demonstrate the importance of each component. Removing the symmetry prior leads to significant degradation in reconstruction quality, with SIDE increasing from 0.793 to 0.897. Without the perceptual loss, results become blurry and lose fine details. The probabilistic symmetry map proves essential: using a hard symmetry constraint (sigma = 1 everywhere) performs worse because it cannot accommodate inherently asymmetric regions.

消融研究證明了每個組成部分的重要性。移除對稱性先驗會導致重建品質顯著下降，SIDE 從 0.793 上升至 0.897。缺少感知損失則使結果變得模糊且失去精細細節。機率對稱性圖被證明是不可或缺的：使用硬性對稱約束（將 sigma 全設為 1）的效果較差，因為它無法適應本質上非對稱的區域。

段落功能消融實驗——驗證各組成部分的必要性。

邏輯角色消融研究直接支持方法設計的合理性：每移除一個組件，性能即下降，證明非冗餘。硬對稱 vs. 軟對稱的對比特別有力。

論證技巧 / 潛在漏洞消融實驗是驗證方法設計的黃金標準。但僅在人臉上進行消融，未在貓臉或車輛上重複驗證，無法確認結論的跨類別泛化性。

5. Conclusion — 結論

We have presented a method for learning 3D deformable object reconstruction from single images without any external supervision. The key enabler is a probabilistic symmetry model that automatically learns which parts of an object are symmetric, providing a powerful self-supervisory signal for 3D learning. Our results demonstrate that this approach achieves state-of-the-art performance on multiple object categories, challenging the assumption that 3D reconstruction requires explicit 3D supervision.

本文提出了一種在完全無外部監督的條件下，從單張影像學習三維可變形物體重建的方法。關鍵推動力是機率對稱性模型，它自動學習物體的哪些部分是對稱的，從而為三維學習提供強大的自我監督訊號。實驗結果表明，此方法在多個物體類別上達到最先進的性能，挑戰了「三維重建需要明確的三維監督」這一假設。

段落功能總結全文——重申核心貢獻與方法論意義。

邏輯角色結論段呼應緒論提出的問題（「能否無監督重建三維？」），以肯定語氣作答，形成完整的論證閉環。

論證技巧 / 潛在漏洞「挑戰三維監督假設」的措辭極具學術影響力，但結論未討論方法的局限性（如對稱性假設的適用範圍、複雜場景下的表現），也未提出未來研究方向。

論證結構總覽

問題
單影像三維重建
依賴昂貴的監督訊號

→

論點
物體的機率對稱性
可作為自我監督訊號

→

證據
人臉 SIDE 0.793
超越有監督基線

→

反駁
非對稱區域由
機率圖自動處理

→

結論
無監督三維重建
可媲美有監督方法

作者核心主張（一句話）

透過學習式的機率對稱性圖與影像分解自動編碼器，無需任何三維監督即可從單張影像重建出高品質的三維物體形狀與外觀。

論證最強處

對稱性先驗的機率化處理：將二元的對稱/非對稱判斷提升為連續的機率估計，既保留了對稱性先驗的約束力，又具備處理真實世界非對稱部位的靈活性。消融實驗明確證實，此設計顯著優於硬性對稱約束。在人臉重建上超越有監督方法，更直接挑戰了「三維重建需要三維監督」的學界共識。

論證最弱處

對稱性假設的適用邊界：方法的有效性根本性地依賴物體具有近似對稱性的假設，這限制了其對高度非對稱物體（如椅子、工具、不規則生物）的適用性。此外，實驗類別（人臉、貓臉、車輛）皆為高度對稱的物體，未充分測試方法在弱對稱或非對稱類別上的表現，留下了泛化能力的疑問。