Joint 3D Scene Reconstruction and Class Segmentation

Abstract — 摘要

Both image segmentation and dense 3D modeling from images are intrinsically ill-posed problems that require strong regularization. In this paper, we argue that these two tasks can mutually benefit each other. Segmentations provide geometric cues about which surface orientations are more likely at given spatial locations, while 3D reconstruction provides suitable regularization for the segmentation problem by lifting the labeling from 2D images to 3D space. We propose a joint formulation that incorporates learned appearance-based cues and 3D surface orientation priors for class-specific regularization. We demonstrate through experiments on real datasets that our joint approach improves results compared to treating each problem separately.

影像分割與從影像建立密集三維模型都是本質上的病態問題，需要強正則化。本文主張這兩項任務能相互受益。分割提供關於特定空間位置哪些表面朝向更可能的幾何線索，而三維重建則透過將標記從二維影像提升到三維空間，為分割問題提供適當的正則化。我們提出聯合公式化，納入學習式外觀線索與類別專屬正則化的三維表面朝向先驗。我們透過在真實資料集上的實驗展示，聯合方法相較於分別處理各問題時有所改善。

段落功能全文總覽——以「互利」概念框架引出三維重建與語意分割的聯合方法。

邏輯角色摘要採用「雙向互利」的對稱論證結構：分割幫助重建，重建也幫助分割。此雙向邏輯比單向因果更具說服力。

論證技巧 / 潛在漏洞「病態問題」的定性開場直接觸及數學本質，具有學術嚴謹性。但「互利」主張需要嚴格的實驗驗證——聯合最佳化不一定導致兩邊都改善，也可能出現一方優化以另一方為代價的情況。

1. Introduction — 緒論

Multi-view 3D reconstruction from images has made remarkable progress in recent years, producing dense surface models of complex environments. Separately, semantic image segmentation has advanced significantly, with methods that can assign pixel-level class labels to images. However, these two tasks have largely been studied in isolation. 3D reconstruction methods typically produce geometric models without semantic meaning, while segmentation methods operate in 2D without exploiting the underlying 3D scene structure. We propose to solve both problems jointly, allowing each task to inform and constrain the other.

多視角三維重建近年來取得顯著進展，能產生複雜環境的密集表面模型。另一方面，語意影像分割也有長足進步，能為影像指派像素級的類別標籤。然而，這兩項任務大多被獨立研究。三維重建方法通常產生沒有語意意義的幾何模型，而分割方法則在二維中運作，未利用底層的三維場景結構。我們提出聯合求解這兩個問題，讓每項任務為另一項提供資訊與約束。

段落功能建立研究場域——指出兩個成熟但隔離的研究方向。

邏輯角色論證鏈的起點：先分別肯定兩個領域的成就，再指出「隔離」是根本問題。此結構有效地建立了「聯合」的必要性。

論證技巧 / 潛在漏洞「分別研究」的批判精準，但作者可能低估了已有的初步整合嘗試——例如使用語意先驗約束 SfM 或使用深度圖輔助分割的工作。

The key intuition is that different object classes have characteristic 3D geometric properties. For example, ground surfaces are predominantly horizontal, walls are vertical, and cars have smooth curved surfaces. If we know the class of a region, we can impose class-specific surface orientation priors that guide the 3D reconstruction. Conversely, the 3D volumetric representation provides a natural regularization for segmentation: labeling in 3D space ensures view-consistency across multiple images and leverages the full geometric context, something that 2D-only methods cannot achieve.

關鍵直覺是不同物件類別具有特徵性的三維幾何屬性。例如，地面主要是水平的，牆壁是垂直的，汽車具有平滑的彎曲表面。如果我們知道區域的類別，就能施加類別專屬的表面朝向先驗來引導三維重建。反過來，三維體積表示為分割提供了自然的正則化：在三維空間中標記確保了跨多張影像的視角一致性，並利用了完整的幾何情境，這是僅限二維的方法無法達成的。

段落功能核心直覺——以具體範例闡述互利關係。

邏輯角色將抽象的「互利」概念具體化：地面水平、牆壁垂直等範例使讀者立即理解類別如何約束幾何，而三維一致性約束如何幫助分割。

論證技巧 / 潛在漏洞範例的選擇（地面、牆壁、汽車）都是幾何屬性明確的類別。但對於形狀多變的類別（如「植被」或「行人」），類別專屬的幾何先驗可能過於寬鬆以至於無用。

Dense 3D reconstruction from multiple views has been extensively studied, from volumetric methods using truncated signed distance functions (TSDF) to variational approaches that minimize photoconsistency-based energy functionals. These methods produce high-quality geometry but lack semantic understanding. On the segmentation side, conditional random fields (CRFs) have become standard for incorporating spatial context into pixel-wise classification. Recent works have begun to bridge these two domains: Semantic SLAM approaches incorporate class labels into mapping, but typically treat segmentation and reconstruction as sequential rather than truly joint processes.

密集三維重建從多視角已被廣泛研究，從使用截斷有號距離函數（TSDF）的體積方法到最小化基於光一致性能量泛函的變分方法。這些方法產生高品質的幾何，但缺乏語意理解。在分割方面，條件隨機場（CRF）已成為在像素分類中納入空間情境的標準工具。近期研究已開始橋接這兩個領域：語意 SLAM 方法將類別標籤納入地圖建構，但通常將分割與重建視為序列式而非真正聯合的過程。

段落功能文獻回顧——梳理三維重建與語意分割的各自發展。

邏輯角色以「序列式 vs. 聯合式」的區分為本文定位：即使語意 SLAM 已開始整合，但仍非「真正聯合」，為本文的公式化留下空間。

論證技巧 / 潛在漏洞將現有整合嘗試歸為「序列式」略有主觀——某些迭代方法在實質上也接近聯合最佳化，差異可能主要在於公式化的形式而非效能。

3. Joint Formulation — 聯合公式化

We formulate the problem on a 3D volumetric grid where each voxel is assigned both a binary occupancy label (inside/outside the surface) and a semantic class label. The joint energy function consists of three terms: (1) a data term derived from multi-view photoconsistency and appearance-based class likelihoods; (2) a surface regularization term that penalizes surface area with class-dependent anisotropic weights favoring expected surface orientations for each class; and (3) a semantic smoothness term that encourages spatially coherent labeling in 3D. The total energy is minimized over both the surface geometry and the semantic labels simultaneously.

我們在三維體積網格上公式化問題，其中每個體素被指派二元佔據標籤（在表面內部/外部）與語意類別標籤。聯合能量函數由三個項組成：(1) 資料項，從多視角光一致性與基於外觀的類別似然推導；(2) 表面正則化項，以類別相關的非等向性權重懲罰表面面積，偏好各類別的預期表面朝向；(3) 語意平滑項，鼓勵三維空間中的空間連貫標記。總能量在表面幾何與語意標籤上同時最小化。

段落功能核心方法——定義聯合能量函數的數學結構。

邏輯角色此段是全文的數學核心：三個能量項分別對應資料擬合、幾何正則化與語意一致性，完整地編碼了重建與分割的耦合關係。

論證技巧 / 潛在漏洞三項能量的設計對稱而完整。但「類別相關的非等向性權重」引入了大量需要學習的參數——每個類別需要一個三維朝向分布，訓練資料是否足夠支撐這些參數的可靠估計？

4. Energy Function Details — 能量函數細節

The class-specific surface regularization is the key innovation connecting segmentation and reconstruction. For each semantic class c, we learn a surface orientation distribution from training data: ground planes concentrate probability mass on horizontal normals, walls on vertical normals, and vegetation shows a more uniform distribution. These distributions are encoded as anisotropic weights in the surface area regularizer, so that when a voxel is labeled as "ground," horizontal surfaces are penalized less than vertical ones, effectively guiding the reconstruction toward class-appropriate geometry. The appearance term uses boosted classifiers on texton and color features projected from the images to provide per-voxel class likelihoods.

類別專屬的表面正則化是連接分割與重建的關鍵創新。對於每個語意類別 c，我們從訓練資料學習表面朝向分布：地面將機率集中在水平法線上，牆壁集中在垂直法線上，而植被則展現較均勻的分布。這些分布被編碼為表面面積正則化器中的非等向性權重，因此當體素被標記為「地面」時，水平表面受到的懲罰小於垂直表面，有效引導重建朝向類別適當的幾何。外觀項使用在紋理子與顏色特徵上訓練的提升分類器，從影像投射至各體素以提供每體素的類別似然。

段落功能創新細節——闡述類別專屬正則化如何運作。

邏輯角色此段將「類別約束幾何」的直覺具體化為數學操作：非等向性權重使得分割結果直接影響表面重建的偏好方向，實現了真正的雙向耦合。

論證技巧 / 潛在漏洞地面/牆壁/植被的朝向分布範例直觀有力。但此設計暗示分類錯誤會導致不適當的幾何約束——若地面被誤分為牆壁，重建會被引導向錯誤的朝向。聯合最佳化能否糾正這種錯誤傳播？

5. Optimization — 最佳化

The joint energy function is non-convex due to the coupling between geometry and semantics, making global optimization intractable. We adopt an alternating minimization strategy: in one step, we fix the semantic labels and optimize the 3D surface using convex relaxation of the binary labeling problem; in the other, we fix the surface geometry and optimize the semantic labels using graph cuts. The convex relaxation of the surface reconstruction subproblem can be solved efficiently using primal-dual algorithms. Convergence is typically achieved within 5-10 iterations of the alternating scheme. While we cannot guarantee global optimality, experiments show consistent improvement over independent optimization of each task.

聯合能量函數由於幾何與語意的耦合而非凸，使得全域最佳化不可行。我們採用交替最小化策略：在一步中，固定語意標籤，透過二元標記問題的凸鬆弛最佳化三維表面；在另一步中，固定表面幾何，使用圖切割最佳化語意標籤。表面重建子問題的凸鬆弛可使用原始-對偶演算法高效求解。交替方案通常在 5-10 次迭代內收斂。雖然我們無法保證全域最佳性，但實驗顯示相較於獨立最佳化各任務有持續的改善。

段落功能求解策略——說明如何處理非凸聯合最佳化問題。

邏輯角色此段坦承了方法的理論局限（非凸性），並以實用的交替最小化策略回應。每個子問題的求解方法（凸鬆弛、圖切割）都是成熟的工具。

論證技巧 / 潛在漏洞坦承非凸性而非迴避是誠實的學術態度。但交替最小化容易陷入局部最小值，尤其是當初始分割嚴重錯誤時。「5-10 次迭代」的收斂速度暗示問題可能對初始值敏感。

6. Experiments — 實驗

We evaluate our approach on outdoor urban scenes with semantic classes including ground, building, vegetation, car, and sky. Input consists of calibrated multi-view image sequences. For 3D reconstruction quality, we compare against standard volumetric reconstruction without semantic priors and show that class-specific regularization produces cleaner surfaces with fewer artifacts — ground planes are flatter, building facades are smoother. For segmentation accuracy, lifting labels to 3D improves consistency across views, reducing the per-pixel error rate by approximately 5-8% compared to independent 2D segmentation. The mutual benefit is confirmed: joint optimization improves both reconstruction and segmentation.

我們在戶外城市場景上評估本方法，語意類別包括地面、建築、植被、汽車與天空。輸入為已校準的多視角影像序列。在三維重建品質方面，我們與不使用語意先驗的標準體積重建進行比較，顯示類別專屬正則化產生更乾淨的表面且偽影更少——地面更平坦、建築立面更平滑。在分割精確度方面，將標籤提升至三維改善了跨視角一致性，相較於獨立的二維分割，每像素錯誤率降低了約 5-8%。互利效益得到確認：聯合最佳化同時改善了重建與分割。

段落功能提供雙面向的實驗證據——分別驗證重建與分割的改善。

邏輯角色直接回應摘要中的「互利」主張，以定量（5-8% 錯誤率降低）與定性（更平坦的地面、更平滑的立面）兩種方式提供證據。

論證技巧 / 潛在漏洞定性結果（「更乾淨」「更平坦」）有說服力但缺乏客觀度量。5-8% 的分割改善是實質性的，但僅在有限的戶外場景上驗證，泛化到室內或其他場景類型的能力未被探討。

7. Conclusion — 結論

We have presented a joint formulation for simultaneous 3D scene reconstruction and semantic class segmentation. By incorporating class-specific surface orientation priors into the reconstruction energy and leveraging 3D consistency for segmentation regularization, our approach demonstrates that these two traditionally separate tasks benefit significantly from being solved together. The improvements in both geometric quality and segmentation accuracy confirm the value of this joint perspective. Future work includes scaling to larger scenes with more diverse object classes and incorporating dynamic objects into the framework.

我們提出了聯合公式化，同時進行三維場景重建與語意類別分割。透過將類別專屬的表面朝向先驗納入重建能量，並利用三維一致性進行分割正則化，我們的方法展示了這兩項傳統上分離的任務從聯合求解中獲得顯著效益。幾何品質與分割精確度的同時改善確認了此聯合觀點的價值。未來工作包括擴展至具有更多樣物件類別的更大場景，以及將動態物件納入框架。

段落功能總結全文——重申互利主張並指出擴展方向。

邏輯角色結論回扣摘要的核心承諾，以實驗結果驗證「互利」假說，形成完整的論證閉環。未來方向的提出也坦承了當前方法的適用範圍限制。

論證技巧 / 潛在漏洞「動態物件」的提及揭示了當前框架的根本假設——靜態場景。在真實世界中，行人與車輛的動態特性使得聯合框架的應用受到根本性限制。

論證結構總覽

問題
三維重建與語意分割
被獨立求解缺乏互利

→

論點
類別約束幾何
幾何正則化分割

→

證據
重建品質與分割
精確度雙雙提升

→

反駁
交替最小化處理
非凸聯合能量

→

結論
聯合公式化優於
獨立求解

作者核心主張（一句話）

三維場景重建與語意分割是相互耦合的問題，透過類別專屬的幾何先驗與三維一致性約束的聯合公式化，能同時改善兩者的效能。

論證最強處

互利關係的雙向驗證：作者不僅主張聯合方法更好，還分別展示了分割如何幫助重建（類別專屬的非等向性正則化產生更乾淨的表面），以及重建如何幫助分割（三維一致性降低 5-8% 錯誤率）。此雙向驗證使「互利」主張不僅是概念，而是有實證支撐的結論。

論證最弱處

非凸最佳化的可靠性疑慮：交替最小化僅保證收斂到局部最小值，且對初始值敏感。此外，實驗僅在有限的戶外城市場景（5 個類別）上進行，物件類別的幾何屬性相對明確。在類別幾何特性不明顯的場景中（如雜亂的室內環境），類別專屬先驗的效益可能大幅下降。