Common Objects in 3D — 雙欄批注

Abstract — 摘要

Traditional approaches for learning 3D object categories have been predominantly trained and evaluated on synthetic datasets due to the unavailability of real 3D-annotated category-centric data. The main goal of this work is to facilitate advances in this field by collecting real-world data in a magnitude similar to existing synthetic counterparts. The principal contribution is a large-scale dataset called Common Objects in 3D (CO3D), with real multi-view images of object categories annotated with camera poses and ground truth 3D point clouds. The dataset contains 1.5 million frames from nearly 19,000 videos capturing objects from 50 MS-COCO categories. The authors exploit this dataset to conduct one of the first large-scale "in-the-wild" evaluations of several new-view-synthesis and category-centric 3D reconstruction methods. Finally, they contribute NerFormer — a novel neural rendering method that leverages the Transformer to reconstruct an object given a small number of its views.

學習三維物件類別的傳統方法主要在合成資料集上訓練與評估，這是因為缺乏真實的三維標註類別中心資料。本研究的主要目標是透過收集與現有合成資料規模相當的真實世界資料來促進該領域的進展。核心貢獻是一個名為 Common Objects in 3D（CO3D）的大規模資料集，包含以攝影機姿態與真實三維點雲標註的真實多視角物件類別影像。資料集包含來自近 19,000 支影片的 150 萬幀影像，涵蓋 50 個 MS-COCO 類別的物件。作者利用此資料集進行了數個新視角合成與類別中心三維重建方法的首批大規模「真實環境」評估之一。此外，他們貢獻了 NerFormer——一種利用 Transformer 從少量視角重建物件的新型神經渲染方法。

段落功能全文總覽——以「合成 vs 真實」的資料鴻溝為動機，引出 CO3D 資料集與 NerFormer 方法的雙重貢獻。

邏輯角色摘要承擔三層功能：(1) 定義資料缺口（無真實 3D 資料集），(2) 填補缺口（CO3D），(3) 利用缺口填補後的新機遇（NerFormer + 大規模評估）。

論證技巧 / 潛在漏洞「與現有合成資料規模相當」的定位暗示 CO3D 不僅是增量改進，而是質的飛躍。150 萬幀與 50 個類別的數字確實令人信服。但「真實環境」的品質控制（遮擋、模糊、背景雜亂）是否引入了新的評估挑戰，需在方法論中詳加說明。

1. Introduction — 緒論

The vision community's progress on 3D object understanding has been largely driven by synthetic datasets such as ShapeNet and ModelNet. While these enable controlled evaluation, they fail to capture the complexity and diversity of real-world objects: real objects exhibit complex textures, varying illumination, partial occlusions, and diverse backgrounds. Existing real-world 3D datasets are either small in scale (DTU, Tanks and Temples) or limited to specific categories (cars, faces). This creates a significant gap between method development on synthetic data and deployment in real-world applications.

視覺社群在三維物件理解上的進展主要由 ShapeNet 與 ModelNet 等合成資料集驅動。這些資料集雖可進行受控評估，卻無法捕捉真實世界物件的複雜性與多樣性：真實物件展現複雜的紋理、多變的光照、局部遮擋與多元的背景。現有真實世界三維資料集要麼規模較小（DTU、Tanks and Temples），要麼僅限於特定類別（汽車、人臉）。這在合成資料上的方法開發與真實世界應用部署之間創造了顯著的鴻溝。

段落功能建立研究動機——以合成-真實鴻溝的系統性分析論證大規模真實資料集的必要性。

邏輯角色論證鏈的起點：列舉合成資料的四項限制（紋理、光照、遮擋、背景）與真實資料集的兩項不足（規模小、類別少），建立 CO3D 需同時克服的目標。

論證技巧 / 潛在漏洞以「開發與部署的鴻溝」框架化問題極具說服力——暗示在合成資料上表現良好的方法可能在真實場景中失效。但 domain gap 的嚴重程度因任務而異，作者需在實驗中量化此差距。

Synthetic 3D datasets like ShapeNet (51,300 models, 55 categories) provide clean 3D meshes but lack photorealism. Objectron provides 3D bounding boxes for real videos but no point clouds or dense geometry. DTU captures controlled scenes with 124 scenes but in laboratory settings only. Tanks and Temples provides 21 large-scale outdoor scenes but focuses on scenes rather than object categories. CO3D differs fundamentally: it provides category-centric, in-the-wild, multi-view coverage of 50 diverse categories at a scale (19,000 sequences) that dwarfs all existing real alternatives.

ShapeNet（51,300 個模型、55 個類別）等合成三維資料集提供乾淨的三維網格但缺乏逼真度。Objectron 為真實影片提供三維邊界框，但無點雲或密集幾何。DTU 在受控環境中拍攝 124 個場景，但僅限於實驗室設定。Tanks and Temples 提供 21 個大規模戶外場景，但專注於場景而非物件類別。CO3D 有根本性的不同：它提供以類別為中心的真實環境多視角覆蓋，涵蓋 50 個多元類別，規模（19,000 個序列）遠超所有現有的真實替代方案。

段落功能文獻定位——以表格式比較清楚展示 CO3D 的差異化優勢。

邏輯角色逐一列舉四個現有資料集並指出各自的不足，建立「無一能同時滿足所有需求」的論證。CO3D 被定位為首個同時具備規模、真實性、類別多元性與密集幾何的資料集。

論證技巧 / 潛在漏洞以數字（51,300 vs 19,000、124 vs 19,000）直接對比規模差異是有效的修辭。但比較不完全公平——ShapeNet 提供完整 3D 網格，CO3D 僅提供點雲；DTU 有精確的 GT 幾何，CO3D 的標註精度可能較低。

3. The CO3D Dataset — CO3D 資料集

3.1 Data Collection and Processing

The CO3D dataset contains 1.5 million frames from nearly 19,000 videos of objects from 50 MS-COCO categories. Each video captures an object from multiple viewpoints through a smooth camera trajectory around the object. The data processing pipeline includes: COLMAP-based SfM to estimate camera poses for each video, Multi-View Stereo (MVS) to generate dense 3D point clouds, and PointRend-based instance segmentation to extract foreground object masks. Quality filtering removes sequences with poor SfM reconstruction or insufficient viewpoint coverage. The resulting annotations include per-frame camera intrinsics and extrinsics, foreground masks, and dense 3D point clouds.

CO3D 資料集包含來自近 19,000 支影片的 150 萬幀影像，涵蓋 50 個 MS-COCO 類別的物件。每支影片透過圍繞物件的流暢攝影機軌跡從多個視角拍攝物件。資料處理管線包括：基於 COLMAP 的 SfM 來估計每支影片的攝影機姿態，多視角立體法（MVS）生成密集三維點雲，以及基於 PointRend 的實例分割提取前景物件遮罩。品質過濾移除 SfM 重建品質不佳或視角覆蓋不足的序列。最終標註包含每幀的攝影機內參與外參、前景遮罩與密集三維點雲。

段落功能資料集建構——詳述從原始影片到結構化標註的完整處理管線。

邏輯角色此段確立 CO3D 的技術可信度：使用 COLMAP（業界標準 SfM）、MVS（成熟的密集重建技術）與 PointRend（先進的實例分割）建構標註，展現了工程上的專業度。

論證技巧 / 潛在漏洞管線中的每個步驟都使用成熟的現有方法，增強了可重現性。但 COLMAP 的姿態估計精度在「真實環境」影片中可能顯著低於實驗室環境——品質過濾的標準與比例直接影響資料集的實際可用性。

3.2 Benchmark Tasks — 基準任務

CO3D defines two primary benchmark tasks. The single-sequence new-view synthesis task evaluates how well a method can render novel views after training on a single video sequence of an object. The category-level new-view synthesis task evaluates whether a model trained on many sequences of a category can generalize to unseen instances. This second task is particularly challenging as it requires learning category-level 3D priors from diverse real-world instances. Evaluation metrics include PSNR, SSIM, and LPIPS on held-out views. Baseline methods evaluated include NeRF, IDR, and SRN, providing the first standardized comparison of these methods on real-world, in-the-wild data.

CO3D 定義了兩項主要基準任務。單序列新視角合成任務評估方法在單一影片序列上訓練後渲染新視角的能力。類別級新視角合成任務評估在多個同類別序列上訓練的模型能否泛化至未見過的實例。第二項任務尤為困難，因為它要求從多元的真實世界實例中學習類別級的三維先驗。評估指標包含 PSNR、SSIM 與 LPIPS。評估的基準方法包括 NeRF、IDR 與 SRN，提供了這些方法在真實世界真實環境資料上的首次標準化比較。

段落功能基準設計——定義兩層遞進的評估任務。

邏輯角色兩項任務的遞進設計巧妙：單序列任務是既有方法的舒適區，類別級任務則是真正的前沿挑戰。這確保資料集在當前與未來都有研究價值。

論證技巧 / 潛在漏洞「首次標準化比較」的宣稱為資料集賦予了里程碑意義。但真實環境資料的品質變異可能使不同方法的性能差異更多反映對雜訊的穩健性而非三維理解能力。

4. NerFormer — NerFormer 方法

The authors introduce NerFormer, a Transformer-based neural rendering method designed for the few-view category-level reconstruction task. Given a small number of source views (e.g., 5-10) of an unseen object, NerFormer predicts novel views by leveraging cross-attention between target ray features and source image features. The Transformer architecture naturally handles variable numbers of input views and can aggregate information across views through attention. NerFormer is trained on many sequences within a category to learn category-level priors that enable reconstruction from sparse observations.

作者引入 NerFormer，一種為少視角類別級重建任務設計的基於 Transformer 的神經渲染方法。給定未見物件的少量來源視角（如 5-10 個），NerFormer 透過目標射線特徵與來源影像特徵之間的交叉注意力來預測新視角。Transformer 架構天然地處理可變數量的輸入視角，並可透過注意力機制跨視角聚合資訊。NerFormer 在一個類別的多個序列上訓練，以學習能從稀疏觀測進行重建的類別級先驗。

段落功能方法貢獻——描述專為 CO3D 基準設計的新型神經渲染方法。

邏輯角色 NerFormer 既是方法貢獻也是基準方法——展示了 CO3D 資料集如何催生新的研究方向（類別級少視角重建）。Transformer 的選擇與「可變輸入視角數」的需求高度契合。

論證技巧 / 潛在漏洞將新方法與新資料集打包發表是「資料集論文」的常見策略——既展示資料的價值也提升論文的技術深度。但 NerFormer 的設計相對簡潔，作為獨立方法的創新深度有限。

5. Experiments — 實驗

Experiments reveal significant insights. On single-sequence new-view synthesis, NeRF achieves reasonable quality but struggles with sequences containing sparse viewpoints or complex backgrounds. On category-level synthesis, all methods show substantially lower performance than on synthetic benchmarks, confirming the significant gap between synthetic and real-world evaluation. NerFormer demonstrates competitive performance on the few-view category-level task, particularly when given 5-10 source views. The experiments establish that current methods have significant room for improvement on in-the-wild 3D reconstruction, validating CO3D's role as a challenging and informative benchmark.

實驗揭示了重要洞見。在單序列新視角合成上，NeRF 達到合理品質但在視角稀疏或背景複雜的序列上表現吃力。在類別級合成上，所有方法的性能均顯著低於在合成基準上的表現，證實了合成與真實世界評估之間存在顯著差距。NerFormer 在少視角類別級任務上展現了有競爭力的性能，尤其在給定 5-10 個來源視角時。實驗確立了當前方法在真實環境三維重建上仍有顯著的改進空間，驗證了 CO3D 作為具挑戰性且資訊豐富的基準的角色。

段落功能實驗洞察——以基準方法的不足驗證資料集的挑戰性與價值。

邏輯角色對資料集論文而言，展示現有方法的「失敗」反而是最好的結果——它證明資料集足夠困難以推動未來研究。「合成 vs 真實差距」的量化是核心論點的直接驗證。

論證技巧 / 潛在漏洞將「方法表現不佳」重新詮釋為「資料集很有價值」是資料集論文的經典修辭。但讀者可能質疑：性能下降是否部分來自資料品質問題（標註噪聲、姿態不準確）而非任務本身的困難度？

6. Conclusion — 結論

CO3D represents a significant step toward real-world evaluation of 3D object understanding methods. With 1.5 million frames, 19,000 sequences, and 50 categories, it is orders of magnitude larger than existing real-world 3D datasets. The benchmark experiments reveal that current methods still have substantial room for improvement on in-the-wild data. NerFormer shows promise for category-level few-view reconstruction using Transformer-based cross-view aggregation. The authors hope CO3D will catalyze progress in 3D understanding by providing a standardized, large-scale real-world benchmark.

CO3D 代表了邁向三維物件理解方法真實世界評估的重要一步。以 150 萬幀影像、19,000 個序列與 50 個類別，它比現有真實世界三維資料集大了數個數量級。基準實驗揭示當前方法在真實環境資料上仍有顯著的改進空間。NerFormer 展示了以 Transformer 為基礎的跨視角聚合進行類別級少視角重建的前景。作者期望 CO3D 能透過提供標準化的大規模真實世界基準來催化三維理解的進展。

段落功能總結全文——重申資料集的規模優勢與研究催化作用。

邏輯角色結論段以「數量級差距」強調資料集的獨特性，以「改進空間」暗示未來研究機會，以「催化進展」提升論文的社群影響力敘事。

論證技巧 / 潛在漏洞「催化進展」的期望已在後續研究中得到驗證——CO3D 確實成為了眾多三維重建方法的標準基準。但結論未討論資料集的維護與擴展計畫，長期可用性是社群資源的重要考量。

論證結構總覽

問題
缺乏大規模真實
三維物件資料集

→

論點
CO3D：150 萬幀
50 類別 19K 序列

→

證據
首次大規模真實環境
三維方法標準化評估

→

反駁
現有方法在真實資料
上仍有巨大改進空間

→

結論
CO3D 催化三維理解
邁向真實世界應用

作者核心主張（一句話）

透過收集涵蓋 50 個類別、近 19,000 支影片的大規模真實多視角資料集 CO3D，並建立標準化基準與 NerFormer 方法，為三維物件理解的真實世界評估奠定基礎。

論證最強處

規模與多樣性的雙重優勢：CO3D 以 19,000 個序列與 50 個類別同時在規模和多樣性上超越所有先前的真實三維資料集。且其建構管線完全基於成熟的現有方法（COLMAP、MVS、PointRend），確保了可重現性。基準實驗中現有方法的性能顯著下降直接驗證了合成-真實差距的存在。

論證最弱處

標註品質的不確定性：CO3D 的攝影機姿態與點雲均由自動化管線生成（COLMAP + MVS），而非手動精確標註。在「真實環境」條件下（模糊、遮擋、光照變化），這些自動標註的精度可能顯著低於實驗室資料集。品質過濾雖移除最差序列，但保留序列的標註品質分布未被充分分析。此不確定性可能使基準評估的結論受到噪聲標註而非方法能力的影響。