SLAM++: Simultaneous Localisation and Mapping at the Level of Objects

Abstract — 摘要

We propose SLAM++, a new SLAM system that operates at the level of 3D objects rather than low-level geometric primitives. Given a database of known 3D object models, our system simultaneously recognizes objects, estimates their 6-DoF poses, and builds an object-level map of the environment in real-time using an RGB-D camera. The resulting map is dramatically more compact and semantically meaningful than traditional point cloud or surfel maps — a room can be described by a handful of object instances with poses rather than millions of points. This representation also enables higher-level reasoning about scene structure, object interactions, and relocation.

我們提出 SLAM++，一個在三維物件層級而非低階幾何基元層級運作的全新 SLAM 系統。給定一組已知的三維物件模型資料庫，我們的系統使用 RGB-D 攝影機即時地同步辨識物件、估計其六自由度姿態，並建立環境的物件層級地圖。所產生的地圖比傳統的點雲或面元地圖大幅精簡且更具語意意義——一個房間可由少數幾個帶有姿態的物件實例來描述，而非數百萬個點。此表示方式同時能實現關於場景結構、物件互動與重新定位的高層次推理。

段落功能全文總覽——提出「物件級 SLAM」的新範式，對比傳統低階表示。

邏輯角色摘要以「百萬點 vs 少數物件」的戲劇性對比建立核心張力，同時預告壓縮性與語意性兩大優勢。

論證技巧 / 潛在漏洞「物件層級」的定位極具前瞻性，預見了後來語意 SLAM 的興起。但「已知物件模型」的前提是嚴重限制——實際環境中充滿未知物件，此方法的適用範圍受到資料庫覆蓋度的制約。

1. Introduction — 緒論

Traditional SLAM systems build maps as collections of geometric primitives — points, lines, surfels, or dense volumetric grids. While these representations capture the geometry of the environment, they are semantically impoverished: a point cloud map does not "understand" that certain clusters of points form a chair, a table, or a monitor. This lack of semantic structure limits the utility of SLAM maps for high-level tasks such as manipulation planning, human-robot interaction, and scene understanding.

傳統 SLAM 系統以幾何基元的集合來建立地圖——點、線、面元或密集體積格網。雖然這些表示能捕捉環境的幾何，但在語意上是匱乏的：點雲地圖無法「理解」某些點的群集構成了一張椅子、一張桌子或一台螢幕。此語意結構的缺乏限制了 SLAM 地圖在操控規劃、人機互動與場景理解等高層次任務中的實用性。

段落功能問題診斷——指出傳統 SLAM 的「語意匱乏」問題。

邏輯角色論證起點：傳統 SLAM 在幾何上成功但在語意上失敗，為物件級表示的引入建立動機。

論證技巧 / 潛在漏洞以擬人化修辭（「不理解」）生動地表達語意缺失。但語意理解可透過後處理分割實現，不一定需要在 SLAM 層面解決。

We argue that the right level of abstraction for indoor SLAM is objects. In structured indoor environments such as offices and living rooms, the scene is largely composed of repeated, known object categories (chairs, tables, monitors, keyboards). If the system can recognize these objects and estimate their poses, the map becomes a compact graph of object instances — orders of magnitude smaller than raw point clouds, yet capturing both geometry and semantics. This paradigm shift from "mapping everything" to "mapping what matters" also offers computational benefits.

我們主張，室內 SLAM 的正確抽象層級是物件。在辦公室和客廳等結構化的室內環境中，場景主要由重複的已知物件類別（椅子、桌子、螢幕、鍵盤）組成。若系統能辨識這些物件並估計其姿態，地圖就成為物件實例的精簡圖——比原始點雲小了數個數量級，同時兼具幾何與語意資訊。從「映射一切」到「映射重要的事物」的範式轉移同時帶來計算上的優勢。

段落功能核心主張——提出「正確的抽象層級是物件」的論點。

邏輯角色此段承擔全文最重要的論點宣示：SLAM 的基本單位應從「點」提升至「物件」。

論證技巧 / 潛在漏洞「mapping everything to mapping what matters」的口號極具感染力。但「什麼是重要的」高度依賴應用——對於導航任務，牆壁和地板可能比物件更重要。

KinectFusion demonstrated real-time dense 3D reconstruction using an RGB-D sensor with a truncated signed distance function (TSDF) volume. While impressive, the resulting maps are purely geometric and scale poorly to large environments. ORB-SLAM and other feature-based systems produce sparse maps that are efficient but lack surface detail and semantic information. Recent works on semantic mapping add object labels to existing maps via post-hoc segmentation, but do not leverage object models as first-class map elements.

KinectFusion 展示了使用 RGB-D 感測器與截斷有號距離函數（TSDF）體積的即時密集三維重建。雖然令人印象深刻，但產生的地圖純粹為幾何性質且難以擴展至大型環境。ORB-SLAM 及其他基於特徵的系統產生稀疏地圖，效率高但缺乏表面細節與語意資訊。近期語意建圖的研究透過事後分割為現有地圖添加物件標籤，但並未將物件模型作為地圖的一級元素。

段落功能文獻回顧——梳理密集、稀疏與語意建圖三條路線。

邏輯角色每種方法都「差一步」到達物件級 SLAM：KinectFusion 有密集幾何但無語意，ORB 有效率但無細節，語意建圖有標籤但不以物件為核心。

論證技巧 / 潛在漏洞三方批判精確地為 SLAM++ 創造了定位空間。但以 2013 年視角，ORB-SLAM 尚處於早期階段，此處的批判可能對後來的版本不完全適用。

3. Method — 方法

3.1 System Overview

SLAM++ operates in real-time on RGB-D data from a Kinect-like sensor. The system maintains a pose graph where nodes represent either camera keyframes or recognized object instances, and edges represent geometric constraints between them. For each incoming frame, the system performs three operations: (1) camera tracking via ICP alignment against the current map, (2) object recognition and pose estimation using a model database, and (3) graph optimization to jointly refine all camera and object poses. Newly detected objects are added as nodes; re-observed objects provide loop closure constraints.

SLAM++ 在類 Kinect 感測器的 RGB-D 資料上即時運作。系統維護一個姿態圖，其中節點代表攝影機關鍵幀或已辨識的物件實例，邊代表它們之間的幾何約束。對於每個輸入幀，系統執行三項操作：(1) 透過 ICP 對齊與當前地圖進行攝影機追蹤；(2) 使用模型資料庫進行物件辨識與姿態估計；(3) 圖最佳化以聯合精煉所有攝影機與物件姿態。新偵測到的物件被添加為節點；重新觀察到的物件提供迴環封閉約束。

段落功能系統架構概述——描述三步驟的即時處理管線。

邏輯角色將「物件辨識」和「SLAM」無縫整合於同一姿態圖框架中，物件同時作為地圖元素與迴環約束。

論證技巧 / 潛在漏洞三步驟管線清晰易懂。但物件辨識失敗（誤認或漏認）會直接影響地圖品質，系統的穩健性取決於辨識模組的可靠度。

3.2 Object Recognition and Pose Estimation — 物件辨識與姿態估計

Object recognition is performed by matching depth data against a database of pre-scanned 3D models. We use a multi-resolution approach: first, candidate object hypotheses are generated by matching local surface descriptors (FPFH features) between the current depth frame and the model database. Then, each hypothesis is refined via ICP alignment between the observed depth and the model, with the alignment quality serving as a verification score. Objects must achieve an ICP fitness above a threshold to be accepted. The 6-DoF pose from ICP provides the geometric constraint between the camera and the object in the pose graph.

物件辨識透過將深度資料與預掃描的三維模型資料庫進行比對來執行。我們使用多解析度方法：首先，透過比對當前深度幀與模型資料庫之間的局部表面描述子（FPFH 特徵）來生成候選物件假設。接著，每個假設經由ICP 對齊觀察到的深度與模型進行精煉，對齊品質作為驗證分數。物件的 ICP 適配度必須超過門檻才會被接受。ICP 得到的六自由度姿態在姿態圖中提供攝影機與物件之間的幾何約束。

段落功能辨識機制——描述從特徵比對到 ICP 精煉的物件辨識流程。

邏輯角色此模組是 SLAM++ 區別於傳統 SLAM 的核心組件：它將三維辨識融入 SLAM 迴路中。

論證技巧 / 潛在漏洞 FPFH + ICP 的兩階段方法在 2013 年是合理的選擇。但基於手工特徵的辨識在嚴重遮擋或部分可見的物件上可能失敗。深度學習方法後來提供了更穩健的替代方案。

A key benefit of object-level mapping is map compactness. A traditional dense map of an office scene might contain millions of surfels or a large TSDF volume. In contrast, SLAM++ represents the same scene as a graph with perhaps 20-30 object nodes, each storing only a model ID and a 6-DoF pose (7 parameters). This yields a compression ratio of several orders of magnitude. Moreover, the object-level map is immediately useful for tasks like robotic grasping — the system knows not just "there is geometry here" but "there is a mug at this pose."

物件級建圖的一項關鍵優勢是地圖的精簡性。傳統的辦公場景密集地圖可能包含數百萬個面元或大型 TSDF 體積。相比之下，SLAM++ 以包含約 20-30 個物件節點的圖來表示相同場景，每個節點僅儲存一個模型 ID 和六自由度姿態（7 個參數）。這帶來了數個數量級的壓縮比。此外，物件級地圖對機器人抓取等任務立即可用——系統不僅知道「此處有幾何」，更知道「此處有一個杯子，姿態如此」。

段落功能優勢量化——以壓縮比和語意可用性展示物件級表示的價值。

邏輯角色用具體數據兌現摘要中「大幅更精簡」的承諾，同時以抓取範例展示語意表示的實用性。

論證技巧 / 潛在漏洞「百萬點 vs 20 個物件」的對比極具衝擊力。但此壓縮是以丟失所有非物件幾何（牆壁、地板等）為代價的，對於需要完整幾何的任務不適用。

4. Experiments — 實驗

We evaluate SLAM++ on real-time RGB-D sequences captured in office environments with a Kinect sensor. The object database contains pre-scanned models of common office objects including chairs, monitors, and keyboards. The system runs at approximately 20 Hz on a desktop GPU, including object recognition. In quantitative evaluation, SLAM++ achieves camera tracking accuracy comparable to KinectFusion while producing a map that is over 1000x more compact. Object pose estimation accuracy is within 2-3 cm translation and 5 degrees rotation for well-observed objects.

我們在使用 Kinect 感測器於辦公環境中擷取的即時 RGB-D 序列上評估 SLAM++。物件資料庫包含預掃描的常見辦公物件模型，包括椅子、螢幕和鍵盤。系統在桌上型 GPU 上以約 20 Hz 運行，包含物件辨識。在定量評估中，SLAM++ 達到與 KinectFusion 相當的攝影機追蹤精度，同時產生壓縮超過 1000 倍的地圖。充分觀察物件的姿態估計精度在平移 2-3 公分與旋轉 5 度以內。

段落功能定量驗證——展示即時效能、追蹤精度與地圖壓縮比。

邏輯角色同時在速度（20 Hz）、精度（與 KinectFusion 相當）與壓縮（1000 倍）三個維度上提供實證支撐。

論證技巧 / 潛在漏洞三維度驗證全面。但「辦公環境」是最有利的測試場景（物件重複性高、幾何簡單），在家庭或工業環境中的表現可能不同。

We demonstrate several unique capabilities enabled by object-level mapping. First, relocation: when the camera is lost, recognizing a previously mapped object instantly recovers the camera pose without requiring visual feature matching. Second, map reuse across sessions: the compact object-level map can be saved and reloaded, allowing immediate localization in a previously mapped environment. Third, semantic queries: the system can answer questions like "where is the nearest keyboard?" directly from the map.

我們展示了物件級建圖所賦能的多項獨特能力。首先是重新定位：當攝影機迷失時，辨識一個先前已建圖的物件即可立即恢復攝影機姿態，無需視覺特徵匹配。其次是跨作業期的地圖重用：精簡的物件級地圖可被儲存和重新載入，允許在先前已建圖的環境中立即定位。第三是語意查詢：系統可直接從地圖回答「最近的鍵盤在哪裡？」等問題。

段落功能應用展示——呈現物件級 SLAM 獨有的高層次功能。

邏輯角色超越「與現有方法同等」的比較，展示物件級表示「能做而傳統方法不能做」的質性差異。

論證技巧 / 潛在漏洞三個應用範例具體且具說服力。但這些功能的實用性取決於物件辨識的可靠度——錯誤辨識會導致錯誤的重定位和語意查詢。

5. Conclusion — 結論

SLAM++ demonstrates that operating at the level of objects is a viable and advantageous paradigm for real-time SLAM. By replacing millions of low-level primitives with a compact graph of recognized object instances, the system achieves comparable tracking accuracy with dramatically reduced map size and gains semantic understanding for free. The current limitation of requiring a pre-built object database could be addressed in future work by integrating online object learning or category-level pose estimation. We envision SLAM++ as a step toward truly intelligent spatial understanding.

SLAM++ 證明了在物件層級運作是即時 SLAM 的可行且有利的範式。透過以辨識物件實例的精簡圖取代數百萬個低階基元，系統在大幅縮減地圖大小的同時達到可比的追蹤精度，並「免費」獲得語意理解。當前需要預建物件資料庫的限制可在未來透過整合線上物件學習或類別級姿態估計來解決。我們將 SLAM++ 視為邁向真正智慧空間理解的一步。

段落功能總結與展望——重申範式轉移的價值並坦承限制。

邏輯角色以「語意理解免費獲得」總結最核心的價值主張，同時透過討論限制和未來方向展現學術誠信。

論證技巧 / 潛在漏洞「免費獲得語意」的措辭優雅但略有誤導——語意並非「免費」，而是透過預建模型資料庫的前期投資換來的。主動提出「線上學習」作為未來方向是明智之舉。

論證結構總覽

問題
傳統 SLAM 地圖
龐大且無語意

→

論點
物件是正確的
SLAM 抽象層級

→

證據
1000 倍壓縮
相當的追蹤精度

→

反駁
需預建模型庫
未來可線上學習

→

結論
邁向智慧
空間理解

作者核心主張（一句話）

透過將 SLAM 的基本地圖元素從低階幾何基元提升至三維物件實例，可在即時運行的同時實現數量級的地圖壓縮並自然獲得場景語意理解能力。

論證最強處

範式轉移的遠見：SLAM++ 在語意 SLAM 尚未成為研究熱點的 2013 年即提出了「物件級建圖」的概念，預見了後來整個領域的發展方向。物件作為迴環約束的設計極為巧妙——辨識出一個已知物件即可提供精確的六自由度約束，比特徵點匹配更穩健。

論證最弱處

封閉世界假設：系統要求所有可辨識的物件預先存在於資料庫中，這在開放世界的真實部署中是不切實際的。未知物件被完全忽略，可能導致地圖中出現「空白區域」。此外，僅在受控的辦公環境中測試，環境多樣性不足。