BEVFormer — 雙欄批注

Abstract — 摘要

3D visual perception tasks, including 3D detection and map segmentation, are essential for autonomous driving. In this work, we present BEVFormer, a paradigm that learns unified Bird's-Eye-View (BEV) representations from multi-camera images to support multiple autonomous driving perception tasks. BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that enables each BEV query to extract spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves new state-of-the-art 56.9% in terms of NDS metric on the nuScenes test set, which is 9.0 points higher than previous best arts.

三維視覺感知任務，包括三維偵測和地圖分割，對自動駕駛至關重要。本文提出 BEVFormer，一種從多相機影像學習統一鳥瞰圖（BEV）表示的範式，以支援多種自動駕駛感知任務。BEVFormer 透過預定義的網格狀 BEV 查詢與空間和時間空間互動，以利用空間和時序資訊。為彙聚空間資訊，我們設計了空間交叉注意力，使每個 BEV 查詢能從各相機視角的感興趣區域提取空間特徵。對於時序資訊，我們提出時序自注意力以遞迴式融合歷史 BEV 資訊。本方法在 nuScenes 測試集上以 NDS 指標達到最新最優的 56.9%，較先前最佳方法高出 9.0 個百分點。

段落功能全文總覽——以統一 BEV 表示串聯空間與時序感知。

邏輯角色摘要承諾了「統一表示」與「SOTA 表現」兩大核心貢獻，NDS 指標的大幅提升（9.0 點）是最有力的說服工具。

論證技巧 / 潛在漏洞 9.0 百分點的提升幅度在自動駕駛領域非常顯著，但摘要未提及是否使用了額外資料或更大的骨幹網路，這些因素可能對公平比較有影響。

1. Introduction — 緒論

Camera-based 3D perception has attracted increasing attention in autonomous driving due to the low cost of cameras compared to LiDAR. However, converting perspective-view image features to a unified 3D representation remains challenging. Previous methods either use depth estimation to lift 2D features to 3D (e.g., LSS, BEVDet) or employ transformer-based approaches with 3D reference points (e.g., DETR3D). The former suffers from inaccurate depth prediction, while the latter only attends to sparse reference points, losing dense scene information. We propose BEVFormer to address these limitations through a unified spatiotemporal transformer architecture that generates BEV features from multi-view cameras.

基於相機的三維感知因相機成本遠低於光達，在自動駕駛中日益受到關注。然而，將透視視角影像特徵轉換為統一的三維表示仍是挑戰。先前的方法要麼使用深度估計將二維特徵提升至三維（如 LSS、BEVDet），要麼採用基於三維參考點的 Transformer 方法（如 DETR3D）。前者受限於不精確的深度預測，後者僅關注稀疏的參考點，損失了密集的場景資訊。我們提出 BEVFormer 透過統一的時空 Transformer 架構解決這些局限，從多視角相機生成 BEV 特徵。

段落功能建立問題意識——對比現有方法的兩大路線及其不足。

邏輯角色以「深度提升」vs.「稀疏參考點」的二分法框定現有方法的困境，為 BEVFormer 的「第三條路」建立動機。

論證技巧 / 潛在漏洞將相機成本優勢作為開場是對產業界的有效訴求。但對 DETR3D「損失密集場景資訊」的批評是否公平，取決於具體任務的需求。

2. Method — 方法

BEVFormer consists of three core components. First, BEV queries are a set of grid-shaped learnable parameters Q in R^{H x W x C}, where H and W define the spatial resolution of the BEV plane and C is the channel dimension. These queries encode the prior knowledge of the BEV space and serve as the carrier for aggregating information from camera features. Second, Spatial Cross-Attention (SCA) lifts each BEV query to a set of 3D reference points along the vertical axis (pillar sampling), projects them to 2D image planes via camera calibration, and uses deformable attention to aggregate features from the corresponding image locations across multiple camera views.

BEVFormer 由三個核心組件組成。第一，BEV 查詢是一組網格狀的可學習參數 Q 屬於 R^{H x W x C}，其中 H 和 W 定義 BEV 平面的空間解析度，C 為通道維度。這些查詢編碼了 BEV 空間的先驗知識，作為從相機特徵彙聚資訊的載體。第二，空間交叉注意力（SCA）將每個 BEV 查詢沿垂直軸提升為一組三維參考點（柱狀採樣），透過相機校正投影至二維影像平面，並使用可變形注意力從多個相機視角的對應影像位置彙聚特徵。

段落功能方法論核心——詳述 BEV 查詢和空間交叉注意力機制。

邏輯角色以「查詢-投影-彙聚」的三步流程清晰地描述了從多相機影像到 BEV 特徵的轉換過程。

論證技巧 / 潛在漏洞「柱狀採樣」是一個巧妙的設計——它隱式地處理了高度不確定性問題。但固定的垂直採樣策略可能不適合高度變化大的場景。

Third, Temporal Self-Attention (TSA) fuses the current BEV queries with the BEV features from the previous timestamp. We first align the previous BEV features to the current coordinate system using ego-motion, then apply deformable self-attention between the current queries and the aligned historical features. This mechanism enables the model to capture temporal cues such as object velocity and occlusion patterns, which are critical for accurate 3D detection. The entire architecture is trained end-to-end with task-specific heads for 3D object detection (following DETR-style set prediction) and BEV segmentation.

第三，時序自注意力（TSA）將當前 BEV 查詢與前一時間戳的 BEV 特徵融合。我們首先使用自車運動將先前的 BEV 特徵對齊至當前座標系，然後在當前查詢與對齊的歷史特徵之間應用可變形自注意力。此機制使模型能捕捉時序線索如物體速度和遮擋模式，這對精確的三維偵測至關重要。整體架構以端對端方式搭配任務專屬頭部進行訓練，用於三維物體偵測（依循 DETR 風格的集合預測）和 BEV 分割。

段落功能時序融合機制——完成方法論的最後一塊拼圖。

邏輯角色時序自注意力是 BEVFormer 相較於靜態方法的核心差異化要素。

論證技巧 / 潛在漏洞以自車運動進行對齊是實務上必要的設計。但僅使用單一前一幀可能不足以捕捉長程時序依賴。

3. Experiments — 實驗

We evaluate BEVFormer on the nuScenes benchmark for both 3D object detection and map segmentation. For 3D detection, BEVFormer with ResNet-101-DCN backbone achieves 51.7% NDS and 41.6% mAP on the validation set, and 56.9% NDS on the test set, establishing a new state of the art among camera-only methods. Compared to DETR3D, BEVFormer improves NDS by 9.0 points and mAP by 7.7 points. The temporal self-attention contributes +3.5% NDS improvement, confirming the importance of temporal modeling. For BEV map segmentation, BEVFormer achieves 62.7% mIoU, outperforming all previous camera-based methods.

我們在 nuScenes 基準上評估 BEVFormer 的三維物體偵測和地圖分割表現。在三維偵測方面，使用 ResNet-101-DCN 骨幹的 BEVFormer 在驗證集上達到 51.7% NDS 和 41.6% mAP，在測試集上達到 56.9% NDS，確立了純相機方法的新最優表現。相較於 DETR3D，BEVFormer NDS 提升 9.0 個百分點，mAP 提升 7.7 個百分點。時序自注意力貢獻了 +3.5% NDS 的提升，確認了時序建模的重要性。在 BEV 地圖分割方面，BEVFormer 達到 62.7% mIoU，超越所有先前的純相機方法。

段落功能核心實驗結果——全面的量化比較。

邏輯角色以多項指標和消融實驗完整支撐了方法的有效性，特別是時序注意力的消融研究。

論證技巧 / 潛在漏洞消融實驗（+3.5% NDS）精確地量化了時序組件的貢獻。但與光達方法的差距未被直接討論。

Ablation studies reveal several insights. The spatial cross-attention with pillar sampling outperforms both global attention and point-based sampling by 2.1% and 1.3% NDS respectively, validating the pillar query design. Using 4 vertical sampling points per pillar achieves the best trade-off between accuracy and efficiency. The BEV resolution of 200x200 (at 0.5m per grid) provides optimal performance, with diminishing returns at higher resolutions. Qualitative results show that BEVFormer produces more accurate velocity estimation and fewer false positives for occluded objects compared to single-frame methods.

消融研究揭示了數項洞察。柱狀採樣的空間交叉注意力分別以 2.1% 和 1.3% NDS 超越全域注意力和基於點的採樣，驗證了柱狀查詢設計。每柱使用 4 個垂直採樣點在精確度與效率間達到最佳平衡。200x200 的 BEV 解析度（每格 0.5 公尺）提供最佳表現，更高解析度的收益遞減。定性結果顯示 BEVFormer 相較於單幀方法產生更精確的速度估計且對遮擋物體的假陽性更少。

段落功能消融分析——驗證各設計選擇的合理性。

邏輯角色系統性的消融實驗展示了每個設計決策背後的實證基礎。

論證技巧 / 潛在漏洞細緻的超參數分析增強了結果的可重現性。但 0.5m 解析度可能不足以偵測小型物體。

4. Conclusion — 結論

We have presented BEVFormer, a spatiotemporal transformer framework that learns unified BEV representations from multi-camera images for autonomous driving perception. Through spatial cross-attention and temporal self-attention, BEVFormer effectively aggregates multi-view and multi-frame information into a coherent BEV representation. The resulting model achieves state-of-the-art performance on the nuScenes benchmark for both 3D detection and map segmentation. We believe that the BEV representation paradigm will continue to advance camera-based autonomous driving systems toward closing the gap with LiDAR-based approaches.

本文提出了 BEVFormer，一個時空 Transformer 框架，從多相機影像學習統一的 BEV 表示以用於自動駕駛感知。透過空間交叉注意力和時序自注意力，BEVFormer 有效地將多視角和多幀資訊彙聚為連貫的 BEV 表示。所得模型在 nuScenes 基準的三維偵測和地圖分割上均達到最優表現。我們相信 BEV 表示範式將持續推進基於相機的自動駕駛系統，縮小與光達方法的差距。

段落功能全文總結——重申貢獻並展望 BEV 範式的發展。

邏輯角色以「縮小與光達差距」作為結語，將研究置於更大的產業願景中。

論證技巧 / 潛在漏洞結論中提到「縮小差距」是謹慎的措辭，承認了相機方法與光達之間仍有差距。

Abstract — 摘要

1. Introduction — 緒論

2. Method — 方法

3. Experiments — 實驗

4. Conclusion — 結論

論證結構總覽

核心主張

最強論點

最弱環節