Planning-oriented Autonomous Driving (UniAD)

Abstract — 摘要

Modern autonomous driving systems are typically built in a modular fashion, where perception, prediction, and planning are designed and optimized independently. While this divide-and-conquer strategy simplifies development, it suffers from accumulated errors across stages and lacks holistic reasoning about the driving scene. An alternative is end-to-end learning, which directly maps sensor inputs to planning outputs but often sacrifices interpretability and struggles with complex multi-agent interactions.

現代自動駕駛系統通常以模組化方式建構，其中感知、預測與規劃分別獨立設計與最佳化。雖然這種分而治之的策略簡化了開發流程，但它在各階段之間累積誤差，且缺乏對駕駛場景的整體性推理。另一種替代方案是端到端學習，直接將感測器輸入映射至規劃輸出，但往往犧牲了可解釋性，且在處理複雜的多智能體互動時力不從心。

段落功能建立問題框架——指出自動駕駛系統兩種主流範式的根本性困境。

邏輯角色論證鏈的起點：以「模組化累積誤差」與「端到端缺乏可解釋性」的雙重困境，為提出第三條路線（統一框架）鋪設必要性。

論證技巧 / 潛在漏洞以二元對立的方式呈現模組化與端到端的缺陷，暗示二者皆不理想。但這種非此即彼的框架可能過度簡化了現有工業界的實際做法——許多系統已在模組間加入緊密的訊息傳遞機制。

In this work, the authors present UniAD, a Unified Autonomous Driving framework that incorporates full-stack driving tasks in one network. The key philosophy is planning-oriented: all preceding tasks — tracking, mapping, motion forecasting, and occupancy prediction — are designed to serve and benefit the ultimate planning objective. UniAD connects these nodes through query-based interfaces within transformer decoders, enabling effective inter-task feature interaction. Extensive experiments on the nuScenes benchmark demonstrate that UniAD achieves state-of-the-art performance on all tasks simultaneously, and that the joint design substantially improves planning safety and accuracy compared to both modular pipelines and naive multi-task approaches.

本研究提出 UniAD——一個將完整自動駕駛任務堆疊整合於單一網路中的統一框架。其核心理念為「以規劃為導向」：所有前置任務——追蹤、建圖、運動預測與占用預測——皆設計為服務於最終的規劃目標。UniAD 透過 Transformer 解碼器中基於查詢的介面串連這些節點，實現有效的跨任務特徵交互。在 nuScenes 基準上的大量實驗表明，UniAD 在所有任務上同時達到最先進的效能，且相較於模組化管線與單純的多任務方法，聯合設計顯著提升了規劃的安全性與準確度。

段落功能全文核心宣言——以「以規劃為導向」的一句話定位 UniAD 的學術貢獻。

邏輯角色承接上段的雙重困境，此段提出第三條路線：非模組化、非純端到端，而是透過查詢介面將多任務有機串連。這是全文的核心主張（thesis statement）。

論證技巧 / 潛在漏洞「所有任務同時達到最先進效能」是一個極強的宣稱。讀者需注意這是在 nuScenes 單一基準上的表現，該基準的場景複雜度與真實部署環境仍有落差。

1. Introduction — 緒論

Autonomous driving has long been approached through a sequential pipeline: first perceive the environment via detection, tracking, and mapping; then predict the future behavior of surrounding agents; and finally plan a safe trajectory for the ego vehicle. This modular design brings clear engineering advantages — each component can be developed, tested, and improved independently. However, errors from upstream modules inevitably propagate downstream, and information loss at module boundaries prevents the planner from fully exploiting rich scene context. For instance, a missed detection in perception leads to a missing prediction, which in turn causes a dangerous planning decision.

自動駕駛長期以來透過序列式管線進行：先透過偵測、追蹤與建圖來感知環境；接著預測周圍智能體的未來行為；最後為自車規劃安全軌跡。此模組化設計具有明確的工程優勢——每個元件可獨立開發、測試與改進。然而，上游模組的誤差不可避免地向下游傳播，且模組邊界處的資訊損失使得規劃器無法充分利用豐富的場景脈絡。例如，感知中的漏檢將導致預測缺失，進而引發危險的規劃決策。

段落功能建立研究背景——描述傳統模組化自動駕駛系統及其根本弱點。

邏輯角色以「先揚後抑」手法先肯定模組化的工程優勢，再揭示其致命缺陷（誤差傳播與資訊損失），為統一框架的必要性奠基。

論證技巧 / 潛在漏洞以「漏檢 -> 漏預測 -> 危險規劃」的因果鏈條作為具體例證，極具說服力。但此類錯誤級聯也可透過冗餘感測器與安全機制緩解，作者未提及這些工程解法。

Recent works have explored multi-task learning for autonomous driving, where a shared backbone extracts features consumed by multiple task-specific heads. While this improves efficiency and enables some implicit feature sharing, the tasks are often treated as independent objectives with separate loss functions, lacking explicit coordination. Such approaches may suffer from negative transfer — where optimizing one task degrades another — and fail to capture the causal dependencies between perception, prediction, and planning. The fundamental question remains: how should driving tasks be organized so that they collectively contribute to safe and effective planning?

近期研究已探索自動駕駛的多任務學習，其中共享骨幹網路提取的特徵供多個任務專用頭使用。雖然這提升了效率並實現了一定程度的隱式特徵共享，但各任務往往被視為具有獨立損失函數的獨立目標，缺乏顯式的協調。此類方法可能遭受負遷移——最佳化某一任務反而使另一任務性能下降——且無法捕捉感知、預測與規劃之間的因果依賴關係。核心問題仍然存在：駕駛任務應如何組織，才能共同促進安全且有效的規劃？

段落功能批判替代方案——指出「簡單多任務學習」的結構性不足。

邏輯角色進一步收窄問題空間：不僅模組化不行，連多任務共享骨幹也不夠——因為缺乏任務間的顯式因果建模。末尾的反問句巧妙地引出 UniAD 的設計哲學。

論證技巧 / 潛在漏洞「負遷移」是多任務學習的已知問題，此處被策略性地放大以批判現有方法。但負遷移可透過梯度調和（如 GradNorm）等技術有效緩解，作者未討論這些改進方案。

The authors argue that the answer lies in a planning-oriented philosophy: every task in the driving stack should be designed with the ultimate goal of planning in mind. They introduce UniAD, which connects five key nodes — detection and tracking (TrackFormer), online mapping (MapFormer), multi-agent motion forecasting (MotionFormer), occupancy prediction (OccFormer), and ego planning (Planner) — through a unified query-based design. Each module communicates with downstream modules via learned queries that carry structured, task-specific representations, ensuring that intermediate outputs are not only accurate on their own metrics but also maximally informative for the planning task.

作者主張，答案在於「以規劃為導向」的設計哲學：駕駛堆疊中的每項任務都應以規劃為最終目標來設計。他們提出 UniAD，透過統一的基於查詢設計，串連五個關鍵節點——偵測與追蹤（TrackFormer）、線上建圖（MapFormer）、多智能體運動預測（MotionFormer）、占用預測（OccFormer）以及自車規劃（Planner）。每個模組透過攜帶結構化、任務專屬表示的學習查詢與下游模組通訊，確保中間輸出不僅在各自的指標上準確，更對規劃任務提供最大資訊量。

段落功能提出核心解決方案——完整概述 UniAD 的五模組架構與查詢介面設計。

邏輯角色全文論證的樞紐：將「以規劃為導向」從抽象理念具體化為五個命名模組的流水線。查詢介面是技術核心，使得各任務的表示可被下游直接消費。

論證技巧 / 潛在漏洞以統一的命名規則（-Former 後綴）強化了架構的一致性印象。然而，五個模組序列化連接意味著推論延遲可能較高，這在即時駕駛場景中是關鍵考量，作者未在緒論中討論此取捨。

3D object detection from multi-camera inputs has advanced rapidly with methods like BEVFormer and DETR3D, which project image features into bird's-eye-view (BEV) representations. For multi-object tracking (MOT), traditional approaches rely on detection followed by association heuristics, introducing non-differentiable post-processing that breaks end-to-end training. Recent query-based trackers such as MOTR and TrackFormer perform joint detection and tracking using track queries that propagate across frames, eliminating the need for hand-crafted association rules. However, these methods are designed as standalone modules and do not consider how tracking representations feed into downstream prediction and planning.

基於多攝影機輸入的三維物件偵測隨著 BEVFormer 與 DETR3D 等方法快速發展，這些方法將影像特徵投影至鳥瞰圖（BEV）表示。對於多物件追蹤，傳統方法依賴偵測後關聯啟發式，引入了破壞端到端訓練的不可微分後處理。近期基於查詢的追蹤器如 MOTR 與 TrackFormer 使用跨幀傳播的追蹤查詢進行聯合偵測與追蹤，消除了手工關聯規則的需求。然而，這些方法被設計為獨立模組，未考慮追蹤表示如何饋入下游的預測與規劃。

段落功能文獻回顧——梳理 3D 偵測與追蹤的技術演進脈絡。

邏輯角色建立技術譜系：BEVFormer/DETR3D（偵測）-> MOTR/TrackFormer（追蹤）-> UniAD（全堆疊整合），每一步指出遺留缺口以定位本文貢獻。

論證技巧 / 潛在漏洞以「獨立模組，未考慮下游」作為統一批判，巧妙地將所有先前工作歸類為「不夠整合」。但這也忽略了某些工業系統（如 Waymo）已在模組間實現深度整合的事實。

Motion forecasting aims to predict future trajectories of surrounding agents. Early methods use recurrent networks or graph neural networks to model temporal and social interactions. More recent approaches adopt goal-conditioned prediction, where an agent's future trajectory is anchored to a set of possible goal locations. Scene-centric representations that reason jointly about all agents have shown advantages over agent-centric formulations that independently predict each agent without considering inter-agent coordination. Crucially, most motion forecasting methods assume perfect perception inputs (ground-truth detections and tracks), a condition rarely met in real-world deployment.

運動預測旨在預測周圍智能體的未來軌跡。早期方法使用遞迴網路或圖神經網路來建模時序與社交互動。較新的方法採用目標條件式預測，其中智能體的未來軌跡被錨定至一組可能的目標位置。以場景為中心的表示法對所有智能體進行聯合推理，相較於獨立預測各智能體而不考慮智能體間協調的以智能體為中心之方法，展現出優勢。至關重要的是，大多數運動預測方法假設感知輸入是完美的（真實標註的偵測與追蹤），但此條件在真實部署中鮮少被滿足。

段落功能文獻定位——揭示運動預測領域「假設完美輸入」的致命假設。

邏輯角色此段的關鍵論點在末句：既有預測方法的效能建立在不現實的假設上，從而為 UniAD 的端到端設計（在不完美感知下仍能運作）提供了直接的動機。

論證技巧 / 潛在漏洞以「完美感知假設」作為核心批判點，精準命中了學術基準與實際部署之間的差距。此為全文最具說服力的動機之一——統一框架允許預測模組直接接收帶有噪聲的感知輸出並學習容忍。

End-to-end autonomous driving methods attempt to learn a direct mapping from sensor inputs to driving actions. Pioneering works like ALVINN and more recent imitation learning approaches demonstrate feasibility, but typically lack intermediate representations, making failure diagnosis difficult. ST-P3 and PnPNet incorporate some intermediate tasks but treat them with simple concatenation or loose coupling, without explicitly modeling the task dependency graph. UniAD distinguishes itself by systematically designing inter-task connections through query-based interfaces, where each module's output queries directly serve as input to the next.

端到端自動駕駛方法嘗試學習從感測器輸入到駕駛動作的直接映射。先驅性工作如 ALVINN 及更近期的模仿學習方法展示了可行性，但通常缺乏中間表示，使得故障診斷困難。ST-P3 與 PnPNet 納入了部分中間任務，但以簡單串接或鬆散耦合處理，未顯式建模任務依賴圖。UniAD 的區別在於系統性地透過基於查詢的介面設計跨任務連接，每個模組的輸出查詢直接作為下一個模組的輸入。

段落功能競爭者比較——區分 UniAD 與既有端到端方法的本質差異。

邏輯角色完成文獻回顧的論證閉環：模組化（段1）、獨立預測（段2）、簡單端到端（段3）皆有不足，UniAD 的查詢介面是唯一系統性地解決任務間通訊的方案。

論證技巧 / 潛在漏洞將 ST-P3 與 PnPNet 歸類為「鬆散耦合」是有效的區分策略，但「查詢介面」的優越性需由實驗數據支撐——單憑架構設計的差異不足以證明效能提升。

3. Method — 方法概述

The overall architecture of UniAD follows a sequential, query-based paradigm. Multi-camera images are first processed by a shared image backbone and BEV encoder to produce a unified bird's-eye-view (BEV) feature map. This BEV representation serves as the common foundation for all downstream modules. The five task modules — TrackFormer, MapFormer, MotionFormer, OccFormer, and Planner — are arranged in a directed acyclic graph (DAG) that respects the natural dependency order of driving tasks. Each module is a transformer decoder that takes task-specific queries and cross-attends to the BEV features and outputs of preceding modules.

UniAD 的整體架構遵循序列式、基於查詢的範式。多攝影機影像首先由共享影像骨幹與 BEV 編碼器處理，產生統一的鳥瞰圖特徵圖。此 BEV 表示作為所有下游模組的共同基礎。五個任務模組——TrackFormer、MapFormer、MotionFormer、OccFormer 與 Planner——以有向無環圖（DAG）排列，遵循駕駛任務的自然依賴順序。每個模組是一個 Transformer 解碼器，接收任務專屬查詢並對 BEV 特徵及前置模組的輸出進行交叉注意力運算。

段落功能架構總覽——以高層視角描述 UniAD 的五模組 DAG 結構。

邏輯角色此段是方法章節的「地圖」，讓讀者先掌握全貌再進入各模組細節。DAG 的概念精確地捕捉了任務間的單向依賴關係。

論證技巧 / 潛在漏洞以 DAG 來描述任務流向是清晰的抽象，但也意味著不允許反向資訊流（如規劃結果回饋改善感知）。這可能是一個架構上的局限——某些場景中規劃意圖能指導注意力分配。

The core innovation is the query-based inter-task communication mechanism. In UniAD, the output queries of one module serve as the input key/value for the next module's cross-attention layers. For example, agent queries from TrackFormer carry per-object representations into MotionFormer, where they interact with map queries from MapFormer to produce motion-aware forecasts. This design ensures that information flows in a structured, task-aware manner rather than through generic shared features, enabling each downstream module to selectively attend to the most relevant upstream representations.

核心創新在於基於查詢的跨任務通訊機制。在 UniAD 中，一個模組的輸出查詢作為下一個模組交叉注意力層的輸入鍵/值。例如，來自 TrackFormer 的智能體查詢攜帶逐物件表示進入 MotionFormer，在那裡與來自 MapFormer 的地圖查詢互動，產生具有運動感知的預測。此設計確保資訊以結構化、任務感知的方式流動，而非透過泛用的共享特徵，使得每個下游模組能選擇性地關注最相關的上游表示。

段落功能核心機制闡述——詳解查詢如何在模組間傳遞與交互。

邏輯角色此段將「查詢介面」從抽象概念具體化為注意力機制的實現，是全文技術論證的核心支柱。

論證技巧 / 潛在漏洞以具體例子（TrackFormer -> MotionFormer + MapFormer）闡明抽象機制，增強可理解性。「結構化 vs. 泛用」的對比暗示了相較於簡單特徵共享的優越性，但需要消融實驗來驗證查詢傳遞確實優於特徵拼接。

3.1 TrackFormer — 追蹤模組

TrackFormer performs joint 3D object detection and multi-object tracking in a unified transformer decoder. It maintains two types of queries: detection queries for discovering new objects entering the scene, and track queries that propagate information about previously detected objects across frames. At each time step, detection queries attend to the current BEV features to localize new agents, while track queries carry forward the representation of existing agents and update their states through cross-attention with the latest BEV features.

TrackFormer 在統一的 Transformer 解碼器中執行聯合三維物件偵測與多物件追蹤。它維護兩種查詢：用於發現進入場景之新物件的偵測查詢，以及跨幀傳播先前已偵測物件資訊的追蹤查詢。在每個時間步，偵測查詢對當前 BEV 特徵進行注意力運算以定位新智能體，而追蹤查詢則延續既有智能體的表示，並透過與最新 BEV 特徵的交叉注意力更新其狀態。

段落功能模組定義——描述 TrackFormer 的雙查詢機制。

邏輯角色作為管線的第一個模組，TrackFormer 的輸出（智能體查詢）將直接饋入後續所有模組，其表示品質決定了整個系統的上限。

論證技巧 / 潛在漏洞偵測查詢與追蹤查詢的二分法優雅地避免了傳統的偵測-關聯兩步流程。但此設計假設追蹤查詢能可靠地跨幀傳播——在快速遮擋或突然出現的場景中，追蹤查詢可能丟失或錯配。

A critical design choice is that TrackFormer eliminates all non-differentiable post-processing typically used in tracking pipelines, such as Hungarian matching at inference time or NMS-based filtering. Instead, the association between detections and tracks is implicitly learned through the attention mechanism. During training, bipartite matching is used to assign ground-truth objects to queries, and a shared query matching strategy ensures consistent instance identities across all downstream modules. This end-to-end differentiable design is essential for enabling gradient flow from the planning loss all the way back to the perception module.

一個關鍵的設計選擇是 TrackFormer 消除了追蹤管線中通常使用的所有不可微分後處理，如推論時的匈牙利匹配或基於非極大值抑制的篩選。取而代之的是，偵測與追蹤之間的關聯透過注意力機制隱式學習。訓練期間使用二部圖匹配將真實標註物件分配給查詢，並以共享查詢匹配策略確保所有下游模組中的實例身份一致。此端到端可微分設計對於使梯度從規劃損失一路回傳至感知模組至關重要。

段落功能技術細節——強調端到端可微分性的重要性與實現方式。

邏輯角色回應「以規劃為導向」的核心哲學：唯有消除不可微分操作，規劃信號才能回傳指導感知學習。此段將設計哲學落實為具體的工程決策。

論證技巧 / 潛在漏洞「梯度從規劃回傳至感知」是極具吸引力的論述，但在實踐中長程梯度傳播可能面臨梯度消失或不穩定的問題。兩階段訓練策略（先訓練感知再端到端微調）暗示直接端到端訓練可能不穩定。

3.2 MapFormer — 建圖模組

MapFormer performs online BEV semantic map construction through panoptic segmentation of road elements. It uses sparse map queries, where each query represents a semantic class of road structure — such as lane dividers, road boundaries, and pedestrian crossings. These queries attend to the BEV features and produce per-class segmentation masks and embedding vectors that encode the local road topology. Unlike offline HD map approaches, MapFormer constructs the map representation on-the-fly from sensor data, removing the dependency on pre-built maps.

MapFormer 透過道路元素的全景分割執行線上 BEV 語意地圖建構。它使用稀疏地圖查詢，其中每個查詢代表一類道路結構的語意類別——如車道分隔線、道路邊界與行人穿越道。這些查詢對 BEV 特徵進行注意力運算，產生逐類別分割遮罩與編碼局部道路拓撲的嵌入向量。不同於離線高精地圖方法，MapFormer 從感測器資料即時建構地圖表示，消除了對預建地圖的依賴。

段落功能模組定義——描述 MapFormer 的線上建圖機制與其輸出形式。

邏輯角色 MapFormer 的輸出（地圖查詢）將與 TrackFormer 的輸出（智能體查詢）在 MotionFormer 中交匯，構成「智能體-地圖」互動的基礎。

論證技巧 / 潛在漏洞「消除對預建高精地圖的依賴」在學術上很有吸引力，但線上建圖的精度與可靠性可能不及離線地圖。在工業部署中，大多數系統仍依賴高精地圖作為安全冗餘。

The map queries carry rich structural information about the driving environment that is critical for downstream motion forecasting. Agent behavior is heavily influenced by road geometry — vehicles follow lanes, respect boundaries, and yield at crossings. By passing map query embeddings into MotionFormer's cross-attention layers, the framework enables the motion predictor to reason about agent-map interactions explicitly, rather than relying on the motion model to implicitly learn road constraints from BEV features alone. This structured communication between mapping and prediction is a key differentiator from prior multi-task approaches.

地圖查詢攜帶關於駕駛環境的豐富結構資訊，對下游運動預測至關重要。智能體行為深受道路幾何影響——車輛沿車道行駛、遵守邊界、在路口讓行。透過將地圖查詢嵌入傳入 MotionFormer 的交叉注意力層，框架使運動預測器能顯式地推理智能體-地圖互動，而非僅仰賴運動模型從 BEV 特徵中隱式學習道路約束。建圖與預測之間的這種結構化通訊是與先前多任務方法的關鍵區別。

段落功能跨模組連結——闡明 MapFormer 輸出如何賦能 MotionFormer。

邏輯角色此段是「以規劃為導向」哲學的具體體現：建圖不僅為了產出地圖，更是為了讓運動預測具備道路感知能力，最終服務於更安全的規劃。

論證技巧 / 潛在漏洞以日常駕駛直覺（車輛沿車道行駛）來解釋技術設計的合理性，使非專家讀者也能理解。然而，實際道路場景中存在大量不遵循車道約束的情況（施工區、緊急車輛），此類邊界案例的處理能力有待驗證。

3.3 MotionFormer — 運動預測模組

MotionFormer is the central prediction module that forecasts multimodal future trajectories for all detected agents in a scene-centric manner. It receives agent queries from TrackFormer and map queries from MapFormer, and models three types of interactions through specialized attention mechanisms: agent-agent interaction (how agents influence each other's future motion), agent-map interaction (how road structure constrains trajectories), and agent-goal interaction (how intended destinations shape future paths). The output is a set of multimodal trajectory predictions with associated probability scores for each agent.

MotionFormer 是核心預測模組，以場景為中心的方式預測所有已偵測智能體的多模態未來軌跡。它接收來自 TrackFormer 的智能體查詢與來自 MapFormer 的地圖查詢，並透過專用的注意力機制建模三種互動：智能體-智能體互動（智能體如何影響彼此的未來運動）、智能體-地圖互動（道路結構如何約束軌跡），以及智能體-目標互動（預期目的地如何塑造未來路徑）。輸出為每個智能體的一組多模態軌跡預測及其對應的機率分數。

段落功能模組定義——描述 MotionFormer 的三重互動機制與輸出形式。

邏輯角色 MotionFormer 是整體架構中承上啟下的關鍵環節：上承感知（TrackFormer）與建圖（MapFormer），下接占用預測（OccFormer）與規劃（Planner），其預測品質直接決定規劃安全性。

論證技巧 / 潛在漏洞三種互動的分類法（agent-agent、agent-map、agent-goal）結構清晰，覆蓋了運動預測的主要因素。但場景為中心的預測方式在智能體數量很多時計算複雜度較高，可能限制即時推論的可行性。

A distinctive feature of MotionFormer is its motion query design, which integrates four types of positional knowledge through sinusoidal positional encoding and MLPs: (1) scene-level anchors that provide a coarse spatial prior for possible future locations; (2) agent-level anchors derived from each agent's current state; (3) the agent's current position as a reference point; and (4) predicted goal points that represent likely destinations. These four components are fused to form rich motion queries that encode both the spatial context and the intentional direction of each agent, enabling more accurate multimodal trajectory prediction.

MotionFormer 的一個顯著特色是其運動查詢設計，透過正弦位置編碼與多層感知器整合四種位置知識：(1) 場景層級錨點，為可能的未來位置提供粗略的空間先驗；(2) 智能體層級錨點，由每個智能體的當前狀態衍生；(3) 智能體的當前位置作為參考點；(4) 預測目標點，代表可能的目的地。這四個組件融合形成豐富的運動查詢，同時編碼每個智能體的空間脈絡與意圖方向，實現更準確的多模態軌跡預測。

段落功能技術細節——拆解運動查詢的四重位置編碼設計。

邏輯角色此段深入運動查詢的內部結構，展現從「粗到細」的空間推理邏輯：場景錨點 -> 智能體錨點 -> 當前位置 -> 目標點。

論證技巧 / 潛在漏洞四重位置編碼的設計顯示了精心的工程考量。但組件數量越多，超參數調整與消融驗證的複雜度也越高。讀者可能質疑每個組件的邊際貢獻是否顯著。

Unlike agent-centric approaches that independently predict each agent's trajectory without awareness of other agents' plans, MotionFormer adopts a scene-centric formulation where all agents' motion queries attend to each other through self-attention layers. This allows the model to capture coordinated behaviors — for example, when one vehicle yields, the model can predict that the other vehicle will proceed. The agent-map cross-attention further constrains predictions to be physically plausible by anchoring trajectories to the road structure encoded by MapFormer, effectively preventing predictions that violate lane boundaries or road geometry.

不同於獨立預測各智能體軌跡而不感知其他智能體計畫的以智能體為中心之方法，MotionFormer 採用場景為中心的方法，所有智能體的運動查詢透過自注意力層相互關注。這允許模型捕捉協調行為——例如，當一輛車讓行時，模型能預測另一輛車將通過。智能體-地圖交叉注意力進一步將預測約束為物理上合理的，透過將軌跡錨定至 MapFormer 編碼的道路結構，有效防止違反車道邊界或道路幾何的預測。

段落功能優勢論證——以對比方式展示場景為中心預測的優越性。

邏輯角色呼應相關工作中對「agent-centric 方法不考慮跨智能體協調」的批判，此處以技術手段（自注意力）直接回應該批評。

論證技巧 / 潛在漏洞以讓行場景作為直觀例證十分有效。但場景為中心的全局自注意力計算量為 O(N^2)，當智能體數量較多（如擁擠十字路口）時可能成為瓶頸。作者未討論此可擴展性問題。

3.4 OccFormer — 占用預測模組

OccFormer predicts multi-step future occupancy grids that represent the spatial regions likely to be occupied by agents at each future time step. Unlike conventional occupancy prediction methods that produce anonymous occupancy maps without distinguishing which agent occupies which cell, OccFormer incorporates a pixel-agent interaction mechanism that preserves agent identity within the occupancy representation. This is achieved by having BEV pixel features cross-attend to agent queries from MotionFormer, with occupancy-guided masks restricting the pixel-to-agent correspondence based on predicted motion trajectories.

OccFormer 預測多步驟未來占用格網，表示每個未來時間步中可能被智能體占據的空間區域。不同於產生匿名占用地圖而不區分哪個智能體占據哪個格位的傳統占用預測方法，OccFormer 結合了像素-智能體互動機制，在占用表示中保留智能體身份。這透過讓 BEV 像素特徵對來自 MotionFormer 的智能體查詢進行交叉注意力運算來實現，並以占用引導遮罩依據預測的運動軌跡限制像素對智能體的對應關係。

段落功能模組定義——描述 OccFormer 的身份保留占用預測機制。

邏輯角色 OccFormer 橋接了「軌跡預測」與「規劃」之間的鴻溝：軌跡預測提供稀疏的智能體未來位置，而規劃需要密集的空間占用資訊來避免碰撞。

論證技巧 / 潛在漏洞「保留智能體身份」是相對於傳統占用預測的顯著改進——它允許規劃器區分不同智能體的威脅程度。但此設計增加了計算複雜度，且在密集場景中可能導致遮罩重疊衝突。

The occupancy prediction serves as a dense, spatiotemporal safety map for the downstream planner. While trajectory predictions from MotionFormer provide sparse future locations of agents, they may not fully capture the spatial extent of agents or account for prediction uncertainty. The occupancy grid complements trajectory prediction by providing a dense representation of future danger zones, enabling the planner to reason about collision risks in a more comprehensive manner. The multi-step nature of the prediction allows the planner to anticipate and avoid occupied regions across the entire planning horizon.

占用預測作為下游規劃器的密集時空安全地圖。雖然 MotionFormer 的軌跡預測提供了智能體的稀疏未來位置，但可能無法完全捕捉智能體的空間範圍或考量預測不確定性。占用格網透過提供未來危險區域的密集表示來補充軌跡預測，使規劃器能以更全面的方式推理碰撞風險。預測的多步驟特性允許規劃器在整個規劃時域內預見並避開被占據的區域。

段落功能功能闡明——解釋為何在有軌跡預測的情況下仍需占用預測。

邏輯角色回應可能的質疑「為何需要兩種預測？」，透過「稀疏 vs. 密集」的互補性論證此設計的必要性。這直接服務於規劃安全。

論證技巧 / 潛在漏洞「密集安全地圖」的類比非常直觀，有效傳達了占用預測的價值。然而，若軌跡預測已足夠精確，額外的占用預測可能帶來冗餘計算。消融研究需證明兩者結合確實優於單獨使用。

3.5 Planner — 規劃模組

The Planner module generates the final ego-vehicle trajectory by leveraging all upstream representations. It uses an ego-vehicle query that attends to the BEV features, agent queries from MotionFormer, and occupancy predictions from OccFormer through cross-attention layers. The planner outputs a sequence of future waypoints for the ego vehicle over a predefined planning horizon. Crucially, the planning module does not operate in isolation — it benefits from the rich, structured scene understanding accumulated through the preceding modules.

規劃模組利用所有上游表示生成最終的自車軌跡。它使用自車查詢，透過交叉注意力層對 BEV 特徵、來自 MotionFormer 的智能體查詢以及來自 OccFormer 的占用預測進行注意力運算。規劃器輸出自車在預定義規劃時域內的一系列未來路點。關鍵在於，規劃模組並非孤立運作——它受益於前置模組所累積的豐富且結構化的場景理解。

段落功能模組定義——描述規劃器如何消費所有上游模組的輸出。

邏輯角色作為整個管線的終端模組，Planner 是「以規劃為導向」哲學的最終受益者。所有前置模組的設計動機在此匯聚。

論證技巧 / 潛在漏洞以「不孤立運作」來強調統一設計的價值，但規劃器對多源資訊的整合方式（透過交叉注意力）可能面臨資訊過載問題——模型是否能有效篩選最關鍵的安全資訊值得探討。

To address perception uncertainty and ensure kinematically feasible trajectories, the Planner incorporates a post-optimization step using Newton's method. The raw trajectory output from the neural network may violate kinematic constraints such as maximum jerk, curvature, and acceleration limits. The optimization smooths the planned trajectory by minimizing a cost function that balances trajectory smoothness, collision avoidance (guided by OccFormer's predictions), and proximity to the neural network's initial output. Additionally, the ground-truth trajectories used for training supervision are pre-smoothed to enforce kinematic feasibility, preventing the model from learning physically impossible motions.

為應對感知不確定性並確保運動學上可行的軌跡，規劃器納入了使用牛頓法的後最佳化步驟。神經網路的原始軌跡輸出可能違反運動學約束，如最大急動度、曲率與加速度限制。最佳化透過最小化一個平衡軌跡平滑度、碰撞迴避（由 OccFormer 預測引導）以及與神經網路初始輸出接近度的成本函數，對規劃軌跡進行平滑處理。此外，用於訓練監督的真實標註軌跡經過預平滑以強制運動學可行性，防止模型學習物理上不可能的運動。

段落功能工程細節——描述後最佳化步驟如何確保軌跡的物理可行性。

邏輯角色此段承認了純神經網路輸出的局限性（可能不符合物理約束），並以傳統最佳化方法作為「安全網」。這是學習式方法與經典方法的務實結合。

論證技巧 / 潛在漏洞牛頓法後最佳化是一個謹慎的工程選擇，展現了對實際部署需求的關注。但此步驟可能掩蓋了神經網路預測的品質——若後最佳化過度修正，等同於承認端到端學習的規劃能力有限。

UniAD adopts a two-stage training strategy. In the first stage, the perception modules (TrackFormer and MapFormer) are jointly trained for 6 epochs to establish reliable detection, tracking, and mapping capabilities. In the second stage, all five modules are trained end-to-end for 20 epochs, allowing gradients from the planning loss to flow through the entire pipeline. This curriculum-style training is motivated by the observation that training all modules from scratch simultaneously leads to unstable optimization, as the prediction and planning modules require reasonable perception inputs to learn meaningful patterns.

UniAD 採用兩階段訓練策略。第一階段，感知模組（TrackFormer 與 MapFormer）聯合訓練 6 個週期，以建立可靠的偵測、追蹤與建圖能力。第二階段，所有五個模組進行端到端訓練共 20 個週期，允許規劃損失的梯度流過整個管線。此課程式訓練源於觀察到同時從頭訓練所有模組會導致不穩定的最佳化——因為預測與規劃模組需要合理的感知輸入才能學習有意義的模式。

段落功能訓練策略說明——闡述兩階段課程式訓練的動機與細節。

邏輯角色此段坦誠地揭示了端到端訓練的挑戰：若不先預訓練感知，全系統無法穩定收斂。這既是實用解法，也暗示了統一框架尚未實現真正的「一步到位」端到端學習。

論證技巧 / 潛在漏洞兩階段訓練策略暴露了一個張力：論文主張統一框架優於模組化設計，但訓練過程本身仍需先「模組化」地預訓練感知。這削弱了「端到端優越性」的論述力度。

4. Experiments — 實驗

All experiments are conducted on the nuScenes dataset, a large-scale autonomous driving benchmark containing 1000 driving scenes with 6 surround-view cameras, LiDAR, and comprehensive 3D annotations. UniAD uses only camera inputs (no LiDAR at inference), making it a vision-only approach. The framework is evaluated on five tasks simultaneously: 3D detection and tracking, online mapping, motion forecasting, occupancy prediction, and planning. Comparisons are made against both specialized single-task state-of-the-art methods and recent multi-task or end-to-end driving approaches including ST-P3, PnPNet, and various standalone baselines.

所有實驗在 nuScenes 資料集上進行，這是一個大規模自動駕駛基準，包含 1000 個駕駛場景，配備 6 個環景攝影機、光達及全面的三維標註。UniAD 僅使用攝影機輸入（推論時不使用光達），使其成為純視覺方法。框架在五項任務上同時進行評估：三維偵測與追蹤、線上建圖、運動預測、占用預測及規劃。比較對象包含專用的單任務最先進方法以及近期的多任務或端到端駕駛方法，包括 ST-P3、PnPNet 及各種獨立基準線。

段落功能實驗設定——說明資料集、輸入模態與評估涵蓋的任務範圍。

邏輯角色建立評估的全面性：五項任務同時評估，且僅使用視覺輸入，展現了方法的廣度與實用性。

論證技巧 / 潛在漏洞以「純視覺」定位突顯了方法的挑戰性，但 nuScenes 作為唯一的評估基準可能限制了結論的普遍性。不同地理區域、天候條件下的表現未被驗證。

For tracking, UniAD achieves 0.359 AMOTA on the nuScenes validation set, substantially outperforming end-to-end baselines. The model produces 906 ID switches, demonstrating competitive identity preservation. For online mapping, the panoptic segmentation quality shows that MapFormer effectively learns road structure from camera-only inputs. Notably, the joint training of tracking and mapping with downstream tasks does not degrade their individual performance — in fact, the end-to-end fine-tuning stage slightly improves perception metrics, suggesting beneficial gradient signals from prediction and planning that help perception learn more task-relevant features.

在追蹤方面，UniAD 在 nuScenes 驗證集上達到 0.359 AMOTA，大幅超越端到端基準線。模型產生 906 次身份切換，展現具競爭力的身份保持能力。在線上建圖方面，全景分割品質顯示 MapFormer 有效地從純攝影機輸入中學習道路結構。值得注意的是，追蹤與建圖與下游任務的聯合訓練未降低其個別效能——事實上，端到端微調階段略微提升了感知指標，顯示來自預測與規劃的有益梯度信號幫助感知學習更具任務相關性的特徵。

段落功能數據呈現——報告感知任務的定量結果與關鍵發現。

邏輯角色此段反駁了「聯合訓練會降低個別任務效能」的顧慮（即負遷移），以實證數據支持「以規劃為導向的訓練能反向提升感知」的核心主張。

論證技巧 / 潛在漏洞「端到端微調略微提升感知指標」是一個有力的發現，直接支持了統一框架的價值。但「略微」一詞暗示提升幅度有限，作者未提供具體數值差異。

The most striking results come from motion forecasting and planning. UniAD achieves 0.71m minADE and 1.02m minFDE for motion prediction, representing a 38.3% improvement over PnPNet. For the critical planning task, UniAD delivers 1.03m average L2 displacement error and 0.31% collision rate, achieving a 51.2% reduction in L2 error compared to ST-P3. These results demonstrate that the planning-oriented design philosophy yields substantial and measurable benefits — the unified framework does not merely match specialized methods but significantly surpasses them, particularly on the downstream tasks that directly impact driving safety.

最引人注目的結果來自運動預測與規劃。UniAD 在運動預測上達到 0.71m minADE 與 1.02m minFDE，相較 PnPNet 提升 38.3%。在關鍵的規劃任務上，UniAD 達成 1.03m 平均 L2 位移誤差與 0.31% 碰撞率，相較 ST-P3 實現 51.2% 的 L2 誤差降低。這些結果表明，以規劃為導向的設計哲學帶來了實質且可量測的效益——統一框架不僅匹配專用方法，更顯著超越它們，特別是在直接影響駕駛安全的下游任務上。

段落功能核心成果展示——以具體數字呈現運動預測與規劃的突破性表現。

邏輯角色此段是全文論證的實證高潮：51.2% 的規劃誤差降低直接驗證了「以規劃為導向」哲學的有效性。數字的說服力遠超架構描述。

論證技巧 / 潛在漏洞百分比改進幅度（38.3%、51.2%）極具視覺衝擊力，但需注意比較基準（PnPNet、ST-P3）本身的效能水平——若基準線較弱，大幅改進的意義需重新評估。此外，0.31% 碰撞率雖然低，但在實際駕駛中仍可能造成安全風險。

4.2 Ablation Studies — 消融研究

Comprehensive ablation studies validate the contribution of each component. Removing the query-based inter-task connections and replacing them with simple feature concatenation leads to significant performance drops across all tasks, confirming that structured query communication is superior to generic feature sharing. Disabling individual modules (e.g., removing MapFormer from the pipeline) degrades not only the removed task but also downstream tasks, demonstrating the interdependency captured by the unified design. The motion prediction module shows the most sensitivity to upstream quality — removing tracking input from MotionFormer increases minADE by 23%.

全面的消融研究驗證了每個組件的貢獻。移除基於查詢的跨任務連接並以簡單特徵串接取代，導致所有任務效能顯著下降，確認結構化查詢通訊優於泛用特徵共享。停用個別模組（如從管線中移除 MapFormer）不僅降低被移除任務的效能，也影響下游任務，展示了統一設計所捕捉的相互依賴性。運動預測模組對上游品質最為敏感——從 MotionFormer 中移除追蹤輸入使 minADE 增加 23%。

段落功能系統性驗證——透過消融實驗逐一確認各組件的必要性。

邏輯角色消融研究扮演「反事實論證」的角色：若去掉 X 則效能如何下降，從而證明 X 的價值。三個層次的消融（連接方式、模組存在、輸入來源）提供了多角度的驗證。

論證技巧 / 潛在漏洞消融設計覆蓋面廣，邏輯嚴謹。然而，讀者可能期待更多的替代設計比較（如不同的任務排列順序、不同的注意力機制），而非僅是「有 vs. 無」的二元消融。

A particularly revealing ablation examines the impact of occupancy prediction on planning. When OccFormer is removed, the planning collision rate increases from 0.31% to 0.78%, more than doubling the risk of collision. This confirms the hypothesis that dense occupancy prediction provides critical safety information that sparse trajectory prediction alone cannot offer. Furthermore, the non-linear optimization post-processing reduces the collision rate by an additional 15% and significantly improves trajectory smoothness, validating the importance of combining learned predictions with physics-based refinement.

一項特別具有揭示性的消融檢驗了占用預測對規劃的影響。移除 OccFormer 後，規劃碰撞率從 0.31% 上升至 0.78%，碰撞風險增加超過一倍。這證實了密集占用預測提供了稀疏軌跡預測單獨無法提供的關鍵安全資訊的假設。此外，非線性最佳化後處理將碰撞率額外降低 15%，並顯著改善軌跡平滑度，驗證了結合學習式預測與基於物理之精煉的重要性。

段落功能安全性驗證——以碰撞率指標量化占用預測與後最佳化的安全貢獻。

邏輯角色此段直接回應了「為何需要 OccFormer」的設計合理性問題，以碰撞率加倍的數據提供了最具說服力的答案。

論證技巧 / 潛在漏洞碰撞率從 0.31% 到 0.78% 的對比極為有力，但這些數字是在 nuScenes 的開迴路（open-loop）評估中得到的。閉迴路（closed-loop）評估中，小幅碰撞率差異可能放大為截然不同的安全表現。

The authors also explore alternative task orderings within the DAG structure. Placing motion forecasting before mapping, or removing the explicit dependency between tracking and motion prediction, consistently leads to worse planning outcomes. This supports the claim that the chosen task ordering — perception first, then prediction, then planning — reflects genuine causal dependencies in the driving domain, and that violating this natural ordering disrupts the information flow needed for safe planning. The results suggest that task topology design is as important as individual module architecture.

作者還探索了 DAG 結構中的替代任務排列順序。將運動預測置於建圖之前，或移除追蹤與運動預測之間的顯式依賴，一致地導致更差的規劃結果。這支持了以下主張：所選的任務排列——先感知、再預測、後規劃——反映了駕駛領域中真實的因果依賴關係，而違反此自然順序會擾亂安全規劃所需的資訊流。結果顯示，任務拓撲設計與個別模組架構同等重要。

段落功能拓撲驗證——以實驗證明任務排列順序並非隨意選擇。

邏輯角色此段提升了論文的理論深度：不僅是「這些模組有用」，更是「這些模組的排列順序有其因果邏輯」。這為 UniAD 的設計選擇提供了更強的理論支撐。

論證技巧 / 潛在漏洞任務順序的消融是一個巧妙的實驗設計，回答了「為何是這個 DAG 而非其他」的問題。但「因果依賴」的宣稱可能過強——觀察到的效能差異也可能源於訓練動態而非真正的因果結構。

5. Conclusion — 結論

This paper presents UniAD, a unified autonomous driving framework that systematically connects perception, prediction, and planning through query-based transformer interfaces. The core contribution is the planning-oriented design philosophy, which ensures that every intermediate task is optimized not in isolation but in service of the final planning objective. Through extensive experiments on nuScenes, the authors demonstrate that this approach achieves state-of-the-art results across all five evaluated tasks and substantially improves planning accuracy and safety compared to both modular and end-to-end alternatives.

本文提出 UniAD——一個透過基於查詢的 Transformer 介面系統性地連接感知、預測與規劃的統一自動駕駛框架。核心貢獻是以規劃為導向的設計哲學，確保每項中間任務不是孤立地最佳化，而是服務於最終的規劃目標。透過在 nuScenes 上的大量實驗，作者證明此方法在所有五項評估任務上均達到最先進的結果，且相較於模組化與端到端替代方案，顯著提升了規劃準確度與安全性。

段落功能總結核心貢獻——以一段話重申全文的核心主張與主要成果。

邏輯角色結論的第一段呼應摘要，形成首尾呼應的論證閉環。從問題（模組化 vs. 端到端的困境）到解決方案（查詢介面的統一框架）到驗證（五項任務 SOTA）。

論證技巧 / 潛在漏洞結論措辭自信且精煉，但未討論方法的局限性——如計算成本、推論延遲、以及在更具挑戰性場景（如惡劣天候、高速公路合流）中的表現。作為最佳論文，讀者期待更全面的反思。

The results reveal that task coordination through structured query interfaces outperforms both isolated task optimization and naive multi-task learning with shared backbones. The planning-oriented framework demonstrates that designing upstream tasks with the downstream objective in mind leads to representations that are inherently more useful for the final goal. Looking ahead, UniAD opens several promising directions: incorporating temporal BEV features for longer-horizon prediction, integrating additional sensing modalities like LiDAR, and extending the framework to handle more complex urban scenarios with diverse agent types. The authors believe this work takes a meaningful step toward truly integrated autonomous driving systems.

結果揭示，透過結構化查詢介面的任務協調優於孤立的任務最佳化以及使用共享骨幹的單純多任務學習。以規劃為導向的框架展示了以下游目標為念來設計上游任務，能產生對最終目標本質上更有用的表示。展望未來，UniAD 開啟了數個有前景的方向：納入時序 BEV 特徵以進行更長時域的預測、整合如光達等額外感測模態，以及擴展框架以處理包含多樣智能體類型的更複雜都市場景。作者相信這項工作朝向真正整合的自動駕駛系統邁出了有意義的一步。

段落功能理論提煉與展望——從具體成果抽象出一般性原則，並指出未來方向。

邏輯角色此段將 UniAD 的具體貢獻提升至更高層次的洞見（「以下游為念設計上游」），同時以未來方向承認當前的局限性。

論證技巧 / 潛在漏洞未來方向的三點提議（時序特徵、多模態、複雜場景）恰好是當前方法的主要局限，以展望的形式間接承認了不足。「有意義的一步」的措辭保持了謙遜，但考量到這是最佳論文的結論，讀者可能期待更大膽的研究藍圖。

In summary, UniAD validates a fundamental insight: in complex, safety-critical systems like autonomous driving, the whole is greater than the sum of its parts. Optimizing tasks jointly through principled inter-task communication yields compound benefits that cannot be achieved by improving individual components in isolation. The query-based transformer architecture provides an elegant and scalable mechanism for implementing this vision, and the consistent improvements across all metrics demonstrate its effectiveness. As the field moves toward real-world deployment, frameworks that unify perception, prediction, and planning under a coherent objective will become increasingly essential.

總而言之，UniAD 驗證了一個基本洞見：在如自動駕駛般複雜且攸關安全的系統中，整體大於部分之和。透過有原則的跨任務通訊進行聯合最佳化，所產生的複合效益無法透過孤立地改進個別組件來實現。基於查詢的 Transformer 架構為實現此願景提供了優雅且可擴展的機制，而所有指標上的一致改進則證明了其有效性。隨著該領域邁向真實世界部署，在連貫目標下統一感知、預測與規劃的框架將變得愈加不可或缺。

段落功能哲學性收尾——以系統思維的視角總結全文的學術意義。

邏輯角色最終段落將技術論文提升至方法論層次：「整體大於部分之和」不僅是對 UniAD 的總結，更是對自動駕駛研究方向的宣言。

論證技巧 / 潛在漏洞以格言般的陳述作結，修辭效果強烈。但「隨著領域邁向部署」一語可能過於樂觀——目前 UniAD 的開迴路評估與閉迴路部署之間仍有巨大鴻溝，且推論速度、計算資源等工程挑戰尚未被充分討論。

論證結構總覽

問題
模組化累積誤差
端到端缺乏可解釋性

→

論點
以規劃為導向
統一查詢介面串連全任務

→

證據
nuScenes 五任務 SOTA
規劃 L2 降低 51.2%

→

反駁
消融研究排除替代設計
任務順序驗證因果結構

→

結論
整體大於部分之和
統一框架為部署之必要

作者核心主張（一句話）

在自動駕駛系統中，透過基於查詢的 Transformer 介面將感知、預測與規劃以有向無環圖結構統一，並以規劃為最終導向進行聯合最佳化，能產生遠超孤立優化或簡單多任務學習的複合效益。

論證最強處

消融研究的全面性與說服力：論文不僅證明每個模組有用，更透過任務順序消融驗證了 DAG 結構的因果合理性。占用預測移除後碰撞率加倍的數據，以及查詢介面 vs. 特徵串接的對比，提供了多層次的反事實證據。51.2% 的規劃誤差降低與 38.3% 的預測改進使核心主張獲得強有力的數據支撐。

論證最弱處

評估範圍與部署差距：所有實驗僅在 nuScenes 開迴路評估中進行，缺乏閉迴路模擬或真實車輛測試。兩階段訓練策略暗示真正的端到端訓練尚不穩定，與論文強調的「統一」敘事存在張力。此外，五模組序列推論的計算延遲、在極端天候或高密度交通下的穩健性，以及對 LiDAR 缺失的影響均未被充分探討。