HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

Abstract — 摘要

Driving World Models (DWMs) have become essential for autonomous driving by enabling future scene prediction. However, existing DWMs are limited to scene generation and fail to incorporate scene understanding, which involves interpreting and reasoning about the driving environment. This paper introduces Hermes, which seamlessly integrates 3D scene understanding and future scene generation through a unified framework. It leverages Bird's-Eye View (BEV) representation to consolidate multi-view spatial information while preserving geometric relationships. The approach introduces world queries that incorporate world knowledge into BEV features via causal attention in Large Language Models. Results demonstrate reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%.

駕駛世界模型（DWM）透過未來場景預測成為自動駕駛的核心要素。然而，現有 DWM 僅限於場景生成，未能納入場景理解——即對駕駛環境的解讀與推理能力。本文提出 Hermes，透過統一框架無縫整合三維場景理解與未來場景生成。它利用鳥瞰圖（BEV）表示法整合多視角空間資訊，同時保留幾何關係。該方法引入世界查詢，透過大型語言模型中的因果注意力將世界知識融入 BEV 特徵。結果展示生成誤差降低 32.4%，理解指標如 CIDEr 提升 8.0%。

段落功能全文總覽——定義駕駛世界模型的現有缺口（僅生成、不理解），並以 Hermes 的統一框架作為回應。

邏輯角色「理解 vs. 生成」的二元分裂是全文的核心論述基礎。BEV 作為統一表示的橋接角色，世界查詢作為知識注入的機制，構成完整的技術敘事。

論證技巧 / 潛在漏洞 32.4% 的誤差降低與 8.0% 的 CIDEr 提升提供了有力的量化支撐。但「統一」的宣稱需驗證兩任務是否真正互惠，而非僅是多任務學習的簡單疊加。

1. Introduction — 緒論

Current Driving World Models excel at predicting environmental evolution but lack interpretation capabilities — they cannot describe environments, answer questions, or provide contextual information. Conversely, vision-language models demonstrate impressive capabilities in driving scene understanding but lack predictive capabilities for scene evolution. This gap creates the core research question: "how can world knowledge and future scene evolutions be seamlessly integrated into a unified world model?"

現有的駕駛世界模型擅長預測環境演變，但缺乏解讀能力——無法描述環境、回答問題或提供情境資訊。相反地，視覺語言模型在駕駛場景理解方面展現了出色的能力，卻缺乏場景演變的預測能力。此差距構成了核心研究問題：「如何將世界知識與未來場景演變無縫整合至統一的世界模型中？」

段落功能定義核心問題——以兩類模型的互補缺陷建立統一框架的必要性。

邏輯角色以對稱的「A 能 B 不能，B 能 A 不能」結構建立互補性論證，使統一框架成為自然推論。

論證技巧 / 潛在漏洞互補性框架極具說服力，但「為何需在同一模型中統一」而非「分別部署再串接」的動機尚未被充分論證。後者可能更具工程彈性。

Three key challenges are identified. First, LLMs typically face token length limitations, especially in autonomous driving where multiple surrounding views must be processed. The solution leverages BEV representation, which effectively compresses surrounding views into unified latent space. Second, a straightforward approach of sharing BEV features with separate models fails to leverage potential interactions between tasks. The proposed solution uses world queries initialized from raw BEV features, enhanced with world knowledge through causal attention in the LLM.

識別出三個關鍵挑戰。首先，大型語言模型通常面臨權杖長度限制，特別是在需要處理多個環繞視角的自動駕駛場景中。解決方案利用 BEV 表示法，有效地將環繞視角壓縮至統一的潛在空間。其次，簡單地以共享 BEV 特徵搭配獨立模型的直接方法無法利用任務間的潛在交互。提議的解決方案使用從原始 BEV 特徵初始化的世界查詢，透過 LLM 中的因果注意力以世界知識加以增強。

段落功能挑戰分析——系統性列舉統一框架面臨的技術障礙及其解法。

邏輯角色「挑戰-解法」的配對結構使技術設計決策有據可依。BEV 解決空間壓縮問題，世界查詢解決任務交互問題。

論證技巧 / 潛在漏洞 BEV 壓縮雖有效但會損失垂直方向的精細資訊。世界查詢透過因果注意力注入知識的有效性需在消融研究中量化驗證。

Research on Driving World Models focuses primarily on generation in both 2D and 3D dimensions. GAIA-1 introduced learned simulators based on autoregressive models. Recent work leverages large-scale data and powerful pre-training models, "significantly enhancing generation quality regarding consistency, resolution, and controllability". For 3D spatial information, OccWorld focuses on future occupancy generation, and ViDAR uses images to predict future point clouds through self-supervision. Critical limitation: "they overlook the explicit understanding capacity of the driving environment".

駕駛世界模型的研究主要集中於二維與三維的場景生成。GAIA-1 引入了基於自迴歸模型的學習式模擬器。近期研究利用大規模資料與強大的預訓練模型，「在一致性、解析度與可控性方面顯著提升了生成品質」。在三維空間資訊方面，OccWorld 聚焦於未來占據柵格的生成，ViDAR 透過自監督方式以影像預測未來點雲。關鍵局限：「它們忽略了對駕駛環境的顯式理解能力」。

段落功能文獻回顧——梳理駕駛世界模型的發展，聚焦其「僅生成不理解」的共同盲點。

邏輯角色以「進步但不完整」的敘事框架定位整個領域，為 Hermes 的統一方案留出差異化空間。

論證技巧 / 潛在漏洞將「理解」定義為顯式的語言輸出（描述、問答），這排除了隱式理解（如端到端駕駛中的隱式場景感知），可能過度窄化了「理解」的定義。

3. Method — 方法

3.1 World Tokenizer and Render — 世界權杖化器與渲染器

Multi-view images at time t are passed through a CLIP image encoder and a single-frame BEVFormer. The obtained BEV feature captures world semantic and geometric information. "Such a feature is too large for the LLM, often containing tens of thousands of tokens." A down-sampling block reduces the feature by two times. For point cloud generation, compressed BEV features are up-sampled, reshaped by adding an extra height dimension, then processed through 3D convolutions to reconstruct the volumetric feature map. Differentiable volume rendering models the environment as an implicit signed distance function (SDF) field to compute depth for each ray.

時間 t 的多視角影像通過 CLIP 影像編碼器與單幀 BEVFormer。所得的 BEV 特徵捕捉了世界的語意與幾何資訊。「此特徵對 LLM 而言過於龐大，通常包含數萬個權杖。」降取樣區塊將特徵縮減為兩倍。在點雲生成方面，壓縮的 BEV 特徵經上取樣、新增額外高度維度進行重塑，再透過三維摺積重建體積特徵圖。可微分體積渲染將環境建模為隱式帶符號距離函數（SDF）場以計算每條射線的深度。

段落功能技術基礎設施——描述從多視角影像到 BEV 表示再到點雲的完整管線。

邏輯角色此段建立了 Hermes 的「輸入-表示-輸出」管線。BEV 作為中間表示同時服務理解與生成兩端，是統一架構的基石。

論證技巧 / 潛在漏洞 BEV 的降取樣為 LLM 處理創造了可行性，但二倍壓縮可能損失空間細節。SDF 場用於體積渲染是成熟技術，但從 BEV 重建三維體積的過程中高度資訊的恢復品質存疑。

3.2 Unification — 統一架構

The LLM (InternVL2-2B) processes flattened BEV to interpret driving scenarios based on user instructions. For understanding, the LLM responds to queries through auto-regressive next-token prediction. For generation, world queries are proposed — groups of n queries initialized via max pooling from BEV features, conditioned on ego-motion encoding and position embeddings. After LLM processing, a "current to future link" module employs cross-attention layers to inject world knowledge into future BEV features. This design ensures generated scenes are contextually aware and enriched with world knowledge from the LLM's understanding capability.

大型語言模型（InternVL2-2B）處理展平的 BEV 以根據使用者指令解讀駕駛場景。在理解任務中，LLM 透過自迴歸的下一個權杖預測回應查詢。在生成任務中，提出世界查詢——以 BEV 特徵的最大池化初始化的查詢群組，以自車運動編碼與位置嵌入為條件。LLM 處理後，「現在到未來連結」模組運用交叉注意力層將世界知識注入未來的 BEV 特徵。此設計確保生成的場景具備情境感知能力，並以 LLM 的理解能力所賦予的世界知識加以豐富。

段落功能核心創新——描述世界查詢如何在 LLM 中吸收世界知識並傳遞給生成任務。

邏輯角色世界查詢是連接理解與生成的關鍵橋梁：它們在 LLM 的因果注意力中被文字理解所豐富，再透過交叉注意力將知識傳遞給未來 BEV 生成。這是論文的核心技術貢獻。

論證技巧 / 潛在漏洞 max pooling 初始化捕捉 BEV 的峰值響應，但可能遺漏分散但重要的空間資訊。n=4 的查詢數量經消融確定，但如此少量的查詢能否充分編碼複雜場景的未來演變需審慎評估。

Training is structured into three stages. Stage 1 trains the World Tokenizer and Render to convert current images into point clouds. Stage 2 establishes BEV-text alignment via caption data, then refines using LoRA on OmniDrive scene descriptions. Stage 3 introduces future generation modules, unifying understanding and generation using nuScenes keyframes and OmniDrive conversations. The total loss combines auto-regressive language modeling loss and L1 depth supervision loss with frame-wise weights.

訓練分為三個階段。第一階段訓練世界權杖化器與渲染器，將當前影像轉換為點雲。第二階段透過字幕資料建立 BEV-文字對齊，再以 LoRA 在 OmniDrive 場景描述上進行精煉。第三階段引入未來生成模組，使用 nuScenes 關鍵幀與 OmniDrive 對話統一理解與生成任務。總損失結合自迴歸語言建模損失與 L1 深度監督損失，搭配幀級權重。

段落功能訓練策略——以分階段方式逐步建構統一能力。

邏輯角色三階段設計遵循「基礎建設 -> 語言對齊 -> 統一微調」的漸進邏輯，降低了直接端到端訓練的最佳化難度。

論證技巧 / 潛在漏洞分階段訓練是成熟的工程實踐，但第三階段的統一訓練是否會導致前兩階段能力的遺忘需以消融研究驗證。LoRA 的使用限制了 LLM 的微調自由度。

4. Experiments — 實驗

Experiments are conducted on nuScenes and OmniDrive-nuScenes datasets. For future point cloud generation, Hermes achieves approximately 32% Chamfer Distance reduction in 3-second point clouds compared to ViDAR. Notably, ViDAR utilizes 3-second history horizon and carefully designed latent rendering, while Hermes uses only current multi-view images and simple volumetric representation. For understanding, Hermes achieves highly competitive caption quality, notably outperforming OmniDrive by 8% on the CIDEr metric, despite using only a subset of OmniDrive's training data.

實驗在 nuScenes 與 OmniDrive-nuScenes 資料集上進行。在未來點雲生成方面，Hermes 與 ViDAR 相比在 3 秒點雲上達到約 32% 的 Chamfer 距離縮減。值得注意的是，ViDAR 使用 3 秒歷史時域與精心設計的潛在渲染，而 Hermes 僅使用當前多視角影像與簡單的體積表示。在理解方面，Hermes 達到極具競爭力的字幕品質，在 CIDEr 指標上超越 OmniDrive 8%，儘管僅使用 OmniDrive 訓練資料的子集。

段落功能核心實驗結果——以兩個維度的數據證明統一框架的有效性。

邏輯角色 32% 的生成改善加上 8% 的理解提升，分別驗證了框架在兩個子任務上的競爭力。與更複雜基線（ViDAR、OmniDrive）的有利比較強化了「簡單統一 > 複雜分離」的論述。

論證技巧 / 潛在漏洞與 ViDAR 的比較在條件上不對等（單幀 vs. 3 秒歷史），這使得 32% 的改善更為突出但也使直接比較不完全公平。CIDEr 8% 的提升在少量資料下尤為印象深刻。

Ablation studies examine world query design. Testing n from 1 to 16 shows world queries don't negatively affect text understanding quality. Increasing the number beyond 4 leads to performance decline due to redundant information. Max pooling initialization achieves a 0.03 Chamfer Distance reduction over alternative pooling strategies. For understanding-generation interaction, the unified approach outperforms separated unification in generation results, as the latter fails to exploit potential interactions between tasks.

消融研究檢驗世界查詢的設計。測試 n 從 1 到 16 顯示世界查詢不會負面影響文字理解品質。數量超過 4 會因冗餘資訊導致效能下降。最大池化初始化比替代池化策略在 Chamfer 距離上減少 0.03。在理解-生成交互方面，統一方法在生成結果上超越分離式統一，因後者未能利用任務間的潛在交互。

段落功能設計驗證——以消融研究確認關鍵設計決策的合理性。

邏輯角色「統一 > 分離」的消融結果是全文最關鍵的驗證——它證明理解與生成的整合確實帶來了互惠效益，而非僅是多任務學習的成本。

論證技巧 / 潛在漏洞統一 vs. 分離的差距（0.03 Chamfer Distance、微小的理解差異）相對較小，「互惠」的程度可能不如預期顯著。n=4 的選擇看似經驗性而非原則性。

5. Conclusion — 結論

This paper introduces Hermes, "a simple yet effective unified Driving World Model that integrates 3D scene understanding and future scene generation within a single framework". By leveraging BEV representation and incorporating world queries enhanced through large language models, the approach effectively bridges the gap between understanding and generation. Extensive experiments validate effectiveness with "significant improvements in future scene prediction accuracy and understanding metrics, surpassing state-of-the-art methods". Future work plans to investigate multi-modal unified world models facilitating both multi-modal input and output.

本文提出 Hermes，「一個簡單而有效的統一駕駛世界模型，在單一框架內整合三維場景理解與未來場景生成」。透過利用 BEV 表示法並結合經大型語言模型增強的世界查詢，該方法有效地彌合了理解與生成之間的差距。廣泛的實驗以「在未來場景預測準確度與理解指標上的顯著改善，超越最先進方法」驗證了其有效性。未來工作計畫探索支持多模態輸入與輸出的多模態統一世界模型。

段落功能總結全文——重申統一框架的價值主張並展望多模態方向。

邏輯角色結論以「簡單而有效」的自我定位呼應摘要，強調設計的優雅性。多模態未來展望暗示當前框架僅是更大願景的第一步。

論證技巧 / 潛在漏洞未充分討論局限性——如僅預測點雲而非影像、未探索感知與規劃任務、以及在複雜場景（大角度轉彎、遮擋、夜間）中的退化。這些在正文中有所提及但結論中被省略。

論證結構總覽

問題
駕駛世界模型
僅能生成
不能理解場景

→

論點
BEV + 世界查詢
統一理解與生成
互惠互利

→

證據
生成誤差降 32.4%
CIDEr 提升 8.0%
統一優於分離

→

反駁
三階段訓練
漸進式建構
降低最佳化難度

→

結論
統一世界模型
是自動駕駛
場景建模的方向

作者核心主張（一句話）

透過 BEV 表示法與大型語言模型中的世界查詢機制，可以在單一框架內統一駕駛場景的三維理解與未來生成能力，且兩者的整合帶來互惠的效能提升。

論證最強處

世界查詢作為知識橋梁的設計：世界查詢在 LLM 的因果注意力中吸收場景理解所產生的世界知識，再透過交叉注意力傳遞給生成模組。此設計使理解與生成不再是獨立的多任務目標，而是真正相互增強的統一過程。消融研究中「統一優於分離」的結果為此提供了實證支持。

論證最弱處

統一互惠的效果有限：消融研究顯示統一與分離方法的差距相對較小（生成的 Chamfer Distance 差 0.03，理解的 METEOR/ROUGE 差 0.002），互惠效益尚未達到令人信服的程度。此外，模型在複雜場景（大幅度轉彎、遮擋、夜間）中表現退化，暗示 BEV 表示的局限性在極端條件下可能抵消統一框架的優勢。