Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding

Abstract — 摘要

This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models (MLLMs), which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. The model combines SAM-2 with LLaVA and unifies text, image, and video in a shared LLM token space. The authors introduce the Ref-SAV dataset with 72k+ object expressions and achieve state-of-the-art results across multiple benchmarks.

本文提出 Sa2VA，首個統一的影像與影片稠密接地理解模型。不同於現有多模態大型語言模型通常受限於特定模態與任務，Sa2VA 支援廣泛的影像與影片任務，包括指涉分割與對話，僅需極少的單樣本指令微調。模型結合 SAM-2 與 LLaVA，在共享的 LLM 詞元空間中統一文字、影像與影片。作者引入含 72k 以上物件表述的 Ref-SAV 資料集，並在多個基準測試上達到最先進的結果。

段落功能全文總覽——以「首個統一模型」的定位建立新穎性，概述架構、資料集與成果。

邏輯角色摘要聚焦於「統一性」這一核心賣點：從模態統一（影像+影片）到任務統一（分割+對話），展現方法的泛用性。

論證技巧 / 潛在漏洞「首個」的宣稱需要對現有方法進行嚴格的邊界定義——「統一」的範圍如何劃定？若僅限於「指涉分割+對話」的特定組合，則新穎性的基礎需更精確地闡述。

1. Introduction — 緒論

Multi-modal Large Language Models (MLLMs) have made significant progress with visual question answering, narrative generation, and interactive editing. However, there exists a fundamental gap: video perception models like SAM-2 lack open-ended understanding abilities, while video MLLMs cannot perform pixel-level perception tasks like segmentation. The authors observe that "no work can leverage the benefits from both sides." Three key challenges emerge: (1) task formulation for multimodal inputs; (2) balancing performance across diverse tasks; and (3) effective knowledge sharing from pretrained models.

多模態大型語言模型已在視覺問答、敘事生成與互動編輯方面取得顯著進展。然而，存在一個根本落差：SAM-2 等影片感知模型缺乏開放式理解能力，而影片 MLLM 則無法執行像素級感知任務（如分割）。作者觀察到「沒有任何工作能同時利用兩方的優勢。」三個關鍵挑戰浮現：(1) 多模態輸入的任務建構；(2) 在多樣化任務間平衡性能；(3) 從預訓練模型有效地共享知識。

段落功能建立研究場域——揭示感知模型與理解模型之間的根本鴻溝。

邏輯角色論證鏈的起點：以「兩個世界的隔閡」建立問題意識，三個挑戰則為方法設計提供了清晰的路線圖。

論證技巧 / 潛在漏洞將問題框架為「兩個世界的橋接」是強力的敘事策略。但 SAM-2 與 LLaVA 的架構差異巨大，如何在不犧牲各自優勢的情況下整合，是一個非顯然的工程挑戰。

Sa2VA addresses these challenges with a decoupled design. Rather than deeply integrating SAM-2 into the LLM, the framework uses special [SEG] tokens as bridges: the hidden states of the [SEG] token generated by the LLM are used as spatial-temporal prompts and fed into SAM-2's decoder. This design maintains the plug-and-play compatibility — SAM-2 and LLaVA operate as independent modules connected through a lightweight interface. For video tasks, key frames generate masks via [SEG] tokens while remaining frames use SAM-2's memory mechanisms for tracking.

Sa2VA 以解耦設計解決這些挑戰。框架不將 SAM-2 深度整合進 LLM，而是使用特殊的 [SEG] 詞元作為橋樑：LLM 生成的 [SEG] 詞元隱藏狀態作為時空提示，餵入 SAM-2 的解碼器。此設計維持了隨插即用的相容性——SAM-2 與 LLaVA 作為獨立模組運作，透過輕量化介面連接。對於影片任務，關鍵幀透過 [SEG] 詞元生成遮罩，其餘幀則使用 SAM-2 的記憶機制進行追蹤。

段落功能提出解決方案——描述 [SEG] 詞元驅動的解耦架構。

邏輯角色此段揭示了架構設計的核心哲學：「解耦而非融合」。[SEG] 詞元作為兩個模型之間的最小介面，既傳遞了語義資訊，又避免了深度耦合的複雜性。

論證技巧 / 潛在漏洞解耦設計的優點是簡潔性與可維護性，但缺點是兩個模型之間的資訊傳遞受限於 [SEG] 詞元的表達能力。單一詞元能否攜帶足夠的空間-時間資訊以指導精確分割，是設計的關鍵假設。

Multi-modal LLMs have evolved from image-level understanding (LLaVA, InternVL) to video handling (LLaVA-OneVision), but lack pixel-level grounding capabilities. Referring segmentation methods have progressed from fusion modules to transformer-based approaches to LLM-equipped architectures, yet the conceptual vocabulary remains limited compared with the knowledge space of LLMs. Video segmentation and grounding methods focus on closed-set pixel segmentation with predefined categories. Existing approaches either sacrifice language understanding for segmentation precision or vice versa. Sa2VA is the first to unify both capabilities without such tradeoffs.

多模態大型語言模型已從影像級理解（LLaVA、InternVL）演化至影片處理（LLaVA-OneVision），但缺乏像素級接地能力。指涉分割方法從融合模組進展到基於 Transformer 的方法再到配備 LLM 的架構，然而其概念詞彙相比 LLM 的知識空間仍然有限。影片分割與接地方法聚焦於預定義類別的封閉集像素分割。現有方法要麼犧牲語言理解以換取分割精度，要麼反之。Sa2VA 首次統一了兩種能力而無需此類取捨。

段落功能文獻回顧——涵蓋 MLLM、指涉分割與影片分割三條研究線。

邏輯角色以「理解 vs. 感知」的二分法組織文獻，使 Sa2VA 的「統一」定位顯得自然而必要。

論證技巧 / 潛在漏洞「無需取捨」的宣稱需要定量驗證——在共同訓練中，兩項能力是否真的不會互相干擾？消融研究中需展示共同訓練 vs. 各自訓練的性能比較。

3. Method — 方法

Sa2VA reformulates diverse tasks under a unified representation. Referring image/video object segmentation takes text tokens and images/videos as input, outputting binary masks or spatio-temporal masklets. Image/video chat and grounded caption generation outputs answer text with aligned masks. Visual prompt understanding accepts additional visual prompts (boxes/points). All tasks are unified as: T_o, M_o = LLM({I_i, V_i, VP_i}, T_i), where outputs are text and/or masks depending on task type. The LLM generates [SEG] tokens whose hidden states serve as spatial-temporal prompts for SAM-2's decoder. The training loss combines text regression loss with pixel-wise cross-entropy and Dice loss for segmentation.

Sa2VA 在統一表示下重新建構多樣化任務。指涉影像/影片物件分割以文字詞元與影像/影片作為輸入，輸出二值遮罩或時空遮罩序列。影像/影片對話與接地式描述生成輸出附帶對齊遮罩的回答文字。視覺提示理解接受額外的視覺提示（框/點）。所有任務統一為：T_o, M_o = LLM({I_i, V_i, VP_i}, T_i)，其中輸出根據任務類型為文字及/或遮罩。LLM 生成 [SEG] 詞元，其隱藏狀態作為 SAM-2 解碼器的時空提示。訓練損失結合文字迴歸損失與用於分割的像素級交叉熵與 Dice 損失。

段落功能統一建構——描述如何將多種任務映射到單一模型框架。

邏輯角色此段展現了架構設計的簡潔性：所有任務共享同一個輸入-輸出介面，差異僅在於輸出的組成（文字、遮罩、或兩者）。[SEG] 詞元是連接語言世界與視覺世界的唯一橋樑。

論證技巧 / 潛在漏洞統一公式化的美感在於簡潔，但「一個模型解決所有問題」的策略可能導致在特定任務上的性能不如專用模型。多任務訓練中的任務權重平衡是一個微妙的工程問題。

3.3 Ref-SAV Dataset and Benchmark — 資料集與基準

The authors introduce Ref-SAV, a dataset built from 37,311 videos with 72,509 object expressions from the SA-V dataset. The annotation pipeline has three stages: (1) object/part-level annotation using InternVL2-76B with consistency checking; (2) scene-level annotation with object relationships; and (3) video-level annotation capturing motion and actions across 8 uniformly sampled frames. The benchmark subset contains 1,147 videos with 1,945 object expressions (1,694 long, 251 short), manually validated. Ref-SAV features long text descriptions, heavy occlusion, and large motion — challenges absent from existing benchmarks.

作者引入 Ref-SAV，一個從 SA-V 資料集中 37,311 支影片與 72,509 個物件表述建構的資料集。標註流程分為三個階段：(1) 物件/部件級標註，使用 InternVL2-76B 並進行一致性檢查；(2) 場景級標註，包含物件關係；(3) 影片級標註，在 8 個均勻取樣幀上捕捉動作。基準測試子集包含 1,147 支影片與 1,945 個物件表述（1,694 個長描述、251 個短描述），經人工驗證。Ref-SAV 具有長文字描述、嚴重遮擋與大幅度運動等特徵——這些挑戰在現有基準中缺席。

段落功能資料貢獻——詳述 Ref-SAV 的建構流程與獨特價值。

邏輯角色 Ref-SAV 不僅是 Sa2VA 的訓練資料來源，更是一個社群貢獻——填補了現有指涉影片物件分割基準在複雜場景方面的空白。

論證技巧 / 潛在漏洞三階段標註流程的設計周全，但使用 VLM（InternVL2-76B）生成標註可能引入系統性偏差。手動驗證僅覆蓋基準子集（1,147 支），訓練集的品質依賴於自動化流程的可靠性。

4. Experiments — 實驗

On image referring segmentation, Sa2VA-8B achieves 81.6, 76.2, and 78.7 cIoU on RefCOCO, RefCOCO+, and RefCOCOg respectively, surpassing prior state-of-the-art. On video referring segmentation, Sa2VA achieves 57.0, 75.2, and 57.6 J&F on MeVIS, Ref-DAVIS17, and ReVOS, surpassing the previous SOTA VISA-13B by 12.5, 4.8, and 6.7 J&F. For chat performance, Sa2VA maintains strong results on MME (1651/578), MMBench (82.4), and SEED-Bench (75.5) while retaining grounding capabilities. On the new Ref-SAV benchmark, Sa2VA-8B achieves 49.3 J&F zero-shot vs. 14.1 for prior methods, improving to 58.7 with training data. Ablation studies confirm that co-training across all tasks benefits each individual task, and that the single [SEG] token design outperforms multi-token variants.

在影像指涉分割上，Sa2VA-8B 在 RefCOCO、RefCOCO+ 與 RefCOCOg 上分別達到 81.6、76.2 與 78.7 cIoU，超越先前最先進方法。在影片指涉分割上，Sa2VA 在 MeVIS、Ref-DAVIS17 與 ReVOS 上分別達到 57.0、75.2 與 57.6 J&F，超越先前最先進的 VISA-13B 達 12.5、4.8 與 6.7 J&F。在對話性能方面，Sa2VA 在 MME (1651/578)、MMBench (82.4) 與 SEED-Bench (75.5) 上維持強勁結果，同時保留接地能力。在新的 Ref-SAV 基準上，Sa2VA-8B 在零樣本下達到 49.3 J&F（先前方法為 14.1），使用訓練資料後提升至 58.7。消融研究確認跨所有任務的共同訓練有益於每個個別任務，且單一 [SEG] 詞元設計優於多詞元變體。

段落功能提供全面的實驗證據——涵蓋影像分割、影片分割、對話、新基準與消融研究。

邏輯角色實驗結果驗證了「統一不犧牲性能」的核心宣稱：在分割與對話兩個維度上均超越或持平專用模型。Ref-SAV 上的巨大零樣本優勢（49.3 vs. 14.1）尤其引人注目。

論證技巧 / 潛在漏洞數據全面且令人信服，但 Ref-SAV 基準由作者自行建構，在其上的優勢可能部分歸因於對資料分布的熟悉度。消融研究確認「共同訓練互利」是回應多任務訓練可能互相干擾之質疑的關鍵證據。

5. Conclusion — 結論

Sa2VA presents a versatile framework that integrates SAM-2 with LLaVA-like MLLMs to achieve dense, grounded understanding of both images and video. The decoupled design with [SEG] tokens enables plug-and-play integration while maintaining state-of-the-art performance across segmentation, conversation, and grounded captioning tasks. The Ref-SAV benchmark provides a challenging evaluation platform for future research. The model handles multiple tasks with a single instruction-tuning process, demonstrating that perception and understanding need not be separate capabilities in multi-modal AI.

Sa2VA 提出了一個多功能框架，整合 SAM-2 與類 LLaVA 的 MLLM，實現對影像與影片的稠密接地理解。以 [SEG] 詞元為基礎的解耦設計實現了隨插即用的整合，同時在分割、對話與接地式描述生成任務上維持最先進的性能。Ref-SAV 基準為未來研究提供了具挑戰性的評估平台。模型以單一指令微調流程處理多種任務，證明了在多模態人工智慧中，感知與理解無需作為分離的能力存在。

段落功能總結全文——以「感知與理解的統一」作為核心啟示。

邏輯角色結論提煉出超越方法本身的洞察：多模態 AI 的發展方向應是感知與理解的融合而非分離。

論證技巧 / 潛在漏洞「感知與理解無需分離」的宣稱具啟發性，但當前方法仍依賴兩個分離的預訓練模型（SAM-2 與 LLaVA）。真正的端到端統一——從頭訓練一個同時具備兩種能力的模型——仍是未來的目標。

論證結構總覽

問題
感知模型與理解模型
各自孤立

→

論點
[SEG] 詞元解耦整合
SAM-2 + LLaVA

→

證據
影像/影片分割
全面超越 SOTA

→

反駁
共同訓練互利
單詞元設計最優

→

結論
感知與理解
無需分離

作者核心主張（一句話）

透過以 [SEG] 詞元為橋樑的解耦設計將 SAM-2 與 LLaVA 整合，可在單一模型中統一影像/影片的稠密感知與開放式語言理解，在兩類任務上均達到最先進性能。

論證最強處

解耦設計的工程智慧：[SEG] 詞元作為最小化介面的設計，既保留了兩個預訓練模型的完整能力，又實現了有效的跨模型資訊傳遞。消融研究確認「共同訓練互利」，打消了多任務學習可能導致負遷移的疑慮。在影片指涉分割上超越 VISA-13B 達 12.5 J&F 的結果尤其引人注目。

論證最弱處

長影片與複雜場景的侷限：作者在附錄中坦承在長影片上存在失敗案例，以及 VQA 與分割任務之間的規模張力。此外，[SEG] 詞元的單一詞元設計雖在消融中表現最佳，但其資訊瓶頸可能在需要極精細空間指引的場景中顯現。