InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Abstract — 摘要

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, in the field of computer vision, the weights of large-scale vision foundation models are often employed as the backbone of other models. The scaling up of vision foundation models and their alignment to LLMs has not been sufficiently explored. In this work, the authors present InternVL, scaling up the vision foundation model to 6 billion parameters and progressively aligning it with an LLM, using web-scale image-text data from diverse sources. The model is versatile, and can be applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image-text retrieval, and link with LLMs to create multimodal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B.

大型語言模型的指數級成長為多模態通用人工智慧系統開啟了無數可能性。然而在電腦視覺領域，大規模視覺基礎模型的權重往往僅被用作其他模型的骨幹網路。視覺基礎模型的規模擴展及其與大型語言模型的對齊，至今尚未被充分探索。本研究提出 InternVL，將視覺基礎模型擴展至 60 億參數，並利用來自多元來源的網路規模圖文資料，漸進式地與大型語言模型對齊。該模型具備高度通用性，在 32 項通用視覺語言基準測試上達到最先進的表現，涵蓋影像層級與像素層級辨識等視覺感知任務、零樣本影像／影片分類與零樣本圖文檢索等視覺語言任務，以及可與大型語言模型銜接建構多模態對話系統。該模型具備強大的視覺能力，可作為 ViT-22B 的優質替代方案。

段落功能全文總覽——以遞進方式從大型語言模型的崛起，到視覺基礎模型的不足，最終引出 InternVL 的定位與貢獻。

邏輯角色摘要承擔「缺口識別與解決方案預告」的雙重功能：先點明視覺基礎模型在規模與對齊上的落差，再以 32 項基準測試的最先進表現作為承諾，預告後文的實證支撐。

論證技巧 / 潛在漏洞「32 項基準測試」的廣度策略極具說服力，暗示模型具備強泛化性。但摘要未說明在每項基準上是否均為最佳；以「可作為 ViT-22B 的替代方案」收尾帶有一定的行銷色彩，需待實驗章節驗證此主張的精確程度。

1. Introduction — 緒論

Recent vision-language foundation models (VLLMs) typically connect a vision encoder to a large language model using lightweight "glue" layers. However, this paradigm creates three critical limitations: first, there is a disparity in parameter scale — vision encoders hover around 1 billion parameters while LLMs exceed 100 billion, creating an imbalanced multimodal system. Second, the vision and language components learn inconsistent representations because they are pre-trained on different objectives and datasets. Third, the connection between the two modalities is often inefficient, relying on simple projection layers that fail to bridge the representational gap.

近期的視覺語言基礎模型通常以輕量的「黏合」層將視覺編碼器與大型語言模型連接。然而，此範式產生三項關鍵限制：其一，參數規模存在嚴重失衡——視覺編碼器停留在約 10 億參數，而大型語言模型已超越 1,000 億，形成不對稱的多模態系統。其二，視覺與語言組件由於在不同目標函數與資料集上預訓練，導致所學表徵不一致。其三，兩種模態之間的連接往往效率低落，僅依賴簡單的投影層，無法有效彌合表徵落差。

段落功能建立研究場域——指出當前視覺語言模型架構的三項結構性缺陷。

邏輯角色論證鏈的起點：以數字對比（1B vs. 100B+）建立參數失衡的直覺，再延伸至表徵不一致與連接低效兩項更深層的問題，為 InternVL 的設計原則提供動機基礎。

論證技巧 / 潛在漏洞三項限制的列舉結構清晰且遞進（規模 -> 表徵 -> 連接），強化了問題的嚴重性。但「1B vs. 100B」的對比略有誇大——實務上許多有效的 VLLM（如 LLaVA）在此失衡下仍表現優異，暗示參數規模未必是唯一瓶頸。

To address these limitations, the authors propose InternVL with three key design principles. First, parameter-balanced components: the vision encoder (InternViT-6B) is scaled to 6 billion parameters, and the language middleware (QLLaMA) contains 8 billion parameters, creating a more balanced multimodal architecture. Second, consistent representation alignment: by initializing QLLaMA from a pre-trained multilingual LLaMA, the vision and language components share a common representational foundation. Third, a progressive image-text alignment strategy that advances from contrastive learning to generative learning, ensuring robust cross-modal alignment at each stage.

為解決上述限制，作者提出 InternVL 並遵循三項核心設計原則。第一，參數均衡的組件配置：視覺編碼器 InternViT-6B 擴展至 60 億參數，語言中介層 QLLaMA 包含 80 億參數，建構更為均衡的多模態架構。第二，一致的表徵對齊：透過以預訓練的多語言 LLaMA 初始化 QLLaMA，使視覺與語言組件共享共同的表徵基礎。第三，漸進式圖文對齊策略，從對比學習推進至生成式學習，確保每個階段均具備穩健的跨模態對齊。

段落功能提出解決方案——以三項設計原則直接回應前段的三項限制。

邏輯角色承接問題陳述的轉折段落：每項設計原則精準對應一項缺陷（規模失衡 -> 參數均衡；表徵不一致 -> 共同初始化；連接低效 -> 漸進對齊），形成完整的「問題-解方」映射。

論證技巧 / 潛在漏洞一對一回應的結構極為工整，具有強烈的說服力。然而，「參數均衡」是否真正必要值得商榷——視覺與語言處理的資訊複雜度本就不同，強行追求參數對等可能導致視覺端過度參數化或語言端不足。此論點的根基需實驗資料佐證。

The contributions of this work are threefold. First, InternVL scales the vision foundation model to 6 billion parameters, making it the largest open-source vision encoder to date. Second, the progressive alignment strategy enables the model to seamlessly bridge contrastive and generative learning paradigms within a unified framework. Third, the model achieves state-of-the-art performance on 32 benchmarks spanning visual perception, vision-language understanding, and multimodal dialogue, demonstrating broad versatility without task-specific architectural modifications.

本研究的貢獻有三。第一，InternVL 將視覺基礎模型擴展至 60 億參數，使其成為目前最大的開源視覺編碼器。第二，漸進式對齊策略使模型能在統一框架內無縫銜接對比學習與生成式學習範式。第三，模型在涵蓋視覺感知、視覺語言理解與多模態對話的 32 項基準測試上達到最先進表現，展現了無需任務特定架構修改即可適用的廣泛通用性。

段落功能貢獻宣告——以三點式結構明確列出論文的核心貢獻。

邏輯角色緒論的收束段落，將前述的動機與設計原則濃縮為具體的貢獻點，為讀者建立對後續章節的預期框架。

論證技巧 / 潛在漏洞「最大的開源視覺編碼器」是一個可客觀驗證的事實性主張，增強可信度。但「32 項基準測試的最先進表現」措辭模糊——是否在每項基準上都達到第一名，或是整體而言具競爭力？此處的量化細節需待實驗章節釐清。

Vision foundation models have evolved from AlexNet through ResNet to the Vision Transformer (ViT) family. Recent scaling efforts include ViT-22B and EVA-02, which demonstrate that larger vision models yield better representations for downstream tasks. However, these models primarily derive their training from visual-only datasets or BERT-style alignment, lacking direct integration with LLMs. This disconnect means that even the most powerful vision encoders cannot natively participate in language-driven reasoning, limiting their utility in the era of multimodal AI.

視覺基礎模型從 AlexNet 經 ResNet 演進至 Vision Transformer 家族。近期的規模擴展包括 ViT-22B 與 EVA-02，證明更大的視覺模型能產生更優質的下游任務表徵。然而，這些模型的訓練主要仰賴純視覺資料集或 BERT 式的對齊方式，缺乏與大型語言模型的直接整合。此斷裂意味著即便是最強大的視覺編碼器也無法原生參與語言驅動的推理，限制了其在多模態人工智慧時代的效用。

段落功能文獻回顧——梳理視覺基礎模型的演進脈絡與當前瓶頸。

邏輯角色延續緒論的「參數規模」議題，此段將批判焦點從規模失衡轉移至訓練範式的不相容——即便視覺模型也在擴大，它們的訓練目標與 LLM 生態系統仍然脫節。

論證技巧 / 潛在漏洞以「AlexNet -> ViT -> ViT-22B」的線性敘事建立技術演進的歷史感，再以「缺乏 LLM 整合」一筆帶過所有先前工作的不足。但 CLIP 式的對比學習本身已建立了視覺-語言橋樑，此處可能低估了現有方法的對齊效果。

Large language models such as the GPT series, LLaMA, and InternLM have demonstrated extraordinary capabilities in reasoning, instruction following, and in-context learning. The integration of visual perception into these models has spawned a rich landscape of vision-language large models (VLLMs), including Flamingo, BLIP-2, and LLaVA. These approaches typically employ a frozen or lightly fine-tuned vision encoder (often CLIP ViT-L) connected to an LLM decoder via a projection layer or Q-Former module. While effective, the vision encoder remains the bottleneck — its limited capacity constrains the visual information available to the language model, and its representations are not natively aligned with the LLM's embedding space.

GPT 系列、LLaMA 與 InternLM 等大型語言模型在推理、指令遵循與上下文學習方面展現了卓越能力。將視覺感知整合至這些模型中，催生了豐富的視覺語言大型模型生態，包括 Flamingo、BLIP-2 與 LLaVA。這些方法通常採用凍結或輕度微調的視覺編碼器（通常為 CLIP ViT-L），經由投影層或 Q-Former 模組與 LLM 解碼器連接。雖然有效，但視覺編碼器仍為瓶頸——其有限的容量制約了可供語言模型使用的視覺資訊，且其表徵並非原生地與 LLM 的嵌入空間對齊。

段落功能文獻定位——在 VLLM 的蓬勃發展中，識別視覺編碼器作為共同瓶頸。

邏輯角色建立 InternVL 的學術譜系：Flamingo / BLIP-2 / LLaVA 皆為前驅，但共享「視覺端不足」的根本限制。此段將 InternVL 定位為此趨勢的自然進化——不只改善連接方式，而是從根本強化視覺端。

論證技巧 / 潛在漏洞將視覺編碼器定義為「瓶頸」是全文的核心前提，但 LLaVA-1.5 等工作已證明以 CLIP ViT-L 配合高品質指令微調資料即可達到優異表現。視覺編碼器是否真為主要瓶頸，抑或資料品質與對齊策略更為關鍵，是一個值得深究但作者未充分討論的問題。

3. Method — 方法

3.1 Overall Architecture — 整體架構

Unlike traditional vision-only or dual-encoder approaches, InternVL combines two core components: InternViT-6B, a vision transformer with 6 billion parameters, and QLLaMA, a language middleware with 8 billion parameters initialized from multilingual LLaMA. The architecture supports flexible composition for diverse visual-linguistic tasks through three inference modes: (1) visual perception, where InternViT-6B generates feature maps for dense prediction tasks; (2) contrastive tasks, where attention pooling extracts global visual features for similarity computation; and (3) generative tasks, where QLLaMA reorganizes visual representations as prefix tokens for sequential generation.

有別於傳統的純視覺或雙編碼器方法，InternVL 結合兩大核心組件：擁有 60 億參數的視覺轉換器 InternViT-6B，以及從多語言 LLaMA 初始化、含 80 億參數的語言中介層 QLLaMA。此架構透過三種推論模式支援多元視覺語言任務的彈性組合：（1）視覺感知模式，由 InternViT-6B 生成特徵圖供密集預測任務使用；（2）對比任務模式，以注意力池化提取全域視覺特徵進行相似度計算；（3）生成任務模式，由 QLLaMA 將視覺表徵重組為前綴標記以進行序列生成。

段落功能架構總覽——勾勒 InternVL 的雙組件設計與三種推論模式。

邏輯角色方法章節的開篇，為讀者建立全局理解：60 億視覺參數 + 80 億語言參數的設計直接回應緒論中的「參數均衡」原則，三種推論模式則預告了後續實驗的廣泛覆蓋面。

論證技巧 / 潛在漏洞三種推論模式的設計展現了架構的通用性，這是一項重要的工程貢獻。然而，模組化的靈活性可能帶來最佳化上的妥協——為適配多種任務而設計的架構，在特定任務上可能不如專門設計的模型表現優異。

3.2 Model Design — 模型設計

The design of InternViT-6B involves systematic hyperparameter exploration, varying depth {32, 48, 64, 80}, head dimension {64, 128}, and MLP ratio {4, 8} across 16 model variants. Evaluation reveals that "depth, head dimension, and MLP ratio have little impact on the performance" for a fixed parameter count, suggesting that the total number of parameters matters more than the specific architectural configuration. The final selected architecture uses width 3200, depth 48, MLP dimension 12800, and 25 attention heads, yielding 5.9 billion parameters. This variant is chosen based on a composite criterion of throughput, accuracy, and training stability.

InternViT-6B 的設計涉及系統性的超參數探索，在深度 {32, 48, 64, 80}、注意力頭維度 {64, 128} 與 MLP 比率 {4, 8} 之間，對 16 種模型變體進行評估。結果揭示，在固定參數量下，深度、頭維度與 MLP 比率對效能影響甚微，表明總參數量比具體的架構配置更為重要。最終選定的架構採用寬度 3200、深度 48、MLP 維度 12800 與 25 個注意力頭，共 59 億參數。此變體基於吞吐量、準確率與訓練穩定性的綜合標準選出。

段落功能架構搜尋——描述 InternViT-6B 的設計方法學與最終選擇依據。

邏輯角色此段為「視覺模型規模擴展」的主張提供方法論基礎：並非隨意擴大，而是經過系統搜尋。「參數量比架構配置更重要」的發現也為選擇提供了理論支撐。

論證技巧 / 潛在漏洞 16 種變體的消融研究展現了方法論的嚴謹性。但「架構配置影響甚微」的結論若為真，則暗示作者在架構設計上的貢獻有限——核心貢獻僅在於「將規模做大」而非「設計更好的架構」。這可能削弱論文的技術新穎性。

QLLaMA is built upon a pre-trained multilingual LLaMA augmented with 96 learnable queries and cross-attention layers totaling 1 billion additional parameters that are randomly initialized. This design offers three advantages over the widely-used Q-Former: (1) pre-trained weight initialization provides a strong starting point for representation alignment, unlike Q-Former which is trained from scratch; (2) QLLaMA is 42 times larger than Q-Former, providing substantially greater capacity for cross-modal understanding; (3) the architecture is applicable to both contrastive and generative tasks, whereas Q-Former is primarily designed for contrastive alignment.

QLLaMA 建構於預訓練的多語言 LLaMA 之上，額外增加 96 個可學習查詢與交叉注意力層，共計 10 億個隨機初始化的新增參數。相較於廣泛使用的 Q-Former，此設計具備三項優勢：（1）預訓練權重初始化為表徵對齊提供了優質的起點，不像 Q-Former 需從零訓練；（2）QLLaMA 的規模是 Q-Former 的 42 倍，為跨模態理解提供了顯著更大的容量；（3）此架構可同時適用於對比與生成任務，而 Q-Former 主要為對比對齊而設計。

段落功能組件設計——詳述 QLLaMA 的架構與相對於 Q-Former 的優勢。

邏輯角色回應緒論中「連接低效」的問題：QLLaMA 並非簡單的投影層，而是一個具備 80 億參數的完整語言模型中介層。以 Q-Former 為參照對象，透過三項具體優勢建立差異化定位。

論證技巧 / 潛在漏洞「42 倍大」的數字對比極具衝擊力，但更大不一定更好——BLIP-2 的 Q-Former 以極少參數即實現了有效的視覺-語言橋接。作者需證明額外的參數量確實帶來了與規模相稱的效能提升，而非僅是算力的浪費。

3.3 Alignment Strategy — 對齊策略

The alignment proceeds through three progressive stages. Stage 1: Vision-Language Contrastive Training aligns InternViT-6B with multilingual LLaMA-7B on web-scale noisy image-text pairs. The training dataset comprises 6.03 billion image-text pairs (4.98 billion after cleaning) drawn from diverse sources including LAION-en, LAION-multi, LAION-COCO, COYO, Wukong, CC3M/12M, and SBU. The loss function follows CLIP's symmetric cross-entropy formulation. This stage establishes the foundation for contrastive task excellence and visual perception capabilities.

對齊流程分為三個漸進階段。第一階段為視覺語言對比訓練，在網路規模的含噪圖文配對資料上對齊 InternViT-6B 與多語言 LLaMA-7B。訓練資料集包含 60.3 億筆圖文配對（清洗後為 49.8 億筆），來源涵蓋 LAION-en、LAION-multi、LAION-COCO、COYO、Wukong、CC3M/12M 及 SBU 等多元資料集。損失函數採用 CLIP 的對稱交叉熵公式。此階段為對比任務的卓越表現與視覺感知能力奠定基礎。

段落功能訓練策略第一階段——以大規模對比學習建立視覺-語言基礎對齊。

邏輯角色漸進式對齊的起點：以 CLIP 式對比學習作為最基礎、最穩健的對齊方式，利用近 50 億筆資料建立廣泛的視覺-語言關聯。

論證技巧 / 潛在漏洞 49.8 億筆資料的規模本身就是一項壁壘——多數研究團隊無法複現此規模的訓練。作者以資料規模作為隱含的競爭優勢，但未充分討論資料品質的影響。含噪的網路資料是否可能引入偏見或降低特定任務的表現，值得進一步探討。

Stage 2: Vision-Language Generative Training builds upon Stage 1 by inheriting its weights and introducing newly added cross-attention components. The training uses filtered, high-quality data reduced to 1.03 billion pairs after rigorous quality filtering. The loss function combines three objectives: image-text contrastive loss (ITC), image-text matching loss (ITM), and image-grounded text generation loss (ITG). Critically, both InternViT-6B and the base QLLaMA parameters remain frozen during this stage, with only the newly added cross-attention layers being optimized. This design preserves the strong visual representations established in Stage 1 while learning the generative bridge.

第二階段為視覺語言生成訓練，繼承第一階段的權重並引入新增的交叉注意力組件。訓練使用經嚴格品質篩選後縮減至 10.3 億筆的高品質資料。損失函數結合三項目標：圖文對比損失（ITC）、圖文匹配損失（ITM）與基於影像的文本生成損失（ITG）。關鍵的是，InternViT-6B 與基底 QLLaMA 的參數在此階段均保持凍結，僅最佳化新增的交叉注意力層。此設計在學習生成橋接的同時，保全了第一階段建立的強健視覺表徵。

段落功能訓練策略第二階段——從對比學習過渡至生成學習。

邏輯角色漸進式策略的中間環節：資料量從 49.8 億大幅縮減至 10.3 億（品質優先於數量），凍結策略確保不破壞第一階段的成果。三重損失函數的設計展現了多目標最佳化的複雜性。

論證技巧 / 潛在漏洞凍結大部分參數是一項保守但務實的選擇——避免災難性遺忘。但這也意味著第二階段的學習容量受限於新增的交叉注意力層（僅 10 億參數），生成能力的上限可能因此受到制約。此外，三項損失函數的權重如何平衡，作者未詳細說明。

Stage 3: Supervised Fine-tuning connects InternVL with downstream LLM decoders such as Vicuna and InternLM. The fine-tuning uses approximately 4 million instruction-tuning samples spanning captioning, visual question answering, OCR, visual grounding, and multi-turn dialogue. The framework offers flexibility: one can freeze the LLM decoder while training only the MLP projection layer, or employ QLLaMA as a more powerful intermediary. This stage transforms the aligned vision-language model into a practical multimodal dialogue system capable of instruction-following and open-ended visual reasoning.

第三階段為監督式微調，將 InternVL 與下游 LLM 解碼器（如 Vicuna 與 InternLM）連接。微調使用約 400 萬筆指令微調樣本，涵蓋圖說生成、視覺問答、光學字元辨識、視覺定位與多輪對話。框架提供彈性選擇：可凍結 LLM 解碼器僅訓練 MLP 投影層，或採用 QLLaMA 作為更強大的中介層。此階段將已對齊的視覺語言模型轉化為具備指令遵循與開放式視覺推理能力的實用多模態對話系統。

段落功能訓練策略第三階段——將基礎模型轉化為實用對話系統。

邏輯角色漸進式策略的最終階段：從基礎對齊（第一階段）到生成能力（第二階段）再到任務適應（第三階段），形成完整的訓練管線。400 萬筆指令資料相較前兩階段規模極小，但品質與任務覆蓋度更高。

論證技巧 / 潛在漏洞提供「凍結 LLM」與「使用 QLLaMA」兩種路徑的彈性設計，增加了方法的實用性。但第三階段實際上與 LLaVA 等方法的微調流程非常相似，InternVL 在此階段的獨特貢獻主要在於更強的視覺編碼器，而非微調策略本身的創新。

4. Experiments — 實驗

On visual perception benchmarks, InternViT-6B demonstrates exceptional capabilities. For image classification on ImageNet-1K, linear probing achieves 88.2% top-1 accuracy, which represents "the currently best linear evaluation results without the JFT dataset." On semantic segmentation (ADE20K), the model consistently outperforms ViT-22B across all evaluation protocols: few-shot learning shows consistent advantages across varied training data proportions; linear probing achieves 47.2 mIoU, surpassing ViT-22B by 12.6 points; and full-parameter tuning reaches 58.9 mIoU, a gain of 3.6 points over ViT-22B. These results confirm that scaling a vision encoder with language-aligned pre-training produces superior visual representations even for pure vision tasks.

在視覺感知基準上，InternViT-6B 展現了卓越能力。在 ImageNet-1K 影像分類任務中，線性探測達到 88.2% 的 top-1 準確率，為目前不使用 JFT 資料集的最佳線性評估結果。在語義分割任務（ADE20K）上，模型在所有評估協定中均持續超越 ViT-22B：少樣本學習在不同訓練資料比例下均展現一致優勢；線性探測達到 47.2 mIoU，超越 ViT-22B 達 12.6 分；全參數微調達到 58.9 mIoU，領先 ViT-22B 達 3.6 分。這些結果確認了以語言對齊預訓練來擴展視覺編碼器，即便在純視覺任務上也能產生更優質的視覺表徵。

段落功能實證驗證（視覺感知）——以定量結果證明 InternViT-6B 的視覺表徵品質。

邏輯角色回應「參數擴展是否有效」的核心問題：以 ViT-22B 為參照基線，在更少參數（6B vs. 22B）下取得更好成績，強力支持「語言對齊預訓練」的價值主張。

論證技巧 / 潛在漏洞以 ViT-22B 作為比較對象是精明的選擇——擊敗一個參數量為自身 3.7 倍的模型極具說服力。但「不使用 JFT 資料集」的限定條件值得注意：以 JFT 訓練的 ViT-22B 可能表現更好，此處的比較存在訓練資料差異的混淆因子。

On vision-language benchmarks, InternVL demonstrates broad competence. For zero-shot image classification, InternVL-C achieves leading performance with "stronger robustness to distribution shift, manifesting in consistent accuracy across ImageNet variants." In zero-shot video classification, single-frame evaluation yields 76.1%, 75.5%, and 67.5% on Kinetics-400/600/700, surpassing EVA-02 by +6.3, +6.2, and +4.1 points respectively; with 8-frame uniform sampling, the model outperforms ViCLIP by more than 3.3 points. For image-text retrieval, InternVL shows strong multilingual performance across both English and Chinese datasets, with InternVL-G (incorporating generative training) further enhancing retrieval accuracy.

在視覺語言基準上，InternVL 展現了廣泛的能力。在零樣本影像分類中，InternVL-C 達到領先表現，對分布偏移具備更強的穩健性，在 ImageNet 各變體上保持一致的準確率。在零樣本影片分類中，單幀評估在 Kinetics-400/600/700 上分別達到 76.1%、75.5% 與 67.5%，超越 EVA-02 達 +6.3、+6.2 與 +4.1 分；以 8 幀均勻取樣評估，模型超越 ViCLIP 逾 3.3 分。在圖文檢索方面，InternVL 在英文與中文資料集上均展現強勁的多語言表現，而融入生成訓練的 InternVL-G 進一步提升了檢索準確率。

段落功能實證驗證（視覺語言）——以多項跨模態基準展示模型的泛化能力。

邏輯角色從純視覺任務擴展至跨模態任務，驗證「漸進式對齊」策略的有效性。影片分類結果特別值得注意——模型並未針對影片做特殊設計，卻能大幅超越專門的影片模型。

論證技巧 / 潛在漏洞多語言能力的展示是一項差異化優勢，但主要歸功於多語言 LLaMA 的初始化而非架構設計本身。此外，零樣本影片分類以「單幀」評估為主，這實質上測試的是靜態影像辨識而非時序理解，可能高估了模型的影片理解能力。

On multimodal dialogue benchmarks, InternVL-Chat combined with Vicuna-13B achieves 1317.2 on MME (spanning 14 sub-tasks) and 87.6 recall on POPE (hallucination evaluation), demonstrating "superior performance compared with previous methods, under the condition of fair trainable parameter counts." Ablation studies further validate the design choices: minimalist supervised fine-tuning with only MLP training shows InternVL outperforms vision encoder baselines, confirming the value of the scaled vision encoder. Moreover, "significant improvement when using QLLaMA as the glue layer" validates the claim that feature representation alignment through a language-informed middleware yields substantial gains over simple projection layers.

在多模態對話基準上，InternVL-Chat 結合 Vicuna-13B 在 MME（涵蓋 14 項子任務）上達到 1317.2 分，在 POPE（幻覺評估）上達到 87.6 的召回率，展現在公平可訓練參數量條件下優於先前方法的表現。消融研究進一步驗證了設計選擇的合理性：以僅訓練 MLP 層的最簡監督微調進行評估，InternVL 超越了各視覺編碼器基線，確認了擴展視覺編碼器的價值。此外，「使用 QLLaMA 作為黏合層時的顯著提升」驗證了以下主張——透過語言知識中介層進行特徵表徵對齊，相較簡單的投影層能帶來實質性的增益。

段落功能實證驗證（對話與消融）——以對話基準和消融研究完成實驗論證的閉環。

邏輯角色實驗章節的收束：對話任務證明端到端的實用性，消融研究則回溯驗證架構的每項設計決策。「公平可訓練參數量」的限定條件暗示作者意識到模型規模可能帶來不公平的比較優勢。

論證技巧 / 潛在漏洞消融研究是本文實驗部分最有價值的環節——它區分了「規模擴展」與「對齊策略」各自的貢獻。但 POPE 上 87.6 的召回率雖看似不錯，幻覺問題在實際應用中仍是嚴峻挑戰。此外，MME 基準在 2024 年已非最具鑑別力的評估指標，較新的基準（如 MM-Bench）可能提供更全面的評估。

5. Conclusion — 結論

InternVL bridges the gap between vision foundation models and large language models by scaling the vision encoder to 6 billion parameters and implementing a progressive alignment strategy. The research demonstrates "proficiency in a wide range of generic visual-linguistic tasks" spanning classification, retrieval, captioning, visual question answering, and dialogue. The three-stage training pipeline — from contrastive pre-training on 4.98 billion pairs, through generative training on 1.03 billion quality-filtered pairs, to supervised fine-tuning on 4 million instruction samples — provides a principled framework for building versatile multimodal systems that treat vision as a first-class citizen alongside language.

InternVL 透過將視覺編碼器擴展至 60 億參數並實施漸進式對齊策略，彌合了視覺基礎模型與大型語言模型之間的落差。本研究展示了在廣泛的通用視覺語言任務上的出色表現，涵蓋分類、檢索、圖說生成、視覺問答與對話。三階段訓練管線——從 49.8 億筆配對的對比預訓練，經 10.3 億筆品質篩選配對的生成訓練，到 400 萬筆指令樣本的監督式微調——為建構將視覺視為與語言同等重要的通用多模態系統，提供了一套有原則的框架。

段落功能總結全文——重申核心貢獻並提煉方法論價值。

邏輯角色結論段呼應摘要與緒論的結構，從技術方法回到更高層次的願景：「將視覺視為與語言同等重要」。三階段管線的資料量遞減（49.8B -> 1.03B -> 4M）優雅地總結了漸進精煉的哲學。

論證技巧 / 潛在漏洞結論以「通用視覺語言」的宏大願景收束，但未充分討論局限性——如計算成本（640 張 A100 GPU 的訓練規模多數團隊無法負擔）、單一影像解析度的限制，以及在複雜推理任務上的表現。作為一篇規模導向的工作，缺乏對「何時規模擴展不再有效」的反思。

論證結構總覽

問題
視覺基礎模型規模不足
且與 LLM 對齊薄弱

→

論點
擴展視覺編碼器至 6B
漸進式跨模態對齊

→

證據
32 項基準最先進表現
超越 ViT-22B

→

反駁
消融研究分離
規模與對齊的貢獻

→

結論
視覺應與語言同等重要
於多模態系統中

作者核心主張（一句話）

將視覺基礎模型擴展至 60 億參數並透過三階段漸進式策略與大型語言模型對齊，能在涵蓋感知、理解與對話的 32 項視覺語言任務上達到最先進的通用表現。

論證最強處

以少勝多的視覺表徵品質：InternViT-6B 僅以 ViT-22B 四分之一的參數量，在 ImageNet 線性探測（88.2%）與 ADE20K 語義分割（線性探測 +12.6 mIoU）上全面超越後者。這有力地證明了語言對齊預訓練對視覺表徵的增益並非來自單純的規模擴展，而是跨模態知識遷移的結果。

論證最弱處

可複現性與公平比較的隱憂：訓練規模需 640 張 A100 GPU 與近 50 億筆資料，形成極高的複現門檻。此外，多項比較基線使用不同的訓練資料與計算預算，使得效能增益究竟來自架構設計、資料規模還是計算量，難以精確歸因。「32 項基準最先進」的主張也未逐一說明是否在每項基準上均為最佳。