LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Abstract — 摘要

Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-CoT, a novel VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. We compile the LLaVA-CoT-100k dataset with structured reasoning annotations and propose an inference-time stage-level beam search method. Remarkably, with only 100k training samples, LLaVA-CoT outperforms its base model by 8.9% on multimodal reasoning benchmarks, and surpasses larger models including Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.

大型語言模型已在推理能力方面展現出顯著進展，特別是透過推論時期縮放，如 OpenAI 的 o1 模型所示。然而，當前的視覺語言模型在進行系統性與結構化推理時常面臨困難，尤其是在處理複雜的視覺問答任務。本文提出 LLaVA-CoT，一種設計用於進行自主多階段推理的新型視覺語言模型。與思維鏈提示不同，LLaVA-CoT 獨立執行摘要、視覺詮釋、邏輯推理與結論生成的序列階段。我們編纂了 LLaVA-CoT-100k 資料集（含結構化推理標註），並提出推論時期的階段級波束搜尋方法。值得注意的是，僅以 100k 訓練樣本，LLaVA-CoT 在多模態推理基準測試上便超越其基礎模型 8.9%，並超越更大的模型，包括 Gemini-1.5-pro、GPT-4o-mini 與 Llama-3.2-90B-Vision-Instruct。

段落功能全文總覽——以 LLM 推理能力的成功對比 VLM 的不足，引出 LLaVA-CoT 的多階段推理方案。

邏輯角色摘要的論證結構為「跨領域類比 + 缺口識別 + 方案 + 數據」：從 LLM 的成功到 VLM 的推理困境，再到 LLaVA-CoT 的解方。

論證技巧 / 潛在漏洞以 8.9% 的提升與超越 Gemini/GPT-4o-mini 的結果構成強力佐證。但「100k 樣本即足夠」的宣稱需確認這些樣本是否以更強模型（如 GPT-4o）生成，若是，則成本隱含在資料集建構中而非訓練中。

1. Introduction — 緒論

Recent advances in inference-time scaling have shown that allowing models more "thinking time" leads to better reasoning. OpenAI's o1 model exemplifies this by performing extended chain-of-thought (CoT) reasoning before answering. However, simply applying CoT prompting to existing VLMs yields inconsistent improvements — the models often produce unstructured or hallucinated reasoning chains. The key insight is that effective visual reasoning requires not just more tokens, but structured reasoning organized into distinct cognitive stages.

推論時期縮放的近期進展顯示，允許模型更多「思考時間」能帶來更好的推理品質。OpenAI 的 o1 模型透過在回答前執行延伸的思維鏈推理來例證此點。然而，單純將思維鏈提示應用於現有視覺語言模型所獲得的改善不一致——模型常產生缺乏結構或幻覺式的推理鏈。關鍵洞察在於：有效的視覺推理不僅需要更多詞元，更需要組織成不同認知階段的結構化推理。

段落功能建立研究場域——從推論時期縮放的趨勢切入，指出 CoT 在 VLM 中的失效。

邏輯角色論證鏈的起點：先認可 CoT 在 LLM 的成功，再揭示其在 VLM 中的不足，為「結構化推理」的必要性建立論據。

論證技巧 / 潛在漏洞「CoT 在 VLM 中不一致」的觀察切中要害，但作者需提供具體的失敗案例或數據。「不同認知階段」的概念源自認知科學，但在 VLM 中的對應是否恰當有待驗證。

We propose LLaVA-CoT, which structures the reasoning process into four explicit stages: (1) Summary — outlining the problem and determining the approach; (2) Caption — extracting and interpreting relevant visual information from images; (3) Reasoning — proceeding step-by-step through logical problem-solving; (4) Conclusion — delivering the final answer with supporting evidence. Each stage is demarcated by special tokens in the training data, enabling the model to autonomously transition between stages without external prompting.

本文提出 LLaVA-CoT，將推理過程結構化為四個顯式階段：(1) 摘要——概述問題並決定方法；(2) 描述——從影像中提取與詮釋相關視覺資訊；(3) 推理——逐步進行邏輯問題求解；(4) 結論——提出附有支持證據的最終答案。每個階段在訓練資料中以特殊詞元標記，使模型能在無需外部提示的情況下自主地在各階段之間轉換。

段落功能提出解決方案——描述四階段結構化推理的設計。

邏輯角色此段是方法論的核心：將「更多思考」具體化為「四個認知階段」，每個階段都有明確的功能定位。特殊詞元的設計使結構化推理成為模型的內在能力。

論證技巧 / 潛在漏洞四階段的劃分直觀且易於理解，但其固定順序可能不適合所有類型的推理任務——某些任務可能需要先推理再描述，或在各階段間反覆迭代。此外，「自主轉換」的宣稱取決於特殊詞元的學習品質。

Chain-of-thought reasoning has proven effective in large language models, with approaches ranging from few-shot CoT prompting to zero-shot "Let's think step by step." Vision-Language Models like LLaVA, InternVL, and Qwen-VL have shown strong performance on visual understanding tasks. However, attempts to integrate CoT into VLMs have been limited to prompting-based approaches that do not fundamentally change the model's reasoning behavior. Inference-time scaling methods like beam search and self-consistency have shown promise, but applying them uniformly across all tokens is computationally wasteful. Our stage-level beam search (SWIRES) method addresses this by allocating more search budget to reasoning-critical stages.

思維鏈推理在大型語言模型中已被證實有效，方法從少樣本 CoT 提示到零樣本「讓我們逐步思考」皆有。LLaVA、InternVL 與 Qwen-VL 等視覺語言模型在視覺理解任務上已展現強勁性能。然而，將 CoT 整合進 VLM 的嘗試僅限於基於提示的方法，並未根本改變模型的推理行為。推論時期縮放方法如波束搜尋與自一致性已展現潛力，但將其均勻應用於所有詞元在計算上是浪費的。本文的階段級波束搜尋（SWIRES）方法透過將更多搜尋預算分配給推理關鍵階段來解決此問題。

段落功能文獻回顧——涵蓋 CoT、VLM 與推論時期縮放，為 SWIRES 建立脈絡。

邏輯角色批判現有方法的「全詞元均勻搜尋」策略，為階段級搜尋的資源效率優勢建立論據。

論證技巧 / 潛在漏洞「提示式 CoT 不改變模型行為」的批判精準——提示僅影響輸出格式而非內在能力。但 SWIRES 的「階段級」搜尋是否真的能被模型所產出的階段標記準確引導，取決於這些標記在推論時的可靠性。

3. Method — 方法

LLaVA-CoT is built on the Llama-3.2-11B-Vision-Instruct backbone. The model is fine-tuned on the LLaVA-CoT-100k dataset with structured reasoning annotations organized into four stages using special delimiter tokens: <SUMMARY>, <CAPTION>, <REASONING>, and <CONCLUSION>. During training, the model learns to generate these stage transitions autonomously, so at inference time it naturally produces structured multi-stage reasoning without any special prompting. The training data is compiled from various visual question answering sources, with structured reasoning annotations generated by distilling reasoning traces from GPT-4o.

LLaVA-CoT 建構於 Llama-3.2-11B-Vision-Instruct 骨幹架構之上。模型在 LLaVA-CoT-100k 資料集上進行微調，使用組織為四個階段的結構化推理標註與特殊分隔詞元：<SUMMARY>、<CAPTION>、<REASONING> 與 <CONCLUSION>。在訓練過程中，模型學習自主生成這些階段轉換，因此在推論時自然產生結構化的多階段推理，無需任何特殊提示。訓練資料從各種視覺問答來源編纂而成，結構化推理標註透過從 GPT-4o 蒸餾推理軌跡而生成。

段落功能訓練方法——描述模型骨幹、訓練資料與階段標記的學習方式。

邏輯角色此段揭示了方法的核心機制：透過在訓練資料中嵌入結構化標記，使 11B 模型「內化」了四階段推理流程。從 GPT-4o 蒸餾則說明了資料品質的來源。

論證技巧 / 潛在漏洞從 GPT-4o 蒸餾推理軌跡是高效的資料建構策略，但引入了對教師模型的依賴——LLaVA-CoT 的推理品質上限可能受限於 GPT-4o 的能力。此外，100k 樣本的多樣性是否足以涵蓋所有推理模式值得關注。

3.2 LLaVA-CoT-100k Dataset — 資料集

The LLaVA-CoT-100k dataset contains 100,000 samples sourced from diverse visual question answering benchmarks. Each sample includes an image, a question, and a structured reasoning annotation with four explicitly delineated stages. The annotations are generated by prompting GPT-4o to produce step-by-step reasoning organized into the summary, caption, reasoning, and conclusion format. Quality filtering ensures that only samples where the GPT-4o reasoning leads to the correct final answer are retained. This curation process ensures high-quality structured reasoning supervision that captures the deliberate cognitive process rather than just the final answer.

LLaVA-CoT-100k 資料集包含 100,000 個樣本，源自多種視覺問答基準測試。每個樣本包含一張影像、一個問題，以及具有四個顯式階段的結構化推理標註。標註透過提示 GPT-4o 產生以摘要、描述、推理與結論格式組織的逐步推理而生成。品質過濾確保僅保留 GPT-4o 推理能導向正確最終答案的樣本。此策劃流程確保了高品質的結構化推理監督，捕捉審慎的認知過程而非僅是最終答案。

段落功能資料基礎——詳述資料集的建構流程與品質控制。

邏輯角色資料集是整個方法的基石——結構化推理能力的學習品質直接取決於標註的品質。品質過濾步驟是確保資料可靠性的關鍵環節。

論證技巧 / 潛在漏洞僅保留「正確答案」的過濾可能導致樣本偏差——難題可能被過濾掉，使模型在簡單任務上過度自信。此外，GPT-4o 的推理風格可能被模型完整繼承，限制了推理多樣性。

3.3 Stage-Wise Beam Search (SWIRES) — 階段級波束搜尋

We propose SWIRES (Stage-Wise Retracing Search), an inference-time scaling method that operates at the stage level rather than the token level. Instead of generating multiple complete reasoning chains and selecting the best, SWIRES generates multiple candidates at each reasoning stage, evaluates them, and selects the best candidate before proceeding to the next stage. This approach is more efficient than token-level beam search because it concentrates search budget on the stages where reasoning quality matters most (typically the Reasoning stage), while spending fewer resources on stages where the output is more deterministic (like Conclusion).

本文提出 SWIRES（階段級回溯搜尋），一種在階段層級而非詞元層級運作的推論時期縮放方法。SWIRES 不生成多條完整推理鏈再選擇最佳者，而是在每個推理階段生成多個候選，進行評估，並在進入下一階段前選擇最佳候選。此方法比詞元級波束搜尋更高效，因為它將搜尋預算集中在推理品質最為關鍵的階段（通常是推理階段），同時在輸出較為確定性的階段（如結論）花費較少資源。

段落功能推論創新——描述階段級搜尋的效率優勢。

邏輯角色此段將四階段結構的優勢延伸到推論時期：結構化不僅改善了訓練品質，也使推論時期的計算資源能更智慧地分配。

論證技巧 / 潛在漏洞階段級搜尋的效率論證在直覺上成立，但實際效果取決於「階段邊界的準確偵測」與「候選評估指標的品質」。若模型在某些輸入上無法產生清晰的階段邊界，SWIRES 可能退化為隨機選擇。

4. Experiments — 實驗

LLaVA-CoT is evaluated across six challenging multimodal reasoning benchmarks. The model outperforms its base model Llama-3.2-11B-Vision-Instruct by 8.9% on average. Remarkably, the 11B-parameter LLaVA-CoT surpasses Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct — models that are significantly larger or proprietary. The SWIRES inference method provides additional gains beyond standard greedy decoding, demonstrating effective inference-time scaling. Ablation studies show that all four stages contribute to performance, with the Reasoning stage being most critical, and that structured annotations outperform unstructured CoT annotations on the same base data.

LLaVA-CoT 在六個具挑戰性的多模態推理基準測試上進行評估。模型平均超越其基礎模型 Llama-3.2-11B-Vision-Instruct 達 8.9%。值得注意的是，僅 11B 參數的 LLaVA-CoT 超越了 Gemini-1.5-pro、GPT-4o-mini 與 Llama-3.2-90B-Vision-Instruct——這些模型要麼規模顯著更大，要麼為商業閉源模型。SWIRES 推論方法在標準貪婪解碼之上提供了額外的增益，展現了有效的推論時期縮放。消融研究顯示四個階段均對性能有所貢獻，其中推理階段最為關鍵，且結構化標註在相同基礎資料上優於非結構化 CoT 標註。

段落功能提供全面的實驗證據——覆蓋多基準比較、推論時期縮放效果與消融研究。

邏輯角色實驗結果的三個亮點（超越基礎模型、超越更大/閉源模型、結構化優於非結構化）分別驗證了方法的有效性、效率與設計選擇的合理性。

論證技巧 / 潛在漏洞超越 90B 模型的結果令人印象深刻，但需注意推理任務的選擇——在需要廣泛世界知識而非結構化推理的任務上，模型規模的優勢可能重新顯現。SWIRES 的額外增益幅度未被明確量化。

5. Conclusion — 結論

LLaVA-CoT demonstrates that structured multistage reasoning can dramatically improve the reasoning capabilities of vision-language models without increasing model size. By organizing reasoning into summary, caption, reasoning, and conclusion stages and training on only 100k carefully curated samples, a relatively small 11B model can rival or exceed much larger systems. The SWIRES inference strategy further leverages this structure for efficient inference-time scaling. This work suggests that how models think matters as much as how large they are.

LLaVA-CoT 證明了結構化多階段推理能在不增加模型規模的情況下，大幅提升視覺語言模型的推理能力。透過將推理組織為摘要、描述、推理與結論四個階段，並僅以 100k 經精心策劃的樣本訓練，一個相對小型的 11B 模型便能匹敵甚至超越規模大得多的系統。SWIRES 推論策略進一步利用此結構實現高效的推論時期縮放。本研究暗示：模型如何思考，與模型有多大同等重要。

段落功能總結全文——以精煉的一句話（「如何思考 vs. 多大」）概括核心洞察。

邏輯角色結論提煉出超越方法本身的哲學啟示：推理結構的重要性可能不亞於模型規模，這對 VLM 社群的研究方向具有引導意義。

論證技巧 / 潛在漏洞「如何思考 vs. 多大」的收尾修辭有力，但可能過度簡化——在足夠複雜的任務上，規模與結構可能是互補而非替代的關係。未來工作需探索兩者的最佳組合。

論證結構總覽

問題
VLM 缺乏結構化
推理能力

→

論點
四階段自主推理
+ 階段級搜尋

→

證據
11B 模型超越
90B 與閉源模型

→

反駁
100k 樣本即足夠
結構優於規模

→

結論
如何思考
與多大同等重要

作者核心主張（一句話）

透過將視覺語言模型的推理組織為摘要、描述、推理與結論四個顯式階段，並以 100k 結構化標註訓練，11B 模型即可在多模態推理基準上超越規模大數倍的模型。

論證最強處

以少勝多的實證力量：11B 模型超越 90B 與閉源模型的結果，有力地論證了「結構化推理」的價值超越「單純規模擴展」。消融研究進一步確認了結構化標註相對於非結構化 CoT 的優勢，排除了「更多資料」解釋。

論證最弱處

對教師模型的隱性依賴：訓練資料的推理軌跡從 GPT-4o 蒸餾而來，意味著 LLaVA-CoT 的推理品質上限受限於教師模型。此外，固定的四階段順序可能不適合所有推理任務類型，在需要反覆驗證或非線性推理的場景中可能力不從心。