Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

Abstract — 摘要

Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation by combining LLM and diffusion models, the state-of-the-art in each task, respectively. Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order (e.g., raster scan). However, we show that spatial tokens lack the recursive structure inherent to languages, hence form an impossible language for LLM to master. In this paper, we build a proper visual language by leveraging diffusion timesteps to learn discrete, recursive visual tokens. Our proposed tokens recursively compensate for the progressive attribute loss in noisy images as timesteps increase, enabling the diffusion model to reconstruct the original image at any timestep. This approach allows us to effectively integrate the strengths of LLMs in autoregressive reasoning and diffusion models in precise image generation, achieving seamless multimodal comprehension and generation within a unified framework. Extensive experiments show that we achieve superior performance for multimodal comprehension and generation simultaneously compared with other MLLMs.

近期在多模態大型語言模型（MLLM）方面的努力，旨在結合大型語言模型（LLM）與擴散模型這兩個各自領域的最先進技術，以統一視覺理解與生成能力。現有方法依賴空間視覺標記，將影像區塊按照空間順序（如光柵掃描）進行編碼與排列。然而，本文指出空間標記缺乏語言固有的遞迴結構，因此對 LLM 而言構成一種「不可能的語言」。本文透過利用擴散時步來學習離散的遞迴視覺標記，建構一套適當的視覺語言。所提出的標記隨著時步增加，遞迴地補償雜訊影像中逐步流失的屬性，使擴散模型能在任意時步重建原始影像。此方法有效整合了 LLM 在自迴歸推理方面的優勢與擴散模型在精確影像生成方面的長處，在統一框架中達成無縫的多模態理解與生成。大量實驗顯示，本方法在多模態理解與生成兩方面同時超越其他 MLLM。

段落功能全文總覽——以遞進方式從「統一多模態」的研究趨勢出發，指出空間標記的根本缺陷，再引出 DDT 標記化方案作為解決之道。

邏輯角色摘要承擔「問題診斷與方案預告」的雙重功能：先以「不可能的語言」這一強論斷建立動機，再以「遞迴補償」概念預告技術方案，最後以實驗成果收束，形成完整的微型論證。

論證技巧 / 潛在漏洞「不可能的語言」(impossible language) 是極為強烈的措辭，修辭效果突出但需要嚴謹的形式化支撐。此主張的成立高度依賴作者對「遞迴性」的特定定義——若採用不同的語言學框架，空間標記未必「不可能」。此點將在方法章節中接受檢驗。

1. Introduction — 緒論

Multimodal Large Language Models (MLLMs) strive to unify the comprehension and generation of data across various modalities within the same next-token prediction paradigm of LLM. Specifically, given a user query about comprehension — "What kind of dog is in this picture [IMG]", or generation — "Turning the picture [IMG] into a sketch", the model can complete the task by sequentially predicting the appropriate text or image tokens. The challenge lies in the conflicting objectives of the two tasks. Comprehension pursues a many-to-one mapping that abstracts visual details (e.g., many photos of corgi dogs result in recognition of "corgi"). On the other hand, generation finds a one-to-one mapping that preserves visual details (e.g., the sketched image specific to the query image). To bridge these conflicting objectives, the most straightforward way is to combine LLMs and diffusion models (DMs), which excel in comprehension and generation respectively.

多模態大型語言模型致力於在 LLM 的次標記預測範式下，統一各種模態資料的理解與生成。具體而言，當使用者提出理解類查詢——「這張圖片中是什麼品種的狗？」，或生成類查詢——「將這張圖片轉換為素描」，模型能透過依序預測適當的文字或影像標記來完成任務。核心挑戰在於兩項任務的目標彼此衝突。理解追求多對一映射，抽象化視覺細節（例如許多柯基犬照片最終歸結為「柯基」的辨識結果）。然而，生成則需要一對一映射，保留視覺細節（例如針對特定查詢影像產出的素描）。為橋接此衝突，最直接的方式便是結合各自擅長理解與生成的 LLM 與擴散模型。

段落功能建立研究場域——以理解與生成的「目標衝突」框定多模態統一的核心挑戰。

邏輯角色論證鏈的起點：先定義 MLLM 的願景（統一理解與生成），再指出根本矛盾（多對一 vs. 一對一），自然引出「結合 LLM + 擴散模型」的研究路線。此結構為後續批評空間標記鋪設了邏輯基礎。

論證技巧 / 潛在漏洞以「多對一 vs. 一對一」的對稱框架來描述理解與生成的差異，概念清晰且易於理解。但此二分法過度簡化——理解並非總是多對一（細粒度描述需要一對一），生成也非嚴格一對一（風格轉換允許一定程度的變異）。不過作為引言的動機設定，此簡化是合理的。

Two main integration strategies have emerged. Cascading approaches let LLMs process queries and pass instructions to diffusion models, while tokenization-based approaches discretize visual modalities into tokens for LLM processing before decoding back to images via diffusion models. Recent work like Transfusion aims to combine both into unified frameworks. However, a critical limitation persists: existing methods rely on spatial visual tokens, where image patches are encoded and arranged in a spatial order. These tokens lack the recursive structure inherent to natural language, forming what the authors term an "impossible language" for LLMs. Perturbation analysis in Figure 1 demonstrates that spatial tokens are robust to sequence disruption — shuffling their order barely affects generation quality — proving they lack the sequential dependency that defines language.

目前已出現兩種主要整合策略。串接式方法讓 LLM 處理查詢並將指令傳遞給擴散模型，而標記化方法則將視覺模態離散化為標記供 LLM 處理，再透過擴散模型解碼回影像。近期如 Transfusion 等研究嘗試將兩者結合至統一框架中。然而，一個關鍵限制持續存在：現有方法依賴空間視覺標記，將影像區塊按空間順序編碼與排列。這些標記缺乏自然語言固有的遞迴結構，構成作者所稱的 LLM 的「不可能語言」。圖一中的擾動分析顯示，空間標記對序列打亂具有穩健性——打亂其順序幾乎不影響生成品質——證明它們缺乏定義語言的序列依賴性。

段落功能批判現有範式——以實驗證據指出空間標記的根本缺陷。

邏輯角色此段是全文最關鍵的轉折點：從「現有方法回顧」轉向「根本性批判」。擾動分析提供了經驗證據，將「空間標記不適合 LLM」從觀察提升為可驗證的論斷，為提出 DDT 標記化建立了嚴格的動機基礎。

論證技巧 / 潛在漏洞擾動分析是一個巧妙的論證手法——以反面證據（打亂不影響品質）證明正面論點（空間標記缺乏序列依賴）。但需注意：「語言必須具有序列依賴性」此前提本身需要更嚴謹的語言學論證。此外，對序列順序的穩健性在某些情境下可能是優點而非缺點。

In this paper, the authors propose Discrete Diffusion Timestep (DDT) tokenization to learn recursive tokens that reflect how diffusion progressively corrupts images. The DDT tokens form expanding token sequences that compensate for incrementally lost visual attributes as noise increases across timesteps. Three main contributions are outlined: (1) proposing DDT as a principled method to create recursive visual tokens with language-like properties; (2) introducing an integration method that combines LLMs with diffusion models through these tokens, enabling unified comprehension and generation; and (3) demonstrating state-of-the-art results across text-to-image generation, image editing, and vision-language understanding, notably with a tokenizer trained only on ImageNet at 256x256 resolution that surpasses methods using larger datasets like LAION.

本文提出離散擴散時步（DDT）標記化方法，學習反映擴散過程如何逐步破壞影像的遞迴標記。DDT 標記形成逐步擴展的標記序列，隨著時步增加的雜訊逐漸補償流失的視覺屬性。文章概述三項主要貢獻：(1) 提出 DDT 作為創建具有類語言性質之遞迴視覺標記的原則性方法；(2) 引入一種透過這些標記結合 LLM 與擴散模型的整合方法，實現統一的理解與生成；(3) 在文字轉影像生成、影像編輯與視覺語言理解方面展示最先進的成果，值得注意的是，僅在 ImageNet 256x256 解析度上訓練的標記器便超越了使用 LAION 等大型資料集的方法。

段落功能提出解決方案並宣告貢獻——完整概述 DDT 的核心創新與三重貢獻。

邏輯角色承接上段的批判，此段扮演「轉折」角色：從「空間標記不可行」過渡到「DDT 標記方案」。三項貢獻分別對應理論創新（遞迴標記）、系統設計（LLM+擴散整合）和實證驗證，構成完整的論文架構預告。

論證技巧 / 潛在漏洞第三項貢獻中「僅在 ImageNet 256x256 上訓練」的強調極具策略性——既展示了方法的資料效率，又為後續可能的品質限制（如美學品質不足）預留了升級空間。但這也暗示方法的上限尚未被探索，讀者需在實驗章節中判斷此「資料效率」是否伴隨著品質妥協。

Research on achieving unified understanding and generation in multimodal models has primarily focused on two main strategies: cascading architectures and tokenization-based methods. Cascading architectures integrate separate modality-specific encoders and decoders, each pre-trained independently, and then fuse their representations through projection layers to create combined models for multimodal tasks. Notable examples include models such as EMU2, which uses pre-trained language models augmented with EVA-02-CLIP-E-plus for comprehension tasks, and cascades an SDXL-initialized diffusion model for visual generation tasks.

在多模態模型中實現統一理解與生成的研究，主要聚焦於兩大策略：串接式架構與標記化方法。串接式架構整合各自獨立預訓練的模態專屬編碼器與解碼器，再透過投影層融合其表示以建構多模態聯合模型。代表性範例包括 EMU2 等模型，其使用以 EVA-02-CLIP-E-plus 增強的預訓練語言模型處理理解任務，並串接以 SDXL 初始化的擴散模型處理視覺生成任務。

段落功能文獻回顧——建立串接式架構的技術脈絡。

邏輯角色作為相關工作的開端，此段系統性地將現有方法分為兩大陣營，為後續論證「兩者皆不足」做準備。串接式架構的介紹偏重「獨立預訓練+投影融合」的結構特徵，暗示其整合深度不足。

論證技巧 / 潛在漏洞以 EMU2 為串接式架構的代表是合理的選擇，但作者僅描述其結構而未深入評價其效能。此段的功能更接近「分類整理」而非「批判性回顧」，真正的批判將留待下一段統一提出。

In contrast, tokenization-based methods aim to create a unified framework by converting visual and textual inputs into a discrete space, and then jointly training a single transformer based solely on next-token prediction. Moreover, recent advances such as TransFusion and Show-o explore a blend of diffusion and autoregressive models within a single transformer for enhanced performance. However, although these methods provide a step towards unification, most of them focus on spatial tokens for vision, which are extracted from image patches and arranged in a spatial order. These spatial tokens "lack the traits of human language, resulting in an inability to seamlessly integrate with human natural language" within an MLLM. Consequently, existing MLLMs still lag behind specialized architectures like SDXL in visual generation tasks and LLaVA-1.6 in visual comprehension tasks. These indicate the need for further exploration into more holistic tokenization methods that go beyond spatial representations.

相對地，標記化方法旨在將視覺與文字輸入轉換至離散空間，再以次標記預測為唯一目標聯合訓練單一 Transformer，藉此建構統一框架。此外，TransFusion 和 Show-o 等近期進展探索在單一 Transformer 中融合擴散與自迴歸模型以提升效能。然而，儘管這些方法朝向統一邁進了一步，其中大多數仍聚焦於從影像區塊擷取並按空間順序排列的空間標記。這些空間標記「缺乏人類語言的特性，導致無法在 MLLM 中與人類自然語言無縫整合」。因此，現有 MLLM 在視覺生成任務上仍落後於 SDXL 等專用架構，在視覺理解任務上仍遜於 LLaVA-1.6 等模型。這表明需要進一步探索超越空間表示的更全面標記化方法。

段落功能批判性文獻回顧——指出標記化方法的核心缺陷並建立研究缺口。

邏輯角色此段是相關工作章節的論證核心：先肯定標記化方法的方向（「朝向統一邁進了一步」），再以空間標記的語言學缺陷作為轉折，最後以效能落差的實證佐證收束。此「肯定-批判-佐證」的三段式結構為 DDT 的提出鋪設了嚴密的邏輯基礎。

論證技巧 / 潛在漏洞將 MLLM 與專用模型（SDXL、LLaVA-1.6）的效能差距歸因於空間標記的缺陷，此因果推論需要更嚴格的控制實驗——效能差距可能源自訓練資料量、模型規模等其他因素。作者巧妙地用「indicate the need for」的弱因果措辭迴避了直接因果宣稱。

3. Diffusion Timestep Tokenizer — 擴散時步標記器

3.1 Architecture Overview — 架構概覽

The authors present a tokenizer architecture with three core components. The goal is to "train an image tokenizer that encodes an image to a recursive sequence of discrete tokens" while enabling decoding back to images. The system comprises: (1) an encoder mapping noise-free images to continuous features, (2) a vector quantizer assigning features to token embeddings from a fixed dictionary, and (3) a diffusion model decoder reconstructing images from tokens and noisy inputs. Unlike conventional image tokenizers that produce spatially arranged tokens, the DDT tokenizer produces temporally structured tokens aligned with diffusion timesteps, establishing a recursive dependency between tokens.

作者提出一個包含三個核心組件的標記器架構。目標是「訓練一個將影像編碼為離散標記之遞迴序列的影像標記器」，同時能將標記解碼回影像。系統包含：(1) 將無雜訊影像映射至連續特徵的編碼器；(2) 將特徵指派至固定字典中標記嵌入的向量量化器；(3) 從標記與雜訊輸入重建影像的擴散模型解碼器。不同於產生空間排列標記的傳統影像標記器，DDT 標記器產出與擴散時步對齊的時間結構化標記，在標記之間建立遞迴依賴關係。

段落功能方法總論——以三組件架構清晰地勾勒 DDT 標記器的整體設計。

邏輯角色作為方法章節的開場，此段承擔「架構地圖」的功能：讓讀者先掌握全貌（編碼器-量化器-解碼器），再在後續子章節深入每個組件。「時間結構化標記 vs. 空間排列標記」的對比直接呼應緒論的核心批判。

論證技巧 / 潛在漏洞以「遞迴依賴」描述標記間的關係是此方法的核心賣點，但讀者此時仍無法判斷「遞迴」的確切數學含義——是否如同自然語言中的遞迴語法（embedded clauses）？抑或僅是逐步累積的序列結構？此模糊性將在後續段落中逐步澄清。

The encoder uses a transformer-based architecture with learnable query tokens. Input consists of patchified images (similar to Vision Transformers) and T learnable query tokens. The system applies "1D and 2D sinusoidal position embeddings on the query and image tokens, respectively." Following the SD3 architecture, it employs "two independent transformers to process the query and image tokens, and join them for the attention operation." Only the transformed query tokens are retained as output. A standard Vector Quantization module with a fixed-size dictionary of 65,536 entries quantizes encoder outputs. The system projects encoder outputs from 256 dimensions to 16 dimensions before lookup, using an EMA-variant of VQ for training stability. Dead entries are monitored and reset to random tokens at each training step.

編碼器採用基於 Transformer 的架構搭配可學習的查詢標記。輸入由區塊化影像（類似 Vision Transformer）和 T 個可學習查詢標記組成。系統分別對查詢標記和影像標記施加一維與二維正弦位置嵌入。遵循 SD3 的架構設計，採用「兩個獨立的 Transformer 分別處理查詢標記與影像標記，再將它們合併進行注意力運算」。最終僅保留轉換後的查詢標記作為輸出。標準的向量量化模組使用包含 65,536 個條目的固定大小字典來量化編碼器輸出。系統在查詢前將編碼器輸出從 256 維投影至 16 維，並使用指數移動平均（EMA）變體的向量量化以確保訓練穩定性。系統在每個訓練步驟監控並重設不活躍的字典條目。

段落功能技術細節——詳述編碼器與量化器的具體實現。

邏輯角色此段從抽象架構深入到實現層級：雙 Transformer 的設計選擇允許影像特徵與查詢標記在獨立處理後再交互，既保留了各自的表示空間，又透過注意力機制實現資訊融合。向量量化的維度壓縮（256 -> 16）是控制字典利用率的關鍵工程技巧。

論證技巧 / 潛在漏洞 65,536 的字典大小與 16 維的低維量化空間是經驗性的設計選擇，文中未提供充分的消融實驗來驗證這些超參數的最佳性。此外，「不活躍條目重設」的策略雖然實用，卻暗示向量量化的碼本崩塌問題在 DDT 架構中依然存在。

3.2 Decoder and Training — 解碼器與訓練

The decoder is a diffusion model based on the MMDiT architecture from SD3 with modifications. Rather than text conditioning tokens, it "inputs a sequence of quantized tokens" with "a linear layer to project the m-dimensional vector to the latent dimension." The system learns recursive token sequences where later tokens compensate for attribute loss in progressively noisier images. The training loss minimizes reconstruction error across timesteps, following Rectified Flow sampling methods. At timestep t, the decoder receives the first t tokens (masking remaining tokens) and attempts to reconstruct the clean image from the noisy input. A commitment loss regularizes encoder outputs to match quantized vectors. This design ensures that each successive token encodes the residual visual information lost at the corresponding noise level, creating a natural recursive hierarchy.

解碼器是基於 SD3 的 MMDiT 架構的擴散模型，經過適度修改。其輸入不是文字條件標記，而是「一系列量化標記」，並以「線性層將 m 維向量投影至潛在維度」。系統學習遞迴標記序列，其中後續標記補償逐步增加雜訊的影像中流失的屬性。訓練損失依循 Rectified Flow 取樣方法，最小化跨時步的重建誤差。在時步 t 時，解碼器接收前 t 個標記（遮蔽其餘標記），嘗試從雜訊輸入重建清晰影像。承諾損失（commitment loss）正則化編碼器輸出使其貼近量化向量。此設計確保每個後續標記編碼在對應雜訊層級下流失的殘餘視覺資訊，自然形成遞迴階層結構。

段落功能核心創新——解釋 DDT 如何透過時步遮蔽機制實現遞迴學習。

邏輯角色此段是整篇論文的技術核心。「在時步 t 僅提供前 t 個標記」的設計是遞迴性的實現機制：第一個標記對應最高雜訊（僅需捕捉最粗略的屬性），後續標記逐步補償細節。此結構與自然語言中「先說主語再加修飾」的遞迴特性具有結構類比。

論證技巧 / 潛在漏洞將擴散時步的去雜訊層級對應到標記的遞迴層級，是一個優雅的概念橋接。但此「遞迴」的性質與語言學中的遞迴（如嵌套從句）仍有本質差異——DDT 的遞迴更接近「逐步細化」而非「結構嵌套」。作者是否在某種程度上藉用了語言學術語來包裝一個本質上為多尺度殘差編碼的方法？

4. DDT-LLaMA — 多模態大型語言模型整合

4.1 Model Architecture — 模型架構

The paper describes converting images into "a 1D recursive discrete sequence like a foreign language that LLM can read." Visual tokens are concatenated with text tokens in multimodal sequences, with special markers [BOV] and [EOV] distinguishing modalities. The model initializes from LLaMA-3-8B, expanding vocabulary by 65,536 visual codes. Training employs a unified next-token-prediction objective across image-text data pairs using cross-entropy loss, where both modalities share a prediction head. This design means that no separate diffusion loss or auxiliary objectives are needed during MLLM training — the standard language modeling loss suffices for both text and image generation.

論文將影像轉換為「一個如同外語般可供 LLM 閱讀的一維遞迴離散序列」。視覺標記與文字標記串接於多模態序列中，以 [BOV] 和 [EOV] 等特殊標記區分模態。模型以 LLaMA-3-8B 為初始化基礎，將詞彙表擴展 65,536 個視覺碼。訓練在影像-文字資料對上採用統一的次標記預測目標，使用交叉熵損失，兩種模態共享預測頭。此設計意味著在 MLLM 訓練過程中無需額外的擴散損失或輔助目標——標準的語言建模損失即可同時處理文字與影像生成。

段落功能系統設計——描述 DDT 標記如何整合進 LLM 的訓練框架。

邏輯角色此段連接了標記器（上一章）與完整 MLLM 系統。「統一次標記預測」是最重要的設計選擇：它將視覺生成問題完全轉化為語言建模問題，使得 LLM 的訓練基礎設施（最佳化器、學習率排程等）可以原封不動地重複使用。

論證技巧 / 潛在漏洞「如同外語」的比喻既直觀又富有說服力，但也暗含一個假設：LLM 學習視覺「外語」的能力類似於學習新語言。然而，人類學習外語需要相當的認知基礎設施調整，LLM 是否具備這樣的彈性仍待更多理論分析。共享預測頭的設計是簡潔的，但也可能限制了各模態的專屬表達能力。

4.2 Training Stages — 訓練階段

Training proceeds in two stages. Stage 1 (Pre-training) uses 200 million image-text pairs from LAION and COYO datasets, structured as [BOS] <caption> [BOV] <DDT tokens> [EOV] [EOS]. Pure text data comprises 10% of training to prevent textual capability degradation. Training occurred on 512 Ascend 910B NPUs for nearly two weeks. Stage 2 (Instruction Tuning) applies supervised fine-tuning on visual comprehension and generation tasks using the format "USER: <Instructions> ASSISTANT: <Answers>", scoring only assistant responses. During inference, after predicting [BOV], the model generates T sequential DDT tokens. These feed into the decoder, which denoises random noise over T timesteps via DDPM, providing only the first t tokens at timestep t while masking remaining tokens.

訓練分為兩個階段。第一階段（預訓練）使用來自 LAION 和 COYO 資料集的兩億個影像-文字對，格式為 [BOS] 標題 [BOV] DDT 標記 [EOV] [EOS]。純文字資料佔訓練量的 10% 以防止文字能力退化。訓練在 512 個 Ascend 910B NPU 上進行了近兩週。第二階段（指令微調）對視覺理解與生成任務進行監督微調，使用「使用者：指令助理：回答」的格式，僅對助理回應計算損失。推論時，模型在預測出 [BOV] 後依序生成 T 個 DDT 標記。這些標記輸入解碼器，透過 DDPM 在 T 個時步上逐步去除雜訊，在時步 t 時僅提供前 t 個標記並遮蔽其餘標記。

段落功能實現細節——完整說明兩階段訓練流程與推論機制。

邏輯角色此段提供了可重現性所需的關鍵細節。兩億影像-文字對的預訓練規模和兩週的訓練時間展示了方法的計算需求。10% 純文字資料的混入是防止「災難性遺忘」的重要工程選擇。推論流程中的「逐步提供標記」機制直接體現了 DDT 的遞迴性質。

論證技巧 / 潛在漏洞 512 個 Ascend 910B NPU 兩週的訓練成本暗示了方法的高計算門檻，但作者未明確報告此成本。與同類方法（如 Emu3）的計算成本對比將有助於評估此方法的實用性。此外，推論時需要 T 步去雜訊過程，是否能透過加速取樣（如 DDIM）來降低推論延遲？文中未充分討論此問題。

5. Experiments — 實驗

5.1 Text-to-Image Generation — 文字轉影像生成

The researchers evaluated DDT-LLaMA using three benchmarks: GenEval, T2I-CompBench, and DrawBench. Results showed the model "significantly outperforms SEED-X, Emu3, and the specialist method SDXL" on GenEval overall scores. The model achieved a GenEval score of 0.66, compared to 0.54 for Emu3 and 0.49 for SEED-X. DDT-LLaMA demonstrated particular strength in "tasks related to color, counting, and position," excelling at "understanding object attributes such as color, quantity, and the spatial relationships between objects." Qualitative examples show DDT-LLaMA "effectively follows various types of instructions, including complex ones such as generating surreal images and multi-condition combined prompts." However, the researchers acknowledge that "there is room for improvement in the aesthetic quality of the images. This limitation stems from the fact that our current tokenizer was only trained on ImageNet with a resolution of 256x256."

研究者使用 GenEval、T2I-CompBench 和 DrawBench 三個基準來評估 DDT-LLaMA。結果顯示模型在 GenEval 總分上「顯著超越 SEED-X、Emu3 以及專用方法 SDXL」。模型取得 GenEval 分數 0.66，相比 Emu3 的 0.54 和 SEED-X 的 0.49。DDT-LLaMA 在「與顏色、計數和位置相關的任務」上表現尤為突出，擅長「理解物件屬性如顏色、數量及物件間的空間關係」。定性範例顯示 DDT-LLaMA「有效遵循各類指令，包括生成超現實影像和多條件組合提示等複雜指令」。然而，研究者坦承「影像的美學品質仍有改善空間。此限制源於目前的標記器僅在 256x256 解析度的 ImageNet 上訓練」。

段落功能實驗驗證——以多基準定量結果證明文字轉影像生成的優越性。

邏輯角色此段是實驗章節的旗艦結果。GenEval 分數 0.66 vs. 0.54（Emu3）的大幅領先，以及對 SDXL 專用模型的超越，直接支撐了「DDT 標記優於空間標記」的核心論點。在顏色、計數和位置方面的優勢暗示 DDT 的遞迴結構確實幫助 LLM 更好地理解組合式語義。

論證技巧 / 潛在漏洞主動承認美學品質的不足並歸因於 ImageNet 256x256 的訓練限制，此策略兼具坦誠與前瞻——既承認當前不足，又暗示擴大訓練數據後問題可迎刃而解。但 GenEval 側重語義一致性而非美學品質，以此為主要基準可能有利於 DDT-LLaMA 的遞迴語義優勢，而迴避了美學方面的弱點。

5.2 Image Editing & Vision-Language Comprehension — 影像編輯與視覺語言理解

Evaluation across three editing datasets (EVR, MA5K, MagicBrush) revealed superior performance. On MagicBrush, DDT-LLaMA achieved an L1 score of 7.1 versus 8.2 for MGIE and 6.6 for UltraEdit. The model "demonstrates significant superiority over existing MLLMs across all test datasets and metrics," supporting "a wide spectrum of editing operations including both local change (e.g., removal, replacement) and global change (change time, manipulation)," while achieving "a great trade-off between fidelity and editability." For vision-language comprehension, comparisons across nine benchmarks (NoCaps, Flickr30K, VQA, GQA, OKVQA, VizWiz, MME, SEEDBench, POPE) show DDT-LLaMA achieves competitive or superior results. On NoCaps, it scored 124.2 versus 117.5 for Emu3 and 114.2 for LaVIT. The model "significantly surpasses its counterparts across multiple benchmarks, even without relying on a specialized pretrained CLIP."

在三個編輯資料集（EVR、MA5K、MagicBrush）上的評估顯示出優越的效能。在 MagicBrush 上，DDT-LLaMA 取得 L1 分數 7.1，相比 MGIE 的 8.2 和 UltraEdit 的 6.6。模型「在所有測試資料集與指標上展現出對現有 MLLM 的顯著優越性」，支援「廣泛的編輯操作，包括局部變更（如移除、替換）和全域變更（改變時間、操作）」，同時達到「保真度與可編輯性之間的良好平衡」。在視覺語言理解方面，跨九個基準（NoCaps、Flickr30K、VQA、GQA、OKVQA、VizWiz、MME、SEEDBench、POPE）的比較顯示 DDT-LLaMA 取得具競爭力或更優的結果。在 NoCaps 上達 124.2 分，相比 Emu3 的 117.5 和 LaVIT 的 114.2。模型「在多個基準上顯著超越同類模型，即使未依賴專門預訓練的 CLIP」。

段落功能多面向驗證——以影像編輯和視覺理解兩個額外任務全面驗證方法的泛化能力。

邏輯角色此段擴展了上一段的論證範圍：從文字轉影像擴展到影像編輯和視覺理解，證明 DDT 標記不僅改善生成，也同時提升理解能力。這直接支撐了摘要中「統一理解與生成」的承諾。「無需 CLIP」的強調進一步凸顯方法的自足性。

論證技巧 / 潛在漏洞九個理解基準的廣泛覆蓋增強了說服力，但需注意部分基準（如 POPE）主要測試幻覺傾向而非深度理解。此外，影像編輯的比較對象包含 UltraEdit（6.6 vs. DDT-LLaMA 的 7.1），表明在某些指標上 DDT-LLaMA 並非全面領先。作者以「great trade-off」的措辭巧妙地將此定位為平衡而非妥協。

5.3 In-Depth Analysis — 深度分析

Several in-depth analyses validate the core claims. For class-conditional generation, DDT achieves reconstruction PSNR of 21.7 on ImageNet validation and FID of 6.1, surpassing MoVQ's 7.1 and RQ-Transformer's 7.6. Counterfactual interpolation experiments demonstrate that "DDT tokens, with their disentangled representation, ensure that only the attributes captured by the substituted tokens change in the generated counterfactuals." Expanding token subset experiments show "with the number of tokens increasing, the image's attributes are progressively recovered — initially the decoder reconstructs fine details, with contours and color information gradually completed." Critically, training with DDT tokens "results in a decrease in perplexity for text generation, whereas training with VQGAN tokens yields the opposite" — demonstrating DDT tokens are "better suited for unified autoregressive modeling with text tokens." In A/B testing using Gemma2-2B, DDT-Gemma outperforms MoVQ-Gemma in 65 editing cases while MoVQ-Gemma surpasses DDT-Gemma in only 10 cases. Preliminary results also indicate scaling laws in DDT-based MLLM, with improvements across model sizes (2B, 8B).

多項深度分析驗證了核心主張。在類別條件生成方面，DDT 在 ImageNet 驗證集上達到 21.7 的重建 PSNR 和 6.1 的 FID，超越 MoVQ 的 7.1 和 RQ-Transformer 的 7.6。反事實插值實驗證明「DDT 標記憑藉其解耦表示，確保在生成的反事實中僅有被替換標記所捕捉的屬性發生變化」。逐步擴展標記子集的實驗顯示「隨著標記數量增加，影像屬性被逐步恢復——解碼器首先重建精細細節，輪廓與色彩資訊隨後逐步補全」。關鍵的是，使用 DDT 標記訓練「導致文字生成困惑度下降，而使用 VQGAN 標記訓練則產生相反結果」——證明 DDT 標記「更適合與文字標記進行統一的自迴歸建模」。在使用 Gemma2-2B 的 A/B 測試中，DDT-Gemma 在 65 個編輯案例中優於 MoVQ-Gemma，而 MoVQ-Gemma 僅在 10 個案例中勝出。初步結果也顯示 DDT 基礎 MLLM 存在縮放定律，在不同模型規模（2B、8B）間均有改善。

段落功能機制驗證——透過多項消融與分析實驗，驗證 DDT 標記的遞迴性與語言相容性。

邏輯角色此段是全文論證的「閉環驗證」。文字困惑度的下降直接佐證了「DDT 標記是一種 LLM 可掌握的語言」的核心論點——若標記化方案與文字標記不相容，聯合訓練應導致文字困惑度上升（如 VQGAN 所示）。反事實插值與逐步解碼實驗則從生成角度驗證了遞迴性。

論證技巧 / 潛在漏洞文字困惑度的對比（DDT 下降 vs. VQGAN 上升）是極具說服力的證據，因為它將抽象的「語言相容性」轉化為可量化的指標。然而，A/B 測試的案例數（65:10）雖然傾斜明顯，但未報告統計顯著性。縮放定律的初步觀察令人振奮，但僅有兩個規模點（2B、8B）難以確立真正的冪律關係。

6. Conclusion — 結論

The authors propose Discrete Diffusion Timestep (DDT) tokenization to "learn discrete, recursive visual tokens, which recursively compensate for the progressive attribute loss in noisy images as timesteps increase." They trained DDT-LLaMA on vast image-text corpora for vision-language alignment. The approach achieves "superior performance for multimodal comprehension and generation simultaneously compared with other MLLMs," excelling across text-to-image generation, image editing, and vision-language understanding. The paper notes they are "currently working on scaling up the training of our DDT tokenizer and the MLLM," with plans to release enhanced versions. This work demonstrates that the key to unifying multimodal comprehension and generation lies not in model architecture alone, but in designing a visual token language that is structurally compatible with the autoregressive nature of LLMs.

作者提出離散擴散時步（DDT）標記化方法，「學習離散的遞迴視覺標記，隨著時步增加遞迴地補償雜訊影像中逐步流失的屬性」。團隊在大規模影像-文字語料庫上訓練 DDT-LLaMA 以實現視覺-語言對齊。此方法「在多模態理解與生成方面同時達到優於其他 MLLM 的表現」，在文字轉影像生成、影像編輯與視覺語言理解方面均表現出色。文章指出團隊「正在擴大 DDT 標記器與 MLLM 的訓練規模」，並計畫釋出增強版本。本研究證明了統一多模態理解與生成的關鍵，不僅在於模型架構本身，更在於設計一種在結構上與 LLM 自迴歸本質相容的視覺標記語言。

段落功能總結全文——重申核心貢獻、成果摘要，並展望未來方向。

邏輯角色結論段呼應摘要的結構，形成完整的論證閉環。最後一句將貢獻從技術層面提升至方法論層面（「關鍵不在架構而在標記語言的設計」），為整個研究方向提供了概念性指引。

論證技巧 / 潛在漏洞「正在擴大訓練規模」的聲明既是對當前限制的坦承（256x256 ImageNet），也是對未來改進空間的暗示。但結論未討論根本性的局限——例如 DDT 的「遞迴性」是否在所有視覺任務中都優於空間標記？是否存在空間標記更適合的場景（如需要精確空間定位的任務）？一篇更完整的結論應包含對方法適用邊界的反思。

論證結構總覽

問題
空間視覺標記缺乏
語言遞迴結構

→

論點
以擴散時步建構
遞迴視覺標記

→

證據
GenEval 0.66 超越 Emu3
九基準全面驗證

→

反駁
擾動分析 + 困惑度對比
證明語言相容性

→

結論
視覺標記語言設計
是統一多模態的關鍵

作者核心主張（一句話）

透過利用擴散模型時步來建構遞迴式離散視覺標記，可使大型語言模型以統一的次標記預測範式，同時在多模態理解與生成上達到最先進的表現。

論證最強處

文字困惑度的對比實驗：DDT 標記與文字標記聯合訓練後文字困惑度下降，而 VQGAN 標記則導致上升。此結果將抽象的「語言相容性」概念轉化為可量化的指標，以最直接的方式證明了 DDT 標記的類語言性質，構成全文最具說服力的實證支柱。

論證最弱處

「遞迴」概念的語言學嚴謹性：DDT 的「遞迴」更接近多尺度殘差編碼的逐步細化，而非語言學中嵌套從句式的結構遞迴。作者以「不可能的語言」這一強論斷建立全文動機，但對「遞迴」的形式化定義不夠嚴謹，使得核心論點的語言學基礎相對薄弱。此外，標記器僅在 ImageNet 256x256 上訓練的限制，使得美學品質的泛化能力尚待驗證。