MaskGIT — 雙欄批注

Abstract — 摘要

We propose MaskGIT, a novel image synthesis paradigm using a bidirectional transformer decoder. During training, MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions. At inference, the model begins by generating all tokens simultaneously, then iteratively refines the image by re-masking and re-predicting low-confidence tokens. MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset, while accelerating autoregressive decoding by up to 64x. We further demonstrate that MaskGIT can be easily extended to various image editing tasks, such as inpainting, extrapolation, and image manipulation.

本文提出 MaskGIT，一種使用雙向 Transformer 解碼器的新穎影像合成範式。在訓練階段，MaskGIT 學習透過關注所有方向的 token 來預測隨機遮罩的 token。在推論時，模型首先同時生成所有 token，然後透過重新遮罩並重新預測低信心的 token 迭代精煉影像。MaskGIT 在 ImageNet 資料集上大幅超越最先進的 Transformer 模型，同時將自迴歸解碼加速達 64 倍。我們進一步展示 MaskGIT 可輕鬆擴展至多種影像編輯任務，如修補、外推和影像操控。

段落功能全文總覽——以遮罩建模取代自迴歸序列生成的範式轉變。

邏輯角色三個層次的宣稱：品質（超越 SOTA）、效率（64 倍加速）、通用性（多種編輯任務）。

論證技巧 / 潛在漏洞64 倍加速是極為吸引人的數字。但加速倍率高度依賴迭代次數的選擇——需確認在相同品質下的公平比較。

1. Introduction — 緒論

Existing generative transformers treat images as sequential 1D token sequences following raster scan ordering (left-to-right, line-by-line). This approach is fundamentally problematic because images are inherently 2D and non-sequential. The sequential decoding strategy creates quadratically long sequences and requires substantial computation — approximately 30 seconds per GPU-generated image with 32x32 tokens. This computational bottleneck limits practical deployment and restricts the achievable image resolution.

現有的生成式 Transformer 將影像視為遵循光柵掃描順序的一維序列 token（由左至右、逐行進行）。此方法存在根本性問題，因為影像本質上是二維且非序列性的。序列化解碼策略產生二次方長度的序列，需要大量計算——以 32x32 token 計算，每張 GPU 生成的影像約需 30 秒。此計算瓶頸限制了實際部署並制約了可達到的影像解析度。

段落功能建立問題意識——批判自迴歸序列生成的根本缺陷。

邏輯角色以「2D 影像被強制轉為 1D 序列」的矛盾精準點出自迴歸方法的結構性問題。

論證技巧 / 潛在漏洞30 秒的具體數字使問題可感知。但自迴歸方法在 token 品質上的優勢（因果一致性）未被討論。

MaskGIT adopts a non-autoregressive approach inspired by how artists create paintings: starting with a sketch and progressively refining details. The model predicts all tokens in parallel at each iteration, retaining only the most confident predictions while masking uncertain ones for refinement in subsequent iterations. This constant-iteration approach completes 256-token images in 8 steps versus 256 steps for autoregressive methods. Key technical innovations include: (1) bidirectional self-attention enabling context from all directions, (2) a novel mask scheduling strategy determining which tokens to predict at each iteration, and (3) iterative refinement balancing quality and efficiency.

MaskGIT 採用非自迴歸方法，靈感來自畫家創作的過程：先打草稿，再逐步精煉細節。模型在每次迭代中平行預測所有 token，僅保留最具信心的預測，同時遮罩不確定的 token 留待後續迭代精煉。此固定迭代方法以 8 步完成 256 個 token 的影像，對比自迴歸方法的 256 步。關鍵技術創新包括：(1) 啟用全方向上下文的雙向自注意力；(2) 決定每次迭代預測哪些 token 的新穎遮罩排程策略；(3) 平衡品質與效率的迭代精煉機制。

段落功能提出解決方案——以畫家比喻引入迭代式平行生成。

邏輯角色「8 步 vs. 256 步」的對比直觀展示了效率優勢。三個創新點結構清晰。

論證技巧 / 潛在漏洞畫家比喻是出色的直覺解釋——從粗到細的漸進生成。但 8 步迭代的品質是否真能匹配 256 步的精度需要實驗驗證。

Our work is inspired by BERT's masked language modeling for bidirectional representation learning. While masked modeling has been extended to vision tasks (MAE, BEiT) for representation learning, few works have applied it successfully to image generation on standard benchmarks. We provide the first evidence demonstrating the efficacy of masked modeling for image generation on the common ImageNet benchmark. Our approach borrows from bidirectional machine translation but introduces novel masking strategies and decoding algorithms specifically designed for generation rather than representation learning.

本工作受 BERT 用於雙向表徵學習的遮罩語言建模所啟發。儘管遮罩建模已擴展至視覺任務（MAE、BEiT）的表徵學習，鮮少有工作成功地將其應用於標準基準上的影像生成。我們提供了遮罩建模在通用 ImageNet 基準上用於影像生成的首個有效性證明。我們的方法借鑑雙向機器翻譯，但引入了專為生成而非表徵學習設計的新穎遮罩策略與解碼演算法。

段落功能定位學術貢獻——從表徵學習到生成任務的遮罩建模遷移。

邏輯角色「首個有效性證明」是強力的優先權宣稱，將 BERT 的成功從 NLP 延伸至影像生成。

論證技巧 / 潛在漏洞巧妙地將 BERT 的聲譽轉移至影像生成領域。但「首個」的宣稱需確認是否忽略了同期的類似工作。

2. Method — 方法

The model learns Masked Visual Token Modeling (MVTM), inspired by BERT's approach. Given latent tokens Y and binary mask M, the training objective minimizes the negative log-likelihood of masked tokens. Training randomly masks a variable number of tokens determined by a mask scheduling function: the number of masked tokens equals the ceiling of gamma(r) times N, where gamma is the scheduling function, r is a random ratio, and N is the total sequence length. The bidirectional transformer predicts probability distributions for masked positions, utilizing context from all directions rather than only preceding tokens as in autoregressive models.

模型學習遮罩視覺 Token 建模（MVTM），靈感來自 BERT 的方法。給定潛在 token Y 和二元遮罩 M，訓練目標為最小化被遮罩 token 的負對數似然。訓練隨機遮罩由遮罩排程函數決定的可變數量 token：被遮罩 token 數等於 gamma(r) 乘以 N 的上取整，其中 gamma 為排程函數、r 為隨機比率、N 為總序列長度。雙向 Transformer 為被遮罩位置預測機率分佈，利用所有方向的上下文，而非如自迴歸模型般僅利用前序 token。

段落功能訓練機制——MVTM 的遮罩策略與雙向注意力。

邏輯角色遮罩排程函數 gamma(r) 是連接訓練與推論的橋樑，其設計直接影響生成品質。

論證技巧 / 潛在漏洞BERT 風格的遮罩訓練與 token 生成之間存在訓練-推論不一致問題——迭代精煉機制正是為解決此問題而設計。

At inference, the iterative decoding algorithm proceeds in T steps: (1) Predict: the model generates probability distributions for all masked positions in parallel; (2) Sample: tokens are sampled with prediction confidence serving as confidence scores; (3) Mask Schedule: compute the number of tokens to mask next using gamma(t/T) times N; (4) Mask: retain only the most confident predictions and mask lower-confidence positions for refinement. This process continues until all tokens are generated. The mask scheduling function critically impacts generation quality — it must be continuous, bounded in (0,1], and monotonically decreasing with gamma(0) approaching 1 and gamma(1) approaching 0.

在推論時，迭代解碼演算法以 T 步進行：(1) 預測：模型平行為所有被遮罩位置生成機率分佈；(2) 取樣：取樣 token 並以預測信心作為信心分數；(3) 遮罩排程：使用 gamma(t/T) 乘以 N 計算下一步要遮罩的 token 數；(4) 遮罩：僅保留最具信心的預測，將低信心位置遮罩以待精煉。此過程持續直至所有 token 生成完畢。遮罩排程函數對生成品質有關鍵影響——它必須是連續的、在 (0,1] 上有界的、且單調遞減，gamma(0) 趨近 1 而 gamma(1) 趨近 0。

段落功能推論機制——四步迭代解碼的完整流程。

邏輯角色「預測-取樣-排程-遮罩」的循環結構是方法論的核心，信心分數驅動的選擇機制保證了品質。

論證技巧 / 潛在漏洞以數學約束（連續、有界、單調遞減）確保演算法收斂的分析是嚴謹的理論基礎。但信心分數的校準品質直接影響哪些 token 被保留。

The paper evaluates three families of scheduling functions: linear (equal tokens masked per iteration), concave (cosine, square, cubic, exponential — emphasizing early iterations with fewer correct predictions), and convex (square root, logarithmic — requiring most tokens to be finalized early). Ablations show that concave functions outperform others, with cosine achieving the best results. The authors hypothesize that concave functions succeed by challenging training with difficult cases and appropriately prioritizing a less-to-more prediction progression — generating fewer tokens with high confidence initially, then progressively committing to more tokens as context accumulates.

論文評估了三類排程函數：線性（每次迭代遮罩等量 token）、凹函數（餘弦、平方、立方、指數——強調早期迭代中較少的正確預測）和凸函數（平方根、對數——要求大多數 token 在早期即確定）。消融實驗顯示凹函數優於其他類型，其中餘弦函數達到最佳結果。作者假設凹函數之所以成功，在於以困難案例挑戰訓練，並適當優先安排從少到多的預測漸進——初始以高信心生成較少 token，隨著上下文累積逐步承諾更多 token。

段落功能遮罩排程設計——系統性比較三類函數族。

邏輯角色排程函數的選擇是調控生成過程的核心超參數，凹函數的「從少到多」策略與擴散模型的「從粗到細」異曲同工。

論證技巧 / 潛在漏洞餘弦排程的成功可能源於其與擴散模型噪聲排程的相似性——兩者都在中間階段分配最多計算資源。但最佳排程可能是資料集特定的。

3. Experiments — 實驗

The implementation uses 24 transformer layers, 8 attention heads, 768 embedding dimensions, and 3072 hidden dimensions. Images are compressed by factor 16 using a tokenizer with 1024-token codebook. Training uses 4x TPU devices, batch size 256, Adam optimizer. On ImageNet 256x256, MaskGIT achieves FID of 6.18 versus VQGAN's 15.78 and Inception Score 182.1 versus 78.3. At 512x512 resolution, FID reaches 7.32, establishing new state-of-the-art on classification accuracy score and FID metrics, exceeding BigGAN's 8.43 FID at this resolution.

實作使用 24 層 Transformer、8 個注意力頭、768 維嵌入和 3072 維隱藏層。影像透過具有 1024 token 碼本的分詞器壓縮 16 倍。訓練使用 4 個 TPU 裝置、批次大小 256、Adam 最佳化器。在 ImageNet 256x256 上，MaskGIT 達到 FID 6.18 對比 VQGAN 的 15.78 以及Inception Score 182.1 對比 78.3。在 512x512 解析度下 FID 達到 7.32，在分類準確度分數和 FID 指標上建立新的最先進水準，超越 BigGAN 在此解析度的 8.43 FID。

段落功能核心品質結果——FID 與 IS 的全面超越。

邏輯角色FID 從 15.78 到 6.18 的巨幅改進（-60.8%）是方法有效性的最強實證。

論證技巧 / 潛在漏洞同時超越 VQGAN（同為 Transformer）和 BigGAN（不同架構族）的雙重超越強化了結論的穩健性。但 FID 本身的局限性（對模式內插的偏好）可能掩蓋某些品質差異。

MaskGIT requires 8-12 inference steps versus VQGAN's 256-1024 steps. Wall-clock runtime comparisons demonstrate MaskGIT accelerates VQGAN by 30-64x, with speed advantages increasing at higher resolutions due to token sequence length growth. Using Classification Accuracy Score (CAS), MaskGIT achieves 63.14 (top-1) and 84.45 (top-5) on 256x256 versus VQGAN's 53.10 and 76.18. Precision/Recall analysis shows MaskGIT balances quality and coverage better than single-mode GANs, offering competitive sample diversity with superior coverage compared to BigGAN.

MaskGIT 僅需 8-12 個推論步驟對比 VQGAN 的 256-1024 步。實際計時比較顯示 MaskGIT 將 VQGAN 加速 30-64 倍，且由於 token 序列長度增長，速度優勢在更高解析度下更為顯著。使用分類準確度分數（CAS），MaskGIT 在 256x256 上達到 top-1 63.14 和 top-5 84.45，對比 VQGAN 的 53.10 和 76.18。精確度/召回率分析顯示 MaskGIT 比單模式 GAN 更好地平衡了品質與覆蓋率，相較 BigGAN 提供具競爭力的樣本多樣性與更優的覆蓋率。

段落功能效率與多樣性分析——速度優勢與覆蓋率指標。

邏輯角色30-64 倍加速是本文最具實用影響力的結果。CAS 指標補充了 FID 無法衡量的類別覆蓋能力。

論證技巧 / 潛在漏洞多角度的評估指標（FID、IS、CAS、Precision/Recall）構成了完整的品質圖景。加速倍率隨解析度增長的趨勢進一步強化了方法在高解析度場景的價值。

The model's bidirectional nature enables a novel class-conditional image editing task: replacing bounding box content with target class objects while preserving context — a task infeasible for autoregressive methods due to their sequential left-to-right generation constraint. For image inpainting on Places2 (center 50% masking), MaskGIT achieves FID 7.92, comparable to dedicated inpainting methods like CoModGAN (7.13) without task-specific training. For image outpainting (right 50% extrapolation), MaskGIT achieves FID 6.78 and IS 11.69, beating all baselines including InfinityGAN (10.60 FID), handling arbitrary-direction outpainting with a single model.

模型的雙向特性支援了一項新穎的類別條件式影像編輯任務：以目標類別物件替換邊界框內容的同時保留上下文——這項任務由於序列化左至右生成的限制，對自迴歸方法而言不可行。在 Places2 影像修補（中央 50% 遮罩）上，MaskGIT 達到 FID 7.92，與專門的修補方法 CoModGAN（7.13）相當，且未經任務特定訓練。在影像外推（右側 50% 擴展）上，MaskGIT 達到 FID 6.78 和 IS 11.69，超越包括 InfinityGAN（10.60 FID）在內的所有基線，以單一模型處理任意方向的外推。

段落功能編輯應用——修補、外推與類別條件編輯。

邏輯角色「自迴歸不可行而 MaskGIT 可行」的對比是雙向架構的結構性優勢證明。

論證技巧 / 潛在漏洞在未經任務特定訓練下接近專門修補方法的表現是出色的泛化能力展示。但 FID 7.92 對比 7.13 的差距表明專門化訓練仍有其價值。

4. Conclusion — 結論

MaskGIT introduces a paradigm shift in generative transformers through bidirectional masked modeling and iterative parallel decoding. The method achieves state-of-the-art image synthesis results on ImageNet while demonstrating substantial speed improvements (30-64x) and versatility across editing tasks. The cosine mask scheduling strategy provides the optimal less-to-more prediction progression, and the iterative refinement mechanism resolves the training-inference discrepancy inherent in single-pass non-autoregressive generation. The authors acknowledge limitations in handling complex structures like faces and text, and propose extending the approach to additional synthesis tasks as future work.

MaskGIT 透過雙向遮罩建模與迭代平行解碼，引入了生成式 Transformer 的範式轉移。此方法在ImageNet 上達到最先進的影像合成結果，同時展示了顯著的速度提升（30-64 倍）與跨編輯任務的多功能性。餘弦遮罩排程策略提供了最佳的從少到多預測漸進，而迭代精煉機制解決了單次非自迴歸生成固有的訓練-推論不一致問題。作者承認在處理人臉和文字等複雜結構上的局限，並提出將方法擴展至更多合成任務作為未來工作。

段落功能總結全文——重申範式轉移的核心貢獻與已知局限。

邏輯角色以「範式轉移」定位全文，同時誠實承認局限，增強學術可信度。

論證技巧 / 潛在漏洞MaskGIT 開創的遮罩生成範式已被後續大量工作採用（如 MUSE、Phenaki），證明其影響力。人臉與文字的局限在後續工作中已被逐步解決。

論證結構總覽

問題
自迴歸序列生成
效率低且不自然

→

論點
雙向遮罩建模
平行迭代生成

→

方法
MVTM + 餘弦排程
+ 信心精煉

→

證據
FID 6.18, 64x 加速
多任務泛化

→

結論
生成式 Transformer
的範式轉移

核心主張（一句話）

透過雙向遮罩建模與迭代平行解碼，MaskGIT 在 ImageNet 上以 8 步完成 256 token 影像的生成，FID 達 6.18 並實現 30-64 倍的加速，證明非自迴歸方法在品質與效率上均可超越序列生成。

論證最強處

餘弦遮罩排程函數的系統性消融實驗提供了清晰的設計指南，「從少到多」的預測漸進原則既有直覺支撐（畫家比喻）又有實驗驗證（消融中一致勝出），理論與實踐的結合極為出色。

論證最弱處

影像品質完全依賴 VQGAN 分詞器的壓縮品質——碼本大小（1024）和壓縮率（16x）設定了品質上限。此外，對人臉和文字等精細結構的生成能力不足，限制了實際應用範圍。