Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Abstract — 摘要

Vision Transformer (ViT) has shown promising performance on image classification but requires large-scale pre-training datasets (e.g., JFT-300M) to achieve competitive results. This paper identifies two key limitations: the simple tokenization of input images fails to model the important local structure such as edges and lines, leading to low training sample efficiency; and the redundant attention backbone of ViT is not efficient enough for vision tasks when training from scratch. The authors propose Tokens-to-Token ViT (T2T-ViT), which progressively tokenizes images through aggregating neighboring tokens and achieves 81.5% top-1 accuracy on ImageNet training from scratch.

視覺 Transformer (ViT) 在影像分類上展現了令人期待的性能，但需要大規模預訓練資料集（如 JFT-300M）才能取得具競爭力的結果。本文指出兩項關鍵限制：對輸入影像的簡單分詞化無法建模邊緣、線條等重要局部結構，導致訓練樣本效率低下；且 ViT 的冗餘注意力骨幹在從頭訓練時對視覺任務的效率不足。作者提出 Tokens-to-Token ViT (T2T-ViT)，透過聚合鄰近 token 漸進式地進行影像分詞化，在 ImageNet 上從頭訓練達成 81.5% 的 top-1 準確率。

段落功能全文總覽——精煉地點出 ViT 的兩項限制及 T2T-ViT 的解決方案與核心成果。

邏輯角色摘要以「問題（ViT 限制）→ 診斷（分詞化與骨幹）→ 解決方案（T2T）→ 結果（81.5%）」的完整論證鏈設定讀者預期。

論證技巧 / 潛在漏洞以 81.5% 的具體數字作為開場的說服力錨點。但「從頭訓練」的定義需釐清——是否包含 ImageNet 本身的資料增強與正則化策略，這些因素也顯著影響結果。

1. Introduction — 緒論

The original ViT splits an image into non-overlapping 16x16 patches and linearly projects each patch into a token. This hard splitting destroys the local continuity and structural information of the image, such as edges, corners, and texture patterns that convolutional neural networks naturally capture through their hierarchical receptive fields. When trained solely on ImageNet-1K (1.3M images), ViT-Base achieves only 77.9% top-1 accuracy, significantly below DeiT-Base's 81.8% which uses extensive data augmentation.

原始 ViT 將影像切分為不重疊的 16x16 區塊，並將每個區塊線性投影為一個 token。這種硬切分方式破壞了影像的局部連續性與結構資訊，例如邊緣、角點及紋理模式——這些正是摺積神經網路透過其階層式感受野自然捕捉的特徵。僅在 ImageNet-1K（130 萬張影像）上訓練時，ViT-Base 僅達到 77.9% 的 top-1 準確率，顯著低於使用大量資料增強的 DeiT-Base（81.8%）。

段落功能批判既有方法——具體指出 ViT 簡單分詞化策略的結構性缺陷。

邏輯角色論證鏈的起點：以具體的性能差距（77.9% vs. 81.8%）量化問題嚴重程度，為 T2T 模組的引入建立必要性。

論證技巧 / 潛在漏洞將 ViT 與 DeiT 的性能差距歸因於分詞化方式，但 DeiT 的改善主要來自訓練策略（蒸餾、增強），這種歸因可能過度簡化了問題的根源。

Furthermore, the authors analyze the attention maps of ViT and observe that many attention heads in the later layers exhibit high similarity, indicating redundancy in the backbone design. This redundancy means the model wastes parameters on repeated computations rather than learning diverse, complementary representations. The authors argue that a more compact and efficient backbone architecture, inspired by CNN design principles, can reduce this redundancy while maintaining representational power.

此外，作者分析了 ViT 的注意力圖，觀察到後層的許多注意力頭呈現高度相似性，顯示骨幹設計中存在冗餘。此冗餘意味著模型將參數浪費在重複計算上，而非學習多樣且互補的表徵。作者主張，受 CNN 設計原則啟發的更精簡高效的骨幹架構，能夠在維持表徵能力的同時減少此冗餘。

段落功能提供分析證據——透過注意力圖的視覺化分析揭示 ViT 骨幹的冗餘問題。

邏輯角色為第二項貢獻（高效骨幹設計）建立動機。從可觀察的現象（注意力頭相似性）推導出設計改進的方向。

論證技巧 / 潛在漏洞以視覺化分析作為論證依據直觀有效，但注意力頭的「相似性」不等同於「冗餘」——高度相似的注意力模式可能服務於集成效應，提升穩健性。

Vision Transformers have rapidly evolved since the original ViT demonstrated that pure transformer architectures can achieve competitive image classification performance. DeiT introduced knowledge distillation and advanced data augmentation strategies to improve ViT's training efficiency on ImageNet. Meanwhile, CNN architectures like ResNet and EfficientNet have long benefited from carefully designed hierarchical structures that progressively build local-to-global representations — a principle largely absent in vanilla ViT.

自原始 ViT 證明純 Transformer 架構可達成具競爭力的影像分類性能以來，視覺 Transformer 迅速發展。DeiT 引入知識蒸餾與進階資料增強策略以提升 ViT 在 ImageNet 上的訓練效率。同時，如 ResNet 和 EfficientNet 等 CNN 架構長期受益於精心設計的階層式結構，漸進式地建構從局部到全域的表徵——這一原則在原始 ViT 中大體缺失。

段落功能文獻回顧——定位 T2T-ViT 在 ViT 改良與 CNN 設計原則交匯處的位置。

邏輯角色建立兩條平行的研究脈絡（ViT 改良 vs. CNN 設計智慧），暗示 T2T-ViT 將兩者融合。

論證技巧 / 潛在漏洞將 CNN 的「階層式局部到全域」原則標記為 ViT 所缺失的特質，巧妙地為 T2T 模組的漸進式聚合設計提供理論支撐。

3. Proposed Approach — 提出方法

3.1 Tokens-to-Token Module — T2T 模組

The T2T module performs progressive tokenization through iterative token restructuring. In each T2T step, the module first applies a transformer layer to model relationships among all tokens, then reshapes the output tokens back into a spatial image format and unfolds overlapping patches from neighboring tokens to aggregate local information. This process progressively reduces the token length while enriching each token with surrounding structural information, effectively encoding local patterns like edges and textures that simple patch splitting misses.

T2T 模組透過迭代式的 token 重構實現漸進式分詞化。在每個 T2T 步驟中，模組首先應用 Transformer 層建模所有 token 之間的關係，接著將輸出 token 重塑為空間影像格式，並從鄰近 token 展開重疊區塊以聚合局部資訊。此過程漸進式地縮減 token 長度，同時以周圍的結構資訊豐富每個 token，有效地編碼簡單區塊切分所遺漏的邊緣、紋理等局部模式。

段落功能方法推導核心——描述 T2T 模組的具體運作機制。

邏輯角色這是全文最核心的技術貢獻。透過「Transformer 建模 → 空間重塑 → 重疊展開」的三步驟迭代，實現了兼具全域注意力與局部結構建模的分詞化。

論證技巧 / 潛在漏洞「漸進式縮減」的設計與 CNN 的池化層有異曲同工之妙，使讀者易於理解。但每個 T2T 步驟都需要一個完整的 Transformer 層，引入的計算開銷相較於簡單切分顯著增加，效率權衡未在此段明確討論。

Specifically, the T2T module uses a soft split operation with overlapping windows. Given tokens reshaped into a 2D spatial map, a sliding window with kernel size k and stride s (where s < k) generates overlapping patches that are concatenated along the channel dimension. After two T2T iterations, the token count is reduced from the original patch count while each token now encodes rich local context. The final tokens are then fed into the efficient transformer backbone.

具體而言，T2T 模組使用帶有重疊視窗的軟切分操作。將 token 重塑為二維空間映射後，以核大小 k 和步幅 s（其中 s < k）的滑動視窗生成重疊區塊，並沿通道維度進行拼接。經過兩次 T2T 迭代後，token 數量從原始區塊數量縮減，同時每個 token 現已編碼豐富的局部上下文。最終的 token 被輸入至高效 Transformer 骨幹。

段落功能技術細節——闡述軟切分操作的具體實現。

邏輯角色將上段的概念性描述落實為可重現的演算法細節，以滑動視窗的核大小與步幅作為關鍵超參數。

論證技巧 / 潛在漏洞重疊視窗的設計直接借鑒了 CNN 中重疊池化的成熟概念，降低了方法的新穎性風險。但重疊視窗增加了 token 的通道維度，對記憶體消耗的影響需要關注。

3.2 Efficient Backbone — 高效骨幹設計

For the transformer backbone, the authors explore architectures inspired by CNN design principles including dense connections (DenseNet-style), channel attention (SE-Net-style), and ghost operations (GhostNet-style). Through systematic comparison, a deep-narrow structure with fewer channels but more layers proves most effective. This design reduces the attention head redundancy observed in vanilla ViT while maintaining comparable representational capacity with significantly fewer parameters.

在 Transformer 骨幹方面，作者探索了受 CNN 設計原則啟發的架構，包括密集連接（DenseNet 風格）、通道注意力（SE-Net 風格）及幻影操作（GhostNet 風格）。透過系統性比較，深窄結構（較少通道但更多層數）被證明最為有效。此設計減少了在原始 ViT 中觀察到的注意力頭冗餘，同時以顯著更少的參數維持相當的表徵能力。

段落功能展示第二項貢獻——以 CNN 設計智慧改良 Transformer 骨幹。

邏輯角色回應緒論中對注意力頭冗餘的診斷，提出「深窄結構」作為解決方案。CNN 設計原則的借用為跨架構知識遷移提供了範例。

論證技巧 / 潛在漏洞系統性比較多種 CNN 設計原則展現了研究的全面性。但最終選擇「深窄結構」可能對特定任務（如偵測、分割）的遷移性產生影響，而此處僅聚焦於分類任務。

4. Experiments — 實驗

On ImageNet-1K, T2T-ViT-14 achieves 81.5% top-1 accuracy with only 21.5M parameters, comparable to ResNet-50 (25.6M parameters, 76.1% accuracy) and DeiT-Small (22.1M parameters, 79.8% accuracy). T2T-ViT-24 further reaches 82.3% accuracy with 64.1M parameters. Ablation studies show that the T2T module alone improves accuracy by +2.1% over simple patch embedding, and the efficient backbone design contributes an additional +1.4% improvement. On transfer learning tasks, T2T-ViT maintains competitive performance on CIFAR-10, CIFAR-100, and Oxford Flowers.

在 ImageNet-1K 上，T2T-ViT-14 以僅 2150 萬個參數達成 81.5% 的 top-1 準確率，可與 ResNet-50（2560 萬參數、76.1% 準確率）和 DeiT-Small（2210 萬參數、79.8% 準確率）相媲美。T2T-ViT-24 進一步以 6410 萬參數達到 82.3% 準確率。消融實驗顯示，T2T 模組本身相較於簡單區塊嵌入提升了 +2.1% 的準確率，而高效骨幹設計額外貢獻了 +1.4% 的改善。在遷移學習任務上，T2T-ViT 在 CIFAR-10、CIFAR-100 及 Oxford Flowers 上維持具競爭力的性能。

段落功能提供全面的實驗證據——在 ImageNet 上的主要結果、消融實驗與遷移學習。

邏輯角色此段是論文的實證支柱，覆蓋三個維度：(1) 與 CNN 及 ViT 變體的公平對比；(2) 消融實驗分離 T2T 模組與骨幹的各自貢獻；(3) 遷移學習驗證泛化能力。

論證技巧 / 潛在漏洞以參數量作為公平比較的基準是合理的，但 FLOPs（浮點運算量）和實際推論速度同樣重要——T2T 模組的重疊窗口操作可能在 FLOPs 上不佔優勢。與 DeiT 的比較未控制訓練策略（蒸餾 vs. 非蒸餾），需注意公平性。

5. Conclusion — 結論

T2T-ViT addresses two fundamental limitations of vanilla ViT through progressive tokenization that models local structure and an efficient deep-narrow backbone that reduces attention redundancy. The result is a vision transformer that achieves CNN-level performance when training from scratch on ImageNet, without requiring massive pre-training datasets or knowledge distillation. This work demonstrates that bridging CNN design wisdom with transformer architectures yields models that are both data-efficient and computationally practical.

T2T-ViT 透過建模局部結構的漸進式分詞化以及減少注意力冗餘的高效深窄骨幹，解決了原始 ViT 的兩項根本限制。所得到的視覺 Transformer 在 ImageNet 上從頭訓練即可達成 CNN 等級的性能，無需大規模預訓練資料集或知識蒸餾。本研究證明，將 CNN 的設計智慧與 Transformer 架構相融合，能夠產生兼具資料效率與計算可行性的模型。

段落功能總結全文——回顧兩項貢獻並提煉整體訊息。

邏輯角色結論段與摘要形成首尾呼應，以「問題 → 解決方案 → 啟示」的三層結構收束全文。

論證技巧 / 潛在漏洞「無需大規模預訓練」的宣示具有強烈的實用價值訴求。但結論未討論 T2T-ViT 在下游任務（偵測、分割）上的表現，以及隨著 ViT 預訓練資料集日益可得，「從頭訓練」的實際需求可能逐漸減弱。

論證結構總覽

問題
ViT 分詞化丟失
局部結構，骨幹冗餘

→

論點
漸進式 T2T 分詞化
＋深窄高效骨幹

→

證據
ImageNet 81.5%
僅 21.5M 參數

→

反駁
消融實驗分離
各組件貢獻

→

結論
CNN 與 Transformer
設計融合的成功範例

作者核心主張（一句話）

透過漸進式 token 聚合建模局部結構，並以 CNN 設計原則精簡 Transformer 骨幹，視覺 Transformer 無需大規模預訓練即可在 ImageNet 上達成與 CNN 匹敵的性能。

論證最強處

問題診斷的精準性：透過注意力圖分析揭示骨幹冗餘，並以 token 視覺化展示簡單切分的資訊丟失，為兩項設計改進提供了堅實的經驗基礎。消融實驗清晰量化了 T2T 模組（+2.1%）與骨幹設計（+1.4%）的各自貢獻，展現了嚴謹的實驗方法論。

論證最弱處

效率指標的不完整性：論文主要以參數量作為效率比較基準，但未充分報告 FLOPs 和實際推論延遲。T2T 模組的重疊操作可能在 FLOPs 上帶來額外開銷。此外，「從頭訓練」的實際需求在預訓練模型日益普及的趨勢下，其長期價值可能受到質疑。