DINOv2: Learning Robust Visual Features without Supervision

Abstract — 摘要

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing self-supervised methods can produce such features if trained on enough curated data from diverse sources. We train a ViT model with 1 billion parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP, on most benchmarks at image and pixel levels.

自然語言處理中在大量資料上進行模型預訓練的近期突破，為電腦視覺中的類似基礎模型開闢了道路。這些模型可透過產生通用視覺特徵——即在不同影像分布與任務間無需微調即可運作的特徵——大幅簡化影像在任何系統中的使用。本研究展示了現有的自監督方法若在足夠的多源策展資料上訓練，即可產生此類特徵。我們訓練了一個具有十億參數的 ViT 模型，並將其蒸餾為一系列較小的模型，在大多數影像層級與像素層級的基準上超越了現有最佳的通用特徵 OpenCLIP。

段落功能全文總覽——以「通用視覺特徵」為核心主張，預覽資料、方法與蒸餾三條主線。

邏輯角色摘要承載了強烈的「範式宣言」：自監督+策展資料 = 超越 CLIP 的通用特徵。從 NLP 類比到視覺的跨域論證是核心敘事策略。

論證技巧 / 潛在漏洞「超越 OpenCLIP」的定位極具商業與學術價值。但 CLIP/OpenCLIP 具有語言對齊能力而 DINOv2 不具備，兩者的「通用性」維度不同——DINOv2 更強於像素級任務，CLIP 更強於零樣本語義任務。

1. Introduction — 緒論

Can we learn general-purpose visual features that work across image distributions and tasks without finetuning? In NLP, foundation models pretrained on web-scale text have achieved this goal. In vision, text-guided pretraining (CLIP, OpenCLIP) has shown promise, but we argue it limits the information retained about images — text captions cannot describe all visual details at the pixel level. In contrast, self-supervised learning can capture information at both the image and pixel level, making it a more suitable foundation for general-purpose features. The missing ingredient has been sufficient scale and diversity of training data combined with careful curation.

我們能否學到在不同影像分布與任務間無需微調即可運作的通用視覺特徵？在自然語言處理中，以網路規模文字預訓練的基礎模型已達成此目標。在視覺領域，文字引導的預訓練（CLIP、OpenCLIP）展現了前景，但我們認為它限制了關於影像所保留的資訊——文字描述無法涵蓋像素層級的所有視覺細節。相比之下，自監督學習可在影像層級與像素層級同時擷取資訊，使其成為通用特徵的更合適基礎。所缺少的要素一直是足夠規模與多樣性的訓練資料，結合審慎的策展。

段落功能建立論點——挑戰文字引導的預訓練範式，主張自監督學習的優越性。

邏輯角色論證鏈的起點：以反問句開場，先肯定 CLIP 的成就再指出其「資訊瓶頸」，為自監督路線正名。「所缺少的要素」一語揭示本文的核心貢獻——資料策展。

論證技巧 / 潛在漏洞「文字描述無法涵蓋像素級細節」是精準的批評——但 CLIP 的語言對齊能力在零樣本分類等任務上具有自監督方法無法企及的優勢。兩種方法的「通用性」定義不同，此處的框架有利於 DINOv2。

2. Data Processing — 資料策展

We construct LVD-142M, a curated dataset of 142 million images. Our pipeline combines curated seed datasets (ImageNet-22k, Google Landmarks, fine-grained classification sets) with images retrieved from 1.2 billion uncurated web images. The retrieval process uses a pretrained ViT-H/16 to compute embeddings, performs k-means clustering, and retrieves nearest neighbors (typically 4 per query image) from the uncurated pool. Copy detection deduplication is applied to remove near-duplicates and increase diversity. This approach ensures the dataset covers diverse visual concepts while maintaining quality through proximity to curated exemplars.

我們建構了 LVD-142M，一個包含一億四千二百萬張影像的策展資料集。我們的管線結合策展種子資料集（ImageNet-22k、Google Landmarks、細粒度分類資料集）與從十二億張未策展網路影像中檢索的影像。檢索過程使用預訓練的 ViT-H/16 計算嵌入，執行 k-means 聚類，並從未策展池中檢索最近鄰（通常每張查詢影像 4 張）。應用複製偵測去重以移除近似重複並增加多樣性。此方法確保資料集涵蓋多樣的視覺概念，同時透過與策展範例的相近性維持品質。

段落功能資料建構策略——描述從種子集到大規模策展資料集的擴展管線。

邏輯角色此段是全文最關鍵的貢獻之一。「策展資料」是 DINOv2 超越先前自監督方法的核心差異——不是更好的演算法，而是更好的資料。以種子集引導檢索的策略在規模與品質之間取得平衡。

論證技巧 / 潛在漏洞以策展種子集為錨點檢索網路影像是精妙的設計——利用已有的高品質資料集擴展資料邊界。但此方法可能引入偏差：種子集的分布特性（如 ImageNet 的西方物件偏向）可能被放大至整個資料集。

3. Discriminative Self-Supervised Pre-training — 判別式自監督預訓練

The training combines multiple loss components. The image-level objective (DINO loss) computes cross-entropy between student and teacher network features from the class token using different image crops. The patch-level objective (iBOT loss) randomly masks input patches for the student while keeping them visible for the teacher, applying loss to masked token predictions. Key design modifications at scale include: untying head weights between DINO and iBOT objectives, Sinkhorn-Knopp centering, KoLeo regularizer for uniform feature distribution, and high-resolution adaptation (518x518) in the final training phase.

訓練結合了多個損失組件。影像層級目標（DINO 損失）使用不同影像裁切，計算學生與教師網路特徵（取自類別標記）之間的交叉熵。補丁層級目標（iBOT 損失）隨機遮罩學生端的輸入補丁，同時保持教師端可見，對遮罩標記的預測施加損失。在大規模訓練中的關鍵設計修改包括：解開 DINO 與 iBOT 目標之間的頭部權重、Sinkhorn-Knopp 中心化、KoLeo 正則化器以促進均勻特徵分布，以及在最終訓練階段進行高解析度適配（518x518）。

段落功能訓練方法——描述雙層級（影像+補丁）自監督目標與規模化調整。

邏輯角色 DINO 損失提供全域語義，iBOT 損失提供局部細節——兩者結合直接對應「影像層級+像素層級」的通用特徵目標。規模化的工程細節（解開頭部、KoLeo）雖然看似次要，實則對最終品質有顯著影響。

論證技巧 / 潛在漏洞雙目標的設計邏輯清晰。但損失組件的數量（DINO + iBOT + KoLeo + 其他）增加了超參數調校的複雜性。作者未提供各損失權重的敏感度分析，難以評估哪些是關鍵的、哪些是錦上添花的。

4. Efficient Implementation — 高效實現

Training a ViT-g model with 1 billion parameters requires substantial engineering. Key optimizations include: custom FlashAttention implementation for GPU hardware, sequence packing that concatenates variable-length token sequences with block-diagonal attention masks, efficient stochastic depth that skips computation of dropped residuals, and FSDP (Fully Sharded Data Parallelism) for distributed training across GPUs with mixed-precision. An important additional contribution is model distillation: smaller models distilled from the ViT-g outperform models of the same size trained from scratch on all 12 benchmarks.

訓練十億參數的 ViT-g 模型需要大量的工程投入。關鍵最佳化包括：為 GPU 硬體客製的 FlashAttention 實現、以區塊對角線注意力遮罩串接變長標記序列的序列打包、跳過被丟棄殘差運算的高效隨機深度，以及跨 GPU 的混合精度完全分片資料平行（FSDP）分散式訓練。一項重要的額外貢獻是模型蒸餾：從 ViT-g 蒸餾的小型模型在全部 12 項基準上超越相同規模的從頭訓練模型。

段落功能工程實現——展示大規模訓練的關鍵最佳化與蒸餾策略。

邏輯角色此段回應「規模化」的隱含挑戰：不只是「更大的模型+更多的資料」，還需要系統級的工程突破才能使訓練可行。蒸餾則解決了部署端的效率問題。

論證技巧 / 潛在漏洞蒸餾結果（12/12 基準上超越從頭訓練）是強力論據——證明知識可以高效傳遞。但此結論建立在 ViT-g 教師模型的強大之上，計算總成本（約 22,016 A100 GPU 小時）意味著此方法僅有少數機構能復現。

5. Experiments — 實驗

DINOv2 ViT-g/14 achieves 86.5% linear evaluation accuracy on ImageNet-1k, surpassing previous self-supervised methods by +4.2% over iBOT and outperforming OpenCLIP-G (+0.3%) and EVA-CLIP (+0.1%). Domain robustness is dramatically improved: ImageNet-A: +29.6% over iBOT; ImageNet-R: +22.1%; Sketch: +23.0%. For dense recognition tasks: ADE20k semantic segmentation reaches 49.0 mIoU with a linear probe; depth estimation on NYUd achieves 0.344 RMSE with a single linear layer. These frozen-backbone results approach or match fully fine-tuned MAE (53.6 mIoU on ADE20k), demonstrating the features' out-of-the-box utility.

DINOv2 ViT-g/14 在 ImageNet-1k 上達到 86.5% 的線性評估準確率，超越先前的自監督方法（比 iBOT 高 4.2%），並超越 OpenCLIP-G（+0.3%）和 EVA-CLIP（+0.1%）。領域穩健性大幅改善：ImageNet-A 比 iBOT 高 29.6%、ImageNet-R 高 22.1%、Sketch 高 23.0%。在稠密辨識任務上：ADE20k 語義分割以線性探針達到 49.0 mIoU；NYUd 深度估計以單一線性層達到 0.344 RMSE。這些凍結骨幹的結果接近或匹配經完整微調的 MAE（ADE20k 上 53.6 mIoU），展示了特徵的開箱即用實用性。

段落功能全面的實驗驗證——在分類、穩健性、稠密預測三個維度上展示結果。

邏輯角色實證支柱：多維度的定量結果直接支撐「通用特徵」的核心主張。領域穩健性的巨幅提升（+29.6% 在 ImageNet-A）特別有力——證明特徵不只在分布內表現好，分布外也具備泛化力。

論證技巧 / 潛在漏洞以「線性探針」而非「微調」來評估，突顯了特徵本身的品質——這是更嚴格的測試標準。但 86.5% 線性準確率在 ImageNet-1k 上僅略高於 CLIP 系列，真正的差距體現在稠密任務和穩健性上。此外，公平性分析顯示模型存在地理偏差（非洲比歐洲低 25.7%），這是通用性主張的重要限制。

Ablation studies reveal the contribution of each component. The KoLeo regularizer provides +2.3% k-NN improvement and +8% retrieval boost. Masked image modeling (iBOT loss) is critical for dense predictions (+3% on segmentation). LVD-142M outperforms uncurated data while maintaining ImageNet performance, confirming that data curation — not just data scale — is essential for learning universal features. Model and data scaling curves show that larger models increasingly benefit from the larger, more diverse LVD-142M dataset compared to ImageNet-22k.

消融研究揭示了各組件的貢獻。KoLeo 正則化器提供 +2.3% 的 k-NN 改進與 +8% 的檢索提升。遮罩影像建模（iBOT 損失）對稠密預測至關重要（分割提升 3%）。LVD-142M 在維持 ImageNet 表現的同時超越未策展資料，確認了資料策展——而非僅資料規模——對於學習通用特徵至關重要。模型與資料的擴展曲線顯示，較大的模型從更大、更多樣的 LVD-142M 資料集中獲益更多，相較 ImageNet-22k。

段落功能消融分析——量化各組件的獨立貢獻。

邏輯角色此段回答「為什麼 DINOv2 比前人好？」：不是單一突破，而是策展資料、KoLeo 正則化、iBOT 損失等多項改進的累積。「資料策展比資料規模更重要」是核心洞見。

論證技巧 / 潛在漏洞「策展比規模更重要」的結論具有深遠影響——暗示不必追求最大的資料集，而應聚焦品質。但策展過程本身依賴預訓練模型（ViT-H/16），形成了「先有好模型才能建好資料」的循環依賴。

6. Conclusion — 結論

DINOv2 demonstrates that self-supervised pretraining on curated, diverse data can produce visual features that match or exceed text-guided approaches across a wide range of tasks. The success rests on four pillars: improved training recipes, larger model scale, curated LVD-142M dataset, and knowledge distillation. The resulting features exhibit emergent properties including automatic foreground-background separation, semantic part alignment across categories, and strong cross-domain transfer. We believe this work demonstrates that self-supervised learning is a viable — and potentially superior — path to general-purpose visual features.

DINOv2 展示了在策展的多樣化資料上進行自監督預訓練，可產生在廣泛任務上匹配或超越文字引導方法的視覺特徵。此成功建立在四項支柱上：改進的訓練配方、更大的模型規模、策展的 LVD-142M 資料集，以及知識蒸餾。所得特徵展現了湧現性質，包括自動前景背景分離、跨類別的語義部件對齊，以及強大的跨域遷移。我們相信此研究展示了自監督學習是通往通用視覺特徵的可行——且可能更優越的——路徑。

段落功能總結全文——重申四大支柱並展望自監督學習的未來。

邏輯角色結論以「四項支柱」簡潔概括全部貢獻，以「湧現性質」增添學術深度。最後一句直接回應緒論的反問句，形成完整閉環。

論證技巧 / 潛在漏洞「可能更優越」的措辭謹慎但仍具野心。未充分討論的限制包括：模型不具備語言對齊能力（無法進行零樣本文字引導的辨識）、地理/收入偏差問題（非洲表現低 25.7%），以及 22,016 A100 GPU 小時的計算成本限制了可復現性。

論證結構總覽

問題
視覺缺乏通用特徵
CLIP 受限於文字瓶頸

→

論點
自監督+策展資料
可產生通用視覺特徵

→

證據
86.5% 線性準確率
稠密任務接近微調水準

→

反駁
策展比規模更重要
蒸餾使小模型也強

→

結論
自監督是通用視覺
特徵的可行且優越路徑

作者核心主張（一句話）

透過在審慎策展的一億四千二百萬張多源影像上以改進的 DINO+iBOT 訓練配方預訓練十億參數 ViT 模型並蒸餾至小型變體，可產出在影像層級與像素層級均超越 OpenCLIP 的通用自監督視覺特徵。

論證最強處

領域穩健性的巨幅改進：在分布外基準上的表現提升（ImageNet-A +29.6%、ImageNet-R +22.1%）遠超分布內的提升幅度，有力地證明了「通用」特徵的主張。以簡單的線性探針即可在稠密任務上接近完整微調的表現（49.0 vs 53.6 mIoU），展示了特徵的開箱即用實用性。

論證最弱處

可復現性與公平性隱憂：22,016 A100 GPU 小時的訓練成本（約 3.7 噸碳排放）使僅有少數機構能復現結果。模型存在系統性的地理偏差（非洲比歐洲低 25.7%），且策展管線依賴預訓練模型形成循環依賴。此外，缺乏語言對齊意味著在需要語義理解的零樣本任務上不如 CLIP。