Momentum Contrast for Unsupervised Visual Representation Learning

Abstract — 摘要

We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective of contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks, outperforming its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC and COCO. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.

本文提出動量對比學習（MoCo），用於無監督視覺表徵學習。從「對比學習即字典查詢」的視角出發，我們建構了一個具有佇列與移動平均編碼器的動態字典。這使得能夠即時建立一個大型且一致的字典，促進對比式無監督學習。MoCo 在 ImageNet 分類的常用線性評估協定下達到具競爭力的結果。更重要的是，MoCo 所學習的表徵能良好地遷移至下游任務，在 PASCAL VOC 和 COCO 的 7 項偵測/分割任務上超越了有監督預訓練的對應方法。這表明在許多視覺任務中，無監督與有監督表徵學習之間的差距已大幅縮小。

段落功能全文總覽——以「字典查詢」的新穎比喻框架對比學習，並以下游任務超越有監督方法作為核心賣點。

邏輯角色摘要的結構極為精煉：先提出方法（MoCo）、再提供框架（字典查詢）、然後展示結果（超越有監督）。最後一句的宣稱（「差距已大幅縮小」）是全文最具影響力的論斷。

論證技巧 / 潛在漏洞「差距已大幅縮小」是一個極具號召力的宣稱，直接挑戰了「有監督學習是金標準」的共識。但此結論僅基於特定的偵測/分割任務，是否適用於所有視覺任務尚待驗證。

1. Introduction — 緒論

In natural language processing (NLP), unsupervised representation learning has been enormously successful, as exemplified by GPT and BERT. In computer vision, however, supervised pre-training still dominates. The reason may be related to their different signal spaces: language tasks use discrete signal spaces (words, sub-word units) that are amenable to building tokenized dictionaries, while vision tasks use continuous, high-dimensional signals that are not structured for dictionary building.

在自然語言處理（NLP）領域，無監督表徵學習已取得巨大成功，GPT 和 BERT 即為典範。然而在電腦視覺中，有監督預訓練仍居主導地位。原因可能與兩者不同的訊號空間有關：語言任務使用離散的訊號空間（詞、子詞單元），適合建構記號化的字典；而視覺任務使用連續的高維訊號，其結構不利於字典的建構。

段落功能建立研究動機——以 NLP 的成功為對照，指出電腦視覺在無監督學習上的落後。

邏輯角色以跨領域對比開篇，效果極佳：NLP 已經成功 → 視覺為何不行？→ 因為訊號空間不同 → 那我們如何建構視覺的「字典」？為 MoCo 的設計動機鋪路。

論證技巧 / 潛在漏洞將 NLP 與視覺的差異歸因於「離散 vs. 連續」是一個有洞察力的分析。但此歸因可能過度簡化——視覺任務的困難也來自空間結構、多尺度特性等因素。

We hypothesize that it is possible to build an effective visual dictionary for contrastive learning. We present Momentum Contrast (MoCo) as a way of building large and consistent dictionaries for unsupervised learning with a contrastive loss. We maintain the dictionary as a queue of data samples: the encoded representations of the current mini-batch are enqueued, and the oldest are dequeued. The dictionary keys are encoded by a slowly progressing (momentum-updated) encoder, making the key representations more consistent across different mini-batches.

我們假設有可能為對比學習建構一個有效的視覺字典。本文提出動量對比學習（MoCo），作為一種為無監督學習建構大型且一致字典的方式，並搭配對比損失函數。我們將字典維護為一個資料樣本的佇列：當前小批次的編碼表徵被入列，最舊的則被出列。字典鍵值由一個緩慢演進的（動量更新的）編碼器編碼，使得鍵值表徵在不同小批次之間更具一致性。

段落功能提出核心方法——佇列 + 動量編碼器的雙重設計。

邏輯角色從上段的「問題」（視覺缺乏字典）直接過渡到「解法」（MoCo 就是視覺字典）。佇列解決「大」的問題，動量編碼器解決「一致」的問題。

論證技巧 / 潛在漏洞佇列與動量更新的設計直覺清晰、易於理解。但動量係數的選擇（0.999 vs. 0.9）對性能的影響大，而最優值需要大量實驗調校。

Contrastive learning traces back to contrastive losses and NCE (Noise-Contrastive Estimation). Recent work includes InstDisc which stores features in a memory bank for all images, and SimCLR which relies on large batch sizes for a large set of negatives. The memory bank approach suffers from stale representations since stored features are from different training epochs. The large-batch approach requires expensive multi-GPU synchronization and is limited by GPU memory. MoCo offers a third perspective that decouples dictionary size from mini-batch size.

對比學習可追溯至對比損失與雜訊對比估計（NCE）。近期工作包括 InstDisc——在記憶體庫中儲存所有影像的特徵；以及 SimCLR——依賴大批次量來獲得大量負樣本。記憶體庫方法的問題在於表徵過時，因為儲存的特徵來自不同的訓練輪次。大批次方法則需要昂貴的多 GPU 同步，且受限於 GPU 記憶體。MoCo 提供了第三種視角，將字典大小與小批次量解耦。

段落功能文獻回顧——系統性地對比三種對比學習的字典建構策略。

邏輯角色以「字典」的統一框架重新詮釋既有方法：記憶體庫（大但不一致）、大批次（一致但受記憶體限制）、MoCo（大且一致）。三分法使 MoCo 的優勢一目了然。

論證技巧 / 潛在漏洞以統一框架歸納競爭方法，是極有效的學術策略。但 SimCLR 後來的改進版（SimCLR v2）已部分緩解了記憶體問題，此處的批判可能有時效性限制。

3. Method — 方法

We consider contrastive learning as training an encoder for a dictionary look-up task. Consider an encoded query q and a set of encoded samples {k_0, k_1, k_2, ...} that are the keys of a dictionary. Assume there is a single key k_+ that q matches. A contrastive loss (InfoNCE) is a function whose value is low when q is similar to its positive key k_+ and dissimilar to all other keys. The loss takes the form of a softmax-based classifier that classifies q as k_+ among K negative keys.

我們將對比學習視為訓練編碼器執行字典查詢任務。考慮一個編碼後的查詢 q 與一組編碼後的樣本 {k_0, k_1, k_2, ...} 作為字典的鍵值。假設存在一個與 q 匹配的正鍵 k_+。對比損失（InfoNCE）是一個在 q 與正鍵 k_+ 相似且與所有其他鍵不相似時取低值的函數。此損失採用基於 softmax 的分類器形式，在 K 個負鍵中將 q 分類為 k_+。

段落功能方法框架——以字典查詢的比喻定義對比學習的數學形式。

邏輯角色將對比學習重新框架為「字典查詢」，使抽象的數學公式獲得直觀的理解。InfoNCE 損失的引入為後續佇列設計提供數學基礎。

論證技巧 / 潛在漏洞「字典查詢」的比喻是本文最重要的概念貢獻之一——它將對比學習從一個訓練技巧提升為一個可分析的框架。但 InfoNCE 的理論基礎（與互資訊的關係）在此處未深入展開。

3.1 Queue and Momentum Update — 佇列與動量更新

The dictionary as a queue allows us to decouple the dictionary size from the mini-batch size. The dictionary size can be much larger than a typical mini-batch size, and can be flexibly and independently set as a hyperparameter. The samples in the dictionary are progressively replaced: the current mini-batch is enqueued and the oldest mini-batch is dequeued. The momentum update of the key encoder uses the formula theta_k = m * theta_k + (1-m) * theta_q, where m = 0.999 works much better than smaller values like 0.9. This slow update makes the key encoder evolve smoothly, so the representations in the queue remain relatively consistent.

將字典視為佇列使我們能夠將字典大小與小批次量解耦。字典大小可以遠大於典型的小批次量，且可作為超參數靈活獨立地設定。字典中的樣本被漸進式替換：當前小批次入列，最舊的小批次出列。鍵值編碼器的動量更新使用公式 theta_k = m * theta_k + (1-m) * theta_q，其中 m = 0.999 的效果遠優於較小的值（如 0.9）。此緩慢更新使鍵值編碼器平滑地演進，因此佇列中的表徵保持相對一致性。

段落功能核心技術細節——闡述佇列機制與動量更新的運作方式。

邏輯角色此段是全文技術貢獻的核心。佇列解決「規模」問題（大字典），動量更新解決「一致性」問題（穩定表徵）。0.999 vs. 0.9 的對比提供了具體的實作指引。

論證技巧 / 潛在漏洞「解耦」是一個有力的工程設計原則——讓字典大小成為獨立超參數，大幅增加了方法的靈活性。但佇列中的表徵仍有時間不一致性（較舊的鍵值由較早的編碼器產生），動量更新只是緩解而非消除此問題。

4. Experiments — 實驗

Under the linear classification protocol on ImageNet, MoCo with a ResNet-50 encoder achieves 60.6% top-1 accuracy, competitive with SimCLR's 69.3% (which uses 8x larger batch size and stronger augmentation). The key finding is in transfer learning: when fine-tuned on PASCAL VOC object detection, MoCo pre-training achieves AP of 57.4 vs. 57.2 for supervised ImageNet pre-training. On COCO object detection and instance segmentation, MoCo outperforms the supervised counterpart in all 7 metrics. These results demonstrate that unsupervised pre-training can surpass supervised pre-training when the target task is different from ImageNet classification.

在 ImageNet 線性分類評估協定下，採用 ResNet-50 編碼器的 MoCo 達到 60.6% 的 top-1 準確率，與 SimCLR 的 69.3%（使用 8 倍大的批次量與更強的資料增強）具有競爭力。關鍵發現在於遷移學習：在 PASCAL VOC 物件偵測上微調時，MoCo 預訓練達到 AP 57.4，而有監督 ImageNet 預訓練為 57.2。在 COCO 物件偵測與實例分割上，MoCo 在全部 7 項指標上超越有監督的對應方法。這些結果證明，當目標任務與 ImageNet 分類不同時，無監督預訓練可以超越有監督預訓練。

段落功能提供核心實驗證據——分別在線性評估與遷移學習上驗證 MoCo 的效能。

邏輯角色此段是論文的實證支柱。策略性地將重點放在遷移學習（而非線性評估）上，因為遷移學習的結果更具說服力——在實際應用場景中超越有監督方法。

論證技巧 / 潛在漏洞將「遷移學習」而非「線性分類」作為主要賣點，是明智的策略——因為在線性分類上 MoCo 落後 SimCLR 近 9 個百分點。VOC 和 COCO 的改進幅度雖然顯著但數值較小（57.4 vs. 57.2），需注意統計顯著性。

Ablation experiments reveal several important findings. Dictionary size matters: increasing K from 256 to 65536 steadily improves performance. The momentum coefficient m = 0.999 significantly outperforms m = 0.9 (60.6% vs. 55.2% on ImageNet linear evaluation), confirming that representation consistency is crucial for contrastive learning. Compared to the end-to-end approach (limited by batch size) and the memory bank approach (inconsistent features), MoCo achieves the best trade-off between dictionary size and consistency.

消融實驗揭示了幾項重要發現。字典大小至關重要：將 K 從 256 增加至 65536 能穩定地提升性能。動量係數 m = 0.999 顯著優於 m = 0.9（ImageNet 線性評估上 60.6% vs. 55.2%），證實表徵一致性對於對比學習至關重要。相較於端對端方法（受限於批次量）和記憶體庫方法（特徵不一致），MoCo 在字典大小與一致性之間達到了最佳平衡。

段落功能消融與比較實驗——驗證設計選擇並量化各組件的貢獻。

邏輯角色消融實驗直接回應方法設計中的兩個關鍵問題：(1) 字典要多大？(2) 動量要多少？具體數值的對比使論證具有說服力。

論證技巧 / 潛在漏洞 5.4 個百分點的差異（m=0.999 vs. 0.9）是強有力的證據。但消融實驗僅在 ImageNet 上進行，未在下游任務上重複，無法確認結論的遷移性。

5. Conclusion — 結論

Our experiments show that MoCo is an effective mechanism for building large and consistent dictionaries for unsupervised contrastive learning. The learned representations transfer strongly to downstream detection and segmentation tasks, in many cases surpassing their supervised pre-training counterparts. We believe MoCo is a simple and general framework applicable beyond visual representation learning. We hope this work will help rethink the role of supervised pre-training in computer vision.

實驗表明 MoCo 是建構大型且一致字典以進行無監督對比學習的有效機制。所學習的表徵能強力遷移至下游偵測與分割任務，在許多情況下超越有監督預訓練的對應方法。我們認為 MoCo 是一個簡潔且通用的框架，其應用範圍超越視覺表徵學習。我們希望本工作能促使學界重新思考有監督預訓練在電腦視覺中的角色。

段落功能總結全文——重申核心發現並提出更廣泛的學術啟示。

邏輯角色結論從具體結果（遷移學習優勢）上升至範式層面的啟示（重新思考有監督預訓練），展現了本文超越技術細節的學術影響力。

論證技巧 / 潛在漏洞「重新思考有監督預訓練」的呼籲極具前瞻性，事後看來（DINO、MAE 等後續工作）確實引領了自監督學習的浪潮。但結論中的「簡潔且通用」宣稱略為模糊，未具體說明通用性的邊界。

論證結構總覽

問題
視覺無監督學習缺乏
有效的字典建構機制

→

論點
佇列 + 動量編碼器
可建構大型一致字典

→

證據
7 項下游任務
超越有監督預訓練

→

反駁
字典大小與批次量
成功解耦

→

結論
無監督與有監督
的差距大幅縮小

作者核心主張（一句話）

透過佇列式動態字典與動量更新編碼器，MoCo 實現了大規模且一致的對比學習，其無監督表徵在多項下游偵測與分割任務上超越了有監督預訓練的表現。

論證最強處

遷移學習的實證突破：在 PASCAL VOC 和 COCO 共 7 項偵測/分割任務上全面超越有監督預訓練，這一結果直接挑戰了學界對有監督預訓練不可替代性的共識。「字典查詢」的統一框架也為理解對比學習提供了清晰的概念工具，催生了後續大量的方法改進。

論證最弱處

線性評估的相對落後：在 ImageNet 線性分類協定上（60.6%），MoCo 顯著落後於同期的 SimCLR（69.3%），表明在特徵品質的某些維度上仍有差距。此外，佇列中的時間不一致性問題僅被動量更新「緩解」而非「消除」，其對訓練穩定性的長期影響未被充分分析。遷移學習的改進幅度（如 57.4 vs. 57.2）雖然一致但數值較小，統計顯著性值得進一步檢驗。