Exploring Simple Siamese Representation Learning

Abstract — 摘要

Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, the authors report surprising empirical results that simple Siamese networks can learn meaningful representations even without any of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders. The key finding is that a stop-gradient operation plays an essential role in preventing collapsing. The authors hypothesize that SimSiam implicitly involves an expectation-maximization (EM) like algorithm.

孿生網路已成為近年來多種無監督視覺表徵學習模型的常見結構。這些模型在滿足特定條件以避免崩塌解的前提下，最大化同一影像兩種增強版本之間的相似度。本文報告了令人意外的實驗結果：簡單的孿生網路即使不使用以下任何技術，仍能學習有意義的表徵：(i) 負樣本對、(ii) 大批次、(iii) 動量編碼器。關鍵發現是停止梯度操作在防止崩塌方面扮演不可或缺的角色。作者假設 SimSiam 隱含了一種類似期望最大化（EM）的演算法。

段落功能全文總覽——以「意外發現」的敘事框架勾勒核心貢獻：簡化、停止梯度、EM 假說。

邏輯角色摘要以「否定式」策略建立新穎性——逐一列舉被認為必要但實際可省略的組件，製造認知衝擊。EM 假說則提供理論深度。

論證技巧 / 潛在漏洞「令人意外」的修辭非常有效，但可能誇大了社群的共識程度。BYOL 已先一步展示無需負樣本的可行性，SimSiam 的「意外」程度取決於讀者是否熟悉 BYOL 的結果。

1. Introduction — 緒論

Siamese networks are "a natural and effective tool for learning visual representations" by comparing related samples. Recent methods have employed various strategies to prevent representation collapse: contrastive learning uses negative pairs to repel dissimilar representations; BYOL relies on a momentum encoder; SwAV uses online clustering with the Sinkhorn-Knopp algorithm. These increasing complexities raise a fundamental question: what is the minimal set of mechanisms needed to prevent collapse in Siamese representation learning?

孿生網路是透過比較相關樣本來學習視覺表徵的自然且有效工具。近期方法採用了各種策略來防止表徵崩塌：對比學習使用負樣本對來排斥不相似的表徵；BYOL 依賴動量編碼器；SwAV 則使用搭配 Sinkhorn-Knopp 演算法的線上聚類。這些日益增加的複雜性引發一個根本性問題：在孿生表徵學習中，防止崩塌所需的最小機制集合是什麼？

段落功能建立研究場域——梳理防止崩塌的各類策略，引出簡化動機。

邏輯角色以遞增複雜度排列現有方法（負樣本 -> 動量編碼器 -> 線上聚類），構建「方法越來越複雜」的敘事，自然引出「能否簡化？」的核心問題。

論證技巧 / 潛在漏洞將問題框架為「最小機制集合」是一個極具吸引力的科學問題——它暗示現有方法可能過度工程化。但各方法的複雜性服務於不同目標（如 SwAV 的聚類也提升了效能），簡化未必等於進步。

The collapsing problem — where all outputs converge to a constant — is the central challenge. Without explicit prevention mechanisms, a Siamese network can trivially minimize the similarity loss by "outputting the same constant vector for all inputs." Existing solutions add complexity: contrastive losses require careful negative sampling and large batch sizes (e.g., 4096 in MoCo); momentum encoders add architectural asymmetry and hyperparameter sensitivity. The authors show that none of these are necessary — a simple stop-gradient operation suffices.

崩塌問題——所有輸出收斂至常數——是核心挑戰。若無顯式的預防機制，孿生網路可透過對所有輸入輸出相同的常數向量，輕易地最小化相似度損失。現有解決方案增加了複雜性：對比損失需要精心的負樣本取樣與大批次量（如 MoCo 中的 4096）；動量編碼器引入了架構不對稱性與超參數敏感度。作者展示這些均非必要——簡單的停止梯度操作即已足夠。

段落功能定義核心問題並批判現有解法的複雜性。

邏輯角色承接上段的「最小機制」問題，此段具體展示現有方法的「過度複雜」——為 SimSiam 的極簡方案製造強烈的對比效果。

論證技巧 / 潛在漏洞「均非必要」是一個大膽的主張。嚴格而言，SimSiam 的效能略低於 MoCo v2/BYOL，因此這些機制雖非「必要」但可能是「有益的」。此處的論述邊界需要讀者自行辨別。

Contrastive learning methods like SimCLR and MoCo learn representations by "pulling positive pairs together and pushing negative pairs apart." SimCLR requires very large batch sizes (up to 8192) for sufficient negative samples. MoCo uses a momentum-updated queue to decouple batch size from the number of negatives. BYOL demonstrated that "negative pairs are not necessary" by using a momentum encoder and a predictor network, but its success was "attributed to the momentum encoder, leaving the role of each component unclear."

對比學習方法如 SimCLR 和 MoCo，透過拉近正樣本對並推開負樣本對來學習表徵。SimCLR 需要非常大的批次量（高達 8192）以獲取足夠的負樣本。MoCo 使用動量更新的佇列來解耦批次大小與負樣本數量。BYOL 證明了負樣本對並非必要，方法是使用動量編碼器與預測器網路，但其成功被歸因於動量編碼器，使得各組件的角色不甚清晰。

段落功能文獻回顧——梳理從 SimCLR 到 BYOL 的演進脈絡。

邏輯角色建立學術譜系：SimCLR（大批次+負樣本） -> MoCo（佇列解耦） -> BYOL（去負樣本+動量），每步減少一項複雜機制。SimSiam 是此趨勢的自然延伸。

論證技巧 / 潛在漏洞將 BYOL 的成功歸因問題（「歸因不明」）作為切入點非常巧妙——暗示學界對這些方法的理解仍不充分，為 SimSiam 的解析性研究建立必要性。

3. Method — 方法

3.1 SimSiam Architecture

SimSiam takes two randomly augmented views x1, x2 of the same image and processes them through a shared encoder f (e.g., ResNet) followed by a projection MLP head h. One branch additionally applies a prediction MLP head p, creating an asymmetry. The loss is a negative cosine similarity: D(p1, z2) = -p1 / ||p1|| * z2 / ||z2||, symmetrized over both views. Critically, the loss does not use negative pairs, momentum encoders, or large batches. A batch size of 256 works well.

SimSiam 取同一影像的兩個隨機增強視圖 x1、x2，通過共享編碼器 f（如 ResNet）後接投影 MLP 頭 h。其中一個分支額外套用預測 MLP 頭 p，製造不對稱性。損失為負餘弦相似度：D(p1, z2) = -p1/||p1|| * z2/||z2||，在兩個視圖上對稱化。至關重要的是，此損失不使用負樣本對、動量編碼器或大批次。批次大小 256 即可良好運作。

段落功能方法描述——完整定義 SimSiam 的架構與損失函數。

邏輯角色此段的核心訊息在於「沒有什麼」——透過列舉不使用的機制來強調方法的簡潔性。預測頭的不對稱性是唯一的結構性區別。

論證技巧 / 潛在漏洞以「不使用 X、不使用 Y、不使用 Z」的排除法定義方法，使簡潔性成為核心賣點。但預測頭本身也是一種複雜性——為何它是必要的？此問題將在後續以停止梯度的討論回答。

3.2 Stop-Gradient — 停止梯度

The stop-gradient (stopgrad) operation is applied to the branch without the predictor: the loss becomes D(p1, stopgrad(z2)). This means "z2 is treated as a constant in this term" and does not receive gradients from this loss component. The authors demonstrate through controlled experiments that removing stop-gradient immediately leads to collapsing — the training loss reaches its minimum (-1) and the representation becomes a constant. The output standard deviation drops to zero within a few epochs. This simple operation is "critical for preventing collapse — it is the only mechanism needed."

停止梯度操作被應用於不含預測頭的分支：損失變為 D(p1, stopgrad(z2))。這意味著 z2 在此項中被視為常數，不會從此損失分量接收梯度。作者透過受控實驗證明，移除停止梯度會立即導致崩塌——訓練損失達到最小值（-1），表徵變為常數。輸出的標準差在數個訓練周期內降至零。這個簡單的操作是防止崩塌的關鍵——它是唯一需要的機制。

段落功能核心發現——以實驗證明停止梯度是防止崩塌的充分必要條件。

邏輯角色全文論證的頂點：「移除 -> 崩塌、保留 -> 有效」的對照實驗直接回答了核心問題。標準差歸零的量化證據使結論無可辯駁。

論證技巧 / 潛在漏洞消融實驗的設計非常精確——僅改變一個變數（停止梯度的有無），使因果推論清晰。但「唯一需要的機制」這一措辭可能過強，因為預測頭也是必要的（後續實驗確認移除預測頭同樣導致崩塌）。

3.3 Hypothesis — EM 假說分析

The authors provide a hypothesis based on an expectation-maximization (EM) framework. They consider the loss as involving two sets of variables: the network parameters theta and an implicit set of representation targets eta. The stop-gradient operation can be interpreted as "alternating between optimizing theta with eta fixed (one gradient step) and updating eta given theta (the stop-gradient target)." This resembles the E-step and M-step of EM algorithms. While not a formal proof, this framework "provides a plausible explanation for why SimSiam avoids collapse" — the alternating optimization naturally prevents trivial solutions.

作者提出一個基於期望最大化（EM）框架的假說。他們將損失視為涉及兩組變數：網路參數 theta 與一組隱含的表徵目標 eta。停止梯度操作可被詮釋為「在固定 eta 的情況下最佳化 theta（一步梯度），與在給定 theta 的情況下更新 eta（停止梯度目標）之間交替進行」。這類似於 EM 演算法的 E 步驟與 M 步驟。雖然這並非正式證明，但此框架提供了一個合理的解釋，說明 SimSiam 為何能避免崩塌——交替最佳化自然地防止了平凡解。

段落功能理論解釋——為停止梯度的有效性提供 EM 框架的直覺性解釋。

邏輯角色從實驗觀察（「停止梯度有效」）過渡到理論理解（「為何有效」）。EM 框架將一個看似特設的工程技巧提升為有理論根基的設計選擇。

論證技巧 / 潛在漏洞作者誠實地將此定位為「假說」而非「證明」，保持學術嚴謹。但 EM 的類比有其限制——真正的 EM 保證收斂至局部最優，而 SimSiam 的單步近似不具此保證。理論與實踐之間仍存在間隙。

4. Experiments — 實驗

SimSiam is evaluated on ImageNet linear evaluation and various transfer learning benchmarks. With a ResNet-50 backbone and 100-epoch pre-training, SimSiam achieves 68.1% top-1 accuracy under linear evaluation, competitive with SimCLR (66.5%) and MoCo v2 (67.4%), and close to BYOL (68.8%). With 200-epoch training, SimSiam reaches 70.0%. On COCO object detection and instance segmentation, SimSiam-pretrained models achieve results on par with supervised pre-training. Ablation studies show the predictor MLP and stop-gradient are both essential; removing either causes complete collapse. Batch size can be as small as 64 with minimal degradation, a significant advantage over SimCLR's requirement of 4096+.

SimSiam 在 ImageNet 線性評估與多種遷移學習基準上進行驗證。使用 ResNet-50 骨幹網路與 100 個訓練周期的預訓練，SimSiam 在線性評估下達到 68.1% 的 top-1 準確率，與 SimCLR（66.5%）和 MoCo v2（67.4%）具競爭力，接近 BYOL（68.8%）。訓練 200 個周期後達到 70.0%。在 COCO 物件偵測與實例分割上，SimSiam 預訓練模型達到與監督式預訓練相當的結果。消融研究顯示預測器 MLP 與停止梯度均不可或缺；移除任一即導致完全崩塌。批次大小可小至 64 且效能衰退極小，相比 SimCLR 需要 4096 以上的批次量，這是顯著的優勢。

段落功能全面實驗驗證——在多個基準與設定下展示 SimSiam 的效能與特性。

邏輯角色實證支柱覆蓋四個維度：(1) 線性評估的絕對效能；(2) 遷移學習的泛化性；(3) 消融研究的組件必要性；(4) 對批次大小的穩健性。

論證技巧 / 潛在漏洞小批次的優勢是非常實際的貢獻——它降低了 GPU 記憶體需求。但 SimSiam 的 68.1% 仍略低於 BYOL 的 68.8%，表明動量編碼器雖非必要但確實有益。作者未充分討論這 0.7% 的差距意味著什麼。

5. Conclusion — 結論

This work shows that simple Siamese representation learning, with neither negative pairs nor momentum encoders, can achieve competitive results. The stop-gradient operation is identified as the key to avoiding collapse. An EM-like hypothesis provides a plausible theoretical framework. The simplicity of SimSiam makes it "a useful baseline for understanding self-supervised representation learning" and suggests that Siamese networks are a natural and effective tool that deserves further investigation.

本研究表明，簡單的孿生表徵學習在不使用負樣本對或動量編碼器的情況下，仍能達到具競爭力的結果。停止梯度操作被確認為避免崩塌的關鍵。類 EM 假說提供了合理的理論框架。SimSiam 的簡潔性使其成為理解自監督表徵學習的有用基線，並顯示孿生網路是一個值得進一步深入研究的自然且有效的工具。

段落功能總結全文——重申簡潔性的價值與理論意涵。

邏輯角色結論段將 SimSiam 定位為「理解工具」而非「效能工具」，這是一個聰明的框架選擇——它避免了與 BYOL 等效能更高方法的直接比較壓力。

論證技巧 / 潛在漏洞「有用的基線」定位讓 SimSiam 的價值從工程轉向科學——即使效能不是最佳，其簡潔性使其成為理解崩塌機制的理想實驗平台。但未來方向僅以「值得進一步研究」帶過，缺乏具體指引。

論證結構總覽

問題
自監督學習方法
為何需要複雜機制？

→

論點
停止梯度是防止
崩塌的充分條件

→

證據
ImageNet 68.1%
批次大小可至 64

→

反駁
EM 假說解釋
為何不崩塌

→

結論
孿生網路是
自然的學習工具

作者核心主張（一句話）

簡單的孿生網路僅需停止梯度操作即可學習有意義的視覺表徵，其背後的機制可以用類 EM 交替最佳化框架加以解釋。

論證最強處

消融實驗的精確性：透過系統性地移除各組件（負樣本、動量編碼器、大批次），以二元式的「崩塌 vs. 不崩塌」結果清晰地界定了每個機制的必要性。停止梯度的發現不僅具有工程價值，更揭示了孿生學習的基本運作原理。

論證最弱處

EM 假說的嚴謹性不足：EM 框架僅為直覺性類比而非正式證明，單步梯度近似與真正的 EM 收斂保證之間存在理論間隙。此外，SimSiam 的效能始終略低於 BYOL（差距約 0.7%），暗示被宣稱為「不必要」的機制實際上仍有邊際貢獻，但作者未深入分析此差距的來源。