Knowledge Distillation Meets Self-Supervision

Abstract — 摘要

We present KDSS (Knowledge Distillation meets Self-Supervision), a framework that bridges knowledge distillation and self-supervised learning to produce compact yet powerful student networks. Traditional knowledge distillation transfers knowledge from a large teacher to a small student through output mimicry, which may lose structural information. We propose to use self-supervised auxiliary tasks as a bridge: the teacher and student are encouraged to produce similar representations for self-supervised pretext tasks, enriching the distilled knowledge. This achieves state-of-the-art distillation results on ImageNet and CIFAR-100.

我們提出 KDSS（知識蒸餾遇見自監督學習），一個橋接知識蒸餾與自監督學習以產生精簡而強大的學生網路的框架。傳統知識蒸餾透過輸出模仿從大型教師傳遞知識至小型學生，可能丟失結構資訊。我們提議使用自監督輔助任務作為橋梁：鼓勵教師和學生對自監督前置任務產生相似的表徵，豐富蒸餾的知識。在 ImageNet 和 CIFAR-100 上達到最先進的蒸餾結果。

段落功能全文總覽——定義自監督輔助任務增強知識蒸餾的框架。

邏輯角色將兩個獨立研究方向的交叉作為創新點是有效的策略，建立了明確的研究定位。

論證技巧 / 潛在漏洞「結構資訊丟失」的批判為自監督橋梁建立了必要性，但需實驗驗證此資訊確實被捕獲。

1. Introduction — 緒論

Knowledge distillation compresses large models into smaller ones by training the student to mimic the teacher's outputs. Self-supervised learning learns representations from unlabeled data through pretext tasks. These two paradigms have been studied independently. We identify that self-supervised tasks encode complementary structural knowledge that standard distillation misses. By jointly optimizing distillation and self-supervised objectives, the student learns richer representations that better generalize to downstream tasks.

知識蒸餾透過訓練學生模仿教師輸出來壓縮大型模型。自監督學習從無標注資料透過前置任務學習表徵。這兩個範式此前獨立研究。我們識別出自監督任務編碼了標準蒸餾遺漏的互補結構性知識。透過聯合最佳化蒸餾與自監督目標，學生學到更豐富的表徵，更好地泛化到下游任務。

段落功能建立動機——識別知識蒸餾與自監督學習的互補性。

邏輯角色「互補結構性知識」的概念是論文的核心洞見，為融合兩個領域提供了理論依據。

論證技巧 / 潛在漏洞兩個成熟方向的交叉創新風險較低但潛力明確，是穩健的研究策略。

Prior work in knowledge distillation has explored feature-level matching, attention transfer, and relational distillation, but these still rely on task-specific supervision signals. Self-supervised learning has emerged as a powerful paradigm for learning task-agnostic representations that capture geometric, textural, and structural properties of images. We hypothesize that these properties are precisely what standard distillation fails to transfer effectively.

先前的知識蒸餾工作探索了特徵層級匹配、注意力遷移和關係性蒸餾，但仍依賴任務特定的監督信號。自監督學習已成為學習任務無關表徵的強大範式，能捕獲影像的幾何、紋理和結構性質。我們假設這些性質正是標準蒸餾無法有效遷移的。

段落功能文獻定位——區分 KDSS 與先前蒸餾方法的差異。

邏輯角色將自監督表徵定位為「任務無關」，與蒸餾的「任務特定」形成互補。

論證技巧 / 潛在漏洞假設合理但需實驗驗證——標準蒸餾確實遺漏了哪些具體的結構資訊。

2. Method — 方法

KDSS consists of three loss components: (1) task-specific distillation loss (KL divergence between teacher and student logits), (2) self-supervised contrastive loss that encourages both teacher and student to produce consistent representations under augmentations, and (3) cross-modal distillation loss that aligns teacher and student self-supervised representations. The total loss is a weighted combination of all three, enabling end-to-end training.

KDSS 由三個損失組件組成：（1）任務特定蒸餾損失（教師和學生 logits 的 KL 散度）；（2）自監督對比損失，鼓勵教師和學生在資料增強下產生一致表徵；（3）跨模態蒸餾損失，對齊教師和學生的自監督表徵。總損失是三者的加權組合，實現端到端訓練。

段落功能核心方法——三重損失的設計與組合策略。

邏輯角色三個損失各自針對不同面向的知識遷移，設計周全且職責明確。

論證技巧 / 潛在漏洞權重的選擇可能需要大量超參數調整，增加了實際使用的複雜度。

2.1 Self-Supervised Bridge — 自監督橋梁

The self-supervised pretext tasks include rotation prediction and contrastive instance discrimination. For rotation prediction, both teacher and student must predict the rotation angle applied to the input image. The key insight is that matching rotation predictions requires understanding spatial structure and object orientation, knowledge that goes beyond class labels. The contrastive task further enriches this with instance-level discrimination ability.

自監督前置任務包含旋轉預測和對比實例辨別。在旋轉預測中，教師和學生都必須預測應用於輸入影像的旋轉角度。關鍵洞見是匹配旋轉預測需要理解空間結構和物件方向，這是超越類別標籤的知識。對比任務進一步以實例層級辨別能力豐富了知識遷移。

段落功能輔助任務設計——旋轉預測與對比學習的互補角色。

邏輯角色「超越類別標籤的知識」精確定義了自監督橋梁的附加價值。

論證技巧 / 潛在漏洞旋轉預測是較簡單的前置任務，更複雜的任務（如拼圖、著色）可能帶來更大改進。

The cross-modal distillation operates in the representation space rather than the output space. We extract intermediate feature maps from both teacher and student during self-supervised tasks and minimize their cosine distance. This forces the student to learn similar internal representations to the teacher for structural tasks, providing a richer learning signal than output-level KL divergence alone.

跨模態蒸餾在表徵空間而非輸出空間中運作。我們在自監督任務期間從教師和學生提取中間特徵圖，並最小化其餘弦距離。這迫使學生對結構性任務學習與教師相似的內部表徵，提供比單獨的輸出層級 KL 散度更豐富的學習信號。

段落功能跨模態蒸餾——在表徵空間對齊教師與學生。

邏輯角色表徵層級的對齊比輸出層級更深入，捕獲更多結構資訊。

論證技巧 / 潛在漏洞餘弦距離是自然的選擇，但教師與學生的特徵維度可能不同，需要投影層。

3. Experiments — 實驗

On ImageNet, KDSS improves ResNet-18 student accuracy from 69.75% (vanilla KD) to 71.96% with a ResNet-34 teacher, a +2.21% improvement. On CIFAR-100, KDSS achieves 76.45% compared to 74.92% for vanilla KD. Ablation studies confirm that both rotation prediction (+0.8%) and contrastive learning (+1.2%) contribute to the improvement, with their combination providing the full gain.

在 ImageNet 上，KDSS 將 ResNet-18 學生準確率從 69.75%（標準 KD）提升至 71.96%（ResNet-34 教師），改進 +2.21%。在 CIFAR-100 上，KDSS 達到 76.45%，相比標準 KD 的 74.92%。消融研究確認旋轉預測（+0.8%）和對比學習（+1.2%）均有貢獻，組合提供完整增益。

段落功能定量評估——兩基準上的一致改進與消融分析。

邏輯角色 +2.21% 的 ImageNet 改進在蒸餾領域相當顯著，證明了方法的有效性。

論證技巧 / 潛在漏洞消融清楚分離兩個自監督任務的各自貢獻，實驗設計嚴謹。

We also evaluate transfer learning performance by fine-tuning the distilled student on downstream tasks. On VOC detection, KDSS-trained ResNet-18 achieves +1.5% mAP improvement over vanilla KD. On COCO instance segmentation, the improvement is +0.9% AP. These results confirm that KDSS produces more transferable representations than standard distillation.

我們也透過在下游任務上微調蒸餾學生來評估遷移學習效能。在 VOC 偵測上，KDSS 訓練的 ResNet-18 比標準 KD 提升 +1.5% mAP。在 COCO 實例分割上提升 +0.9% AP。這些結果確認 KDSS 產生了比標準蒸餾更具遷移性的表徵。

段落功能遷移評估——驗證表徵的下游任務泛化能力。

邏輯角色下游任務的改進直接支撐「更豐富表徵」的核心宣稱。

論證技巧 / 潛在漏洞遷移學習評估使結論更具說服力，超越了單一任務的驗證。

4. Conclusion — 結論

We have shown that knowledge distillation and self-supervised learning are complementary paradigms that can be unified to produce better compact models. KDSS achieves state-of-the-art distillation results by leveraging self-supervised tasks as a bridge for richer knowledge transfer. This work opens up a new direction at the intersection of model compression and self-supervised learning.

我們展示了知識蒸餾與自監督學習是互補的範式，可統一以產生更好的精簡模型。KDSS 透過利用自監督任務作為更豐富知識遷移的橋梁，達到最先進的蒸餾結果。本研究開闢了模型壓縮與自監督學習交叉領域的新方向。

段落功能總結——確立兩個領域交叉的新研究方向。

邏輯角色交叉方向的定位為後續研究提供了明確的路線圖。

論證技巧 / 潛在漏洞框架的通用性暗示可結合更先進的自監督方法（如 MAE、DINO）獲得更大改進。

論證結構總覽

問題
蒸餾丟失結構知識

→

論點
自監督提供互補知識

→

方法
三重損失聯合訓練

→

證據
ImageNet +2.21%

→

結論
兩領域交叉新方向

核心主張

透過自監督輔助任務橋接知識蒸餾，可傳遞超越類別標籤的結構性知識，顯著提升學生網路的表徵品質。

論證最強處

ImageNet 上 +2.21% 的顯著改進與完整的消融分析，清楚驗證了兩種自監督任務的各自貢獻。

論證最弱處

損失權重需要額外的超參數調整，且前置任務的選擇對結果影響顯著但缺乏系統性指導。