CutMix: Regularization Strategy to Train Strong Classifiers

Abstract — 摘要

Regional dropout strategies have been proposed to enhance the performance of convolutional neural network classifiers. They have proved to be effective for guiding the model to attend on less discriminative parts of objects, thereby letting the network generalize better and have better object localization capabilities. However, the deleted regions are usually zeroed-out or filled with random noise, greatly reducing the proportion of informative pixels on training images. We propose CutMix, where patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches. CutMix makes efficient use of training pixels and retains the regularization effect of regional dropout, achieving state-of-the-art results on CIFAR and ImageNet classification, weakly-supervised localization, and transfer learning tasks.

區域丟棄策略已被提出用於增強摺積神經網路分類器的效能。它們已被證明能有效引導模型關注物件中較不具鑑別力的部分，使網路具有更好的泛化能力與物件定位能力。然而，被刪除的區域通常以零值填充或隨機雜訊填充，大幅降低了訓練影像中有效像素的比例。我們提出 CutMix，其中區塊在訓練影像之間被剪切與貼上，同時真實標籤也按照區塊面積比例進行混合。CutMix 有效利用了訓練像素並保留了區域丟棄的正則化效果，在 CIFAR 與 ImageNet 分類、弱監督定位及遷移學習等任務上達到最先進成果。

段落功能全文總覽——從區域丟棄的既有效果出發，指出其資訊浪費問題，再引出 CutMix 的解決方案。

邏輯角色摘要以「肯定-批判-提出」的三段式結構清晰地定位 CutMix：它不是全新的範式，而是對既有正則化策略的精煉改良。

論證技巧 / 潛在漏洞「大幅降低有效像素比例」的批評切中 Cutout 的核心弱點。但 CutMix 混合標籤的做法與 Mixup 相似，需在方法章節清楚區分兩者的本質差異。

1. Introduction — 緒論

Deep convolutional neural networks (CNNs) have shown promising performances on various computer vision problems such as image classification, object detection, semantic segmentation, and video analysis. To prevent overfitting, "random feature removal regularizations" have been proposed, including dropout and regional dropout methods like Cutout. While effective for regularization, deleted regions are usually zeroed-out or filled with random noise, greatly reducing the proportion of informative pixels on training images. This means the model is trained with "no uninformative pixel" being addressed — wasting valuable training signal.

深度摺積神經網路在影像分類、物件偵測、語意分割與影片分析等多種電腦視覺問題上展現了優異的效能。為防止過擬合，隨機特徵移除正則化方法被提出，包括 Dropout 與 Cutout 等區域丟棄方法。雖然在正則化上有效，但被刪除的區域通常以零值或隨機雜訊填充，大幅降低了訓練影像中有效像素的比例。這意味著模型在訓練時浪費了寶貴的訓練訊號。

段落功能建立問題——從 CNN 的過擬合問題出發，指出既有正則化的效率問題。

邏輯角色論證鏈起點：先肯定正則化的必要性，再精準指出「像素浪費」這一被忽略的效率問題。

論證技巧 / 潛在漏洞將問題聚焦在「像素使用效率」上是巧妙的切入角度，使得解決方案（用有意義的區塊替代零值）顯得自然而然。

CutMix addresses this limitation by replacing removed regions with patches from other training images, ensuring "no uninformative pixel during training" while maintaining the regularization benefits of regional dropout. The ground truth labels are mixed proportionally to the area of the combined patches. This simple yet effective strategy simultaneously improves classification accuracy, localization ability, model robustness, and uncertainty estimation, demonstrating that efficient use of every training pixel matters.

CutMix 透過將移除的區域替換為其他訓練影像的區塊來解決此限制，確保訓練過程中「沒有無效像素」，同時維持區域丟棄的正則化效益。真實標籤按照合併區塊的面積比例進行混合。這個簡單卻有效的策略同時提升了分類準確度、定位能力、模型穩健性與不確定性估計，證明了有效利用每個訓練像素的重要性。

段落功能提出方案——概述 CutMix 的核心操作與多面向效益。

邏輯角色此段直接回應上段的問題：以「替代」取代「刪除」，實現正則化與資訊保留的雙贏。四項效益的列舉預告了實驗章節的驗證維度。

論證技巧 / 潛在漏洞「同時提升四項指標」的主張極具吸引力，但需警惕是否在所有設定下均成立。若 CutMix 在某些任務上的提升不顯著，全面性的宣稱可能被質疑。

Regional Dropout: Methods removing random regions in images have been proposed to enhance generalization. Cutout randomly masks out square regions, while Random Erasing fills them with random values. CutMix differs fundamentally by "filling with patches from another training image" rather than zeroing or noise-filling. Mixup: CutMix shares similarity with Mixup in that both combine two samples with linearly interpolated labels. However, Mixup blends entire images, producing locally ambiguous and unnatural samples. "CutMix overcomes the problem by replacing the image region with a patch from another training image," producing more locally natural results. CutMix is also "complementary to the above methods because it operates on the data level, without changing internal representations or architecture."

區域丟棄方面，移除影像中隨機區域的方法已被提出用於增強泛化能力。Cutout 隨機遮罩方形區域，Random Erasing 以隨機值填充。CutMix 的根本差異在於以「其他訓練影像的區塊填充」取代零值或雜訊填充。Mixup 方面，CutMix 與 Mixup 的相似之處在於兩者皆結合兩個樣本並使用線性插值標籤。然而，Mixup 混合整張影像，產生局部模糊且不自然的樣本。CutMix 透過以另一張訓練影像的區塊替換影像區域來克服此問題，產出更為局部自然的結果。CutMix 也與上述方法互補，因為它在資料層級運作，不改變內部表示或架構。

段落功能文獻比較——系統性地區分 CutMix 與 Cutout、Mixup 的本質差異。

邏輯角色三方比較精確定位 CutMix：它結合了 Cutout 的區域性操作與 Mixup 的標籤混合，同時避免了兩者各自的缺陷。

論證技巧 / 潛在漏洞以「局部自然性」作為優於 Mixup 的關鍵論點很有說服力。但「局部自然」在語意上仍是拼接影像，是否真正「自然」可被質疑。此外，Mixup 的「不自然」在某些情境下可能反而提供更強的正則化效果。

3. CutMix Method — 方法

3.1 Algorithm — 演算法

The CutMix combining operation generates new training samples as: x_tilde = M * x_A + (1 - M) * x_B and y_tilde = lambda * y_A + (1 - lambda) * y_B, where M is a binary mask indicating the cut region, and lambda is sampled from Beta(alpha, alpha). The rectangular mask dimensions follow: r_w = W * sqrt(1 - lambda) and r_h = H * sqrt(1 - lambda), making the cropped area ratio r_w * r_h / (W * H) = 1 - lambda. This ensures that the label mixing ratio exactly corresponds to the pixel area ratio. Implementation is straightforward: "CutMix is simple and incurs a negligible computational overhead" compared to existing data augmentation techniques.

CutMix 的合併操作生成新的訓練樣本：x_tilde = M * x_A + (1 - M) * x_B 且 y_tilde = lambda * y_A + (1 - lambda) * y_B，其中 M 為指示剪切區域的二元遮罩，lambda 從 Beta(alpha, alpha) 分布取樣。矩形遮罩的維度為：r_w = W * sqrt(1 - lambda) 且 r_h = H * sqrt(1 - lambda)，使得剪切面積比 r_w * r_h / (W * H) = 1 - lambda。這確保了標籤混合比例恰好對應像素面積比例。實作簡明直接：CutMix 相較於現有資料增強技術，幾乎不產生額外的計算開銷。

段落功能核心演算法——以數學公式完整定義 CutMix 操作。

邏輯角色此段建立了方法的數學基礎。面積比例與標籤比例的精確對應是設計的核心原則，使得混合標籤具有幾何意義。

論證技巧 / 潛在漏洞演算法的簡潔性是其最大優勢——易於實作與復現。但矩形遮罩的限制意味著物件形狀資訊被忽略；若以物件感知的非矩形遮罩替代，效果可能更佳。

3.2 Discussion — 討論

The authors verify that CutMix effectively forces the model to learn object recognition from partial views. Class Activation Mapping (CAM) visualizations reveal that "CutMix can take advantage of the mixed region on image, but Cutout cannot" — because CutMix provides meaningful content in the replaced region, the model learns to recognize objects from their visible parts while also extracting information from the pasted region. The method achieves three desirable properties simultaneously: "usage of full image region," "regional dropout" effect, and "mixed image & label" correspondence.

作者驗證了 CutMix 有效迫使模型從部分視角學習物件辨識。類別啟動圖（CAM）視覺化顯示 CutMix 能利用影像中的混合區域，但 Cutout 無法——因為 CutMix 在替換區域提供了有意義的內容，模型學習從可見部分辨識物件，同時也從貼入的區域萃取資訊。該方法同時達成三項理想特性：完整影像區域的利用、區域丟棄效果，以及混合影像與標籤的對應。

段落功能機制驗證——以 CAM 視覺化解釋 CutMix 的作用機制。

邏輯角色此段將直覺性的設計動機（有效利用像素）與可觀察的學習行為（CAM 分布）連結，為方法提供了超越定量指標的解釋性證據。

論證技巧 / 潛在漏洞 CAM 視覺化是直觀且有說服力的佐證。但 CAM 本身的可靠性受到學界質疑——其啟動區域不必然精確對應模型的決策依據。

4. Experiments — 實驗

ImageNet Classification: CutMix achieves 21.40% top-1 error, representing a +2.28% improvement over the ResNet-50 baseline, outperforming all considered augmentation strategies. Notably, "CutMix improves the performance by +2.28% while increased depth (ResNet-50 to ResNet-152) boosts +1.99%," demonstrating that augmentation benefits can exceed architectural improvements. On CIFAR-100 with PyramidNet-200, CutMix achieves 14.47% top-1 error, +1.98% higher than the baseline.

ImageNet 分類方面：CutMix 達到 21.40% 的 top-1 錯誤率，相對於 ResNet-50 基線改善了 +2.28%，超越所有考量的增強策略。值得注意的是，CutMix 帶來的效能提升（+2.28%）超過了增加網路深度（ResNet-50 至 ResNet-152）所帶來的提升（+1.99%），證明了資料增強的效益可以超越架構改進。在 CIFAR-100 使用 PyramidNet-200 的設定下，CutMix 達到 14.47% 的 top-1 錯誤率，較基線改善 +1.98%。

段落功能核心實驗——在標準基準上展示分類效能。

邏輯角色此段提供了最直接的效能證據。將增強效益與架構改進的比較尤為精彩——這暗示資料增強是被低估的效能提升途徑。

論證技巧 / 潛在漏洞「增強效益超越架構改進」的比較極具衝擊力，但兩者的計算成本差異巨大（CutMix 幾乎免費 vs. 更深網路需要更多計算），使此比較在公平性上略有瑕疵。

Weakly Supervised Object Localization: CutMix outperforms Mixup on localization accuracies by +5.51% and +1.41% on CUB200-2011 and ImageNet, respectively. CutMix also achieves comparable localization accuracies to dedicated state-of-the-art WSOL methods. Robustness and Uncertainty: "CutMix significantly improves the robustness to adversarial attacks" compared to other augmentation methods, and "significantly alleviates the over-confidence of the model" on out-of-distribution detection. Transfer Learning: CutMix outperforms Mixup and Cutout on Pascal VOC detection and image captioning benchmarks.

弱監督物件定位方面：CutMix 在 CUB200-2011 與 ImageNet 上的定位準確度分別超越 Mixup +5.51% 與 +1.41%。CutMix 也達到了與專門的最先進弱監督物件定位方法相當的定位準確度。穩健性與不確定性方面：CutMix 相較於其他增強方法顯著提升了對抗性攻擊的穩健性，並顯著緩解了模型在分布外偵測上的過度自信。遷移學習方面：CutMix 在 Pascal VOC 偵測與影像描述基準上超越 Mixup 與 Cutout。

段落功能廣度驗證——在定位、穩健性、不確定性與遷移學習上全面展示效益。

邏輯角色此段延伸分類以外的驗證，支撐「四項同時改善」的核心主張。定位效果的顯著提升與方法設計（部分遮擋迫使模型學習更廣泛的特徵）直接相關。

論證技巧 / 潛在漏洞跨任務的一致性改善是強而有力的論據。但穩健性的提升機制未被深入分析——CutMix 為何能改善對抗穩健性？與對抗訓練相比效果如何？這些問題值得進一步探究。

5. Conclusion — 結論

CutMix provides a simple and effective training strategy for training CNNs with strong classification and localization ability. By replacing removed regions with informative patches from other images and mixing labels proportionally, the method achieves "no extra cost" while delivering consistent improvements across classification, localization, transfer learning, robustness, and uncertainty quantification tasks. The approach is complementary to architectural advances and other training strategies, making it a practical and broadly applicable tool for improving deep learning models.

CutMix 提供了一種簡單且有效的訓練策略，能訓練出兼具強大分類與定位能力的 CNN。透過以其他影像的有效區塊替換移除的區域並按比例混合標籤，該方法在不產生額外成本的情況下，在分類、定位、遷移學習、穩健性與不確定性量化等任務上持續帶來改善。此方法與架構改進及其他訓練策略互補，使其成為改善深度學習模型的實用且廣泛適用的工具。

段落功能總結全文——強調方法的簡潔性、免費性與廣泛效益。

邏輯角色結論以「免費的午餐」為隱含訊息：CutMix 幾乎不增加成本卻帶來多面向改善，使其成為應被預設採用的訓練技巧。

論證技巧 / 潛在漏洞以「互補性」定位避免了過度宣稱。但結論未討論失敗案例或不適用場景——在何種條件下 CutMix 可能無效或有害？此類討論將增強論文的完整性。

論證結構總覽

問題
區域丟棄浪費
有效訓練像素

→

論點
以有意義區塊替代
零值填充

→

證據
ImageNet/CIFAR 分類
定位/穩健性均改善

→

反駁
與 Mixup 差異在於
局部自然性

→

結論
簡單有效的
通用增強策略

作者核心主張（一句話）

以其他訓練影像的區塊替換被遮罩的區域並按面積比例混合標籤，即可同時獲得正則化效果與完整的像素利用率，在分類、定位、穩健性等多任務上帶來一致的免費改善。

論證最強處

跨任務的一致性改善：CutMix 不僅在分類任務上有效，在弱監督定位、對抗穩健性、不確定性估計與遷移學習上均展現改善，證明其作用機制具有基礎性。與架構改進的直接數值比較（+2.28% vs. +1.99%）尤為引人注目。

論證最弱處

作用機制的深層解釋不足：CutMix 為何能同時改善如此多面向的指標？論文以 CAM 視覺化提供了部分解釋，但未建立更深層的理論框架。矩形遮罩的設計也未經充分消融——不同形狀、大小分布的遮罩效果如何？這些問題的回答將增強方法的理論基礎。