Rethinking ImageNet Pre-training

Abstract — 摘要

We report competitive results on object detection and instance segmentation on the COCO dataset using standard models trained from random initialization. The results are no worse than their ImageNet pre-training counterparts even when using the hyper-parameters of the baseline system (Mask R-CNN) that were optimized for fine-tuning pre-trained models, with the sole exception of increasing the number of training iterations so the randomly initialized models may converge. Training from random initialization is surprisingly robust: the models are no worse than their pre-training counterparts under multiple settings including training with limited data, using deeper and wider models, and performing multiple tasks. We achieved 50.9 AP on COCO object detection without using any external data — a result on par with the top COCO 2017 competition results.

我們報告了使用隨機初始化訓練的標準模型在 COCO 資料集上的物件偵測與實例分割達到具競爭力的結果。即使使用針對微調預訓練模型所最佳化的基線系統（Mask R-CNN）超參數，結果也不遜於 ImageNet 預訓練的對應版本，唯一的例外是需要增加訓練迭代次數以讓隨機初始化模型收斂。從隨機初始化訓練具有出人意料的穩健性：在多種設定下模型均不遜於預訓練對應版本，包括使用有限資料、更深更寬的模型，以及執行多種任務。我們在不使用任何外部資料的情況下，在 COCO 物件偵測上達到 50.9 AP——與 COCO 2017 競賽頂尖成績相當。

段落功能全文總覽——挑戰「預訓練必要性」的傳統觀念，陳述核心發現。

邏輯角色摘要以「surprisingly robust」定調，將隨機初始化從「不可行」重新定位為「可行且具競爭力」。50.9 AP 的數字直接挑戰預訓練典範。

論證技巧 / 潛在漏洞「唯一例外是增加迭代次數」的讓步式表述非常巧妙——預先處理最明顯的反對意見。但「增加迭代」意味著更多的計算成本，這並非免費的。

1. Introduction — 緒論

Deep convolutional neural networks have revolutionized computer vision through transfer learning, where representations learned on pre-training tasks transfer to target tasks. The established paradigm has been to pre-train on large-scale data like ImageNet, then fine-tune on target tasks with less training data. This approach enabled state-of-the-art results on object detection, image segmentation, and action recognition. Recent efforts have pushed this paradigm further by pre-training on datasets up to 3000x larger than ImageNet. However, improvements on object detection in particular are small and scale poorly with the pre-training dataset size.

深度摺積神經網路透過遷移學習徹底改變了電腦視覺——在預訓練任務上學到的表徵可遷移到目標任務。既有的典範是在 ImageNet 等大規模資料上預訓練，然後在訓練資料較少的目標任務上微調。此方法使物件偵測、影像分割和動作辨識達到最先進的成果。近期的研究更將此典範推進到規模達 ImageNet 3000 倍的資料集上預訓練。然而，特別是在物件偵測上，改善幅度很小且隨預訓練資料集大小的擴展效果不佳。

段落功能鋪陳遷移學習典範的歷史地位，再引出其邊際效益遞減的困境。

邏輯角色先肯定預訓練的歷史貢獻（避免被視為否定前人），再以「3000 倍資料卻效果不佳」製造認知衝擊。

論證技巧 / 潛在漏洞「scale poorly」是對更大預訓練的含蓄批評，為「何不直接在目標資料上訓練」鋪路。但分類預訓練和偵測的任務差異本身就可能解釋這個現象。

We question this paradigm by exploring the opposite direction: achieving competitive object detection and instance segmentation accuracy when training on COCO from random initialization ("from scratch"), without any pre-training. Notably, this is accomplished using baseline systems and their hyper-parameters that were optimized for fine-tuning pre-trained models. Success requires two key ingredients: appropriate normalization techniques for optimization and sufficient training duration to compensate for lack of pre-training.

我們從相反方向質疑此典範：在 COCO 上從隨機初始化（「從零開始」）訓練，無需任何預訓練，即可達到具競爭力的物件偵測與實例分割精度。值得注意的是，這是使用針對微調預訓練模型所最佳化的基線系統及其超參數完成的。成功需要兩個關鍵要素：用於最佳化的適當正規化技術，以及用於彌補缺乏預訓練的充足訓練時間。

段落功能提出核心研究問題並預告成功條件。

邏輯角色「opposite direction」的措辭將本文定位為對既有典範的根本性挑戰，而非漸進改良。

論證技巧 / 潛在漏洞強調「使用為預訓練最佳化的超參數」是極有力的實驗設計——排除了「只是針對從零訓練做了特殊調整」的反駁。兩個成功要素的透明揭示增加了可信度。

Our key observations are threefold. (i) ImageNet pre-training accelerates early convergence but does not prevent models trained from scratch from eventually matching performance after comparable total computation. (ii) Pre-training does not automatically provide regularization; models trained from scratch match accuracy without any extra regularization, even with only 10% COCO data. (iii) Pre-training shows minimal benefit for tasks requiring spatially localized predictions, with noticeable AP improvement for high box overlap thresholds when training from scratch.

我們的核心觀察有三項。(i) ImageNet 預訓練加速了早期收斂，但在可比的總計算量後，並無法阻止從零訓練的模型最終追平效能。(ii) 預訓練不會自動提供正則化；從零訓練的模型在不需額外正則化的情況下即可追平精度，即使僅使用 10% 的 COCO 資料。(iii) 對於需要空間精確定位的任務，預訓練的益處甚微，從零訓練在高 IoU 閾值的偵測框 AP 上反而有顯著改善。

段落功能濃縮三大核心發現——解構預訓練的三個常見假設。

邏輯角色以三段論式結構拆解預訓練的「迷思」：收斂（非必要）、正則化（非自動）、定位（反而有害）。

論證技巧 / 潛在漏洞第三項觀察特別具顛覆性——暗示分類預訓練可能「教壞」模型的定位能力。但因果方向尚不明確：是預訓練有害，還是從零訓練偶然學到更好的定位？

2. Methodology — 方法

Normalization is critical for training object detectors from scratch. Batch Normalization (BN), standard in modern networks, creates difficulties because object detectors are typically trained with high resolution inputs, which reduces batch sizes as constrained by memory, and small batch sizes severely degrade BN accuracy. We address this with two strategies: Group Normalization (GN), which performs computation independent of the batch dimension and is insensitive to batch sizes; and Synchronized Batch Normalization (SyncBN), which computes batch statistics across multiple GPUs to increase the effective batch size.

正規化對於從零訓練物件偵測器至關重要。批次正規化（BN）雖為現代網路的標準做法，但會造成困難，因為物件偵測器通常以高解析度輸入訓練，受記憶體限制導致批次大小縮小，而小批次會嚴重降低 BN 的精度。我們以兩種策略應對：群組正規化（GN），其計算獨立於批次維度，對批次大小不敏感；以及同步批次正規化（SyncBN），跨多個 GPU 計算批次統計量以增加有效批次大小。

段落功能識別從零訓練偵測器的技術瓶頸——正規化。

邏輯角色正規化是讓「從零訓練」從不可行變為可行的關鍵技術條件之一。

論證技巧 / 潛在漏洞清楚解釋了為何過去的從零訓練嘗試失敗（BN + 小批次），使讀者理解本文成功的技術原因。GN 和 SyncBN 都不是本文的創新，但識別出它們是關鍵賦能因素是洞察力的體現。

Training duration requires careful consideration. Pre-training involves over one million images iterated for one hundred epochs, which teaches both high-level semantics and low-level features (edges, textures) that don't require re-learning during fine-tuning. Models trained from scratch must learn both low- and high-level semantics, so more iterations may be necessary. However, a fairer comparison accounts for the total number of training samples (images, instances, pixels) seen across all training stages. When measured in pixels, from-scratch models see comparable samples due to detectors' higher-resolution inputs.

訓練時長需要審慎考量。預訓練涉及超過一百萬張影像迭代一百個紀元，這既教會高階語意也教會低階特徵（邊緣、紋理），而這些在微調時不需要重新學習。從零訓練的模型必須同時學習低階和高階語意，因此可能需要更多迭代。然而，更公平的比較應考量所有訓練階段中觀察到的總訓練樣本數（影像、實例、像素）。若以像素衡量，由於偵測器使用更高解析度的輸入，從零訓練的模型看到的樣本量其實相當。

段落功能重新定義公平比較的基準——挑戰「需要更多訓練」的直覺批評。

邏輯角色預先反駁「從零訓練浪費計算」的批評，以像素級樣本量的重新計算使比較更公平。

論證技巧 / 潛在漏洞以「像素」而非「影像」作為度量單位是精巧的重新框架——使從零訓練的額外成本看起來不那麼顯著。但計算成本不僅取決於資料量，也取決於模型架構和 GPU 時間。

We also report that using appropriately normalized initialization, we can train object detectors with VGG nets from random initialization without BN or GN. This demonstrates that the methodology generalizes beyond modern architectures with normalization layers, suggesting that the difficulty of training detectors from scratch has been largely an optimization issue rather than a fundamental limitation.

我們還報告了使用適當的正規化初始化，可以在沒有 BN 或 GN 的情況下，以 VGG 網路從隨機初始化訓練物件偵測器。這表明該方法論不限於具有正規化層的現代架構，進而暗示從零訓練偵測器的困難主要是最佳化問題，而非根本性的限制。

段落功能將方法論推廣到無正規化層的經典架構。

邏輯角色排除「只是 GN/SyncBN 的功勞」的替代解釋，強化「從零訓練本質上可行」的核心論點。

論證技巧 / 潛在漏洞以 VGG（無 BN）作為補充實驗極為巧妙——堵住了「依賴特定正規化技術」的批評。「最佳化問題而非根本限制」是本文最深層的洞見。

3. Experiments — 實驗

On COCO object detection and instance segmentation using Mask R-CNN with FPN and GN, models trained from random initialization match their pre-training counterparts given sufficient training. With a 6x schedule, ResNet-50 trained from scratch achieves 41.3 AP^bbox (vs. 41.1 pre-trained) and 36.6 AP^mask (vs. 36.4 pre-trained). ResNet-101 achieves 42.7 AP^bbox (vs. 42.2 pre-trained) and 37.6 AP^mask (vs. 37.2 pre-trained). Notably, from-scratch models show stronger performance at high overlap thresholds: AP75^bbox is 45.6 vs. 44.6 for R50, confirming that pre-training does not help spatial localization.

在使用 Mask R-CNN + FPN + GN 的 COCO 物件偵測與實例分割上，從隨機初始化訓練的模型在充足訓練後可追平預訓練版本。在 6 倍排程下，從零訓練的 ResNet-50 達到 41.3 AP^bbox（對比預訓練 41.1）和 36.6 AP^mask（對比預訓練 36.4）。ResNet-101 達到 42.7 AP^bbox（對比預訓練 42.2）和 37.6 AP^mask（對比預訓練 37.2）。值得注意的是，從零訓練的模型在高重疊閾值下表現更強：R50 的 AP75^bbox 為 45.6 對比 44.6，確認了預訓練對空間定位無益。

段落功能核心實驗——COCO 偵測/分割的定量結果。

邏輯角色以最標準的基準（COCO + Mask R-CNN）展示「追平甚至超越」的結果，為全文論點提供最關鍵的實證。

論證技巧 / 潛在漏洞 AP75 的優勢是出人意料的亮點——不僅追平，在定位精度上還更好。但 6 倍排程的額外訓練時間（約 3 倍於標準微調）是一個不可忽略的實際成本。

For large models trained from scratch, a ResNeXt-152 (8x32d) with approximately 4x FLOPs compared to R101 achieves remarkable results. With training-time augmentation and Cascade R-CNN, the model reaches 48.6 AP^bbox and 41.4 AP^mask. Adding test-time augmentation, it achieves 50.9 AP^bbox and 43.2 AP^mask on val2017, and 51.3 AP^bbox and 43.6 AP^mask on the test-challenge set — matching COCO 2017 competition winners. Remarkably, the same model with ImageNet pre-training achieved only 50.3/42.5 on val2017, showing pre-training provides no advantage for large models.

對於從零訓練的大型模型，計算量約為 R101 四倍的 ResNeXt-152 (8x32d) 取得了驚人的結果。搭配訓練時資料增強和 Cascade R-CNN，模型達到 48.6 AP^bbox 和 41.4 AP^mask。加入測試時增強後，達到 val2017 上的 50.9 AP^bbox 和 43.2 AP^mask，test-challenge 上的 51.3 AP^bbox 和 43.6 AP^mask——與 COCO 2017 競賽冠軍相當。值得注意的是，同一模型使用 ImageNet 預訓練在 val2017 上僅達到 50.3/42.5，表明預訓練對大型模型毫無優勢。

段落功能王牌實驗——大模型從零訓練達到競賽水準。

邏輯角色 50.9 AP 是全文最具說服力的數字——不僅追平，且在大模型上從零訓練反而更好。

論證技巧 / 潛在漏洞「50.9 vs 50.3」的逆轉是最強的證據——預訓練在大模型上甚至是負面的。但 ResNeXt-152 的計算量極大，在實際應用中可能不常見。

We investigate the data regime where pre-training might help. With 35k COCO images (~1/3), after grid search optimization, both pre-trained and from-scratch models achieve 36.3 AP. With 10k COCO images (~1/10), the from-scratch model achieves 25.9 AP vs. pre-training's 26.0 AP — still comparable. However, with only 1k images (~1/100), pre-training achieves 9.9 AP vs. 3.5 AP from scratch, revealing the breakdown point between 3.5k and 10k COCO training images. On PASCAL VOC (15k images, 20 categories), pre-training retains an advantage: 82.7 mAP vs. 77.6 mAP from scratch, attributed to VOC's lower instance density.

我們探究預訓練可能有益的資料規模。使用 35k 張 COCO 影像（約 1/3），經網格搜索最佳化後，預訓練和從零訓練模型均達到 36.3 AP。使用 10k 張 COCO 影像（約 1/10），從零訓練的模型達到 25.9 AP，對比預訓練的 26.0 AP——仍然可比。然而，使用僅 1k 張影像（約 1/100）時，預訓練達到 9.9 AP 對比從零訓練的 3.5 AP，揭示了崩潰點位於 3.5k 到 10k 張 COCO 訓練影像之間。在 PASCAL VOC（15k 張影像、20 個類別）上，預訓練保有優勢：82.7 mAP 對比從零訓練的 77.6 mAP，歸因於 VOC 較低的實例密度。

段落功能探索預訓練優勢的邊界條件——資料規模消融實驗。

邏輯角色誠實地呈現從零訓練失效的條件（<10k 影像），增加論文的可信度和實用指導價值。

論證技巧 / 潛在漏洞主動報告「崩潰點」是學術誠信的典範——既不過度宣稱也不迴避弱點。VOC 的結果提醒讀者，資料集特性（實例密度、類別數）也是關鍵因素。

4. Conclusion — 結論

Our findings challenge the dominant ImageNet pre-training paradigm with six key conclusions. Training from scratch on target tasks is possible without architectural changes. It requires more iterations to converge, but can be no worse than pre-training counterparts under many circumstances, down to as few as 10k COCO images. ImageNet pre-training speeds up convergence but does not necessarily provide regularization unless in very small data regimes. Pre-training helps less if the target task is more sensitive to localization than classification.

我們的發現以六項關鍵結論挑戰了主流的 ImageNet 預訓練典範。在目標任務上從零訓練無需架構修改即可行。雖需更多迭代才能收斂，但在許多情況下可不遜於預訓練版本，資料量低至 10k 張 COCO 影像仍成立。ImageNet 預訓練加速收斂但不必然提供正則化效果，除非在極少量資料的情境下。若目標任務對定位比對分類更敏感，預訓練的幫助更小。

段落功能濃縮六項結論——系統性地重新定義預訓練的角色。

邏輯角色結論與緒論的三大觀察形成呼應，每項結論都對應了一個被拆解的「預訓練迷思」。

論證技巧 / 潛在漏洞六項結論的精煉表述使全文主張高度可引用。但整體結論偏向「預訓練不必要」的敘事，而實際上在小資料集和 VOC 上預訓練仍有明顯優勢。

We conclude that ImageNet pre-training is a historical workaround for when the community did not have enough target data or computational resources. Collecting data and training on target tasks is a solution worth considering, especially when there is a significant gap between the source pre-training task and the target task. The community should be more careful when evaluating pre-trained features (e.g., for self-supervised learning), as now we learn that even random initialization could produce excellent results.

我們的結論是：ImageNet 預訓練是社群在缺乏足夠目標資料或計算資源時的歷史性權宜之計。蒐集資料並直接在目標任務上訓練是值得考慮的方案，特別是當源預訓練任務與目標任務之間存在顯著差異時。社群在評估預訓練特徵（例如自監督學習）時應更加謹慎，因為我們現在了解到即使是隨機初始化也能產生優異的結果。

段落功能對未來研究方向的建議——重新審視預訓練評估方法。

邏輯角色將「歷史性權宜之計」的定位延伸到對自監督學習評估的警示，影響力超越本文直接範疇。

論證技巧 / 潛在漏洞對自監督學習的警示極具前瞻性。但「歷史性權宜之計」的措辭可能過於貶低預訓練的持續價值——在資料稀缺的醫學影像等領域，預訓練仍是不可或缺的。

論證結構總覽

問題
ImageNet 預訓練
是否真正必要？

→

論點
從零訓練可追平
預訓練效能

→

證據
R50: 41.3 vs 41.1 AP
X152: 50.9 vs 50.3 AP
10k 資料仍可行

→

反駁
<10k 時崩潰
VOC 仍有差距
需更多訓練迭代

→

結論
預訓練為歷史權宜
非根本必要

作者核心主張（一句話）

在充足的目標資料（10k+ COCO 影像）和適當的正規化條件下，從隨機初始化訓練的物件偵測器可以追平甚至超越 ImageNet 預訓練的對應版本，挑戰了遷移學習作為電腦視覺預設典範的地位。

論證最強處

ResNeXt-152 的 50.9 AP 從零訓練結果：不僅追平，更超越了使用 ImageNet 預訓練的 50.3 AP，且與 COCO 2017 競賽冠軍相當。搭配資料規模消融實驗（10k 仍可行），以及 VGG（無 BN）的補充驗證，形成了多角度、難以反駁的證據鏈。

論證最弱處

計算成本的隱含代價：6 倍排程意味著比標準微調多出 3 倍的訓練時間。且在小資料集（VOC、<10k COCO）上預訓練仍有明顯優勢。「歷史性權宜之計」的定位對於資料稀缺領域（醫學影像、稀有物種辨識）可能過度樂觀。