Colorful Image Colorization

Abstract — 摘要

Given a grayscale photograph as input, this paper attacks the problem of hallucinating a plausible color version of the photograph. This problem is clearly underdetermined, so previous approaches have either relied on significant user interaction or resulted in desaturated colorizations. We propose a fully automatic approach that produces vibrant and realistic colorizations. We embrace the underlying uncertainty of the problem by posing it as a classification task and use class-rebalancing at training time to increase the diversity of colors in the result. The system is implemented as a feed-forward pass in a CNN at test time and is trained on over a million color images. We evaluate our algorithm using a "colorization Turing test," asking human observers to choose between a generated and ground truth color image. Our method successfully fools humans on 32% of the trials, significantly higher than previous methods. Moreover, we show that colorization can be a powerful pretext task for self-supervised feature learning, acting as a cross-channel encoder.

給定一張灰階照片作為輸入，本文攻克的問題是幻想出該照片的合理彩色版本。此問題顯然是不適定的，因此先前的方法要不依賴大量使用者互動，要不就會產生飽和度不足的著色結果。我們提出一種全自動的方法，能產生色彩鮮豔且逼真的著色結果。我們擁抱問題的內在不確定性，將其建模為分類任務，並在訓練時使用類別重新平衡來增加結果中的色彩多樣性。系統在測試時以CNN 的單次前饋傳遞實現，並在超過一百萬張彩色影像上進行訓練。我們使用「著色圖靈測試」來評估演算法，請人類觀察者在生成的彩色影像與真實彩色影像之間選擇。我們的方法在 32% 的測試中成功欺騙了人類，顯著高於先前的方法。此外，我們展示著色可作為強大的自監督特徵學習前置任務，充當跨通道編碼器。

段落功能定義問題、提出方法、報告核心結果。

邏輯角色以「圖靈測試」作為評估標準極具說服力，同時開拓自監督學習的新應用。

論證技巧 / 潛在漏洞32% 的欺騙率作為指標既引人注目又量化可比，但該測試的設計是否存在偏差值得關注。

1. Introduction — 緒論

Automatic colorization of grayscale images has been a topic of interest since Sziranyi et al. first proposed the problem. The problem is inherently ambiguous: a grayscale image corresponds to many plausible color images. Despite this inherent ambiguity, most existing methods focus on producing a single "correct" colorization, often resulting in desaturated outputs because the loss function (typically L2 in Lab space) encourages averaging over the possible colors. We take a fundamentally different approach: we embrace the ambiguity by treating colorization as a classification problem over quantized color values, rather than a regression problem. This allows the network to hedge its bets and produce multimodal distributions over possible colors, ultimately leading to more vivid and plausible results.

灰階影像自動著色自 Sziranyi 等人首次提出以來一直是備受關注的議題。此問題本質上是模糊的：一張灰階影像對應著許多合理的彩色影像。儘管存在這種固有的模糊性，大多數現有方法仍專注於產生單一的「正確」著色，往往導致飽和度不足的輸出，因為損失函數（通常是 Lab 空間中的 L2）鼓勵對可能的顏色取平均。我們採取根本不同的方法：擁抱模糊性，將著色視為對量化色彩值的分類問題，而非迴歸問題。這使得網路可以對沖押注，產生可能色彩的多模態分布，最終產生更鮮豔且合理的結果。

段落功能指出現有方法的根本缺陷，提出分類替代迴歸的創新思路。

邏輯角色透過批判 L2 損失的「平均化」效應，為分類式方法提供直觀的合理性論證。

論證技巧 / 潛在漏洞「擁抱模糊性」的論述轉化了問題的缺陷為方法的優勢，是巧妙的論述框架轉換。

2. Approach — 方法

We train a CNN to map from a grayscale input to a distribution over quantized color value outputs using the CIE Lab color space. Given the lightness channel L as input, our network learns to predict the corresponding a and b color channels. We quantize the ab output space into bins with grid size 10 and keep the Q = 313 values which are in-gamut. The architecture is based on VGG-net, with all pooling layers removed and replaced by strided convolutions, followed by a series of dilated convolutions. For each pixel, the network predicts a probability distribution over the 313 possible color bins, and the final colorization is obtained by mapping these distributions to point estimates in ab space.

我們訓練一個 CNN，使用 CIE Lab 色彩空間，將灰階輸入映射到量化色彩值輸出的分布。給定亮度通道 L 作為輸入，網路學習預測對應的 a 和 b 色彩通道。我們以網格大小 10 將 ab 輸出空間量化為區間，保留在色域內的 Q = 313 個值。架構基於 VGG-net，移除所有池化層並以帶步幅摺積取代，隨後接上一系列膨脹摺積。對於每個像素，網路預測313 個可能色彩區間的機率分布，最終的著色結果透過將這些分布映射到 ab 空間中的點估計來獲得。

段落功能詳述網路架構與色彩空間的量化方案。

邏輯角色將摘要中的「分類式著色」概念落實為具體的技術方案。

論證技巧 / 潛在漏洞Lab 空間的選擇使亮度與色度解耦，313 個量化區間平衡了精度與可行性。

3. Classification Loss and Class Rebalancing — 分類損失與重新平衡

Quantizing the color space and treating the prediction as a classification problem offers several advantages. We use a multinomial cross-entropy loss rather than L2 regression loss. However, the distribution of ab values in natural images is strongly biased towards desaturated values — backgrounds like sky, ground, and walls dominate. To address this, we rebalance the loss based on the rarity of the color at training time. We use a weighting term that is inversely proportional to the color class probability, smoothed by a temperature parameter lambda. This class rebalancing is critical for producing colorful results, as without it the network would converge to producing grayish outputs.

將色彩空間量化並以分類問題處理提供了多項優勢。我們使用多項式交叉熵損失而非 L2 迴歸損失。然而，自然影像中 ab 值的分布強烈偏向低飽和度的值——天空、地面和牆壁等背景佔主導地位。為解決此問題，我們在訓練時根據色彩的稀有程度重新平衡損失。使用一個與色彩類別機率成反比的加權項，並以溫度參數 lambda 進行平滑。類別重新平衡對於產生色彩豐富的結果至關重要，因為缺少它，網路會收斂到產生灰濛濛的輸出。

段落功能說明類別重新平衡的動機與實現。

邏輯角色解決從迴歸轉分類後出現的類別不平衡新問題，展現設計的完整性。

論證技巧 / 潛在漏洞誠實面對色彩分布偏差問題並提出解決方案，但溫度參數的調節對結果影響較大，屬超參數敏感點。

4. Experiments — 實驗

We train on 1.3M images from ImageNet and evaluate with several metrics. For our "colorization Turing test," we show pairs of images — one real, one colorized — to AMT workers and ask them to identify the fake. Our method achieves a fooling rate of 32.3%, compared to 22.1% for the baseline L2 regression approach. A perfect colorization would achieve 50%. We also evaluate using PSNR and perceptual metrics. While L2 regression achieves higher PSNR (because it produces mean colors), our classification approach produces results that are perceptually preferred. Additionally, we demonstrate that features learned through colorization achieve competitive performance on ImageNet classification (44.5% top-1 accuracy) when used as a self-supervised pre-training task, outperforming several other self-supervised methods.

我們在 ImageNet 的 130 萬張影像上進行訓練，並以多項指標進行評估。在「著色圖靈測試」中，向 AMT 工作者展示成對的影像——一張真實、一張著色——請他們辨識偽造者。我們的方法達到 32.3% 的欺騙率，相比基準 L2 迴歸方法的 22.1%。完美著色將達到 50%。我們也使用 PSNR 和感知指標進行評估。雖然 L2 迴歸的 PSNR 更高（因為它產生的是平均色彩），但分類方法產生的結果在感知上更受偏好。此外，我們展示了透過著色學習的特徵在用作自監督預訓練任務時，在 ImageNet 分類上達到具競爭力的表現（44.5% top-1 精度），超越了其他數種自監督方法。

段落功能報告著色品質與自監督學習的雙重實驗結果。

邏輯角色從人類感知（圖靈測試）和機器理解（特徵遷移）兩面向驗證方法的有效性。

論證技巧 / 潛在漏洞坦承 PSNR 不如迴歸方法，但用感知偏好反轉論述，巧妙地將「弱點」轉化為「指標不適切」的論證。

We conduct ablation studies to analyze the effect of each component. Without class rebalancing, the fooling rate drops from 32.3% to 24.1%. Using regression instead of classification reduces the fooling rate further. The annealed-mean mapping for converting probability distributions to point estimates in ab space provides a good balance between vividness and spatial consistency. We also analyze failure cases: the method sometimes produces semantically incorrect colors (e.g., blue bananas) and struggles with scenes that have ambiguous object-color associations.

我們進行消融研究以分析各元件的效果。去除類別重新平衡後，欺騙率從 32.3% 降至 24.1%。使用迴歸取代分類則進一步降低欺騙率。退火均值映射（annealed-mean mapping）用於將機率分布轉換為 ab 空間中的點估計，在鮮豔度與空間一致性之間提供了良好的平衡。我們也分析了失敗案例：該方法有時會產生語義不正確的色彩（例如藍色香蕉），且在物件-色彩關聯模糊的場景中表現不佳。

段落功能消融實驗驗證各元件貢獻，並誠實揭示失敗案例。

邏輯角色透過消融與失敗分析增強論文的可信度與完整性。

論證技巧 / 潛在漏洞主動展示失敗案例（藍色香蕉）是優秀的學術誠信表現，有助於讀者理解方法的適用邊界。

5. Conclusions — 結論

We have presented a system for fully automatic image colorization that produces realistic and vibrant results by treating the problem as multinomial classification with class rebalancing. Through a "colorization Turing test," we showed that our results fool human observers more often than previous methods. Importantly, we have also demonstrated that colorization is a viable pretext task for self-supervised representation learning, opening up new directions for learning visual features without manual annotation. Future work may explore the use of conditional generative models to handle the multimodality more explicitly.

我們提出了一個全自動影像著色系統，透過將問題視為帶有類別重新平衡的多項式分類，產生逼真且色彩鮮豔的結果。透過「著色圖靈測試」，我們展示了結果比先前方法更頻繁地欺騙人類觀察者。更重要的是，我們也證明了著色是一種可行的自監督表示學習的前置任務，為無需手動標註的視覺特徵學習開闢了新方向。未來工作可能探索使用條件生成模型來更明確地處理多模態性。

段落功能總結雙重貢獻（著色品質與自監督學習）。

邏輯角色以「前置任務」的延伸價值提升論文的長期影響力。

論證技巧 / 潛在漏洞將著色與自監督學習聯繫，擴展了論文的受眾與影響範圍。

論證結構總覽

問題
L2 迴歸產生灰濛色彩

➔

論點
分類+類別重新平衡

➔

證據
32% 圖靈測試欺騙率

➔

反駁
偶發語義錯誤色彩

➔

結論
有效的著色與自監督任務

核心主張

將著色建模為分類問題並使用類別重新平衡，可產生色彩鮮豔逼真的結果，同時所學特徵可作為自監督表示學習的有力工具。

最強論證

著色圖靈測試以人類感知作為最終評判標準，32.3% 的欺騙率（對比 22.1% 基準）提供了直觀且有力的品質證據。

最弱環節

語義色彩錯誤（如藍色香蕉）揭示了模型缺乏高層語義理解的局限性，在色彩-物件關聯模糊的場景中表現不穩定。