Image-to-Image Translation with Conditional Adversarial Networks (pix2pix)

Abstract — 摘要

We investigate conditional adversarial networks as a general-purpose solution to image-to-image translation problems. These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations. We demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks. Indeed, since the release of our pix2pix software, a large number of internet users have posted their own experiments with our system, further demonstrating its wide applicability.

我們研究條件式對抗網路作為影像到影像轉換問題的通用解決方案。這些網路不僅學習從輸入影像到輸出影像的映射，還學習訓練此映射的損失函數。這使得同一種通用方法可以應用於傳統上需要非常不同損失公式的問題。我們展示此方法在從標籤圖合成照片、從邊緣圖重建物件、影像上色等任務中的有效性。事實上，自我們的 pix2pix 軟體發布以來，大量網路使用者已張貼了他們使用我們系統的實驗，進一步展示了其廣泛的適用性。

段落功能全文總覽——將條件 GAN 定位為影像轉換的通用框架。

邏輯角色摘要的核心論述是「學習損失函數」的元學習概念——不僅學習映射，還學習衡量映射好壞的標準。這將 pix2pix 從特定應用提升為方法論層級的貢獻。

論證技巧 / 潛在漏洞引用社群使用者的廣泛實驗作為「適用性」的證據是非常規但有效的論證。但「通用解決方案」的主張可能過於大膽——pix2pix 需要成對訓練資料，這在許多實際場景中難以獲得。

1. Introduction — 緒論

Many problems in image processing, computer graphics, and computer vision can be posed as "translating" an input image into a corresponding output image. Just as a concept may be expressed in either English or French, a scene may be rendered as an RGB image, a gradient field, an edge map, a semantic label map, etc. In analogy to automatic language translation, we define automatic image-to-image translation as the task of translating one possible representation of a scene into another. Traditionally, each of these tasks has been tackled with separate, special-purpose machinery. One key concern with CNNs is that they require the design of effective loss functions — naive Euclidean distance produces blurry results because it is minimized by averaging all plausible outputs.

影像處理、電腦圖學與電腦視覺中的許多問題可以被框架為將輸入影像「翻譯」為對應的輸出影像。正如一個概念可以用英語或法語表達，一個場景可以被呈現為 RGB 影像、梯度場、邊緣圖、語義標籤圖等。類比於自動語言翻譯，我們定義自動影像到影像翻譯為將場景的一種可能表示轉換為另一種的任務。傳統上，這些任務各自以獨立的、專用的機制來處理。CNN 的一個關鍵問題是它們需要設計有效的損失函數——樸素的歐幾里得距離會產生模糊的結果，因為它是透過對所有可能輸出取平均來最小化的。

段落功能建立統一框架——以「翻譯」隱喻統一多種影像轉換任務。

邏輯角色「語言翻譯」的隱喻將看似不同的任務（上色、分割、合成）統一到同一個概念框架下。L2 損失產生模糊的觀察精準地指出了現有方法的核心缺陷。

論證技巧 / 潛在漏洞「翻譯」隱喻非常強大——它暗示所有影像轉換任務在結構上是等價的。但不同任務的對稱性差異很大（邊緣->照片 vs. 照片->邊緣的難度截然不同），此隱喻可能掩蓋了任務間的本質差異。

GANs offer a solution: instead of manually designing a loss function, we can specify only a high-level goal, like "make the output indistinguishable from reality," and then automatically learn a loss function appropriate for satisfying this goal. Conditional GANs (cGANs) learn a mapping from observed image x and random noise vector z to y: G: {x, z} -> y. The generator G is trained to produce outputs that cannot be distinguished from "real" images by an adversarially trained discriminator D. Our primary contribution is to demonstrate that conditional GANs produce reasonable results across a wide variety of image-to-image translation tasks, and to present a simple framework sufficient to achieve good results, along with an analysis of several important architectural choices.

GAN 提供了一個解決方案：與其手動設計損失函數，我們可以僅指定一個高階目標，如「使輸出與真實無法區分」，然後自動學習適合達成此目標的損失函數。條件式 GAN（cGAN）學習從觀察到的影像 x 與隨機雜訊向量 z 到 y 的映射：G: {x, z} -> y。生成器 G 被訓練以產生對抗式訓練的判別器 D 無法與「真實」影像區分的輸出。我們的主要貢獻是展示條件式 GAN 在各種影像到影像轉換任務中產生合理的結果，並呈現一個足以達到良好結果的簡單框架，以及幾項重要架構選擇的分析。

段落功能提出核心方法——以 cGAN 作為「學習損失函數」的機制。

邏輯角色承接「損失設計困難」的問題，以 GAN 作為「自動損失學習器」的解決方案。將 GAN 從「生成器」重新框架為「損失函數學習器」是概念上的重要創新。

論證技巧 / 潛在漏洞「自動學習損失函數」的措辭將 GAN 訓練不穩定的缺點隱藏了——學到的損失函數可能不穩定或不收斂。此外，「簡單框架」的強調是明智的定位，避免了與更複雜方法的技術軍備競賽。

Image-to-image translation has traditionally employed per-pixel classification or regression, treating output space as "unstructured". In contrast, structured losses penalize the joint configuration of the output rather than individual pixels. Prior work conditioned GANs on discrete labels, text, and images. Previous image-conditional models addressed image prediction from a normal map, future frame prediction, and image generation from sparse annotations. However, our framework differs fundamentally: nothing is application-specific. This makes our setup considerably simpler than most others. Architecturally, we distinguish our approach through two choices: a "U-Net"-based generator and a convolutional "PatchGAN" discriminator.

影像到影像轉換傳統上採用逐像素分類或迴歸，將輸出空間視為「非結構化的」。相比之下，結構化損失懲罰的是輸出的聯合配置而非個別像素。先前的研究在離散標籤、文字和影像上對 GAN 施加條件。先前的影像條件模型處理了從法線圖預測影像、未來幀預測，以及從稀疏標註生成影像。然而，我們的框架存在根本差異：沒有任何東西是特定於應用的。這使我們的設置比大多數其他方法簡單得多。在架構上，我們透過兩個選擇區分我們的方法：基於「U-Net」的生成器和摺積式「PatchGAN」判別器。

段落功能文獻定位——以「應用無關性」區分 pix2pix 與先前的條件 GAN 研究。

邏輯角色核心差異化論述：先前的條件 GAN 各為特定任務設計，pix2pix 則追求通用性。U-Net 和 PatchGAN 是實現此通用性的兩個關鍵架構創新。

論證技巧 / 潛在漏洞「nothing is application-specific」是極為大膽的主張。但 pix2pix 仍需成對的訓練資料，這本身就是一種強烈的應用假設。後續的 CycleGAN 正是為了消除此假設而提出的。

3. Method — 方法

3.1 Objective Function — 目標函數

The conditional GAN objective combines an adversarial loss where the generator G tries to minimize against the discriminator D trying to maximize: L_cGAN(G,D) = E[log D(x,y)] + E[log(1 - D(x,G(x,z)))]. We also find it beneficial to mix the GAN objective with a more traditional loss like L1 distance: L_L1(G) = E[||y - G(x,z)||_1]. We use L1 rather than L2 as L1 encourages less blurring. The final objective is: G* = arg min_G max_D L_cGAN(G,D) + lambda * L_L1(G). Regarding noise: the generator simply learned to ignore the noise z when provided as an explicit input. Instead, we provide noise only in the form of dropout applied at both training and test time.

條件式 GAN 目標函數結合了一個對抗損失，其中生成器 G 嘗試最小化而判別器 D 嘗試最大化：L_cGAN(G,D) = E[log D(x,y)] + E[log(1 - D(x,G(x,z)))]。我們也發現將 GAN 目標與更傳統的損失如 L1 距離混合是有益的：L_L1(G) = E[||y - G(x,z)||_1]。我們使用 L1 而非 L2，因為 L1 產生較少的模糊。最終目標為：G* = arg min_G max_D L_cGAN(G,D) + lambda * L_L1(G)。關於雜訊：當作為顯式輸入提供時，生成器簡單地學會忽略雜訊 z。取而代之的是，我們僅以在訓練和測試時都應用的 dropout 形式提供雜訊。

段落功能目標函數定義——結合對抗損失與 L1 重建損失。

邏輯角色雙損失設計的分工清晰：cGAN 損失負責高頻細節的真實感，L1 損失負責低頻結構的準確性。此組合直接回應了「L2 產生模糊」的問題。

論證技巧 / 潛在漏洞「生成器學會忽略雜訊」的坦承非常誠實——這意味著 pix2pix 本質上是確定性映射，無法產生多樣化的輸出。此限制對於存在多種合理輸出的任務（如上色）是重要的缺陷。以 dropout 替代顯式雜訊是務實但不優雅的解決方案。

3.2 Generator Architecture — 生成器架構 (U-Net)

A defining feature of image-to-image translation is that input and output differ in surface appearance, but both are renderings of the same underlying structure. Therefore, the structure in the input is roughly aligned with the structure in the output. We employ an encoder-decoder with skip connections, following the general shape of a "U-Net". Specifically, we add skip connections between each layer i and layer n-i, where n is the total number of layers. Each skip connection simply concatenates all channels at layer i with those at layer n-i. This allows low-level information shared between input and output to shuttle directly across the net, bypassing the bottleneck layer.

影像到影像轉換的一個定義性特徵是：輸入與輸出在表面外觀上不同，但兩者都是同一底層結構的呈現。因此，輸入中的結構與輸出中的結構大致對齊。我們採用具有跳躍連接的編碼器-解碼器，遵循「U-Net」的一般形狀。具體而言，我們在每個第 i 層與第 n-i 層之間添加跳躍連接，其中 n 為總層數。每個跳躍連接簡單地將第 i 層的所有通道與第 n-i 層的通道串接。這允許輸入與輸出之間共享的低階資訊直接穿越網路，繞過瓶頸層。

段落功能生成器設計——以 U-Net 跳躍連接保留低階結構資訊。

邏輯角色「同一底層結構」的觀察是 U-Net 選擇的關鍵理由：如果輸入輸出共享結構，那麼直接傳遞低階特徵（邊緣、紋理）比讓它們通過瓶頸更合理。

論證技巧 / 潛在漏洞將 U-Net（醫學影像分割架構）引入生成任務是跨領域的創新應用。但「同一底層結構」的假設在某些轉換任務中不成立——例如從文字描述生成影像時，輸入輸出之間沒有空間對應關係。

3.3 PatchGAN Discriminator — PatchGAN 判別器

It is well established that the L1 loss already ensures low-frequency correctness. For the GAN discriminator, we therefore only need to model high-frequency structure. This motivates restricting the discriminator to only penalize structure at the scale of local image patches. We term this discriminator a PatchGAN: it classifies whether each N x N patch in an image is real or fake, running this discriminator convolutionally across the image, averaging all responses to provide the ultimate output. This design assumes independence between pixels separated by more than a patch diameter, effectively modeling the image as a Markov random field. Testing patch sizes from 1x1 to 286x286 shows that 70x70 patches achieve optimal results. Smaller patches lack spatial sharpness; full image discriminators show diminishing returns.

眾所周知 L1 損失已確保了低頻的正確性。因此，對於 GAN 判別器，我們只需建模高頻結構。這促使我們將判別器限制為僅在局部影像區塊的尺度上懲罰結構。我們將此判別器稱為 PatchGAN：它分類影像中每個 N x N 區塊是真實還是偽造的，在影像上以摺積方式運行此判別器，平均所有回應以提供最終輸出。此設計假設間距超過區塊直徑的像素之間是獨立的，有效地將影像建模為馬可夫隨機場。測試從 1x1 到 286x286 的區塊大小顯示 70x70 區塊達到最佳結果。較小的區塊缺乏空間銳利度；全影像判別器則顯示收益遞減。

段落功能判別器創新——PatchGAN 聚焦於局部高頻結構。

邏輯角色 L1 與 PatchGAN 的分工是全文最優雅的設計：L1 負責全域低頻，PatchGAN 負責局部高頻，各自在最適合的尺度上運作。馬可夫隨機場的理論框架為此設計提供了數學基礎。

論證技巧 / 潛在漏洞 PatchGAN 的概念後來被廣泛採用，影響深遠。70x70 的最佳區塊大小透過系統性搜索確定，展現了嚴謹的實驗態度。但「像素獨立性」假設在存在大尺度結構依賴的影像中可能不成立——例如建築立面的全域對稱性。

4. Experiments — 實驗

Experiments test diverse tasks: semantic labels to photo, architectural labels to photo, maps to aerial photos, BW to color, edges to photo, sketch to photo, day to night, thermal to color, and photo inpainting. Evaluation uses Amazon Mechanical Turk studies and FCN-score (whether synthesized images fool pre-trained classifiers). Ablation studies show L1 alone produces blurry results, cGAN alone introduces artifacts, and combining both (L1+cGAN) reduces artifacts and blurriness. U-Net substantially outperforms basic encoder-decoder with skip connections proving essential. AMT studies show map-to-photo results fool participants 18.9% of trials with L1+cGAN versus 0.8% with L1 alone. Colorization achieves 22.5% fooling rate, slightly below specialized prior work at 27.8%.

實驗測試了多樣化的任務：語義標籤到照片、建築標籤到照片、地圖到航空照片、黑白到彩色、邊緣到照片、素描到照片、白天到夜晚、熱成像到彩色，以及照片修復。評估使用 Amazon Mechanical Turk 研究和 FCN 分數（合成影像是否能欺騙預訓練的分類器）。消融研究顯示 L1 單獨使用產生模糊結果，cGAN 單獨使用引入偽影，而組合兩者（L1+cGAN）同時減少偽影和模糊。U-Net 大幅優於基本編碼器-解碼器，跳躍連接被證明是不可或缺的。AMT 研究顯示地圖到照片的結果在 L1+cGAN 時以 18.9% 的試驗欺騙參與者，而 L1 僅為 0.8%。上色達到 22.5% 的欺騙率，略低於專業先前研究的 27.8%。

段落功能大規模實驗驗證——跨九個任務展示框架的通用性。

邏輯角色九個任務的展示直接支撐「通用框架」的核心主張。消融實驗證明了每個設計選擇的必要性：L1（防模糊）+ cGAN（增真實感）+ U-Net（保結構）。

論證技巧 / 潛在漏洞任務多樣性是最具說服力的論據——一個框架、零修改、九個任務。但 22.5% vs. 27.8% 的上色對比暗示通用方法在特定任務上可能不如專用方法。此外，AMT 評估的主觀性和可重複性是潛在的方法論弱點。

5. Conclusion — 結論

Conditional adversarial networks are a promising approach for many image-to-image translation tasks, especially those involving highly structured graphical outputs. These networks learn a loss function adapted to the task and data at hand, which makes them applicable across a wide variety of settings. The release of pix2pix has sparked significant community engagement, with artists and researchers applying it to creative applications beyond the original scope, demonstrating the framework's promise as a general-purpose image-to-image translation tool.

條件式對抗網路對於許多影像到影像轉換任務是一種有前景的方法，尤其是涉及高度結構化圖形輸出的任務。這些網路學習適應當前任務和資料的損失函數，使其在各種設定下都適用。pix2pix 的發布引發了顯著的社群參與，藝術家和研究人員將其應用於超出原始範圍的創意應用，展示了該框架作為通用影像到影像轉換工具的前景。

段落功能總結全文——以社群影響力佐證通用性主張。

邏輯角色結論以社群採用的事實作為最終論據，將論文影響從學術擴展到藝術創作領域。「學習損失函數」的元學習概念被重申為核心貢獻。

論證技巧 / 潛在漏洞以社群採用作為論據雖非傳統學術論證，但極具說服力。結論未充分討論限制——成對資料需求、確定性輸出、模式崩潰風險等問題在後續工作（CycleGAN、BicycleGAN）中被逐一解決。

論證結構總覽

問題
損失函數設計困難
L2 產生模糊輸出

→

論點
cGAN 自動學習
適應任務的損失函數

→

證據
九種任務通用有效
AMT 真實感驗證

→

反駁
L1+cGAN 互補
PatchGAN 聚焦高頻

→

結論
通用影像轉換框架
社群廣泛採用

作者核心主張（一句話）

條件式對抗網路能自動學習適應任務的損失函數，結合 U-Net 生成器和 PatchGAN 判別器，構成一個無需針對特定應用調整即可跨多種影像轉換任務產生合理結果的通用框架。

論證最強處

九任務的通用性展示：以完全相同的框架在九種截然不同的影像轉換任務上產生合理結果，是對「通用性」最直接的實證。L1+cGAN 的消融實驗清楚展示了各組件的互補角色。社群的廣泛採用則提供了「野外」（in-the-wild）驗證，遠超受控實驗的說服力。

論證最弱處

輸出多樣性的缺失：生成器學會忽略雜訊輸入，使 pix2pix 本質上是確定性映射——對同一輸入只能產生一種輸出。對於本質上具有多模態解的任務（如上色、風格轉換），此限制意味著方法僅能學習條件分布的一個模式。此外，需要成對訓練資料的前提大幅限制了實際應用範圍。