High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs

Abstract — 摘要

We present a method for synthesizing high-resolution photorealistic images from semantic label maps using conditional generative adversarial networks (conditional GANs). Conditional GANs have enabled a variety of applications, but the results are often limited to low-resolution and still far from realistic. In this work, we generate 2048 x 1024 visually appealing results with a novel adversarial loss, as well as new multi-scale generator and discriminator architectures. Furthermore, we extend our framework to interactive visual manipulation with both semantic and instance-level control. To validate the importance of the different components, we perform detailed ablation studies and comparisons against existing methods.

本文提出一種利用條件生成對抗網路，從語意標籤圖合成高解析度擬真影像的方法。條件 GAN 已促成多種應用，但其結果通常受限於低解析度且距離真實感仍有差距。在本研究中，我們透過新穎的對抗損失函數以及全新的多尺度生成器與判別器架構，生成了 2048 x 1024 的高品質視覺結果。此外，我們將框架擴展至具備語意與實例層級控制的互動式視覺操作。為驗證各組件的重要性，我們進行了詳盡的消融研究與既有方法的比較。

段落功能全文總覽——以遞進方式從「條件 GAN 的成就」到「解析度瓶頸」再到「本文方案」，最終預告架構創新與應用擴展。

邏輯角色摘要同時承擔「問題定義」與「解決方案預告」：先界定條件 GAN 在高解析度合成的缺口，再以三重貢獻（損失函數、架構、互動編輯）回應。

論證技巧 / 潛在漏洞以具體數字（2048 x 1024）作為吸引讀者的錨點，修辭上具說服力。但摘要僅提及「visually appealing」而非客觀指標，暗示主要評估可能依賴主觀感知。

1. Introduction — 緒論

Image-to-image translation using conditional GANs has been demonstrated on a wide range of tasks, such as synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images. However, existing methods produce limited resolution results (typically up to 256 x 256). Generating higher resolution images requires larger networks with deeper or wider architectures, which makes GAN training notoriously unstable. The memory constraints of current hardware further limit the achievable resolution.

利用條件 GAN 的影像到影像翻譯已在廣泛的任務上獲得驗證，例如從標籤圖合成照片、從邊緣圖重建物件、以及影像著色等。然而，現有方法僅能產生有限解析度的結果（通常最高 256 x 256）。生成更高解析度的影像需要更大的網路——更深或更寬的架構——這使得 GAN 的訓練更加不穩定。當前硬體的記憶體限制進一步制約了可達到的解析度。

段落功能建立研究場域——指出影像翻譯的成就與解析度瓶頸。

邏輯角色論證鏈起點：先肯定條件 GAN 的通用性，再指出解析度天花板與訓練不穩定性，為引入多尺度架構奠基。

論證技巧 / 潛在漏洞以 256 x 256 作為當時的技術極限，暗示直接放大是不可行的。但作者未提及同期漸進式訓練（ProGAN）已將 GAN 推向 1024 x 1024 的事實，可能低估了替代方案的進展。

We address this problem with a coarse-to-fine generator architecture, a multi-scale discriminator architecture, and a robust adversarial learning objective function. Specifically, we decompose the generator into two sub-networks: G1 (global generator) operates at 1024 x 512 resolution, and G2 (local enhancer) outputs images at 2048 x 1024. For the discriminator, we use 3 discriminators that operate at different image scales. We also introduce an instance-level feature encoder that enables diverse image synthesis and instance-level semantic manipulation.

我們透過粗到細的生成器架構、多尺度判別器架構，以及穩健的對抗學習目標函數來解決此問題。具體而言，我們將生成器分解為兩個子網路：G1（全域生成器）在 1024 x 512 解析度下運作，G2（局部增強器）輸出 2048 x 1024 的影像。在判別器方面，我們採用三個在不同影像尺度上運作的判別器。我們還引入了實例層級特徵編碼器，使其能進行多樣化影像合成與實例層級的語意操作。

段落功能提出解決方案——概述三大技術貢獻。

邏輯角色直接回應上段提出的三個障礙（解析度、穩定性、記憶體），以一對一的方式給出解法：粗到細解決解析度、多尺度判別器解決穩定性、特徵編碼器增加可控性。

論證技巧 / 潛在漏洞架構分解為 G1 + G2 的設計令人信服，但增加了訓練複雜度——需先訓練 G1 再聯合訓練。多尺度判別器的靈感來自影像金字塔，但三個獨立判別器的計算開銷未在此討論。

Generative adversarial networks (GANs) have been applied to various image generation and manipulation tasks. Pix2pix proposes a general framework for image-to-image translation using conditional GANs with a U-Net generator and a PatchGAN discriminator. However, the results of pix2pix are limited to 256 x 256 resolution. Cascaded Refinement Networks (CRN) use a multi-resolution refinement approach to synthesize images at 2048 x 1024 resolution, but the images lack fine details and the method does not use an adversarial loss. Our approach combines the benefits of adversarial training with multi-scale architectures to produce high-resolution results with rich details.

生成對抗網路已被應用於各種影像生成與操作任務。Pix2pix 提出了一個使用 U-Net 生成器與 PatchGAN 判別器的條件 GAN 通用框架。然而，pix2pix 的結果受限於 256 x 256 解析度。級聯精煉網路（CRN）採用多解析度精煉方法在 2048 x 1024 解析度下合成影像，但影像缺乏精細細節且該方法未使用對抗損失。我們的方法結合了對抗訓練與多尺度架構的優勢，以產生富含細節的高解析度結果。

段落功能文獻回顧——定位 pix2pixHD 於 pix2pix 與 CRN 之間的空白地帶。

邏輯角色建立「兩難困境」：pix2pix 有對抗訓練但受限解析度，CRN 有高解析度但缺乏對抗損失。pix2pixHD 的定位即是同時解決兩者的限制。

論證技巧 / 潛在漏洞以二分法呈現現有方法的缺陷，邏輯清晰但稍嫌簡化。CRN 的非對抗訓練並非本質缺陷而是設計選擇；pix2pix 的解析度限制也非架構限制而是當時的訓練技巧問題。

3. Method — 方法

3.1 Coarse-to-Fine Generator — 粗到細生成器

Our generator is decomposed into two sub-networks: a global generator network G1 and a local enhancer network G2. The global generator operates at 1024 x 512 resolution and consists of a convolutional front-end, a set of residual blocks, and a transposed convolutional back-end. The local enhancer network has a similar architecture but operates at the full 2048 x 1024 resolution. Crucially, the feature map output from the last layer of G1 is fed as input to G2, so the local enhancer builds upon the global structure established by G1 and adds local details. During training, we first train G1 at lower resolution, then fix G1 and train G2, and finally fine-tune both jointly.

我們的生成器分解為兩個子網路：全域生成器網路 G1 與局部增強器網路 G2。全域生成器在 1024 x 512 解析度下運作，由摺積前端、一組殘差區塊與轉置摺積後端組成。局部增強器具有類似架構，但在完整的 2048 x 1024 解析度下運作。關鍵在於，G1 最後一層的特徵圖作為 G2 的輸入，使局部增強器在 G1 建立的全域結構上添加局部細節。訓練時，我們先在較低解析度訓練 G1，接著固定 G1 訓練 G2，最後聯合微調兩者。

段落功能方法推導第一步——定義粗到細生成器的架構。

邏輯角色此段是架構設計的核心：G1 提供全域語意一致性，G2 負責局部精細化。分階段訓練策略確保穩定收斂。

論證技巧 / 潛在漏洞分解為 G1 + G2 的設計在工程上極為巧妙，可依需求堆疊更多增強器以達到更高解析度。但這種序列化依賴意味著 G2 的品質受 G1 品質上限的制約，且分階段訓練增加了超參數調校的複雜度。

3.2 Multi-Scale Discriminator — 多尺度判別器

High-resolution image synthesis demands a discriminator with a large receptive field. Rather than using a deeper network (which increases capacity but may cause overfitting and training instability), we use 3 discriminators D1, D2, D3 that have identical network structures but operate at different image scales. Specifically, D1 operates on the original scale, D2 at half resolution, and D3 at quarter resolution. The discriminators at coarser scales have a larger effective receptive field and can guide the generator to produce globally consistent structures, while the finest-scale discriminator encourages fine-grained details.

高解析度影像合成要求判別器具備大感受野。與其使用更深的網路（這會增加容量但可能導致過擬合與訓練不穩定），我們採用三個具有相同網路結構但在不同影像尺度上運作的判別器 D1、D2、D3。具體而言，D1 在原始尺度運作，D2 在半解析度，D3 在四分之一解析度。粗尺度的判別器具有更大的有效感受野，能引導生成器產生全域一致的結構；最細尺度的判別器則鼓勵精細細節的生成。

段落功能架構創新——以多尺度策略取代單一深層判別器。

邏輯角色與生成器的粗到細策略形成對偶設計：生成器由粗到細合成，判別器同時在多個尺度評判。這種對稱性在理論上更有利於穩定的對抗訓練。

論證技巧 / 潛在漏洞將「增加網路深度」與「多尺度分解」對比是有效的修辭策略。但三個判別器的總參數量可能不亞於單個更深的判別器，計算成本的比較未被明確量化。此外，三個判別器的損失加權是超參數，可能需要仔細調校。

The full objective combines the GAN loss with a feature matching loss. The feature matching loss is based on the discriminator: we extract features from multiple layers of each discriminator and match intermediate representations between real and synthesized images. This stabilizes training by providing additional supervision beyond the binary real/fake signal. The total loss is: min_G ((max_{D1,D2,D3} sum_{k=1,2,3} L_GAN(G,D_k)) + lambda * sum_{k=1,2,3} L_FM(G,D_k)), where lambda controls the relative importance of the feature matching loss.

完整的目標函數結合了 GAN 損失與特徵匹配損失。特徵匹配損失基於判別器：我們從每個判別器的多個層提取特徵，並匹配真實影像與合成影像之間的中間表示。這透過提供超越二元真/假訊號的額外監督來穩定訓練。總損失為：min_G ((max_{D1,D2,D3} sum_{k=1,2,3} L_GAN(G,D_k)) + lambda * sum_{k=1,2,3} L_FM(G,D_k))，其中 lambda 控制特徵匹配損失的相對重要性。

段落功能損失函數設計——定義訓練的最佳化目標。

邏輯角色將多尺度判別器的理念延伸至損失函數層面：每個判別器不僅提供真/假判斷，還透過中間層特徵提供風格與結構的匹配訊號。

論證技巧 / 潛在漏洞特徵匹配損失實質上是一種感知損失的變體，但直接從判別器提取而非使用預訓練的 VGG 網路。此設計優雅地避免了外部網路依賴，但判別器品質的波動可能間接影響特徵匹配的穩定性。

4. Experiments — 實驗

We evaluate our method on Cityscapes (2048 x 1024), NYU Indoor RGBD, ADE20K, and Helen Face datasets. For quantitative evaluation, we use semantic segmentation accuracy as a proxy metric: we run a pre-trained segmentation model on the synthesized images and measure how well the predicted labels match the input. On Cityscapes, our method achieves pixel accuracy of 83.78% and mean IoU of 0.6389, significantly outperforming pix2pix (78.34% / 0.3948) and CRN (70.55% / 0.3483). In human perceptual studies, our results are preferred over pix2pix 93.8% of the time and over CRN 86.2% of the time.

我們在 Cityscapes（2048 x 1024）、NYU Indoor RGBD、ADE20K 及 Helen Face 資料集上評估本方法。在定量評估方面，我們採用語意分割準確度作為代理指標：在合成影像上執行預訓練的分割模型，衡量預測標籤與輸入的吻合程度。在 Cityscapes 上，本方法達到像素準確度 83.78% 與平均 IoU 0.6389，顯著優於 pix2pix（78.34% / 0.3948）與 CRN（70.55% / 0.3483）。在人類感知研究中，本方法的結果在 93.8% 的情況下優於 pix2pix，86.2% 的情況下優於 CRN。

段落功能提供全面的實驗證據——在多個基準上以定量與主觀指標驗證方法的優越性。

邏輯角色實證支柱覆蓋三個維度：(1) 多資料集的泛化性；(2) 語意分割作為代理指標的定量比較；(3) 人類感知研究的主觀驗證。

論證技巧 / 潛在漏洞以語意分割準確度作為影像品質的代理指標是創新的評估策略，因為它直接衡量影像是否保留了語意結構。但此指標可能無法捕捉紋理自然度或光照一致性等感知品質。人類實驗彌補了此不足。

Ablation studies reveal the contribution of each component. Using GAN loss only achieves 58.90% preference versus our full method; adding feature matching raises this to 68.55%. The instance boundary map provides significant improvement, with 64.34% preference for realism when included. Architecturally, our coarse-to-fine generator achieves 83.78% pixel accuracy, outperforming both U-Net (77.86%) and CRN encoder (78.96%). The multi-scale discriminator improves mean IoU from 0.5775 (single-scale) to 0.6389 (three-scale). The system runs at 20-30 milliseconds per 2048 x 1024 image on an NVIDIA 1080Ti GPU.

消融研究揭示了各組件的貢獻。僅使用 GAN 損失時，相對於完整方法的偏好度為 58.90%；加入特徵匹配後提升至 68.55%。實例邊界圖帶來顯著改善，包含時真實感偏好度達 64.34%。在架構方面，粗到細生成器達到 83.78% 的像素準確度，優於 U-Net（77.86%）與 CRN 編碼器（78.96%）。多尺度判別器將平均 IoU 從 0.5775（單尺度）提升至 0.6389（三尺度）。系統在 NVIDIA 1080Ti GPU 上對每張 2048 x 1024 影像的處理時間為 20-30 毫秒。

段落功能消融驗證——逐一確認各組件對最終結果的貢獻。

邏輯角色此段是論證的嚴謹性保證：每個設計選擇（損失函數、實例邊界、架構、判別器尺度）都有獨立的實驗驗證，排除了僅因整體複雜度提升而帶來的偶然改善。

論證技巧 / 潛在漏洞消融實驗設計完善，每次僅移除一個組件。推論時間（20-30ms）的報告增強了方法的實用性論述。但消融使用偏好度百分比而非標準 FID 或 IS 指標，不同消融實驗的指標不統一可能造成比較困難。

5. Conclusion — 結論

We have presented pix2pixHD, a conditional GAN framework for synthesizing high-resolution photorealistic images from semantic label maps. Our key contributions include a coarse-to-fine generator, a multi-scale discriminator architecture, and a feature matching loss that together enable 2048 x 1024 resolution synthesis with significantly improved quality. We further demonstrate interactive semantic manipulation at the instance level, including object removal, insertion, and appearance transfer. Our results suggest that high-resolution conditional image synthesis is a viable foundation for interactive content creation tools.

我們提出了 pix2pixHD，一個從語意標籤圖合成高解析度擬真影像的條件 GAN 框架。我們的主要貢獻包括粗到細生成器、多尺度判別器架構與特徵匹配損失，三者共同實現了品質顯著提升的 2048 x 1024 解析度合成。我們進一步展示了實例層級的互動式語意操作，包含物件移除、插入與外觀轉移。結果表明，高解析度條件影像合成是互動式內容創作工具的可行基礎。

段落功能總結全文——重申核心貢獻並展望應用前景。

邏輯角色結論呼應摘要結構：從架構創新回到應用場景（互動式內容創作），形成完整的論證閉環。

論證技巧 / 潛在漏洞結論簡潔有力，但未充分討論局限性——如對特定場景類型的泛化能力、生成影像的多樣性與模式崩塌風險，以及語意標籤圖的取得成本。作為高影響力論文，更坦誠的局限性討論會增強可信度。

論證結構總覽

問題
條件 GAN 影像合成
受限於低解析度

→

論點
多尺度架構與特徵匹配
突破解析度瓶頸

→

證據
2048x1024 合成結果
多指標顯著領先

→

反駁
消融研究排除
各組件的偶然效果

→

結論
高解析度條件合成
是內容創作的基礎

作者核心主張（一句話）

透過粗到細生成器、多尺度判別器與特徵匹配損失的協同設計，條件 GAN 能夠在 2048 x 1024 解析度下合成具有語意一致性與視覺真實感的高品質影像。

論證最強處

多尺度設計的對偶性：生成器由粗到細合成，判別器同時在多個尺度評判，形成互補的監督訊號。消融研究系統性地驗證了每個組件的獨立貢獻，尤其是多尺度判別器將 mean IoU 從 0.5775 提升至 0.6389，證明了設計的有效性。

論證最弱處

評估指標的間接性：以語意分割準確度作為影像品質的代理指標，雖然創新但本質上是間接的——高分割準確度不等同於視覺真實感。此外，對模式多樣性的評估較為薄弱，多樣化合成能力的量化驗證不足。