SPADE: Semantic Image Synthesis with Spatially-Adaptive Normalization

Abstract 摘要

Park, Liu, Wang, Zhu — CVPR 2019

We propose spatially-adaptive normalization, a simple but effective layer for synthesizing photorealistic images given an input semantic layout. The previous methods directly feed the semantic layout as input to the deep network, which is then processed through stacks of convolution, normalization, and nonlinearity layers. We show that this is suboptimal as the normalization layers tend to "wash away" semantic information. To address the issue, we propose using the input layout for modulating the activations in normalization layers through a spatially-adaptive, learned transformation. Experiments on several challenging datasets demonstrate the advantage of the proposed method over existing approaches regarding both visual fidelity and alignment with input layouts. Moreover, the model allows users to easily control both the semantic and style of image content. Code is available upon publication.

本文提出空間自適應正規化層（spatially-adaptive normalization），這是一種簡潔而有效的網路層，用於根據輸入的語義佈局合成逼真影像。先前的方法直接將語義佈局作為深度網路的輸入，經由層疊的摺積、正規化與非線性層處理。然而我們發現此做法並非最優，因為正規化層傾向於「洗掉」語義資訊。為解決此問題，我們提出利用輸入佈局，透過空間自適應的學習轉換來調制正規化層中的激活值。在多個具挑戰性的資料集上的實驗表明，所提方法在視覺保真度與輸入佈局對齊方面均優於現有方法。此外，該模型允許使用者輕鬆控制影像內容的語義與風格。程式碼於論文發表後公開。

段落功能：提出核心問題與解決方案

邏輯角色：作為全文的濃縮概覽，建立「問題 → 原因診斷 → 方法提出 → 實驗驗證 → 應用價值」的完整論述框架。

論證技巧：摘要精準地先指出現有方法的根本缺陷（正規化層洗掉語義資訊），再提出對應的解法（空間自適應調制），形成清晰的因果鏈。同時強調「使用者可控性」拓展了研究的應用價值，兼顧學術貢獻與工程實用性。

1. Introduction 引言

問題定義與研究動機

We address the problem of semantic image synthesis—converting a semantic segmentation mask to a photorealistic image. This problem is challenging but has many important applications, such as content creation and image editing tools, which could eventually offer an alternative to traditional rendering pipelines.

本研究聚焦於語義影像合成問題——即將語義分割遮罩轉換為逼真影像。此問題極具挑戰性，但擁有眾多重要應用場景，例如內容創作與影像編輯工具，未來有望成為傳統渲染管線的替代方案。

段落功能：定義研究任務

邏輯角色：開篇即明確界定研究問題的邊界——語義分割遮罩到逼真影像的轉換，為後續技術討論奠定基礎。

論證技巧：簡潔地引出應用場景（內容創作、影像編輯）來建立研究的實用價值，使讀者迅速理解此研究「為何重要」。

The conventional network architecture for this task uses stacked convolutional, normalization, and nonlinearity layers. We identify that this design is suboptimal because normalization layers tend to "wash away" information contained in the input semantic masks. This is particularly problematic when the semantic layout is directly concatenated with or added to the input of the first layer, as the semantic signal must survive through many normalization stages.

用於此任務的傳統網路架構採用層疊的摺積、正規化與非線性層。我們指出此設計並非最優，因為正規化層傾向於「洗掉」輸入語義遮罩中的資訊。當語義佈局直接串聯或疊加至第一層輸入時，此問題尤為突出，因為語義訊號必須在多個正規化階段中存活下來。

段落功能：指出現有方法的根本缺陷

邏輯角色：此段是全文的關鍵洞察——識別出正規化層會「洗掉」語義資訊這一核心問題，為後續 SPADE 的設計提供直接動機。

論證技巧：使用「wash away」的比喻式措辭使抽象的技術問題變得直觀，同時補充「語義訊號必須在多個正規化階段中存活」的解釋，強化問題的嚴重性。

We propose spatially-adaptive normalization, a conditional normalization layer that modulates the activations using input semantic layouts through a spatially-adaptive, learned transformation. Unlike prior conditional normalization methods, the modulation parameters of the proposed method are spatially-adaptive—they vary with respect to the spatial position in the input semantic mask—effectively propagating semantic information throughout the network.

我們提出空間自適應正規化，一種條件式正規化層，透過空間自適應的學習轉換，利用輸入語義佈局來調制激活值。與先前的條件式正規化方法不同，本方法的調制參數具有空間自適應性——它們隨輸入語義遮罩的空間位置而變化——從而有效地將語義資訊在整個網路中傳播。

段落功能：提出核心方法

邏輯角色：緊接問題診斷，提出「空間自適應正規化」作為解決方案，完成「問題→方案」的邏輯閉環。

論證技巧：透過與先前條件式正規化方法的對比（「unlike prior methods」），精準定位本方法的創新點——參數的空間變化性。「effectively propagating semantic information throughout the network」則直接呼應前段提出的問題。

We conduct extensive experiments on several challenging datasets including COCO-Stuff, ADE20K, and Cityscapes. Our results show that even a compact network with spatially-adaptive normalization produces significantly better results compared to several state-of-the-art methods in terms of both visual quality and alignment with the input segmentation masks.

我們在多個具挑戰性的資料集上進行了大量實驗，包括 COCO-Stuff、ADE20K 和 Cityscapes。結果表明，即使是採用空間自適應正規化的輕量網路，在視覺品質與輸入分割遮罩對齊方面，亦顯著優於多種現有最先進方法。

段落功能：預覽實驗成果

邏輯角色：在引言尾段預告實驗結果，強化讀者對方法有效性的信心，引導其繼續閱讀後續技術細節。

論證技巧：強調「compact network」仍能超越 SOTA，暗示方法的效率優勢，不僅品質好且模型更輕量——這是極具說服力的組合論證。提及三個不同領域的資料集也展現了方法的泛化能力。

技術背景與文獻綜述

Deep generative models, particularly Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), have achieved remarkable progress in image generation. We focus on conditional image synthesis, where the goal is to generate images conditioned on some input signal. Various forms of conditioning have been explored, including class-conditional models, text-to-image generation, and image-to-image translation.

深度生成模型，尤其是生成對抗網路（GANs）和變分自編碼器（VAEs），在影像生成方面取得了顯著進展。本文聚焦於條件式影像合成，其目標是根據某種輸入訊號來生成影像。已有研究探索了多種條件形式，包括類別條件模型、文本到影像生成，以及影像到影像轉譯。

段落功能：建立技術背景

邏輯角色：梳理生成模型的整體技術脈絡，將本研究定位在「條件式影像合成」這一具體分支中，幫助讀者理解研究位置。

論證技巧：透過列舉 GAN、VAE 等主流框架以及多種條件形式，展示作者對領域的全面掌握，建立學術可信度。分類方式清晰，便於讀者定位本文的技術座標。

Normalization layers can be categorized into unconditional types—such as Batch Normalization (BatchNorm), Instance Normalization (InstanceNorm), and Layer Normalization (LayerNorm)—and conditional types, including Conditional Batch Normalization and Adaptive Instance Normalization (AdaIN). Our proposed method applies spatially-varying affine transformations learned from semantic input, distinguishing it from prior conditional normalization methods that use spatially-uniform parameters across all spatial coordinates.

正規化層可分為非條件式類型——如批次正規化（BatchNorm）、實例正規化（InstanceNorm）和層正規化（LayerNorm）——以及條件式類型，包括條件批次正規化和自適應實例正規化（AdaIN）。本文提出的方法從語義輸入中學習空間變化的仿射變換，這使其與先前使用空間均勻參數的條件式正規化方法形成根本區別。

段落功能：定位技術差異

邏輯角色：透過對正規化方法的系統分類，精準標定 SPADE 在技術譜系中的位置——它是一種具空間變化性的條件式正規化。

論證技巧：巧妙地將現有方法歸類後，用「spatially-varying vs. spatially-uniform」的對比凸顯核心創新。此策略既承認先前工作的貢獻，又清晰劃定了自身的創新邊界。潛在漏洞：分類較為概括，未深入討論各方法在特定場景的表現差異。

3.1 Spatially-Adaptive Denormalization (SPADE)

核心技術貢獻

The core contribution of this work is SPADE (SPatially-Adaptive DEnormalization). Unlike standard Batch Normalization, where activations are normalized channel-wise and then modulated with learned, spatially-uniform scale and bias parameters, SPADE modulates the normalized activations with scale and bias parameters that depend on the input segmentation mask and vary across spatial positions.

本研究的核心貢獻是 SPADE（空間自適應去正規化）。不同於標準的批次正規化——其中激活值按通道正規化後以學習得到的空間均勻尺度與偏差參數進行調制——SPADE 使用依賴於輸入分割遮罩且隨空間位置變化的尺度與偏差參數來調制正規化後的激活值。

段落功能：定義核心機制

邏輯角色：正式引入 SPADE 的技術定義，這是全文的核心創新點。透過與標準 BatchNorm 的對比，使讀者迅速把握關鍵差異。

論證技巧：採用「對比框架」——先描述已知的標準做法（BatchNorm），再指出 SPADE 的不同之處（空間變化參數），降低讀者理解門檻。SPADE 的命名本身就暗示了其「去正規化」的核心操作。

Mathematically, the activation value at site (n, c, y, x) is first normalized using the channel-wise mean and standard deviation. The normalized activation is then modulated by position-specific scaling (γ) and bias (β) parameters, which are derived from the segmentation mask through a two-layer convolutional network. The first convolution layer processes the segmentation mask with a shared set of filters, followed by separate convolution layers that produce the γ and β tensors respectively.

從數學角度而言，位於 (n, c, y, x) 處的激活值首先使用通道級別的均值和標準差進行正規化。隨後，正規化後的激活值由位置特定的尺度參數（γ）和偏差參數（β）進行調制，這些參數透過一個兩層摺積網路從分割遮罩中衍生。第一層摺積使用共享濾波器處理分割遮罩，接著由各自獨立的摺積層分別產生 γ 和 β 張量。

段落功能：闡述技術細節

邏輯角色：提供 SPADE 的精確數學表述，讓讀者能夠完整理解並複現此方法。從高層概念過渡到具體實現。

論證技巧：將複雜的數學操作分解為「正規化 → 生成調制參數 → 應用調制」三步驟，邏輯清晰。兩層摺積網路的設計選擇（共享第一層+獨立輸出層）體現了參數效率與表達力的平衡，但論文未深入討論為何選擇兩層而非更深的網路。

The proposed method generalizes several existing normalization approaches: if we replace the segmentation mask with class labels, SPADE reduces to Conditional Batch Normalization; if we replace it with image data, it reduces to Adaptive Instance Normalization (AdaIN). This unifying perspective shows that SPADE occupies a more general position in the design space of conditional normalization layers.

所提出的方法可統一多種現有正規化方法：若將分割遮罩替換為類別標籤，SPADE 退化為條件批次正規化；若替換為影像資料，則退化為自適應實例正規化（AdaIN）。此統一觀點表明，SPADE 在條件式正規化層的設計空間中佔據更為通用的位置。

段落功能：建立理論統一性

邏輯角色：將 SPADE 定位為現有方法的泛化形式，而非僅僅是另一種新方法。此論證策略極大地提升了方法的理論地位。

論證技巧：採用「特例歸約」的數學論證風格——展示現有方法是 SPADE 的特殊情況——這是一種極為有力的論證手段，既展現了 SPADE 的通用性，也暗示先前方法的局限性。此策略常見於數學導向的電腦視覺論文。

3.2 SPADE Generator 生成器架構

網路設計與訓練策略

The SPADE generator architecture removes the encoder component common in recent architectures, since the learned modulation parameters already encode sufficient label information. This simplification produces a more lightweight network while maintaining or even improving performance. The generator accepts a random noise vector as input, which is processed through a series of SPADE residual blocks with upsampling layers to produce the final output image.

SPADE 生成器架構移除了近期架構中常見的編碼器元件，因為學習得到的調制參數已足以編碼標籤資訊。這一簡化產生了更為輕量的網路，同時維持甚至提升了性能。生成器接受隨機雜訊向量作為輸入，經由一系列 SPADE 殘差模組與上採樣層處理，產生最終的輸出影像。

段落功能：描述架構創新

邏輯角色：闡述 SPADE 生成器的架構設計邏輯——因為 SPADE 層已承擔語義編碼功能，故可移除傳統編碼器，實現輕量化。

論證技巧：「移除編碼器」既是架構簡化，又是 SPADE 有效性的間接證明——若 SPADE 層不能充分編碼語義資訊，移除編碼器後效能應會下降。這構成了一種優雅的自證邏輯。接受隨機向量輸入的設計同時開啟了多模態合成的可能。

The architecture employs ResNet blocks with upsampling layers. Training uses a multi-scale discriminator and pix2pixHD loss function, with the modification of replacing the least-squared loss with hinge loss. This training scheme stabilizes the adversarial training process and improves the quality of generated images.

該架構採用帶上採樣層的 ResNet 模組。訓練使用多尺度判別器和pix2pixHD 損失函數，並以鉸鏈損失（hinge loss）替代最小平方損失。此訓練方案穩定了對抗訓練過程，並提升了生成影像的品質。

段落功能：說明訓練策略

邏輯角色：補充生成器的訓練細節，使方法描述完整。多尺度判別器與 hinge loss 是當時 GAN 訓練的成熟技巧。

論證技巧：訓練策略的選擇建立在已有成功經驗（pix2pixHD）之上，將 hinge loss 替換 least-squared loss 是較為「安全」的改進。潛在漏洞：未詳細討論為何 hinge loss 更適合此場景，讀者需自行推敲。

3.3 Why SPADE Works Better 為何 SPADE 更有效

理論分析與直覺解釋

When convolution is applied to uniform or homogeneous mask regions, followed by Instance Normalization, the normalized activations become zero regardless of the input label value, effectively eliminating semantic information entirely. This is because in uniform regions, the convolution output has the same value at every spatial location, leading to zero variance after normalization.

當對均勻或同質的遮罩區域施加摺積，再接以實例正規化時，正規化後的激活值無論輸入標籤為何都會變為零，從而徹底消除語義資訊。這是因為在均勻區域中，摺積輸出在每個空間位置具有相同的值，導致正規化後方差為零。

段落功能：提供理論證明

邏輯角色：以嚴謹的推理證明傳統方法（Conv + InstanceNorm）在均勻遮罩區域會完全喪失語義資訊，這是支撐 SPADE 設計動機的理論基石。

論證技巧：這是一個簡潔而有力的「反面論證」——不是直接證明 SPADE 好，而是先證明傳統方法在特定條件下必然失敗（零方差導致資訊消失）。此類構造性反例在學術論文中極具說服力。

In contrast, SPADE feeds the segmentation masks through spatially-adaptive modulation without normalization—only the previous-layer activations are normalized. The semantic map is processed through a separate convolutional pathway to produce the modulation parameters, which are then applied after normalization. This design preserves semantic information while maintaining the benefits of normalization for stable training.

相比之下，SPADE 將分割遮罩透過空間自適應調制引入，而不對其進行正規化——僅對前一層的激活值進行正規化。語義圖透過獨立的摺積路徑處理以生成調制參數，這些參數在正規化之後施加。此設計在保留語義資訊的同時，維持了正規化對穩定訓練的好處。

段落功能：解釋 SPADE 的優勢機制

邏輯角色：與前段形成「問題→解答」的完整邏輯對——前段說明為何傳統方法失敗，本段說明 SPADE 為何不會有此問題。

論證技巧：關鍵洞察在於「分離路徑」：語義資訊走獨立的摺積路徑（不被正規化），而正規化僅作用於前層激活值。這種設計巧妙地「兩全其美」——既保留語義又穩定訓練。以「in contrast」開頭，形成清晰的正反對比結構。

3.4 Multi-Modal Synthesis 多模態合成

風格控制與影像編碼

The architecture enables multi-modal synthesis by accepting random noise vectors as input to the generator. Different random vectors lead to different output images for the same semantic layout, providing diversity in generated results.

該架構透過接受隨機雜訊向量作為生成器輸入，實現了多模態合成。對於相同的語義佈局，不同的隨機向量會產生不同的輸出影像，從而提供生成結果的多樣性。

段落功能：引出附加功能

邏輯角色：展示 SPADE 架構的額外優勢——移除編碼器後，隨機向量輸入自然實現了多模態合成，這是架構設計的附帶紅利。

論證技巧：將多模態合成呈現為架構設計的自然結果而非刻意添加的功能，暗示設計的優雅性。「同一佈局、不同輸出」的描述直觀易懂。

An image encoder can be trained to process real images into random vectors, forming a Variational Autoencoder (VAE) framework. The encoder captures the style information of the real image, while the generator combines the encoded style with segmentation information via SPADE layers. This allows guided image synthesis, where the style of a reference image can be transferred to a new semantic layout.

可訓練一個影像編碼器將真實影像處理為隨機向量，形成變分自編碼器（VAE）框架。編碼器擷取真實影像的風格資訊，而生成器則透過 SPADE 層將編碼的風格與分割資訊結合。這使得導引式影像合成成為可能——即可將參考影像的風格轉移至新的語義佈局上。

段落功能：拓展應用場景

邏輯角色：進一步將架構擴展為 VAE 框架，展示「風格遷移 + 語義控制」的複合能力，拓寬了研究的應用邊界。

論證技巧：透過引入 VAE 框架，SPADE 的「語義調制」能力與「風格編碼」能力被解耦，這一設計允許兩者獨立控制。「guided image synthesis」一詞暗示了極具吸引力的互動式應用場景。潛在漏洞：VAE 編碼器可能引入後驗坍塌問題，但論文未詳細討論。

4. Experiments 實驗

定量評估與使用者研究

Training is conducted on 8 V100 GPUs with synchronized Batch Normalization. The model is evaluated on challenging datasets including COCO-Stuff, ADE20K, and Cityscapes, covering diverse scene categories from urban street views to indoor environments.

訓練在 8 塊 V100 GPU 上進行，使用同步批次正規化。模型在多個具挑戰性的資料集上評估，包括 COCO-Stuff、ADE20K 和 Cityscapes，涵蓋從城市街景到室內環境的多樣場景類別。

段落功能：說明實驗配置

邏輯角色：交代硬體環境與資料集選擇，為後續定量結果的可信度奠定基礎。

論證技巧：選擇三個不同規模和場景類型的資料集展現了方法的泛化能力。提及同步 BatchNorm 體現了對分散式訓練的重視。潛在問題：8 塊 V100 的計算需求較高，可能限制方法的可及性。

On COCO-Stuff, SPADE achieves a mIoU of 35.2 versus 14.6 for pix2pixHD—a dramatic improvement in semantic alignment. The FID scores are approximately 2.2 times better than previous leading methods, indicating significantly improved visual quality. These quantitative improvements are consistent across all evaluated datasets.

在 COCO-Stuff 上，SPADE 達到 35.2 的 mIoU，而 pix2pixHD 僅為 14.6——語義對齊方面有了大幅提升。FID 分數約為先前領先方法的 2.2 倍改善，表明視覺品質顯著提升。這些定量改進在所有評估資料集上均保持一致。

段落功能：呈現核心定量結果

邏輯角色：以具體資料支撐方法的有效性主張。mIoU 從 14.6 到 35.2 的提升（約 2.4 倍）是極為顯著的改進。

論證技巧：選用 mIoU（語義對齊）和 FID（視覺品質）兩個互補指標，分別驗證方法在「忠於輸入佈局」和「生成品質」兩個維度上的優勢。「2.2 times better」的表述具有很強的衝擊力。「consistent across all datasets」進一步排除了偶然性。

User studies via Amazon Mechanical Turk strongly favor the proposed results across all datasets, confirming the perceptual quality advantage. Ablation studies show that SPADE consistently outperforms variants including simple concatenation of semantic masks. Notably, the decoder-style SPADE generator achieves better performance than baselines with fewer parameters, demonstrating both effectiveness and efficiency.

透過 Amazon Mechanical Turk 進行的使用者研究在所有資料集上均強烈偏好本方法的結果，印證了感知品質上的優勢。消融研究表明，SPADE 持續優於包括簡單串聯語義遮罩在內的各種變體。值得注意的是，解碼器式的 SPADE 生成器以更少的參數實現了優於基線方法的效能，同時展現了有效性與效率。

段落功能：補充主觀評估與消融實驗

邏輯角色：以使用者研究補充定量指標的不足（FID 不完全等於人類感知），消融實驗則驗證了各設計選擇的必要性。

論證技巧：三重驗證策略——定量指標、使用者研究、消融實驗——形成了極為完備的實驗論證體系。「fewer parameters + better performance」是最具說服力的組合：不僅效果好，而且更高效。這直接反駁了「效能提升可能僅來自更大模型」的潛在質疑。

5. Conclusion 結論

總結與展望

We have presented spatially-adaptive normalization for utilizing semantic layouts in affine transformations within normalization layers. This approach produces the first semantic image synthesis model capable of generating photorealistic outputs across diverse scenes. The method demonstrates compelling applications in multi-modal synthesis and guided image synthesis, providing users with flexible control over the generation process.

本文提出了空間自適應正規化，用於在正規化層的仿射變換中利用語義佈局。此方法產生了首個能夠在多樣場景中生成逼真輸出的語義影像合成模型。該方法在多模態合成與導引式影像合成中展現了引人注目的應用，為使用者提供了對生成過程的靈活控制。

段落功能：總結全文貢獻

邏輯角色：以精煉的語言回顧核心貢獻——技術創新（SPADE）、品質突破（首個逼真多場景合成）、應用價值（多模態與導引合成）。

論證技巧：「first semantic image synthesis model」的措辭確立了里程碑地位。結論刻意強調應用層面（multi-modal、guided synthesis），將學術貢獻與實際價值結合。潛在漏洞：未討論方法的局限性或失敗案例，這在學術論文中是常見的遺憾。

論證結構總覽

全文邏輯骨架

問題：語義遮罩→影像合成

→

診斷：正規化層洗掉語義資訊

→

方案：SPADE 空間自適應正規化

→

理論：分離路徑保留語義

→

實證：mIoU / FID 大幅提升

→

結論：首個逼真多場景合成模型

作者核心主張（一句話）

透過空間自適應的條件式正規化層（SPADE），語義分割遮罩中的資訊可在生成網路中被有效保留與利用，從而以更輕量的架構實現跨場景的逼真語義影像合成。

論證最強處 vs 最弱處

最強處理論分析（Section 3.3）以構造性論證證明 InstanceNorm 在均勻區域必然消除語義資訊，並搭配定量實驗（mIoU 從 14.6 提升至 35.2）和消融研究三重印證。特別是「更少參數、更好效果」的發現，有力反駁了「效能來自模型規模」的替代假說，形成了從理論到實證的完整閉環。

最弱處論文對方法的局限性與失敗案例討論不足。例如：SPADE 在語義類別高度碎片化的場景中是否仍然有效？兩層摺積網路生成調制參數的設計是否為最優選擇？VAE 編碼器是否存在後驗坍塌問題？此外，8 塊 V100 GPU 的訓練需求較高，可能限制了方法在資源受限環境中的可及性，但論文未對此進行討論。