StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation

Abstract — 摘要

Recent studies have shown remarkable success in image-to-image translation for two domains. However, existing approaches have limited scalability and robustness in handling more than two domains, since different models should be built independently for every pair of image domains. We propose StarGAN, a novel and scalable approach that can perform image-to-image translations for multiple domains using only a single model. Such a unified model architecture allows simultaneous training of multiple datasets with different domains within a single network. Experiments demonstrate that StarGAN generates visually higher quality results compared to existing methods.

近期研究在兩個領域間的影像轉換已展現卓越成功。然而，現有方法在處理超過兩個領域時具有有限的可擴展性與穩健性，因為必須為每一對影像領域獨立建構不同的模型。我們提出 StarGAN，一種新穎且可擴展的方法，能以單一模型執行多領域影像轉換。此統一的模型架構允許在單一網路中同時訓練具有不同領域的多個資料集。實驗證明 StarGAN 相較於現有方法能產生視覺品質更高的結果。

段落功能全文總覽——以「兩領域 vs. 多領域」的擴展作為核心貢獻的框架。

邏輯角色摘要以遞進方式展開：既有成就（兩域成功）-> 現有限制（多域不可擴展）-> 本文方案（StarGAN 單模型多域）。

論證技巧 / 潛在漏洞「單一模型」的簡潔性主張極具吸引力。但「視覺品質更高」的主張需要謹慎的定量驗證——主觀的視覺比較與客觀的定量指標可能存在落差。

1. Introduction — 緒論

Existing methods for image-to-image translation, such as pix2pix and CycleGAN, learn mappings between two specific domains. To handle translations among k domains, these approaches require training k(k-1) separate generators, which is both computationally expensive and unable to leverage training data from other domains. Our key innovation is that the generator takes in as inputs both the image and the target domain information, and learns to flexibly translate the image into the corresponding domain. We additionally introduce a mask vector technique that enables joint training across datasets with different label sets.

現有的影像轉換方法（如 pix2pix 和 CycleGAN）學習兩個特定領域之間的映射。要處理 k 個領域間的轉換，這些方法需要訓練 k(k-1) 個獨立的生成器，這在計算上十分昂貴，且無法利用其他領域的訓練資料。我們的關鍵創新在於：生成器同時接收影像與目標領域資訊作為輸入，學習將影像靈活地轉換至對應領域。此外，我們引入遮罩向量技術，使得可跨不同標籤集合的資料集進行聯合訓練。

段落功能問題量化——以 k(k-1) 的組合爆炸具體化多域轉換的低效率。

邏輯角色 k(k-1) 的複雜度分析為 StarGAN 的「單一模型」方案提供了最直接的動機。遮罩向量技術則進一步擴展了方法的適用範圍。

論證技巧 / 潛在漏洞 k(k-1) 的計算量論述非常有說服力——例如 10 個領域就需要 90 個生成器。但作者未提及條件式 GAN（如 cGAN）也能以單一模型處理多類生成，只是形式不同。

Generative Adversarial Networks (GANs) have been applied to various tasks including image generation, super-resolution, and domain adaptation. Conditional GANs condition the generation on additional information like class labels. Image-to-image translation methods such as pix2pix require paired training data, while CycleGAN and DiscoGAN learn from unpaired data using cycle consistency loss. However, these methods are inherently limited to two-domain translation. IcGAN and DIAT attempt multi-domain translation but require separate models or produce lower quality results.

生成對抗網路已被應用於各種任務，包括影像生成、超解析度與領域適應。條件式 GAN 以類別標籤等額外資訊作為生成條件。影像轉換方法如 pix2pix 需要配對的訓練資料，而 CycleGAN 和 DiscoGAN 使用循環一致性損失從非配對資料中學習。然而，這些方法本質上局限於兩域轉換。IcGAN 與 DIAT 嘗試多域轉換，但需要獨立模型或產生較低品質的結果。

段落功能文獻回顧——追溯從 GAN 到影像轉換的發展脈絡，定位 StarGAN 的學術位置。

邏輯角色建立清晰的方法演進線：GAN -> cGAN -> pix2pix -> CycleGAN -> StarGAN，每一步都解決了前一步的特定限制。

論證技巧 / 潛在漏洞將 CycleGAN 定位為「兩域限制」的方法是準確的，但也暗示了 StarGAN 建立在 CycleGAN 的循環一致性損失之上。方法的新穎性在於架構設計而非損失函數。

3. Method — 方法

3.1 Architecture — 架構

StarGAN uses a single generator G and discriminator D. The generator takes as input an image x and a target domain label c, and produces a translated image G(x, c). The discriminator has two branches: one for real/fake classification (adversarial) and one for domain classification. This design means that a single G learns all possible domain translations, using the target domain label to control which translation to perform. The domain label is concatenated to the input image as additional channels, spatially replicated to match the image dimensions.

StarGAN 使用單一生成器 G 與判別器 D。生成器接收影像 x 與目標領域標籤 c 作為輸入，產生轉換後的影像 G(x, c)。判別器具有兩個分支：一個用於真偽分類（對抗性），另一個用於領域分類。此設計意味著單一 G 學習所有可能的領域轉換，使用目標領域標籤來控制執行哪種轉換。領域標籤以額外通道的形式與輸入影像串接，在空間上複製以匹配影像維度。

段落功能核心架構——描述 StarGAN 的生成器與判別器設計。

邏輯角色「單一 G 學習所有轉換」是本文最核心的設計決策，直接回應了 k(k-1) 的效率問題。判別器的雙分支設計則確保了轉換的真實性與正確性。

論證技巧 / 潛在漏洞將領域標籤串接為額外通道是一個簡潔的條件化方式，但可能在語意上不夠表達——例如「年齡增長」與「性別轉換」是語意截然不同的操作，僅以離散標籤區分可能限制轉換的細膩度。

3.2 Loss Functions — 損失函數

The training objective combines four components. (1) Adversarial loss: uses Wasserstein GAN with gradient penalty (WGAN-GP) to produce realistic images. (2) Domain classification loss: ensures the translated image is correctly classified into the target domain, applied to both real images (training D) and fake images (training G). (3) Reconstruction loss: applies cycle consistency — translating from source to target and back should recover the original image, preserving content while only changing domain-specific attributes. The hyperparameters are set to lambda_cls = 1 and lambda_rec = 10.

訓練目標結合四個組件。(1) 對抗損失：使用帶梯度懲罰的 Wasserstein GAN（WGAN-GP）以產生逼真的影像。(2) 領域分類損失：確保轉換後的影像被正確分類至目標領域，同時應用於真實影像（訓練 D）與生成影像（訓練 G）。(3) 重建損失：應用循環一致性——從來源轉換至目標再轉回應能恢復原始影像，在僅改變領域特定屬性的同時保留內容。超參數設定為 lambda_cls = 1 與 lambda_rec = 10。

段落功能損失設計——定義驅動 StarGAN 訓練的多目標函數。

邏輯角色三重損失各司其職：對抗損失保證真實性、分類損失保證正確性、重建損失保證內容保留。共同構成完整的約束系統。

論證技巧 / 潛在漏洞 lambda_rec = 10 的高權重暗示重建損失對穩定訓練至關重要。但高重建權重可能導致保守的轉換——生成器可能傾向於做最小改動以確保可逆性，犧牲轉換的劇烈程度。

3.3 Training with Multiple Datasets — 多資料集訓練

A key challenge in multi-dataset training is that different datasets have different label sets — for example, CelebA has facial attributes (hair color, gender, age) while RaFD has facial expressions (happy, angry, fearful). StarGAN addresses this with a mask vector approach: the domain label is extended to include all possible labels from all datasets, with a one-hot mask indicating which dataset the label belongs to. Unknown labels are simply set to zero. This enables the first successful multi-domain translation across different datasets in a single model.

多資料集訓練的關鍵挑戰在於不同資料集具有不同的標籤集合——例如，CelebA 具有臉部屬性（髮色、性別、年齡），而 RaFD 具有臉部表情（開心、憤怒、恐懼）。StarGAN 以遮罩向量方法解決此問題：領域標籤被擴展以包含所有資料集的所有可能標籤，搭配一個獨熱遮罩指示標籤屬於哪個資料集。未知標籤僅設為零。這實現了首次在單一模型中跨不同資料集成功進行多域轉換。

段落功能技術創新——描述跨資料集訓練的遮罩向量方法。

邏輯角色遮罩向量是 StarGAN 超越單一資料集多域轉換的關鍵技術。它將「標籤不相容」的實際問題轉化為一個簡潔的工程方案。

論證技巧 / 潛在漏洞將未知標籤設為零是一個大膽的簡化。在理論上，「未知」與「不具有此屬性」是不同的語意——零值可能被模型誤解為否定信號。但實驗結果顯示此方法在實際中有效。

4. Experiments — 實驗

Experiments are conducted on CelebA (202,599 facial images, 40 binary attributes) and RaFD (4,824 images, 8 expressions). In AMT user studies for single-attribute transfer on CelebA, StarGAN shows significant preference over baselines: hair color 66.2% vs. CycleGAN's 20.0%, gender 39.1% vs. DIAT's 31.4%, and aging 70.6% vs. IcGAN's 9.2%. On RaFD, the expression classification error is 2.12% for StarGAN versus 5.99% for CycleGAN. Critically, StarGAN requires only 53.2M parameters for all translations, compared to 52.6M x 7 for DIAT and 52.6M x 14 for CycleGAN.

實驗在 CelebA（202,599 張臉部影像，40 個二元屬性）和 RaFD（4,824 張影像，8 種表情）上進行。在 CelebA 單屬性轉換的 AMT 使用者研究中，StarGAN 顯示出對基線方法的顯著偏好：髮色 66.2% vs. CycleGAN 的 20.0%，性別 39.1% vs. DIAT 的 31.4%，老化 70.6% vs. IcGAN 的 9.2%。在 RaFD 上，StarGAN 的表情分類錯誤率為 2.12%，而 CycleGAN 為 5.99%。關鍵的是，StarGAN 所有轉換僅需 53.2M 參數，相比之下 DIAT 需 52.6M x 7，CycleGAN 需 52.6M x 14。

段落功能核心實驗——以使用者研究、分類準確度與參數效率三維度驗證方法。

邏輯角色三重論證：(1) 人類偏好（AMT 研究）；(2) 客觀指標（分類錯誤率）；(3) 效率（參數量比較）。參數量的比較尤其直觀。

論證技巧 / 潛在漏洞參數量的比較（53.2M vs. 52.6M x 14）是最具說服力的效率論述。但 AMT 使用者研究的樣本量與統計方法未被詳述，且「老化 70.6% vs. 9.2%」的極端差距可能暗示基線方法的調參不夠充分。

5. Conclusion — 結論

We have proposed StarGAN, a scalable image-to-image translation model capable of multi-domain translations using a single generator-discriminator pair. By incorporating domain labels as input and using mask vectors for multi-dataset training, StarGAN achieves the first successful multi-domain translation across different datasets. The results demonstrate superior image quality through implicit data augmentation from multi-task learning, along with significantly fewer parameters than multi-model approaches. StarGAN represents a step toward more flexible and efficient generative models.

我們提出了 StarGAN，一個可擴展的影像轉換模型，能以單一生成器—判別器對執行多域轉換。透過將領域標籤納入輸入並使用遮罩向量進行多資料集訓練，StarGAN 實現了首次跨不同資料集的成功多域轉換。結果展示了透過多工學習的隱式資料擴增所獲得的優越影像品質，以及相較於多模型方法顯著更少的參數量。StarGAN 代表了朝向更靈活且高效的生成模型邁出的一步。

段落功能總結全文——以效率與品質的雙重優勢收束論述。

邏輯角色結論呼應摘要：從「多域不可擴展」到「單模型多域成功」，形成完整閉環。「隱式資料擴增」的論點為多域訓練提供了理論解釋。

論證技巧 / 潛在漏洞「隱式資料擴增」是一個有趣但未充分驗證的假說——多域訓練帶來的改善可能源於其他因素（如共享表示的正規化效果）。作者未討論 StarGAN 在極端領域差異（如跨物種轉換）上的限制。

論證結構總覽

問題
k 域轉換需 k(k-1)
個獨立生成器

→

論點
單一生成器
以標籤控制多域轉換

→

證據
AMT 偏好度 66-70%
參數量減少 14 倍

→

反駁
遮罩向量處理
標籤不相容問題

→

結論
首個跨資料集
多域轉換單一模型

作者核心主張（一句話）

StarGAN 以單一生成器—判別器對搭配領域標籤輸入與遮罩向量技術，首次實現跨多資料集的多領域影像轉換，在品質與效率上同時超越需要多個獨立模型的傳統方法。

論證最強處

效率論述的壓倒性優勢：53.2M vs. 52.6M x 14 的參數量對比極為直觀。搭配 AMT 使用者研究中的高偏好率，StarGAN 不僅更高效，且品質更優——這打破了「效率與品質不可兼得」的常見假設。多域共同訓練帶來的隱式資料擴增效果為此提供了合理解釋。

論證最弱處

離散標籤的表達能力限制：以獨熱向量表示領域限制了轉換的細膩度——無法表達「稍微年長一些」或「80% 金髮 20% 棕髮」等連續性屬性。此外，遮罩向量中「零 = 未知」的簡化假設在理論上存在語意歧義，且該方法僅在臉部領域驗證，在更多樣化的影像類型上的泛化性尚不明確。