Learning from Simulated and Unsupervised Images through Adversarial Training (SimGAN)

Abstract — 摘要

With recent progress in graphics, it has become more tractable to train models on synthetic images, potentially avoiding the need for expensive annotations. However, learning from synthetic images may not achieve the desired performance due to a gap between synthetic and real image distributions. To reduce this gap, we propose Simulated+Unsupervised (S+U) learning, where the task is to learn a model to improve the realism of a simulator's output using unlabeled real data, while preserving the annotation information from the simulator. We develop a method for S+U learning that uses an adversarial network similar to Generative Adversarial Networks (GANs), but with synthetic images as inputs instead of random vectors. We make several key modifications including a 'self-regularization' term, a local adversarial loss, and updating the discriminator using a history of refined images.

隨著電腦圖學的進步，使用合成影像訓練模型變得更加可行，有望避免昂貴的標註需求。然而，由於合成影像與真實影像分布之間存在差距，從合成影像學習可能無法達到期望的效能。為縮小此差距，我們提出模擬+非監督式（S+U）學習，其任務是學習一個模型，利用未標註的真實資料改善模擬器輸出的真實感，同時保留模擬器提供的標註資訊。我們開發了一種 S+U 學習方法，使用類似生成對抗網路（GAN）的對抗網路，但以合成影像而非隨機向量作為輸入。我們進行了幾項關鍵修改，包括「自正則化」項、局部對抗損失，以及使用精煉影像的歷史紀錄來更新判別器。

段落功能全文總覽——定義 S+U 學習範式，並預告三項技術貢獻。

邏輯角色摘要以「合成影像可用但有缺陷」建立問題，再以「S+U 學習」作為新範式回應，最後列舉三項關鍵技術修改。結構清晰的「問題-方案-技術」三層遞進。

論證技巧 / 潛在漏洞將 GAN 從「生成」重新定位為「精煉」是概念上的重要創新。但「保留標註資訊」的承諾在摘要中缺乏定量保證——精煉過程是否真的不會破壞標註對應關係，需要實驗驗證。

1. Introduction — 緒論

Large labeled training datasets are becoming increasingly important with the recent rise in high capacity deep neural networks. However, labeling such large datasets is expensive and time-consuming. An alternative to labeling real images is to use synthetic data from simulators, which can automatically generate annotations. While this approach is appealing, learning from synthetic images can be problematic due to a gap between synthetic and real image distributions. The goal of this work is to improve the realism of synthetic images from a simulator using unlabeled real data. The refined images should add realism, preserve annotation information, and be generated without artifacts.

隨著高容量深度神經網路的興起，大型標註訓練資料集變得日益重要。然而，標註如此龐大的資料集既昂貴又耗時。替代方案是使用來自模擬器的合成資料，它能自動生成標註。雖然此方法極具吸引力，但由於合成影像與真實影像分布之間的差距，從合成影像學習可能存在問題。本研究的目標是利用未標註的真實資料，改善模擬器所產生之合成影像的真實感。精煉後的影像應增添真實感、保留標註資訊，且不產生偽影。

段落功能建立研究動機——從資料標註的成本痛點引出合成資料的價值與挑戰。

邏輯角色論證鏈起點：標註昂貴 -> 合成資料替代 -> 但存在領域差距 -> 需要精煉。三個清晰的設計目標（真實感、保留標註、無偽影）為後續方法設計提供了明確的評估標準。

論證技巧 / 潛在漏洞以實際的工業痛點（標註成本）作為研究動機，使論文具有強烈的實用性訴求。但此處未量化「領域差距」的嚴重程度——不同任務和模擬器之間的差距差異很大。

We introduce SimGAN, which refines synthetic images from a simulator using a neural network which we call the 'refiner network'. The key idea is to use a combination of an adversarial loss that fools a discriminator network into classifying refined images as real, together with a self-regularization loss that penalizes large changes between the synthetic and refined images. This combination ensures that the refined images look realistic while retaining the annotation information. Our key contributions include: (i) proposing the S+U learning methodology; (ii) training refiner networks using combined losses; (iii) implementing modifications to stabilize GAN training; and (iv) demonstrating improvements through qualitative, quantitative, and user study experiments.

我們引入 SimGAN，使用一個我們稱為「精煉器網路」的神經網路來精煉來自模擬器的合成影像。核心概念是結合一個對抗損失（欺騙判別器網路，使其將精煉影像分類為真實影像）與一個自正則化損失（懲罰合成影像與精煉影像之間的大幅變化）。此組合確保精煉影像看起來真實，同時保留標註資訊。我們的主要貢獻包括：(i) 提出 S+U 學習方法論；(ii) 使用組合損失訓練精煉器網路；(iii) 實作穩定 GAN 訓練的修改；(iv) 透過定性、定量與使用者研究實驗展示改進。

段落功能提出解決方案——概述 SimGAN 的精煉器架構與雙重損失設計。

邏輯角色承接問題陳述，轉入方案概述。對抗損失負責「增加真實感」，自正則化損失負責「保留標註」，兩者精確對應前段的三個設計目標中的前兩個。

論證技巧 / 潛在漏洞雙損失的設計理念清晰——對抗損失拉近分布，正則化約束保持結構。但這兩個損失之間存在內在張力：過強的對抗損失可能破壞結構，過強的正則化則限制精煉效果。lambda 的調節至關重要。

The GAN framework learns two networks (a generator and a discriminator) with competing losses. Various extensions have been proposed including CoGAN, InfoGAN, and style transfer applications. Regarding synthetic data usage, applications span gaze estimation, text detection, font recognition, object detection, hand pose estimation, scene recognition, semantic segmentation, and human pose estimation. Our approach differs from domain adaptation methods in that it bridges the gap between image distributions through adversarial training at the pixel level, rather than adapting features for specific tasks. This makes it task-independent — the refined images can be used for any downstream task.

GAN 框架學習兩個具有競爭性損失的網路（生成器與判別器）。各種延伸已被提出，包括 CoGAN、InfoGAN 及風格轉換應用。關於合成資料的使用，應用涵蓋視線估計、文字偵測、字型辨識、物件偵測、手部姿態估計、場景辨識、語義分割與人體姿態估計。我們的方法不同於領域自適應方法之處在於，它透過像素層級的對抗訓練來彌合影像分布之間的差距，而非為特定任務適應特徵。這使其與任務無關——精煉後的影像可用於任何下游任務。

段落功能文獻定位——區分 SimGAN 與 GAN 變體及領域自適應方法。

邏輯角色透過與領域自適應的對比，突出 SimGAN 的核心差異化優勢：任務無關性。像素層級的精煉使輸出可服務於任何下游任務，這比特徵層級的適應更具泛用性。

論證技巧 / 潛在漏洞「任務無關」的定位非常聰明，大幅擴展了方法的適用範圍。但像素層級精煉也意味著無法針對特定任務的需求進行最佳化——對某些需要語義級別調整的任務，特徵層級的適應可能更有效。

3. Method — S+U Learning with SimGAN

3.1 Adversarial Loss with Self-Regularization — 對抗損失與自正則化

The refiner network R_theta should produce outputs that look like real images in appearance while preserving the annotation information. The discriminator D_phi minimizes cross-entropy loss distinguishing real from refined images. The refiner loss combines an adversarial component that fools D into classifying refined images as real, with a self-regularization loss: lambda * ||psi(R_theta(x_i)) - psi(x_i)||_1, where psi is a feature transform (identity in the simplest case). The refiner is a fully convolutional neural network without striding or pooling, modifying the synthetic image on a pixel level while preserving its global structure. This design ensures the spatial resolution remains unchanged and the refined image stays close to the original.

精煉器網路 R_theta 應產生在外觀上看起來像真實影像的輸出，同時保留標註資訊。判別器 D_phi 最小化交叉熵損失以區分真實影像與精煉影像。精煉器損失結合了一個對抗分量（欺騙 D 將精煉影像分類為真實）與一個自正則化損失：lambda * ||psi(R_theta(x_i)) - psi(x_i)||_1，其中 psi 為特徵變換（最簡單情況下為恆等映射）。精煉器是一個不含步進或池化的全摺積神經網路，在像素層級修改合成影像，同時保留其全域結構。此設計確保空間解析度不變，且精煉影像保持接近原始影像。

段落功能核心方法推導——定義精煉器的損失函數與網路約束。

邏輯角色此段奠定數學基礎：L1 自正則化確保精煉是「微調」而非「重建」，全摺積無池化設計從架構層面限制了修改幅度。雙重約束（損失函數+架構）共同服務於「保留標註」的目標。

論證技巧 / 潛在漏洞「全摺積無池化」的設計選擇是精妙的工程決策——它從架構層面保證了像素級對應關係。但 L1 正則化在平衡真實感與保真度時過於簡單——更複雜的感知損失或結構相似性指標可能更適合衡量標註相關的結構保持。

3.2 Local Adversarial Loss — 局部對抗損失

Rather than using a global discriminator that classifies the entire image as real or fake, we implement a discriminator network that classifies all local image patches separately, outputting a w x h dimensional probability map of patches belonging to the fake class. This local adversarial loss encourages the refiner to model the local statistics of real images rather than over-emphasize global structure, which can lead to obvious artifacts like unnatural depth boundaries when using a global discriminator. The local approach also provides more training signal per image, as each patch contributes independently to the loss.

相較於使用將整張影像分類為真實或偽造的全域判別器，我們實作了一個分別對所有局部影像區塊進行分類的判別器網路，輸出一個 w x h 維度的機率圖，表示區塊屬於偽造類別的機率。此局部對抗損失鼓勵精煉器建模真實影像的局部統計特性，而非過度強調全域結構——使用全域判別器可能導致明顯的偽影，如不自然的深度邊界。局部方法也提供了更多的每影像訓練訊號，因為每個區塊獨立地貢獻於損失。

段落功能技術創新——引入局部判別以改善精煉品質。

邏輯角色回應全域判別器的偽影問題：局部化判別將注意力從全域結構轉移至局部紋理統計，這恰好符合「精煉」的目標——改變紋理細節而非整體結構。

論證技巧 / 潛在漏洞此設計預示了後來 pix2pix 中 PatchGAN 的概念，在歷史脈絡中具有前瞻性。但純局部判別可能無法捕捉某些全域一致性問題，如整體色調偏移或大尺度結構失真。

3.3 History Buffer — 歷史緩衝區更新機制

We introduce a method for improving the stability of adversarial training by updating the discriminator using a history of refined images rather than only the ones in the current mini-batch. We maintain a buffer of previously generated refined images. During each discriminator update, we sample b/2 images from the current refiner network, and sample an additional b/2 images from the buffer, then update the buffer by randomly replacing some entries. This prevents the discriminator from overfitting to the most recent refiner output and provides a more stable training signal that smooths out the adversarial dynamics.

我們引入一種改善對抗訓練穩定性的方法，透過使用精煉影像的歷史紀錄而非僅當前小批次的影像來更新判別器。我們維護一個先前生成之精煉影像的緩衝區。在每次判別器更新時，我們從當前精煉器網路取樣 b/2 張影像，並從緩衝區額外取樣 b/2 張影像，然後隨機替換緩衝區中的部分條目。這防止判別器過度擬合於最近的精煉器輸出，並提供更穩定的訓練訊號，平滑化對抗動態。

段落功能訓練穩定性——解決 GAN 訓練的已知不穩定性問題。

邏輯角色針對 GAN 訓練的實際挑戰提出工程解決方案。歷史緩衝區使判別器接觸到更多樣的精煉影像分布，避免「貓鼠遊戲」陷入振盪。

論證技巧 / 潛在漏洞此技巧直觀且實用，後來也被其他 GAN 訓練方法採用。但 b/2 的混合比例是經驗性的選擇，作者未提供對不同混合比例的敏感度分析。緩衝區大小對訓練穩定性的影響也未充分探討。

4. Experiments — 實驗

We evaluate SimGAN on two tasks: gaze estimation and hand pose estimation. For gaze estimation, training on refined images achieved a 22.3% absolute percentage improvement over synthetic baselines, and state-of-the-art results on the MPIIGaze dataset with a relative improvement of 21%. A user study found that subjects chose the correct label (real vs. refined) only 517 times out of 1000 trials (p=0.148), meaning they were not able to reliably distinguish real images from refined synthetic ones. For hand pose estimation, training on refined synthetic data outperforms the model trained on real images with supervision by 8.8%. Ablation studies confirmed that the history buffer prevents severe artifacts, and the local adversarial loss is superior to global approaches, removing obvious unrealistic depth boundary artifacts.

我們在兩項任務上評估 SimGAN：視線估計與手部姿態估計。在視線估計方面，使用精煉影像訓練相比合成基線取得了 22.3% 的絕對百分比提升，在 MPIIGaze 資料集上達到最先進結果，相對改進達 21%。使用者研究發現，受試者在 1000 次試驗中僅 517 次正確辨識（真實 vs. 精煉）（p=0.148），意味著他們無法可靠地區分真實影像與精煉後的合成影像。在手部姿態估計方面，使用精煉合成資料訓練的模型比使用有監督真實影像訓練的模型高出 8.8%。消融研究確認歷史緩衝區防止了嚴重偽影，且局部對抗損失優於全域方法，消除了明顯不真實的深度邊界偽影。

段落功能實證驗證——以多維度證據（定量指標、使用者研究、消融實驗）支持方法有效性。

邏輯角色三重驗證策略極為周全：(1) 下游任務的量化改進；(2) 人類感知的「圖靈測試」；(3) 各組件的必要性驗證。

論證技巧 / 潛在漏洞使用者研究的 p=0.148 結果極具說服力——人類無法區分真實與精煉影像。但實驗僅涵蓋兩個視覺任務（眼睛、手部），能否推廣至更複雜的場景（如全身、室外場景）尚未驗證。此外，「超越有監督真實影像訓練」的結果令人驚訝，可能暗示真實資料集較小或模擬器資料量具優勢。

5. Conclusion — 結論

We proposed Simulated+Unsupervised learning to add realism to the simulator while preserving the annotations of the synthetic images. Using a combination of adversarial loss and self-regularization loss, along with local adversarial training and a history buffer, our method demonstrates state-of-the-art results without any labeled real data. The refined images are visually indistinguishable from real images by human observers. Future work includes exploring modeling the noise distribution to generate more than one refined image for each synthetic image, and investigating refining videos for temporal consistency.

我們提出了模擬+非監督式學習，為模擬器添加真實感，同時保留合成影像的標註。透過組合對抗損失與自正則化損失，配合局部對抗訓練與歷史緩衝區，我們的方法在不使用任何標註真實資料的情況下達到了最先進的結果。精煉後的影像在視覺上讓人類觀察者無法與真實影像區分。未來工作包括探索對雜訊分布建模以為每張合成影像生成多張精煉影像，以及研究影片精煉以確保時間一致性。

段落功能總結全文——重申貢獻並指出未來方向。

邏輯角色結論段簡潔地回顧了完整的技術堆疊，並以「無需標註真實資料」作為核心賣點。未來方向（多樣性、影片）暗示了方法的擴展空間。

論證技巧 / 潛在漏洞結論措辭得當，但未充分討論方法的侷限性——如何處理模擬器與真實世界之間的語義級差異（例如物件種類、佈局），而非僅僅是紋理與光照層級的差異。

論證結構總覽

問題
合成影像與真實影像
存在領域差距

→

論點
對抗式精煉
保留標註的真實化

→

證據
視線/手部 SOTA
使用者無法分辨

→

反駁
局部損失+歷史緩衝
解決偽影與穩定性

→

結論
無需標註真實資料
即達最先進表現

作者核心主張（一句話）

透過對抗式精煉將模擬器輸出轉化為視覺上逼真且保留標註的訓練資料，無需任何標註真實影像即可達到最先進的下游任務表現。

論證最強處

使用者研究的「感知圖靈測試」：人類受試者在統計上無法區分精煉影像與真實影像（p=0.148），這是對「真實感」最直接且最具說服力的驗證。結合下游任務的定量改進，形成了從感知到效能的完整證據鏈。

論證最弱處

任務範圍的局限性：實驗僅涵蓋兩個較為受限的視覺任務（眼部近拍的視線估計與手部深度圖的姿態估計），均為相對簡單的影像域。對於更複雜的場景（如室外駕駛場景、多物件互動場景），模擬器與真實世界的差距遠超紋理層級，S+U 精煉能否有效彌合尚不明確。