Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Abstract — 摘要

We consider image transformation problems, where an input image is transformed into an output image. Recent methods for such problems typically train feed-forward convolutional neural networks using a per-pixel loss function between the output and ground-truth images. Parallel work has shown that high-quality images can be generated by defining and optimizing perceptual loss functions based on high-level features extracted from pretrained networks. We combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image transformation tasks. We show results on two tasks: style transfer, where we achieve comparable performance to the optimization-based method of Gatys et al. but are three orders of magnitude faster, and single-image super-resolution, where we show that replacing a per-pixel loss with a perceptual loss gives visually more pleasing results.

我們考慮影像轉換問題，其中輸入影像被轉換為輸出影像。此類問題的近期方法通常使用輸出與真實影像之間的逐像素損失函數來訓練前饋摺積神經網路。平行研究已表明，透過定義和最佳化基於預訓練網路提取的高階特徵的感知損失函數，可以生成高品質影像。我們結合兩種方法的優勢，提出使用感知損失函數來訓練用於影像轉換任務的前饋網路。我們在兩項任務上展示結果：風格遷移中，達到與 Gatys 等人基於最佳化方法相當的效能，但速度快三個數量級；以及單影像超解析度中，以感知損失取代逐像素損失能給出視覺上更令人滿意的結果。

段落功能提出「感知損失 + 前饋網路」的統一框架，並列出兩項應用。

邏輯角色整合兩個平行研究方向（前饋速度 vs 感知品質），建立本文的創新點。

論證技巧 / 潛在漏洞「三個數量級」的加速宣稱極具衝擊力，精確量化了速度優勢。

1. Introduction — 緒論

Many classic problems in computer vision can be framed as image transformation tasks, where a system receives some input image and transforms it into an output image. Each of these tasks can be addressed with convolutional neural networks. A common approach is to train a feed-forward CNN using a per-pixel loss function that measures the difference between the output and ground-truth. Although such approaches produce reasonable results, per-pixel losses do not capture perceptual differences between output and ground-truth images. Two images that are perceptually quite different may have a small per-pixel loss, and two images that are perceptually very similar may have a large per-pixel loss. Recently, Gatys et al. showed that neural style transfer can be performed using optimization to find an image that minimizes a perceptual loss; however this optimization process is very slow, requiring hundreds of iterations of forward and backward passes through a pretrained network.

電腦視覺中的許多經典問題都可以被框架為影像轉換任務，系統接收某個輸入影像並將其轉換為輸出影像。每項任務都可以用摺積神經網路來處理。常見的方法是使用逐像素損失函數來訓練前饋 CNN，衡量輸出與真實值之間的差異。雖然此類方法能產生合理的結果，但逐像素損失無法捕捉輸出與真實影像之間的感知差異。兩張在感知上截然不同的影像可能有很小的逐像素損失，而兩張在感知上非常相似的影像可能有很大的逐像素損失。近期，Gatys 等人展示了神經風格遷移可以透過最佳化找到最小化感知損失的影像來實現；然而，此最佳化過程非常緩慢，需要預訓練網路數百次的前向和反向傳遞。

段落功能指出逐像素損失的根本缺陷與 Gatys 方法的速度瓶頸。

邏輯角色建立雙重問題（品質 vs 速度），為統一解決方案鋪路。

論證技巧 / 潛在漏洞以逐像素損失的反直覺案例（感知不同但損失小）有效說明了問題的嚴重性。

2. Method — 方法

Our system consists of two components: an image transformation network f_W and a loss network phi that is used to define several loss functions. The image transformation network is a deep residual CNN parameterized by weights W; it transforms input images x into output images y = f_W(x). The loss network phi is used to define a feature reconstruction loss l_feat and a style reconstruction loss l_style. For each input image x, we have a content target y_c and a style target y_s. For style transfer, the content target is the input image itself and the style target is a fixed style image. The feature reconstruction loss is the (squared, normalized) Euclidean distance between feature representations: l_feat = ||phi_j(y) - phi_j(y_c)||^2. The style reconstruction loss is based on the Gram matrices of the feature maps.

我們的系統由兩個組件構成：影像轉換網路 f_W 和用於定義多個損失函數的損失網路 phi。影像轉換網路是以權重 W 參數化的深度殘差 CNN，將輸入影像 x 轉換為輸出影像 y = f_W(x)。損失網路 phi 用於定義特徵重建損失 l_feat 和風格重建損失 l_style。對於每個輸入影像 x，我們有內容目標 y_c 和風格目標 y_s。在風格遷移中，內容目標就是輸入影像本身，風格目標是固定的風格影像。特徵重建損失是特徵表示之間的（平方、正規化）歐氏距離：l_feat = ||phi_j(y) - phi_j(y_c)||^2。風格重建損失基於特徵圖的 Gram 矩陣。

段落功能描述雙網路架構與兩種感知損失的數學定義。

邏輯角色將抽象的「感知損失」概念落實為可計算的數學公式。

論證技巧 / 潛在漏洞損失網路（VGG）作為固定的感知評估器，巧妙地利用了預訓練特徵的語義表示能力。

3. Style Transfer — 風格遷移

For style transfer, we train one network per style. Once trained, the network can stylize any input image in a single forward pass, taking approximately 20 milliseconds for a 256x256 image, compared to the tens of seconds required by the optimization-based approach of Gatys et al.. The results are qualitatively comparable to those produced by the optimization approach. Our trained networks successfully capture the texture, color, and compositional patterns of the target style while maintaining the high-level structure of the content image. We use a network architecture based on residual blocks, with two stride-2 convolutions for downsampling, followed by five residual blocks and two fractionally-strided convolutions for upsampling.

在風格遷移中，我們為每種風格訓練一個網路。訓練完成後，網路能以單次前向傳遞將任意輸入影像風格化，處理一張 256x256 影像約需 20 毫秒，相比 Gatys 等人基於最佳化方法所需的數十秒。結果在品質上與最佳化方法產生的結果相當。我們訓練的網路成功捕捉了目標風格的紋理、色彩和構圖模式，同時保持了內容影像的高階結構。我們使用基於殘差區塊的網路架構，包含兩個步幅為 2 的摺積用於降取樣，接著五個殘差區塊和兩個分數步幅摺積用於升取樣。

段落功能報告風格遷移的速度與品質結果。

邏輯角色以量化的速度對比（20ms vs 數十秒）驗證「三個數量級加速」的宣稱。

論證技巧 / 潛在漏洞「品質相當但快 1000 倍」是極強的實用性論證，但每種風格需要單獨訓練一個網路是明顯的局限。

4. Super-Resolution — 超解析度

For single-image super-resolution, we train networks with upsampling factors of x4 and x8. We compare our perceptual loss approach against standard per-pixel loss (MSE). While per-pixel loss achieves higher PSNR values, the results tend to be blurry and lack fine detail. In contrast, our perceptual loss produces sharper images with more plausible high-frequency detail, even though the PSNR may be slightly lower. This observation aligns with the growing recognition that PSNR and per-pixel metrics do not always correlate well with human perceptual quality. The super-resolution network uses the same architecture as the style transfer network but with different loss functions.

在單影像超解析度中，我們訓練了升取樣因子為 4 倍和 8 倍的網路。將感知損失方法與標準逐像素損失（MSE）進行比較。雖然逐像素損失達到更高的 PSNR 值，但結果往往模糊且缺乏精細細節。相比之下，感知損失產生更銳利且具有更合理高頻細節的影像，即使 PSNR 可能略低。此觀察與越來越多的認知一致：PSNR 和逐像素指標並不總是與人類感知品質有良好的相關性。超解析度網路使用與風格遷移網路相同的架構，但損失函數不同。

段落功能展示感知損失在超解析度任務中的優勢。

邏輯角色第二項應用驗證了感知損失框架的通用性。

論證技巧 / 潛在漏洞挑戰 PSNR 作為品質指標的適切性，這一論點在後續的超解析度研究中得到了廣泛認同。

5. Conclusions — 結論

In this work, we have combined the benefits of feed-forward image transformation networks and optimization-based image generation by training feed-forward networks with perceptual loss functions that depend on high-level features from a pretrained loss network. We have demonstrated compelling results on style transfer and single-image super-resolution, showing that perceptual losses produce results that are visually superior to those trained with per-pixel losses alone. Our method enables real-time style transfer and super-resolution, making these techniques practical for interactive applications. We believe that the ideas presented here are applicable to a broad range of image transformation tasks.

在本工作中，我們結合了前饋影像轉換網路和基於最佳化的影像生成的優勢，以依賴預訓練損失網路高階特徵的感知損失函數來訓練前饋網路。我們在風格遷移和單影像超解析度上展示了引人注目的結果，表明感知損失產生的結果在視覺上優於僅使用逐像素損失訓練的結果。我們的方法實現了即時風格遷移與超解析度，使這些技術在互動式應用中變得可行。我們相信此處提出的概念可應用於廣泛的影像轉換任務。

段落功能總結統一框架的價值並展望廣泛應用。

邏輯角色以「實用化」作為收束，強調從學術到應用的跨越。

論證技巧 / 潛在漏洞適當的泛化展望（廣泛影像轉換任務），但未深入討論方法的局限性。

論證結構總覽

問題
逐像素損失品質差
最佳化方法速度慢

➔

論點
感知損失+前饋網路

➔

證據
風格遷移快1000倍
超解析度更銳利

➔

反駁
PSNR 略低但感知更佳

➔

結論
即時影像轉換框架

核心主張

以預訓練網路的高階特徵定義感知損失函數來訓練前饋轉換網路，可同時達到高感知品質與即時處理速度。

最強論證

風格遷移三個數量級的加速（20ms vs 數十秒），在品質幾乎不妥協的前提下實現即時應用，數據有力支撐實用性主張。

最弱環節

每種風格需要獨立訓練一個網路，無法處理任意風格的即時遷移，後續的 AdaIN 等方法解決了此限制。