Generative Image Inpainting with Contextual Attention

Abstract — 摘要

Recent deep learning based approaches show promising results for image inpainting of large missing regions. However, existing methods often generate distorted structures or blurry textures inconsistent with surrounding areas. This paper proposes a new deep generative model-based approach which can not only synthesize novel image structures but also explicitly utilise surrounding image features as references during network training to make better predictions. The model is a feed-forward, fully convolutional neural network which can process images with multiple holes at arbitrary locations and with variable sizes. Experiments on multiple datasets including faces (CelebA, CelebA-HQ), textures (DTD), and natural images (ImageNet, Places2) demonstrate superior results.

近期基於深度學習的方法在大面積缺失區域的影像修補上展現了有前景的結果。然而，現有方法常產生扭曲的結構或與周圍區域不一致的模糊紋理。本文提出一種新的基於深度生成模型的方法，不僅能合成新穎的影像結構，還能在網路訓練中顯式地利用周圍影像特徵作為參考，以做出更好的預測。該模型是一個前饋式全摺積神經網路，可處理具有任意位置與可變大小的多個缺失區域的影像。在包括人臉（CelebA、CelebA-HQ）、紋理（DTD）和自然影像（ImageNet、Places2）的多個資料集上的實驗展示了優越的結果。

段落功能全文總覽——從修補品質的缺陷出發，引出上下文注意力的核心創新。

邏輯角色摘要以「問題-方案-驗證」三段推進：現有方法的紋理不一致 -> 顯式利用周圍特徵 -> 多資料集驗證。

論證技巧 / 潛在漏洞「顯式利用」一詞與先前方法的「隱式學習」形成對比，暗示可解釋性優勢。多資料集的涵蓋面（人臉、紋理、自然影像）展現方法的通用性。

1. Introduction — 緒論

Image inpainting — filling in missing pixels — has applications in photo editing, image-based rendering and computational photography. Early patch-based methods (such as PatchMatch) work well for stationary backgrounds but cannot hallucinate novel contents for complex, non-repetitive structures like faces. Recent deep CNN and GAN approaches formulate inpainting as conditional image generation but often create boundary artifacts, distorted structures and blurry textures.

影像修補——填補缺失像素——在照片編輯、基於影像的渲染與計算攝影學中有廣泛應用。早期的基於修補的方法（如 PatchMatch）對於靜態背景效果良好，但無法為人臉等複雜、非重複結構幻想出新穎的內容。近期的深度 CNN 與 GAN 方法將修補公式化為條件式影像生成，但常產生邊界偽影、扭曲結構與模糊紋理。

段落功能建立研究場域——概述影像修補的應用與兩類方法的各自缺陷。

邏輯角色論證鏈起點：傳統方法缺乏生成力 + 深度方法缺乏一致性 = 需要結合兩者優勢的新方法。

論證技巧 / 潛在漏洞將兩類方法的不同缺陷並列，隱含本文方法能同時克服兩者的期望。但這也設定了較高的驗證門檻。

The authors identify that CNNs struggle with long-range correlations between distant contextual information and the hole regions. Standard convolutional layers have limited receptive fields and process all spatial locations uniformly, without explicitly borrowing information from known regions. The key contribution is a novel contextual attention layer that explicitly attends to and borrows feature patches from distant spatial locations, using surrounding patches as convolutional filters to match and reconstruct missing content.

作者辨識出 CNN 在遙遠的上下文資訊與缺失區域之間的長距離關聯上存在困難。標準摺積層的感受野有限，且統一處理所有空間位置，未顯式地從已知區域借用資訊。關鍵貢獻是一個新穎的上下文注意力層，顯式地關注並從遙遠的空間位置借用特徵修補塊，使用周圍修補塊作為摺積濾波器來匹配與重建缺失內容。

段落功能問題根因分析——從感受野與資訊流的角度揭示 CNN 修補的根本缺陷。

邏輯角色將「品質差」的表面現象歸因為「長距離資訊借用不足」的根本原因，為上下文注意力的引入提供技術動機。

論證技巧 / 潛在漏洞「顯式借用」vs.「隱式學習」的框架非常有效。但空洞摺積（dilated convolution）已部分緩解感受野問題，此處的對比可能過度簡化。

Traditional diffusion and patch-based methods use variational algorithms or patch similarity to propagate information from background regions to holes. PatchMatch accelerated this with efficient nearest-neighbour search. Deep learning methods like Context Encoders first applied deep networks to inpainting large holes. Iizuka et al. improved results using global and local discriminators plus dilated convolutions. In the realm of attention modelling, Spatial Transformer Networks predict affine transformations, appearance flow predicts offset vectors, and deformable convolutional networks learn spatially attentive kernels. The proposed contextual attention differs by explicitly matching patches rather than learning deformations.

傳統的擴散與基於修補的方法使用變分演算法或修補塊相似度，將資訊從背景區域傳播到缺失區域。PatchMatch 以高效的最近鄰搜尋加速了此過程。Context Encoders 等深度學習方法首次將深度網路應用於大面積缺失修補。Iizuka 等人使用全域與區域判別器加上空洞摺積改善了結果。在注意力建模領域，空間變換器網路預測仿射變換、外觀流預測偏移向量、可變形摺積網路學習空間注意力核。所提出的上下文注意力不同之處在於顯式匹配修補塊而非學習變形。

段落功能文獻定位——將上下文注意力置於修補方法與注意力機制的交匯處。

邏輯角色雙線文獻回顧：修補方法的演進（傳統 -> CNN -> GAN）+ 注意力機制的發展（空間變換 -> 外觀流 -> 可變形摺積），本文方法位於兩線交匯。

論證技巧 / 潛在漏洞以「顯式匹配 vs. 學習變形」的對比清楚區分本文方法與相關工作。但顯式匹配的假設（已知區域含有相關修補塊）可能在生成全新內容時不成立。

3. Improved Generative Inpainting Network — 改良生成修補網路

The network adopts a coarse-to-fine architecture: the first stage produces a rough completion using a dilated convolutional network with reconstruction loss; the second stage refines results with both reconstruction and adversarial losses. Key design improvements include a thin/deep architecture with fewer parameters, use of ELU activations instead of ReLU, and removal of batch normalisation. Training employs Wasserstein GAN with gradient penalty (WGAN-GP) rather than DCGAN, achieving better stability. A novel spatially discounted reconstruction loss weights pixels as gamma^l where l is the distance to the nearest known pixel, reducing ambiguity in the hole centre and achieving 100x training speedup on Places2.

網路採用由粗到細的架構：第一階段使用帶有重建損失的空洞摺積網路產生粗略完成；第二階段以重建與對抗損失精煉結果。關鍵設計改良包括參數更少的瘦/深架構、以 ELU 取代 ReLU 的啟動函數，以及移除批次正規化。訓練採用帶梯度懲罰的 Wasserstein GAN（WGAN-GP）而非 DCGAN，達到更好的穩定性。一個新穎的空間折扣重建損失以 gamma^l（l 為到最近已知像素的距離）加權像素，降低缺失中心的歧義性，在 Places2 上達到 100 倍的訓練加速。

段落功能基礎架構——描述由粗到細的網路設計與訓練改良。

邏輯角色在引入核心創新（上下文注意力）之前，先建立穩健的基礎架構。多項工程改良（WGAN-GP、空間折扣損失）展示紮實的技術功力。

論證技巧 / 潛在漏洞 100 倍訓練加速是極具說服力的實證。空間折扣損失從強化學習借鑒時間折扣的類比非常巧妙。但眾多改良堆疊使得各單項貢獻的消融分析更加重要。

4. Image Inpainting with Contextual Attention — 上下文注意力修補

The contextual attention layer operates via a match-and-attend mechanism. First, 3x3 patches are extracted from background regions and used as convolutional filters. Foreground-background similarity is measured using normalised inner product (cosine similarity). A scaled softmax across the background dimension produces attention scores, and the background patches serve as deconvolutional filters for reconstruction. An attention propagation step enforces spatial coherency through left-right then top-down convolution. The unified network has two parallel encoders after the coarse stage: one for content generation via dilated convolution, another for contextual attention, with outputs merged into a single decoder.

上下文注意力層透過匹配與關注機制運作。首先，從背景區域提取 3x3 修補塊並用作摺積濾波器。以正規化內積（餘弦相似度）衡量前景與背景的相似度。跨背景維度的縮放 softmax 產生注意力分數，而背景修補塊作為反摺積濾波器進行重建。一個注意力傳播步驟透過先左右再上下的摺積來強制空間一致性。統一網路在粗略階段之後有兩個平行編碼器：一個透過空洞摺積進行內容生成，另一個進行上下文注意力，輸出合併至單一解碼器。

段落功能核心創新——詳述上下文注意力層的完整運作機制。

邏輯角色此段是全文的技術支柱：修補塊作為濾波器的設計巧妙地將傳統的修補塊匹配（PatchMatch 精神）嵌入到可微分的神經網路框架中。

論證技巧 / 潛在漏洞將修補塊匹配重新詮釋為注意力機制是概念上的重要突破。但 3x3 修補塊的固定大小可能限制了多尺度紋理的捕捉能力。注意力傳播的方向性（先左右後上下）可能引入方向偏差。

5. Experiments — 實驗

Experiments are conducted on Places2, CelebA, CelebA-HQ, DTD textures, and ImageNet. Qualitative results show that contextual attention generates more realistic results with much less artifacts than the baseline. Attention map visualisations reveal adaptive information borrowing from semantically relevant surrounding areas. Quantitative results on Places2 show improvements in l1 loss (8.6% vs. 9.4% baseline) and PSNR (18.91 vs. 18.15). The model has only 2.9M parameters — half of prior work — and runs at 0.2 seconds per 512x512 frame. Ablation studies confirm that WGAN-GP outperforms DCGAN and LSGAN, reconstruction loss is essential, and contextual attention outperforms Spatial Transformer Networks and appearance flow.

實驗在 Places2、CelebA、CelebA-HQ、DTD 紋理和 ImageNet 上進行。定性結果顯示上下文注意力比基線產生更逼真且偽影更少的結果。注意力圖的視覺化揭示了從語意相關的周圍區域自適應地借用資訊。Places2 上的定量結果顯示 l1 損失（8.6% vs. 基線 9.4%）與 PSNR（18.91 vs. 18.15）的改善。模型僅有 290 萬參數——為先前工作的一半——且每張 512x512 影格僅需 0.2 秒。消融研究確認 WGAN-GP 優於 DCGAN 和 LSGAN，重建損失不可或缺，且上下文注意力優於空間變換器網路與外觀流。

段落功能全面驗證——從定性、定量、效率與消融四個面向展示方法的優越性。

邏輯角色實證支柱：注意力圖視覺化提供可解釋性證據，定量指標提供客觀比較，參數量與速度展示實用性。

論證技巧 / 潛在漏洞注意力圖的視覺化使方法的運作機制透明化，增強可信度。但 PSNR 的提升幅度有限（0.76 dB），且 PSNR 本身不一定反映感知品質。缺乏使用者研究來驗證感知改善。

6. Conclusion — 結論

The coarse-to-fine framework with the contextual attention module significantly improves image inpainting results by learning feature representations for explicitly matching and attending to relevant background patches. The approach bridges the gap between traditional patch-based methods and modern deep generative models, combining the explicit spatial correspondence of the former with the learning capacity of the latter. Future work includes progressive growing for very high-resolution applications and extensions to image editing and super-resolution.

由粗到細的框架結合上下文注意力模組，透過學習特徵表示來顯式匹配與關注相關的背景修補塊，顯著改善了影像修補結果。此方法彌合了傳統基於修補方法與現代深度生成模型之間的差距，結合前者的顯式空間對應性與後者的學習能力。未來工作包括用於超高解析度應用的漸進式增長，以及影像編輯與超解析度的擴展。

段落功能總結全文——以「橋接」的概念統一傳統與深度方法的貢獻。

邏輯角色結論將方法定位為「傳統修補 + 深度生成」的融合，呼應緒論中兩類方法各有缺陷的問題陳述。

論證技巧 / 潛在漏洞「橋接」的定位精準且具有歷史意義。但漸進式增長的未來展望暗示當前方法在超高解析度上的限制，而超解析度的擴展可行性未被充分論證。

論證結構總覽

問題
CNN 修補缺乏
長距離資訊借用

→

論點
上下文注意力
顯式修補塊匹配

→

證據
五大資料集驗證
PSNR + 注意力圖

→

反駁
WGAN-GP + 空間
折扣損失穩定訓練

→

結論
橋接傳統修補
與深度生成模型

作者核心主張（一句話）

透過在深度生成修補網路中引入上下文注意力層，顯式地從周圍區域借用特徵修補塊，能顯著提升影像修補的結構一致性與紋理品質。

論證最強處

注意力機制的可解釋性：注意力圖的視覺化清楚展示模型如何從語意相關的區域借用紋理，這不僅驗證了設計直覺，也為使用者提供了理解模型行為的窗口。空間折扣損失的 100 倍加速更是極具說服力的工程貢獻。

論證最弱處

對已知區域的依賴性：上下文注意力假設缺失區域的修復線索存在於周圍的已知區域。當缺失面積極大或缺失區域包含獨特物件時，周圍可能缺乏相關的修補塊，導致方法退化為純粹的生成。