Globally and Locally Consistent Image Completion

Abstract — 摘要

This paper presents a neural network approach for image completion that maintains both global and local consistency. The method uses a fully-convolutional neural network to complete images of arbitrary resolutions with missing regions of any shape. To ensure coherence, the system employs two discriminator networks — a global discriminator that evaluates overall image consistency and a local discriminator that assesses the quality of small regions around completed areas. Unlike patch-based methods, this approach can synthesize novel image fragments not present elsewhere in the source image, enabling natural completion of structured objects.

本文提出一種神經網路方法用於影像補全，能同時維持全域與區域的一致性。方法使用全摺積神經網路，補全任意解析度且具有任意形狀缺失區域的影像。為確保一致性，系統採用兩個判別器網路——一個全域判別器評估整體影像一致性，一個區域判別器評估補全區域周圍小範圍的品質。與基於修補塊的方法不同，此方法能合成來源影像中其他地方不存在的新穎影像片段，實現結構化物件的自然補全。

段落功能全文總覽——以全域與區域一致性的雙重要求引出雙判別器架構。

邏輯角色摘要的核心張力是「全域一致性 vs. 區域品質」的平衡，雙判別器設計直接回應此雙重需求。

論證技巧 / 潛在漏洞「合成新穎片段」的主張是與傳統方法的根本區別，極具說服力。但此能力的上限（能生成多複雜的新結構）未被量化。

1. Introduction — 緒論

Image completion (or inpainting) aims to fill missing regions in a way that is visually plausible. Traditional patch-based methods like PatchMatch search for similar patches within the image to fill holes. While effective for texture synthesis and repetitive patterns, they cannot generate semantically meaningful content absent from the image — for example, completing a face when the eye region is missing. Deep learning approaches like Context Encoders demonstrated that neural networks can learn to hallucinate plausible content, but results suffered from blurriness and lacked local detail consistency.

影像補全（或修補）旨在以視覺上合理的方式填補缺失區域。傳統基於修補塊的方法如 PatchMatch 在影像內搜尋相似修補塊來填補缺失。雖然對紋理合成與重複圖案有效，但無法生成影像中不存在的語意上有意義的內容——例如，當眼睛區域缺失時補全一張人臉。深度學習方法如 Context Encoders 證明了神經網路能學會幻想合理的內容，但結果受到模糊與缺乏區域細節一致性的困擾。

段落功能建立研究動機——以傳統方法與早期深度方法的互補缺陷定義問題空間。

邏輯角色論證鏈起點：傳統方法無法生成 + 深度方法缺乏一致性 = 需要兼具生成力與一致性的方法。

論證技巧 / 潛在漏洞以「眼睛補全」為例使問題具體可感。但 Context Encoders 的「模糊」部分歸因於 MSE 損失而非架構限制，此處的歸因可能不夠精確。

The key insight of this work is that ensuring both global structure and local texture quality requires separate evaluation mechanisms. A single discriminator struggles to balance these two aspects. The authors propose a completion network trained with two adversarial discriminators: one sees the entire image to judge global coherence (e.g., scene layout, object proportions), and the other focuses on a small region centred on the completed area to judge local texture fidelity.

本研究的關鍵洞見是，確保全域結構與區域紋理品質需要分離的評估機制。單一判別器難以平衡這兩個面向。作者提出以兩個對抗判別器訓練補全網路：一個觀看整張影像以判斷全域一致性（例如場景佈局、物件比例），另一個聚焦於以補全區域為中心的小範圍以判斷區域紋理的保真度。

段落功能核心洞見——揭示單一判別器的根本不足，為雙判別器提供理論動機。

邏輯角色此段是全文的概念核心：將「品質」分解為「全域一致性」與「區域保真度」兩個可分離的維度，每個維度各需一個專門的評估機制。

論證技巧 / 潛在漏洞「分離評估」的直覺非常強。但兩個判別器的梯度訊號可能在訓練中產生衝突——全域判別器可能鼓勵模糊但一致的結果，而區域判別器鼓勵銳利但可能全域不協調的結果。此權衡的處理未被充分討論。

Diffusion-based methods propagate information from boundaries inward, working well for narrow missing regions but failing for large holes. Patch-based methods (PatchMatch and variants) iteratively search and paste the best matching patches, excelling at texture propagation but lacking semantic understanding. Deep generative models, particularly Generative Adversarial Networks (GANs), have opened new possibilities by learning to generate realistic image content. Context Encoders combined an encoder-decoder architecture with adversarial training for inpainting, while semantic inpainting methods optimise in the latent space of pre-trained generators. These approaches typically use only a single discriminator and produce results with noticeable artifacts at completion boundaries.

基於擴散的方法從邊界向內傳播資訊，對窄小缺失區域效果良好但對大面積缺失則失敗。基於修補塊的方法（PatchMatch 及其變體）迭代搜尋與貼上最佳匹配修補塊，擅長紋理傳播但缺乏語意理解。深度生成模型，特別是生成對抗網路（GAN），透過學習生成逼真影像內容開啟了新的可能性。Context Encoders 結合編碼器-解碼器架構與對抗訓練進行修補，而語意修補方法在預訓練生成器的潛在空間中進行最佳化。這些方法通常僅使用單一判別器，且在補全邊界處產生明顯的偽影。

段落功能文獻全景——從擴散、修補塊到深度生成三個世代的演進。

邏輯角色建立方法演進的歷史脈絡，以「單一判別器」作為既有深度方法的共同弱點，引出雙判別器的差異化定位。

論證技巧 / 潛在漏洞將三個世代的方法線性排列暗示持續進步，但未充分承認基於修補塊的方法在某些場景下仍然優越的事實。

3. Method — 方法

3.1 Completion Network — 補全網路

The completion network is a fully convolutional architecture consisting of downsampling layers, dilated convolutional layers, and upsampling layers. Dilated convolutions are used in the bottleneck to increase the receptive field without reducing spatial resolution, allowing the network to aggregate information from a larger context when filling holes. The network accepts an input image with a binary mask indicating missing regions and outputs the completed image. It is fully convolutional, meaning it can handle arbitrary input resolutions and arbitrarily shaped missing regions at test time.

補全網路是一個全摺積架構，由下採樣層、空洞摺積層與上採樣層組成。空洞摺積用於瓶頸處以在不降低空間解析度的情況下增大感受野，使網路在填補缺失時能從更大的上下文彙整資訊。網路接受帶有指示缺失區域的二元遮罩的輸入影像，並輸出補全後的影像。由於完全摺積，在測試時可處理任意輸入解析度與任意形狀的缺失區域。

段落功能架構描述——補全網路的基礎設計與空洞摺積的角色。

邏輯角色建立方法的生成端：空洞摺積增大感受野直接回應「長距離上下文」的需求。全摺積的設計保證了測試時的靈活性。

論證技巧 / 潛在漏洞空洞摺積的引入是實用的工程選擇，但其固定的擴張率可能無法自適應地處理不同大小的缺失區域。多尺度策略的缺失可能限制方法的通用性。

3.2 Global and Local Discriminators — 全域與區域判別器

The training framework employs two adversarial discriminator networks. The global discriminator receives the entire completed image as input and evaluates whether the overall scene is coherent — checking for consistent lighting, perspective, and semantic layout. The local discriminator receives a small crop centred on the completed region and evaluates whether the local texture and structure are realistic. The total adversarial loss is a weighted combination of both discriminators' outputs. These discriminators serve as auxiliary training networks — they guide the completion network during training but are discarded at inference time.

訓練框架採用兩個對抗判別器網路。全域判別器接收整張補全影像作為輸入，評估整體場景是否一致——檢查光照、透視與語意佈局的一致性。區域判別器接收以補全區域為中心的小型裁切，評估區域紋理與結構是否逼真。總對抗損失為兩個判別器輸出的加權組合。這些判別器作為輔助訓練網路——在訓練期間引導補全網路，但在推論時被棄用。

段落功能核心創新——詳述雙判別器的分工與協作機制。

邏輯角色此段是全文的技術支柱：雙判別器直接實現「全域+區域一致性」的核心承諾。推論時棄用判別器確保了效率不受影響。

論證技巧 / 潛在漏洞「輔助訓練網路」的定位清晰——推論時無額外計算成本。但兩個判別器的權重比例選擇是敏感的超參數，不當的權衡可能導致全域一致但區域模糊、或區域銳利但全域不協調。

4. Experiments — 實驗

The method is evaluated on Places2 for scene completion and CelebA for face inpainting. Compared to PatchMatch and Context Encoders, the proposed method produces significantly more natural completions, especially for large missing regions and structured content like faces. The global discriminator prevents inconsistencies in scene layout, while the local discriminator improves texture sharpness. Ablation studies demonstrate that removing either discriminator degrades quality: without the global discriminator, completions have good local texture but inconsistent global structure; without the local discriminator, results are globally coherent but locally blurry. The method handles object removal, scene completion, and arbitrary-shaped holes.

方法在場景補全的 Places2 與人臉修補的 CelebA 上評估。與 PatchMatch 和 Context Encoders 相比，所提方法產生明顯更自然的補全結果，尤其對於大面積缺失區域和人臉等結構化內容。全域判別器防止場景佈局的不一致，而區域判別器改善紋理的銳利度。消融研究證明移除任一判別器都會降低品質：沒有全域判別器，補全具有良好的區域紋理但全域結構不一致；沒有區域判別器，結果全域一致但區域模糊。方法能處理物件移除、場景補全與任意形狀的缺失。

段落功能全面驗證——以對比實驗與消融研究雙重支撐方法的有效性。

邏輯角色消融結果完美對應理論預測：全域判別器負責結構、區域判別器負責紋理，兩者缺一不可。這是論證中最強的實證支撐。

論證技巧 / 潛在漏洞消融研究的結果與理論預測的高度一致性極具說服力。但缺乏與當時最先進 GAN 方法的定量指標比較（如 FID、PSNR），定性比較的主觀性較強。

5. Conclusion — 結論

This paper proposes a deep image completion method that ensures both global and local consistency through a dual-discriminator training framework. The fully-convolutional completion network can process arbitrary resolutions and hole shapes, while the global and local discriminators guide it to produce completions that are both semantically coherent and texturally detailed. The approach significantly outperforms patch-based methods and prior deep learning approaches, particularly for structured content and large missing regions. The dual-discriminator design is general and could benefit other image generation tasks.

本文提出一種深度影像補全方法，透過雙判別器訓練框架確保全域與區域的一致性。全摺積補全網路能處理任意解析度與缺失形狀，而全域與區域判別器引導其產生語意一致且紋理細緻的補全結果。此方法顯著優於基於修補塊的方法與先前的深度學習方法，特別是在結構化內容與大面積缺失區域方面。雙判別器設計具有通用性，可惠及其他影像生成任務。

段落功能總結全文——重申雙判別器的核心設計與方法的通用性。

邏輯角色結論將雙判別器從「影像補全的技巧」昇華為「通用的影像生成設計原則」，擴大了貢獻的影響範圍。

論證技巧 / 潛在漏洞「通用性」的主張具前瞻性但未被實驗驗證——需要在其他生成任務上的測試才能支持此推論。結論未討論方法在高解析度影像上的限制。

論證結構總覽

問題
單一判別器無法
兼顧全域與區域

→

論點
全域+區域
雙判別器架構

→

證據
Places2 / CelebA
消融驗證各判別器

→

反駁
推論時棄用判別器
無額外計算成本

→

結論
雙判別器為通用
影像生成設計原則

作者核心主張（一句話）

以全域判別器確保語意佈局一致性、區域判別器確保紋理保真度的雙判別器訓練框架，能使深度影像補全同時達到全域與區域的品質要求。

論證最強處

消融實驗的對稱性：移除全域判別器導致全域不一致、移除區域判別器導致區域模糊，此對稱的消融結果完美驗證了雙判別器各自分工的設計直覺，是論文中最具說服力的實證。

論證最弱處

缺乏定量評估指標：主要依賴定性比較，缺乏 FID、PSNR 等客觀指標的系統性報告。且兩個判別器的損失權重平衡可能需要針對不同場景調整，方法的泛化穩健性未被充分驗證。