Restormer: Efficient Transformer for High-Resolution Image Restoration

Abstract — 摘要

Since convolutional neural networks (CNNs) perform well at learning generalizable image priors from large-scale data, these models have been extensively applied to image restoration and related tasks. Recently, another class of neural architectures, Transformers, have shown significant performance gains on natural language and high-level vision tasks. While the self-attention mechanism in the Transformer model is key to its success, its computation grows quadratically with the spatial resolution, therefore making it infeasible to apply to most image restoration tasks involving high-resolution images.

由於摺積神經網路（CNN）擅長從大規模資料中學習可泛化的影像先驗，這類模型已被廣泛應用於影像修復及相關任務。近年來，另一類神經架構——Transformer——在自然語言處理與高階視覺任務上展現出顯著的效能增益。雖然 Transformer 模型中的自注意力機制是其成功的關鍵，但其計算量隨空間解析度呈二次方增長，因此難以應用於涉及高解析度影像的大多數影像修復任務。

段落功能問題界定——建立 CNN 與 Transformer 在影像修復領域的對立張力。

邏輯角色摘要開頭以「CNN 擅長但有限、Transformer 強大但不可行」的對比結構，精準定位研究缺口：如何在高解析度影像修復中兼得兩者之長。

論證技巧 / 潛在漏洞以「二次方複雜度」作為 Transformer 不可行的論據，簡潔有力。但此處未提及已有的局部注意力方案（如 Swin Transformer），可能略微誇大了問題的嚴峻程度。

In this paper, the authors propose an efficient Transformer model, Restormer, that can capture long-range pixel interactions while remaining applicable to large images. The key designs include multi-Dconv head transposed attention (MDTA) that efficiently aggregates local and non-local pixel interactions, and gated-Dconv feed-forward network (GDFN) that performs controlled feature transformation to suppress less informative features. Restormer achieves state-of-the-art results on 16 benchmark datasets across tasks including image deraining, single-image motion deblurring, defocus deblurring, and image denoising.

本文提出一種高效的 Transformer 模型 Restormer，能夠在捕捉長距離像素交互作用的同時，仍可應用於大尺寸影像。關鍵設計包含：多深度摺積頭轉置注意力（MDTA），可高效聚合局部與非局部像素交互作用；以及門控深度摺積前饋網路（GDFN），執行受控的特徵轉換以抑制資訊量較低的特徵。Restormer 在涵蓋影像去雨、單影像運動去模糊、散焦去模糊及影像去噪等任務的 16 個基準資料集上，達到了最先進的結果。

段落功能方案預告——概述 Restormer 的核心模組與實驗成效。

邏輯角色承接上段的問題陳述，此段以「兩大關鍵設計 + 16 個基準資料集」的組合，建立起方法的技術核心與實證基礎的雙重預告。

論證技巧 / 潛在漏洞「16 個基準資料集」的數量本身即構成強有力的說服力指標，暗示方法具備跨任務的廣泛適用性。MDTA 與 GDFN 兩個縮寫同時出場，為後續方法章節建立了清晰的預期框架。

1. Introduction — 緒論

Image restoration aims to reconstruct a high-quality image from its degraded observation corrupted by noise, blur, rain streaks, and other adverse factors. This is an inherently ill-posed problem as multiple solutions can map to the same degraded input. CNNs have become the preferred approach because they learn generalizable image priors from large-scale data. However, CNNs present two key limitations: restricted receptive fields that prevent long-range dependency modeling, and static filter weights that cannot flexibly adapt to input content.

影像修復旨在從受到雜訊、模糊、雨紋等不利因素汙損的退化觀測中，重建高品質影像。這本質上是一個病態問題，因為多組解均可對應至同一退化輸入。摺積神經網路（CNN）已成為首選方法，因其能從大規模資料中學習可泛化的影像先驗。然而，CNN 存在兩項關鍵侷限：受限的感受野使其無法建模長距離依賴關係，以及靜態濾波器權重無法靈活適應輸入內容。

段落功能建立研究場域——定義影像修復問題並指出 CNN 的雙重侷限。

邏輯角色論證鏈的起點：先肯定 CNN 的學習能力，再以「感受野受限」與「權重靜態」兩項技術性弱點，為引入 Transformer 的自注意力機制建立必要性論據。

論證技巧 / 潛在漏洞「病態問題」一詞在數學上嚴謹地定位了任務的困難度，增強了學術可信度。但「靜態濾波器權重」的批評未考慮動態摺積（如 DynamicConv）等已有的應對方案，可能簡化了 CNN 近年的進展。

Self-attention mechanisms, the core component of Transformer models, compute responses as weighted sums across all positions, thus addressing both CNN limitations. Transformers have demonstrated state-of-the-art performance in NLP and high-level vision tasks. However, their complexity grows quadratically with spatial resolution, rendering them infeasible for high-resolution image restoration. Recent Transformer applications in restoration either apply self-attention to small 8x8 spatial windows or divide images into non-overlapping 48x48 patches. These approaches contradict the goal of capturing true long-range relationships, particularly on high-resolution images.

自注意力機制是 Transformer 模型的核心組件，透過計算所有位置的加權總和來產生回應，從而同時解決 CNN 的兩項侷限。Transformer 已在自然語言處理與高階視覺任務中展現最先進的效能。然而，其複雜度隨空間解析度呈二次方增長，使其在高解析度影像修復上不可行。近期將 Transformer 應用於修復任務的方法，要麼在小型 8x8 空間窗口內施加自注意力，要麼將影像分割為不重疊的 48x48 圖塊。這些做法與捕捉真正長距離關係的目標相矛盾，在高解析度影像上尤為明顯。

段落功能批判現有方案——指出既有 Transformer 修復方法的根本矛盾。

邏輯角色此段扮演「問題深化」的角色：不僅指出 Transformer 的二次方瓶頸（已知問題），更進一步揭示現有折衷方案（局部窗口注意力）的邏輯矛盾——為了解決效率問題而犧牲了 Transformer 最核心的長距離建模優勢。

論證技巧 / 潛在漏洞以「自相矛盾」的修辭策略批評 Swin Transformer 式的局部注意力，論證力度強。但 Swin Transformer 透過窗口移位機制仍可逐步聚合全域資訊，此處的批評略顯片面。作者需在方法章節中證明其替代方案確實更優。

In this work, the authors propose Restormer, an efficient Transformer capable of capturing global connectivity while remaining applicable to large images. The main contributions are threefold: (1) proposing Restormer, an encoder-decoder Transformer enabling multi-scale local-global representation learning on high-resolution images without disintegrating them into patches; (2) introducing multi-Dconv head transposed attention (MDTA) that efficiently aggregates local and non-local pixel interactions by applying self-attention across channels rather than spatial dimensions; and (3) presenting gated-Dconv feed-forward network (GDFN) that performs controlled feature transformation with a gating mechanism to suppress less informative features. The model achieves state-of-the-art performance on 16 benchmark datasets across deraining, deblurring, and denoising.

本研究提出 Restormer，一種能夠捕捉全域連結性且可應用於大尺寸影像的高效 Transformer。主要貢獻有三：（一）提出 Restormer，一個編碼器-解碼器 Transformer，在不將高解析度影像拆分為圖塊的前提下，實現多尺度局部-全域表示學習；（二）引入多深度摺積頭轉置注意力（MDTA），透過在通道維度而非空間維度上施加自注意力，高效聚合局部與非局部像素交互作用；（三）提出門控深度摺積前饋網路（GDFN），以門控機制執行受控的特徵轉換，抑制資訊量較低的特徵。該模型在涵蓋去雨、去模糊與去噪的 16 個基準資料集上達到最先進效能。

段落功能提出解決方案——以三點式列舉 Restormer 的核心貢獻。

邏輯角色承接上段的問題與批判，此段是論證的轉折點。三項貢獻分別對應：架構層級的創新（編碼器-解碼器）、注意力層級的創新（MDTA）、前饋層級的創新（GDFN），形成由上而下的完整技術方案。

論證技巧 / 潛在漏洞三點式結構清晰且便於讀者記憶。「不將影像拆分為圖塊」的強調直接回應了上段對局部注意力方法的批評。但「轉置注意力」的核心概念——在通道維度上計算注意力——並非全新，早在 SE-Net 等通道注意力機制中即有先例，作者需在方法章節中闡明差異。

Data-driven CNN architectures have significantly outperformed conventional restoration approaches. Encoder-decoder U-Net architectures dominate the field due to their hierarchical multi-scale representation and computational efficiency. Techniques such as skip connections, spatial and channel attention modules, and multi-stage progressive designs have further enhanced restoration effectiveness. Notable approaches include MPRNet with multi-stage progressive restoration and MIRNet with multi-scale residual blocks, both demonstrating the value of hierarchical feature processing for capturing details at multiple scales.

資料驅動的CNN 架構已大幅超越傳統修復方法。編碼器-解碼器 U-Net 架構因其階層式多尺度表示能力與計算效率而主導該領域。跳接連結、空間與通道注意力模組，以及多階段漸進式設計等技術進一步提升了修復效果。代表性方法包括採用多階段漸進式修復的 MPRNet，以及使用多尺度殘差區塊的 MIRNet，兩者均展示了階層式特徵處理在多尺度細節捕捉上的價值。

段落功能文獻回顧——梳理 CNN 修復方法的演進脈絡。

邏輯角色此段建立 CNN 修復的技術譜系，從 U-Net 到注意力機制再到多階段設計，為 Restormer 繼承「編碼器-解碼器」框架與「多尺度處理」思想提供正當性。

論證技巧 / 潛在漏洞特意提及 MPRNet 與 MIRNet——兩者皆為同一作者群的先前工作——建立了自身研究的延續性脈絡。此舉在學術上合理，但讀者應注意這可能使對比實驗的公正性產生疑慮。

Transformers, initially developed for sequence processing in NLP, have been adapted across vision tasks including image recognition, segmentation, and object detection. Vision Transformers (ViT) decompose images into patch sequences, learning mutual relationships and capturing long-range dependencies with strong input adaptability. Despite successful applications to low-level vision problems like super-resolution, colorization, denoising, and deraining, computational complexity remains prohibitive for high-resolution outputs. Recent methods employ strategies reducing complexity, such as applying self-attention within local image regions using Swin Transformer design. However, this restricts context aggregation to local neighborhoods, contradicting the primary motivation for self-attention over convolutions — particularly unsuitable for image restoration requiring global context awareness.

Transformer 最初為自然語言處理的序列處理而開發，後已被調適至影像辨識、分割與物件偵測等視覺任務。視覺 Transformer（ViT）將影像分解為圖塊序列，學習彼此間的關係並捕捉長距離依賴，具有強大的輸入適應性。儘管已成功應用於超解析度、著色、去噪與去雨等低階視覺問題，但計算複雜度仍使其在高解析度輸出上不可行。近期方法採用降低複雜度的策略，如使用 Swin Transformer 設計在局部影像區域內施加自注意力。然而，這將上下文聚合限制在局部鄰域內，與自注意力相對於摺積的核心動機相矛盾——對於需要全域上下文感知的影像修復尤為不適。

段落功能批判定位——分析 Transformer 在低階視覺中的現狀與根本矛盾。

邏輯角色此段是整篇論文動機的核心支柱：Swin Transformer 的局部窗口策略雖解決了效率問題，卻犧牲了全域建模能力。這一「矛盾」論述為 Restormer 的「通道維度注意力」方案提供了最直接的合理性依據。

論證技巧 / 潛在漏洞「與核心動機相矛盾」的修辭非常有力，但 Swin Transformer 的移位窗口機制確實能跨越窗口邊界傳遞資訊。作者此處的批評指向的是「直接的」全域感受野而非「間接的」全域連結，這一區分若能更精確地闡述，論證會更為嚴謹。

Several Transformer-based restoration methods have been proposed. IPT (Image Processing Transformer) uses a large-scale pre-trained Transformer but requires enormous computational resources with over 115M parameters. SwinIR applies Swin Transformer blocks with shifted window attention and achieves strong results in super-resolution and denoising, yet its local window attention limits the effective receptive field. Uformer introduces a U-shaped Transformer with LeWin attention, applying self-attention within non-overlapping windows. These methods demonstrate the potential of Transformers for restoration, but none achieves truly global self-attention at linear complexity on full-resolution feature maps.

已有多種基於 Transformer 的修復方法被提出。影像處理 Transformer（IPT）使用大規模預訓練 Transformer，但需要超過 1.15 億參數的龐大計算資源。SwinIR 應用具有移位窗口注意力的 Swin Transformer 區塊，在超解析度與去噪上取得優異結果，但其局部窗口注意力限制了有效感受野。Uformer 引入具 LeWin 注意力的 U 形 Transformer，在不重疊窗口內施加自注意力。這些方法展示了 Transformer 在修復領域的潛力，但沒有一個能在全解析度特徵圖上以線性複雜度實現真正的全域自注意力。

段落功能競爭方法對比——具體列舉三種代表性 Transformer 修復方法的優缺點。

邏輯角色此段透過逐一檢視 IPT、SwinIR、Uformer 的侷限，以排除法收窄至 Restormer 要填補的精確技術缺口：「全解析度特徵圖上的線性複雜度全域注意力」。

論證技巧 / 潛在漏洞以三個具體方法作為反面教材，使論證從抽象批評轉為具體對比。但每個方法僅被指出一項缺陷，可能過度簡化了它們的能力。例如 SwinIR 透過多層堆疊已能覆蓋相當大的有效感受野。

3. Method — 方法

Given a degraded image of dimensions H x W x 3, Restormer first applies a convolution to obtain low-level feature embeddings of dimension H x W x C. These features pass through a 4-level symmetric encoder-decoder, producing deep features of H x W x 2C dimensions. The encoder hierarchically reduces spatial size while expanding channel capacity. The decoder progressively recovers high-resolution representations from low-resolution latent features. Feature downsampling and upsampling use pixel-unshuffle and pixel-shuffle operations respectively. Encoder features concatenate with decoder features via skip connections, followed by 1x1 convolution to reduce channels. A refinement stage at the original resolution further enriches deep features before a final convolution generates the residual image that is added to the degraded input.

給定一張維度為 H x W x 3 的退化影像，Restormer 首先以一層摺積提取維度為 H x W x C 的低階特徵嵌入。這些特徵通過一個四階對稱編碼器-解碼器，產生維度為 H x W x 2C 的深層特徵。編碼器以階層方式縮減空間尺寸並擴展通道容量，解碼器則從低解析度潛在特徵中漸進恢復高解析度表示。特徵的降取樣與升取樣分別使用 pixel-unshuffle 與 pixel-shuffle 運算。編碼器特徵透過跳接連結與解碼器特徵串接，再經 1x1 摺積以縮減通道數。原始解析度上的精煉階段進一步豐富深層特徵，最後由摺積生成殘差影像，加回退化輸入即得最終結果。

段落功能架構總覽——描述 Restormer 的編碼器-解碼器整體管線。

邏輯角色此段為後續各子模組的討論設定了宏觀框架。四階對稱設計繼承了 U-Net 的成功範式，而 pixel-shuffle/unshuffle 的使用避免了棋盤偽影，殘差學習策略則降低了學習難度。

論證技巧 / 潛在漏洞架構設計上大量借鏡成熟的 CNN 修復範式（U-Net、跳接、殘差學習），這是務實的選擇。但讀者可能質疑：若整體架構與 CNN 方法如此相似，那 Transformer 的貢獻究竟體現在何處？答案在於接下來的 MDTA 與 GDFN。

3.1 Multi-Dconv Head Transposed Attention (MDTA)

Conventional self-attention exhibits O(W²H²) time and memory complexity for W x H pixel images, rendering it infeasible for restoration tasks involving high-resolution imagery. MDTA achieves linear complexity by applying self-attention across channels rather than spatial dimensions, computing cross-covariance across channels to generate attention maps that encode global context implicitly. Instead of the massive HW x HW spatial attention maps in conventional self-attention, MDTA produces compact C x C transposed-attention maps, where C is the number of channels — typically orders of magnitude smaller than HW.

傳統自注意力對 W x H 像素影像具有 O(W²H²) 的時間與記憶體複雜度，使其在涉及高解析度影像的修復任務中不可行。MDTA 透過在通道維度而非空間維度上施加自注意力來達到線性複雜度，計算通道間的交叉共變異數以生成隱含編碼全域上下文的注意力圖。不同於傳統自注意力中巨大的 HW x HW 空間注意力圖，MDTA 產生緊湊的 C x C 轉置注意力圖，其中 C 為通道數——通常比 HW 小數個數量級。

段落功能核心創新揭示——闡述 MDTA 如何將注意力從空間維度轉移至通道維度。

邏輯角色此段是全文最關鍵的技術貢獻。「空間維度 -> 通道維度」的維度轉換，在數學上將複雜度從 O(H²W²) 降至 O(HWC)，直接實現了緒論中承諾的「線性複雜度全域注意力」。

論證技巧 / 潛在漏洞以「C 通常比 HW 小數個數量級」作為效率論據，非常具說服力。但通道注意力本質上計算的是「哪些特徵通道更重要」，而非「哪些空間位置相關」——這與傳統空間自注意力捕捉的資訊類型不同。作者聲稱其「隱含編碼全域上下文」，但這一等價性需要更嚴格的理論或實驗驗證。

MDTA incorporates depth-wise convolutions to emphasize local context before feature covariance computation. From layer-normalized input tensors, MDTA generates query (Q), key (K), and value (V) projections enriched with local context through two stages: first, 1x1 convolutions aggregate pixel-wise cross-channel context; then, 3x3 depth-wise convolutions encode channel-wise spatial context. The multi-head structure divides channels into separate heads, each learning parallel attention maps independently. This design implicitly models contextualized global relationships between pixels while computing covariance-based attention maps, complementing convolutional strengths within the pipeline.

MDTA 結合深度摺積以在特徵共變異數計算前強調局部上下文。從經層正規化的輸入張量出發，MDTA 透過兩個階段生成富含局部上下文的查詢（Q）、鍵（K）、值（V）投影：首先以 1x1 摺積聚合逐像素的跨通道上下文，接著以 3x3 深度摺積編碼逐通道的空間上下文。多頭結構將通道分為獨立的頭，各自平行學習注意力圖。此設計在計算基於共變異數的注意力圖的同時，隱含地建模了像素間的上下文化全域關係，與管線中的摺積優勢相互補足。

段落功能技術細節補充——完整描述 MDTA 的計算流程與設計哲學。

邏輯角色此段深化了 MDTA 的技術描述，關鍵在於「1x1 摺積 + 3x3 深度摺積」的二階段投影設計。這使得 Q、K、V 在進入注意力計算前已包含局部空間資訊，彌補了純通道注意力缺乏空間感知的弱點。

論證技巧 / 潛在漏洞「與摺積優勢互補」的論述巧妙地將 Transformer 與 CNN 定位為合作而非競爭關係，緩和了學術界可能的質疑。深度摺積的引入本質上是在通道注意力中注入空間歸納偏置——這是一個務實但可能降低模型「純 Transformer」純度的設計決策。

The computational advantage of MDTA is significant. For an input feature map of spatial dimensions H x W and C channels, conventional self-attention produces attention maps of size HW x HW, leading to O(H²W²) complexity. In contrast, MDTA reshapes Q and K such that the dot-product generates C x C transposed-attention maps, resulting in O(HWC) complexity — linear in the number of pixels. For a typical restoration setting with 256 x 256 input and 48 channels at level-1, this represents a reduction from approximately 4.3 billion operations to 3.1 million — a factor of over 1000x.

MDTA 的計算優勢相當顯著。對於空間維度為 H x W、具 C 個通道的輸入特徵圖，傳統自注意力產生大小為 HW x HW 的注意力圖，導致 O(H²W²) 的複雜度。相較之下，MDTA 重塑 Q 與 K 使得點積生成 C x C 的轉置注意力圖，複雜度為 O(HWC)——相對於像素數量呈線性。以典型的修復設定為例，256 x 256 輸入搭配第一階的 48 個通道，這意味著運算量從約 43 億次降至 310 萬次——超過 1000 倍的縮減。

段落功能定量論證——以具體數值展示 MDTA 的效率優勢。

邏輯角色此段將前兩段的理論描述轉化為可量化的效率指標。「1000 倍」的縮減倍率是極具衝擊力的數字，直接回應了緒論中「二次方複雜度不可行」的問題陳述。

論證技巧 / 潛在漏洞以具體數字（43 億 vs. 310 萬）取代抽象的大 O 符號，大幅提升說服力。但需注意此計算僅涵蓋注意力圖的生成，未包含 Q、K、V 投影及深度摺積的額外開銷。完整的端到端效率比較需在實驗章節驗證。

3.2 Gated-Dconv Feed-Forward Network (GDFN)

Conventional feed-forward networks (FFN) in Transformers operate identically at each pixel location using two 1x1 convolutions: first expanding feature channels by a factor gamma (typically 4), then reducing to original dimensions, with non-linearity applied in hidden layers. GDFN introduces two fundamental modifications: a gating mechanism and the incorporation of depth-wise convolutions. The gating mechanism functions as the element-wise product of two parallel linear transformation paths, one activated with GELU non-linearity. This design controls information flow through hierarchical levels, allowing each level to focus on complementary fine details.

Transformer 中傳統的前饋網路（FFN）在每個像素位置上以相同方式運作，使用兩層 1x1 摺積：先將特徵通道擴展 gamma 倍（通常為 4），再縮減回原始維度，並在隱藏層施加非線性。GDFN 引入兩項根本性修改：門控機制與深度摺積的結合。門控機制以兩條平行線性轉換路徑的逐元素乘積實現，其中一條以 GELU 非線性啟動。此設計控制了資訊在階層式層級間的流動，使各層級得以聚焦於互補的精細細節。

段落功能第二核心創新——闡述 GDFN 的門控機制設計。

邏輯角色在 MDTA 解決了注意力效率問題之後，此段轉向 Transformer 的另一核心組件——前饋網路。門控機制賦予網路「選擇性傳遞資訊」的能力，超越了標準 FFN 的無差別特徵轉換。

論證技巧 / 潛在漏洞門控機制（GLU 變體）在 NLP 領域已有廣泛應用（如 GPT-3），此處將其引入影像修復是合理的遷移。但「控制資訊流動」的表述較為模糊，未具體說明門控如何在不同階層上學到互補的特徵——需要消融實驗支持。

Specifically, the GDFN formulation involves: the input tensor undergoes layer normalization, then splits into two parallel transformation paths. The first path applies point-wise (1x1) convolution followed by 3x3 depth-wise convolution and GELU activation. The second path applies similar point-wise and depth-wise convolutions without activation. These two paths are combined via element-wise multiplication (gating), followed by a final point-wise convolution and residual connection. The depth-wise convolutions encode spatially neighboring pixel information, useful for learning local image structure enabling effective restoration. Since GDFN performs more operations than a standard FFN, the expansion ratio is reduced (gamma = 2.66 instead of 4) to maintain similar parameter count and computational cost.

具體而言，GDFN 的計算流程如下：輸入張量先經層正規化，然後分為兩條平行轉換路徑。第一條路徑依序施加逐點（1x1）摺積、3x3 深度摺積與 GELU 啟動函數。第二條路徑施加類似的逐點與深度摺積但不含啟動函數。兩條路徑透過逐元素乘積（門控）結合，最後經逐點摺積與殘差連結。深度摺積編碼了空間鄰近像素的資訊，有助於學習局部影像結構以實現有效修復。由於 GDFN 比標準 FFN 執行更多運算，擴展比率從 4 降為 2.66 以維持相近的參數量與計算成本。

段落功能計算細節——完整描述 GDFN 的資料流程與參數設定。

邏輯角色此段展示了工程設計的審慎考量：在增加門控複雜度的同時，透過降低擴展比率維持整體計算預算不變。這種「零成本增益」的論述策略，使得 GDFN 的引入顯得有利無害。

論證技巧 / 潛在漏洞「gamma 從 4 降至 2.66」的調整看似微小，但實際上改變了隱藏層的表示容量。作者聲稱參數量「相近」，但理想情況下應提供精確的數字比較。此外，深度摺積在 MDTA 與 GDFN 中均被採用，凸顯了該運算在此架構中的核心地位。

3.3 Progressive Learning — 漸進式學習

CNN-based restoration models typically train on fixed-size image patches. However, Transformer models trained on small cropped patches may not encode global image statistics, yielding suboptimal performance on full-resolution test images. Progressive learning addresses this by training networks on smaller patches in early epochs and gradually larger patches in later epochs. This strategy resembles curriculum learning, where networks progress from simpler to more complex tasks requiring fine image structure and texture preservation.

基於 CNN 的修復模型通常在固定尺寸的影像圖塊上訓練。然而，在小型裁切圖塊上訓練的 Transformer 模型可能無法編碼全域影像統計量，導致在全解析度測試影像上表現欠佳。漸進式學習透過在早期訓練週期使用較小圖塊、在後期逐步增大圖塊尺寸來解決此問題。此策略類似課程式學習，網路從較簡單的任務逐步過渡到需要精細影像結構與紋理保存的複雜任務。

段落功能訓練策略創新——提出漸進式學習以彌補固定圖塊訓練的不足。

邏輯角色此段補充了前兩節的架構創新，從「模型設計」擴展到「訓練方法」。漸進式學習直接回應了 Transformer 訓練的獨特挑戰：需要足夠大的感受野才能發揮全域建模優勢。

論證技巧 / 潛在漏洞以「課程式學習」的類比使漸進式訓練策略顯得理論上有據。但此策略並非 Restormer 獨創，類似的圖塊尺寸排程在多種訓練方法中已有應用。作者的貢獻更多在於系統性地驗證此策略對 Transformer 修復的效益。

Mixed-size patch training through progressive learning enhances performance at test time across different image resolutions — a common scenario in restoration tasks. Specifically, training begins with 128 x 128 patches and batch size 64, then advances through progressively larger patch-batch pairs: (160², 40), (192², 32), (256², 16), (320², 8), (384², 8) at designated iteration milestones [92K, 156K, 204K, 240K, 276K] across 300K total iterations. As patch sizes increase, batch sizes decrease to maintain consistent computational cost per optimization step.

透過漸進式學習的混合尺寸圖塊訓練，在測試時能增強不同影像解析度下的效能——這在修復任務中十分常見。具體而言，訓練從 128 x 128 圖塊搭配批次大小 64 開始，接著漸進至更大的圖塊-批次組合：(160², 40)、(192², 32)、(256², 16)、(320², 8)、(384², 8)，分別在總計 300K 次迭代中的第 92K、156K、204K、240K、276K 次切換。隨著圖塊尺寸增大，批次大小相應縮小，以維持每步最佳化的一致計算成本。

段落功能提供具體排程——以數值詳述漸進式學習的實施細節。

邏輯角色此段將漸進式學習從概念轉化為可複現的具體實作。六組圖塊-批次配對的精確數值，體現了作者對訓練效率的細緻調校。

論證技巧 / 潛在漏洞詳盡的超參數排程有助於可複現性，這在修復領域極為重要。但讀者可能質疑這些切換點的選擇是否經過系統性搜索，抑或僅為啟發式設定。消融實驗中應驗證排程的穩健性。

4. Experiments — 實驗

Image Deraining. Restormer achieves consistent and significant performance gains across five deraining datasets compared to existing approaches. Compared to SPAIR, the previous best method, Restormer advances the state-of-the-art by 1.05 dB averaged across datasets, with individual gains reaching 2.06 dB on Rain100L. Visual results demonstrate that Restormer effectively removes rain streaks while preserving fine structural content that other methods tend to over-smooth.

影像去雨方面，Restormer 相較於現有方法，在五個去雨資料集上達到一致且顯著的效能增益。與先前最佳方法 SPAIR 相比，Restormer 將最先進水準推進了跨資料集平均 1.05 dB，在 Rain100L 上的個別增益更達 2.06 dB。視覺結果表明，Restormer 能有效去除雨紋，同時保存其他方法傾向過度平滑的精細結構內容。

段落功能第一組實證——展示去雨任務上的定量與定性結果。

邏輯角色以去雨作為實驗驗證的起點，跨五個資料集的一致性增益強化了方法的泛化性論述。2.06 dB 的最大增益在修復領域屬於非常顯著的進步。

論證技巧 / 潛在漏洞選擇以「平均增益」與「最大增益」並陳，兼顧了穩健性與峰值表現的呈現。但 Rain100L 上 2.06 dB 的大幅增益可能暗示該資料集較為簡單或已趨飽和，使增益的絕對值意義打折。

Single-Image Motion Deblurring. Restormer outperforms all compared approaches on both synthetic (GoPro, HIDE) and real-world (RealBlur-R, RealBlur-J) datasets. Averaged across all datasets, Restormer provides a 0.47 dB boost over MIMO-UNet+ and 0.26 dB over the previous best MPRNet. Crucially, Restormer achieves this with 81% fewer FLOPs than MPRNet. Compared to the Transformer model IPT, Restormer shows 0.4 dB improvement with 4.4x fewer parameters and runs 29x faster. Despite being trained solely on the GoPro dataset, the method demonstrates strong generalization to other benchmarks.

單影像運動去模糊方面，Restormer 在合成（GoPro、HIDE）與真實世界（RealBlur-R、RealBlur-J）資料集上均優於所有對比方法。跨所有資料集平均，Restormer 比 MIMO-UNet+ 提升 0.47 dB，比先前最佳的 MPRNet 提升 0.26 dB。關鍵在於，Restormer 以比 MPRNet 少 81% 的浮點運算量達成此成果。與 Transformer 模型 IPT 相比，Restormer 以少 4.4 倍的參數量提升 0.4 dB，且運行速度快 29 倍。儘管僅在 GoPro 資料集上訓練，該方法仍展現出對其他基準的強泛化能力。

段落功能第二組實證——展示去模糊任務上的效能與效率優勢。

邏輯角色此段的論證維度更為豐富：不僅比較 PSNR，還涵蓋 FLOPs、參數量與推論速度。「81% 更少 FLOPs」與「29 倍更快」的數據直接回應了「Transformer 效率不足」的普遍質疑。

論證技巧 / 潛在漏洞與 IPT 的比較尤為精彩：以 4.4 倍更少的參數達到更好的效能，有力地證明了「暴力增大模型」並非最佳策略。但「僅在 GoPro 上訓練」既是優勢（泛化性強）也是弱點——若使用更多訓練資料，增益幅度可能更大或更小，此處未探討。

Image Denoising. For Gaussian denoising across synthetic benchmarks and noise levels (sigma = 15, 25, 50), Restormer achieves state-of-the-art under both single-model and noise-level-specific training. At challenging noise level sigma = 50 on Urban100, Restormer achieves 0.37 dB gain over DRUNet (CNN-based) and 0.31 dB over SwinIR (Transformer-based), with 3.14x fewer FLOPs and 13x faster inference than SwinIR. For real image denoising, Restormer is the only method surpassing 40 dB PSNR on both SIDD and DND datasets. On SIDD, it obtains 0.3 dB over MIRNet and 0.25 dB over Uformer.

影像去噪方面，在合成基準與各雜訊等級（sigma = 15、25、50）的高斯去噪上，Restormer 在單一模型與雜訊等級特定訓練兩種設定下均達到最先進水準。在具挑戰性的 Urban100 資料集 sigma = 50 條件下，Restormer 比基於 CNN 的 DRUNet 提升 0.37 dB，比基於 Transformer 的 SwinIR 提升 0.31 dB，同時以少 3.14 倍的浮點運算量與快 13 倍的推論速度達成。在真實影像去噪方面，Restormer 是唯一在 SIDD 與 DND 兩個資料集上均突破 40 dB PSNR 的方法。在 SIDD 上比 MIRNet 提升 0.3 dB，比 Uformer 提升 0.25 dB。

段落功能第三組實證——以去噪任務展示跨任務、跨架構類型的全面優勢。

邏輯角色此段在論證結構上具有「收網」功能：同時擊敗 CNN 最佳（DRUNet）與 Transformer 最佳（SwinIR），證明 Restormer 超越了兩大架構家族。「唯一突破 40 dB」的里程碑式描述進一步鞏固了最先進地位。

論證技巧 / 潛在漏洞「唯一突破 40 dB」是極具修辭力量的量化里程碑。與 SwinIR 的直接效率比較（3.14 倍更少 FLOPs、13 倍更快）為 MDTA 的轉置注意力設計提供了最強有力的實證支持。但需注意 SwinIR 的原始設計目標為超解析度而非去噪，跨任務比較可能不完全公平。

Ablation Studies. Systematic ablation experiments are conducted on Gaussian color denoising (Urban100, sigma = 50). The MDTA module provides 0.32 dB gain over the baseline, and removing depth-wise convolutions results in PSNR drops, confirming the importance of local context. The gating mechanism in GDFN yields 0.12 dB gain over conventional FFN, with depth-wise convolutions providing additional benefits, totaling 0.26 dB gain. The overall Transformer block improvements amount to 0.51 dB. Progressive learning provides better results than fixed patch training while maintaining similar training time. Under similar parameter and FLOP budgets, deep-narrow models outperform wide-shallow counterparts in accuracy.

消融研究在高斯彩色去噪（Urban100、sigma = 50）上進行系統性驗證。MDTA 模組相較基準提供 0.32 dB 的增益，移除深度摺積後 PSNR 下降，確認了局部上下文的重要性。GDFN 中的門控機制比傳統 FFN 提升 0.12 dB，深度摺積帶來額外效益，合計增益達 0.26 dB。整體 Transformer 區塊的改進合計 0.51 dB。漸進式學習在維持相近訓練時間的同時，比固定圖塊訓練取得更好的結果。在相近的參數量與浮點運算預算下，深窄模型在準確度上優於寬淺模型。

段落功能組件驗證——以消融實驗逐一確認各設計決策的貢獻。

邏輯角色消融研究是方法論文的必備環節，此段系統性地拆解了三大貢獻的各別效益：MDTA（+0.32 dB）> GDFN（+0.26 dB）> 漸進式學習，建立了清晰的貢獻度排序。

論證技巧 / 潛在漏洞逐步拆解的消融方式使每個組件的邊際貢獻一目了然。但消融僅在單一任務（去噪）上進行，未覆蓋去雨與去模糊——不同任務對各組件的依賴程度可能不同。此外，0.51 dB 的總計改進是否能完全歸因於架構創新，而非超參數調整，仍需審慎解讀。

5. Conclusion — 結論

Restormer presents an efficient Transformer for high-resolution image restoration through carefully designed architectural innovations. Multi-Dconv head transposed attention (MDTA) models global context by applying self-attention across channels rather than spatial dimensions, achieving linear complexity versus the quadratic complexity of conventional self-attention. Gated-Dconv feed-forward networks (GDFN) introduce gating mechanisms for controlled feature transformation. Both modules incorporate depth-wise convolutions encoding spatially local context, complementing CNN strengths within the Transformer pipeline.

Restormer 透過精心設計的架構創新，提出了一種適用於高解析度影像修復的高效 Transformer。多深度摺積頭轉置注意力（MDTA）透過在通道維度而非空間維度上施加自注意力來建模全域上下文，相較於傳統自注意力的二次方複雜度，達到了線性複雜度。門控深度摺積前饋網路（GDFN）引入門控機制以進行受控的特徵轉換。兩個模組均結合了深度摺積編碼空間局部上下文，在 Transformer 管線中與 CNN 的優勢形成互補。

段落功能總結方法——以精煉語言重申 Restormer 的核心架構創新。

邏輯角色結論首段呼應摘要結構，將 MDTA 與 GDFN 的關鍵設計以最精簡的形式重述，形成論證的閉環。「與 CNN 互補」的措辭將 Restormer 定位為融合兩大架構家族優點的混合方案。

論證技巧 / 潛在漏洞「互補 CNN 優勢」的定位非常務實，避免了「Transformer 取代 CNN」的極端論述，有利於學術社群的接受度。但深度摺積的大量使用也模糊了「這究竟是一個 Transformer 還是一個增強版 CNN」的架構身份——此議題在社群中仍有爭議。

Extensive experiments across 16 benchmark datasets demonstrate state-of-the-art performance for image deraining, single-image motion deblurring, defocus deblurring (single-image and dual-pixel), and image denoising (Gaussian and real-world). The model effectively removes degradations while preserving fine image structure and texture details across diverse restoration tasks. These results confirm that Transformer architectures, when equipped with appropriate computational efficiency mechanisms, represent a powerful paradigm for low-level vision tasks that have long been dominated by convolutional approaches.

在 16 個基準資料集上的廣泛實驗證實了該模型在影像去雨、單影像運動去模糊、散焦去模糊（單影像與雙像素）及影像去噪（高斯與真實世界）上的最先進效能。模型在多樣化的修復任務中有效去除退化同時保存精細的影像結構與紋理細節。這些結果確認了配備適當計算效率機制的 Transformer 架構，代表了低階視覺任務的一個強大範式——而這些任務長期以來一直由摺積方法主導。

段落功能展望與啟示——從實驗結果提煉更廣泛的領域意義。

邏輯角色結尾段從具體的 Restormer 成果昇華至更宏觀的論述：Transformer 正在挑戰 CNN 在低階視覺中的統治地位。這為後續研究指明了方向，也賦予本文超越單一方法的學術價值。

論證技巧 / 潛在漏洞「16 個基準資料集」的反覆強調在結論中達到最大效果。但作者未討論方法的局限性——如對極高解析度（4K/8K）的可擴展性、對其他修復任務（如超解析度、壓縮偽影去除）的適用性，以及推論時的記憶體消耗。作為一篇頂會論文，更開放地討論局限會增強可信度。

論證結構總覽

問題
CNN 感受野受限
Transformer 複雜度過高

→

論點
通道維度轉置注意力
實現線性複雜度全域建模

→

證據
16 個基準資料集
跨四大修復任務最先進

→

反駁
深度摺積補足局部資訊
漸進式學習增強泛化性

→

結論
高效 Transformer 為
低階視覺的強大範式

作者核心主張（一句話）

透過在通道維度而非空間維度上計算自注意力，並結合深度摺積與門控前饋機制，Transformer 能以線性複雜度在高解析度影像修復中達到全面超越 CNN 與既有 Transformer 方法的最先進效能。

論證最強處

效率與效能的雙重驗證：MDTA 的通道轉置注意力不僅在理論上將複雜度從二次方降至線性（1000 倍計算量縮減），更在 16 個基準資料集、四大修復任務上實證地超越了 CNN 最佳（MPRNet、DRUNet）與 Transformer 最佳（SwinIR、IPT），且以更少的參數量與更快的推論速度達成——徹底打破了「效能與效率不可兼得」的既有認知。

論證最弱處

通道注意力與空間注意力的等價性未被嚴格論證：MDTA 以 C x C 通道共變異數矩陣取代 HW x HW 空間注意力圖，作者聲稱前者「隱含編碼全域上下文」，但兩者捕捉的資訊本質不同——通道注意力學習的是「哪些特徵通道重要」而非「哪些空間位置相關」。深度摺積的引入確實彌補了部分空間感知缺失，但這也模糊了 Transformer 與增強版 CNN 之間的架構邊界。