Non-local Neural Networks

Abstract — 摘要

Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, non-local models achieve competitive results on Kinetics and Charades datasets. In addition, non-local models are shown to improve object detection and pose estimation on COCO.

摺積與遞迴運算都是一次處理一個局部鄰域的基礎建構元件。本文提出非局部運算作為擷取長程依賴的通用建構元件族群。受到電腦視覺中經典非局部均值方法的啟發，我們的非局部運算將某位置的響應計算為所有位置特徵的加權總和。此建構元件可嵌入多種電腦視覺架構。在影片分類任務上，非局部模型在 Kinetics 和 Charades 資料集上達到具競爭力的結果。此外，非局部模型亦被證明能改善 COCO 上的物件偵測與姿態估計。

段落功能全文總覽——以「局部 vs. 非局部」的對比引出核心創新。

邏輯角色摘要先診斷問題（摺積與遞迴的局部性限制），再提出解方（非局部運算），最後以多任務的實證結果支撐。

論證技巧 / 潛在漏洞將摺積與遞迴並列為「局部」運算是一個巧妙的統一框架，使非局部運算的定位更為清晰。然而，「所有位置的加權總和」在計算上是 O(N^2)，可擴展性問題在摘要中未被提及。

1. Introduction — 緒論

Capturing long-range dependencies is of central importance in deep neural networks. For sequential data (e.g., speech, language), the dominant approach is recurrent operations, which process elements sequentially and propagate information through hidden states. For image data, long-range dependencies are modeled by stacking many convolutional layers, gradually increasing the receptive field. Both approaches have fundamental limitations: recurrent operations are sequential and thus computationally inefficient, and convolutional stacking requires many layers to cover large distances, making optimization difficult and multi-hop dependencies hard to capture.

在深度神經網路中，擷取長程依賴至關重要。對於序列資料（如語音、語言），主流方法是遞迴運算，循序處理元素並透過隱藏狀態傳播資訊。對於影像資料，長程依賴透過堆疊多層摺積來建模，逐步擴大感受野。兩種方法都有根本限制：遞迴運算是循序的，因此計算效率低下；摺積堆疊需要多層才能涵蓋大距離，使最佳化困難且多跳依賴難以擷取。

段落功能問題診斷——系統性地分析現有方法擷取長程依賴的根本限制。

邏輯角色論證起點：建立「長程依賴很重要」的前提，再從兩個面向（遞迴、摺積）論證現有工具的不足。

論證技巧 / 潛在漏洞將遞迴與摺積的限制並列處理，展現出對跨領域的整合視野。但空洞摺積（dilated convolution）等方法已能在較少層數下擴大感受野，此處可能低估了摺積方法的彈性。

We propose non-local operations as an efficient, simple, and generic solution. A non-local operation computes the response at a position as a weighted sum of features at all positions in the input feature maps. The set of positions can be in space, time, or spacetime, making the formulation applicable to image, sequence, and video problems alike. Our non-local operations are related to self-attention mechanisms recently explored in machine translation, and we show that the embedded Gaussian instantiation of our non-local operation is equivalent to the self-attention form used in Transformers.

我們提出非局部運算作為一種高效、簡潔且通用的解決方案。非局部運算將某位置的響應計算為輸入特徵圖中所有位置特徵的加權總和。位置集合可以在空間、時間或時空中，使此公式化同樣適用於影像、序列與影片問題。我們的非局部運算與機器翻譯中近期探索的自注意力機制相關，我們也展示了非局部運算的嵌入式高斯實例化等價於 Transformer 中使用的自注意力形式。

段落功能方案提出——概述非局部運算的核心思想與跨領域適用性。

邏輯角色關鍵轉折：將電腦視覺中的經典方法（非局部均值）與 NLP 中的新興方法（自注意力）連結起來，建立跨領域的理論橋樑。

論證技巧 / 潛在漏洞指出與 Transformer 自注意力的等價性是一個極具影響力的觀察，為後續 Vision Transformer 的興起埋下伏筆。此處的論述兼具理論深度與實用價值。

The classical non-local means algorithm computes a filtered output as a weighted average of all pixels, where weights depend on patch similarity. This inspired bilateral filtering and other non-local denoising methods. In the deep learning era, self-attention in NLP computes attention weights across all positions in a sequence. Relation networks model interactions between entities. Our work bridges these traditions by formulating non-local operations as a generic neural network building block applicable beyond any single domain.

經典的非局部均值演算法將濾波輸出計算為所有像素的加權平均，權重取決於區塊相似度。這啟發了雙邊濾波及其他非局部去噪方法。在深度學習時代，NLP 中的自注意力計算序列中所有位置的注意力權重。關係網路建模實體間的互動。本研究將非局部運算形式化為通用的神經網路建構元件，橋接了這些傳統，適用範圍超越任何單一領域。

段落功能學術譜系——追溯非局部運算的理論淵源。

邏輯角色建立三條知識傳承線索：傳統影像處理 -> 深度學習自注意力 -> 本文的通用框架。使方法的定位既有歷史根基又有創新突破。

論證技巧 / 潛在漏洞將電腦視覺的經典方法與 NLP 的前沿方法統一在同一個框架下，展現了跨領域的視野與整合能力。但這也可能被視為「新瓶裝舊酒」——非局部均值與自注意力本質上是否真的是同一回事？

3. Non-local Operations — 非局部運算

3.1 Formulation — 公式化

A generic non-local operation is defined as: y_i = (1/C(x)) * sum_j f(x_i, x_j) * g(x_j), where i is the index of an output position, j enumerates all possible positions, f computes a scalar pairwise affinity between positions i and j, g computes a representation of the input at position j, and C(x) is a normalization factor. The key property is that the operation considers all positions (j) regardless of their distance from position i, in contrast to convolution which only operates on a local neighborhood.

通用非局部運算定義為：y_i = (1/C(x)) * sum_j f(x_i, x_j) * g(x_j)，其中 i 是輸出位置的索引，j 列舉所有可能的位置，f 計算位置 i 與 j 之間的純量成對親和度，g 計算位置 j 處輸入的表示，C(x) 為正規化因子。關鍵特性是：此運算考慮所有位置（j），不論其與位置 i 的距離，這與僅在局部鄰域上運作的摺積形成對比。

段落功能數學定義——給出非局部運算的正式公式。

邏輯角色這是全文的數學基礎。公式的通用性（f 和 g 可自由選擇）使得非局部運算成為一個框架而非單一方法。

論證技巧 / 潛在漏洞公式的簡潔性令人印象深刻——僅一行即定義了核心概念。但「所有位置」的加總意味著 O(N^2) 的時間與空間複雜度，對高解析度影像或長序列可能不可行。

3.2 Instantiations — 實例化

The paper explores several instantiations of the pairwise function f: (1) Gaussian: f(x_i, x_j) = exp(x_i^T * x_j), corresponding to a softmax-based similarity; (2) Embedded Gaussian: f(x_i, x_j) = exp(theta(x_i)^T * phi(x_j)), where theta and phi are learned embeddings — this form is equivalent to the self-attention mechanism in Transformers; (3) Dot product: f(x_i, x_j) = theta(x_i)^T * phi(x_j). Surprisingly, all three instantiations achieve comparable results, suggesting that the generic non-local behavior is the key factor rather than the specific choice of f.

本文探索了成對函數 f 的數種實例化：(1) 高斯：f(x_i, x_j) = exp(x_i^T * x_j)，對應基於 softmax 的相似度；(2) 嵌入式高斯：f(x_i, x_j) = exp(theta(x_i)^T * phi(x_j))，其中 theta 與 phi 為學習得到的嵌入——此形式等價於 Transformer 中的自注意力機制；(3) 內積：f(x_i, x_j) = theta(x_i)^T * phi(x_j)。令人驚訝的是，三種實例化皆達到相當的結果，顯示通用的非局部行為本身才是關鍵因素，而非 f 的特定選擇。

段落功能變體分析——展示框架的靈活性與穩健性。

邏輯角色以「三種實例化表現相當」的經驗結果，將研究重心從「選哪個 f」導向「非局部機制本身的價值」，提升了論述的層次。

論證技巧 / 潛在漏洞「嵌入式高斯等價於自注意力」是跨領域的關鍵橋樑，極具學術影響力。三種變體的一致性暗示非局部運算的效益來自結構性質而非參數選擇，但這也可能意味著框架中有冗餘的設計空間。

3.3 Non-local Block — 非局部區塊

The non-local operation is wrapped into a non-local block that can be inserted into existing architectures: z_i = W_z * y_i + x_i, where the residual connection (+x_i) allows inserting the block into pre-trained models without breaking initial behavior. The implementation uses a bottleneck design (halving the number of channels) and spatial subsampling of the pairwise computation for efficiency. This design makes the block model-agnostic: it can be plugged into ConvNets, recurrent networks, or hybrid architectures with minimal modification.

非局部運算被封裝為非局部區塊，可插入既有架構：z_i = W_z * y_i + x_i，其中殘差連接（+x_i）允許將區塊插入預訓練模型而不破壞初始行為。實作上採用瓶頸設計（將通道數減半）與成對計算的空間子取樣以提升效率。此設計使區塊具備模型無關性：可以最小修改嵌入摺積網路、遞迴網路或混合架構。

段落功能工程設計——將理論運算轉化為可實際部署的模組。

邏輯角色殘差連接確保「無害插入」，瓶頸設計控制計算成本。這兩個工程決策使非局部區塊從理論概念升級為實用工具。

論證技巧 / 潛在漏洞「模型無關」的主張使非局部區塊的適用範圍極大化。殘差連接是一個簡單但關鍵的設計——它確保了插入區塊的風險極低，大幅降低了採用門檻。

4. Experiments — 實驗

Video classification: On Kinetics, a non-local ResNet-101 I3D model achieves 77.7% top-1 accuracy, compared to 73.1% for the baseline ResNet-101 — an improvement of +4.6 percentage points with only 1.2x increase in FLOPs. Non-local blocks are shown to be complementary to both 3D convolutions and deeper architectures: adding non-local blocks to a ResNet-50 with 3D conv provides gains on top of the already-improved 3D baseline. On Charades, the non-local model achieves state-of-the-art results.

影片分類：在 Kinetics 上，非局部 ResNet-101 I3D 模型達到 77.7% 的 top-1 準確度，基線 ResNet-101 為 73.1%——僅增加 1.2 倍 FLOPs 即提升 4.6 個百分點。非局部區塊被證明與三維摺積和更深架構互補：在已改進的三維基線上加入非局部區塊仍能帶來增益。在 Charades 上，非局部模型達到最先進的結果。

段落功能核心實驗——以影片分類的具體數據展示非局部區塊的效果。

邏輯角色 4.6% 的提升配合僅 1.2x 的計算增加，呈現極佳的效率—效果比。「互補性」的展示更進一步證明非局部運算擷取了摺積無法擷取的資訊。

論證技巧 / 潛在漏洞以計算成本（1.2x）來佐證效率增益是說服力極強的呈現方式。但 Kinetics 是一個相對短片段的資料集，非局部運算在更長的影片上的效果可能不同。

Object detection and pose estimation: On COCO, adding 1 non-local block to a Mask R-CNN baseline leads to ~1 point increase in AP for both object detection and instance segmentation. For keypoint detection, the improvement is +1.4 AP when non-local blocks are added to both the backbone and the head. These results demonstrate that non-local operations provide benefits beyond video understanding, improving spatial reasoning in static image tasks as well.

物件偵測與姿態估計：在 COCO 上，向 Mask R-CNN 基線加入一個非局部區塊，物件偵測與實例分割的 AP 皆提升約 1 個百分點。在關鍵點偵測方面，當非局部區塊同時加入骨幹網路與偵測頭時，改善幅度為 +1.4 AP。這些結果證明非局部運算的效益超越影片理解，在靜態影像任務中亦能改善空間推理能力。

段落功能跨任務驗證——將非局部區塊的效益延伸至影像理解領域。

邏輯角色以 COCO 上的結果支撐「通用建構元件」的主張：非局部區塊不僅適用於時序任務，也適用於空間推理。

論證技巧 / 潛在漏洞跨任務的一致性改善強化了方法的通用性主張。但 ~1 AP 的改善幅度不大，且需考量統計顯著性。關鍵在於這些改善是「免費的」（僅需插入一個區塊），使其實用價值超越數字本身。

5. Conclusion — 結論

We have presented non-local operations as a generic, flexible, and efficient building block for capturing long-range dependencies in deep neural networks. Our approach generalizes the classical non-local means in computer vision to a learnable, neural network module. We have demonstrated that the embedded Gaussian form is equivalent to self-attention, connecting our work to the Transformer architecture. Experiments on video classification, object detection, instance segmentation, and pose estimation consistently show improvements, validating the broad applicability of non-local operations across tasks and modalities.

我們提出了非局部運算作為在深度神經網路中擷取長程依賴的通用、靈活且高效的建構元件。我們的方法將電腦視覺中經典的非局部均值推廣為可學習的神經網路模組。我們也展示了嵌入式高斯形式等價於自注意力，將本研究與 Transformer 架構連結。在影片分類、物件偵測、實例分割與姿態估計上的實驗一致顯示改善，驗證了非局部運算跨任務與跨模態的廣泛適用性。

段落功能總結全文——重申核心貢獻與跨領域連結。

邏輯角色結論呼應摘要，形成閉環：從「局部運算的限制」出發，以「非局部運算的廣泛效益」收束。與 Transformer 的連結預示了此方向的巨大潛力。

論證技巧 / 潛在漏洞結論未討論非局部運算的計算瓶頸與可擴展性限制。隨著 Vision Transformer 的後續發展，本文的核心洞見被證明極具前瞻性，但 O(N^2) 的計算複雜度始終是一個待解的挑戰。

論證結構總覽

問題
摺積與遞迴運算
僅擷取局部依賴

→

論點
非局部運算直接
建模所有位置的關係

→

證據
Kinetics +4.6%
COCO +1 AP

→

反駁
O(N^2) 計算瓶頸
瓶頸+子取樣緩解

→

結論
通用建構元件
跨任務一致改善

作者核心主張（一句話）

非局部運算作為一種通用的神經網路建構元件，能夠直接建模任意距離的特徵依賴關係，在影片分類、物件偵測與姿態估計等多項任務中一致地帶來效能提升。

論證最強處

跨領域的統一框架：將電腦視覺的非局部均值、NLP 的自注意力、以及關係推理統一在一個簡潔的數學公式下。嵌入式高斯與 Transformer 自注意力的等價性證明，為後續 Vision Transformer 的發展奠定了概念基礎。

論證最弱處

計算可擴展性的迴避：O(N^2) 的計算複雜度在高解析度影像或長影片上可能成為實際瓶頸，而文中僅以瓶頸設計和子取樣作為緩解手段，未提出根本解決方案。此外，COCO 上約 1 AP 的改善幅度較小，統計顯著性未被充分討論。