A Closer Look at Spatiotemporal Convolutions for Action Recognition

Abstract — 摘要

In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 3D CNNs with 3D convolution kernels outperform 2D CNNs when used in residual learning frameworks. We show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly better accuracy. Specifically, our proposed R(2+1)D block decomposes a 3D convolution into a 2D spatial convolution followed by a 1D temporal convolution. This decomposition doubles the number of nonlinearities in the network, which we argue is a key factor behind the improved performance. We achieve state-of-the-art results on Sports-1M, Kinetics, UCF101, and HMDB51.

本文討論了幾種用於影片分析的時空摺積形式，並研究它們對動作辨識的影響。我們的動機源自一項觀察：在殘差學習框架中，使用三維摺積核的三維 CNN 優於二維 CNN。我們展示將三維摺積濾波器分解為獨立的空間與時間分量，能顯著提升準確度。具體而言，我們提出的 R(2+1)D 區塊將三維摺積分解為一個二維空間摺積後接一個一維時間摺積。此分解使網路中的非線性數量加倍，我們認為這是效能提升的關鍵因素。我們在 Sports-1M、Kinetics、UCF101 與 HMDB51 上取得了最先進的結果。

段落功能全文總覽——從時空摺積的系統性比較引出 R(2+1)D 的核心概念與實驗結果。

邏輯角色摘要的論證結構清晰：觀察（3D 優於 2D）-> 核心主張（分解更優）-> 機制解釋（非線性加倍）-> 實證支持（四個基準）。

論證技巧 / 潛在漏洞「非線性加倍是關鍵因素」的因果主張極為大膽——此觀點在論文中以實驗支持而非理論證明，可能還有其他因素（如梯度流通性）同等重要。

1. Introduction — 緒論

Video understanding requires models that can capture both spatial appearance and temporal dynamics. Early deep learning approaches for video applied 2D CNNs to individual frames and aggregated temporal information through pooling or recurrent networks, which limits the ability to model fine-grained temporal patterns. 3D convolutions (C3D) directly process spatiotemporal volumes but are computationally expensive and harder to train. Recent work has shown mixed results: some studies find 3D convolutions offer no advantage over 2D, while others demonstrate clear benefits. We aim to systematically study the design space of spatiotemporal convolutions to resolve these contradictions.

影片理解需要能同時捕捉空間外觀與時間動態的模型。早期的影片深度學習方法將二維 CNN 應用於個別幀，並透過池化或循環網路聚合時間資訊，這限制了建模精細時間模式的能力。三維摺積（C3D）直接處理時空體積，但計算成本高且更難訓練。近期研究結果不一：部分研究發現三維摺積相較二維無優勢，另一些則展示了明確的收益。我們旨在系統性地研究時空摺積的設計空間，以解決這些矛盾。

段落功能建立研究場域——指出 2D vs. 3D 摺積在影片理解中的爭論。

邏輯角色以「矛盾的文獻結果」作為研究動機——當社群對基本設計選擇缺乏共識時，系統性的比較研究具有高價值。

論證技巧 / 潛在漏洞將先前研究的矛盾結果歸因於設計選擇的差異是合理的策略，但也可能反映了不同資料集與評估協定的影響，而非純粹的架構差異。

Two-stream networks process RGB frames and optical flow separately, achieving strong results but requiring pre-computed optical flow. C3D proposed using 3D convolutions with 3x3x3 kernels for spatiotemporal feature learning, but was limited by the VGG-style architecture without residual connections. I3D inflated successful 2D architectures (Inception) to 3D and achieved strong results on Kinetics, but the question of how to optimally structure spatiotemporal convolutions within modern residual architectures remains open. P3D and S3D explored various decomposition strategies but did not provide a comprehensive comparative analysis.

雙流網路分別處理 RGB 幀與光流，取得了出色的結果，但需要預先計算的光流。C3D 提出使用 3x3x3 核的三維摺積進行時空特徵學習，但受限於缺乏殘差連接的 VGG 式架構。I3D 將成功的二維架構（Inception）膨脹為三維並在 Kinetics 上取得了優異結果，但如何在現代殘差架構中最佳地構建時空摺積的問題仍未解決。P3D 與 S3D 探索了各種分解策略，但未提供全面的比較分析。

段落功能文獻回顧——梳理影片理解從雙流到三維摺積的技術演進。

邏輯角色建立「需要系統性比較」的論據：每種先前方法只探索了設計空間的一角，缺乏統一框架下的公平比較。

論證技巧 / 潛在漏洞將 I3D 的「架構膨脹」定位為不同於本文的「摺積分解」，有效區隔了兩種研究方向。但 I3D 使用 Inception 而非 ResNet 骨幹，直接比較需要謹慎控制變因。

3. Spatiotemporal Convolution Forms — 時空摺積形式

We consider five forms of spatiotemporal convolutions within a ResNet-based architecture. R2D: uses only 2D convolutions (1x3x3 kernels), treating each frame independently. R3D: uses full 3D convolutions (3x3x3 kernels) throughout. MC_x: uses 3D convolutions in early layers and 2D in later layers (mixed convolutions). rMC_x: the reverse — 2D in early layers, 3D in later layers. R(2+1)D: decomposes every 3D convolution into a 2D spatial convolution followed by a 1D temporal convolution. All five architectures have approximately the same number of parameters, enabling fair comparison.

我們在 ResNet 架構中考慮五種時空摺積形式。R2D：僅使用二維摺積（1x3x3 核），獨立處理每幀。R3D：全程使用完整的三維摺積（3x3x3 核）。MC_x：在早期層使用三維摺積，後期層使用二維（混合摺積）。rMC_x：相反——早期二維，後期三維。R(2+1)D：將每個三維摺積分解為一個二維空間摺積後接一個一維時間摺積。五種架構具有大致相同的參數量，確保公平比較。

段落功能系統性列舉——定義完整的設計空間供比較研究。

邏輯角色此段是方法論的支柱：五種架構涵蓋了時空摺積的主要配置空間，而參數量的控制是實驗公平性的關鍵保證。

論證技巧 / 潛在漏洞控制參數量是嚴謹的實驗設計。然而，相同參數量不等同於相同計算量（FLOPs）或相同記憶體佔用——R(2+1)D 的分解可能改變計算圖的效率特性。

3.2 R(2+1)D Decomposition — R(2+1)D 分解

The R(2+1)D block replaces a full 3D convolution of size t x d x d with a spatial 2D convolution of size 1 x d x d followed by a temporal 1D convolution of size t x 1 x 1. Between the two operations, a ReLU nonlinearity and batch normalization are inserted. This decomposition has two advantages. First, it doubles the number of nonlinear activations in the network, which increases the complexity of functions that can be represented. Second, it facilitates the optimization: learning spatial and temporal features separately is easier than learning them jointly, as the error surface becomes smoother with the decomposition. The intermediate feature dimension M_i is chosen to match the parameter count of the equivalent full 3D convolution.

R(2+1)D 區塊將尺寸為 t x d x d 的完整三維摺積替換為尺寸為 1 x d x d 的空間二維摺積後接尺寸為 t x 1 x 1 的時間一維摺積。在兩個運算之間插入 ReLU 非線性與批次正規化。此分解有兩個優勢。第一，它使網路中的非線性啟動數量加倍，增加了可表示函數的複雜度。第二，它促進了最佳化：分開學習空間與時間特徵比聯合學習更容易，因為分解使得誤差曲面更加平滑。中間特徵維度 M_i 的選擇確保參數量與等效的完整三維摺積相當。

段落功能核心創新——詳述 R(2+1)D 分解的機制與理論依據。

邏輯角色全文的技術核心：以「非線性加倍」和「最佳化簡化」兩個獨立論據支撐分解的優勢。中間維度的設計確保了與 R3D 的公平比較。

論證技巧 / 潛在漏洞「非線性加倍」的論點簡潔有力，但若此為主因，則任何增加非線性的策略都應同樣有效——為何必須是空間-時間的分解？「誤差曲面更平滑」的主張缺乏數學證明，是基於實驗觀察的推測。

4. Experiments — 實驗

We evaluate all five architectures on Sports-1M, Kinetics, UCF101, and HMDB51 using clip-level and video-level accuracy. On Kinetics with 18-layer ResNet, R(2+1)D achieves clip-level accuracy of 72.0%, outperforming R3D (71.1%), MC3 (69.6%), rMC3 (69.0%), and R2D (64.8%). The advantage of R(2+1)D over R3D is consistent across different depths: with a 34-layer ResNet, R(2+1)D reaches 74.3% vs. R3D at 72.7%. When pre-trained on Sports-1M and fine-tuned, R(2+1)D achieves 97.3% on UCF101 and 78.7% on HMDB51, competitive with the state-of-the-art I3D results of 98.0% and 80.7% (which uses additional optical flow input).

我們在 Sports-1M、Kinetics、UCF101 與 HMDB51 上使用片段級與影片級準確度評估所有五種架構。在 Kinetics 上使用 18 層 ResNet 時，R(2+1)D 達到片段級準確度 72.0%，優於 R3D（71.1%）、MC3（69.6%）、rMC3（69.0%）與 R2D（64.8%）。R(2+1)D 相對於 R3D 的優勢在不同深度上保持一致：使用 34 層 ResNet 時，R(2+1)D 達到 74.3% 而 R3D 為 72.7%。在 Sports-1M 上預訓練並微調後，R(2+1)D 在 UCF101 上達到 97.3%，在 HMDB51 上達到 78.7%，與使用額外光流輸入的 I3D 最先進結果（98.0% 與 80.7%）具有競爭力。

段落功能全面的定量比較——在多基準上驗證 R(2+1)D 的一致優勢。

邏輯角色實證支柱涵蓋三個維度：(1) 五種架構的公平比較；(2) 不同網路深度的一致性驗證；(3) 與外部最先進方法的跨方法比較。

論證技巧 / 潛在漏洞與 I3D 的比較需注意：I3D 使用 Inception 骨幹與額外的光流流，並非完全公平的比較。R(2+1)D 在 UCF101 上落後 0.7%、HMDB51 上落後 2.0%，但不使用光流是其實用優勢。

We provide training curve analysis comparing R3D and R(2+1)D. The R(2+1)D model consistently shows lower training loss throughout the training process, suggesting that the decomposition indeed facilitates optimization. Furthermore, the gap between training and validation loss is smaller for R(2+1)D, indicating better generalization rather than merely better fitting to the training data. This supports our hypothesis that the factorized architecture leads to a smoother optimization landscape.

我們提供了 R3D 與 R(2+1)D 的訓練曲線分析。R(2+1)D 模型在整個訓練過程中始終展現較低的訓練損失，表明分解確實促進了最佳化。此外，R(2+1)D 的訓練與驗證損失之間的差距更小，說明其改善來自更好的泛化而非僅僅是對訓練資料的更好擬合。這支持了我們的假設：分解式架構帶來了更平滑的最佳化地形。

段落功能機制驗證——以訓練動態支持理論假設。

邏輯角色此段補強了「分解有助於最佳化」的因果論述：不僅最終結果更好，訓練過程本身也更順暢，排除了「僅因參數空間不同而偶然更好」的替代解釋。

論證技巧 / 潛在漏洞訓練曲線分析是有說服力的補充證據，但「更平滑的最佳化地形」仍是間接推斷。直接可視化損失地形（如 Li et al. 的方法）能提供更強的支持。

5. Conclusion — 結論

We have presented a systematic study of spatiotemporal convolutions for video action recognition. Our key finding is that R(2+1)D — decomposing 3D convolutions into separate spatial and temporal components — consistently outperforms full 3D convolutions across multiple datasets and network depths. We attribute this to the increased number of nonlinearities and the simplified optimization landscape. The R(2+1)D decomposition is a simple drop-in replacement that can benefit any architecture using 3D convolutions, making it a practical and widely applicable contribution to video understanding.

我們提出了影片動作辨識中時空摺積的系統性研究。核心發現是 R(2+1)D——將三維摺積分解為獨立的空間與時間分量——在多個資料集與網路深度上一致地優於完整的三維摺積。我們將此歸因於非線性數量的增加與最佳化地形的簡化。R(2+1)D 分解是一個簡單的即插即用替代方案，能使任何使用三維摺積的架構受益，使其成為影片理解領域具實用性且可廣泛應用的貢獻。

段落功能總結全文——提煉核心發現並強調實用價值。

邏輯角色結論呼應摘要但更具體：從「系統性研究」到「即插即用方案」，強調方法的通用性與低採用門檻。

論證技巧 / 潛在漏洞「即插即用」的修辭最大化了方法的實用吸引力。但結論未討論 R(2+1)D 的侷限，例如在非 ResNet 架構或非動作辨識任務上的適用性，以及分解對長期時間依賴的建模能力的影響。

論證結構總覽

問題
時空摺積的最佳設計
尚無定論

→

論點
空間-時間分解
優於完整 3D 摺積

→

證據
四基準一致領先
訓練曲線更平滑

→

反駁
非線性加倍與
最佳化簡化為機制

→

結論
R(2+1)D 是即插即用
的通用改進方案

作者核心主張（一句話）

將三維摺積分解為空間二維摺積加時間一維摺積的 R(2+1)D 結構，透過增加非線性與簡化最佳化地形，在參數量相同的條件下一致優於完整的三維摺積。

論證最強處

系統性且公平的比較框架：五種架構在統一的 ResNet 框架中比較，嚴格控制參數量。R(2+1)D 的優勢不僅體現在最終準確度，還在訓練動態中得到驗證（更低的訓練損失、更小的泛化差距），提供了超越數字比較的機制洞察。

論證最弱處

因果解釋的不充分：「非線性加倍」和「更平滑的最佳化地形」雖為合理假設，但缺乏嚴格的理論基礎。若非線性是關鍵，則簡單地在 R3D 中間插入額外的 ReLU 是否同樣有效？此對照實驗的缺失削弱了因果論述的說服力。