AR-Net: Adaptive Frame Resolution for Efficient Action Recognition

Abstract — 摘要

We introduce AR-Net (Adaptive Resolution Network) for efficient action recognition in videos. The key insight is that not all frames in a video require the same spatial resolution for accurate recognition. AR-Net uses a lightweight policy network to dynamically select the optimal resolution for each frame, allocating more computation to informative frames and less to redundant ones. This achieves comparable accuracy to fixed high-resolution processing with up to 50% fewer FLOPs.

我們介紹用於高效動作辨識的 AR-Net（自適應解析度網路）。核心洞察是影片中並非所有幀都需要相同的空間解析度來實現準確辨識。AR-Net 使用輕量的策略網路來動態選擇每幀的最佳解析度，為資訊豐富的幀分配更多計算，為冗餘幀分配更少。在最多減少 50% FLOPs 的情況下達到與固定高解析度處理相當的準確率。

段落功能全文總覽——以自適應解析度實現高效影片辨識。

邏輯角色「並非所有幀都需要高解析度」的洞察在直覺上完全合理，為後續方法設計奠定基礎。

論證技巧 / 潛在漏洞 50% FLOPs 減少是強有力的效率指標，但需要驗證在不同類型的動作上是否一致。

1. Introduction — 緒論

Video understanding models are computationally expensive due to processing many high-resolution frames. Existing efficiency methods use frame sampling or temporal pruning, treating all spatial locations uniformly. We observe that spatial resolution requirements vary dramatically across frames: frames with large motion or fine-grained actions need high resolution, while static or transition frames can be processed at low resolution without accuracy loss.

影片理解模型因處理大量高解析度幀而計算代價高昂。現有效率方法使用幀採樣或時序剪枝，對所有空間位置統一處理。我們觀察到幀間的空間解析度需求差異巨大：大幅運動或細粒度動作的幀需要高解析度，而靜態或過渡幀以低解析度處理不會損失準確率。

段落功能建立動機——以幀間的解析度需求差異證明自適應方法的必要性。

邏輯角色將已知的時序冗餘擴展到空間解析度維度是新穎的視角，開闢了正交的效率改善途徑。

論證技巧 / 潛在漏洞觀察合理但量化「哪些幀需要高解析度」是非凡的挑戰，需要學習式方法來解決。

Our approach is inspired by the observation that human visual attention naturally adjusts resolution based on content complexity. When watching a video, we focus more on key action moments and less on static backgrounds. AR-Net mimics this behavior by learning a policy that allocates computational resources proportionally to frame informativeness. This is fundamentally different from approaches that reduce temporal redundancy, as we target spatial redundancy within individual frames.

我們的方法受到以下觀察的啟發：人類視覺注意力會根據內容複雜度自然調整解析度。觀看影片時，我們對關鍵動作時刻投注更多關注，對靜態背景則較少。AR-Net 透過學習一個按幀資訊量比例分配計算資源的策略來模仿此行為。這與減少時序冗餘的方法根本不同，因為我們針對的是個別幀內的空間冗餘。

段落功能類比論證——以人類視覺注意力類比自適應解析度的合理性。

邏輯角色將方法定位為「空間冗餘」的解決方案，與處理「時序冗餘」的方法形成互補。

論證技巧 / 潛在漏洞人類注意力的類比直觀有力，但人類的注意力機制遠比解析度選擇複雜。

2. Method — 方法

AR-Net consists of two components: a policy network and a recognition backbone. The policy network is a lightweight CNN that takes a low-resolution version of each frame and predicts the optimal resolution level from a set of candidates (e.g., 84x84, 128x128, 224x224). The selected resolution is used to resize the frame before feeding it to the backbone. The policy network is trained with reinforcement learning (Gumbel-Softmax) to maximize accuracy while minimizing computational cost.

AR-Net 由兩個組件組成：策略網路和辨識骨幹。策略網路是輕量 CNN，接收每幀的低解析度版本並從候選集（如 84x84、128x128、224x224）預測最佳解析度級別。選定的解析度用於在送入骨幹前調整幀大小。策略網路以強化學習（Gumbel-Softmax）訓練，同時最大化準確率和最小化計算成本。

段落功能核心方法——策略網路的設計與強化學習訓練。

邏輯角色 Gumbel-Softmax 使離散的解析度選擇可微分，是巧妙的技術選擇，使端到端訓練成為可能。

論證技巧 / 潛在漏洞強化學習的訓練可能不穩定，需要精心的獎勵函數設計來平衡效率與準確度。

2.1 Policy Learning — 策略學習

The reward function balances accuracy and efficiency: R = accuracy - lambda * computation_cost. The computation cost is measured as the FLOPs of processing the frame at the selected resolution. Lambda controls the accuracy-efficiency trade-off, allowing users to tune the model for different deployment constraints. During inference, the policy network adds less than 1% additional FLOPs.

獎勵函數平衡準確率與效率：R = 準確率 - lambda * 計算成本。計算成本以所選解析度處理幀的 FLOPs 衡量。Lambda 控制準確率-效率取捨，允許使用者針對不同部署約束調整模型。推論時策略網路僅增加不到 1% 的額外 FLOPs。

段落功能獎勵設計——準確率與效率的可調平衡。

邏輯角色可調的 lambda 使單一模型適應不同部署場景，大幅提升了方法的實用性。

論證技巧 / 潛在漏洞策略網路 1% 的開銷極小，不影響整體效率。但 lambda 的最佳值可能因資料集而異。

To stabilize training, we employ a curriculum learning strategy that starts with a high lambda (favoring efficiency) and gradually decreases it to allow the network to learn accurate resolution selection. We also use baseline subtraction in the REINFORCE gradient estimator to reduce variance. The Gumbel-Softmax temperature is annealed from 1.0 to 0.1 during training, transitioning from soft to hard selection.

為穩定訓練，我們採用課程學習策略，從高 lambda（偏好效率）開始並逐步降低，使網路學習準確的解析度選擇。我們也在 REINFORCE 梯度估計器中使用基線減法來降低變異數。Gumbel-Softmax 的溫度在訓練期間從 1.0 退火至 0.1，從軟選擇過渡到硬選擇。

段落功能訓練穩定化——課程學習與溫度退火的技術細節。

邏輯角色這些技巧直接回應了強化學習訓練不穩定的已知問題。

論證技巧 / 潛在漏洞多種穩定化技巧的使用暗示了訓練的複雜性，但每項技巧都是成熟且有效的。

3. Experiments — 實驗

On ActivityNet with ResNet-50 backbone, AR-Net achieves 72.0% mAP with 48% fewer FLOPs compared to fixed 224x224 resolution (72.4% mAP). On FCVID, AR-Net achieves 84.3% mAP with 42% fewer FLOPs. Analysis shows that the policy network correctly assigns higher resolution to action-centric frames and lower resolution to background frames.

在 ActivityNet 上搭配 ResNet-50 骨幹，AR-Net 以減少 48% FLOPs 達到 72.0% mAP，相比固定 224x224 解析度的 72.4% mAP。在 FCVID 上以減少 42% FLOPs 達到 84.3% mAP。分析顯示策略網路正確地為動作核心幀分配高解析度、為背景幀分配低解析度。

段落功能定量評估——兩基準上的效率-準確率取捨。

邏輯角色 48% FLOPs 減少僅損失 0.4% mAP，極具實際部署價值。

論證技巧 / 潛在漏洞策略分配的可視化增強了方法的可解釋性，讓讀者信服策略確實學到了有意義的模式。

We further analyze the resolution distribution learned by the policy network. On ActivityNet, approximately 35% of frames are assigned the lowest resolution, 40% medium, and only 25% the highest. This confirms the hypothesis that most frames do not require full resolution. Ablation studies show that the policy network outperforms random resolution assignment by 2.1% mAP and uniform low-resolution processing by 3.8% mAP.

我們進一步分析策略網路學到的解析度分布。在 ActivityNet 上，約 35% 的幀被分配最低解析度、40% 中等、僅 25% 最高。這確認了大多數幀不需要全解析度的假設。消融研究顯示策略網路比隨機解析度分配高 2.1% mAP，比統一低解析度處理高 3.8% mAP。

段落功能消融分析——驗證策略學習的有效性。

邏輯角色與隨機分配和統一低解析度的比較，清楚分離了「自適應」帶來的增益。

論證技巧 / 潛在漏洞解析度分布統計為核心假設提供了直接的實證支持。

4. Conclusion — 結論

We have presented AR-Net, which demonstrates that adaptive frame resolution is an effective and orthogonal dimension for improving video understanding efficiency. The approach is model-agnostic and can be combined with other efficiency methods like temporal sampling. AR-Net opens up a promising direction for content-adaptive computation in video understanding.

我們提出了 AR-Net，證明自適應幀解析度是提升影片理解效率的有效且正交的維度。此方法與模型無關，可與時序採樣等其他效率方法結合。AR-Net 為影片理解中的內容自適應計算開闢了有前景的方向。

段落功能總結——強調方法的正交性和通用性。

邏輯角色「正交維度」的定位意味著可與其他方法疊加使用，拓展了研究的實用性。

論證技巧 / 潛在漏洞內容自適應計算是更廣泛趨勢的一部分，後續動態網路研究進一步驗證了此方向。

論證結構總覽

問題
影片處理計算代價高

→

論點
幀間解析度需求不同

→

方法
策略網路動態選擇

→

證據
減少 48% FLOPs

→

結論
內容自適應計算方向

核心主張

透過輕量策略網路動態選擇每幀的空間解析度，可在幾乎不損失準確率的情況下大幅降低影片動作辨識的計算成本。

論證最強處

48% FLOPs 減少僅損失 0.4% mAP 的極佳取捨，且策略分配的可視化清楚展示了方法的合理性。

論證最弱處

強化學習訓練可能不穩定，獎勵函數的 lambda 需要手動調整，且僅在影片分類任務上驗證。