Action Recognition with Improved Trajectories

Abstract — 摘要

Dense trajectories have been shown to be an efficient video representation for action recognition. In this paper, we improve the dense trajectory approach by explicitly estimating camera motion and removing it from the optical flow. We use human detection and feature point matching to robustly estimate homographies between consecutive frames. The estimated camera motion is used to cancel out background motion in the optical flow and trajectory shape descriptors. We also propose to remove trajectories consistent with camera motion rather than including them as features. Our improved trajectories yield significant improvements over the original dense trajectories on Hollywood2 (64.3%), HMDB51 (57.2%), and UCF50 (91.2%), achieving state-of-the-art results across these benchmarks.

稠密軌跡已被證明是動作辨識中有效的視訊表示。本文透過顯式估計攝影機運動並將其從光流中移除，來改進稠密軌跡方法。我們使用人體偵測與特徵點匹配，在連續影格之間穩健地估計單應矩陣。估計的攝影機運動被用於消除光流與軌跡形狀描述子中的背景運動。我們也提出移除與攝影機運動一致的軌跡，而非將它們納入特徵。改進的軌跡在 Hollywood2（64.3%）、HMDB51（57.2%）與 UCF50（91.2%）上比原始稠密軌跡產生顯著改進，在這些基準上達到最先進的結果。

段落功能全文總覽——以攝影機運動補償作為核心改進，預告三個基準上的最先進結果。

邏輯角色摘要以「稠密軌跡的成功→攝影機運動的問題→補償方案→結果改進」的四步遞進清晰展示了研究邏輯。

論證技巧 / 潛在漏洞直接在摘要中報告三個基準的具體數值是自信的展示方式。但攝影機運動補償的想法並非全新——作者的貢獻在於「如何穩健地做到」而非「概念本身」，這一點在摘要中未被充分區分。

1. Introduction — 緒論

Action recognition in video has received considerable attention due to its applications in video surveillance, human-computer interaction, and video retrieval. The dense trajectory framework proposed by Wang et al. (2011) achieves strong performance by densely sampling feature points and tracking them using optical flow, then computing local descriptors along the trajectories. However, a key limitation is that dense trajectories do not distinguish between camera-induced motion and object motion. When the camera moves (pan, tilt, zoom), all trajectories are affected, and background trajectories carry no information about the action but dominate the representation. This motivates our approach: by explicitly estimating and compensating for camera motion, we can extract trajectories that truly reflect human actions.

視訊中的動作辨識因其在視訊監控、人機互動與視訊檢索等方面的應用而受到大量關注。Wang 等人（2011）提出的稠密軌跡框架透過稠密取樣特徵點並使用光流追蹤它們，再沿軌跡計算局部描述子，達到強勁的表現。然而，一個關鍵限制是稠密軌跡無法區分攝影機引起的運動與物件運動。當攝影機移動（平移、俯仰、變焦）時，所有軌跡都受影響，而背景軌跡不攜帶動作資訊卻主導了表示。這驅動了我們的方法：透過顯式估計並補償攝影機運動，我們能擷取真正反映人類動作的軌跡。

段落功能建立研究動機——精確指出稠密軌跡的攝影機運動混淆問題。

邏輯角色以「背景軌跡主導表示」的具體缺陷，為攝影機運動補償建立必要性。問題界定非常精確且可操作。

論證技巧 / 潛在漏洞將問題聚焦於「攝影機運動 vs. 物件運動」的分離，使研究動機清晰可懂。但在某些情況下（如追蹤拍攝），攝影機運動本身攜帶了動作的重要語境資訊——完全移除它可能丟失有用信號。

Local spatio-temporal features such as STIP (Space-Time Interest Points) and cuboid features have been widely used for action recognition. The bag-of-visual-words pipeline with HOG, HOF, and MBH descriptors remains the dominant paradigm. Dense trajectories outperform interest-point-based approaches by providing denser coverage and motion-aware sampling. Attempts to handle camera motion include motion stabilization and background subtraction, but these are either too aggressive (removing useful motion) or require static cameras. Our approach is more principled: we estimate a parametric camera motion model and use it to selectively compensate trajectories and descriptors.

局部時空特徵如 STIP（時空興趣點）與長方體特徵已被廣泛用於動作辨識。基於視覺詞袋的管線搭配 HOG、HOF 與 MBH 描述子仍是主流範式。稠密軌跡透過提供更密集的覆蓋與運動感知的取樣，超越了基於興趣點的方法。處理攝影機運動的嘗試包括運動穩定化與背景相減，但這些方法要麼過於激進（移除有用的運動）要麼需要靜態攝影機。我們的方法更為有原則：估計參數化的攝影機運動模型，並使用它選擇性地補償軌跡與描述子。

段落功能文獻回顧——從時空特徵到稠密軌跡的演進，並批判現有的攝影機補償方法。

邏輯角色以「過於激進 vs. 過於限制」的兩難，為本方法「有原則的選擇性補償」創造定位空間。

論證技巧 / 潛在漏洞將競爭方法的缺陷分為兩類（太激進/太限制）是有效的對比策略。但「更有原則」的宣稱需要實驗證明——是否存在本方法也「太激進」或「不夠充分」的情況？

3. Method — 方法

3.1 Dense Trajectories Recap — 稠密軌跡回顧

The dense trajectory framework densely samples feature points on a regular grid at multiple spatial scales. Points are tracked for L = 15 frames using median filtering of optical flow. Along each trajectory, four types of descriptors are computed within a spatio-temporal volume: trajectory shape (displacement vectors), HOG (Histogram of Oriented Gradients), HOF (Histogram of Optical Flow), and MBH (Motion Boundary Histograms). Descriptors are encoded using Fisher vectors with GMM vocabularies of 256 Gaussians, and classification is performed with linear SVMs.

稠密軌跡框架在多個空間尺度上以規則網格稠密取樣特徵點。使用光流的中值濾波追蹤點達 L = 15 個影格。沿每條軌跡，在時空體積內計算四種描述子：軌跡形狀（位移向量）、HOG（方向梯度直方圖）、HOF（光流直方圖）與 MBH（運動邊界直方圖）。描述子使用含 256 個高斯分量的 GMM 詞彙進行 Fisher 向量編碼，並以線性 SVM 進行分類。

段落功能方法基礎——回顧原始稠密軌跡管線的完整流程。

邏輯角色建立改進的基線：讀者需要理解原始方法的每個組件，才能理解改進作用於哪些環節。

論證技巧 / 潛在漏洞對基線方法的詳細回顧確保了可重現性。Fisher 向量 + 線性 SVM 的選擇在 2013 年是最先進的編碼方式，但此管線的每個環節都有相當多的超參數（網格密度、軌跡長度、GMM 大小等），改進可能部分來自超參數調整。

3.2 Camera Motion Estimation — 攝影機運動估計

We estimate camera motion between consecutive frames by computing a homography using RANSAC on matched feature points. To make the estimation robust to foreground motion (humans performing actions), we employ two strategies: (1) we use a human detector to identify and exclude foreground regions from the feature matching, and (2) we rely on RANSAC's outlier rejection to handle remaining foreground points. The estimated homography H captures the dominant planar camera motion (pan, tilt, rotation). We then warp each frame by the inverse homography to obtain a camera-motion-compensated optical flow. This compensated flow ideally captures only the independent motion of objects in the scene.

我們透過在匹配特徵點上使用 RANSAC 計算單應矩陣來估計連續影格間的攝影機運動。為使估計對前景運動（人類執行動作）具穩健性，我們採用兩種策略：(1) 使用人體偵測器辨識並排除前景區域的特徵匹配；(2) 依賴 RANSAC 的離群值拒絕機制來處理殘餘的前景點。估計的單應矩陣 H 捕捉主導性的平面攝影機運動（平移、俯仰、旋轉）。我們接著以逆向單應矩陣對每個影格進行變形，以取得攝影機運動補償後的光流。此補償後的光流在理想情況下僅捕捉場景中物件的獨立運動。

段落功能核心創新——詳述攝影機運動估計的穩健方法。

邏輯角色此段是全文的技術核心：「人體偵測排除 + RANSAC 離群值拒絕」的雙重穩健策略直接解決了前景運動對攝影機估計的干擾。

論證技巧 / 潛在漏洞雙重穩健策略的設計合理，但單應矩陣假設場景為平面或攝影機做純旋轉——在有顯著深度變化的場景中，此模型可能不夠精確。此外，人體偵測器的失敗（如遮擋或不尋常姿態）會直接影響攝影機運動的估計品質。

3.3 Improved Descriptors — 改進的描述子

With the estimated camera motion, we make three improvements to the descriptors: (1) Trajectory shape descriptors are computed from the compensated optical flow, so they encode only object motion; (2) HOF descriptors are computed from the compensated flow rather than the raw flow; (3) Trajectories that are consistent with the estimated camera motion (background trajectories) are removed entirely. The MBH descriptor, which already computes derivatives of optical flow, is relatively insensitive to global translational camera motion and thus benefits less from compensation. However, even MBH shows improvement when camera rotation is significant.

利用估計的攝影機運動，我們對描述子做了三項改進：(1) 軌跡形狀描述子從補償後的光流計算，因此僅編碼物件運動；(2) HOF 描述子從補償後的光流而非原始光流計算；(3) 與估計攝影機運動一致的軌跡（背景軌跡）被完全移除。MBH 描述子本身計算光流的導數，對全域平移式攝影機運動相對不敏感，因此從補償中獲益較少。然而，當攝影機旋轉顯著時，即使 MBH 也展現改進。

段落功能改進細節——說明攝影機補償如何融入各描述子。

邏輯角色將攝影機補償的效益具體落實到每個描述子上，展現改進的系統性。MBH 的特殊情況討論顯示了對方法的深入理解。

論證技巧 / 潛在漏洞對 MBH 的坦誠分析（受益較少但旋轉時仍有改進）展現了學術誠實，也暗示了補償的效益因描述子類型而異。但未討論補償引入的誤差——不完美的攝影機估計可能反而汙染某些描述子。

4. Experiments — 實驗

We evaluate on three challenging benchmarks: Hollywood2 (12 action classes from movies), HMDB51 (51 action classes from diverse sources), and UCF50 (50 action classes from YouTube). Our improved trajectories achieve 64.3% on Hollywood2 (vs. 58.2% for original dense trajectories), 57.2% on HMDB51 (vs. 48.3%), and 91.2% on UCF50 (vs. 85.6%). The improvements are largest on datasets with significant camera motion (Hollywood2 movies involve extensive camera movement). Ablation studies confirm that (1) camera motion compensation in trajectory shape and HOF descriptors provides the most benefit, (2) removing background trajectories further improves performance, and (3) combining all improved descriptors with Fisher vectors achieves the best results. Our method achieves state-of-the-art on all three benchmarks at the time of publication.

我們在三個具挑戰性的基準上進行評估：Hollywood2（12 個來自電影的動作類別）、HMDB51（51 個來自多種來源的動作類別）與 UCF50（50 個來自 YouTube 的動作類別）。改進的軌跡在 Hollywood2 上達到 64.3%（原始稠密軌跡為 58.2%），HMDB51 上 57.2%（原為 48.3%），UCF50 上 91.2%（原為 85.6%）。改進在攝影機運動顯著的資料集上最大（Hollywood2 電影涉及大量攝影機運動）。消融研究確認：(1) 軌跡形狀與 HOF 描述子的攝影機運動補償提供最大效益；(2) 移除背景軌跡進一步改善表現；(3) 結合所有改進描述子與 Fisher 向量達到最佳結果。本方法在發表時達到三個基準上的最先進結果。

段落功能提供全面的實驗證據——三個基準的定量結果與系統性消融研究。

邏輯角色實證核心：6-9% 的一致性提升、消融研究的系統驗證，以及改進與攝影機運動量的正相關，共同構成強力的證據鏈。

論證技巧 / 潛在漏洞消融研究的設計嚴謹，逐項驗證了每個改進組件的貢獻。但 HMDB51 的 57.2% 準確率雖為最先進，絕對值仍不高，顯示動作辨識任務本身的挑戰性。UCF50 的 91.2% 則可能受限於資料集的飽和效應。

5. Conclusion — 結論

We have presented improved trajectories for action recognition by explicitly estimating and compensating for camera motion. Our approach consistently improves upon the original dense trajectory framework across multiple challenging benchmarks, with the largest gains on videos with significant camera motion. The key insight is that separating camera-induced motion from true object motion leads to more discriminative representations. The approach is simple, effective, and compatible with the standard bag-of-words pipeline. Future directions include incorporating depth information for more accurate camera estimation and exploring deep learning approaches for learning motion-compensated representations end-to-end.

本文提出了透過顯式估計並補償攝影機運動而改進的動作辨識軌跡。我們的方法在多個具挑戰性的基準上一致地改進了原始稠密軌跡框架，在攝影機運動顯著的視訊上獲得最大增益。關鍵洞見在於：將攝影機引起的運動與真正的物件運動分離，能帶來更具辨別力的表示。此方法簡單、有效，且與標準的視覺詞袋管線相容。未來方向包括結合深度資訊以獲得更準確的攝影機估計，以及探索深度學習方法以端對端地學習運動補償表示。

段落功能總結全文——重申核心洞見並展望深度學習方向。

邏輯角色結論將技術貢獻提煉為一個簡潔的洞見（「分離攝影機運動與物件運動」），並以「簡單、有效」為方法的實用性背書。

論證技巧 / 潛在漏洞提及「深度學習方法」作為未來方向頗具先見之明（2013 年深度學習在動作辨識中尚未成為主流）。但這也暗示了手工特徵管線的長期局限性——後續確實被端對端的深度學習方法所取代。

論證結構總覽

問題
稠密軌跡混淆
攝影機與物件運動

→

論點
顯式攝影機運動補償
提升軌跡辨別力

→

證據
三基準最先進
6-9% 一致性提升

→

反駁
單應矩陣假設
深度場景的限制

→

結論
運動分離是有效
且通用的改進策略

作者核心主張（一句話）

透過穩健地估計攝影機運動並從稠密軌跡的光流與描述子中將其移除，可以顯著提升動作辨識的表現，尤其在攝影機運動明顯的場景中。

論證最強處

一致且顯著的改進：在三個不同特性的基準（電影、多源、YouTube）上均取得 6-9% 的提升，且改進幅度與攝影機運動量正相關——這直接驗證了方法的動機。消融研究系統性地拆解了每個改進組件的貢獻，邏輯完整。

論證最弱處

攝影機運動模型的簡化：單應矩陣假設場景為平面或攝影機做純旋轉，在具有顯著視差的場景中不成立。此外，完全移除背景軌跡可能丟失有用的場景語境資訊——例如「在操場上跑步」與「在室內跑步」的區分可能依賴背景。