Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition

Abstract — 摘要

Human action recognition from skeleton data has attracted increasing attention due to the availability of depth sensors and real-time pose estimation algorithms. Existing methods typically concatenate all joint coordinates into a single feature vector, ignoring the spatial structure of the human skeleton. In this paper, we propose a hierarchical recurrent neural network (HRNN) that decomposes the human skeleton into five anatomical groups (two arms, two legs, and a torso) and processes them in a bottom-up hierarchy. At the lowest level, individual body part subnets capture part-specific motion patterns. At higher levels, these part representations are progressively fused to form a full-body representation. This hierarchical decomposition naturally encodes the physical structure of the human body into the network architecture. We demonstrate state-of-the-art performance on several benchmark datasets.

由於深度感測器和即時姿態估計演算法的可用性，從骨架資料進行人體動作辨識日益受到關注。現有方法通常將所有關節座標串接為單一特徵向量，忽略了人體骨架的空間結構。在本文中，我們提出一種階層式遞迴神經網路（HRNN），將人體骨架分解為五個解剖學群組（兩隻手臂、兩條腿和軀幹），並以由下而上的階層方式處理。在最低層，個別身體部位子網路捕捉部位特定的動作模式。在較高層，這些部位表示被漸進式融合以形成全身表示。這種階層分解將人體的物理結構自然地編碼到網路架構中。我們在數個基準資料集上展示了最先進的效能。

段落功能全文總覽——以「現有方法忽略骨架空間結構」為問題切入，引出階層式 RNN 的解決方案。

邏輯角色摘要建立了清晰的「問題（忽略結構）-> 方案（階層分解）-> 驗證（最先進效能）」三段式論證。五個解剖學群組的具體數字使方法立即變得直覺。

論證技巧 / 潛在漏洞「自然地編碼物理結構」是核心賣點——網路架構反映了問題的內在結構。但五組分法是否是唯一或最佳的分解方式（例如可否更細或更粗），需要消融實驗支持。

1. Introduction — 緒論

Skeleton-based action recognition has emerged as an important research direction due to several advantages: skeleton data is compact, view-invariant to a large extent, and robust to background clutter and illumination changes. With the popularization of Microsoft Kinect and advances in real-time pose estimation, large-scale skeleton datasets have become available. However, most existing approaches treat the skeleton as a flat vector of joint coordinates, applying standard classifiers (SVMs, random forests) or feeding the entire vector into a single RNN. This ignores the rich structural information inherent in the human body's kinematic tree.

基於骨架的動作辨識已成為一個重要的研究方向，原因在於若干優勢：骨架資料緊湊、在很大程度上視角不變，且對背景雜亂和照明變化具有穩健性。隨著 Microsoft Kinect 的普及和即時姿態估計的進步，大規模骨架資料集已經可用。然而，大多數現有方法將骨架視為關節座標的扁平向量，應用標準分類器（SVM、隨機森林）或將整個向量饋入單一 RNN。這忽略了人體動力學樹中固有的豐富結構資訊。

段落功能建立研究場域——肯定骨架資料的優勢並指出現有方法的結構性缺陷。

邏輯角色先建立骨架辨識的價值（緊湊、視角不變、穩健），再指出「扁平向量」的處理方式未能利用此結構——為階層式方法提供動機。

論證技巧 / 潛在漏洞「動力學樹」一詞巧妙連接了人體解剖學與圖結構，暗示網路架構應反映此樹結構。但骨架資料的「視角不變」僅在理想情況下成立——Kinect 的關節估計在遮擋情況下仍有顯著誤差。

We propose to exploit the physical structure of the human body by designing a hierarchical RNN architecture. The key observation is that human actions are composed of coordinated movements of body parts: for instance, "walking" involves coordinated leg motions, "clapping" involves coordinated arm motions, and "bowing" primarily involves torso motion. By first extracting part-level temporal representations and then combining them hierarchically, the network can learn both part-specific patterns and inter-part coordination at different levels of abstraction.

我們提議透過設計階層式 RNN 架構來利用人體的物理結構。關鍵觀察在於人體動作由身體部位的協調運動組成：例如，「行走」涉及腿部的協調運動，「拍手」涉及手臂的協調運動，「鞠躬」主要涉及軀幹運動。透過先提取部位層級的時序表示，然後階層式地結合它們，網路能在不同抽象層級上學習部位特定模式和部位間的協調。

段落功能提出核心概念——以具體動作範例說明階層分解的直覺合理性。

邏輯角色以「行走」「拍手」「鞠躬」三個日常動作建立直覺：不同動作由不同部位組合驅動，因此部件層級的表示學習是有價值的。

論證技巧 / 潛在漏洞日常動作的例子使抽象概念直覺化。但許多動作涉及全身協調（如跳舞、武術），五組分法可能過於僵硬——部位間的交互作用在最低層就被忽略了。

Hand-crafted feature approaches for skeleton action recognition include covariance descriptors, Lie group representations, and actionlets. While effective, these methods require domain expertise to design features and may not generalize across datasets. Deep learning approaches have begun to be applied: Du et al. used standard RNNs on concatenated joint vectors, and Veeriah et al. explored differential RNNs. However, these methods do not explicitly model the spatial structure of the skeleton. Graph-based approaches model joint relationships but typically use static graph structures that do not capture temporal dynamics. Our HRNN uniquely combines structural decomposition with recurrent temporal modeling in an end-to-end trainable framework.

手工設計特徵方法用於骨架動作辨識，包括共變異數描述子、李群表示和 actionlets。雖然有效，這些方法需要領域專業知識來設計特徵，且可能無法跨資料集推廣。深度學習方法已開始被應用：Du 等人在串接的關節向量上使用標準 RNN，Veeriah 等人探索了微分 RNN。然而，這些方法未顯式建模骨架的空間結構。基於圖的方法建模關節關係，但通常使用靜態圖結構，無法捕捉時序動態。我們的 HRNN 在端對端可訓練框架中獨特地結合了結構分解與遞迴時序建模。

段落功能文獻回顧——將 HRNN 定位於手工特徵、深度學習、圖方法的交叉點。

邏輯角色三方對比：手工特徵（不通用）、平坦 RNN（忽略結構）、靜態圖（忽略時序），HRNN 同時處理空間結構和時序動態。

論證技巧 / 潛在漏洞以排除法建立研究空白是有效的。但後續的 ST-GCN（2018）展示了圖摺積網路能同時處理空間和時序，暗示 HRNN 的固定階層可能不是最佳的結構編碼方式。

3. Method — 方法

3.1 Skeleton Decomposition

The human skeleton (typically 20-25 joints from a Kinect sensor) is decomposed into five anatomical groups: left arm (shoulder, elbow, wrist, hand), right arm, left leg (hip, knee, ankle, foot), right leg, and torso (spine, neck, head). Each group contains the 3D coordinates of its constituent joints at each time step. This decomposition follows the natural kinematic chain of the human body — joints within the same group tend to move in a coordinated manner, while inter-group coordination encodes higher-level action semantics.

人體骨架（通常來自 Kinect 感測器的 20-25 個關節）被分解為五個解剖學群組：左臂（肩、肘、腕、手）、右臂、左腿（髖、膝、踝、腳）、右腿和軀幹（脊椎、頸、頭）。每個群組包含其組成關節在每個時間步的三維座標。此分解遵循人體的自然動力學鏈——同一群組內的關節傾向於協調運動，而群組間的協調則編碼更高層級的動作語意。

段落功能方法核心第一部分——定義骨架的空間分解策略。

邏輯角色此段建立了空間分解的生物學合理性：動力學鏈中相鄰關節的協同運動為群組劃分提供了物理基礎。

論證技巧 / 潛在漏洞五組分法直覺且易於實作。但此分解是固定的——不同動作可能需要不同的分組（例如「雙手舉物」時雙臂應視為一組）。自適應分組策略可能更優。

3.2 Hierarchical Architecture — 階層架構

The hierarchical RNN consists of multiple layers. At the first layer, five independent bidirectional RNNs (one per body part group) process the temporal sequences of their respective joint coordinates. Each part-level RNN produces a hidden state sequence capturing the temporal dynamics of that body part. At the second layer, the hidden states from related parts are concatenated and fed into higher-level RNNs: the left arm and right arm representations are combined, the left leg and right leg are combined, and the torso representation is carried forward. At the third (top) layer, all representations are fused into a single RNN that produces the final full-body temporal representation. The last hidden state of the top-level RNN is fed into a softmax classifier for action classification.

階層式 RNN 由多個層次組成。在第一層，五個獨立的雙向 RNN（每個身體部位群組一個）處理各自關節座標的時序序列。每個部位層級 RNN 產生一個隱藏狀態序列，捕捉該身體部位的時序動態。在第二層，相關部位的隱藏狀態被串接並饋入更高層 RNN：左臂和右臂的表示被結合，左腿和右腿被結合，軀幹表示則直接延續。在第三（頂）層，所有表示被融合至一個單一 RNN，產生最終的全身時序表示。頂層 RNN 的最後隱藏狀態被饋入 softmax 分類器以進行動作分類。

段落功能方法核心第二部分——描述三層階層架構的具體組成。

邏輯角色三層結構（部位 -> 肢體 -> 全身）完美對應了人體的解剖階層：關節 -> 肢體 -> 全身。雙向 RNN 的使用使每個時間步都能同時利用過去和未來的資訊。

論證技巧 / 潛在漏洞漸進式融合策略使每一層的輸入維度保持可控。但固定的三層結構缺乏靈活性——是否三層就是最佳深度？更深或更淺的階層是否影響效能需要消融驗證。

3.3 Temporal Modeling — 時序建模

Each RNN in the hierarchy uses bidirectional connections to capture both forward and backward temporal context. We employ Long Short-Term Memory (LSTM) units to address the vanishing gradient problem inherent in standard RNNs when processing long sequences. The LSTM's gating mechanism (input gate, forget gate, output gate) allows the network to selectively remember or forget information across time steps. For action recognition, this is particularly important because discriminative motion patterns may occur at any point in the sequence, and the network must learn to attend to the most informative temporal segments.

階層中的每個 RNN 使用雙向連接以捕捉前向和後向的時序上下文。我們採用長短期記憶（LSTM）單元來處理標準 RNN 在處理長序列時固有的梯度消失問題。LSTM 的閘門機制（輸入閘、遺忘閘、輸出閘）允許網路跨時間步選擇性地記憶或遺忘資訊。對於動作辨識，這尤其重要，因為具鑑別力的動作模式可能在序列的任何時刻出現，網路必須學習關注最具資訊量的時序片段。

段落功能方法細節——描述時序建模的技術選擇。

邏輯角色此段處理「如何在時間維度上建模」的問題，與前段的空間分解互補。雙向 LSTM 確保時序資訊的完整利用。

論證技巧 / 潛在漏洞 LSTM 作為成熟的序列模型是安全的技術選擇。但「選擇性關注」的主張需要注意力機制的顯式引入來驗證——標準 LSTM 的「關注」是隱式的。

4. Experiments — 實驗

We evaluate on three benchmark datasets: NTU RGB+D (the largest skeleton dataset at the time with 56,880 sequences and 60 action classes), SBU Kinect Interaction, and CMU Motion Capture. On NTU RGB+D, our HRNN achieves significant improvements over flat RNN baselines in both cross-subject and cross-view evaluation protocols. Compared to hand-crafted feature methods (Lie Group, Skeletal Quads, Dynamic Skeletons), HRNN outperforms all baselines by clear margins. On SBU Kinect Interaction (two-person interactions), the hierarchical architecture also demonstrates superior performance, showing that the part-level decomposition is beneficial even for interaction recognition.

我們在三個基準資料集上評估：NTU RGB+D（當時最大的骨架資料集，包含 56,880 個序列和 60 個動作類別）、SBU Kinect Interaction 和 CMU Motion Capture。在 NTU RGB+D 上，我們的 HRNN 在跨受試者和跨視角評估協定中均較扁平 RNN 基線達到顯著改進。相比手工設計特徵方法（李群、骨架四元組、動態骨架），HRNN 以明確的差距超越所有基線。在 SBU Kinect Interaction（雙人互動）上，階層架構同樣展示了優異效能，顯示部位層級分解即使在互動辨識中也是有益的。

段落功能實驗驗證——在三個不同規模和類型的資料集上展示效能。

邏輯角色三個資料集的選擇具有互補性：NTU RGB+D 測試大規模通用性、SBU 測試互動場景、CMU 測試運動捕捉精度。跨資料集的一致改進強化了方法的穩健性。

論證技巧 / 潛在漏洞「明確的差距」措辭模糊——具體數字更有說服力。此外，與同期其他深度學習方法（而非僅手工特徵）的比較更能體現 HRNN 的實際優勢。

Ablation studies validate the hierarchical design: the three-layer HRNN consistently outperforms a single-layer flat RNN that receives all joints simultaneously. The part-level decomposition alone (first layer only) already provides improvements over the flat baseline, confirming that respecting the body's spatial structure is beneficial. The additional fusion layers further improve performance by capturing inter-part coordination. Analysis of per-action performance reveals that the hierarchical model is especially beneficial for actions involving specific body parts (e.g., "waving" benefits from arm-specific processing), while maintaining competitiveness on full-body actions.

消融研究驗證了階層設計：三層 HRNN 一致優於同時接收所有關節的單層扁平 RNN。僅部位層級分解（僅第一層）相較於扁平基線已帶來改進，確認了尊重身體空間結構是有益的。額外的融合層透過捕捉部位間協調進一步提升了效能。逐動作效能分析揭示，階層模型對涉及特定身體部位的動作特別有益（如「揮手」受惠於手臂專用處理），同時在全身動作上也維持了競爭力。

段落功能消融研究——逐步驗證階層設計各組件的貢獻。

邏輯角色四層消融（扁平 vs 一層分解 vs 兩層融合 vs 三層完整）建立了漸進式的改進證據鏈。逐動作分析進一步揭示了「何時」階層架構最有幫助。

論證技巧 / 潛在漏洞「揮手」受惠於手臂處理的逐動作分析非常有說服力。但對於全身動作「維持競爭力」（而非超越）暗示階層分解在這些場景中可能引入了不必要的資訊瓶頸。

5. Conclusion — 結論

We have presented a hierarchical recurrent neural network for skeleton-based action recognition that explicitly encodes the physical structure of the human body into the network architecture. By decomposing the skeleton into anatomical groups and processing them in a bottom-up hierarchy, the network learns part-specific temporal patterns and inter-part coordination at multiple levels of abstraction. Experimental results on multiple benchmarks demonstrate the effectiveness of this structure-aware approach. The proposed framework is general and could be extended to other articulated structures beyond the human body, such as animal locomotion or robotic manipulation.

我們提出了一種用於基於骨架動作辨識的階層式遞迴神經網路，將人體的物理結構顯式編碼到網路架構中。透過將骨架分解為解剖學群組並以由下而上的階層方式處理，網路在多個抽象層級上學習部位特定的時序模式和部位間的協調。在多個基準上的實驗結果展示了此結構感知方法的有效性。所提出的框架具有通用性，可擴展至人體以外的其他關節式結構，如動物運動或機器人操作。

段落功能總結全文——重申結構感知的核心貢獻並展望更廣泛的應用。

邏輯角色結論以「通用性」作為提升——從人體動作推廣至任何關節式結構，擴大了方法的影響範圍。

論證技巧 / 潛在漏洞動物和機器人的展望增添了想像空間，但未提供任何初步驗證。結論也未討論固定階層的局限性——後續的圖神經網路方法（如 ST-GCN）提供了更靈活的結構編碼。

論證結構總覽

問題
扁平向量忽略
骨架空間結構

→

論點
階層 RNN 編碼
人體動力學樹

→

證據
三個資料集
一致超越基線

→

反駁
LSTM 解決
長序列梯度消失

→

結論
結構感知架構
可推廣至其他領域

作者核心主張（一句話）

將人體骨架分解為解剖學群組並以階層式雙向 LSTM 由下而上處理，能在多個抽象層級上捕捉部位特定動作模式與部位間協調，實現最先進的骨架動作辨識。

論證最強處

架構與問題結構的天然對應：五組分法直接映射人體的解剖學結構，使網路架構具有物理可解釋性。消融研究清楚展示了從扁平到階層的逐步改進，且逐動作分析精準指出了階層模型的優勢場景。

論證最弱處

固定階層的靈活性不足：五組的劃分和三層的階層深度都是人工設定的，缺乏自適應性。對於需要全身協調的動作，早期的部位隔離可能引入資訊瓶頸。後續的圖摺積網路（ST-GCN）展示了更靈活的關節交互建模方式。