Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group

Abstract — 摘要

We propose a novel skeletal representation that explicitly models the 3D geometric relationships between various body parts using rotations and translations in 3D space. Since the relative 3D geometry between a pair of body parts can be described by an element of the special Euclidean group SE(3), the skeletal representation of a human body is a point in the Lie group SE(3) x ... x SE(3). An action is then represented as a curve in this Lie group. We map these curves to the corresponding Lie algebra, which is a vector space, and perform classification using a combination of dynamic time warping, Fourier temporal pyramid representation, and linear SVM. Experiments demonstrate superior performance compared to existing skeleton-based approaches on multiple action recognition datasets.

我們提出一種新穎的骨架表示法，利用三維空間中的旋轉與平移來顯式建模各身體部位之間的三維幾何關係。由於一對身體部位之間的相對三維幾何可由特殊歐幾里德群 SE(3) 的一個元素描述，人體的骨架表示即為李群 SE(3) x ... x SE(3) 中的一個點。動作接著被表示為此李群中的一條曲線。我們將這些曲線映射至對應的李代數——一個向量空間——並結合動態時間規整、傅立葉時間金字塔表示與線性支持向量機進行分類。實驗展示了在多個動作識別資料集上，相較於現有骨架方法的優越表現。

段落功能全文總覽——以數學結構（李群/李代數）為核心，串聯表示、映射與分類三個步驟。

邏輯角色摘要建構了「幾何 -> 代數 -> 分類」的三段式論證：骨架的幾何關係自然活在李群中，映射至李代數後化為向量，最終可用標準分類器處理。

論證技巧 / 潛在漏洞以 SE(3) 的直積描述全身骨架極為優雅，但「所有身體部位對」的組合數量可能導致維度爆炸。作者需在方法中說明如何選擇有意義的部位對。

1. Introduction — 緒論

Human action recognition from 3D skeletal data has gained significant attention due to the availability of cost-effective depth sensors such as Microsoft Kinect. These sensors provide real-time estimation of 3D joint positions, offering a compact yet informative representation of human body configuration. However, most existing skeleton-based methods treat joint positions as points in Euclidean space, ignoring the underlying geometric structure of the human body — specifically, the fact that body parts are rigid segments connected by joints that allow rotation and translation.

從三維骨架資料進行人體動作識別，因 Microsoft Kinect 等低成本深度感測器的普及而受到顯著關注。這些感測器提供三維關節位置的即時估測，為人體構型提供了簡潔且富含資訊的表示。然而，大多數現有骨架方法將關節位置視為歐幾里德空間中的點，忽略了人體的底層幾何結構——具體而言，身體部位是由允許旋轉與平移的關節所連接的剛體段。

段落功能建立研究場域——以 Kinect 時代為背景，指出現有方法的幾何缺失。

邏輯角色論證起點：Kinect 提供了骨架資料（機會），但現有方法忽略了剛體幾何（缺口），為引入李群表示建立需求。

論證技巧 / 潛在漏洞「歐幾里德空間」與「李群」的對比暗示了從平凡表示到結構化表示的質變。但歐幾里德表示的簡單性也是一種優勢（易於處理），作者需證明幾何結構的額外複雜性是值得的。

The key insight of our approach is that the relative 3D rigid body transformation between any pair of body parts can be naturally represented as an element of SE(3), the special Euclidean group consisting of 3D rotations (SO(3)) and translations. By considering all relevant body part pairs, the entire skeleton configuration at each time frame becomes a point in a product Lie group. An action sequence is therefore a curve in this Lie group manifold. Since Lie groups are non-Euclidean spaces where standard machine learning tools cannot be directly applied, we leverage the exponential and logarithmic maps to transfer curves to the tangent space — the Lie algebra — where Euclidean methods become applicable.

我們方法的關鍵洞見在於：任何一對身體部位之間的相對三維剛體變換，可自然地表示為 SE(3)（由三維旋轉 SO(3) 與平移構成的特殊歐幾里德群）的一個元素。考慮所有相關的身體部位對後，每個時間影格的整體骨架構型成為一個直積李群中的點。動作序列因此是此李群流形中的一條曲線。由於李群是非歐幾里德空間，標準的機器學習工具無法直接應用，我們利用指數映射與對數映射將曲線轉移至切空間——李代數——在此處歐幾里德方法方可適用。

段落功能核心洞見——闡述李群表示的數學動機與映射策略。

邏輯角色此段是全文數學基礎的精華：SE(3) 直積 -> 李群曲線 -> 李代數向量的三步轉化，將幾何問題轉化為標準分類問題。

論證技巧 / 潛在漏洞將「非歐問題」轉化為「歐幾里德問題」是優雅的策略，但對數映射僅在單位元素的鄰域內是雙射的——對於大幅度旋轉（如翻滾動作），此近似可能引入誤差。

Skeleton-based action recognition methods can be broadly categorized into joint-based and body-part-based approaches. Joint-based methods represent actions using raw joint coordinates or their temporal derivatives, which are sensitive to body size variations and viewpoint changes. Body-part-based methods use relative joint positions or angular representations, offering better invariance but lacking a principled mathematical framework for modeling the geometric structure. Recent works have explored covariance descriptors on Riemannian manifolds for action recognition, but these do not exploit the specific Lie group structure inherent in rigid body transformations.

基於骨架的動作識別方法可大致分為基於關節的方法與基於身體部位的方法。基於關節的方法使用原始關節座標或其時間導數來表示動作，但對身體尺寸變異與視角變化敏感。基於身體部位的方法使用相對關節位置或角度表示，提供更好的不變性，但缺乏用於建模幾何結構的嚴謹數學框架。近期工作探索了黎曼流形上的共變異數描述子用於動作識別，但這些方法未利用剛體變換中固有的特定李群結構。

段落功能文獻回顧——梳理骨架動作識別的方法分類及各自局限。

邏輯角色以二分法（關節 vs. 部位）審視現有方法，並將黎曼幾何方法作為最近親，最終指出李群結構的未被利用之處。

論證技巧 / 潛在漏洞將自身定位為「黎曼方法的精化」——承認幾何思維的先驅性，但主張自身更具結構特異性。此差異化策略有效但微妙——讀者可能質疑李群與一般黎曼流形之間的實際效能差異。

3. Method — 方法

3.1 Skeletal Representation in SE(3)

Given a skeleton with n body parts, we represent the relative geometry between body part i and body part j as a rigid body transformation (R_ij, t_ij) in SE(3), where R_ij is a 3x3 rotation matrix in SO(3) and t_ij is a 3D translation vector. This transformation maps the local coordinate frame of part i to that of part j. By considering a set of C body part pairs, the configuration at each time frame t is represented as a point p(t) = (g_1(t), g_2(t), ..., g_C(t)) in the product Lie group G = SE(3)^C. The full action is a curve {p(t), t=1,...,T} in G.

給定一個具有 n 個身體部位的骨架，我們將身體部位 i 與部位 j 之間的相對幾何表示為 SE(3) 中的剛體變換 (R_ij, t_ij)，其中 R_ij 是 SO(3) 中的 3x3 旋轉矩陣，t_ij 是三維平移向量。此變換將部位 i 的局部座標系映射至部位 j 的局部座標系。考慮 C 個身體部位對後，每個時間影格 t 的構型被表示為直積李群 G = SE(3)^C 中的一個點 p(t) = (g_1(t), g_2(t), ..., g_C(t))。完整的動作是 G 中的一條曲線 {p(t), t=1,...,T}。

段落功能方法推導第一步——定義骨架在李群中的形式化表示。

邏輯角色將直覺性的「身體部位關係」轉化為嚴格的數學對象——SE(3) 的直積。此形式化是後續所有推導的基礎。

論證技巧 / 潛在漏洞表示法的選擇（哪些部位對被納入）將直接影響效能。若選擇所有可能的部位對，維度為 O(n^2)；若僅選相鄰部位，則可能丟失遠距依賴。此選擇的準則需在實驗中驗證。

3.2 Mapping to Lie Algebra — 李代數映射

Since the Lie group G is a curved manifold, not a vector space, standard linear classifiers cannot be applied directly. We use the logarithmic map at the identity element to project each point p(t) from G to the Lie algebra g, which is the tangent space at the identity — a flat vector space. For each SE(3) component, the logarithmic map produces a 6-dimensional vector (3 for rotation via the matrix logarithm of SO(3), and 3 for translation). The entire skeleton at time t is thus mapped to a 6C-dimensional vector in the Lie algebra. This mapping preserves the essential geometric information while enabling the use of Euclidean machine learning tools.

由於李群 G 是一個彎曲的流形而非向量空間，標準的線性分類器無法直接應用。我們使用在單位元素處的對數映射，將每個點 p(t) 從 G 投影至李代數 g——即在單位元素處的切空間，一個平坦的向量空間。對每個 SE(3) 分量而言，對數映射產生一個 6 維向量（3 維來自 SO(3) 的矩陣對數，用於旋轉；3 維用於平移）。因此，時間 t 的整個骨架被映射至李代數中的 6C 維向量。此映射在保留關鍵幾何資訊的同時，使歐幾里德機器學習工具得以使用。

段落功能核心映射——從流形到向量空間的關鍵轉化步驟。

邏輯角色此段完成了方法中最關鍵的「去曲率」步驟：將非歐幾里德流形中的問題轉化為歐幾里德空間中的標準問題。

論證技巧 / 潛在漏洞「保留關鍵幾何資訊」的聲稱需要定量驗證。對數映射在離單位元素較遠的區域可能引入顯著的失真。此外，不同動作的「參考姿態」選擇將影響映射的品質。

3.3 Temporal Modeling and Classification — 時序建模與分類

After mapping each time frame to the Lie algebra, an action sequence becomes a time series of 6C-dimensional vectors. To handle temporal misalignment and varying action speeds, we employ dynamic time warping (DTW) to compute distances between sequences. For classification, we construct a Fourier temporal pyramid (FTP) representation that captures both global and local temporal patterns at multiple resolutions. The FTP features are then fed to a linear SVM for final classification. This combination allows the system to handle actions of different durations while capturing the essential temporal dynamics.

將每個時間影格映射至李代數後，動作序列成為 6C 維向量的時間序列。為處理時間對齊不一致與不同動作速度的問題，我們採用動態時間規整（DTW）來計算序列間的距離。在分類方面，我們建構傅立葉時間金字塔（FTP）表示，以多種解析度捕捉全域與局部的時序模式。FTP 特徵接著輸入線性支持向量機進行最終分類。此組合使系統能處理不同持續時間的動作，同時捕捉關鍵的時序動態。

段落功能分類管線——描述從時間序列到分類決策的完整流程。

邏輯角色此段完成了方法的最後一哩路：李代數向量 -> DTW/FTP 時序建模 -> SVM 分類。整體管線保持了「幾何表示 + 傳統分類器」的設計哲學。

論證技巧 / 潛在漏洞使用線性 SVM 而非更複雜的分類器，暗示李群/李代數表示本身已提供足夠的鑑別力。但 DTW 的時間複雜度為 O(T^2)，對長序列可能成為瓶頸。

4. Experiments — 實驗

We evaluate on three standard benchmarks: MSR Action3D, Florence 3D Actions, and UTKinect-Action datasets. On MSR Action3D, our method achieves 92.46% accuracy, outperforming the previous best skeleton-based method by a significant margin. On Florence 3D Actions, we obtain 90.88% accuracy, and on UTKinect-Action, 97.08% accuracy. Ablation studies show that the Lie group representation consistently outperforms the raw joint coordinate representation across all datasets, confirming that modeling the geometric structure of the skeleton is beneficial. The choice of body part pairs also significantly impacts performance, with a combination of adjacent and non-adjacent pairs yielding the best results.

我們在三個標準基準上評估：MSR Action3D、Florence 3D Actions 與 UTKinect-Action 資料集。在 MSR Action3D 上，本方法達到 92.46% 的準確率，以顯著幅度超越先前最佳的骨架方法。在 Florence 3D Actions 上，我們獲得 90.88% 的準確率，在 UTKinect-Action 上則為 97.08%。消融研究顯示，李群表示在所有資料集上一致地優於原始關節座標表示，確認了建模骨架的幾何結構是有益的。身體部位對的選擇也顯著影響效能，以相鄰與非相鄰部位對的組合產生最佳結果。

段落功能提供全面的實驗證據——在多個基準上驗證方法的有效性。

邏輯角色實證支柱覆蓋三個維度：(1) 三個資料集的絕對準確率；(2) 李群 vs. 歐幾里德的消融比較；(3) 部位對選擇的影響分析。

論證技巧 / 潛在漏洞消融研究直接驗證了核心假設——幾何結構的顯式建模確實有幫助。但實驗資料集相對較小（MSR Action3D 僅有數百個樣本），在大規模資料集上的表現尚待驗證。

5. Conclusion — 結論

We have proposed a principled approach to skeleton-based action recognition that represents 3D skeletons as points in the Lie group SE(3)^C. By mapping action curves from the Lie group to the Lie algebra via the logarithmic map, we enable the use of standard Euclidean classifiers while preserving the geometric structure of the human body. Experimental results on multiple benchmarks demonstrate the effectiveness of this representation. Future directions include learning the optimal body part pairs from data and extending the framework to incorporate appearance features alongside skeletal geometry.

我們提出了一種嚴謹的骨架動作識別方法，將三維骨架表示為李群 SE(3)^C 中的點。透過對數映射將動作曲線從李群映射至李代數，我們在保留人體幾何結構的同時，使標準的歐幾里德分類器得以使用。多個基準上的實驗結果展示了此表示法的有效性。未來方向包括從資料中學習最佳的身體部位對組合，以及擴展框架以結合外觀特徵與骨架幾何。

段落功能總結全文——重申方法的數學嚴謹性並展望未來。

邏輯角色結論呼應摘要的結構，從表示法回到實驗驗證。「學習最佳部位對」的未來方向精準地指出了當前方法中的人工設計成分。

論證技巧 / 潛在漏洞「嚴謹」一詞的使用強調了數學框架的優勢，但未討論深度學習方法對骨架動作識別的潛在衝擊——後續 RNN/GCN 方法在此任務上大幅超越了傳統管線。

論證結構總覽

問題
骨架方法忽略了
人體的剛體幾何結構

→

論點
李群 SE(3) 自然
描述身體部位的關係

→

證據
三個資料集上的
SOTA 準確率

→

反駁
李代數映射使
標準分類器可用

→

結論
幾何結構的顯式建模
顯著提升動作識別

作者核心主張（一句話）

將人體三維骨架表示為特殊歐幾里德群 SE(3) 直積中的點，並透過李代數映射將動作曲線轉化為歐幾里德向量，能有效捕捉身體部位間的幾何關係並提升動作識別準確率。

論證最強處

數學嚴謹性與物理直覺的統一：SE(3) 正是描述剛體運動的正確數學工具，此選擇不是任意的，而是由物理約束直接決定。消融研究進一步確認李群表示一致優於歐幾里德表示，使「幾何結構有益」的論證鏈條完整且令人信服。

論證最弱處

資料集規模與時代侷限：實驗資料集較小，方法的可擴展性未被充分驗證。此外，對數映射在遠離單位元素處的近似品質，以及身體部位對的手動選擇，都是需要進一步自動化與驗證的弱點。後續深度學習方法（如 ST-GCN）在更大資料集上大幅超越了此方法。