First-Person Activity Forecasting with Online Inverse Reinforcement Learning

Abstract — 摘要

We address the problem of incrementally modeling and forecasting long-term goals of a first-person camera wearer. We present the Darko algorithm for Discovering Agent Rewards for K-futures Online. Darko learns and predicts semantic states, state transitions, rewards, and goals from streaming first-person video using Online Inverse Reinforcement Learning (IRL). In contrast to classical batch IRL approaches that require complete demonstrations, Darko discovers states, transitions, rewards, and goals from continuous observations. We provide a theoretical guarantee of no-regret performance and demonstrate the algorithm's effectiveness empirically on first-person activity datasets spanning extended temporal and spatial horizons.

我們處理的問題是以漸進方式建模與預測第一人稱攝影機穿戴者的長期目標。我們提出 Darko 演算法（Discovering Agent Rewards for K-futures Online，線上探索代理獎勵以預測 K 個未來），從串流的第一人稱影片中使用線上逆強化學習（IRL）學習並預測語意狀態、狀態轉移、獎勵與目標。相較於需要完整示範的傳統批次 IRL 方法，Darko 從持續觀測中探索狀態、轉移、獎勵與目標。我們提供無悔效能的理論保證，並在橫跨延長時間與空間範圍的第一人稱活動資料集上實證驗證演算法的有效性。

段落功能全文總覽——定義問題（長期目標預測）並引出 Darko 演算法的核心定位。

邏輯角色摘要同時涵蓋理論（無悔保證）與實驗（實證驗證），展現論文的雙重貢獻。「線上」與「批次」的對比清晰界定了新穎性。

論證技巧 / 潛在漏洞強調「無悔保證」的理論結果極為有力，但此保證的前提假設（如獎勵的線性結構）可能在複雜的真實場景中不完全成立。

1. Introduction — 緒論

Wearable cameras provide a continuous stream of first-person visual data that captures the wearer's activities, interactions, and environment over extended periods. A key challenge is to develop AI systems that learn human intent and goals through continuous behavioral observation. This extends beyond simple trajectory forecasting to include reasoning about object interactions, scene-based goals, and semantic understanding. We present the first application of ideas from online learning theory and inverse reinforcement learning to the task of continuously learning human behavior models with a wearable camera. Our approach addresses fundamental questions: where will this person go, what will they do, and what semantic goals are they pursuing?

穿戴式攝影機提供持續的第一人稱視覺資料串流，在延長時段中擷取穿戴者的活動、互動與環境。一個關鍵挑戰是開發能透過持續行為觀測學習人類意圖與目標的人工智慧系統。這超越了簡單的軌跡預測，涵蓋對物件互動、基於場景的目標與語意理解的推理。我們提出將線上學習理論與逆強化學習的概念首次應用於以穿戴式攝影機持續學習人類行為模型的任務。我們的方法回應了根本性問題：這個人會去哪裡、會做什麼、以及正在追求什麼語意目標？

段落功能建立研究動機——從穿戴式攝影機的實際場景出發，定義活動預測的三層問題。

邏輯角色以「首次應用」的措辭確立新穎性，並將問題從低層次（軌跡）提升至高層次（語意目標）。

論證技巧 / 潛在漏洞三個層次的問題（去哪裡、做什麼、追求什麼目標）的設定極具吸引力，但每個層次的難度差異巨大。實際驗證中是否能同等有效地回答這三個問題，需仔細檢驗。

Our work draws from three research domains: first-person vision (FPV), inverse reinforcement learning (IRL), and trajectory forecasting. In first-person vision, prior work focuses on activity recognition, object detection, and gaze prediction from egocentric video. In IRL, classical methods such as maximum entropy IRL learn reward functions from complete expert demonstrations in a batch setting, assuming access to full trajectories and known state spaces. Trajectory forecasting methods predict future physical positions but typically do not reason about object interactions or semantic goals. Our work distinguishes itself by going "beyond physical trajectory forecasting by reasoning over future object interactions and predicting future goals in terms of scene types".

我們的研究涉及三個領域：第一人稱視覺（FPV）、逆強化學習（IRL）與軌跡預測。在第一人稱視覺中，先前研究聚焦於自我中心影片的活動辨識、物件偵測與凝視預測。在 IRL 中，最大熵 IRL 等經典方法在批次設定下從完整的專家示範中學習獎勵函數，假設能取得完整軌跡與已知狀態空間。軌跡預測方法預測未來的物理位置，但通常不對物件互動或語意目標進行推理。我們的研究透過「超越物理軌跡預測，推理未來物件互動並以場景類型預測未來目標」來區隔自身。

段落功能文獻定位——跨三個領域梳理相關工作，界定 Darko 的獨特位置。

邏輯角色透過指出每個領域的局限（批次假設、缺乏語意推理），構建出 Darko 所填補的精確研究缺口。

論證技巧 / 潛在漏洞跨領域的文獻整合展現了廣闊的研究視野。但將三個領域的缺點各取一個加以批判，可能簡化了各領域內部的進展。

3. Method — 方法

3.1 Online IRL with Darko — 線上逆強化學習與 Darko 演算法

We formulate the problem as a Markov Decision Process (MDP). States encode 3D position, held objects, and previous goal location: s = [x, y, z, o_1...o_|O|, h_1...h_|K|]. Actions include movement and object acquisition/release. State transitions are incrementally built from observed (s, a, s') triplets. The reward function is modeled as R(s, a; theta) = theta^T * phi(s, a), an inner product of features and learned weights. The Darko algorithm operates in three steps: (1) state tracking via SLAM for localization and action detection for hand-object interactions; (2) goal detection via velocity-based stop detection; (3) online IRL weight update after each episode using gradient descent on the maximum entropy IRL objective.

我們將問題公式化為馬可夫決策過程（MDP）。狀態編碼三維位置、持有物件與先前目標位置：s = [x, y, z, o_1...o_|O|, h_1...h_|K|]。動作包括移動與物件獲取/釋放。狀態轉移從觀測到的 (s, a, s') 三元組漸進建構。獎勵函數建模為 R(s, a; theta) = theta^T * phi(s, a)，即特徵與學習權重的內積。Darko 演算法以三步驟運作：(1) 透過 SLAM 進行定位追蹤與手-物件互動偵測；(2) 透過基於速度的停止偵測進行目標偵測；(3) 在每個情節後使用梯度下降法在最大熵 IRL 目標上進行線上權重更新。

段落功能核心方法定義——建立 MDP 公式化並描述 Darko 的三步驟流程。

邏輯角色將抽象的「活動預測」問題轉化為精確的數學框架。三步驟設計將感知（SLAM）、語意（目標偵測）與學習（IRL 更新）清晰分離。

論證技巧 / 潛在漏洞 MDP 公式化賦予問題嚴格的數學結構。但狀態空間中同時包含連續（3D 位置）與離散（物件持有）元素，實際離散化策略可能影響方法的可擴展性。

The gradient update uses the difference between expected feature counts and empirical feature counts of the current episode. We prove a regret bound of R_t <= 2B * sqrt(2td), where the average regret approaches zero as more episodes are observed, guaranteeing no-regret convergence. This means Darko's cumulative prediction performance asymptotically matches the best fixed reward function in hindsight, even though it learns incrementally without access to future data.

梯度更新使用目前情節的期望特徵計數與經驗特徵計數之差。我們證明了悔恨界限 R_t <= 2B * sqrt(2td)，其中平均悔恨隨觀測的情節增加而趨近零，保證無悔收斂。這意味著 Darko 的累積預測效能漸進地匹配事後最佳固定獎勵函數，即使它是在無法取得未來資料的情況下漸進學習的。

段落功能理論保證——提供 Darko 演算法的收斂性證明。

邏輯角色此段將 Darko 從純經驗方法提升至有理論基礎的演算法。無悔保證是線上學習理論的標準品質指標，賦予方法堅實的理論地位。

論證技巧 / 潛在漏洞無悔界限提供了漸進保證，但界限中的常數（B, d）在實際規模下可能導致收斂速度較慢。此外，界限假設獎勵的線性結構，在非線性獎勵場景中可能不適用。

3.2 Activity Forecasting — 活動預測

Given the learned reward function, we derive multiple forecasting capabilities. Goal prediction is formulated as posterior estimation: P(g | trajectory) is proportional to P(g) * exp(V_st(g) - V_s0(g)), encoding progress toward goals through value differences. This captures the intuition that if the agent has gotten closer to a goal (in terms of expected cumulative reward), that goal becomes more likely. Additional predictions include trajectory length forecasting and action-state subspace visitation, both derived from the core state visitation function. The state visitation function computes expected future state occupancies given partial trajectories, enabling probabilistic multi-modal forecasting of future behaviors.

基於學習到的獎勵函數，我們推導出多種預測能力。目標預測公式化為後驗估計：P(g | 軌跡) 正比於 P(g) * exp(V_st(g) - V_s0(g))，透過價值差異編碼朝向目標的進展。此直覺是：如果代理已更接近某個目標（以期望累積獎勵衡量），該目標就變得更可能。額外的預測包括軌跡長度預測與動作-狀態子空間造訪，兩者均由核心狀態造訪函數推導而來。狀態造訪函數計算給定部分軌跡後的期望未來狀態佔據，使得未來行為的機率多模態預測成為可能。

段落功能預測框架——描述如何從學習到的獎勵函數推導多種預測能力。

邏輯角色此段將學習階段與預測階段連結，展示 IRL 框架的豐富推論能力——不僅能預測目標，還能預測到達時間與中間行為。

論證技巧 / 潛在漏洞以貝式公式推導目標預測既優雅又有理論基礎。但「價值差異」作為進展指標假設了目標之間的獨立性，在多目標相互關聯的場景中可能失準。

4. Experiments — 實驗

We evaluate on a custom first-person continuous activity dataset across 5 environments (homes, offices, laboratory), featuring over 200 high-level activities, 250+ actions with 19 objects, and 17 scene types, spanning over 15 days of recording. Goal detection uses velocity-based stop detection, achieving 62-73% accuracy. Darko achieves mean goal prediction probability of 0.378-0.667 across environments using visual detections, and 0.683-0.880 with ground-truth labels. Darko outperforms baselines including logistic regression, max-margin event detection (MMED), and RNN classifiers. Trajectory length prediction achieves 6.3-34.8% median relative error. We also empirically validate sublinear regret convergence, confirming the theoretical guarantee. Notably, incorporating object interactions substantially improves goal prediction compared to position-only representations.

我們在跨 5 個環境（住宅、辦公室、實驗室）的自訂第一人稱持續活動資料集上進行評估，包含超過 200 個高階活動、250 餘個動作搭配 19 種物件、17 個場景類型，涵蓋超過 15 天的錄影。目標偵測使用基於速度的停止偵測，達到 62-73% 準確率。Darko 使用視覺偵測在各環境中達到平均目標預測機率 0.378-0.667，使用真實標籤時為 0.683-0.880。Darko 優於邏輯迴歸、最大邊距事件偵測（MMED）及 RNN 分類器等基線。軌跡長度預測達到 6.3-34.8% 的中位相對誤差。我們也實證驗證了次線性悔恨收斂，確認理論保證。值得注意的是，納入物件互動相比僅位置表示顯著改善了目標預測。

段落功能全面實驗驗證——在多環境、多指標上展示 Darko 的表現。

邏輯角色實證支柱涵蓋五個維度：目標預測、基線比較、軌跡預測、理論驗證、消融分析（物件互動的貢獻）。

論證技巧 / 潛在漏洞使用自訂資料集雖允許更長期的記錄，但缺乏與公開基準的可比性。視覺偵測與真實標籤之間的巨大效能差距（0.378 vs. 0.683）暗示系統瓶頸可能在感知而非推理。

5. Conclusion — 結論

We presented Darko, the first method for continuous first-person semantic behavior modeling at extended temporal and spatial horizons. By combining online inverse reinforcement learning with first-person visual perception, Darko learns to predict goals, trajectories, and actions from streaming egocentric video with theoretical guarantees of no-regret convergence. Our results demonstrate that reasoning beyond physical trajectories — incorporating object interactions and scene semantics — significantly enhances long-term activity forecasting. This work opens new directions at the intersection of online learning, decision-theoretic modeling, and first-person vision.

我們提出了 Darko，這是第一個在延長的時間與空間範圍內進行持續第一人稱語意行為建模的方法。透過結合線上逆強化學習與第一人稱視覺感知，Darko 學習從串流的自我中心影片中預測目標、軌跡與動作，並具有無悔收斂的理論保證。我們的結果展示，超越物理軌跡的推理——納入物件互動與場景語意——顯著增強了長期活動預測。這項工作在線上學習、決策理論建模與第一人稱視覺的交匯處開闢了新方向。

段落功能總結全文——重申核心貢獻並展望跨領域的未來方向。

邏輯角色結論精煉地呼應摘要的三重主張：(1) 首創性（第一個方法）；(2) 理論性（無悔保證）；(3) 實用性（語意推理的效益）。

論證技巧 / 潛在漏洞以跨領域交叉作為展望方向頗具遠見。但論文未充分討論在更大規模、更多人的場景中的可擴展性，以及感知誤差累積對長期預測的潛在影響。

論證結構總覽

問題
如何從第一人稱影片
預測長期語意目標？

→

論點
線上逆強化學習
實現漸進式行為建模

→

證據
多環境實驗 +
無悔理論保證

→

反駁
超越軌跡預測
納入物件互動與語意

→

結論
跨領域融合
開闢新研究方向

作者核心主張（一句話）

透過線上逆強化學習從串流的第一人稱影片中漸進學習獎勵函數，Darko 能以理論保證的方式預測穿戴者的長期語意目標、軌跡與動作。

論證最強處

理論與實務的結合：無悔收斂保證賦予方法堅實的理論基礎，而實驗中次線性悔恨的實證驗證確認了理論預測。將 IRL 從批次擴展至線上的貢獻在理論上清晰且在實務上有意義。

論證最弱處

感知瓶頸與可擴展性：視覺偵測與真實標籤之間的巨大效能差距暗示系統的上限受限於感知品質而非推理框架。此外，自訂資料集的使用雖允許長期記錄，但降低了與其他方法在公開基準上的可比性。