Robust Object Tracking with Online Multi-lifespan Dictionary Learning

Abstract — 摘要

Sparse representation-based tracking methods have shown promising results, but a critical challenge remains in how to effectively manage and update the template dictionary during tracking. In this paper, we treat template updating as an online incremental dictionary learning problem and propose a multi-lifespan dictionary learning framework. The key idea is to maintain dictionary atoms with different lifespans — short-lived atoms that capture recent appearance changes, and long-lived atoms that preserve stable features. This strategy balances adaptability to appearance variations with robustness against drift. We develop both generative and discriminative observation models within a Bayesian particle filter framework. Experiments on ten challenging sequences demonstrate that our approach achieves state-of-the-art tracking performance.

基於稀疏表示的追蹤方法已展現出前景性的結果，但在追蹤過程中如何有效管理與更新範本字典仍是一個關鍵挑戰。本文將範本更新視為一個線上增量字典學習問題，並提出多壽命字典學習框架。核心概念在於維護具有不同壽命的字典原子——短壽命原子捕捉近期的外觀變化，長壽命原子保留穩定的特徵。此策略在適應外觀變異的靈活性與抵抗漂移的穩健性之間取得平衡。我們在貝氏粒子濾波器框架中發展生成式與辨別式兩種觀測模型。在十個具挑戰性的序列上的實驗證明，我們的方法達到了最先進的追蹤效能。

段落功能全文總覽——從「範本管理挑戰」到「多壽命字典學習」，預告在粒子濾波框架中的雙模型設計。

邏輯角色摘要精確定義了問題（範本更新）與解答（多壽命策略），以「適應性 vs. 穩健性」的對立統一作為核心論證框架。

論證技巧 / 潛在漏洞「短壽命 vs. 長壽命」的隱喻直覺上極具吸引力，易於理解。但壽命長短的具體設定（幾幀？如何衰退？）需在方法章節中嚴格定義。

1. Introduction — 緒論

Visual object tracking is a fundamental problem with broad applications in surveillance, autonomous driving, human-computer interaction, and activity recognition. A tracker must handle appearance changes due to illumination variation, pose change, partial occlusion, and background clutter. Sparse representation-based methods have gained popularity by representing the target as a sparse linear combination of dictionary templates. However, the effectiveness of these methods hinges on the quality of the template dictionary.

視覺物件追蹤是一個具有廣泛應用的基礎問題，涵蓋監控、自駕車、人機互動與活動辨識。追蹤器必須處理因光照變化、姿態改變、部分遮擋與背景雜亂所造成的外觀變化。基於稀疏表示的方法透過將目標表示為字典範本的稀疏線性組合而獲得普及。然而，這些方法的有效性取決於範本字典的品質。

段落功能建立研究場域——定義視覺追蹤的重要性並指出稀疏表示方法的核心依賴。

邏輯角色論證鏈的起點：先以四類挑戰建立問題的困難度，再將焦點收窄至稀疏方法的特定瓶頸——字典品質。

論證技巧 / 潛在漏洞將所有追蹤挑戰歸因為「外觀變化」是合理的統一視角。但稀疏表示本身可能不是最優的表示——例如，全域範本（PCA 子空間）在某些場景中可能更穩健。

Existing template update strategies fall into two extremes: static dictionaries that never update and fully adaptive dictionaries that update every frame. Static dictionaries fail to capture appearance variations over time, while fully adaptive ones are prone to drift — gradually incorporating tracking errors into the template set. We propose a principled middle ground through multi-lifespan dictionary learning, where dictionary atoms are assigned different lifespans reflecting their temporal relevance and reliability. This enables the tracker to adapt to genuine appearance changes while maintaining a stable reference from the initial frames.

現有的範本更新策略落入兩個極端：從不更新的靜態字典與每幀都更新的完全自適應字典。靜態字典無法捕捉隨時間的外觀變異，而完全自適應字典則容易漂移——逐漸將追蹤錯誤納入範本集。我們提出透過多壽命字典學習的原則性中間方案，其中字典原子被賦予反映其時間相關性與可靠性的不同壽命。這使得追蹤器能夠適應真正的外觀變化，同時維持來自初始幀的穩定參考。

段落功能指出現有策略的兩難困境——為多壽命方案建立動機。

邏輯角色以「兩極之間的中間方案」定位方法，是經典的「辯證法」論證結構：正題（靜態）-> 反題（全適應）-> 合題（多壽命）。

論證技巧 / 潛在漏洞二分法的呈現清晰有力。但「中間方案」的說法可能過於簡化——壽命設計的具體機制遠比「折衷」複雜，需要精確的數學建模。

Sparse representation was first applied to tracking by Mei and Ling, who formulated target appearance as a sparse linear combination of template patches with trivial templates for occlusion handling. Incremental visual tracking (IVT) by Ross et al. maintains a PCA subspace representation that is incrementally updated. Online dictionary learning has been explored by Mairal et al. for general tasks but not specifically adapted for tracking with temporal coherence and drift prevention. Our approach combines the representational flexibility of sparse coding with a principled temporal management strategy that has no direct precedent in the tracking literature.

稀疏表示首先由 Mei 與 Ling 應用於追蹤，將目標外觀公式化為範本區塊的稀疏線性組合，搭配瑣碎範本處理遮擋。Ross 等人的增量視覺追蹤（IVT）維護一個增量更新的 PCA 子空間表示。Mairal 等人探索了線上字典學習用於一般任務，但未針對具有時間連貫性與漂移預防的追蹤做出調適。我們的方法結合了稀疏編碼的表示靈活性與一種有原則的時間管理策略，在追蹤文獻中尚無直接先例。

段落功能文獻回顧——定位方法在稀疏追蹤與字典學習的交匯處。

邏輯角色建立三條平行路線（稀疏追蹤、增量子空間、線上字典學習），並指出三者尚未在「時間管理」維度上整合。

論證技巧 / 潛在漏洞「尚無直接先例」的主張建立了方法的原創性。但 IVT 的增量 PCA 更新本質上也是一種時間管理策略——以遺忘因子控制舊資訊的衰退。此處的差異化可能需要更精確的表述。

3. Method — 方法

3.1 Online Dictionary Learning — 線上字典學習

We formulate the template management problem as online dictionary learning. Given a sequence of observed target patches {y_1, y_2, ..., y_t}, we seek a dictionary D = [d_1, ..., d_K] such that each observation can be well approximated by a sparse combination: y_t ≈ D * alpha_t with ||alpha_t||_0 small. The dictionary is updated incrementally using block coordinate descent: at each frame, we fix the dictionary and solve for sparse codes, then update the dictionary atoms to minimize reconstruction error. This avoids the computational cost of retraining from scratch at each frame.

我們將範本管理問題公式化為線上字典學習。給定一系列觀測到的目標區塊 {y_1, y_2, ..., y_t}，我們尋求一個字典 D = [d_1, ..., d_K]，使每個觀測可被良好地近似為稀疏組合：y_t ≈ D * alpha_t 且 ||alpha_t||_0 較小。字典使用區塊座標下降法進行增量更新：在每一幀，我們固定字典並求解稀疏編碼，然後更新字典原子以最小化重建誤差。這避免了在每一幀從頭重新訓練的計算成本。

段落功能方法推導第一步——定義線上字典學習的最佳化框架。

邏輯角色此為方法的數學基礎。L0 稀疏性約束確保了表示的簡潔性，區塊座標下降法則確保了計算的即時性。

論證技巧 / 潛在漏洞數學公式化清晰標準。但 L0 約束在實務中通常以 L1 鬆弛取代，稀疏度的選擇（多少個非零係數？）對追蹤品質的影響需要討論。

3.2 Multi-lifespan Strategy — 多壽命策略

The core innovation is the multi-lifespan management of dictionary atoms. We partition the dictionary into three groups: (1) permanent atoms initialized from the first frame that are never replaced, preserving the original target appearance; (2) long-lifespan atoms that are updated slowly with a large decay factor, capturing gradual appearance changes such as illumination drift; and (3) short-lifespan atoms that are frequently replaced with recent observations, handling rapid appearance changes such as pose variation. The combination of these three temporal scales ensures that the dictionary retains memory of the original target while remaining adaptive to both slow and fast appearance changes. The replacement strategy for short-lived atoms prioritizes atoms with the lowest activation frequency.

核心創新在於字典原子的多壽命管理。我們將字典劃分為三組：(1) 永久原子——從第一幀初始化且永不替換，保留原始目標外觀；(2) 長壽命原子——以大衰退因子緩慢更新，捕捉如光照漂移等漸進的外觀變化；(3) 短壽命原子——頻繁以近期觀測替換，處理如姿態變化等快速的外觀變化。這三種時間尺度的結合確保字典保留對原始目標的記憶，同時對緩慢與快速的外觀變化均保持自適應性。短壽命原子的替換策略優先替換啟動頻率最低的原子。

段落功能核心創新——描述三級壽命的字典管理機制。

邏輯角色此段是全文論證的支柱：三級壽命（永久/長期/短期）精確對應了三種外觀變化類型（無變化/漸變/劇變）。這種設計將直覺性的「穩定 vs. 靈活」對立，轉化為可操作的技術方案。

論證技巧 / 潛在漏洞三級分區設計的邏輯清晰且有直覺合理性。但三組之間的大小比例如何設定（各幾個原子？）、衰退因子的具體數值如何選擇——這些超參數可能需要針對不同場景進行調整。啟動頻率作為替換標準假設低頻原子不重要，但在遮擋恢復時，低頻但有價值的原子可能被錯誤替換。

4. Experiments — 實驗

We evaluate on ten challenging video sequences widely used in the tracking community, covering scenarios with heavy occlusion (FaceOcc, Girl), illumination change (Car4, David), fast motion (Deer, Jumping), and background clutter (Singer, Football). We compare against eight state-of-the-art trackers including IVT, L1-tracker, MIL, OAB, SemiBoost, FragTrack, TLD, and VTD. Using both center location error (CLE) and overlap rate as metrics, our method achieves the best average performance across all sequences. On the occlusion-heavy sequences, our method shows particularly strong robustness — the permanent atoms prevent catastrophic drift after occlusion recovery. Ablation studies confirm that removing any lifespan group (permanent, long, or short) degrades performance, validating the necessity of the three-level temporal hierarchy.

我們在追蹤社群中廣泛使用的十個具挑戰性影片序列上進行評估，涵蓋嚴重遮擋（FaceOcc、Girl）、光照變化（Car4、David）、快速運動（Deer、Jumping）與背景雜亂（Singer、Football）等場景。我們與八個最先進的追蹤器比較，包括 IVT、L1-tracker、MIL、OAB、SemiBoost、FragTrack、TLD 與 VTD。使用中心位置誤差（CLE）與重疊率作為指標，我們的方法在所有序列上達到最佳平均效能。在遮擋嚴重的序列上，我們的方法展現出尤為強勁的穩健性——永久原子防止了遮擋恢復後的災難性漂移。消融研究確認移除任一壽命組（永久、長期或短期）均會降低效能，驗證了三級時間層次結構的必要性。

段落功能提供全面的實驗證據——在多種挑戰場景中驗證方法的有效性。

邏輯角色實證支柱：(1) 整體排名第一；(2) 遮擋場景的特殊優勢（呼應永久原子的設計動機）；(3) 消融研究的三重確認。

論證技巧 / 潛在漏洞遮擋場景的特殊優勢是強有力的佐證——它直接驗證了「永久原子防漂移」的設計意圖。但十個序列的評估規模在當時雖屬標準，但統計穩健性有限。此外，某些序列上的領先幅度值得關注——若僅在部分序列上顯著領先而在其他序列上持平，整體優勢的普遍性需要商榷。

5. Conclusion — 結論

We have presented a multi-lifespan dictionary learning framework for robust visual object tracking. By partitioning the dictionary into permanent, long-lifespan, and short-lifespan atoms, our method effectively balances the stability-adaptability trade-off that plagues sparse representation-based trackers. The online incremental learning procedure ensures real-time feasibility, and the multi-lifespan strategy provides principled protection against model drift. Experimental results on challenging sequences validate the approach. Future directions include extending the framework to deep feature representations and developing adaptive lifespan allocation that automatically determines the optimal temporal scale for each atom.

我們提出了一個用於穩健視覺物件追蹤的多壽命字典學習框架。透過將字典劃分為永久、長壽命與短壽命原子，我們的方法有效地平衡了困擾基於稀疏表示之追蹤器的穩定性-適應性取捨。線上增量學習程序確保了即時可行性，多壽命策略提供了對模型漂移的有原則保護。在具挑戰性序列上的實驗結果驗證了此方法。未來方向包括將框架擴展至深度特徵表示，以及開發自動決定每個原子最佳時間尺度的自適應壽命分配機制。

段落功能總結全文——以「穩定-適應平衡」為主軸重申貢獻，並展望兩個具體的未來方向。

邏輯角色結論段呼應摘要的「平衡」主題。「自適應壽命分配」的展望承認了手動設定壽命的侷限，是對方法弱點的建設性回應。

論證技巧 / 潛在漏洞結論簡潔有力。但未討論方法在深度學習追蹤器（如 2014 年後的 CNN 追蹤器）面前的長期競爭力。「自適應壽命分配」的展望本身暗示了固定壽命設定的侷限——不同序列可能需要完全不同的壽命配置。

論證結構總覽

問題
靜態字典不適應
全適應字典易漂移

→

論點
三級壽命字典
平衡穩定與適應

→

證據
十個序列上
最佳平均效能

→

反駁
消融確認三級
均不可或缺

→

結論
多壽命策略有效
抵禦模型漂移

作者核心主張（一句話）

透過將字典原子劃分為永久、長壽命與短壽命三組並分別管理，能在線上追蹤中同時維持對原始目標的穩定記憶與對外觀變化的靈活適應。

論證最強處

遮擋恢復的穩健性：永久原子的設計在遮擋恢復場景中展現出明顯優勢——當短壽命原子在遮擋期間被汙染時，永久原子仍保留原始目標外觀，有效防止災難性漂移。消融實驗對三級結構的必要性提供了嚴格驗證。

論證最弱處

超參數敏感性與評估規模：三級壽命的大小比例、衰退因子與替換頻率均為手動設定，跨場景的泛化性未被充分驗證。十個測試序列的評估規模在統計上較薄弱，且與同期更大規模的 OTB 基準的比較缺失。