Online Object Tracking: A Benchmark

Abstract — 摘要

Object tracking is one of the most important problems in computer vision, yet there is no established standard benchmark for evaluating tracking algorithms. Existing evaluations are fragmented, using different sequences, metrics, and protocols, making fair comparison difficult. In this paper, we present a large-scale benchmark for online single-object tracking that includes 50 fully annotated sequences with 11 attributes covering major challenges such as illumination variation, scale change, occlusion, deformation, and fast motion. We evaluate 29 tracking algorithms using one-pass evaluation (OPE), temporal robustness evaluation (TRE), and spatial robustness evaluation (SRE), and provide comprehensive analysis of tracker performance under each attribute.

物件追蹤是電腦視覺中最重要的問題之一，然而目前缺乏評估追蹤演算法的既定標準基準。現有的評估方式分散零碎，使用不同的序列、指標與協議，使公平比較難以實現。本文提出一個大規模的線上單物件追蹤基準，包含 50 個完整標註的序列，涵蓋 11 個屬性，涵蓋光照變化、尺度變化、遮擋、形變與快速運動等主要挑戰。我們使用單次評估（OPE）、時間穩健性評估（TRE）與空間穩健性評估（SRE）來評估 29 種追蹤演算法，並提供每個屬性下追蹤器效能的全面分析。

段落功能全文總覽——指出基準缺失的問題，提出 OTB 作為標準化解決方案。

邏輯角色摘要同時扮演「診斷」與「處方」的角色：先診斷追蹤領域評估混亂的現狀，再開出 OTB 基準的處方。

論證技巧 / 潛在漏洞「50 個序列、29 種演算法」的數字展示了基準的規模。但 50 個序列是否足以涵蓋追蹤的所有困難場景？後續 OTB-100 的推出也間接承認了此版本的規模限制。

1. Introduction — 緒論

The field of visual object tracking has seen rapid progress over the past decade, with dozens of new algorithms proposed each year. However, this progress is difficult to quantify objectively because researchers often evaluate on their own selected sequences with different evaluation metrics. A paper may report improvements on 5-10 sequences that happen to favor its method, while failing on sequences not shown. This "evaluation crisis" hinders scientific progress and makes it impossible for practitioners to choose the right tracker for their application.

視覺物件追蹤領域在過去十年間取得了快速進展，每年有數十種新演算法被提出。然而，此進展難以客觀量化，因為研究者往往在自行選擇的序列上以不同的評估指標進行評估。一篇論文可能在恰好有利於其方法的 5-10 個序列上報告改進，卻在未展示的序列上表現不佳。此「評估危機」阻礙了科學進展，使實務工作者無法為其應用選擇合適的追蹤器。

段落功能問題診斷——以「評估危機」一詞尖銳地指出領域的系統性問題。

邏輯角色將散在的評估問題（不同序列、不同指標、選擇性報告）統合為一個「危機」，提升了問題的緊迫感。

論證技巧 / 潛在漏洞「評估危機」的修辭具有強烈的號召力，有效動員社群關注基準建設。但此說法對現有研究者的評估實踐帶有批評意味，可能引起反彈。

In the object detection community, the PASCAL VOC challenge has served as a standard benchmark since 2005, enabling fair comparison and measurable year-over-year progress. Similarly, the Middlebury benchmark has driven advances in optical flow and stereo. The tracking community lacks an equivalent benchmark. Our goal is to fill this gap by providing a comprehensive, standardized evaluation platform for visual tracking.

在物件偵測社群中，PASCAL VOC 挑戰賽自 2005 年以來一直作為標準基準，使公平比較與可量測的逐年進步成為可能。類似地，Middlebury 基準推動了光流與立體視覺的進步。追蹤社群缺乏同等的基準。我們的目標是透過提供一個全面、標準化的視覺追蹤評估平台來填補此缺口。

段落功能類比論證——以其他領域的成功基準為範例，證明標準化評估的價值。

邏輯角色以 PASCAL VOC 和 Middlebury 為正面範例，暗示「有基準就有進步」的因果關係，為建立追蹤基準提供正當性。

論證技巧 / 潛在漏洞類比論證有效但不完美：偵測與追蹤的任務性質不同，偵測基準的成功不保證相同模式適用於追蹤（追蹤涉及時序依賴、初始化敏感性等特殊問題）。

Prior tracking evaluations include PETS workshops (focused on surveillance), the VOT challenge (initiated in 2013, concurrent with this work), and various ad hoc evaluations in individual papers. PETS primarily addresses multi-object tracking in fixed-camera scenarios, which is quite different from the generic single-object tracking problem we address. The Visual Tracker Benchmark by Babenko et al. evaluates a small number of trackers on limited sequences without systematic attribute annotation. Our benchmark improves on these efforts by combining large-scale evaluation, detailed per-attribute analysis, and multiple robustness evaluation protocols.

先前的追蹤評估包括 PETS 研討會（聚焦於監控場景）、VOT 挑戰賽（於 2013 年啟動，與本文同期）、以及各論文中的臨時性評估。PETS 主要處理固定攝影機場景中的多物件追蹤，與我們所針對的通用單物件追蹤問題截然不同。Babenko 等人的視覺追蹤器基準在有限序列上評估少量追蹤器，缺乏系統性的屬性標註。我們的基準結合了大規模評估、詳細的逐屬性分析與多種穩健性評估協議，超越了這些既有努力。

段落功能文獻回顧——梳理既有評估方案並指出各自的不足。

邏輯角色透過與 PETS、VOT、Babenko 的對比，凸顯 OTB 的獨特貢獻：屬性標註 + 多協議評估。

論證技巧 / 潛在漏洞與同期 VOT 挑戰賽的關係處理得體——承認其存在但強調不同的設計哲學。但兩個基準的共存也導致了追蹤社群的評估標準分裂。

3. Benchmark Design — 基準設計

3.1 Dataset and Attributes

Our benchmark consists of 50 sequences selected to cover a wide variety of tracking challenges. Each sequence is annotated with frame-level bounding boxes and tagged with 11 attributes: illumination variation (IV), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-plane rotation (OPR), out-of-view (OV), background clutter (BC), and low resolution (LR). Each sequence may have multiple attributes, enabling fine-grained analysis of which challenges cause which trackers to fail.

我們的基準由 50 個序列組成，精選以涵蓋多種追蹤挑戰。每個序列都標註了幀級邊界框，並標記了11 個屬性：光照變化（IV）、尺度變化（SV）、遮擋（OCC）、形變（DEF）、動態模糊（MB）、快速運動（FM）、面內旋轉（IPR）、面外旋轉（OPR）、出視野（OV）、背景雜亂（BC）與低解析度（LR）。每個序列可具有多個屬性，使得可以細粒度地分析哪些挑戰導致哪些追蹤器失敗。

段落功能資料集描述——定義 11 個追蹤困難屬性。

邏輯角色屬性標註是 OTB 的核心創新之一：不僅評估「整體效能」，更能精確診斷「哪裡失敗」。

論證技巧 / 潛在漏洞 11 個屬性的選擇全面且直覺。但屬性之間的相關性未被充分處理——例如快速運動常伴隨動態模糊，使得單屬性分析可能具有混淆因素。

3.2 Evaluation Methodology — 評估方法

We propose three complementary evaluation protocols. One-Pass Evaluation (OPE) runs each tracker once from the ground-truth initial frame — this is the standard protocol. Temporal Robustness Evaluation (TRE) initializes trackers from 20 different starting frames throughout each sequence, measuring sensitivity to the temporal starting point. Spatial Robustness Evaluation (SRE) perturbs the initial bounding box with shifts and scale changes, measuring sensitivity to initialization accuracy. Performance is reported via success plots (overlap threshold vs. success rate) and precision plots (location error threshold vs. precision), with area under the curve (AUC) as the primary ranking metric.

我們提出三種互補的評估協議。單次評估（OPE）從真實初始幀執行每個追蹤器一次——這是標準協議。時間穩健性評估（TRE）從每個序列中的 20 個不同起始幀初始化追蹤器，量測對時間起始點的敏感性。空間穩健性評估（SRE）以位移與尺度變化擾動初始邊界框，量測對初始化精度的敏感性。效能以成功率曲線（重疊門檻 vs. 成功率）和精度曲線（位置誤差門檻 vs. 精度）報告，曲線下面積（AUC）作為主要排名指標。

段落功能評估協議定義——三種互補協議全面衡量追蹤器的穩健性。

邏輯角色 OPE/TRE/SRE 分別測量「標準效能」、「時間穩定性」與「空間穩定性」，形成多維度的評估體系。

論證技巧 / 潛在漏洞 TRE 和 SRE 的設計回應了「追蹤器對初始化敏感」的已知問題，方法論設計嚴謹。但 20 個起始幀的選擇是否足夠，以及擾動範圍的選擇是否合理，需要進一步驗證。

4. Experimental Results — 實驗結果

We evaluate 29 tracking algorithms spanning diverse approaches: generative methods (IVT, L1APG), discriminative methods (Struck, SCM), sparse representation (ASLA), correlation filters (CSK, MOSSE), and part-based models (DPM-based). Under OPE, Struck achieves the highest overall AUC of 0.474, followed by SCM (0.453) and ASLA (0.434). The attribute-based analysis reveals that no single tracker dominates across all attributes: Struck excels on fast motion and motion blur, while SCM performs best on background clutter and occlusion.

我們評估了 29 種追蹤演算法，涵蓋多樣化的方法：生成式方法（IVT、L1APG）、判別式方法（Struck、SCM）、稀疏表示（ASLA）、相關濾波器（CSK、MOSSE）及基於零件的模型（基於 DPM 的方法）。在 OPE 下，Struck 達到最高的整體 AUC 0.474，其次為 SCM（0.453）和 ASLA（0.434）。基於屬性的分析揭示了沒有單一追蹤器能在所有屬性上佔據主導地位：Struck 在快速運動和動態模糊上表現優異，而 SCM 在背景雜亂和遮擋上表現最佳。

段落功能核心發現呈現——展示追蹤器的整體排名與屬性別表現。

邏輯角色「無單一最佳追蹤器」的發現是本文最重要的實證結論——它證明了基於屬性分析的必要性。

論證技巧 / 潛在漏洞全面的比較令人印象深刻。但 AUC 0.474 的最高分暗示即使最好的追蹤器也只在不到一半的情況下「成功」，反映了追蹤問題的固有困難。

The TRE and SRE evaluations provide additional insights. Under TRE, rankings remain largely consistent with OPE, suggesting that top trackers are robust to temporal starting position. Under SRE, some trackers show significant sensitivity to bounding box perturbation — in particular, generative trackers like IVT degrade more than discriminative trackers when initialization is noisy. We also observe that speed-accuracy trade-offs vary widely: real-time trackers like MOSSE (615 FPS) score much lower than Struck (20 FPS) but are orders of magnitude faster.

TRE 和 SRE 評估提供了額外的洞見。在 TRE 下，排名與 OPE 大致一致，表明頂級追蹤器對時間起始位置具有穩健性。在 SRE 下，某些追蹤器對邊界框擾動表現出顯著敏感性——特別是，當初始化存在雜訊時，IVT 等生成式追蹤器的退化程度大於判別式追蹤器。我們同時觀察到速度-精度的取捨差異極大：MOSSE（615 FPS）等即時追蹤器的分數遠低於 Struck（20 FPS），但速度快了數個數量級。

段落功能穩健性深度分析——揭示不同追蹤器對初始化的敏感性差異。

邏輯角色 TRE/SRE 的結果驗證了多協議評估的價值：僅看 OPE 會遺漏穩健性差異。

論證技巧 / 潛在漏洞「生成式對初始化更敏感」的發現具有實務指導意義。但未深入分析為何判別式方法更穩健，錯失了理論性貢獻的機會。

5. Conclusion — 結論

We have presented a comprehensive benchmark for online single-object tracking with 50 sequences, 11 attributes, and 29 evaluated trackers. Our findings suggest that the choice of tracker should be guided by the specific challenges of the application scenario, as no single method excels in all situations. We release our benchmark including all sequences, annotations, evaluation code, and tracker results to the community, and hope it will serve as a standard platform for future tracking research.

我們提出了一個全面的線上單物件追蹤基準，包含 50 個序列、11 個屬性及 29 個被評估的追蹤器。我們的發現顯示，追蹤器的選擇應以應用場景的具體挑戰為指導，因為沒有單一方法在所有情況下都表現優異。我們向社群發布包含所有序列、標註、評估程式碼與追蹤器結果的基準，期望其能作為未來追蹤研究的標準平台。

段落功能總結與開放——重申貢獻並強調資源公開。

邏輯角色以資源開放作結，確保基準的持續影響力。「沒有最佳追蹤器」的結論簡潔而有力。

論證技巧 / 潛在漏洞公開所有資源是學術貢獻的典範做法，直接促成了 OTB 成為事實上的標準。事實上，OTB 至今仍是追蹤領域最常引用的基準之一。

論證結構總覽

問題
追蹤領域缺乏
標準化評估基準

→

論點
多屬性多協議
的系統化基準

→

證據
29 種追蹤器
在 50 序列上的評估

→

反駁
無單一最佳追蹤器
需依場景選擇

→

結論
公開資源
成為標準平台

作者核心主張（一句話）

透過建立涵蓋 50 個序列、11 個困難屬性、三種評估協議的標準化基準，可系統性地揭示追蹤演算法的優劣，並為追蹤領域提供公正的比較平台。

論證最強處

屬性別分析的開創性：11 個屬性的標註使研究者首次能夠精確診斷追蹤器在何種條件下失敗。「無單一最佳追蹤器」的發現顛覆了簡單的效能排名觀念，推動社群轉向更細緻的演算法設計。基準的完全公開更確保了長期的影響力。

論證最弱處

規模與多樣性的局限：50 個序列雖然多於先前的評估，但相比現代追蹤基準（如 LaSOT 的 1,400 個序列）仍顯不足。此外，序列主要來自常見場景，對特殊領域（醫學影像、衛星追蹤等）的代表性有限。屬性之間的共線性未被充分處理。