Learning to Track at 100 FPS with Deep Regression Networks (GOTURN)

Abstract — 摘要

Machine learning based approaches to tracking have typically addressed the problem by learning a model of the tracked object's appearance online during test time. This online learning approach limits the expressiveness of the model and requires careful engineering. We propose a tracker that learns to track objects in a purely offline manner, using a deep regression network trained on a large set of videos with tracking annotations. Our tracker, GOTURN (Generic Object Tracking Using Regression Networks), does not perform any online model updating, and thus tracks at 100 frames per second. Despite its simplicity and speed, GOTURN outperforms state-of-the-art online learning approaches on the VOT2014 benchmark. We also show that by training on a large and diverse dataset, the tracker learns to generalize to track novel objects without further training.

基於機器學習的追蹤方法通常透過在測試時線上學習被追蹤物件的外觀模型來解決問題。此線上學習方法限制了模型的表達力且需要細心的工程設計。我們提出一種以純粹離線方式學習追蹤物件的追蹤器，使用在大量帶追蹤標註影片上訓練的深度迴歸網路。我們的追蹤器 GOTURN（使用迴歸網路的通用物件追蹤）不執行任何線上模型更新，因此能以每秒 100 幀的速度追蹤。儘管簡潔且快速，GOTURN 在 VOT2014 基準上超越了最先進的線上學習方法。我們還展示透過在大型多樣化資料集上訓練，追蹤器學會了泛化到追蹤新物件而無需進一步訓練。

段落功能提出離線迴歸式追蹤的核心主張。

邏輯角色挑戰「追蹤需要線上學習」的傳統觀念，建立離線追蹤的可行性。

論證技巧 / 潛在漏洞100 FPS 的速度和離線訓練的簡潔性形成強烈的實用性論述。

1. Introduction — 緒論

Object tracking is one of the fundamental problems in computer vision. Most state-of-the-art trackers learn a model of the object's appearance online, updating the model each frame. This online learning is computationally expensive and limits the tracker's speed. Furthermore, online learning typically uses only the current video as training data, which limits the capacity of the learned model. We take a fundamentally different approach: we train a neural network offline to learn a generic relationship between an object's appearance and its motion. At test time, the network takes as input a crop of the previous frame (centered on the object) and a crop of the current frame (a search region), and directly outputs the bounding box coordinates of the object in the current frame.

物件追蹤是電腦視覺中的基礎問題之一。大多數最先進的追蹤器線上學習物件外觀模型，每幀更新模型。此線上學習計算成本高且限制了追蹤器的速度。此外，線上學習通常僅使用當前影片作為訓練資料，限制了學習模型的容量。我們採取根本不同的方法：離線訓練神經網路以學習物件外觀與其運動之間的通用關係。測試時，網路以前一幀的裁剪（以物件為中心）和當前幀的裁剪（搜尋區域）作為輸入，直接輸出物件在當前幀中的邊界框座標。

段落功能詳述離線迴歸追蹤的運作機制。

邏輯角色清楚對比線上 vs 離線方法的優劣。

論證技巧 / 潛在漏洞「直接輸出邊界框座標」使追蹤問題簡化為直覺的迴歸任務。

2. Method — 方法

GOTURN uses a two-branch CNN architecture. Both branches use the first five convolutional layers of CaffeNet (similar to AlexNet) pre-trained on ImageNet. The previous frame crop (target patch) and current frame crop (search region) are processed by the two branches independently, and their features are concatenated and passed through three fully-connected layers. The output is a 4-dimensional vector representing the bounding box coordinates (top-left and bottom-right corners) of the object in the search region. The entire network is trained end-to-end with an L1 loss on the bounding box coordinates. This direct regression approach eliminates the need for proposal generation or classification-based detection.

GOTURN 使用雙分支 CNN 架構。兩個分支均使用在 ImageNet 上預訓練的 CaffeNet（類似 AlexNet）的前五個摺積層。前一幀裁剪（目標區塊）和當前幀裁剪（搜尋區域）由兩個分支獨立處理，其特徵被串接並通過三個全連接層。輸出是表示搜尋區域中物件邊界框座標的四維向量（左上角和右下角）。整個網路以邊界框座標的 L1 損失進行端對端訓練。此直接迴歸方法消除了候選區域生成或基於分類的偵測需求。

段落功能描述雙分支迴歸架構的詳細設計。

邏輯角色將追蹤問題簡化為「給定前後幀裁剪，直接回歸邊界框」的迴歸任務。

論證技巧 / 潛在漏洞架構設計極為直覺——雙分支比較+全連接迴歸，但全連接層限制了對搜尋區域大小的適應性。

3. Training — 訓練

We train GOTURN on the ALOV300 video tracking dataset combined with the ImageNet detection dataset. Using ImageNet detection data is crucial: we generate synthetic "tracking" pairs by applying random transformations to detection images, which greatly increases the diversity and volume of training data. The network is trained using stochastic gradient descent with a batch size of 50. A key training detail is the data augmentation strategy of random translation and scale changes to simulate the motion patterns that occur during tracking. Training takes approximately 2 days on a single GPU.

我們在 ALOV300 影片追蹤資料集結合 ImageNet 偵測資料集上訓練 GOTURN。使用 ImageNet 偵測資料至關重要：透過對偵測影像施加隨機變換來生成合成的「追蹤」配對，大幅增加了訓練資料的多樣性和數量。網路使用批次大小為 50 的隨機梯度下降進行訓練。關鍵的訓練細節是隨機平移和尺度變化的資料擴增策略，以模擬追蹤過程中發生的運動模式。訓練大約需要在單一 GPU 上 2 天。

段落功能說明訓練資料策略與合成配對生成。

邏輯角色解決離線追蹤訓練的關鍵挑戰——如何獲得足夠的訓練配對。

論證技巧 / 潛在漏洞利用 ImageNet 偵測資料生成合成追蹤配對是巧妙的資料擴增策略。

4. Experiments — 實驗

We evaluate GOTURN on VOT2014 and OTB benchmarks. On VOT2014, GOTURN achieves an accuracy of 0.62 and a robustness score of 2.02, outperforming all compared online trackers. The speed advantage is dramatic: GOTURN runs at 100 FPS on GPU, which is at least 5x faster than any competing deep tracker. On the OTB benchmark, GOTURN achieves competitive performance despite no online model updates. The tracker struggles with extreme appearance changes and long-term occlusions, which is expected given the absence of online adaptation. However, for applications requiring real-time performance with good accuracy, GOTURN represents an excellent trade-off.

我們在 VOT2014 和 OTB 基準上評估 GOTURN。在 VOT2014 上，GOTURN 達到 0.62 的精度和 2.02 的穩健性分數，超越所有比較的線上追蹤器。速度優勢非常顯著：GOTURN 在 GPU 上以 100 FPS 運行，至少比任何競爭的深度追蹤器快 5 倍。在 OTB 基準上，儘管沒有線上模型更新，GOTURN 仍達到具競爭力的表現。追蹤器在極端外觀變化和長期遮擋方面表現不佳，鑑於缺乏線上適應，這是可預期的。然而，對於需要具備良好精度的即時性能的應用，GOTURN 代表了出色的權衡。

段落功能報告基準結果，坦承局限性。

邏輯角色以速度優勢和誠實的局限性分析建立平衡的評估。

論證技巧 / 潛在漏洞主動承認缺點（遮擋、外觀劇變）增強了論文的可信度。

5. Conclusions — 結論

We have presented GOTURN, a tracker that learns offline to track generic objects through deep regression, achieving 100 FPS with no online updating. By leveraging large-scale offline training data, GOTURN learns a general-purpose motion model that generalizes to novel objects. This work demonstrates that offline learning can be a viable and efficient alternative to online learning for visual tracking.

我們提出了 GOTURN，一種透過深度迴歸離線學習追蹤通用物件的追蹤器，在不進行線上更新的情況下達到 100 FPS。透過利用大規模離線訓練資料，GOTURN 學習了可泛化到新物件的通用運動模型。本工作證明了離線學習可以是視覺追蹤中線上學習的可行且高效的替代方案。

段落功能總結離線追蹤範式的價值。

邏輯角色以「替代方案」的定位為離線追蹤正名。

論證技巧 / 潛在漏洞「替代方案」而非「取代方案」的措辭恰如其分地反映了方法的適用範圍。

論證結構總覽

問題
線上學習追蹤太慢

➔

論點
離線深度迴歸追蹤

➔

證據
100 FPS + VOT2014 最佳

➔

反駁
遮擋與劇變外觀困難

➔

結論
離線追蹤可行方案

核心主張

透過大規模離線訓練深度迴歸網路學習通用運動模型，無需線上更新即可實現 100 FPS 的即時物件追蹤。

最強論證

100 FPS 的速度比競爭者快至少 5 倍，同時在 VOT2014 上超越線上方法，有力地證明了離線範式的可行性。

最弱環節

不進行線上適應導致在長期遮擋和極端外觀變化中表現不穩，限制了在複雜場景中的適用性。