Fully-Convolutional Siamese Networks for Object Tracking (SiamFC)

Abstract — 摘要

The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object's appearance exclusively online, using as sole training data the video itself. Despite the success of these methods, their online-only approach inherently limits the richness of the model they can learn. Recently, several attempts have been made to exploit the expressive power of deep convolutional networks. However, when using deep features for tracking, training is typically done using a classification-based approach with stochastic gradient descent, which is too slow for real-time tracking. We equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video. Our tracker operates at frame-rates beyond real-time (86 FPS) and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks.

任意物件追蹤問題傳統上透過僅在線上學習物件外觀模型來處理，以影片本身作為唯一訓練資料。儘管這些方法取得成功，但其純線上方法本質上限制了可學習模型的豐富度。近期，多項嘗試利用深度摺積網路的表達能力。然而，使用深度特徵進行追蹤時，訓練通常採用基於分類的方法配合隨機梯度下降，對即時追蹤而言過於緩慢。我們為基本追蹤演算法配備了一個在 ILSVRC15 資料集上以端對端方式訓練的新型全摺積孿生網路，用於影片中的物件偵測。追蹤器以超越即時的幀率（86 FPS）運作，儘管極度簡潔，仍在多個基準測試中達到最先進的性能。

段落功能定義問題，批判現有方法，提出孿生網路解決方案。

邏輯角色以「簡潔性 + 即時性 + 最先進性能」三重優勢建立核心論述。

論證技巧 / 潛在漏洞「極度簡潔」的修飾語設定了低預期，使後續的高性能結果更具衝擊力。

1. Introduction — 緒論

We propose to learn a similarity function offline using large datasets, and then apply this function at test time to track arbitrary objects. The core idea is to train a Siamese network that takes two image patches as input — an exemplar (the target template) and a search region — and produces a scalar-valued score that indicates how similar the two patches are. At test time, the exemplar is cropped from the first frame using the ground-truth bounding box, and the search region is extracted from subsequent frames. The position of maximum similarity in the output score map indicates the estimated target location. This approach sidesteps the need for online model updating via SGD, which is the bottleneck of most deep learning based trackers.

我們提出使用大型資料集離線學習相似度函數，然後在測試時應用此函數來追蹤任意物件。核心思想是訓練一個孿生網路，以兩個影像區塊作為輸入——範例（目標模板）和搜尋區域——並產生一個標量分數表示兩個區塊的相似程度。測試時，範例從第一幀使用真實邊界框裁剪，搜尋區域從後續幀中提取。輸出分數圖中最大相似度的位置即為估計的目標位置。此方法迴避了透過 SGD 進行線上模型更新的需求，而這正是大多數基於深度學習追蹤器的瓶頸。

段落功能詳細說明孿生追蹤的核心思路與運作方式。

邏輯角色建立「離線學習相似度 + 線上樣板匹配」的追蹤範式。

論證技巧 / 潛在漏洞清楚地將問題轉化為相似度學習，使追蹤問題的解法變得優雅且直觀。

2. Fully-Convolutional Siamese Network — 全摺積孿生網路

A Siamese network applies an identical transformation phi to both inputs and then combines their representations using a function g. In our formulation, the function g is a cross-correlation layer. Given the exemplar image z and search image x, the response map is computed as: f(z, x) = phi(z) * phi(x) + b, where * denotes cross-correlation. The key advantage of our formulation is that it is fully-convolutional with respect to the search image. This means that rather than evaluating the similarity at a single position, we can evaluate it densely across the entire search region in a single forward pass. The embedding function phi is a CNN without any fully-connected layers, using a modified AlexNet architecture with five convolutional layers.

孿生網路對兩個輸入施加相同的變換 phi，然後使用函數 g 組合它們的表示。在我們的公式中，函數 g 是一個交叉相關層。給定範例影像 z 和搜尋影像 x，回應圖的計算方式為：f(z, x) = phi(z) * phi(x) + b，其中 * 表示交叉相關。此公式的關鍵優勢在於它相對於搜尋影像是全摺積的。這意味著我們不是在單一位置評估相似度，而是可以在單次前向傳遞中密集地評估整個搜尋區域。嵌入函數 phi 是一個不含全連接層的 CNN，使用修改過的 AlexNet 架構，包含五個摺積層。

段落功能定義全摺積孿生網路的數學公式與架構。

邏輯角色揭示「全摺積」帶來的計算效率——密集相似度評估僅需一次前向傳遞。

論證技巧 / 潛在漏洞交叉相關操作既優雅又高效，但使用 AlexNet 的淺層網路可能限制特徵表示能力。

3. Training — 訓練

We train our network on the ILSVRC15 video object detection dataset, which contains more than 4000 sequences and over 1.3 million annotated frames. We use pairs of frames from the same video to create training examples. The exemplar is a 127x127 crop centered on the target in the first frame, and the search image is a 255x255 crop from a later frame. We train using logistic loss, where each position in the score map is treated as a binary classification problem (target present or absent). We use stochastic gradient descent with momentum for optimization. The training takes only 2 days on a single GPU, and the resulting model generalizes well to tracking arbitrary objects not seen during training.

我們在 ILSVRC15 影片物件偵測資料集上訓練網路，該資料集包含超過 4000 個序列和超過 130 萬個標註幀。使用來自同一影片的幀對來建立訓練範例。範例是以第一幀中目標為中心的 127x127 裁剪，搜尋影像則是從後續幀中的 255x255 裁剪。使用邏輯損失進行訓練，分數圖中的每個位置被視為二元分類問題（目標存在或不存在）。使用帶動量的隨機梯度下降進行最佳化。訓練僅需在單一 GPU 上 2 天，所得模型能良好地泛化到訓練中未見過的任意物件追蹤。

段落功能說明訓練資料、損失函數與訓練效率。

邏輯角色證明方法的可重現性和效率——訓練成本低、泛化性好。

論證技巧 / 潛在漏洞「2 天單 GPU」的訓練成本強調了方法的實用門檻極低。

4. Experiments — 實驗

We evaluate SiamFC on OTB-2013, OTB-2015, and VOT-2015 benchmarks. On OTB-2013, our tracker achieves an AUC score of 0.608, which is comparable to more complex methods like MDNet (0.648) while running at 86 FPS compared to MDNet's 1 FPS. On VOT-2015, SiamFC ranks among the top performers with an expected average overlap of 0.292. The tracker is remarkably fast: it runs at 86 FPS with 3 scales and 58 FPS with 5 scales on a Titan X GPU. While our tracker does not achieve the absolute best accuracy on all benchmarks, its combination of speed, simplicity, and competitive accuracy makes it an attractive option for applications requiring real-time performance.

我們在 OTB-2013、OTB-2015 和 VOT-2015 基準上評估 SiamFC。在 OTB-2013 上，追蹤器達到 AUC 分數 0.608，與更複雜的方法如 MDNet（0.648）相當，同時以 86 FPS 運行，對比 MDNet 的 1 FPS。在 VOT-2015 上，SiamFC 位居頂尖表現者之列，預期平均重疊度為 0.292。追蹤器速度卓越：在 Titan X GPU 上，3 個尺度達到 86 FPS，5 個尺度達到 58 FPS。雖然追蹤器並非在所有基準上都達到絕對最佳精度，但其速度、簡潔性和具競爭力精度的組合使其成為需要即時性能應用的有吸引力選擇。

段落功能報告多基準的追蹤結果，特別強調速度優勢。

邏輯角色以 86:1 的速度比（對比 MDNet）建立效率論證。

論證技巧 / 潛在漏洞坦承非絕對最佳精度但強調速度-精度的綜合優勢，論述平衡且誠實。

5. Conclusions — 結論

We have presented a tracker based on fully-convolutional Siamese networks that achieves state-of-the-art tracking accuracy while operating at frame rates far exceeding real-time requirements. The simplicity of the approach — learning a similarity function offline and applying it at test time without online model updates — makes it easy to implement and integrate. This work establishes that Siamese networks are a powerful paradigm for visual object tracking, and we believe it will inspire further research in combining offline training with online tracking.

我們提出了一個基於全摺積孿生網路的追蹤器，在遠超即時需求的幀率下達到最先進的追蹤精度。此方法的簡潔性——離線學習相似度函數並在測試時無需線上模型更新即可應用——使其易於實現和整合。本工作確立了孿生網路是視覺物件追蹤的強大範式，我們相信它將啟發離線訓練與線上追蹤結合的進一步研究。

段落功能總結核心貢獻並展望孿生追蹤範式的未來發展。

邏輯角色以「範式建立」的高度定位本文的學術影響力。

論證技巧 / 潛在漏洞「範式」一詞準確預見了後續大量基於孿生網路的追蹤研究。

論證結構總覽

問題
線上 SGD 追蹤太慢

➔

論點
離線學習相似度函數

➔

證據
86 FPS + 頂尖精度

➔

反駁
非所有指標絕對最佳

➔

結論
建立孿生追蹤範式

核心主張

以全摺積孿生網路離線學習通用相似度函數，無需線上模型更新即可實現超即時的任意物件追蹤。

最強論證

86 FPS 對比 MDNet 的 1 FPS 在精度相當的前提下快 86 倍，徹底證明了離線相似度學習範式的實用性。

最弱環節

不進行線上模型更新意味著無法適應目標外觀的劇烈變化，在長期追蹤與遮擋場景中可能表現不穩定。