Tracking Anything in High Quality

Abstract — 摘要

The paper introduces HQTrack, a framework combining video multi-object segmentation (VMOS) with mask refinement (MR). The system addresses object tracking challenges by propagating masks across frames and applying pre-trained models to enhance mask quality. The authors achieved second place in the VOTS2023 challenge without using test-time augmentations or model ensembles, demonstrating the effectiveness of the two-component architecture in high-quality tracking and segmentation.

本文提出 HQTrack，一個結合影片多物件分割（VMOS）與遮罩精煉（MR）的框架。系統透過跨幀傳播遮罩並應用預訓練模型增強遮罩品質來解決物件追蹤挑戰。作者在未使用測試時間增強或模型集成的情況下，於 VOTS2023 挑戰賽中取得第二名，展示了雙組件架構在高品質追蹤與分割中的有效性。

段落功能全文總覽——以競賽成績（第二名）作為方法有效性的即時驗證。

邏輯角色摘要強調「無增強無集成」的條件，暗示方法本身的強度而非工程技巧。

論證技巧 / 潛在漏洞競賽第二名是有力的實證，但摘要未提及與第一名的差距。「無增強無集成」的聲明暗示若加入這些技巧可能達到更高排名，但也可能表示方法本身的上限有限。

1. Introduction — 緒論

Visual object tracking represents a foundational task in computer vision with applications spanning robotics and autonomous driving. The VOTS2023 challenge broadens traditional tracking constraints by merging short-term and long-term tracking, single and multi-target scenarios, using segmentation masks as the primary localization method. This integration presents distinct challenges including inter-object relationship understanding, multi-target trajectory maintenance, and accurate mask estimation. Previous approaches divide into online-update trackers and Siamese-based methods, while recent Transformer-based architectures offer superior long-range modeling but primarily address single-object tracking with bounding box outputs.

視覺物件追蹤是電腦視覺的基礎任務，應用涵蓋機器人學與自動駕駛。VOTS2023 挑戰賽擴大了傳統追蹤的範疇，融合短期與長期追蹤、單目標與多目標場景，以分割遮罩作為主要定位方式。此整合帶來了獨特挑戰，包括物件間關係理解、多目標軌跡維持以及精確的遮罩估測。先前的方法分為線上更新追蹤器與孿生網路方法，而近期的 Transformer 架構雖提供更優的長程建模，但主要處理以邊界框為輸出的單物件追蹤。

段落功能建立研究場域——從追蹤的傳統範式到 VOTS2023 的新挑戰。

邏輯角色 VOTS2023 的多維度整合（短/長期、單/多目標、遮罩輸出）精確定義了問題的複雜性，為雙組件架構的必要性鋪路。

論證技巧 / 潛在漏洞將問題框架與特定競賽綁定既是優勢（有明確的基準）也是限制（方法可能過度特化於此競賽設定）。Transformer 方法的「僅邊界框」限制已被後續工作部分克服。

VOTS2023 presents three primary challenges: (i) Extended video sequences exceeding 10,000 frames demand robust appearance discrimination and environmental adaptation; (ii) Objects disappearing and reappearing require specialized tracking mechanisms; (iii) Fast motion, occlusion, distractors, and tiny objects compound difficulties. The authors propose HQTrack, comprising a video multi-object segmenter (VMOS) built upon an improved DeAOT variant with cascaded gated propagation modules and InternImage-T feature extraction, and a mask refiner using HQ-SAM to address SAM's limitations with complex object structures.

VOTS2023 呈現三大主要挑戰：(i) 超過 10,000 幀的長影片序列需要穩健的外觀判別與環境適應能力；(ii) 物件的消失與重現需要特殊的追蹤機制；(iii) 快速運動、遮擋、干擾物與微小物件使困難加劇。作者提出 HQTrack，包含建構於改良 DeAOT 變體之上的影片多物件分割器（VMOS），使用串接式門控傳播模組與 InternImage-T 特徵提取；以及使用 HQ-SAM 的遮罩精煉器，以解決 SAM 在複雜物件結構上的限制。

段落功能具體挑戰與方案對應——將三大困難映射至雙組件設計。

邏輯角色三項挑戰 -> VMOS 處理追蹤與分割 -> HQ-SAM 處理遮罩品質，形成清晰的問題-方案對應。

論證技巧 / 潛在漏洞選用 DeAOT 而非其他 VOS 方法（如 XMem）以及 HQ-SAM 而非原始 SAM 的決策，需消融研究支撐。InternImage-T 作為骨幹的選擇也應與 ResNet、Swin Transformer 等進行比較。

2. Method — 方法

2.1 Video Multi-Object Segmenter (VMOS) — 影片多物件分割器

DeAOT incorporates identification mechanisms for unified multi-object association within shared embedding spaces. Its hierarchical propagation through dual-branch gated propagation modules decouples visual and identification embedding propagation, mitigating object-agnostic visual information loss during deep propagation. VMOS extends DeAOT by cascading 1/8 scale gated propagation modules for improved tiny object perception. Multi-scale propagation features feed into FPN-based decoders alongside encoder features. InternImage-T replaces standard backbones, leveraging deformable convolutions for enhanced object discrimination across representative vision tasks. A fixed-length long-term memory of 8 frames addresses memory constraints in lengthy sequences.

DeAOT 引入在共享嵌入空間中進行統一多物件關聯的身份識別機制。其透過雙分支門控傳播模組的層級化傳播，解耦視覺與身份嵌入的傳播，減輕深層傳播中與物件無關的視覺資訊損失。VMOS 透過串接 1/8 尺度的門控傳播模組來改善微小物件的感知，擴展了 DeAOT。多尺度傳播特徵與編碼器特徵一同送入基於 FPN 的解碼器。InternImage-T 取代標準骨幹網路，利用可變形摺積增強物件判別能力。固定長度的 8 幀長期記憶體解決了長序列的記憶體約束。

段落功能核心組件一——VMOS 的架構設計與對 DeAOT 的改進。

邏輯角色三項改進（1/8 尺度模組、InternImage-T、固定長度記憶體）各自回應一項挑戰：微小物件、判別力、長序列。

論證技巧 / 潛在漏洞固定 8 幀的長期記憶體設計簡單有效，但對記憶體更新策略（哪些幀被替換）的描述不夠詳細。1/8 尺度的額外模組增加了計算成本，對即時性要求高的應用可能是瓶頸。

2.2 Mask Refiner (MR) — 遮罩精煉器

SAM demonstrates powerful zero-shot capabilities through large-scale training on 1.1 billion masks but struggles with intricate object structures. HQ-SAM addresses this by injecting learnable output tokens into SAM's mask decoder while maintaining promptable design, efficiency, and zero-shot generalizability. HQTrack employs HQ-SAM as its mask refiner, extracting bounding boxes from VMOS predictions as prompts for HQ-SAM. An IoU threshold selector prevents harmful refinement: when IoU between VMOS and HQ-SAM masks exceeds the threshold, refined masks replace originals; otherwise original masks persist. This selective strategy ensures refinement only when beneficial.

SAM 透過在 11 億個遮罩上的大規模訓練展現強大的零樣本能力，但在處理精細物件結構時力有未逮。HQ-SAM 透過在 SAM 的遮罩解碼器中注入可學習的輸出標記來解決此問題，同時保持可提示設計、效率與零樣本泛化能力。HQTrack 採用 HQ-SAM 作為遮罩精煉器，從 VMOS 的預測中提取邊界框作為 HQ-SAM 的提示。IoU 閾值選擇器防止有害的精煉：當 VMOS 與 HQ-SAM 遮罩之間的 IoU 超過閾值時，精煉遮罩取代原始遮罩；否則保留原始遮罩。此選擇性策略確保精煉僅在有益時執行。

段落功能核心組件二——遮罩精煉的策略與安全機制。

邏輯角色 IoU 閾值選擇器是關鍵的安全閥——它認識到 HQ-SAM 並非總是優於 VMOS，避免了盲目精煉可能帶來的品質下降。

論證技巧 / 潛在漏洞選擇性精煉是務實的工程設計。但 IoU 閾值的選擇（0.1）需要仔細調校——過低會讓太多低品質精煉通過，過高則失去精煉的效果。此外，以邊界框作為 HQ-SAM 提示可能不如遮罩提示精確。

3. Implementation Details — 實作細節

InternImage-T serves as the encoder backbone, balancing accuracy and efficiency. Training occurs in two stages: first-stage pre-training uses synthetic sequences from static image datasets; second-stage fine-tuning employs multi-object segmentation datasets including DAVIS, YouTubeVOS, VIPSeg, BURST, MOTS, and OVIS for robust multi-object understanding. Training utilized two A100 GPUs with batch size 16. The long-term memory gap is set to 50 frames (optimized for VOTS video lengths), and inference follows the pipeline without test-time augmentation, multi-scale testing, or model ensembles.

InternImage-T 作為編碼器骨幹，在準確度與效率間取得平衡。訓練分兩階段：第一階段使用靜態影像資料集生成的合成序列進行預訓練；第二階段在多物件分割資料集（包括 DAVIS、YouTubeVOS、VIPSeg、BURST、MOTS、OVIS）上微調以獲得穩健的多物件理解能力。訓練使用兩張 A100 GPU、批次大小 16。長期記憶體間隔設為 50 幀（針對 VOTS 影片長度最佳化），推論時不使用測試時間增強、多尺度測試或模型集成。

段落功能實作規格——提供完整的訓練配置與超參數。

邏輯角色兩階段訓練（合成預訓練 + 真實微調）是 VOS 領域的成熟範式。6 個微調資料集的廣度確保了多場景泛化。

論證技巧 / 潛在漏洞 50 幀的記憶體間隔是為 VOTS 特化的——在其他場景（如監控影片）中可能需要不同的設定。僅使用兩張 A100 的訓練資源相對節制，展現了方法的效率。

4. Experiments — 實驗

Component-wise VMOS analysis demonstrated that replacing ResNet50 with InternImage-T improved AUC to 0.611. Adding multi-scale propagation mechanisms achieved 0.650 AUC, representing 3.9% improvement. Long-term memory gap analysis identified 50 frames as optimal for VOTS video sequences. For mask refinement, selective refinement using 0.1 IoU threshold optimally balanced benefits against degradation. Final VOTS2023 test set results: InternImage-T replacement yielded 3.2% AUC increase; SAM-H refinement added 1.4%; HQ-SAM-H boosted final AUC to 0.615, ranking second. Joint tracking paradigm showed superiority over separate tracking, attributed to improved inter-target relationship understanding.

組件分析顯示以 InternImage-T 取代 ResNet50 將 AUC 提升至 0.611。加入多尺度傳播機制達到 0.650 AUC，提升 3.9%。長期記憶體間隔分析確認 50 幀為 VOTS 影片序列的最佳值。遮罩精煉方面，使用 0.1 IoU 閾值的選擇性精煉最佳地平衡了效益與退化。VOTS2023 測試集最終結果：InternImage-T 替換帶來 3.2% AUC 提升；SAM-H 精煉增加 1.4%；HQ-SAM-H 將最終 AUC 推升至 0.615，排名第二。聯合追蹤範式展現了優於分離追蹤的表現，歸因於改善的物件間關係理解。

段落功能逐組件消融與最終排名——以累積貢獻展示每個設計選擇的價值。

邏輯角色從 ResNet50 基線逐步添加改進的消融路徑，清晰量化了每個組件的貢獻：骨幹(3.2%) + 多尺度(3.9%) + 精煉(0.9%)。

論證技巧 / 潛在漏洞逐步累積的消融設計有效展示了組件價值。但各組件的改進可能並非獨立——交互效應未被探討。HQ-SAM 相比 SAM 僅額外貢獻少量提升（0.615 vs. 不含 HQ 的版本），暗示精煉的邊際效益遞減。

5. Conclusion — 結論

HQTrack demonstrates powerful object tracking and segmentation capabilities through its two-component architecture. VMOS manages multi-target propagation across video frames while the mask refiner enhances segmentation quality. The framework achieved second place in VOTS2023 without auxiliary techniques, validating its effectiveness for high-quality tracking across diverse scenarios including long-term sequences, object disappearance/reappearance, and challenging visual conditions.

HQTrack 透過其雙組件架構展示了強大的物件追蹤與分割能力。VMOS 管理跨影片幀的多目標傳播，而遮罩精煉器增強分割品質。該框架在未使用輔助技巧的情況下於 VOTS2023 中取得第二名，驗證了其在多樣場景下的高品質追蹤有效性，包括長期序列、物件消失/重現以及具挑戰性的視覺條件。

段落功能總結全文——以競賽成績作為終極驗證。

邏輯角色結論簡潔地將雙組件各自的角色（VMOS=追蹤、MR=品質）與競賽成績連結，形成完整閉環。

論證技巧 / 潛在漏洞結論未討論方法的計算效率（推論速度）以及在 VOTS2023 以外基準上的表現。此外，未提及未來改進方向——如更好的記憶體管理策略或端到端訓練的可能性。

論證結構總覽

問題
VOTS2023 的多維度
追蹤挑戰

→

論點
VMOS + HQ-SAM
雙組件分工架構

→

證據
AUC 0.615
VOTS2023 第二名

→

反駁
IoU 選擇性精煉
防止品質退化

→

結論
無需工程技巧
即達高品質追蹤

作者核心主張（一句話）

透過將改良的 DeAOT 多物件分割器（VMOS）與 HQ-SAM 遮罩精煉器以選擇性策略結合，可在不依賴測試時間增強或模型集成的情況下，實現長序列多目標的高品質追蹤與分割。

論證最強處

選擇性精煉的安全機制：IoU 閾值選擇器認識到 HQ-SAM 並非萬能，在精煉可能降低品質時保留 VMOS 的原始遮罩。此「寧可不改也不錯改」的工程哲學，加上逐組件消融的完整實驗設計，使方法的每個設計選擇都有量化支撐。

論證最弱處

對特定競賽的過度特化：方法的多項超參數（50 幀記憶體間隔、0.1 IoU 閾值）均針對 VOTS2023 調校，在其他基準或實際應用場景中的泛化能力未經驗證。此外，雙組件的串接設計意味著推論延遲是兩者之和，對即時追蹤應用可能構成瓶頸，但作者未報告推論速度數據。