VisTR: End-to-End Video Instance Segmentation with Transformers

Abstract — 摘要

Video instance segmentation (VIS) aims to simultaneously detect, segment, and track object instances across video frames. Existing methods typically adopt a multi-stage pipeline: first performing per-frame detection and segmentation, then associating instances across frames via tracking. This paper proposes VisTR, the first end-to-end transformer-based framework for video instance segmentation. VisTR formulates the VIS task as a direct set prediction problem in the spatiotemporal domain, where instance predictions for all frames are generated simultaneously through a single forward pass. The approach achieves the highest speed among VIS models while maintaining competitive accuracy on the YouTube-VIS benchmark.

影片實例分割（VIS）旨在跨影片畫格同時偵測、分割與追蹤物件實例。現有方法通常採用多階段流程：先進行逐幀偵測與分割，再透過追蹤關聯跨幀的實例。本文提出 VisTR，首個基於 Transformer 的端到端影片實例分割框架。VisTR 將 VIS 任務定義為時空域中的直接集合預測問題，透過單次前向傳播同時生成所有畫格的實例預測。該方法在 VIS 模型中達到最高速度，同時在 YouTube-VIS 基準上維持具競爭力的準確度。

段落功能全文總覽——指出多階段流程的問題，提出端到端 Transformer 方案。

邏輯角色摘要以「多階段 vs. 端到端」的對比建構核心論點。「首個」的宣稱確立了開創性地位，而「最高速度+具競爭力準確度」同時回應了效率與效能兩個維度。

論證技巧 / 潛在漏洞「首個」的宣稱強而有力但需驗證同期是否有類似工作。「competitive accuracy」而非「state-of-the-art」的措辭暗示準確度並非此方法的最大優勢。

1. Introduction — 緒論

Video instance segmentation was introduced as a new task that unifies detection, segmentation, and tracking into a single evaluation framework. Current state-of-the-art methods follow a detect-then-track paradigm: they apply an image-level instance segmentation model (e.g., Mask R-CNN) to each frame independently, then use hand-crafted heuristics or separate tracking networks to associate detections across time. This multi-stage approach suffers from error propagation between stages, the need for complex post-processing (e.g., NMS, IoU-based matching), and inability to leverage temporal information during the detection phase.

影片實例分割作為一項新任務被提出，將偵測、分割與追蹤統一於單一的評估框架中。當前最先進的方法遵循「先偵測再追蹤」的範式：對每幀獨立施用影像層級的實例分割模型（例如 Mask R-CNN），再以手工設計的啟發式規則或獨立的追蹤網路關聯跨時間的偵測結果。此多階段方法面臨各階段間的誤差傳播、需要複雜後處理（如 NMS、基於 IoU 的匹配），以及在偵測階段無法利用時間資訊等問題。

段落功能建立研究場域——定義 VIS 任務並批判多階段方法的固有限制。

邏輯角色此段以三項具體缺陷（誤差傳播、複雜後處理、時間資訊浪費）系統性地論證多階段方法的不足，為端到端方案建立必要性。

論證技巧 / 潛在漏洞三項缺陷的列舉全面且有力。但多階段方法在工程上具有模組化優勢（各模組可獨立最佳化與替換），此優點被忽略。端到端方法的可解釋性與除錯難度也未被討論。

Inspired by DETR's success in reformulating object detection as set prediction with transformers, we extend this paradigm to the video domain. VisTR treats the entire video clip as a single entity and directly outputs instance predictions for all frames simultaneously. The transformer's self-attention mechanism naturally captures spatiotemporal dependencies, allowing each instance query to attend to features across both space and time. This eliminates the need for separate tracking modules, NMS, and other post-processing heuristics.

受 DETR 成功將物件偵測重新定義為以 Transformer 進行集合預測的啟發，我們將此範式擴展至影片領域。VisTR 將整個影片片段視為單一實體，直接同時輸出所有畫格的實例預測。Transformer 的自注意力機制自然地捕捉時空依賴關係，使每個實例查詢（query）能關注跨越空間與時間的特徵。此設計消除了對獨立追蹤模組、NMS 及其他後處理啟發式規則的需求。

段落功能提出解決方案——以 DETR 為基礎，將集合預測範式擴展至時空域。

邏輯角色此段完成了從「問題」到「方案」的轉折。「自注意力自然捕捉時空依賴」是核心論述——將追蹤問題隱含地融入偵測過程，而非作為獨立任務。

論證技巧 / 潛在漏洞以「naturally captures」描述自注意力的時空建模能力是有力的修辭，但「自然」未必意味著「最佳」。將整個影片片段作為輸入的計算成本隨片段長度快速增長，可處理的片段長度受到嚴格限制。

MaskTrack R-CNN first established the VIS task and benchmark, extending Mask R-CNN with a tracking branch that uses learned embeddings for association. Subsequent works like SipMask, CompFeat, and STEm-Seg improved segmentation quality or temporal modeling but all maintained the multi-stage paradigm. In object detection, DETR demonstrated that transformers with set-based Hungarian matching can eliminate NMS and anchor generation. Deformable DETR further improved training efficiency. However, no prior work has applied the transformer-based set prediction framework to the video instance segmentation task.

MaskTrack R-CNN 首先建立了 VIS 任務與基準，將 Mask R-CNN 擴展加入追蹤分支，使用學習的嵌入進行關聯。後續工作如 SipMask、CompFeat 與 STEm-Seg 改善了分割品質或時間建模，但皆維持多階段範式。在物件偵測領域，DETR 展示了基於集合的匈牙利匹配可消除 NMS 與錨框生成。Deformable DETR 進一步提升了訓練效率。然而，尚無先前工作將基於 Transformer 的集合預測框架應用於影片實例分割任務。

段落功能文獻回顧——定位 VisTR 在 VIS 與 Transformer 偵測兩條研究脈絡的交匯點。

邏輯角色以兩條平行脈絡（VIS 方法的演進 vs. Transformer 偵測的發展）收束於「無人填補的交匯處」，精準定位 VisTR 的創新空間。

論證技巧 / 潛在漏洞「兩條脈絡的交匯」敘事清晰有效。但將問題簡化為「DETR + 影片」可能低估了從影像到影片擴展所面臨的根本性挑戰（如長序列記憶體消耗、實例在時間上的一致性維護）。

3. Method — 方法

3.1 架構概覽

VisTR processes a video clip of T frames simultaneously. Each frame is first processed by a CNN backbone (ResNet-50/101) to extract feature maps. The feature maps from all T frames are concatenated along the spatial dimension to form a single spatiotemporal feature sequence. This sequence is then fed into a Transformer encoder-decoder, where the encoder applies multi-head self-attention over the entire spatiotemporal sequence, enabling each spatial position in any frame to attend to all positions in all frames. The decoder takes N learned instance queries and produces N instance predictions, each containing class labels, bounding boxes, and segmentation masks for all T frames.

VisTR 同時處理一個包含 T 幀的影片片段。每幀首先由 CNN 骨幹（ResNet-50/101）擷取特徵圖。所有 T 幀的特徵圖沿空間維度串接，形成單一的時空特徵序列。此序列隨後輸入 Transformer 編碼器-解碼器，編碼器對整個時空序列施加多頭自注意力，使任何畫格中的每個空間位置均可關注所有畫格中的所有位置。解碼器接收 N 個可學習的實例查詢，並產生 N 個實例預測，每個預測包含所有 T 幀的類別標籤、邊界框與分割遮罩。

段落功能方法概覽——描述從多幀輸入到實例預測的完整流程。

邏輯角色此段將「端到端」的承諾具體化：CNN 提取特徵→串接形成時空序列→Transformer 編碼時空關係→實例查詢解碼預測。每一步皆可微分，無需手工後處理。

論證技巧 / 潛在漏洞架構描述清晰完整。但將所有幀的特徵串接為單一序列，其長度為 T x H' x W'，自注意力的 O(n^2) 複雜度意味著 T 不能太大（通常限於 36 幀左右），嚴重限制了可處理的影片長度。

3.2 實例序列匹配

A key innovation is the instance sequence matching mechanism. Unlike DETR which matches predictions to per-image ground truth, VisTR performs matching between predicted instance sequences and ground truth instance sequences across all T frames using the Hungarian algorithm. The matching cost considers classification probability, bounding box L1 and GIoU losses, and mask dice loss, all accumulated across the T frames. This formulation naturally handles the tracking problem as a byproduct of the set prediction framework: each instance query is responsible for the same object across all frames, making explicit tracking unnecessary.

一項關鍵創新是實例序列匹配機制。不同於 DETR 將預測與逐影像真值進行匹配，VisTR 使用匈牙利演算法在所有 T 幀上進行預測實例序列與真值實例序列之間的匹配。匹配成本綜合考量分類機率、邊界框 L1 與 GIoU 損失，以及遮罩 dice 損失，全部在 T 幀上累積計算。此公式化將追蹤問題自然地作為集合預測框架的副產品處理：每個實例查詢負責跨所有畫格追蹤同一物件，使得顯式追蹤成為不必要。

段落功能方法核心——描述實例序列匹配如何將偵測與追蹤統一。

邏輯角色此段是全文論證的頂點。「追蹤作為集合預測的副產品」是最核心的洞見——它將傳統上需要獨立模組處理的追蹤問題優雅地內化於匹配過程中。

論證技巧 / 潛在漏洞「自然處理」的措辭非常有說服力，但隱含了一個重要假設：實例在影片片段內不會出現或消失。對於物件進出畫面的場景，固定數量的實例查詢可能無法靈活應對。匈牙利匹配的全域最佳化在長序列上的計算成本也值得關注。

4. Experiments — 實驗

Evaluation is conducted on the YouTube-VIS 2019 benchmark, the standard dataset for VIS containing 2,238 training, 302 validation, and 343 test videos with 40 categories. With ResNet-50 backbone, VisTR achieves 36.2% AP on the validation set, comparable to MaskTrack R-CNN (30.3%) and SipMask (33.7%). With ResNet-101, performance improves to 40.1% AP. Critically, VisTR achieves the highest inference speed among all compared methods at 69.9 ms per clip (approximately 57.5 FPS for 36-frame clips), compared to MaskTrack R-CNN at 162.5 ms and STEm-Seg at 188.3 ms. The method demonstrates strong temporal consistency in segmentation masks without any post-processing.

評估在 YouTube-VIS 2019 基準上進行，該標準 VIS 資料集包含 2,238 部訓練、302 部驗證與 343 部測試影片，涵蓋 40 個類別。使用 ResNet-50 骨幹，VisTR 在驗證集上達到 36.2% AP，與 MaskTrack R-CNN（30.3%）和 SipMask（33.7%）相比具有競爭力。使用 ResNet-101 時，性能提升至 40.1% AP。關鍵在於，VisTR 在所有比較方法中達到最高推論速度，每個片段 69.9 毫秒（36 幀片段約 57.5 FPS），相比之下 MaskTrack R-CNN 為 162.5 毫秒，STEm-Seg 為 188.3 毫秒。該方法在無需任何後處理的情況下展示了分割遮罩的強時間一致性。

段落功能提供實驗證據——在標準基準上以準確度與速度兩個維度驗證方法。

邏輯角色此段的論證重心在速度而非準確度：2.3x 的速度優勢搭配可比的準確度，支撐了「端到端方法在效率上具有結構性優勢」的核心主張。

論證技巧 / 潛在漏洞速度優勢的論證有力且具體。但 36.2% AP 與同期最佳方法（如 VisTR 發表時的 SOTA 約 44-47% AP）仍有差距，「competitive」的措辭可能掩蓋了這一事實。此外，69.9ms 是處理 36 幀片段的時間，每幀約 1.94ms，但實際部署中需要考慮片段之間的銜接。

Ablation experiments examine several design choices. Removing the instance sequence matching and using per-frame matching degrades AP by 3.1%, confirming the importance of temporal-aware matching. Reducing the number of encoder layers from 6 to 3 drops AP by 1.8%, indicating that sufficient self-attention depth is needed for spatiotemporal reasoning. The segmentation mask head using instance-level attention pooling outperforms simple feature concatenation by 2.4% AP. Increasing clip length from 18 to 36 frames improves AP by 1.2%, demonstrating the benefit of longer temporal context.

消融實驗檢視了數個設計選擇。移除實例序列匹配改用逐幀匹配會使 AP 降低 3.1%，確認了時間感知匹配的重要性。將編碼器層數從 6 減至 3 會使 AP 降低 1.8%，表明充足的自注意力深度對時空推理是必要的。使用實例級注意力池化的分割遮罩頭優於簡單特徵串接 2.4% AP。將片段長度從 18 增加到 36 幀可提升 1.2% AP，展示了更長時間上下文的效益。

段落功能消融分析——驗證各設計組件的獨立貢獻。

邏輯角色消融實驗系統性地支撐了架構中每個設計選擇的合理性。「實例序列匹配」的 3.1% 貢獻最為關鍵，直接支持了「追蹤作為匹配副產品」的核心論點。

論證技巧 / 潛在漏洞消融結果清晰且邏輯一致。「片段越長 AP 越高」的趨勢值得注意——它暗示方法受益於更多時間資訊，但也意味著計算成本將持續增長。未提供片段長度 vs. 計算成本的完整分析。

5. Conclusion — 結論

We present VisTR, the first end-to-end transformer-based model for video instance segmentation. By formulating VIS as direct set prediction in the spatiotemporal domain with instance sequence matching, VisTR eliminates the need for separate tracking modules, NMS, and other hand-designed post-processing. The approach achieves competitive accuracy with the highest inference speed on YouTube-VIS. Our work demonstrates that the transformer's ability to model long-range dependencies is particularly well-suited for tasks requiring joint spatiotemporal reasoning.

我們提出 VisTR，首個端到端的基於 Transformer 影片實例分割模型。透過將 VIS 定義為時空域中搭配實例序列匹配的直接集合預測，VisTR 消除了對獨立追蹤模組、NMS 及其他手工設計後處理的需求。該方法在 YouTube-VIS 上以最高推論速度達到具競爭力的準確度。我們的工作證明了 Transformer 建模長程依賴的能力特別適合需要聯合時空推理的任務。

段落功能總結全文——重申端到端方法的核心價值與實驗成果。

邏輯角色結論段以「消除」（eliminating）的強烈措辭總結方法的簡潔性優勢，並以 Transformer 在時空推理中的適用性作為更廣泛的洞見收束。

論證技巧 / 潛在漏洞結論以方法論的簡潔性（而非準確度的絕對領先）作為主要賣點，這是明智的定位策略。但未討論影片長度限制、物件出入場景的處理，以及從影片片段到完整影片的推廣策略等實務問題。

論證結構總覽

問題
VIS 多階段流程
誤差傳播與後處理

→

論點
端到端時空集合預測
追蹤即匹配副產品

→

證據
36.2-40.1% AP
最高推論速度

→

反駁
消除 NMS 與追蹤模組
架構更簡潔

→

結論
Transformer 適合
聯合時空推理任務

作者核心主張（一句話）

將影片實例分割定義為時空域中的直接集合預測問題，以 Transformer 的自注意力機制同時進行偵測、分割與追蹤，消除了多階段流程的誤差傳播與手工後處理，在達到最高推論速度的同時維持具競爭力的準確度。

論證最強處

問題重新定義的優雅性：將追蹤問題內化為集合預測匹配的副產品，是概念上最具突破性的貢獻。實例序列匹配的消融實驗（移除後 AP 降低 3.1%）直接驗證了此設計的有效性。速度上的結構性優勢（2.3x 於 MaskTrack R-CNN）進一步支撐了端到端方法的實用價值。

論證最弱處

可擴展性與準確度的取捨：自注意力的 O(n^2) 複雜度嚴格限制了可處理的影片長度（約 36 幀），而真實世界影片通常長達數百至數千幀，需要滑動視窗等額外策略來處理。在準確度上，與同期多階段方法的最佳結果仍有明顯差距，「competitive」的措辭無法掩蓋此事實。此外，固定數量的實例查詢假設場景中物件數量有上限，對密集場景的適應性存疑。