Multi-Object Tracking via Constrained Sequential Labeling

Abstract — 摘要

We address the problem of tracking multiple objects in densely packed environments where occlusion and posture variation present significant challenges. Rather than relying solely on detection, we propose Constrained Sequential Labeling (CSL), which formulates tracking as labeling mid-level features by object identifiers while maintaining hard constraints through constraint propagation. The CSL method enables flexible cost functions and constraints to represent complex dependencies beyond standard network-flow approaches. We include a learning algorithm for constraints with proven convergence properties. Testing on challenging footage from volleyball, basketball, and pedestrian scenarios demonstrates improved performance relative to existing methods.

我們處理在密集環境中追蹤多個物件的問題，其中遮擋與姿態變異帶來重大挑戰。我們不單純依賴偵測，而是提出約束序列標記（CSL），將追蹤形式化為以物件識別子標記中階特徵的過程，同時透過約束傳播維持硬約束。CSL 方法支援靈活的成本函數與約束來表示超越標準網路流方法的複雜依賴關係。我們納入了一個具有已證明收斂性質的約束學習演算法。在排球、籃球與行人等具挑戰性場景的影片上測試，展示了相對於現有方法的效能提升。

段落功能全文總覽——將多物件追蹤重新框架為約束序列標記問題。

邏輯角色摘要建構了「框架重塑」的論證：從追蹤的傳統視角（偵測+關聯）轉向序列標記視角，由此獲得表達力的提升。

論證技巧 / 潛在漏洞「超越標準網路流」的主張暗示更強的建模能力，但也可能意味著更高的計算複雜度。「已證明收斂性質」為理論背書增添了學術嚴謹性。

1. Introduction — 緒論

Multi-object tracking (MOT) in crowded scenes is a fundamental yet challenging task. The dominant paradigm — tracking-by-detection — first detects objects in each frame, then associates detections across frames. This approach is typically formulated as a data association problem, solved via network flow optimization or Hungarian algorithm. However, these formulations assume pairwise costs between detections and cannot express higher-order dependencies such as "two tracks should not overlap in space" or "a track's appearance should vary smoothly". In densely packed scenarios like team sports, these limitations lead to frequent identity switches and fragmented trajectories.

在擁擠場景中的多物件追蹤是一項基礎但具挑戰性的任務。主流範式——偵測式追蹤——首先在每個影格中偵測物件，然後跨影格關聯偵測結果。此方法通常被形式化為資料關聯問題，透過網路流最佳化或匈牙利演算法求解。然而，這些形式化假設偵測之間的成對成本，無法表達更高階的依賴關係，例如「兩條軌跡不應在空間上重疊」或「軌跡的外觀應平滑變化」。在團隊運動等密集場景中，這些限制導致頻繁的身份切換與碎片化軌跡。

段落功能建立研究場域——指出偵測式追蹤的成對成本限制。

邏輯角色論證起點：先概述 MOT 的主流範式，再以「成對成本」的數學限制為突破口，為引入高階約束建立需求。

論證技巧 / 潛在漏洞以具體的約束範例（空間不重疊、外觀平滑）使「高階依賴」具體化。但網路流方法實際上可透過邊容量約束部分地表達空間不重疊——作者的批判可能過度簡化了基準方法的能力。

We propose a fundamentally different formulation: Constrained Sequential Labeling (CSL). Instead of associating detections, we extract mid-level features (superpixels or detection fragments) and assign each an object identity label. The labeling is performed sequentially over time, with hard constraints enforced through constraint propagation — a technique borrowed from constraint satisfaction problems (CSP) in artificial intelligence. This formulation naturally accommodates higher-order dependencies, arbitrary cost functions, and domain-specific constraints that are difficult or impossible to express in flow-based frameworks.

我們提出一個本質上不同的形式化：約束序列標記（CSL）。我們不關聯偵測結果，而是提取中階特徵（超像素或偵測片段）並為每個指定一個物件身份標籤。標記在時間上依序進行，透過約束傳播來強制執行硬約束——這是一種借鑒自人工智慧中約束滿足問題（CSP）的技術。此形式化自然地容納了在流式框架中難以或無法表達的高階依賴關係、任意成本函數與領域特定約束。

段落功能提出核心框架——從資料關聯轉向序列標記。

邏輯角色此段是方法論的核心轉折：將 MOT 重新框架為 CSP，借用 AI 中的約束傳播技術來處理計算機視覺問題。

論證技巧 / 潛在漏洞「借鑒自 CSP」的跨領域遷移是新穎的觀點，但約束傳播的效率高度依賴約束的結構——在物件數量大時，約束傳播的計算複雜度可能成為瓶頸。

Multi-object tracking methods can be categorized by their optimization strategy. Network-flow methods model tracking as minimum-cost flow and can be solved globally optimally in polynomial time, but are limited to pairwise interactions. Conditional Random Field (CRF) approaches can encode richer dependencies but require approximate inference. Multiple Hypothesis Tracking (MHT) maintains a tree of hypotheses but is exponential in the number of objects. Our CSL approach combines the expressiveness of CRF-like models with the efficiency of constraint propagation, offering a middle ground between optimality and expressiveness.

多物件追蹤方法可依其最佳化策略分類。網路流方法將追蹤建模為最小成本流，可在多項式時間內求得全域最優解，但僅限於成對互動。條件隨機場（CRF）方法可編碼更豐富的依賴關係，但需要近似推論。多假說追蹤（MHT）維護一棵假說樹，但在物件數量上呈指數級增長。我們的 CSL 方法結合了類 CRF 模型的表達力與約束傳播的效率，在最優性與表達力之間提供了一個中間地帶。

段落功能文獻回顧——以最佳化策略為軸線比較三類 MOT 方法。

邏輯角色以三方比較（網路流、CRF、MHT）定位 CSL：比網路流更具表達力，比 CRF 更高效，比 MHT 更可擴展。

論證技巧 / 潛在漏洞「中間地帶」的定位策略避免了與任何單一方法的直接對抗。但此定位也意味著 CSL 在任何單一維度上都不是最優——讀者可能質疑是否存在真正需要此折衷的應用場景。

3. Method — 方法

3.1 Sequential Labeling Formulation

At each time step t, we extract a set of mid-level features F_t = {f_1, ..., f_m} from the video frame. Each feature is a localized region (e.g., a detection bounding box or a superpixel cluster) characterized by position, appearance histogram, and motion vector. The goal is to assign each feature a label from the set {0, 1, ..., K}, where labels 1 through K represent tracked objects and label 0 represents background. The labeling at time t is denoted L_t = {l_1, ..., l_m}. A cost function C(L_t | L_{t-1}, F_t) evaluates the quality of the labeling given the previous labeling and current features.

在每個時間步 t，我們從影片影格中提取一組中階特徵 F_t = {f_1, ..., f_m}。每個特徵是一個局部化的區域（如偵測邊界框或超像素簇），以位置、外觀直方圖與運動向量來描述。目標是為每個特徵指定一個來自集合 {0, 1, ..., K} 的標籤，其中標籤 1 到 K 代表被追蹤的物件，標籤 0 代表背景。時間 t 的標記記為 L_t = {l_1, ..., l_m}。成本函數 C(L_t | L_{t-1}, F_t) 在給定前一時間步的標記與當前特徵下，評估標記的品質。

段落功能形式化定義——建立序列標記問題的數學框架。

邏輯角色將直覺性的「為每個區域指定身份」轉化為嚴格的最佳化問題。成本函數的馬可夫性質（僅依賴前一步）使序列處理成為可能。

論證技巧 / 潛在漏洞馬可夫假設（僅看前一步）簡化了計算但可能丟失長程依賴。在長時間遮擋後的重新識別場景中，此假設可能過度限制。

3.2 Constraint Propagation — 約束傳播

The key novelty lies in incorporating hard constraints into the labeling process via constraint propagation. We define several types of constraints: (1) Uniqueness: each object label appears at most once per time step; (2) Spatial exclusion: features within a certain distance cannot share the same non-background label; (3) Appearance consistency: the appearance of a tracked object should not change drastically between consecutive frames. These constraints are enforced by iteratively pruning the domain of possible labels for each feature, a process analogous to arc consistency in constraint satisfaction. This pruning dramatically reduces the search space, making the optimization tractable.

關鍵創新在於透過約束傳播將硬約束納入標記過程。我們定義了幾種約束類型：(1) 唯一性：每個物件標籤在每個時間步最多出現一次；(2) 空間排斥：在一定距離內的特徵不能共享相同的非背景標籤；(3) 外觀一致性：被追蹤物件的外觀在連續影格之間不應有劇烈變化。這些約束透過迭代修剪每個特徵的可能標籤域來強制執行——此過程類似於約束滿足中的弧一致性。此修剪大幅縮減了搜尋空間，使最佳化在計算上可行。

段落功能核心創新——描述三類硬約束及其強制執行機制。

邏輯角色此段是全文論證的支柱：約束傳播既是理論創新（將 CSP 引入 MOT），也是實用工具（大幅縮減搜尋空間）。

論證技巧 / 潛在漏洞三類約束的設計直覺且實用，但「硬約束」意味著對違反的零容忍——在雜訊環境下（偵測器輸出不穩定），某些約束可能過度嚴格，導致可行解不存在。軟約束的替代方案未被討論。

3.3 Constraint Learning — 約束學習

Rather than manually tuning the constraint parameters, we propose a learning algorithm that automatically determines the constraint thresholds from training data. The learning procedure minimizes a loss function that penalizes both missed constraints (leading to identity switches) and overly strict constraints (leading to track fragmentation). We prove that this learning algorithm converges to a fixed point under mild assumptions. The learned constraints are domain-adaptive: constraints learned for volleyball differ from those for pedestrian tracking, reflecting the distinct spatial and temporal characteristics of each scenario.

我們不手動調整約束參數，而是提出一個從訓練資料中自動決定約束閾值的學習演算法。學習程序最小化一個損失函數，同時懲罰遺漏的約束（導致身份切換）與過度嚴格的約束（導致軌跡碎片化）。我們證明此學習演算法在溫和的假設下收斂至不動點。學習到的約束具有領域適應性：為排球學習的約束不同於行人追蹤的約束，反映了每種場景獨特的時空特性。

段落功能自動化——描述資料驅動的約束學習機制。

邏輯角色此段解決了手動約束設計的可擴展性問題：從人工設定轉為資料驅動學習，使方法可適應不同的應用場景。

論證技巧 / 潛在漏洞收斂性證明增強了理論背書。「遺漏約束 vs. 過度約束」的雙向懲罰設計體現了精度-召回的權衡思維。但「溫和假設」的具體內容未在摘要級描述中明確，讀者需查閱證明細節。

4. Experiments — 實驗

We evaluate on three challenging multi-object tracking scenarios: volleyball (6 players per team, frequent occlusion), basketball (5 players per team, fast motion), and pedestrian sequences from standard MOT benchmarks. On the volleyball dataset, CSL achieves significant reductions in identity switches (40-60% fewer) compared to network-flow baselines while maintaining competitive MOTA scores. On basketball sequences, the track fragmentation rate is reduced by 35%. On standard pedestrian benchmarks, CSL achieves competitive performance with state-of-the-art methods. Ablation studies confirm that constraint propagation is essential — removing it degrades performance to below the network-flow baseline, and learned constraints consistently outperform hand-tuned ones.

我們在三個具挑戰性的多物件追蹤場景上評估：排球（每隊 6 名球員，頻繁遮擋）、籃球（每隊 5 名球員，快速運動）以及標準 MOT 基準中的行人序列。在排球資料集上，CSL 相較於網路流基準達到了身份切換的顯著降低（減少 40-60%），同時維持具競爭力的 MOTA 分數。在籃球序列上，軌跡碎片化率降低了 35%。在標準行人基準上，CSL 達到與最先進方法相當的表現。消融研究確認了約束傳播是不可或缺的——移除它使效能退化至網路流基準以下，且學習到的約束一致優於手動調整的約束。

段落功能提供全面的實驗證據——在運動與行人場景中驗證方法的有效性。

邏輯角色實證支柱覆蓋三個維度：(1) 運動場景的身份切換降低；(2) 行人場景的競爭力表現；(3) 約束傳播與約束學習的消融驗證。

論證技巧 / 潛在漏洞在運動場景（最需要高階約束的場景）上的大幅改善具說服力。但在行人場景上僅為「競爭力」而非「超越」，暗示 CSL 的優勢主要在密集互動場景中。

5. Conclusion — 結論

We have presented Constrained Sequential Labeling (CSL), a novel formulation for multi-object tracking that goes beyond pairwise data association by incorporating hard constraints through constraint propagation. The approach naturally handles higher-order dependencies and is accompanied by a constraint learning algorithm with convergence guarantees. Experiments demonstrate that CSL is particularly effective in densely packed scenarios with frequent occlusions. Future work includes extending CSL to online tracking settings and incorporating deep learning-based appearance models for more robust feature representations.

我們提出了約束序列標記（CSL），一種多物件追蹤的新形式化方法，透過約束傳播納入硬約束，超越了成對的資料關聯。此方法自然地處理高階依賴關係，並附帶具收斂保證的約束學習演算法。實驗展示 CSL 在頻繁遮擋的密集場景中特別有效。未來工作包括將 CSL 擴展至線上追蹤設定，以及納入基於深度學習的外觀模型以獲得更穩健的特徵表示。

段落功能總結全文——強調在密集場景中的特殊優勢並展望未來。

邏輯角色結論誠實地聚焦於方法最擅長的場景（密集遮擋），而非泛稱全面優越。「深度學習外觀模型」的未來方向預示了 MOT 領域後續的演進方向。

論證技巧 / 潛在漏洞未討論的關鍵限制包括：方法的計算複雜度隨物件數量的增長行為，以及在全域最佳化層面相較於網路流的理論保證缺失。線上追蹤的擴展也是一個非平凡的挑戰。

論證結構總覽

問題
網路流追蹤無法
表達高階依賴

→

論點
約束序列標記
支援硬約束傳播

→

證據
身份切換減少 40-60%
運動與行人場景驗證

→

反駁
約束可從資料學習
具收斂保證

→

結論
CSL 在密集場景中
顯著優於傳統方法

作者核心主張（一句話）

將多物件追蹤重新框架為約束序列標記問題，透過約束傳播來強制執行唯一性、空間排斥與外觀一致性等硬約束，能有效降低密集場景中的身份切換與軌跡碎片化。

論證最強處

理論與實務的結合：約束傳播的引入不僅帶來了理論上的表達力提升，更直接轉化為身份切換 40-60% 的實測降低。約束學習演算法的收斂性證明增添了學術嚴謹性，而領域適應的實驗結果則展示了方法的實用靈活性。

論證最弱處

硬約束的脆弱性：在偵測器品質不穩定或場景極度擁擠時，硬約束可能導致約束衝突（無可行解）。此外，方法在行人場景上僅達到「競爭力」而非超越現有方法，暗示在不需高階約束的場景中，CSL 的額外複雜度未帶來顯著收益。