Towards Streaming Perception

Abstract — 摘要

We present an approach towards streaming perception, the problem of real-time understanding of dynamic visual scenes that considers both accuracy and latency as a joint optimization target. Current benchmarks evaluate detectors on static images without considering the delay between image capture and result availability. In a streaming setting, this delay means that by the time results are available, the world has moved on, rendering the results outdated. We introduce a meta-benchmark that jointly evaluates the accuracy and latency of any detection algorithm by simulating the streaming setting, and show that the optimal detector under streaming evaluation is very different from the optimal detector under traditional offline evaluation.

我們提出一種邁向串流感知的方法，此問題旨在將準確度與延遲作為聯合最佳化目標來即時理解動態視覺場景。目前的基準測試在靜態影像上評估偵測器，而未考慮影像擷取與結果可用之間的延遲。在串流設定中，此延遲意味著當結果可用時，世界已經改變，使結果過時。我們引入一個元基準測試，透過模擬串流設定來聯合評估任何偵測演算法的準確度與延遲，並證明在串流評估下的最佳偵測器與傳統離線評估下的最佳偵測器截然不同。

段落功能全文總覽——定義串流感知問題並引入新的評估框架。

邏輯角色透過指出現有評估方式忽略延遲的根本缺陷，建立重新定義問題的必要性。

論證技巧 / 潛在漏洞「世界已經改變」的表述生動地傳達了延遲問題的嚴重性。但此問題在自駕領域早已被討論，本文的新穎性更多在於形式化定義。

1. Introduction — 緒論

The dominant paradigm in perception for autonomous systems treats vision as a sequence of independent frames: capture an image, process it through a detector, and output results. But in the real world, perception is inherently a streaming process — the environment continuously changes while the algorithm is computing. A detector that takes 100ms to process a frame will return results that describe the world 100ms in the past. For a vehicle traveling at 60 mph, this corresponds to nearly 3 meters of travel, making the results potentially dangerous for downstream planning.

自主系統感知的主流範式將視覺視為一系列獨立幀的處理：擷取影像、通過偵測器處理、輸出結果。但在真實世界中，感知本質上是一個串流過程——環境在演算法計算期間持續變化。一個需要 100 毫秒處理一幀的偵測器將回傳描述 100 毫秒前世界狀態的結果。對於以時速 60 英里行駛的車輛，這相當於近 3 公尺的行程，使得結果對下游規劃而言可能是危險的。

段落功能建立動機——以自駕的具體場景說明延遲問題的安全隱患。

邏輯角色將抽象的「延遲問題」轉化為可量化的安全風險（3 公尺偏差），為串流感知的研究動機提供強有力的支撐。

論證技巧 / 潛在漏洞以「60 mph + 100ms = 3 公尺」的計算直觀地量化了問題的嚴重性，極具說服力。

We argue that the community needs a fundamental rethinking of how perception systems are evaluated. Standard benchmarks like COCO and Cityscapes evaluate detectors on static snapshots, rewarding accuracy without penalizing latency. This creates a perverse incentive to build increasingly complex and slow models. We propose a streaming evaluation protocol that measures the accuracy of a detector's output at the time it becomes available, against the ground truth at that same time, naturally incorporating the cost of latency.

我們認為學界需要從根本上重新思考感知系統的評估方式。COCO 和 Cityscapes 等標準基準在靜態快照上評估偵測器，獎勵準確度而不懲罰延遲。這產生了構建日益複雜和緩慢模型的反常激勵。我們提出一種串流評估協議，在偵測器輸出可用的時刻，對照同一時刻的真實標註來衡量準確度，自然地納入延遲的代價。

段落功能批判現狀——指出現有評估指標的系統性偏差。

邏輯角色「反常激勵」的論述是全文的核心洞見——現有指標不僅不完整，還在引導研究走向錯誤的方向。

論證技巧 / 潛在漏洞將研究社群的趨勢（追求更高準確度）框定為「反常激勵」是大膽但有效的修辭。然而在許多非即時應用中（如醫學影像），準確度仍應是首要目標。

2. Streaming Perception Framework — 串流感知框架

2.1 Streaming Evaluation — 串流評估指標

We formalize the streaming evaluation as follows: given a continuous video stream, a detector processes frames and produces outputs with some latency. The streaming AP (sAP) evaluates each output against the ground truth at the time the output becomes available, not when the input frame was captured. This means a fast but less accurate detector may score higher than a slow but more accurate one, because the fast detector's results are more temporally aligned with reality. We further introduce forecasting-based methods that predict where objects will be at a future time, compensating for computational delay.

我們將串流評估形式化如下：給定一個連續的影片串流，偵測器處理幀並以某些延遲產生輸出。串流 AP（sAP）在輸出可用的時刻而非輸入幀被擷取的時刻，將每個輸出與真實標註進行比較。這意味著一個快速但精度較低的偵測器可能比一個緩慢但更精確的偵測器得分更高，因為快速偵測器的結果在時間上與現實更為一致。我們進一步引入基於預測的方法，預測物件在未來時刻的位置，以補償計算延遲。

段落功能形式化定義——精確描述串流 AP 的計算方式與預測機制。

邏輯角色 sAP 的定義是本文最核心的貢獻——一個自然地將準確度與延遲統一的指標。

論證技巧 / 潛在漏洞「快但不精確」可能優於「慢但精確」的洞察是反直覺但合理的。預測機制為進一步改進留下了空間。

3. Experiments — 實驗

We evaluate a range of detectors under the streaming setting using the Argoverse-HD dataset. Under standard evaluation, heavier models (e.g., Cascade R-CNN with ResNeXt-101) achieve the highest AP. However, under streaming evaluation, lighter models (e.g., SSD with MobileNet) achieve comparable or better streaming AP due to their lower latency. The ranking of detectors is almost completely reversed: the best offline detector drops to near the bottom in the streaming ranking. Adding simple linear motion forecasting improves streaming AP by 2-5 points across most models.

我們使用 Argoverse-HD 資料集在串流設定下評估一系列偵測器。在標準評估下，較重的模型（如搭載 ResNeXt-101 的 Cascade R-CNN）達到最高的 AP。然而在串流評估下，較輕的模型（如搭載 MobileNet 的 SSD）由於延遲更低，達到相當甚至更好的串流 AP。偵測器的排名幾乎完全逆轉：離線最佳偵測器在串流排名中跌至接近底部。加入簡單的線性運動預測在大多數模型上使串流 AP 提升 2-5 個百分點。

段落功能實證驗證——排名逆轉的實驗結果有力地支持了論文的核心論點。

邏輯角色「排名幾乎完全逆轉」是全文最有衝擊力的發現，直接驗證了現有評估方式的根本性缺陷。

論證技巧 / 潛在漏洞排名逆轉的結果極具說服力，但實驗僅在 Argoverse-HD 上進行，泛化性有待更多場景驗證。

We also analyze the accuracy-latency trade-off in more detail. By varying the backbone architecture and input resolution, we trace out a Pareto frontier in the accuracy-latency space. Under streaming evaluation, the optimal operating point shifts dramatically towards faster models. We find that reducing input resolution is often more beneficial than using a lighter backbone, as it provides a larger latency reduction per unit of accuracy loss. These findings have significant implications for the design of real-time perception systems.

我們也更詳細地分析了準確度-延遲取捨。透過改變骨幹架構和輸入解析度，我們在準確度-延遲空間中描繪出帕累托前沿。在串流評估下，最佳操作點大幅轉向更快的模型。我們發現降低輸入解析度通常比使用更輕量的骨幹更有益，因為它在每單位準確度損失上提供更大的延遲降低。這些發現對即時感知系統的設計具有重大啟示。

段落功能深入分析——提供設計即時感知系統的實用指導。

邏輯角色從「發現問題」進入「提供解決方向」，帕累托分析使結論具有實際工程價值。

論證技巧 / 潛在漏洞「降解析度優於換骨幹」的發現是實用的工程指導。但這些結論高度依賴硬體平台和影片特性。

4. Conclusion — 結論

We have introduced the problem of streaming perception and proposed a principled framework for jointly evaluating the accuracy and latency of perception systems. Our key finding is that the optimal detector under streaming conditions is fundamentally different from the optimal detector under traditional evaluation, challenging the current research paradigm. We hope this work will inspire the community to rethink the design and evaluation of real-time perception systems, giving equal consideration to latency as to accuracy.

我們引入了串流感知問題，並提出一個原則性的框架來聯合評估感知系統的準確度與延遲。我們的核心發現是串流條件下的最佳偵測器與傳統評估下的最佳偵測器在根本上不同，挑戰了現有的研究範式。我們希望本研究能啟發學界重新思考即時感知系統的設計與評估，給予延遲與準確度同等的重視。

段落功能總結——重申串流感知的核心發現與對研究方向的影響。

邏輯角色從技術貢獻上升至研究範式的反思，呼籲研究社群改變評估方式。

論證技巧 / 潛在漏洞作為一篇「問題定義」論文，其核心價值在於改變思維方式而非提出新演算法。獲得 Honorable Mention 證明了評審對此重新框架化的認可。

論證結構總覽

問題
現有評估忽略延遲

→

論點
感知是串流過程

→

方法
串流 AP + 運動預測

→

證據
偵測器排名完全逆轉

→

結論
重新定義評估範式

核心主張

在即時動態環境中，延遲是感知品質的根本組成部分，傳統的準確度至上的評估方式產生了引導研究方向錯誤的反常激勵。

論證最強處

偵測器排名在串流評估下幾乎完全逆轉的實驗結果，以不可辯駁的方式證明了現有評估框架的根本性缺陷。

論證最弱處

僅在自駕場景（Argoverse-HD）上驗證，對其他即時感知應用（機器人操作、AR）的泛化性不足。線性運動預測過於簡化。