Annotating Object Instances with a Polygon-RNN

Abstract — 摘要

We propose an approach for semi-automatic annotation of object instances. We frame the object segmentation task as a polygon prediction problem, where the model takes as input an image crop and sequentially produces vertices of the polygon outlining the object. This allows the annotator to interfere at any time and correct a vertex, upon which the model updates the prediction. We show that our approach achieves "a factor of 4.7" speed improvement across Cityscapes classes with "78.4% agreement in IoU with original ground-truth," matching typical inter-annotator agreement levels. For vehicles specifically, the speed-up reaches 7.3x with 82.2% agreement.

我們提出一種物件實例的半自動標註方法。我們將物件分割任務框架為多邊形預測問題，模型接收影像裁切區域並依序產生勾勒物件輪廓的多邊形頂點。這允許標註者隨時介入並修正頂點，模型隨之更新預測。我們展示此方法在 Cityscapes 各類別上達到 4.7 倍的速度提升，與原始真值的 IoU 一致度為 78.4%，匹配典型的標註者間一致度水準。在車輛類別上，加速倍率達 7.3 倍且一致度為 82.2%。

段落功能全文總覽——將物件分割重新框架為多邊形預測，強調人機互動的半自動流程。

邏輯角色摘要的核心創新是「問題轉化」：從像素級分割轉為多邊形頂點序列預測。這個轉化同時啟用了兩個優勢：(1) RNN 的序列生成能力；(2) 人類可在頂點級別介入修正。

論證技巧 / 潛在漏洞以「匹配標註者間一致度」作為品質基準是聰明的策略——它暗示模型已達到人類水準。但 78.4% 的 IoU 與最先進的全自動分割方法相比仍有差距，需要在「速度 vs 精度」的權衡中定位價值。

1. Introduction — 緒論

Semantic segmentation demands large-scale annotated datasets. While deep learning approaches achieve impressive results, they remain "data hungry" with performance correlated to training data volume. Most large-scale segmentation datasets such as Cityscapes employed polygon-based annotation by human annotators — drawing polygon vertices around objects. This process is extremely time-consuming and expensive. The authors propose Polygon-RNN, which generates polygon vertices sequentially from image crops within bounding boxes, allowing human corrective intervention while maintaining structural coherence.

語意分割需要大規模的標註資料集。儘管深度學習方法取得了出色的成果，但它們仍然「資料飢渴」，效能與訓練資料量相關。Cityscapes 等大多數大規模分割資料集採用人工標註者的多邊形標註——在物件周圍繪製多邊形頂點。此過程極其耗時且昂貴。作者提出 Polygon-RNN，從邊界框內的影像裁切區域依序生成多邊形頂點，允許人類在維持結構連貫性的同時進行修正性介入。

段落功能建立動機——從標註資料的稀缺性與高成本出發，引出自動化標註的需求。

邏輯角色因果鏈清晰：深度學習需要大量資料 -> 標註耗時昂貴 -> 需要半自動化工具。Polygon-RNN 的提出直接回應此需求鏈。

論證技巧 / 潛在漏洞將標註成本作為動機具有強烈的實務共鳴。但此論點假設多邊形是最佳標註格式——對於有孔洞或複雜拓撲的物件，多邊形可能不如像素級遮罩精確。

Existing semi-automatic annotation approaches include scribble-based methods, GrabCut variants, and superpixel labeling. These pixel-level graphical models struggle incorporating shape priors and often produce holes or require extensive manual correction. The authors position their polygon approach as more naturally matching existing annotation practices — most datasets are annotated with polygons, not pixel masks. The structured output (polygon) is also easier for human correction: adjusting a vertex is more intuitive than fixing scattered pixel labels.

現有的半自動標註方法包含基於塗鴉的方法、GrabCut 變體和超像素標記。這些像素級的圖形模型難以融入形狀先驗，且經常產生孔洞或需要大量手動修正。作者將其多邊形方法定位為更自然地匹配現有標註實務——大多數資料集是以多邊形而非像素遮罩標註的。結構化輸出（多邊形）也更易於人類修正：調整頂點比修正分散的像素標籤更直覺。

段落功能文獻定位——對比像素級方法的缺陷，強調多邊形方法與現有工作流程的契合。

邏輯角色此段從兩個角度建立多邊形方法的優越性：(1) 技術面——避免孔洞等像素級問題；(2) 實務面——與現有標註工作流程一致。雙重論證強化了方法的合理性。

論證技巧 / 潛在漏洞「匹配現有標註實務」的論點具有實務說服力。但 GrabCut 的比較可能不完全公平——GrabCut 是完全不同的問題設定（從使用者提示開始），而非序列預測。

3. Polygon-RNN — 模型架構

3.1 Model Architecture — 模型架構

The model architecture comprises two components. The CNN image encoder adapts VGG-16, removing fully connected layers and using skip-connections to concatenate multi-scale features, achieving 8x downsampling relative to the input. The RNN vertex predictor employs a two-layer Convolutional LSTM with 3x3 kernels and 16 channels. At each timestep, the ConvLSTM receives concatenated input including CNN features, one-hot encodings of the previous two vertices, and the first vertex encoding. The output represents a DxD grid (D=28) plus an end-of-sequence token through one-hot encoding, predicting the next vertex position on a quantized grid.

模型架構包含兩個組件。CNN 影像編碼器改編自 VGG-16，移除全連接層並使用跳躍連接串接多尺度特徵，達到相對於輸入的 8 倍下取樣。RNN 頂點預測器採用兩層摺積 LSTM，具有 3x3 核和 16 個通道。在每個時間步，ConvLSTM 接收串接的輸入，包括 CNN 特徵、前兩個頂點的獨熱編碼以及第一個頂點的編碼。輸出透過獨熱編碼表示一個 DxD 的網格（D=28）加上一個序列結束符號，在量化網格上預測下一個頂點位置。

段落功能核心方法——描述 CNN 編碼器與 RNN 解碼器的架構細節。

邏輯角色此段建立了「感知-決策」的雙模組架構：CNN 理解影像，RNN 順序生成頂點。三個歷史頂點的編碼（前兩個 + 第一個）確保了多邊形的閉合約束和平滑性。

論證技巧 / 潛在漏洞 ConvLSTM 的選擇保留了空間結構資訊，相比普通 LSTM 更適合此任務。但 D=28 的量化解析度意味著頂點位置的最小精度為輸入的 1/28，可能在複雜輪廓上不夠精確。此外，VGG-16 在當時已非最強的特徵提取器。

3.2 Training — 訓練

Training uses cross-entropy loss at each RNN timestep with smoothed target distributions — non-zero probability mass is assigned to locations within distance 2 in the output grid, softening the hard one-hot targets. The Adam optimizer trains with batch size 8 and learning rate 1e-4, decayed by 10x after 10 epochs. First vertex prediction uses a multi-task loss combining logistic losses for boundaries and vertices. Data augmentation includes random flipping, context expansion (10-20% beyond bounding box), and random polygon starting point selection. Training completes in approximately one day on an Nvidia Titan-X GPU.

訓練在每個 RNN 時間步使用交叉熵損失搭配平滑化的目標分布——在輸出網格距離 2 以內的位置分配非零機率質量，軟化硬性的獨熱目標。Adam 最佳化器以批次大小 8 訓練，學習率 1e-4，10 個 epoch 後衰減 10 倍。第一個頂點的預測使用結合邊界與頂點的邏輯損失的多任務損失。資料擴增包含隨機翻轉、上下文擴展（超出邊界框 10-20%）和隨機多邊形起始點選擇。訓練在一張 Nvidia Titan-X GPU 上約一天完成。

段落功能訓練細節——描述損失函數、最佳化策略和資料擴增方法。

邏輯角色平滑化目標分布是一項重要的技術選擇——量化網格上相鄰位置幾乎等價，硬性獨熱編碼會懲罰合理的近似解。此設計體現了對任務特性的深刻理解。

論證技巧 / 潛在漏洞隨機起始點是關鍵的資料擴增——多邊形的起始頂點不應影響結果，此擴增確保了模型的旋轉不變性。但 teacher forcing 的訓練策略可能導致推論時的誤差累積（exposure bias），此問題未被討論。

3.3 Annotator-in-the-Loop — 標註者互動

During inference, the model selects highest log-probability vertices at each timestep. The key innovation is the annotator-in-the-loop mechanism: annotators can correct predicted vertices by inputting corrections at any step, which then feeds into subsequent predictions. This means each human correction propagates forward, improving all subsequent vertices. Typical inference requires only 250 milliseconds per object, enabling real-time interactive annotation. Simulated annotator experiments with threshold T (correcting predictions deviating beyond distance T) show that at T=3, only 9.39 clicks per instance achieves 78.40% IoU agreement.

推論時，模型在每個時間步選擇最高對數機率的頂點。關鍵創新是標註者互動機制：標註者可在任何步驟輸入修正，修正後的頂點會饋入後續預測。這意味著每次人類修正都會向前傳播，改善所有後續頂點。典型推論僅需每物件 250 毫秒，實現即時互動標註。模擬標註者實驗以閾值 T（修正偏差超過距離 T 的預測）顯示，在 T=3 時，每實例僅需 9.39 次點擊即可達到 78.40% 的 IoU 一致度。

段落功能核心創新——描述人機互動的標註流程與其效率優勢。

邏輯角色此段是本文最具差異化的論述：不同於全自動方法追求零人工介入，Polygon-RNN 刻意設計為人機協作，使每次修正的效益最大化。250ms 的推論速度確保了互動的流暢性。

論證技巧 / 潛在漏洞「修正向前傳播」的機制是方法的核心優勢——一次修正帶來連鎖改善。但模擬標註者（以閾值自動修正）可能高估了實際效率——真實標註者需要時間來判斷哪些頂點需要修正，這個認知成本未被計入。

4. Results — 實驗結果

On Cityscapes without annotator interaction, Polygon-RNN outperforms DeepMask, SharpMask, and Dilation10 on six of eight categories. For cars, it achieves 71.17% IoU, exceeding SharpMask by 6%. In human evaluation on 101 car instances, the model reaches 82% IoU agreement with only 4.6 average clicks (7.3x speed-up) and 87.7% with 9.3 clicks (3.6x speed-up). Compared to GrabCut, which requires 17.5 clicks per instance, Polygon-RNN needs only 5.0 to 9.6 clicks. On KITTI (741 instances, cross-dataset generalization), the model achieves comparable IoU to human agreement with 5.84 clicks per instance.

在 Cityscapes 上無標註者互動時，Polygon-RNN 在八個類別中的六個上超越 DeepMask、SharpMask 和 Dilation10。在車輛上達到 71.17% IoU，超越 SharpMask 達 6%。在 101 個車輛實例的人類評估中，模型以平均 4.6 次點擊達到 82% IoU 一致度（7.3 倍加速），以 9.3 次點擊達到 87.7%（3.6 倍加速）。相比需要每實例 17.5 次點擊的 GrabCut，Polygon-RNN 僅需 5.0 至 9.6 次。在 KITTI（741 實例，跨資料集泛化）上，模型以每實例 5.84 次點擊達到與人類一致度相當的 IoU。

段落功能核心實證——在自動模式、人機互動模式和跨資料集泛化三個面向驗證方法。

邏輯角色此段的實證覆蓋了三個關鍵問題：(1) 純自動模式是否有競爭力？（是，6/8 類別最佳）(2) 互動模式效率如何？（7.3 倍加速）(3) 是否跨資料集泛化？（KITTI 驗證）。

論證技巧 / 潛在漏洞與 GrabCut 的比較（5-10 clicks vs 17.5 clicks）極具說服力。但 101 個車輛實例的樣本量偏小，且僅限車輛類別——對於複雜形狀的物件（如行人、腳踏車），效率是否同樣優異需要更多驗證。

5. Conclusion — 結論

Polygon-RNN provides a new paradigm for object instance annotation by framing segmentation as sequential polygon vertex prediction. The method produces structurally plausible polygon representations, enables user correction for desired accuracy levels through minimal clicks, and generalizes across datasets. Achieving a "speed-up of factor 4.74" while matching inter-annotator agreement demonstrates the practical utility for building large-scale annotation benchmarks.

Polygon-RNN 透過將分割框架為序列多邊形頂點預測，為物件實例標註提供了新的範式。此方法產出結構合理的多邊形表示，允許使用者以最少點擊達到所需精度，並具備跨資料集的泛化能力。在匹配標註者間一致度的同時達到 4.74 倍加速，展示了建構大規模標註基準的實務效用。

段落功能總結全文——重申半自動標註的實務價值。

邏輯角色結論從三個維度收束：品質（結構合理性）、效率（4.74 倍加速）、泛化性（跨資料集）。以「實務效用」而非「技術突破」作為最終定位。

論證技巧 / 潛在漏洞將 Polygon-RNN 定位為「標註工具」而非「分割方法」是務實的選擇。但結論未探討更深遠的影響——如訓練資料中的偏差是否會透過自動標註被放大和傳播。

論證結構總覽

問題
分割標註耗時昂貴
深度學習資料飢渴

→

論點
多邊形序列預測
人機互動修正

→

證據
4.7 倍加速
匹配標註者間一致度

→

反駁
跨資料集泛化
優於 GrabCut

→

結論
大規模標註的
實用半自動工具

作者核心主張（一句話）

透過將物件分割轉化為基於 RNN 的多邊形頂點序列預測問題，配合標註者即時修正機制，能以 4.7 倍的速度提升達到匹配人類標註者間一致度的標註品質。

論證最強處

人機互動的精妙設計：每次人類修正一個頂點，後續所有頂點都會相應更新，使人工介入的效益最大化。此機制使方法在全自動模式和人機互動模式中均有競爭力，且與現有的多邊形標註工作流程無縫銜接。

論證最弱處

輸出解析度與拓撲限制：28x28 的量化網格限制了頂點的空間精度，對細長或高度不規則的物件可能不足。多邊形表示無法處理有孔洞的物件（如框架結構），且序列生成在頂點數目極多時可能產生累積誤差。後續的 Polygon-RNN++ 正是針對這些限制進行改進。