Learning to Segment Every Thing

Abstract — 摘要

Most methods for object instance segmentation require all training examples to be labeled with segmentation masks. This paper proposes a partially supervised training paradigm that enables learning instance segmentation on categories that have only bounding box annotations, by leveraging mask annotations available for a subset of categories. The core idea is a weight transfer function that predicts mask prediction parameters from detection parameters, enabling the model to generalise segmentation capability to novel categories. The authors train Mask R-CNN with this approach on 3000 visual concepts from Visual Genome using COCO mask annotations for only 80 categories, achieving significant improvements over baselines.

大多數物件實例分割方法要求所有訓練範例都標註有分割遮罩。本文提出一種部分監督訓練範式，透過利用類別子集的遮罩標註，在僅有邊界框標註的類別上學習實例分割。核心概念是一個權重遷移函數，從偵測參數預測遮罩預測參數，使模型能將分割能力泛化到新類別。作者以此方法在 Visual Genome 的 3000 個視覺概念上訓練 Mask R-CNN，僅使用 COCO 的 80 個類別的遮罩標註，達到顯著優於基線的改善。

段落功能全文總覽——從遮罩標註的瓶頸出發，引出部分監督與權重遷移的核心創新。

邏輯角色摘要以「標註瓶頸 -> 部分監督 -> 權重遷移 -> 大規模驗證」的四步推進，涵蓋了問題、方案與成果。

論證技巧 / 潛在漏洞從 80 個類別遷移到 3000 個類別的規模對比極具衝擊力（37.5 倍放大）。但此類別放大倍率的品質衰減程度未在摘要中量化。

1. Introduction — 緒論

Instance segmentation — detecting objects and predicting their pixel-level masks — is a fundamental computer vision task. Current state-of-the-art methods like Mask R-CNN require densely annotated segmentation masks for all categories. However, obtaining mask annotations is significantly more expensive than bounding box annotations — typically 5-10 times more costly per instance. This limits instance segmentation systems to a small number of categories (typically about 100 in COCO's 80 categories).

實例分割——偵測物件並預測其像素級遮罩——是電腦視覺的基礎任務。目前最先進的方法如 Mask R-CNN 需要所有類別的密集標註分割遮罩。然而，取得遮罩標註的成本遠高於邊界框標註——每個實例通常貴 5 到 10 倍。這限制了實例分割系統只能涵蓋少量類別（通常約 COCO 的 80 個類別）。

段落功能建立研究動機——以標註成本量化實例分割的可擴展性瓶頸。

邏輯角色論證鏈起點：以「5-10 倍成本」的具體數據使標註瓶頸可量化，為減少標註依賴的研究動機建立經濟學基礎。

論證技巧 / 潛在漏洞以具體的成本倍數量化問題非常有效。但 5-10 倍的數據來源未被引用，且隨著標註工具的進步，此比率可能已有變化。

The authors ask: "Is it possible to train high-quality instance segmentation models without complete instance segmentation annotations for all categories?" They propose a partially supervised setting where the category set C splits into set A (with both box and mask annotations) and set B (with only box annotations). The key hypothesis is that detection weights encode visual appearance information that can be transferred to predict segmentation weights for categories in set B, without ever seeing their masks during training.

作者提問：「是否可能在沒有所有類別的完整實例分割標註的情況下，訓練高品質的實例分割模型？」他們提出一種部分監督設定，其中類別集合 C 分割為集合 A（具有邊界框和遮罩標註）與集合 B（僅有邊界框標註）。關鍵假說是偵測權重編碼了可遷移的視覺外觀資訊，能用來預測集合 B 中類別的分割權重，即使在訓練中從未見過它們的遮罩。

段落功能核心假說提出——以問句形式引出部分監督的研究方向。

邏輯角色此段是全文的概念轉折：從「需要完整標註」到「部分標註即可」。關鍵假說（偵測權重可遷移為分割權重）是全文的理論基礎。

論證技巧 / 潛在漏洞以問句開頭的修辭策略引導讀者自然地接受研究方向。「偵測權重編碼外觀資訊」的假說直覺上合理但缺乏理論依據——偵測權重為何包含足夠的形狀資訊來生成遮罩？

Instance segmentation methods, particularly Mask R-CNN, have achieved impressive results by extending Faster R-CNN with a mask prediction branch. Weight prediction networks (or hypernetworks) generate the parameters of one network from the output of another, enabling dynamic architectures. Transfer learning approaches leverage knowledge from data-rich domains to improve performance in data-scarce settings. Weakly supervised segmentation methods use image-level labels or bounding boxes as supervision but typically lag behind fully supervised approaches in quality. The class-agnostic mask prediction approach provides a natural baseline — training a single mask predictor shared across all categories — but ignores class-specific shape priors.

實例分割方法，特別是 Mask R-CNN，透過在 Faster R-CNN 上擴展遮罩預測分支取得了令人印象深刻的結果。權重預測網路（或超網路）從一個網路的輸出生成另一個網路的參數，實現動態架構。遷移學習方法利用資料豐富領域的知識來改善資料稀缺設定的效能。弱監督分割方法使用影像級標籤或邊界框作為監督，但通常在品質上落後於全監督方法。類別無關的遮罩預測方法提供了自然的基線——訓練一個跨所有類別共享的單一遮罩預測器——但忽略了類別特定的形狀先驗。

段落功能文獻匯集——將實例分割、超網路、遷移學習與弱監督四條脈絡交織。

邏輯角色建立方法的學術譜系：Mask R-CNN 提供架構基礎、超網路提供權重預測的概念工具、遷移學習提供跨域泛化的理論支撐。

論證技巧 / 潛在漏洞將類別無關遮罩作為基線而非競爭者是巧妙的定位——它既是比較對象也是出發點。但弱監督方法作為替代路線的充分比較被省略了。

3. Method — 方法

3.1 Architecture Overview — 架構概覽

The method builds on Mask R-CNN, which predicts a class-specific binary mask for each detected region of interest. In standard Mask R-CNN, the mask head has class-specific parameters w_seg^c for each category c, which are directly learned from mask annotations. The authors propose replacing this direct learning with a weight transfer function that computes w_seg^c = T(w_det^c; theta), where w_det^c are the detection classification weights for category c and T is a learnable neural network parameterised by theta.

方法建立在 Mask R-CNN 之上，為每個偵測到的感興趣區域預測類別特定的二元遮罩。在標準 Mask R-CNN 中，遮罩頭部對每個類別 c 具有類別特定的參數 w_seg^c，這些參數直接從遮罩標註學習。作者提出以權重遷移函數取代此直接學習，計算 w_seg^c = T(w_det^c; theta)，其中 w_det^c 是類別 c 的偵測分類權重，T 是由 theta 參數化的可學習神經網路。

段落功能核心機制——定義權重遷移函數的數學形式。

邏輯角色此段是全文的技術支柱：T 函數將偵測知識映射到分割知識，其可學習性意味著遷移關係不需手工設計。

論證技巧 / 潛在漏洞將偵測權重視為「類別嵌入」的隱含假設非常巧妙——如果偵測權重捕捉了類別的視覺本質，那麼形狀資訊確實可能被編碼其中。但此假設的理論基礎薄弱，主要依賴實驗驗證。

3.2 Weight Transfer and Training — 權重遷移與訓練

The weight transfer function T is implemented as a small fully-connected neural network. During training, for categories in set A (with masks), both the detection and segmentation losses are active. For categories in set B (box-only), only the detection loss is used, but the detection weights w_det^c are still updated. At test time, T generates mask parameters for all categories, including those never seen with mask annotations. An important training detail is gradient stopping on the detection weights — preventing gradients from the mask loss from modifying w_det^c — to maintain homogeneity of the class embedding space between sets A and B.

權重遷移函數 T 實現為一個小型全連接神經網路。訓練時，對集合 A（有遮罩）中的類別，偵測與分割損失均啟用。對集合 B（僅邊界框）中的類別，僅使用偵測損失，但偵測權重 w_det^c 仍會更新。測試時，T 為所有類別生成遮罩參數，包括從未見過遮罩標註的類別。一個重要的訓練細節是對偵測權重進行梯度停止——防止遮罩損失的梯度修改 w_det^c——以維持集合 A 與 B 之間類別嵌入空間的同質性。

段落功能訓練策略——闡明梯度停止等關鍵技術細節。

邏輯角色梯度停止是微妙但關鍵的設計：若不停止梯度，集合 A 的偵測權重會被遮罩損失「污染」，破壞其與集合 B 偵測權重的可比較性，導致遷移失敗。

論證技巧 / 潛在漏洞梯度停止的必要性從「嵌入空間同質性」的角度論證，邏輯嚴密。但梯度停止也意味著遮罩資訊無法回饋以改善偵測，可能犧牲了端對端訓練的潛在優勢。

The authors further propose fusing FCN and MLP mask heads to capture complementary information. The MLP mask predictor captures the "gist" of an object's shape (coarse outline), while the FCN mask predictor captures fine details (edges, concavities). This fusion improves segmentation quality for both seen and unseen categories. The complete model is termed MaskX R-CNN — Mask R-CNN extended to handle partially supervised training.

作者進一步提出融合 FCN 與 MLP 遮罩頭部以捕捉互補資訊。MLP 遮罩預測器捕捉物件形狀的「要旨」（粗略輪廓），而 FCN 遮罩預測器捕捉精細細節（邊緣、凹陷）。此融合提升了已知與未知類別的分割品質。完整模型稱為 MaskX R-CNN——Mask R-CNN 擴展以處理部分監督訓練。

段落功能架構擴展——透過雙頭部融合進一步提升分割品質。

邏輯角色在權重遷移的核心創新之上疊加工程改良：FCN 的空間精確性與 MLP 的全域形狀理解互補，是務實的效能提升策略。

論證技巧 / 潛在漏洞「要旨 vs. 細節」的互補性直覺上合理。但增加的模型複雜度（雙頭部）可能使消融分析更加困難——哪些改善來自權重遷移，哪些來自雙頭部融合？

4. Experiments — 實驗

Evaluation on COCO 80 categories split into VOC (20 classes) and non-VOC (60 classes) subsets demonstrates 40% relative improvement in mask AP for unseen categories compared to the class-agnostic baseline. Ablation studies examine the impact of input embeddings, transfer function architecture, and training procedures. On the large-scale Visual Genome experiment (3000 categories), the model produces reasonable segmentation masks using only 80 COCO mask annotations, with qualitative results showing coherent masks even for abstract concepts like "shadows" and "paths". The end-to-end training with gradient stopping outperforms stage-wise training approaches.

在 COCO 80 個類別上的評估，分為 VOC（20 類）與非 VOC（60 類）子集，展示了在未見類別上遮罩 AP 相對提升 40%（相較於類別無關基線）。消融研究檢驗了輸入嵌入、遷移函數架構與訓練程序的影響。在大規模 Visual Genome 實驗（3000 個類別）中，模型僅使用 80 個 COCO 遮罩標註即產生合理的分割遮罩，定性結果顯示即使對「陰影」和「路徑」等抽象概念也能生成一致的遮罩。端對端訓練搭配梯度停止優於階段式訓練方法。

段落功能全面驗證——從 COCO 的精確消融到 Visual Genome 的大規模展示。

邏輯角色實證支柱：COCO 實驗提供嚴謹的定量評估，Visual Genome 實驗展示令人印象深刻的規模化能力。40% 的改善令人信服。

論證技巧 / 潛在漏洞「陰影」和「路徑」的分割展示極具視覺衝擊力，但這些抽象概念的遮罩品質如何定量評估？Visual Genome 的定性結果缺乏客觀的定量指標支撐。

5. Conclusion — 結論

This paper addresses a fundamental bottleneck in instance segmentation: the requirement for mask annotations for all categories. The proposed weight transfer function enables partially supervised training where mask predictions can be made for categories seen only with bounding boxes. By leveraging the insight that detection weights serve as class embeddings encoding visual appearance, the approach achieves significant improvements over baselines and scales to thousands of categories. The framework opens the door to practical large-vocabulary instance segmentation systems that can leverage the abundantly available bounding box datasets.

本文處理實例分割中的根本瓶頸：要求所有類別都有遮罩標註。所提出的權重遷移函數實現了部分監督訓練，使得僅以邊界框見過的類別也能進行遮罩預測。透過利用偵測權重作為編碼視覺外觀的類別嵌入的洞見，此方法顯著優於基線，並可擴展到數千個類別。此框架為實用的大詞彙實例分割系統開啟了大門，使其能利用豐富的邊界框資料集。

段落功能總結全文——重申標註瓶頸、權重遷移解法與大規模驗證。

邏輯角色結論將方法從「技術貢獻」昇華為「實用系統的基石」，以「大詞彙實例分割」的願景收尾。

論證技巧 / 潛在漏洞「開啟大門」的修辭具有前瞻性。但未討論權重遷移在視覺上差異極大的類別間（如從「貓」遷移到「椅子」）的效能下限，以及與後續更強的弱監督方法的比較。

論證結構總覽

問題
遮罩標註昂貴
限制類別規模

→

論點
權重遷移函數
偵測 -> 分割

→

證據
COCO +40% AP
Visual Genome 3000 類

→

反駁
梯度停止維持
嵌入空間同質性

→

結論
大詞彙實例分割
實用化基石

作者核心主張（一句話）

透過學習從偵測權重到分割權重的遷移函數，可在部分監督設定下將實例分割能力泛化到僅有邊界框標註的數千個類別。

論證最強處

規模化驗證的說服力：從 80 個有遮罩的類別遷移到 3000 個類別的 Visual Genome 實驗，展示了方法在極端類別不平衡下的穩健性。40% 的 AP 改善在嚴謹的 COCO 評估下具有高度可信度。

論證最弱處

遷移假說的理論基礎薄弱：「偵測權重編碼形狀資訊」的核心假說主要依賴實驗驗證而非理論推導。對於形狀差異極大的類別（如從動物遷移到器具），遷移品質的下限未被系統性地探討。