SOLO: Segmenting Objects by Locations

Abstract — 摘要

We present SOLO, a new approach to instance segmentation that is fundamentally different from proposal-based or embedding-based methods. Our key insight is that instance segmentation can be reformulated as category prediction and mask generation organized by locations on a grid. The image is divided into a uniform S x S grid, and each cell predicts the semantic category and instance mask of the object whose center falls into that cell. SOLO achieves competitive results with Mask R-CNN on COCO.

我們提出 SOLO，一種實例分割新方法，與基於提案或嵌入的方法根本不同。核心洞察是實例分割可重新表述為按網格位置組織的類別預測和遮罩生成。影像劃分為 S x S 均勻網格，每格子預測中心落入其中物件的語意類別和實例遮罩。SOLO 在 COCO 上達到與 Mask R-CNN 相當的結果。

段落功能全文總覽——以位置為組織原則重新定義實例分割。

邏輯角色建立「位置 → 類別 + 遮罩」的直接映射，繞過傳統提案生成。

論證技巧 / 潛在漏洞「位置決定實例」的洞察簡潔有力，但密集遮擋場景可能有中心點歧義。

1. Introduction — 緒論

Instance segmentation follows a detect-then-segment paradigm (Mask R-CNN) requiring proposal networks, feature pooling, and NMS. Bottom-up methods based on pixel embeddings need complex post-processing. SOLO directly predicts instance masks without proposals or embeddings, using a grid-based location assignment with FPN for multi-scale handling.

實例分割遵循先偵測後分割範式（Mask R-CNN），需要提案網路、特徵池化和 NMS。基於像素嵌入的方法需複雜後處理。SOLO 直接預測實例遮罩，無需提案或嵌入，使用基於網格的位置分配搭配FPN 多尺度處理。

段落功能問題定位——批判兩大類方法的複雜性，開闢第三路線。

邏輯角色同時批判自頂向下與自底向上方法，為 SOLO 建立獨特定位。

論證技巧 / 潛在漏洞以「概念複雜性」為切入點極具工程說服力，吸引追求簡潔設計的讀者。

The fundamental insight behind SOLO is that different instances are naturally distinguished by their spatial locations. Even when two objects share the same category, their center locations differ. This observation allows us to use the grid cell index as a proxy for instance identity. Combined with Feature Pyramid Networks for multi-scale processing, SOLO handles objects at different scales by assigning them to appropriate grid levels.

SOLO 背後的基本洞察是不同實例由其空間位置自然區分。即使兩個物件屬同一類別，其中心位置也不同。這個觀察允許我們使用網格格子索引作為實例身份的代理。結合特徵金字塔網路進行多尺度處理，SOLO 透過將物件分配到適當的網格層級來處理不同尺度的物件。

段落功能核心洞察——位置作為實例身份的自然代理。

邏輯角色此觀察將實例分割從複雜的匹配問題簡化為位置預測問題。

論證技巧 / 潛在漏洞當多個小物件中心落入同一格子時，此假設會失效。

2. Method — 方法

SOLO operates on two parallel branches. The category branch predicts S x S x C semantic probabilities. The mask branch predicts S^2 instance masks of size H x W. Each mask corresponds to one grid cell. To handle different scales, we use multiple grid sizes (S=12,24,36,48,96) at different FPN levels. A simple matrix NMS removes duplicate masks.

SOLO 在兩個平行分支上運作。類別分支預測 S x S x C 語意機率。遮罩分支預測 S^2 個 H x W 的實例遮罩。每個遮罩對應一個網格格子。為處理不同尺度，使用多種網格尺寸（S=12,24,36,48,96）在不同 FPN 層級。簡單的矩陣 NMS 去除重複遮罩。

段落功能核心架構——雙分支設計與多尺度網格策略。

邏輯角色 FPN 多尺度分配解決固定網格的尺度限制問題。

論證技巧 / 潛在漏洞 S^2 個全影像遮罩記憶體需求大，SOLOv2 以動態摺積解決此限制。

2.1 Grid Assignment — 網格分配

Objects are assigned to the grid cell containing their center of mass. Each cell gets a positive label if any ground-truth center falls within it. This is anchor-free and much simpler than Mask R-CNN's anchor assignment. The multi-scale grid ensures small objects use finer grids and large objects use coarser grids. We also use a center sampling strategy that only assigns positive labels within the central region of each object, improving training stability.

物件被分配到包含其質量中心的格子。格子在任何真實中心落入時獲正標籤。此方案無需錨框，比 Mask R-CNN 的錨框分配簡單得多。多尺度網格確保小物件用精細網格，大物件用粗糙網格。我們也使用中心採樣策略，僅在每個物件的中心區域分配正標籤，提升訓練穩定性。

段落功能分配策略——基於中心點的簡潔規則。

邏輯角色多尺度網格巧妙利用 FPN 多層級特性，是設計的核心元素。

論證技巧 / 潛在漏洞中心點分配簡潔但高度遮擋場景可能多物件落入同一格子。

3. Experiments — 實驗

On COCO test-dev with ResNet-101, SOLO achieves 37.8 AP, competitive with Mask R-CNN at 37.5 AP. With deformable convolutions and multi-scale training, SOLO reaches 40.4 AP. SOLO extends naturally to panoptic segmentation with competitive PQ. Inference speed is comparable to Mask R-CNN without ROI operations.

在 COCO test-dev 上搭配 ResNet-101，SOLO 達到 37.8 AP，與 Mask R-CNN 的 37.5 AP 相當。搭配可變形摺積和多尺度訓練達 40.4 AP。SOLO 自然擴展至全景分割。推論速度與 Mask R-CNN 相當且無需 ROI 操作。

段落功能定量評估——COCO 上與 Mask R-CNN 全面比較。

邏輯角色匹配甚至超越 Mask R-CNN 驗證了位置基礎方法可行性。

論證技巧 / 潛在漏洞 37.8 vs 37.5 AP 展示概念更簡潔的方法可達相似效能。

Ablation studies reveal the contribution of key components. Multi-scale grid assignment contributes +4.2 AP over single-scale. Matrix NMS provides +0.7 AP over traditional NMS while being 3x faster. The center sampling strategy adds +0.5 AP by reducing ambiguous positive samples near object boundaries.

消融研究揭示關鍵組件的貢獻。多尺度網格分配相比單一尺度貢獻 +4.2 AP。矩陣 NMS 比傳統 NMS 提升 +0.7 AP 且快 3 倍。中心採樣策略透過減少物件邊界附近的歧義正樣本增加 +0.5 AP。

段落功能消融分析——各設計組件的量化貢獻。

邏輯角色多尺度的 +4.2 AP 是最大增益，確認了多尺度設計的核心重要性。

論證技巧 / 潛在漏洞矩陣 NMS 同時提升準確度和速度是罕見的雙贏結果。

4. Conclusion — 結論

We presented SOLO, a new paradigm for instance segmentation that directly segments objects by locations. SOLO shows that instance segmentation does not require proposals or embeddings. Its simplicity makes it easy to implement and extend. The location-based approach offers a promising direction for instance-level recognition.

我們提出 SOLO，一種透過位置直接分割物件的實例分割新範式。SOLO 證明實例分割不需要提案或嵌入。其簡潔性使之易於實現和擴展。位置基礎方法為實例層級識別提供了有前景的方向。

段落功能總結——重申範式創新與影響。

邏輯角色 SOLOv2 的後續改進驗證了此方向的持續潛力。

論證技巧 / 潛在漏洞後續大量跟進工作證明 SOLO 確實開創了一條新路線。

論證結構總覽

問題
實例分割依賴提案/嵌入

→

論點
位置可唯一識別實例

→

方法
網格 + 雙分支預測

→

證據
COCO 上媲美 Mask R-CNN

→

結論
位置基礎新範式

核心主張

透過網格劃分以位置為索引直接預測實例遮罩，無需提案或嵌入即可達到與 Mask R-CNN 相當的效能。

論證最強處

極簡設計達到具競爭力效能，無 ROI 操作、無錨框，概念清晰易擴展。

論證最弱處

S^2 個全影像遮罩記憶體開銷大，高度遮擋處理有限，仍需矩陣 NMS。