Multiscale Combinatorial Grouping (MCG)

Abstract — 摘要

We propose a unified approach for bottom-up hierarchical image segmentation and object candidate generation for recognition, called Multiscale Combinatorial Grouping (MCG). The approach consists of three main components: (1) a fast normalized cuts algorithm that significantly accelerates the computation of the segmentation hierarchy, (2) a high-performance hierarchical segmenter that makes effective use of multiscale information, and (3) a grouping strategy that combines multiscale regions into highly-accurate object candidates by exploring efficiently their combinatorial space. We also present Single-scale Combinatorial Grouping (SCG), a faster variant producing competitive proposals in under five seconds per image. Our approach achieves state-of-the-art results on BSDS500, SegVOC12, SBD, and COCO datasets for contours, hierarchical regions, and object proposals.

我們提出一種統一的由下而上層次化影像分割與物件候選產生方法，稱為多尺度組合分組（MCG）。此方法由三個主要組件構成：(1) 一個快速正規化切割演算法，顯著加速分割層次的計算；(2) 一個高效能的層次化分割器，有效利用多尺度資訊；(3) 一個分組策略，透過高效探索其組合空間，將多尺度區域合併為高精度的物件候選。我們也提出單尺度組合分組（SCG），一個更快的變體，能在每張影像五秒內產生具競爭力的提案。我們的方法在 BSDS500、SegVOC12、SBD 和 COCO 資料集上針對輪廓、層次化區域與物件提案達到最先進的結果。

段落功能全文總覽——以三組件架構預告 MCG 的完整方法論，並提供速度與精度的雙重承諾。

邏輯角色摘要以模組化方式呈現方法：快速演算法（效率） + 多尺度分割（品質） + 組合分組（精度）。三者的層次關係構成了從低階像素到高階物件的完整管線。

論證技巧 / 潛在漏洞以「統一方法」統攝分割與物件提案兩個傳統分開處理的任務，暗示方法的通用性。SCG 的五秒速度標竿是面向實用性的重要賣點。但四個資料集上的「最先進」宣稱需在實驗中逐一驗證——面面俱到可能意味著在某些基準上的優勢並不顯著。

1. Introduction — 緒論

Object proposal generation has become a critical preprocessing step in modern object detection pipelines. Methods like Selective Search and CPMC generate category-independent proposals that reduce the search space from millions of sliding windows to thousands of candidates. However, existing approaches face a trade-off between proposal quality (recall and localization accuracy) and computational cost. Simultaneously, hierarchical image segmentation has proven valuable for capturing structure at multiple granularities, but single-scale methods miss objects that are best captured at different scales. We propose to unify these two tasks — hierarchical segmentation and object proposal generation — through a multiscale combinatorial grouping strategy that leverages the complementary information provided by different scales.

物件提案產生已成為現代物件偵測管線中的關鍵前處理步驟。選擇性搜尋和 CPMC 等方法產生與類別無關的提案，將搜尋空間從數百萬個滑動視窗縮減至數千個候選。然而，現有方法面臨提案品質（召回率與定位精度）和計算成本之間的取捨。同時，層次化影像分割已被證明對於在多種粒度下捕捉結構很有價值，但單尺度方法會遺漏在不同尺度下才能最佳捕捉的物件。我們提出透過多尺度組合分組策略來統一這兩項任務——層次化分割與物件提案產生——利用不同尺度所提供的互補資訊。

段落功能建立問題——從物件提案的品質-效率取捨與單尺度分割的侷限切入。

邏輯角色以雙線敘事同時建立兩個問題（提案品質 vs. 效率、單尺度 vs. 多尺度），再以「統一」作為解決方案的關鍵詞。這種雙重問題 -> 統一解答的結構增強了論文的貢獻感。

論證技巧 / 潛在漏洞將分割與提案統一處理的想法優雅且自然——好的分割本身就蘊含物件邊界。但「多尺度」帶來的計算開銷可能與「減少搜尋空間」的初衷矛盾，需要快速演算法來彌補。

Object proposal methods can be broadly categorized into grouping-based approaches (e.g., Selective Search, which greedily merges superpixels) and window scoring approaches (e.g., Objectness, EdgeBoxes, which evaluate candidate windows). Grouping-based methods typically produce higher-quality proposals with better boundary adherence, while window scoring methods are generally faster. For hierarchical segmentation, the Ultrametric Contour Map (UCM) framework based on gPb (globalized probability of boundary) is considered the gold standard, but its O(n^1.5) complexity limits scalability. Our work builds upon the UCM framework but introduces a fast normalized cuts algorithm and multiscale fusion that dramatically improves both speed and quality.

物件提案方法大致可分為基於分組的方法（如選擇性搜尋，貪婪地合併超像素）和視窗評分方法（如 Objectness、EdgeBoxes，評估候選視窗）。基於分組的方法通常產生品質更高、邊界貼合度更好的提案，而視窗評分方法通常更快。在層次化分割方面，基於 gPb（全域化邊界機率）的超度量輪廓圖（UCM）框架被視為金標準，但其 O(n^1.5) 的複雜度限制了可擴展性。我們的工作建立於 UCM 框架之上，但引入了快速正規化切割演算法和多尺度融合，大幅改善了速度與品質。

段落功能文獻分類——系統性區分兩類提案方法與分割框架的優劣。

邏輯角色透過分類學（分組 vs. 評分）建立評估框架，再以 UCM 的速度瓶頸為轉折點，引出本文的加速貢獻。

論證技巧 / 潛在漏洞以「品質 vs. 速度」的二元對立框架分析先前工作，暗示 MCG 能兼得兩者。但這種理想化的承諾在實際中難以完美實現——SCG 的五秒速度雖快，但仍遠慢於 EdgeBoxes 等毫秒級方法。

3. Fast Normalized Cuts — 快速正規化切割

The normalized cuts criterion is fundamental to spectral segmentation but requires solving a generalized eigenvector problem that scales poorly with image size. We propose a fast approximation based on a multi-resolution hierarchy: the image is first over-segmented into superpixels at a coarse level, then the normalized cuts eigenvectors are computed on the reduced graph of superpixels rather than on the full pixel graph. This reduces the matrix dimension from millions of pixels to thousands of superpixels, yielding a speedup of orders of magnitude while preserving segmentation quality. The coarse eigenvectors are then interpolated back to the full resolution to produce the final segmentation hierarchy.

正規化切割準則是光譜分割的基礎，但需要求解一個隨影像大小擴展性差的廣義特徵向量問題。我們提出一種基於多解析度層次結構的快速近似方法：影像首先在粗糙層級被過度分割為超像素，接著在超像素的縮減圖上（而非完整像素圖上）計算正規化切割的特徵向量。這將矩陣維度從數百萬像素縮減至數千超像素，帶來數個數量級的加速，同時保持分割品質。粗糙的特徵向量隨後被內插回全解析度，以產生最終的分割層次。

段落功能方法組件一——加速正規化切割的核心技術。

邏輯角色此組件解決了整體管線的效率瓶頸：將 O(n^1.5) 的全像素特徵值分解轉為超像素圖上的小規模問題。這是使多尺度處理在實務上可行的先決條件。

論證技巧 / 潛在漏洞以「數個數量級加速」的量化宣稱增強說服力。然而，超像素近似必然引入精度損失——若初始過度分割遺漏了細微邊界，後續所有處理都無法恢復。品質-速度的取捨需在實驗中量化驗證。

4. Multiscale Hierarchical Segmentation — 多尺度層次分割

Single-scale segmentation inevitably misses structures that are better captured at different resolutions. A small object may be merged into the background at a coarse scale, while fine-grained boundaries may be lost at lower resolutions. Our multiscale hierarchical segmenter addresses this by computing independent segmentation hierarchies at multiple image scales and then aligning and fusing them into a single unified hierarchy. The fusion strategy preserves the strengths of each scale: fine scales contribute precise boundaries, while coarse scales provide better grouping of large regions. The resulting hierarchy captures structure at all granularities — from small parts to entire objects — in a single coherent representation.

單尺度分割不可避免地會遺漏在不同解析度下才能更好捕捉的結構。小物件可能在粗糙尺度下被合併到背景中，而精細的邊界可能在較低解析度下遺失。我們的多尺度層次化分割器透過在多個影像尺度下計算獨立的分割層次，再將它們對齊並融合為單一統一層次結構來解決此問題。融合策略保留了每個尺度的優勢：精細尺度貢獻精確邊界，粗糙尺度提供大區域的更佳分組。最終的層次結構在單一一致的表示中捕捉了所有粒度的結構——從小部件到整個物件。

段落功能方法組件二——多尺度融合的設計動機與機制。

邏輯角色此組件直接回應緒論中「單尺度遺漏」的問題。精細尺度 + 粗糙尺度的互補性是核心論點：前者提供邊界精度，後者提供語義完整性。

論證技巧 / 潛在漏洞以具體的失敗模式（小物件被合併、精細邊界遺失）為多尺度處理辯護，論證直觀。但多尺度融合的「對齊」步驟在技術上並非簡單——不同尺度下的分割結果可能不一致，如何解決衝突需要精細的設計。

5. Combinatorial Grouping — 組合分組

Given the multiscale segmentation hierarchy, we generate object proposals by combinatorially grouping adjacent regions from the hierarchy. The naive approach of enumerating all possible region combinations is computationally infeasible due to exponential growth. We address this by designing an efficient search strategy that ranks region combinations using learned features such as shape, size, boundary strength, and region similarity. The top-ranked combinations are retained as object candidates. This approach is fundamentally different from greedy merging (as in Selective Search): by exploring the combinatorial space more thoroughly, we produce proposals with significantly better localization accuracy, particularly for objects with complex shapes that require non-local grouping decisions.

給定多尺度分割層次結構，我們透過組合地分組層次結構中的相鄰區域來產生物件提案。樸素地列舉所有可能的區域組合因指數增長而在計算上不可行。我們透過設計一個高效的搜尋策略來解決此問題，該策略使用學習到的特徵（如形狀、大小、邊界強度、區域相似度）來為區域組合排序。排名最高的組合被保留為物件候選。此方法與貪婪合併（如選擇性搜尋中的做法）根本不同：透過更徹底地探索組合空間，我們產生定位精度顯著更高的提案，尤其是對於需要非局部分組決策的複雜形狀物件。

段落功能方法組件三——從分割層次到物件提案的組合分組策略。

邏輯角色管線的最終環節：將分割結果轉化為偵測可用的物件提案。與選擇性搜尋的「貪婪」策略對比，突出「組合式」搜尋的優勢。

論證技巧 / 潛在漏洞將 MCG 與選擇性搜尋的核心差異歸結為「組合 vs. 貪婪」是清晰的差異化策略。但「更徹底地探索組合空間」的計算成本仍然存在——即使有高效搜尋策略，組合空間的規模仍遠大於貪婪路徑，這是速度劣勢的根源。

6. Experiments — 實驗

We evaluate MCG comprehensively across four benchmarks. On BSDS500 for boundary detection, MCG achieves the best F-measure among all methods. For hierarchical segmentation on SegVOC12, MCG outperforms gPb-UCM while being significantly faster. For object proposals on PASCAL VOC and SBD, MCG achieves the highest recall at high IoU thresholds (>0.7), demonstrating superior localization accuracy. On COCO, MCG also sets a new state of the art. When integrated into the R-CNN detection pipeline, MCG proposals yield higher detection mAP than Selective Search proposals due to better boundary adherence. The faster SCG variant runs in under 5 seconds per image while maintaining competitive quality, making it suitable for large-scale applications.

我們在四個基準上全面評估 MCG。在 BSDS500 邊界偵測上，MCG 在所有方法中達到最佳 F 值。在 SegVOC12 的層次化分割上，MCG 在顯著更快的同時超越 gPb-UCM。在 PASCAL VOC 和 SBD 的物件提案上，MCG 在高 IoU 閾值（>0.7）下達到最高召回率，展示了卓越的定位精度。在 COCO 上，MCG 同樣創下新的最先進紀錄。當整合到 R-CNN 偵測管線中時，MCG 提案因更好的邊界貼合度而產出比選擇性搜尋提案更高的偵測 mAP。更快的 SCG 變體在每張影像五秒內執行，同時維持具競爭力的品質，適用於大規模應用。

段落功能全面實驗驗證——跨四個基準的多面向評估。

邏輯角色實證部分覆蓋了三個任務維度（邊界、分割、提案）和下游應用（R-CNN 偵測），加上效率基準（SCG 五秒），構成了極為全面的驗證。

論證技巧 / 潛在漏洞以 R-CNN 整合實驗展示 MCG 提案的下游價值是精明的策略——直接回應「好的提案如何轉化為好的偵測」。但「高 IoU 閾值下的優勢」暗示在低 IoU 閾值下優勢可能不明顯，選擇性搜尋可能已足夠好。

7. Conclusion — 結論

We have presented Multiscale Combinatorial Grouping (MCG), a unified framework for bottom-up segmentation and object proposal generation. By combining a fast normalized cuts algorithm, multiscale hierarchical segmentation, and combinatorial grouping, MCG produces object proposals with state-of-the-art localization accuracy while maintaining practical computational cost. The key principle is that leveraging the rich structure of multiscale segmentation hierarchies through combinatorial exploration yields proposals that are fundamentally more accurate than those produced by greedy merging strategies. MCG provides a strong foundation for object detection and semantic segmentation systems that rely on bottom-up region proposals.

我們提出了多尺度組合分組（MCG），一個統一的由下而上分割與物件提案產生框架。透過結合快速正規化切割演算法、多尺度層次化分割與組合分組，MCG 產生具有最先進定位精度的物件提案，同時維持實用的計算成本。核心原則在於：透過組合式探索利用多尺度分割層次的豐富結構，產出的提案在根本上比貪婪合併策略產生的提案更為精確。MCG 為依賴由下而上區域提案的物件偵測與語義分割系統提供了堅實基礎。

段落功能總結全文——重申三組件的協同效應與核心原則。

邏輯角色結論以「組合 vs. 貪婪」的核心對比收束，呼應緒論的問題設定。將 MCG 定位為偵測管線的「基礎設施」，強調其作為上游組件的廣泛適用性。

論證技巧 / 潛在漏洞將自身定位為「基礎設施」是策略性的——它意味著 MCG 的價值不限於某個特定偵測器，而是普惠所有下游系統。但深度學習的快速演進（如 Faster R-CNN 的 RPN）正在使外部提案方法逐漸過時，MCG 的長期影響力可能受限於這一趨勢。

論證結構總覽

問題
單尺度分割遺漏結構
提案品質與效率取捨

→

論點
多尺度組合分組
統一分割與提案

→

證據
四基準最先進
高 IoU 召回率領先

→

反駁
SCG 五秒速度
兼顧實用性

→

結論
組合搜尋根本上
優於貪婪合併

作者核心主張（一句話）

透過在多尺度分割層次結構上進行組合式區域分組（而非貪婪合併），能夠產出在高 IoU 閾值下定位精度顯著優於既有方法的物件提案，同時以快速正規化切割保持實用的計算成本。

論證最強處

全面的模組化驗證：三個組件各自有獨立的消融驗證（快速正規化切割的加速比、多尺度融合的精度提升、組合分組的召回率改善），加上下游 R-CNN 整合的端到端效益展示。這種逐層堆疊驗證的策略使讀者能清楚歸因每個組件的貢獻。

論證最弱處

與深度學習提案方法的競爭力：MCG 的核心演算法基於傳統的光譜分割與手工設計特徵，在深度學習快速滲透物件偵測的趨勢下（如 Faster R-CNN 的 Region Proposal Network），其方法論路線的長期競爭力受到質疑。此外，即使是快速的 SCG 變體，其每張影像五秒的速度仍遠慢於學習式提案方法。