Hierarchical Saliency Detection

Abstract — 摘要

Saliency detection aims to identify the most visually prominent regions in an image. A fundamental challenge arises when salient foreground or background contains small-scale high-contrast patterns, which can adversely affect detection accuracy. Existing methods that use varying patch sizes or image downsampling to handle scale issues often introduce artifacts or lose fine details. We propose a hierarchical approach that analyzes saliency through a scale-based tree model rather than varying patch sizes or downsampling. Our method constructs a hierarchical segmentation tree and computes saliency at multiple layers, from fine to coarse, then integrates these multi-scale results. This approach improves saliency detection on many images that cannot be handled well by traditional single-scale methods. We also present a newly constructed large-scale benchmark dataset for evaluation.

顯著性偵測旨在辨識影像中視覺上最突出的區域。當顯著前景或背景包含小尺度高對比模式時，會產生根本性的挑戰，可能不利地影響偵測精確度。使用不同區塊大小或影像降取樣來處理尺度問題的現有方法，往往引入偽影或丟失精細細節。我們提出一種層次方法，透過基於尺度的樹模型而非不同區塊大小或降取樣來分析顯著性。我們的方法建構階層式分割樹，在從精細到粗略的多個層級計算顯著性，然後整合這些多尺度結果。此方法改善了許多傳統單尺度方法無法良好處理的影像的顯著性偵測。我們也提出一個新建構的大規模基準資料集供評估使用。

段落功能全文總覽——以尺度問題為核心動機，引出階層式顯著性偵測方法。

邏輯角色摘要採用「問題-現有不足-本文方案-貢獻」的標準結構，同時預告了新資料集這一額外貢獻。

論證技巧 / 潛在漏洞「小尺度高對比模式」是顯著性偵測中確實存在的難題，但作者未在摘要中具體說明此問題有多普遍。「新基準資料集」的雙重貢獻策略有效提升了論文的影響力。

1. Introduction — 緒論

Visual saliency plays a crucial role in image retargeting, object recognition, image quality assessment, and content-aware editing. The goal is to produce a saliency map where pixel values indicate the probability of belonging to a salient object. Most existing methods compute saliency at a single scale determined by the superpixel or patch size. However, real-world salient objects exhibit complex multi-scale structures: a person wearing a striped shirt has a uniform silhouette at coarse scale but high-contrast internal patterns at fine scale. Single-scale methods may incorrectly mark the stripes as salient boundaries or fail to group them as part of the salient object.

視覺顯著性在影像重定向、物件辨識、影像品質評估與內容感知編輯中扮演關鍵角色。目標是產生顯著圖，其中像素值指示屬於顯著物件的機率。大多數現有方法在由超像素或區塊大小決定的單一尺度上計算顯著性。然而，真實世界的顯著物件展現複雜的多尺度結構：穿著條紋襯衫的人在粗略尺度具有統一的輪廓，但在精細尺度具有高對比的內部模式。單尺度方法可能錯誤地將條紋標記為顯著邊界，或無法將它們歸組為顯著物件的一部分。

段落功能建立研究場域——以具體範例說明多尺度挑戰。

邏輯角色論證的起點：先建立顯著性偵測的廣泛應用價值，再以「條紋襯衫」的直觀範例引出尺度問題。

論證技巧 / 潛在漏洞「條紋襯衫」是極佳的具體範例，使抽象的尺度問題變得直觀。但此類情境在實際應用中的普遍性值得質疑——大多數顯著物件可能不具如此明顯的多尺度結構。

We propose a hierarchical saliency detection framework that operates on a tree-structured image representation. Instead of processing saliency at a single fixed scale, we compute saliency cues at multiple layers of a hierarchical segmentation, from individual superpixels to large merged regions. At each layer, saliency is computed based on region-level contrast and spatial distribution features. The multi-layer results are then integrated through a principled combination scheme that preserves fine details from lower layers while maintaining global coherence from upper layers. Additionally, we introduce a new large-scale dataset with 5,000 images and pixel-accurate ground truth for benchmarking.

我們提出一個在樹狀結構影像表示上運作的階層式顯著性偵測框架。我們不在單一固定尺度上處理顯著性，而是在階層式分割的多個層級——從個別超像素到大型合併區域——計算顯著性線索。在每個層級，顯著性基於區域級的對比度與空間分布特徵計算。多層結果隨後透過有原則的結合方案整合，從較低層級保留精細細節，同時從較高層級維持全域一致性。此外，我們引入一個包含 5,000 張影像與像素精確真實標註的新大規模資料集供基準測試使用。

段落功能提出解決方案——概述階層式框架與資料集貢獻。

邏輯角色承接上段的問題陳述，此段提供完整的方法概覽：多層計算 + 整合方案。5,000 張影像的新資料集是重要的額外貢獻。

論證技巧 / 潛在漏洞「有原則的結合方案」措辭模糊，需在方法章節中具體化。新資料集的引入是雙刃劍：它增強了論文貢獻，但也意味著與先前工作的公平比較需要格外謹慎。

Saliency detection methods can be broadly categorized into bottom-up and top-down approaches. Bottom-up methods compute saliency based on low-level visual contrast, including frequency domain analysis, region-based contrast, and graph-based manifold ranking. The influential RC (Region Contrast) method computes saliency as the weighted sum of color distances to all other regions. GC (Global Contrast) uses color histogram distances. However, these methods operate at a fixed segmentation granularity and cannot adapt to the intrinsic scale of salient objects. Some recent works use multi-scale superpixels but simply average or concatenate features across scales without principled integration.

顯著性偵測方法可大致分為由下而上與由上而下的方法。由下而上方法基於低階視覺對比度計算顯著性，包括頻域分析、基於區域的對比度與基於圖的流形排序。具影響力的 RC（區域對比度）方法將顯著性計算為與所有其他區域之色彩距離的加權總和。GC（全域對比度）使用色彩直方圖距離。然而，這些方法在固定的分割粒度上運作，無法適應顯著物件的固有尺度。部分近期研究使用多尺度超像素，但僅是跨尺度平均或串接特徵，缺乏有原則的整合。

段落功能文獻回顧——系統性分類並批判現有顯著性偵測方法。

邏輯角色以「固定粒度」為批判焦點，將 RC、GC 等代表性方法統一歸類為「單尺度」的局限，為本文的多層方法建立對比。

論證技巧 / 潛在漏洞對多尺度超像素方法的批判（「簡單平均或串接」）準確點出了缺乏理論基礎的問題。但作者自身的「有原則整合」是否真正具有理論保證，仍需在方法章節驗證。

3. Hierarchical Tree Model — 層次樹模型

We construct a hierarchical segmentation tree from the input image. The tree is built using an agglomerative clustering process: starting from an initial fine-grained superpixel segmentation (leaf nodes), we iteratively merge adjacent regions based on color similarity and boundary strength, creating parent nodes at each merge step. The resulting tree has L layers (typically 3-5), where layer 0 contains the finest superpixels and layer L-1 contains the coarsest regions. Each node in the tree represents a region at a specific scale, and the parent-child relationships encode how fine-grained regions compose into coarser ones. This tree provides a natural multi-scale representation that respects image boundaries at every level.

我們從輸入影像建構階層式分割樹。此樹透過凝聚式聚類過程建構：從初始的精細超像素分割（葉節點）開始，基於色彩相似度與邊界強度迭代合併相鄰區域，在每個合併步驟建立父節點。所得的樹具有 L 個層級（通常 3-5 層），其中第 0 層包含最精細的超像素，第 L-1 層包含最粗略的區域。樹中的每個節點代表特定尺度上的區域，而父子關係編碼了精細區域如何組成粗略區域。此樹提供了在每個層級都尊重影像邊界的自然多尺度表示。

段落功能建構多尺度基礎——定義階層式分割樹。

邏輯角色此段為後續的多層顯著性計算提供結構基礎。凝聚式聚類確保了層級間的嚴格嵌套關係，使多層結果的整合成為可能。

論證技巧 / 潛在漏洞「在每個層級都尊重影像邊界」是關鍵優勢——這避免了降取樣方法中邊界模糊的問題。但凝聚式聚類的合併順序依賴於局部決策，可能導致次優的層次結構。

4. Layer-wise Saliency Computation — 層次顯著性計算

At each layer l of the tree, we compute a saliency value for each region based on two complementary cues. The global contrast cue measures the color distinctiveness of a region relative to all other regions at the same layer, weighted by spatial distance to emphasize nearby contrast. The spatial distribution cue measures how compactly a region's color is distributed in the image — salient objects tend to be spatially compact, while background colors are widely scattered. These two cues are combined as S_l(r) = C_l(r) * D_l(r), where C is contrast and D is spatial compactness. Different layers produce saliency maps at different granularities: fine layers capture details but are noisy, while coarse layers provide holistic assessment but lose boundaries.

在樹的每個層級 l，我們基於兩個互補的線索計算每個區域的顯著性值。全域對比度線索衡量區域相對於同層所有其他區域的色彩獨特性，以空間距離加權以強調鄰近對比。空間分布線索衡量區域的色彩在影像中分布的緊湊程度——顯著物件傾向於空間緊湊，而背景色彩則廣泛散布。這兩個線索結合為 S_l(r) = C_l(r) * D_l(r)，其中 C 為對比度，D 為空間緊湊度。不同層級產生不同粒度的顯著圖：精細層級捕捉細節但含雜訊，粗略層級提供整體評估但丟失邊界。

段落功能核心計算——定義每層的顯著性度量。

邏輯角色此段將顯著性的直覺（對比度 + 空間緊湊度）轉化為具體的計算公式。乘法結合意味著兩個線索必須同時為高才能產生高顯著性。

論證技巧 / 潛在漏洞對比度與空間緊湊度的乘法結合簡潔有力，但可能在某些情況下過於嚴格——例如顯著物件佔據影像大部分面積時，空間緊湊度分數會很低。精細層「含雜訊」的坦承也暗示了整合步驟的關鍵性。

5. Multi-layer Integration — 多層整合

The final saliency map is obtained by integrating saliency values across all tree layers. For each pixel p, its saliency is computed as a weighted combination of the saliency values from the regions containing p at each layer. The weights are determined by the reliability of each layer, estimated from the distribution of saliency values at that layer — a layer with clear bimodal distribution (salient vs. non-salient) receives higher weight. Additionally, we apply a hierarchical refinement step: saliency from coarse layers provides spatial priors that constrain the saliency computation at finer layers, ensuring that fine-scale details are consistent with the global saliency structure. This bidirectional information flow between scales is the core advantage of our tree-based approach.

最終的顯著圖透過跨所有樹層級整合顯著性值來獲得。對於每個像素 p，其顯著性計算為包含 p 的各層級區域的顯著性值之加權組合。權重由各層級的可靠性決定，從該層級的顯著性值分布估計——具有清晰雙峰分布（顯著 vs. 非顯著）的層級獲得更高的權重。此外，我們施加階層式精煉步驟：粗略層級的顯著性提供空間先驗來約束精細層級的顯著性計算，確保精細尺度的細節與全域顯著性結構一致。此尺度間的雙向資訊流動是我們基於樹的方法的核心優勢。

段落功能整合策略——描述多尺度結果如何融合。

邏輯角色此段回答了「如何有原則地整合」的問題：以層級可靠性為權重，並透過粗到細的精煉確保一致性。「雙向資訊流動」是對先前工作「簡單平均」的直接改進。

論證技巧 / 潛在漏洞以顯著性分布的雙峰性估計層級可靠性是巧妙的設計。但此假設在困難影像中可能不成立——當前景與背景差異模糊時，所有層級都可能缺乏清晰的雙峰分布。

6. Experiments — 實驗

We evaluate on four benchmark datasets: the MSRA-B (5,000 images), ECSSD (1,000 images with complex structures), DUT-OMRON (5,168 images), and our newly constructed HKU-IS dataset (4,447 images). We compare against 14 state-of-the-art methods including RC, GC, SF, MC, and DSR. Using the standard precision-recall curve and F-measure, our hierarchical approach achieves the highest F-measure on all four datasets. Notably, on ECSSD, which specifically contains images with complex structures, we improve F-measure by 4.2% over the next best method. Ablation analysis confirms that multi-layer integration outperforms any single layer, and that the hierarchical refinement provides an additional 2-3% improvement over simple averaging.

我們在四個基準資料集上評估：MSRA-B（5,000 張影像）、ECSSD（1,000 張具複雜結構的影像）、DUT-OMRON（5,168 張影像）以及我們新建構的 HKU-IS 資料集（4,447 張影像）。我們與包括 RC、GC、SF、MC 和 DSR 在內的 14 種最先進方法進行比較。使用標準的精確度-召回率曲線與 F-measure，我們的階層式方法在所有四個資料集上達成最高的 F-measure。值得注意的是，在專門包含複雜結構影像的 ECSSD 上，我們比次佳方法提升了 4.2% 的 F-measure。消融分析確認多層整合優於任何單一層級，且階層式精煉比簡單平均額外提供 2-3% 的改善。

段落功能提供全面的實驗證據——在四個資料集上與 14 種方法比較。

邏輯角色實證支柱，覆蓋：(1) 跨多個資料集的一致性；(2) 在複雜結構影像上的顯著優勢；(3) 消融研究驗證各組件。ECSSD 上 4.2% 的提升直接支持「階層式方法擅長處理複雜結構」的核心主張。

論證技巧 / 潛在漏洞比較對象涵蓋 14 種方法，實驗設計充分。但其中一個測試資料集（HKU-IS）是作者自行建構的，在此資料集上的優勢可能受到資料集設計偏差的影響。

7. Conclusion — 結論

We have presented a hierarchical saliency detection method that addresses the fundamental scale challenge in saliency computation. By constructing a segmentation tree and computing saliency at multiple levels with principled integration, our approach captures both fine details and global structure simultaneously. The consistent improvements across four datasets and strong performance on complex-structure images validate the effectiveness of hierarchical processing. Our new HKU-IS dataset also provides a valuable resource for future research. Potential extensions include integrating deep features into the hierarchical framework and applying it to video saliency detection.

我們提出了一種階層式顯著性偵測方法，處理顯著性計算中的根本尺度挑戰。透過建構分割樹並在多個層級上以有原則的整合計算顯著性，我們的方法同時捕捉精細細節與全域結構。跨四個資料集的一致改善以及在複雜結構影像上的優異表現，驗證了階層式處理的有效性。我們的新 HKU-IS 資料集也為未來研究提供了寶貴的資源。潛在的擴展方向包括將深度特徵整合到階層式框架中，以及應用於影片顯著性偵測。

段落功能總結全文——重申核心貢獻並展望未來。

邏輯角色結論呼應緒論的尺度問題，以「同時捕捉精細細節與全域結構」作為問題的最終解答，形成論證閉環。

論證技巧 / 潛在漏洞「深度特徵」的展望在 2013 年頗具前瞻性——隨後基於深度學習的顯著性偵測方法確實大幅超越了手工特徵方法。此展望暗示作者已意識到特徵選擇可能是當前框架的瓶頸。

論證結構總覽

問題
單尺度顯著性偵測
無法處理複雜結構

→

論點
階層式分割樹提供
自然的多尺度表示

→

證據
四個資料集上
全面超越 14 種方法

→

反駁
有原則的多層整合
優於簡單平均

→

結論
階層式處理解決
尺度挑戰

作者核心主張（一句話）

透過在階層式分割樹的多個層級上計算顯著性，並以層級可靠性為權重進行有原則的整合，能同時保留精細細節與全域結構，有效解決顯著性偵測中的尺度問題。

論證最強處

在複雜結構上的顯著優勢：ECSSD 資料集專門收集具有複雜內部結構的影像，而本方法在此資料集上比次佳方法提升 4.2%，直接驗證了階層式處理對「條紋襯衫」類問題的有效性。消融研究中多層整合優於單層的結果也排除了「多此一舉」的疑慮。

論證最弱處

低階特徵的表達力限制：方法完全依賴色彩對比度與空間分布等低階特徵，缺乏語意理解。在前景與背景色彩相近或顯著物件佔據大部分影像的情況下，各層級的顯著性計算可能都不可靠，階層整合也無法彌補底層特徵的不足。此外，新建構的 HKU-IS 資料集作為評估基準之一可能引入偏差。