Holistically-Nested Edge Detection

Abstract — 摘要

We develop a new edge detection algorithm that addresses two important issues in this long-standing vision problem: (1) holistic image training and prediction; and (2) multi-scale and multi-level feature learning. Our proposed method, holistically-nested edge detection (HED), performs image-to-image prediction by means of a deep learning model that leverages fully convolutional neural networks and deeply-supervised nets. HED automatically learns rich hierarchical representations that are important in order to approach the human ability of resolving the challenging ambiguity in edge and object boundary detection. We significantly advance the state-of-the-art on the BSD500 dataset (ODS F-score of .782) and the NYU Depth dataset (.746), with substantially faster speed (0.4 seconds per image on GPU).

我們開發了一種新的邊緣偵測演算法，解決此長期視覺問題中的兩個重要議題：(1) 整體性影像訓練與預測；(2) 多尺度與多層級特徵學習。本文提出的整體嵌套邊緣偵測（HED），透過深度學習模型實現影像對影像的預測，該模型結合了全摺積神經網路與深度監督網路。HED 自動學習豐富的階層式表示，這對於接近人類在邊緣與物件邊界偵測中解決困難歧義的能力至關重要。我們在 BSD500 資料集（ODS F-score .782）與 NYU Depth 資料集（.746）上大幅推進了最先進水準，且速度顯著更快（GPU 上每張影像 0.4 秒）。

段落功能全文總覽——以兩個核心議題定義研究範疇，並以具體數據預告成果。

邏輯角色摘要以「議題-方法-成果」三段式組織，同時涵蓋精確度（F-score）與效率（速度），展示方法的雙重優勢。

論證技巧 / 潛在漏洞「接近人類能力」的表述為強烈的修辭手法，但人類 F-score 約為 0.80，而 HED 達到 0.782，差距雖小但仍存在。速度優勢的強調有效，因為先前 CNN 方法需數秒至數小時。

1. Introduction — 緒論

Edge and boundary detection is fundamental to computer vision, serving applications from visual saliency, segmentation, object detection/recognition, tracking, and motion analysis to modern tasks like autonomous driving and 3D reconstruction. Human consistency studies established an F-score of 0.80 for edge placement, setting a practical upper bound. Prior work falls into three categories: early gradient-based methods (Sobel, Canny), information-theory approaches (Pb, gPb), and learning-based methods with hand-crafted features. Recent CNN approaches showed promise but suffered from slow inference speeds and modest improvements over Structured Edges.

邊緣與邊界偵測是電腦視覺的基礎，服務於視覺顯著性、分割、物件偵測/辨識、追蹤與運動分析等應用，乃至自動駕駛與三維重建等現代任務。人類一致性研究確立了 0.80 的邊緣定位 F-score 作為實際上限。先前的研究分為三類：早期梯度方法（Sobel、Canny）、資訊理論方法（Pb、gPb）以及使用手工特徵的學習方法。近期的 CNN 方法雖展現前景，卻飽受推論速度緩慢與相較 Structured Edges 僅有適度改善之苦。

段落功能建立研究場域——系統性回顧邊緣偵測的歷史與現狀。

邏輯角色論證鏈起點：以三類方法的分類建立研究脈絡，指出 CNN 方法的「速度與精度」雙重不足，為 HED 的定位鋪路。

論證技巧 / 潛在漏洞將先前 CNN 方法歸類為「緩慢且改善有限」是有效的對比策略。0.80 人類上限的引入為後續評估提供了天花板參照，是良好的論證框架設定。

2. Holistically-Nested Edge Detection — HED 方法

We formalize existing multi-scale deep learning approaches into four categories: multi-stream learning, skip-layer networks, single models on multiple inputs, and independent networks. Our holistically-nested approach differs fundamentally: it uses a single-stream deep network with multiple side outputs, drawing on deeply-supervised net methodology. The training objective combines side-output losses across M layers: L_side(W,w) = Σ_m α_m l_side(W,w^(m)). To handle the severe class imbalance (roughly 90% non-edge pixels), we employ class-balanced cross-entropy loss with weight β = |Y-|/|Y|. A weighted-fusion layer learns to combine multi-scale outputs during training.

我們將現有的多尺度深度學習方法系統化為四類：多串流學習、跳層網路、單一模型多輸入以及獨立網路。我們的整體嵌套方法有本質區別：它使用具有多個側輸出的單串流深度網路，借鑑深度監督網路方法論。訓練目標結合跨 M 層的側輸出損失：L_side(W,w) = Σ_m α_m l_side(W,w^(m))。為處理嚴重的類別不平衡（約 90% 為非邊緣像素），我們採用類別平衡交叉熵損失，權重 β = |Y-|/|Y|。加權融合層在訓練中學習組合多尺度輸出。

段落功能核心方法定義——以分類學定位 HED 並描述其數學公式。

邏輯角色將四類現有方法作為對照組，凸顯 HED 的「單串流多側輸出」獨特性。類別平衡損失直接解決實際訓練中的數據偏差問題。

論證技巧 / 潛在漏洞以分類學建立差異化是有效的論證手法。但「深度監督」的概念並非首創，作者需明確說明 HED 在此基礎上的增量貢獻。類別平衡損失是必要的工程決策，但其超參數 α_m 的選擇策略需要交代。

3. Network Architecture — 網路架構

The architecture adopts VGGNet but removes the 5th pooling layer and all fully connected layers to improve efficiency and produce meaningful multi-scale outputs. Side outputs connect to layers conv1_2, conv2_2, conv3_3, conv4_3, and conv5_3, with receptive field sizes ranging from 5 to 196 and strides from 1 to 16. Comparison with FCN variants showed HED outperformed FCN-8s (.697 ODS) and FCN-2s (.738 ODS). Deep supervision proved critical: "training with only the weighted-fusion output loss gives edge predictions that lack discernible order" with many critical edges absent at higher layers.

架構採用 VGGNet 但移除第五池化層與所有全連接層，以提升效率並產生有意義的多尺度輸出。側輸出連接至 conv1_2、conv2_2、conv3_3、conv4_3 與 conv5_3 層，感受野大小從 5 到 196，步幅從 1 到 16。與 FCN 變體的比較顯示 HED 優於 FCN-8s（.697 ODS）與 FCN-2s（.738 ODS）。深度監督至關重要：僅以加權融合輸出損失訓練會導致邊緣預測「缺乏可辨別的秩序」，高層中許多關鍵邊緣缺失。

段落功能架構細節——具體說明網路修改與側輸出配置。

邏輯角色以 VGG 修剪策略說明工程決策的合理性，並透過與 FCN 的定量對比驗證架構優越性。深度監督的必要性以消融實驗確認。

論證技巧 / 潛在漏洞 5 到 196 的感受野跨度清楚展示了多尺度捕獲能力。與 FCN 的直接比較極具說服力，但 FCN 並非為邊緣偵測而設計，此比較的公平性可被質疑。

4. Experiments — 實驗

Training uses Caffe with VGG-16 pre-training, mini-batch size 10, learning rate 1e-6, momentum 0.9, and 10,000 iterations. Consensus sampling uses only pixels labeled positive by at least 3 annotators. Data augmentation rotates images to 16 angles with flipping, creating 32x dataset expansion. On BSDS500, HED achieves ODS F-score of .782, OIS of .804, and AP of .833. Individual side outputs contribute variably: layer 1 reaches .595 ODS, layer 2 achieves .697, layer 3 reaches .738, while layer 5 achieves only .606. Processing takes 0.4 seconds per 320x480 image on GPU, compared to seconds or hours for competing methods. On NYUDv2, the combined RGB-HHA model achieves ODS of .746.

訓練使用 Caffe 搭配 VGG-16 預訓練權重，小批次大小 10，學習率 1e-6，動量 0.9，共 10,000 次迭代。共識取樣僅使用至少 3 位標注者標記為正的像素。資料增強將影像旋轉至 16 個角度並翻轉，產生 32 倍的資料集擴充。在 BSDS500 上，HED 達到 ODS F-score .782、OIS .804 與 AP .833。各側輸出的貢獻不一：第 1 層達 .595 ODS，第 2 層 .697，第 3 層 .738，而第 5 層僅 .606。處理每張 320x480 影像在 GPU 上需 0.4 秒，相較於競爭方法的數秒至數小時。在 NYUDv2 上，RGB-HHA 組合模型達到 .746 ODS。

段落功能全面的實驗驗證——涵蓋訓練細節、主要結果與跨資料集評估。

邏輯角色實證支柱：以精確度（.782 ODS）、速度（0.4 秒）與跨資料集泛化（NYUDv2）三個面向支撐方法的有效性。各層輸出的分析揭示多尺度融合的必要性。

論證技巧 / 潛在漏洞各層輸出的個別效能數據非常有洞察力——第 5 層僅 .606 顯示高層語義特徵不適合精細邊緣定位，但融合後仍有貢獻。速度對比以「數秒至數小時」描述競爭方法略顯模糊，精確的比較表格會更有說服力。

5. Conclusion — 結論

We present an end-to-end edge detection system that achieves state-of-the-art performance on natural images at a speed of practical relevance (0.4 seconds using GPU and 12 seconds using CPU). The method combines fully convolutional networks with deep supervision on a trimmed VGG architecture. The holistically-nested design enables automatic learning of multi-scale, multi-level edge representations, addressing a fundamental challenge in this classic vision task. Code and pre-trained models are publicly released.

本文提出一個端對端邊緣偵測系統，在自然影像上以具實際意義的速度達到最先進效能（GPU 0.4 秒，CPU 12 秒）。該方法在修剪後的 VGG 架構上結合全摺積網路與深度監督。整體嵌套設計實現了多尺度、多層級邊緣表示的自動學習，解決了此經典視覺任務中的根本挑戰。程式碼與預訓練模型已公開釋出。

段落功能總結全文——重申核心貢獻並強調實用性。

邏輯角色結論簡潔有力地回應了摘要中的兩個核心議題（整體性與多尺度），並以開源釋出作為對社群的貢獻。

論證技巧 / 潛在漏洞結論適度聚焦而不過度膨脹。然而未討論方法的局限性——例如對紋理邊緣與物件邊界的區分能力，以及在更高解析度影像上的可擴展性。

論證結構總覽

問題
邊緣偵測缺乏整體性
與多尺度學習能力

→

論點
單串流多側輸出+
深度監督實現 HED

→

證據
BSD500 .782 ODS
NYUDv2 .746 ODS

→

反駁
類別平衡損失解決
90% 非邊緣偏差

→

結論
端對端系統以實用速度
達到最先進效能

作者核心主張（一句話）

透過在單串流全摺積網路中以深度監督訓練多個側輸出，HED 自動學習多尺度邊緣表示，以 0.4 秒的實用速度在 BSD500 上達到 .782 ODS F-score。

論證最強處

深度監督的多尺度融合：各層側輸出的個別效能分析清楚展示不同網路深度捕獲不同尺度的邊緣資訊，而學習式融合能有效整合這些互補訊號。速度優勢從數小時壓縮至 0.4 秒，使方法具備真正的實用性。

論證最弱處

與人類效能的差距未充分分析：HED 的 .782 與人類的 .800 之間仍有差距，但作者未深入分析失敗案例或討論此差距的來源。此外，方法對 VGG 預訓練的依賴意味著其效能可能受限於預訓練資料的多樣性。