ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation

Abstract — 摘要

We introduce ESPNet, a fast and efficient convolutional neural network for semantic segmentation of high resolution images under resource constraints. ESPNet is based on a new convolutional module, efficient spatial pyramid (ESP), which is efficient in terms of computation, memory, and power. The ESP module decomposes a standard convolution into a point-wise convolution and a spatial pyramid of dilated convolutions, which dramatically reduces the number of parameters and operations while maintaining a large effective receptive field. Our ESPNet model is 22 times faster than the state-of-the-art semantic segmentation network PSPNet, while being 180 times smaller. ESPNet can process high resolution images at 112 fps on a standard GPU and 9 fps on an edge device, making it suitable for real-time applications on resource-constrained devices.

我們提出 ESPNet，一個快速且高效的摺積神經網路，用於資源受限條件下高解析度影像的語意分割。ESPNet 基於一個新的摺積模組——高效空間金字塔（ESP），在計算、記憶體和功耗方面均具有高效性。ESP 模組將標準摺積分解為逐點摺積和擴張摺積的空間金字塔，大幅降低了參數量和運算量，同時維持較大的有效感受野。我們的 ESPNet 模型比最先進的語意分割網路 PSPNet 快 22 倍，且規模小 180 倍。ESPNet 能在標準 GPU 上以 112 fps 處理高解析度影像，在邊緣裝置上以 9 fps 運行，適用於資源受限裝置上的即時應用。

段落功能全文總覽——定義 ESPNet 為高效語意分割網路並陳述核心貢獻。

邏輯角色摘要以三維效率指標（計算、記憶體、功耗）定位研究獨特性，並以倍數級的速度和大小優勢收束。

論證技巧 / 潛在漏洞「22 倍快、180 倍小」的倍數級表述極具衝擊力。但摘要未提及精度數字，讀者需自行衡量速度換取了多少精度損失。

1. Introduction — 緒論

Semantic segmentation requires assigning a class label to each pixel in an image. While significant progress has been made using deep convolutional neural networks, most state-of-the-art models are computationally expensive and require large amounts of memory, making them unsuitable for deployment on edge devices such as mobile phones, drones, and autonomous vehicles. For instance, PSPNet has 65.7 million parameters and runs at about 1 FPS while discharging the battery of a standard laptop at a rate of 77 Watts. The key challenge is to design a network that is both accurate and efficient enough for real-time inference on resource-constrained platforms.

語意分割需要為影像中的每個像素指派一個類別標籤。雖然使用深度摺積神經網路已取得顯著進展，但大多數最先進模型計算成本高昂且需要大量記憶體，使其不適合部署在手機、無人機和自動駕駛車輛等邊緣裝置上。例如，PSPNet 擁有 6570 萬個參數，運行速度約 1 FPS，同時以 77 瓦的速率消耗筆記型電腦電池。核心挑戰在於設計一個既精確又足夠高效的網路，以在資源受限平台上實現即時推論。

段落功能定義問題背景——以 PSPNet 的具體數字凸顯大型分割模型的部署困境。

邏輯角色將研究動機錨定在邊緣部署這個實際應用場景，用功耗數字讓「不實用」的論點變得具體可感。

論證技巧 / 潛在漏洞「77 瓦」的功耗數字是很少見的引用方式，直接訴諸讀者對電池續航的直覺。三類邊緣裝置（手機、無人機、自駕車）使問題具體化。

Existing efficient segmentation networks such as ENet and ICNet either sacrifice too much accuracy or still require significant computational resources. Our approach is to design a new convolutional module that is inherently efficient by construction, rather than applying post-hoc compression techniques like pruning or quantization. The ESP module achieves this by factorizing convolutions into point-wise and dilated components, with a hierarchical feature fusion (HFF) strategy to address the gridding artifact caused by dilated convolutions.

現有的高效分割網路如 ENet 和 ICNet 要麼犧牲太多精度，要麼仍需大量計算資源。我們的方法是設計一個在結構上即具備內在效率的新摺積模組，而非採用剪枝或量化等事後壓縮技術。ESP 模組透過將摺積分解為逐點和擴張組件，搭配分層特徵融合（HFF）策略來解決擴張摺積造成的網格偽影問題。

段落功能區分本方法與現有高效方法及壓縮技術的差異。

邏輯角色建立「設計上高效」vs.「壓縮後高效」的方法論框架，將 ESP 定位在更根本的層次。

論證技巧 / 潛在漏洞「inherently efficient by construction」暗示事後壓縮是次優方案。但兩種方法論並非互斥——ESP 模組同樣可以再經壓縮進一步加速。

Beyond traditional accuracy metrics, we introduce system-level metrics for comprehensively evaluating network performance on edge devices, including GPU frequency sensitivity, warp execution efficiency, memory efficiency, and power consumption. Our ESPNet processes high-resolution RGB images at 112 FPS on a high-end GPU, 21 FPS on a laptop, and 9 FPS on an edge device, while consuming only 1 Watt of average power on the Jetson TX2 at 824 MHz.

除了傳統的精度指標外，我們還引入了系統層級指標，用於全面評估網路在邊緣裝置上的效能，包括 GPU 頻率敏感度、束波執行效率、記憶體效率和功耗。我們的 ESPNet 以 112 FPS（高階 GPU）、21 FPS（筆電）和 9 FPS（邊緣裝置）處理高解析度 RGB 影像，在 Jetson TX2 上 824 MHz 頻率下僅消耗平均 1 瓦功率。

段落功能提出系統層級評估指標並展示跨平台效能。

邏輯角色將評估從「學術基準」擴展到「工程部署」，為邊緣裝置論述提供更完整的證據。

論證技巧 / 潛在漏洞引入新的評估維度（功耗、束波效率）是論文的獨到之處——在高效網路論文中較為罕見，增加了研究的差異化。

2. Method — 方法

The ESP module is based on a reduce-split-transform-merge strategy. First, a 1x1 point-wise convolution reduces the input feature maps from M channels to N/K channels (Reduce). The reduced feature maps are then split into K parallel branches (Split), where each branch applies a 3x3 dilated convolution with dilation rate 2^(k-1) for k = 1, 2, ..., K (Transform). Finally, the outputs of all K branches are concatenated to produce an N-dimensional output (Merge). This factorization reduces parameters by a factor of n^2 MK / (M + n^2 N) while increasing the effective receptive field by approximately [2^(K-1)]^2.

ESP 模組基於降維-分割-轉換-合併策略。首先，1x1 逐點摺積將輸入特徵圖從 M 通道降至 N/K 通道（降維）。降維後的特徵圖被分割為 K 個平行分支（分割），每個分支施加擴張率為 2^(k-1) 的 3x3 擴張摺積（k = 1, 2, ..., K）（轉換）。最後，所有 K 個分支的輸出被串接以產生 N 維輸出（合併）。此分解將參數量減少了 n^2 MK / (M + n^2 N) 倍，同時將有效感受野增大約 [2^(K-1)]^2 倍。

段落功能詳述 ESP 模組的四步驟核心流程。

邏輯角色本文最核心的技術貢獻——以數學方式量化效率增益與感受野擴張。

論證技巧 / 潛在漏洞「reduce-split-transform-merge」四步驟命名框架使複雜設計變得直觀可記憶。參數減少倍數的理論推導使效率主張有數學根據。

The Hierarchical Feature Fusion (HFF) is a key design choice to address the gridding artifact. In standard spatial pyramid modules, dilated convolutions with large dilation rates produce checkerboard-like patterns due to the sparse sampling of the input. Our HFF strategy hierarchically adds the output of each branch to the output of the previous branch before concatenation, effectively filling the gaps in the receptive field. This results in smooth and complete receptive field coverage without the gridding artifact, and does not increase the complexity of the ESP module.

分層特徵融合（HFF）是解決網格偽影問題的關鍵設計。在標準空間金字塔模組中，大擴張率的擴張摺積由於對輸入的稀疏取樣會產生棋盤格樣式。我們的 HFF 策略在串接前將每個分支的輸出分層加到前一個分支的輸出上，有效填補了感受野中的空隙。這實現了平滑且完整的感受野覆蓋，不含網格偽影，且不增加 ESP 模組的複雜度。

段落功能解釋分層特徵融合如何解決擴張摺積的已知缺陷。

邏輯角色處理設計的潛在弱點——先承認問題，再提出零成本的解決方案，展現設計的周全性。

論證技巧 / 潛在漏洞「不增加複雜度」是關鍵賣點——將 HFF 呈現為免費午餐。但分層加法是否會引入特徵混淆或資訊損失，值得更深入的消融驗證。

The network architecture progresses through four variants. ESPNet-A is the baseline that learns representations at multiple spatial levels. ESPNet-B adds long-range connections by concatenating feature maps from strided ESP modules. ESPNet-C introduces input reinforcement by downsampling the original image and concatenating it at each spatial level, compensating for spatial information loss. The full ESPNet attaches a lightweight decoder using a reduce-upsample-merge (RUM) strategy, producing segmentation masks at the original resolution. A depth multiplier alpha controls network depth without changing the topology.

網路架構經歷四個漸進變體。ESPNet-A 是在多個空間層級學習表徵的基線版本。ESPNet-B 透過串接步進 ESP 模組的特徵圖來增加長程連接。ESPNet-C 引入輸入增強，將原始影像降取樣後在每個空間層級進行串接，以補償空間資訊的損失。完整的 ESPNet 附加了一個使用降維-上取樣-合併（RUM）策略的輕量級解碼器，產生與原始解析度相同的分割遮罩。深度倍增器 alpha 在不改變拓撲結構的情況下控制網路深度。

段落功能描述從基線到完整架構的漸進式設計路線。

邏輯角色漸進式呈現（A-B-C-完整版）隱含了消融研究的邏輯——每一步都有可量化的改善。

論證技巧 / 潛在漏洞四個變體的漸進設計同時服務兩個目的：作為消融實驗的自然對照組，以及讓使用者根據資源限制選擇不同版本。深度倍增器的設計增加了靈活性。

3. Experiments — 實驗

We evaluate ESPNet on the Cityscapes dataset for semantic segmentation. Our model achieves 60.3% class mIoU and 82.2% category mIoU on the test set with only 0.4M parameters. Compared to ENet which achieves 58.3% mIoU with 0.36M parameters, ESPNet achieves 2 percentage points higher accuracy while running 1.27x and 1.16x faster on desktop and laptop, respectively. Compared to PSPNet which achieves 78.4% mIoU with 65.7M parameters, ESPNet has only 8% lower category-wise mIoU while learning 180x fewer parameters.

我們在 Cityscapes 資料集上評估 ESPNet 的語意分割效能。模型以僅 0.4M 參數達到測試集上 60.3% 類別 mIoU 和 82.2% 範疇 mIoU。與達到 58.3% mIoU 且有 0.36M 參數的 ENet 相比，ESPNet 精度高出 2 個百分點，同時在桌機和筆電上分別快 1.27 倍和 1.16 倍。與達到 78.4% mIoU 且有 6570 萬參數的 PSPNet 相比，ESPNet 範疇 mIoU 僅低 8%，但參數量少 180 倍。

段落功能核心實驗——Cityscapes 上的精度-效率比較。

邏輯角色以兩組不同的比較對象展示 ESPNet 的定位：與 ENet 比精度（勝），與 PSPNet 比效率（勝）。

論證技巧 / 潛在漏洞雙重比較策略精妙：ESPNet 在兩個維度上都有優勢敘事。但 60.3% vs 78.4% 的絕對精度差距仍然顯著，在安全關鍵應用中可能不可接受。

We further evaluate the real-time capability on edge devices. On the NVIDIA Jetson TX2, ESPNet processes high-resolution Cityscapes images at 9 fps. On a standard NVIDIA TitanX GPU, ESPNet runs at 112 fps. The model size is only 0.7 MB. Comparing convolutional modules, ESP outperforms MobileNet by 7% and ShuffleNet by 12% while learning a similar number of parameters and having comparable network size and inference speed.

我們進一步評估其在邊緣裝置上的即時能力。在 NVIDIA Jetson TX2 上，ESPNet 以 9 fps 處理高解析度 Cityscapes 影像。在標準 NVIDIA TitanX GPU 上以 112 fps 運行。模型大小僅 0.7 MB。比較摺積模組，ESP 在相近參數量和推論速度下，分別優於 MobileNet 7% 和 ShuffleNet 12%。

段落功能展示邊緣裝置上的實際部署效能與模組級比較。

邏輯角色將效率論述從學術基準延伸到實際硬體部署，並透過模組級比較進一步鞏固 ESP 的設計優勢。

論證技巧 / 潛在漏洞「0.7 MB」是極具衝擊力的數字。與 MobileNet 和 ShuffleNet 的模組級比較（而非網路級）使 ESP 的設計優勢更為清晰。

Cross-dataset generalization tests on the unseen Mapillary dataset (trained only on Cityscapes) show ESPNet achieving 0.40 mIoU with 0.364M parameters, outperforming ENet (0.33) and ERFNet (0.25). On PASCAL VOC 2012, ESPNet achieves 63.01% mIoU with 0.364M parameters, compared to SegNet's 59.10% with 29.5M parameters (81x fewer). Power consumption on the Jetson TX2 is only 1 W at 824 MHz GPU frequency, compared to ENet's 1.5 W and ERFNet's 2.9 W, with warp execution efficiency about 9% higher than ENet and 14% higher than ERFNet.

跨資料集泛化測試在未見過的 Mapillary 資料集上（僅在 Cityscapes 上訓練），ESPNet 以 0.364M 參數達到 0.40 mIoU，優於 ENet（0.33）和 ERFNet（0.25）。在 PASCAL VOC 2012 上，ESPNet 以 0.364M 參數達到 63.01% mIoU，對比 SegNet 的 59.10%（29.5M 參數，少 81 倍）。在 Jetson TX2 上的功耗在 824 MHz GPU 頻率下僅 1 瓦，相比 ENet 的 1.5 瓦和 ERFNet 的 2.9 瓦，束波執行效率比 ENet 高約 9%、比 ERFNet 高約 14%。

段落功能延伸驗證——跨資料集泛化能力與系統層級效能指標。

邏輯角色泛化測試證明 ESP 學到的表徵具有可遷移性，不僅在訓練集上有效。功耗和束波效率的數據為邊緣部署提供工程層面的實證。

論證技巧 / 潛在漏洞跨資料集泛化是少見但極具說服力的實驗。功耗比較（1W vs 2.9W）在邊緣部署情境下比 FLOPs 更有實際意義。

4. Conclusion — 結論

We have presented ESPNet, an efficient convolutional neural network for real-time semantic segmentation on resource-constrained devices. The core contribution is the ESP module, which uses a factorized convolution approach with hierarchical feature fusion to achieve a large effective receptive field with minimal computational cost. Beyond traditional accuracy metrics, we introduce system-level metrics for comprehensively evaluating network performance on edge devices. Empirical analysis demonstrates ESPNet achieves compelling trade-offs between accuracy and efficiency while learning generalizable representations across diverse datasets and conditions.

我們提出了 ESPNet，一個用於資源受限裝置上即時語意分割的高效摺積神經網路。核心貢獻是 ESP 模組，採用分解摺積方法搭配分層特徵融合，以最小的計算成本實現大的有效感受野。除了傳統精度指標外，我們引入了系統層級指標，全面評估網路在邊緣裝置上的效能。實證分析表明，ESPNet 在精度與效率之間達成令人信服的權衡，同時在不同資料集和條件下學到可泛化的表徵。

段落功能總結全文——重申核心貢獻與兩大創新點。

邏輯角色結論與緒論形成完整閉環：問題（邊緣部署困境）-方案（ESP 模組）-驗證（多資料集+系統指標）。

論證技巧 / 潛在漏洞強調「系統層級指標」作為方法論貢獻是明智的——這超越了單純的架構提案，為後續邊緣部署研究建立了評估框架。但對精度上限的進一步提升方向未多著墨。

論證結構總覽

問題
語意分割模型太重
無法在邊緣裝置
即時運行

→

論點
從設計層面追求
內在效率，而非
事後壓縮

→

證據
0.4M 參數 / 0.7 MB
112 fps GPU / 9 fps 邊緣
60.3% mIoU / 1W 功耗

→

反駁
ENet/ICNet 犧牲精度
PSPNet 太重無法部署
壓縮法僅為事後補救

→

結論
ESP 模組實現
邊緣即時分割
且具泛化能力

作者核心主張（一句話）

透過 ESP 模組的摺積分解（逐點摺積 + 多尺度擴張摺積 + 分層特徵融合），ESPNet 以 0.4M 參數和 0.7 MB 模型大小在邊緣裝置上實現即時語意分割，開創了以系統層級指標全面評估高效網路的先河。

論證最強處

全方位的邊緣部署驗證：不僅報告精度和 FPS，更提供功耗（1W）、束波執行效率、GPU 頻率敏感度等系統指標。在 Jetson TX2 上的實際 9 fps 演示搭配跨資料集泛化（Mapillary、VOC）結果，使「邊緣可部署」的主張有完整的實證支撐。

論證最弱處

精度差距仍然顯著：60.3% vs PSPNet 的 78.4% mIoU 差距達 18 個百分點。論文以速度和參數倍數來淡化此差距，但在自動駕駛等安全關鍵場景中，每個百分點的精度都至關重要。「高效 = 可用」的隱含等式需要更多場景驗證。