ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

Abstract — 摘要

The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks have achieved impressive accuracy on segmentation benchmarks, but they require significant computational resources and are far from real-time on embedded devices. We propose ENet (Efficient Neural Network), a novel neural network architecture designed specifically for tasks requiring low latency operation. ENet is up to 18x faster, requires 75x less FLOPs, has 79x less parameters, and provides similar or better accuracy compared to existing models. We have tested it on CamVid, Cityscapes, and SUN datasets and report on both accuracy and speed.

在行動應用中，執行逐像素語義分割的即時能力至關重要。近期深度神經網路在分割基準上達到了令人印象深刻的精度，但它們需要大量計算資源，在嵌入式裝置上遠非即時。我們提出 ENet（高效神經網路），一種專為需要低延遲運作的任務設計的新型神經網路架構。ENet 相比現有模型快達 18 倍、所需 FLOPs 少 75 倍、參數少 79 倍，且提供相似或更好的精度。我們在 CamVid、Cityscapes 和 SUN 資料集上進行測試，報告精度和速度。

段落功能提出針對即時分割設計的輕量架構。

邏輯角色以嵌入式/行動應用的需求建立效率優先的設計動機。

論證技巧 / 潛在漏洞18x/75x/79x 的多維度效率提升數據非常引人注目。

1. Introduction — 緒論

Semantic segmentation is important for understanding visual scenes, with applications in autonomous driving, augmented reality, and robotics. State-of-the-art networks like SegNet and FCN have achieved good accuracy but remain too slow for real-time applications, especially on mobile and embedded platforms. The key challenge is to design a network that is both accurate and efficient. We observe that existing architectures are over-parameterized for the task of segmentation, and that careful architectural design can dramatically reduce computation without sacrificing accuracy. ENet is designed from the ground up with efficiency as the primary constraint.

語義分割對於理解視覺場景非常重要，應用於自動駕駛、擴增實境和機器人等領域。最先進的網路如 SegNet 和 FCN 已達到良好的精度，但對於即時應用仍然過慢，特別是在行動和嵌入式平台上。關鍵挑戰在於設計一個既精確又高效的網路。我們觀察到現有架構對分割任務而言過度參數化，精心的架構設計可以在不犧牲精度的情況下大幅減少計算量。ENet 從頭開始以效率作為首要約束進行設計。

段落功能批判現有架構的過度參數化並提出效率優先的設計哲學。

邏輯角色以「過度參數化」的洞見為輕量設計提供理論基礎。

論證技巧 / 潛在漏洞直接挑戰主流「越大越好」的思維，轉向效率導向的設計。

2. Architecture — 架構設計

ENet consists of an asymmetric encoder-decoder structure with a large encoder and a small decoder. The encoder follows the design of SegNet but with significantly fewer layers. The key building block is the bottleneck module, inspired by ResNet, consisting of a 1x1 projection that reduces dimensionality, a main convolutional layer (regular, dilated, or asymmetric), and a 1x1 expansion. We also use early downsampling — the initial block aggressively reduces resolution through a combination of max pooling and convolution. The decoder is intentionally lightweight: its role is only to upsample the encoder output and fine-tune details, not to learn complex features.

ENet 由一個非對稱的編碼器-解碼器結構組成，具有大型編碼器和小型解碼器。編碼器遵循 SegNet 的設計但層數顯著更少。關鍵建構區塊是瓶頸模組，靈感來自 ResNet，由一個 1x1 投影（降低維度）、主摺積層（常規、膨脹或非對稱）和 1x1 擴展組成。我們也使用早期降取樣 — 初始區塊透過最大池化和摺積的組合積極降低解析度。解碼器刻意保持輕量：其角色僅是對編碼器輸出進行上取樣和微調細節，而非學習複雜特徵。

段落功能詳述非對稱編碼器-解碼器的設計理念。

邏輯角色「大編碼器+小解碼器」的非對稱設計是效率的關鍵來源。

論證技巧 / 潛在漏洞早期降取樣大幅減少計算量，但可能犧牲精細的空間資訊。

3. Design Choices — 設計選擇

Several key design choices contribute to ENet's efficiency: (1) PReLU activation is used instead of ReLU, providing a small but consistent accuracy improvement; (2) Spatial dropout with varying rates in different stages helps regularize the network; (3) Asymmetric convolutions (5x1 followed by 1x5) replace standard 5x5 convolutions, reducing parameters; (4) Dilated convolutions are used in later encoder stages to increase the receptive field without increasing computational cost. These choices collectively enable ENet to achieve only 0.37M parameters, compared to SegNet's 29.5M.

數項關鍵設計選擇促成了 ENet 的效率：(1) 使用 PReLU 激活取代 ReLU，提供小但一致的精度改善；(2) 在不同階段使用不同比率的空間丟棄法幫助正則化網路；(3) 非對稱摺積（5x1 接著 1x5）取代標準 5x5 摺積，減少參數；(4) 在後期編碼器階段使用膨脹摺積來增大感受野而不增加計算成本。這些選擇共同使 ENet 僅有 0.37M 參數，相比 SegNet 的 29.5M。

段落功能列舉四項效率最佳化設計選擇。

邏輯角色逐項說明效率來源，使讀者理解每項選擇的貢獻。

論證技巧 / 潛在漏洞0.37M vs 29.5M 的參數對比（79x）極具說服力。

4. Experiments — 實驗

We evaluate ENet on Cityscapes, CamVid, and SUN RGB-D. On Cityscapes, ENet achieves 58.3% class IoU, comparable to SegNet's 56.1% while being 18x faster. On CamVid, ENet achieves 51.3% class IoU at 76.9 FPS on an NVIDIA TX1 embedded platform. ENet processes a 640x360 image in 7ms on a Titan X GPU — true real-time performance. On the Jetson TX1 mobile platform, ENet runs at 20+ FPS for high-resolution inputs. While ENet does not achieve the absolute highest accuracy, its accuracy-to-speed ratio is unmatched, making it the ideal choice for embedded and mobile applications.

我們在 Cityscapes、CamVid 和 SUN RGB-D 上評估 ENet。在 Cityscapes 上，ENet 達到 58.3% 類別 IoU，與 SegNet 的 56.1% 相當且快 18 倍。在 CamVid 上，ENet 在 NVIDIA TX1 嵌入式平台上達到 51.3% 類別 IoU、76.9 FPS。ENet 在 Titan X GPU 上處理一張 640x360 影像僅需 7 毫秒 — 真正的即時性能。在 Jetson TX1 行動平台上，ENet 對高解析度輸入達到 20+ FPS。雖然 ENet 未達到絕對最高精度，但其精度對速度的比率無可匹敵，使其成為嵌入式和行動應用的理想選擇。

段落功能報告多平台的速度與精度結果。

邏輯角色以嵌入式平台（TX1）的實測結果驗證實際部署的可行性。

論證技巧 / 潛在漏洞在嵌入式平台上的實測是論文的亮點，超越了僅在高階 GPU 上的理論速度。

5. Conclusions — 結論

We have proposed ENet, an efficient neural network architecture designed for real-time semantic segmentation on resource-constrained platforms. Through careful architectural design — asymmetric encoder-decoder, early downsampling, bottleneck modules with dilated and asymmetric convolutions — ENet achieves competitive segmentation accuracy while being orders of magnitude more efficient than existing methods. ENet enables semantic segmentation on embedded devices and mobile platforms for the first time.

我們提出了 ENet，一種為資源受限平台上的即時語義分割設計的高效神經網路架構。透過精心的架構設計——非對稱編碼器-解碼器、早期降取樣、帶膨脹和非對稱摺積的瓶頸模組——ENet 在比現有方法高效數個數量級的同時達到具競爭力的分割精度，首次實現了嵌入式裝置和行動平台上的語義分割。

段落功能總結效率導向的設計成就。

邏輯角色以「首次在嵌入式裝置上實現」作為實際影響力的宣稱。

論證技巧 / 潛在漏洞「首次」的宣稱增加了工作的歷史定位價值。

論證結構總覽

問題
分割模型過大無法即時

➔

論點
效率優先的輕量設計

➔

證據
18x快+79x少參數

➔

反駁
非最高精度但比率最佳

➔

結論
嵌入式即時分割

核心主張

透過非對稱編碼器-解碼器、早期降取樣和多種輕量摺積技術，ENet 以 0.37M 參數實現了與大型模型相當的分割精度。

最強論證

在 NVIDIA TX1 嵌入式平台上的實測結果（76.9 FPS）證明了真實部署場景的可行性，超越了理論效率分析。

最弱環節

精度與最先進方法的差距在某些類別上仍然明顯，早期降取樣可能導致小物件和精細邊界的資訊損失。