Pedestrian Detection with Unsupervised Multi-Stage Feature Learning

Abstract — 摘要

Pedestrian detection is a key problem for automotive safety and surveillance. Most state-of-the-art systems rely on hand-crafted features such as HOG (Histogram of Oriented Gradients) combined with linear or kernel-based classifiers. We propose an approach that replaces hand-crafted features with features learned in an unsupervised manner using a multi-stage convolutional architecture. Each stage consists of convolutional filtering, non-linear activation, and spatial pooling, with filters learned via unsupervised methods (PSD — Predictive Sparse Decomposition). Our multi-stage system significantly outperforms HOG-based methods on the INRIA and Caltech pedestrian benchmarks, demonstrating that learned features can surpass carefully engineered ones for pedestrian detection.

行人偵測是汽車安全與監控的關鍵問題。大多數最先進的系統依賴手工設計的特徵，如方向梯度直方圖（HOG）結合線性或基於核的分類器。我們提出一種以非監督式方式學習的特徵取代手工特徵的方法，使用多階段摺積架構。每個階段由摺積濾波、非線性啟動及空間池化組成，濾波器透過非監督式方法（PSD——預測稀疏分解）學習。我們的多階段系統在 INRIA 和 Caltech 行人基準上顯著優於基於 HOG 的方法，證明了學習特徵在行人偵測上可超越精心設計的特徵。

段落功能全文總覽——提出「非監督學習特徵取代手工特徵」的核心主張。

邏輯角色摘要建立了「手工 vs 學習」的二元對立，並以基準結果直接宣告學習特徵的勝利。

論證技巧 / 潛在漏洞「非監督式」學習的強調是本文的一大賣點——不需要標註資料來學習特徵。但此時 AlexNet 已於 2012 年發表，監督式深度學習即將席捲視覺領域，本文的非監督路線在歷史上屬於過渡方案。

1. Introduction — 緒論

Pedestrian detection has been extensively studied due to its critical role in autonomous driving, advanced driver assistance systems (ADAS), and intelligent surveillance. The dominant paradigm involves sliding window detection with HOG features and a linear SVM classifier, as proposed by Dalal and Triggs. While numerous refinements have been proposed — multi-scale HOG, HOG+LBP combinations, channel features — the core limitation remains: the features are hand-designed and may not capture the most discriminative patterns for pedestrian/non-pedestrian classification.

行人偵測因其在自動駕駛、先進駕駛輔助系統（ADAS）與智慧監控中的關鍵角色而被廣泛研究。主流範式涉及滑動視窗偵測配合 HOG 特徵和線性 SVM 分類器，如 Dalal 和 Triggs 所提出。雖然已有眾多改進——多尺度 HOG、HOG+LBP 組合、通道特徵——但核心限制仍在：特徵是手工設計的，可能無法捕捉行人/非行人分類中最具判別力的模式。

段落功能研究背景——概述 HOG 為中心的行人偵測範式及其根本限制。

邏輯角色建立「手工特徵的天花板」論點：即使精心改進 HOG，仍受限於人類設計能力的上限。

論證技巧 / 潛在漏洞列舉多種 HOG 變體後仍認定「核心限制」，暗示手工特徵設計的邊際報酬遞減。但部分手工特徵（如 Dollar 的 ACF）在速度上仍具顯著優勢。

Deep learning approaches, particularly convolutional neural networks (ConvNets), have recently shown remarkable success in image classification (Krizhevsky et al., 2012). However, these successes rely on large-scale supervised training data, which may not always be available. We explore an alternative: using unsupervised feature learning to train convolutional filters, which are then combined with a supervised classifier for the final detection task. This two-phase approach (unsupervised pre-training + supervised fine-tuning) reduces the dependency on labeled data while still leveraging the representational power of deep architectures.

深度學習方法，特別是摺積神經網路（ConvNet），近期在影像分類（Krizhevsky 等人，2012）上展現了卓越的成功。然而，這些成功依賴大規模的監督式訓練資料，而這些資料並非總是可取得。我們探索一種替代方案：使用非監督式特徵學習來訓練摺積濾波器，再結合監督式分類器完成最終的偵測任務。此兩階段方法（非監督預訓練 + 監督微調）降低了對標註資料的依賴，同時仍能利用深層架構的表示能力。

段落功能方法定位——在全監督深度學習與手工特徵之間開闢第三條路線。

邏輯角色承認 AlexNet 的成功但指出其「大資料依賴」，為非監督學習提供立足點。

論證技巧 / 潛在漏洞「標註資料不總是可取得」的論點在 2013 年合理，但隨著大規模資料集（ImageNet、COCO）的普及，此動機的說服力在歷史上快速減弱。

The HOG detector by Dalal and Triggs has been the foundation of pedestrian detection for nearly a decade. Key extensions include deformable part models (DPM) by Felzenszwalb et al., which model pedestrians as collections of parts, and channel features by Dollar et al., which aggregate multiple feature channels (gradient magnitude, color channels) efficiently. On the learning side, sparse coding and autoencoders have been used for unsupervised feature learning in various recognition tasks but have not been systematically applied to pedestrian detection with multi-scale architectures.

Dalal 和 Triggs 的 HOG 偵測器近十年來一直是行人偵測的基礎。關鍵擴展包括 Felzenszwalb 等人的可變形零件模型（DPM）——將行人建模為零件的集合，以及 Dollar 等人的通道特徵——有效聚合多個特徵通道（梯度幅度、顏色通道）。在學習方面，稀疏編碼和自編碼器已被用於各種識別任務的非監督式特徵學習，但尚未被系統性地應用於具有多尺度架構的行人偵測。

段落功能文獻回顧——從手工特徵到學習特徵的演進脈絡。

邏輯角色指出非監督特徵學習已在其他任務上成功，但在行人偵測上尚有空白——此即本文的切入點。

論證技巧 / 潛在漏洞文獻選擇恰當地覆蓋了兩條主要路線。但稀疏編碼在行人偵測上的先前嘗試可能存在但未被引用。

3. Method — 方法

3.1 Unsupervised Feature Learning

Our feature learning approach uses Predictive Sparse Decomposition (PSD), which learns a dictionary of convolutional filters by jointly minimizing a reconstruction error and a sparsity penalty. Given a set of unlabeled image patches extracted from natural images, PSD learns filters that produce sparse feature maps when convolved with input images. Unlike standard sparse coding (which requires iterative inference at test time), PSD trains an encoder network that directly predicts the sparse codes in a single forward pass, making it suitable for real-time applications.

我們的特徵學習方法使用預測稀疏分解（PSD），它透過同時最小化重建誤差與稀疏懲罰來學習摺積濾波器字典。給定一組從自然影像中提取的未標註影像區塊，PSD 學習的濾波器在與輸入影像摺積時能產生稀疏特徵圖。不同於標準稀疏編碼（在測試時需要迭代推論），PSD 訓練一個能在單次前向傳播中直接預測稀疏編碼的編碼器網路，使其適合即時應用。

段落功能特徵學習核心——描述 PSD 如何從未標註資料中學習濾波器。

邏輯角色 PSD 是實現「非監督特徵學習」承諾的技術手段：從自然影像統計中自動發現有用的濾波器。

論證技巧 / 潛在漏洞 PSD 的「前向傳播預測」相對於傳統稀疏編碼的「迭代推論」是重要的速度優勢。但學習到的濾波器是否真的比隨機濾波器好多少？後續研究顯示隨機濾波器有時也能達到不錯的效果。

3.2 Multi-Scale Architecture — 多尺度架構

A key contribution is our multi-stage, multi-scale convolutional architecture. The first stage applies 64 learned 7x7 filters to the input image, followed by absolute value rectification and contrast normalization. The output is passed to a 2x2 average pooling layer. The second stage applies another set of learned filters to the pooled features, with the same non-linearity and pooling. To capture both fine details and coarse structure, we extract features from multiple stages and combine them via concatenation before the final classifier. This "multi-scale feature" strategy allows the classifier to leverage both low-level edges and higher-level shape information.

一項關鍵貢獻是我們的多階段、多尺度摺積架構。第一階段將 64 個學習得到的 7x7 濾波器套用至輸入影像，接著進行絕對值整流與對比度正規化。輸出被傳入 2x2 平均池化層。第二階段將另一組學習的濾波器套用至池化後的特徵，具有相同的非線性與池化。為了同時捕捉細節與粗略結構，我們從多個階段提取特徵並在最終分類器前透過串接來組合。此「多尺度特徵」策略允許分類器同時利用低階邊緣與高階形狀資訊。

段落功能架構設計——描述多階段特徵提取與組合策略。

邏輯角色多尺度特徵串接是本文的架構創新：不只用最深層的特徵，而是結合各層的資訊，此設計後來在 FPN 等架構中被廣泛採用。

論證技巧 / 潛在漏洞多尺度串接的設計直覺上合理——行人偵測需要同時辨識輪廓（低階）和整體形狀（高階）。但簡單串接可能引入冗餘特徵，增加分類器的負擔。

The final detection system combines the multi-stage convolutional features with a linear SVM classifier. We also investigate boosting and neural network classifiers as alternatives. The detection pipeline follows the standard sliding window approach with multi-scale image pyramid, applying the learned feature extractor and classifier at each window position and scale. Non-maximum suppression is used to merge overlapping detections. While the feature extraction is more expensive than HOG computation, the system can be accelerated using GPU implementation.

最終的偵測系統結合了多階段摺積特徵與線性 SVM 分類器。我們同時研究了提升法（boosting）和神經網路分類器作為替代方案。偵測管線遵循標準的滑動視窗方法搭配多尺度影像金字塔，在每個視窗位置與尺度套用學習的特徵提取器和分類器。非最大值抑制用於合併重疊的偵測。雖然特徵提取比 HOG 計算更昂貴，但系統可透過 GPU 實作來加速。

段落功能偵測管線——描述從特徵到最終偵測的完整流程。

邏輯角色保持滑動視窗範式但替換特徵提取器，使方法可直接與 HOG 基準比較。

論證技巧 / 潛在漏洞保持偵測管線不變而只替換特徵，實現了嚴格的控制變因比較。但承認「更昂貴」暗示速度可能是實際部署的瓶頸。

4. Experiments — 實驗

We evaluate on the INRIA Person dataset and the Caltech Pedestrian benchmark. On INRIA, our multi-stage system with unsupervised features achieves a miss rate of 13.1% at 1 FPPI (false positives per image), compared to 23.1% for the standard HOG+SVM baseline. This represents a relative improvement of 43%. On the more challenging Caltech benchmark, we achieve significant improvements over HOG across all occlusion levels and pedestrian scales. The two-stage architecture outperforms the single-stage variant by approximately 5% in miss rate, confirming the value of hierarchical feature learning.

我們在 INRIA 行人資料集和 Caltech 行人基準上進行評估。在 INRIA 上，我們使用非監督特徵的多階段系統在 1 FPPI（每張影像的誤報數）時達到 13.1% 的漏檢率，相比標準 HOG+SVM 基準的 23.1%。這代表了 43% 的相對改進。在更具挑戰性的 Caltech 基準上，我們在所有遮擋等級和行人尺度上都顯著優於 HOG。兩階段架構比單階段變體在漏檢率上改善了約 5%，確認了層次式特徵學習的價值。

段落功能定量驗證——以兩個基準上的具體數字支撐核心主張。

邏輯角色 43% 的相對改進有力地證明了「學習特徵優於手工特徵」的論點，消融研究則驗證了多階段設計的必要性。

論證技巧 / 潛在漏洞改進幅度令人印象深刻，但僅與 HOG+SVM 基準比較——更強的手工特徵方法（如 ACF）和監督式深度學習方法未被納入比較。

We further analyze the learned filters and find they exhibit meaningful structure: first-stage filters resemble Gabor-like edge detectors at various orientations and frequencies, while second-stage filters capture more complex patterns like corners and junctions. This hierarchical emergence of feature complexity — from edges to parts — mirrors the design philosophy of hand-crafted feature hierarchies but arises automatically from the data. We also show that features learned from general natural images transfer well to the pedestrian detection task, suggesting strong generalizability.

我們進一步分析學習到的濾波器，發現它們展現出有意義的結構：第一階段的濾波器類似於各種方向與頻率的 Gabor 式邊緣偵測器，而第二階段的濾波器捕捉了更複雜的模式如角落和交叉點。此特徵複雜度的層次式湧現——從邊緣到零件——反映了手工特徵層次的設計哲學，但自動從資料中產生。我們同時展示從一般自然影像中學習的特徵能良好地遷移至行人偵測任務，顯示強大的泛化能力。

段落功能可解釋性分析——展示學習濾波器的語意結構。

邏輯角色從「黑箱」質疑中為方法辯護：學習的濾波器並非隨機，而是有意義的層次結構，增強了可信度。

論證技巧 / 潛在漏洞濾波器可視化是有效的解釋工具。「自動發現類似手工設計的結構」的觀察極為有力——但也可解讀為「學習只是重新發現了人們已知的東西」，削弱了學習方法的必要性論述。

5. Conclusion — 結論

We have demonstrated that unsupervised multi-stage feature learning can significantly improve pedestrian detection compared to traditional hand-crafted features. Our convolutional architecture with PSD-learned filters achieves state-of-the-art results on standard benchmarks while requiring no labeled data for the feature learning stage. The learned features exhibit interpretable hierarchical structure and transfer across tasks. Our work provides strong evidence that the era of hand-crafted features in pedestrian detection may be drawing to a close, and that learned representations — whether unsupervised or supervised — are the path forward.

我們已證明非監督式多階段特徵學習相比傳統手工特徵能顯著改善行人偵測。我們使用 PSD 學習濾波器的摺積架構在標準基準上達到最先進的結果，同時特徵學習階段無需標註資料。學習到的特徵展現出可解釋的層次結構且能跨任務遷移。我們的研究提供了強有力的證據，表明行人偵測中手工特徵的時代可能正在走向終結，而學習的表示——無論是非監督式或監督式——才是前進的方向。

段落功能總結與宣言——宣告手工特徵時代的終結。

邏輯角色將實驗結果提升至範式轉移的層次，從具體的行人偵測推廣至「學習 vs 手工」的根本方向判斷。

論證技巧 / 潛在漏洞「手工特徵時代終結」的宣言大膽且被歷史證明是正確的。但作者將功勞歸於「非監督學習」，而實際上終結手工特徵的是監督式深度學習（R-CNN 於同年出現）。

論證結構總覽

問題
手工特徵（HOG）
限制了偵測效能

→

論點
非監督學習的
摺積特徵更優

→

證據
INRIA 漏檢率
降低 43%

→

反駁
學習濾波器具
可解釋層次結構

→

結論
手工特徵時代
正在終結

作者核心主張（一句話）

透過非監督式預測稀疏分解學習的多階段摺積特徵，在行人偵測上顯著超越 HOG 等手工特徵，證明了特徵學習在視覺偵測中的根本優越性。

論證最強處

嚴格控制的比較實驗：保持偵測管線（滑動視窗 + SVM）不變，僅替換特徵提取器，使 HOG vs 學習特徵的比較極為公正。43% 的相對改進在控制變因下難以歸因於其他因素。多階段消融進一步確認了層次特徵的價值。

論證最弱處

歷史時機的尷尬：本文強調非監督學習，但同年 R-CNN 的出現證明了全監督深度學習在偵測上的壓倒性優勢。非監督預訓練路線在 ImageNet 等大規模標註資料集的普及下迅速失去了實用價值。此外，計算成本的比較不充分——學習特徵的推論成本遠高於 HOG。