Minimalist Vision with Freeform Pixels

Abstract — 摘要

A minimalist vision system uses the smallest number of pixels needed to solve a vision task. While a traditional camera uses a large grid of square pixels, a minimalist camera uses freeform pixels that can take on arbitrary shapes to increase their information content. The hardware of a minimalist camera can be modeled as the first layer of a neural network, where the subsequent layers are used for inference. Training the network for any given task yields the shapes of the camera's freeform pixels, each of which is implemented using a photodetector and an optical mask. We have designed minimalist cameras for monitoring indoor spaces (with 8 pixels), measuring room lighting (with 8 pixels), and estimating traffic flow (with 8 pixels).

極簡視覺系統採用完成視覺任務所需的最少像素數量。傳統相機使用大型方形像素陣列，而極簡相機則使用可呈任意形狀的自由形像素，以提升其資訊承載量。極簡相機的硬體可被建模為神經網路的第一層，後續各層則負責推論。針對任何給定任務訓練該網路，即可得到相機自由形像素的形狀，每個像素以光偵測器和光學遮罩來實現。我們已設計了用於室內空間監控（8 個像素）、房間照明量測（8 個像素）及交通流量估計（8 個像素）的極簡相機。

段落功能全文總覽——以精簡語言定義「極簡視覺」的核心概念，並預告三項應用場景作為實證。

邏輯角色摘要作為論文入口，先建立「最少像素即可完成任務」的核心主張，再以神經網路建模和硬體實現說明方法，最後用三個 8 像素應用作為收束。

論證技巧 / 潛在漏洞「8 個像素」的數字極具震撼效果，形成強烈反差——讀者自然會好奇如此少的像素如何完成複雜任務。這是高效的注意力引導策略，但也可能引發對通用性的質疑。

1. Introduction — 緒論

Modern cameras capture images with millions of square pixels arranged in a dense grid. This design philosophy assumes that high-resolution images are always necessary for visual understanding. However, for many practical vision tasks, most of the captured information is redundant. A surveillance camera monitoring whether a room is occupied does not need megapixel resolution. A sensor measuring ambient lighting does not need to capture a detailed image of the room. This observation motivates us to ask: what is the minimum number of measurements needed to solve a given vision task?

現代相機以數百萬個排列於密集網格中的方形像素來擷取影像。這種設計哲學預設視覺理解總是需要高解析度影像。然而，對於許多實用的視覺任務而言，大部分擷取到的資訊都是冗餘的。一個監控房間是否有人的監視攝影機不需要百萬像素的解析度；一個量測環境照明的感測器也不需要擷取房間的細節影像。這個觀察促使我們提出問題：完成一個給定的視覺任務，最少需要多少量測值？

段落功能挑戰既有範式——質疑「高解析度影像是必要的」這個普遍假設。

邏輯角色論證鏈的起點：先描述現狀（百萬像素），再點出冗餘性問題，最後以核心研究問題收束，引導讀者進入極簡視覺的思維框架。

論證技巧 / 潛在漏洞以日常場景（監控、照明）為例，讓讀者直覺認同「不需要這麼多像素」，這是有效的類比論證。但「冗餘」的判斷高度依賴任務定義，若任務稍加複雜，結論可能不同。

We introduce the concept of freeform pixels — pixels that are not constrained to be small squares arranged in a grid, but can instead take on arbitrary spatial shapes. By allowing each pixel to integrate light over a carefully designed region of the scene, a single freeform pixel can capture far more task-relevant information than a single square pixel. The key insight is that the shape of a freeform pixel acts as a spatial filter — it determines what combination of scene content contributes to the pixel's measurement. By optimizing these shapes for a specific task, we can design cameras that achieve high performance with an extremely small number of pixels.

我們引入自由形像素的概念——像素不受限於排列在網格中的小型方塊，而是可以呈現任意空間形狀。藉由讓每個像素在場景中一個精心設計的區域上積分光線，單個自由形像素可以擷取到遠多於單個方形像素的任務相關資訊。關鍵洞見在於自由形像素的形狀扮演空間濾波器的角色——它決定了哪些場景內容的組合會貢獻到該像素的量測值。藉由針對特定任務最佳化這些形狀，我們可以設計出以極少數像素達成高效能的相機。

段落功能定義核心概念——正式引入「自由形像素」及其理論基礎。

邏輯角色從前段的問題（最少需要多少量測？）過渡到解決方案：自由形像素 = 空間濾波器，形狀最佳化 = 任務適配。

論證技巧 / 潛在漏洞「空間濾波器」的類比恰到好處，既準確又易懂。但此處未討論形狀最佳化的計算成本與收斂性問題，這在實際部署時可能是挑戰。

2. Method — 方法

We model the entire minimalist vision system as a neural network. The first layer of the network represents the camera hardware: each neuron in this layer corresponds to one freeform pixel, and its weights encode the pixel's spatial sensitivity pattern — effectively, the shape of its optical mask. The subsequent layers form a standard inference network that maps the pixel measurements to the desired output (e.g., room occupancy count, lighting parameters, or traffic density). During training, both the pixel shapes and the inference network weights are optimized jointly using backpropagation, ensuring that the hardware and software are co-designed for optimal task performance.

我們將整個極簡視覺系統建模為一個神經網路。網路的第一層代表相機硬體：此層的每個神經元對應一個自由形像素，其權重編碼了像素的空間靈敏度模式——實質上就是光學遮罩的形狀。後續各層組成標準的推論網路，將像素量測值映射到所需輸出（例如房間佔用人數、照明參數或交通密度）。在訓練過程中，像素形狀與推論網路權重透過反向傳播聯合最佳化，確保硬體與軟體協同設計以達到最佳任務效能。

段落功能闡述核心方法——將硬體設計問題轉化為神經網路第一層的權重學習問題。

邏輯角色方法論的核心段落，建立「硬體即網路第一層」的範式，使形狀最佳化可以利用標準的深度學習工具。

論證技巧 / 潛在漏洞將物理硬體抽象為神經網路層是一個精巧的設計，使得端到端最佳化成為可能。但實際製造中，學習到的連續形狀需要離散化，此量化誤差可能影響效能。

Each freeform pixel is physically realized using a photodetector paired with an optical mask. The mask is a binary or grayscale transparency pattern placed in the image plane that determines which parts of the scene contribute to the photodetector's measurement. The mask pattern is derived directly from the learned weights of the corresponding neuron in the first layer. For binary masks, we apply a thresholding operation during fabrication; for grayscale masks, halftone printing techniques are used. This hardware implementation is low-cost and compact, requiring only standard photodetectors and printed masks.

每個自由形像素在物理上以一個光偵測器搭配一個光學遮罩來實現。遮罩是放置在像平面上的二值或灰階透明度圖案，決定場景中哪些部分會貢獻到光偵測器的量測值。遮罩圖案直接從第一層對應神經元的學習權重推導而得。對於二值遮罩，我們在製造時施加閾值化操作；對於灰階遮罩，則使用半色調列印技術。這種硬體實現成本低廉且結構緊湊，僅需標準光偵測器和列印遮罩即可。

段落功能說明硬體實現細節——從學習的權重到物理遮罩的轉換。

邏輯角色補充前段的理論框架，展示方法的可實現性，將抽象的神經網路權重落地為具體的光學元件。

論證技巧 / 潛在漏洞強調「低成本且緊湊」增強了實用性論述。但二值化閾值操作的資訊損失未量化，可能在某些任務中顯著影響效能。

3. Freeform Pixel Design — 自由形狀像素設計

The shape optimization of freeform pixels can be understood through the lens of information theory. A square pixel in a traditional camera captures a uniform spatial average over a small, fixed region. In contrast, a freeform pixel captures a weighted spatial average over a potentially large, task-optimized region. The information content of a freeform pixel is therefore determined not just by signal-to-noise ratio, but also by how well its spatial sensitivity pattern aligns with the task-relevant features of the scene. Our optimization procedure naturally discovers pixel shapes that act as matched filters for the most informative spatial patterns.

自由形像素的形狀最佳化可透過資訊理論的視角來理解。傳統相機中的方形像素擷取一個小型固定區域上的均勻空間平均。相比之下，自由形像素擷取的是一個潛在大範圍、任務最佳化區域上的加權空間平均。因此，自由形像素的資訊含量不僅取決於信噪比，還取決於其空間靈敏度模式與場景中任務相關特徵的對齊程度。我們的最佳化程序自然地發現了作為最具資訊量之空間模式的匹配濾波器的像素形狀。

段落功能提供理論解釋——從資訊理論角度闡明自由形像素為何比方形像素更優。

邏輯角色在方法（如何做）與實驗（效果如何）之間架設理論橋樑，解釋「為何有效」。

論證技巧 / 潛在漏洞「匹配濾波器」的概念精準地連接了通訊理論與視覺系統設計。但此理論分析假設場景統計特性已知且穩定，對分佈外場景的表現有待檢驗。

An important property of our approach is that it naturally tends to preserve the privacy of individuals in the scene. Because the minimalist camera captures only 8 highly abstracted measurements rather than a detailed image, it is fundamentally incapable of recording identifiable visual information about people. This is a significant advantage for applications such as smart buildings, occupancy counting, and ambient intelligence, where privacy concerns have historically been a major barrier to adoption. The privacy preservation is inherent to the hardware design rather than relying on software-based anonymization, making it robust against adversarial attacks.

我們方法的一個重要特性是天然地傾向於保護場景中個人的隱私。由於極簡相機僅擷取8 個高度抽象的量測值而非詳細影像，它在本質上無法記錄人員的可辨識視覺資訊。這對於智慧建築、人數統計和環境智慧等應用而言是一項顯著優勢，因為隱私疑慮歷來是這些領域推廣的重大障礙。隱私保護內建於硬體設計之中，而非依賴軟體匿名化處理，使其能抵禦對抗攻擊。

段落功能提出附加優勢——硬體層級的隱私保護，預先回應潛在的倫理疑慮。

邏輯角色超越技術層面，將極簡視覺連結到社會需求（隱私），強化研究的應用價值與動機。

論證技巧 / 潛在漏洞將「限制」（無法拍攝詳細影像）轉化為「優勢」（隱私保護）是巧妙的論證翻轉。但需注意，即便是 8 個量測值，在理論上仍可能洩露部分位置或行為資訊。

4. Experiments — 實驗

We evaluate the minimalist camera on three vision tasks: indoor occupancy monitoring, room lighting estimation, and traffic flow measurement. For each task, we compare our approach against baselines including conventional cameras with downsampled images (to match the number of pixels) and random pixel configurations. In the occupancy monitoring task, our 8-pixel minimalist camera achieves 97.5% accuracy in distinguishing between 0-5 occupants, compared to 82.3% for an 8-pixel downsampled conventional image and 68.1% for random pixel masks. This demonstrates the critical importance of task-optimized pixel shapes versus naive spatial sampling.

我們在三項視覺任務上評估極簡相機：室內佔用監控、房間照明估計和交通流量量測。對於每項任務，我們將本方法與基線進行比較，包括使用降採樣影像的傳統相機（匹配像素數量）和隨機像素配置。在佔用監控任務中，我們的 8 像素極簡相機在區分 0-5 名佔用者時達到 97.5% 準確率，相較之下 8 像素降採樣傳統影像為 82.3%，隨機像素遮罩為 68.1%。這證明了任務最佳化的像素形狀相對於樸素空間取樣的關鍵重要性。

段落功能提供核心實證——以三項任務的量化結果驗證方法效能。

邏輯角色從理論主張（自由形像素更有效）轉向實證支撐，以具體數字量化效能差距。

論證技巧 / 潛在漏洞 97.5% vs 82.3% 的對比數據令人信服。然而，僅測試三個相對簡單的任務，對更複雜的視覺任務（如姿態估計、場景理解）的推廣性尚不明確。

For the lighting estimation task, the minimalist camera must estimate the spatial distribution of light sources and their intensities in a room from only 8 measurements. Our optimized freeform pixels achieve a mean angular error of 4.2 degrees for light direction and 8.7% relative error for intensity, significantly outperforming downsampled images (11.8 degrees, 19.3%). The learned pixel shapes reveal that the system has discovered directional sensitivity patterns that act as spatial probes for different regions of the room. For traffic flow estimation, the minimalist camera achieves a mean absolute error of 2.1 vehicles per minute from only 8 freeform pixels on a busy highway, with self-powered operation enabled by the minimal sensor requirements.

在照明估計任務中，極簡相機必須僅從 8 個量測值估計房間內光源的空間分佈及其強度。我們最佳化的自由形像素達到光線方向平均角度誤差 4.2 度、強度相對誤差 8.7%，顯著優於降採樣影像（11.8 度、19.3%）。學習到的像素形狀揭示了系統已發現作為房間不同區域空間探測器的方向性靈敏度模式。在交通流量估計方面，極簡相機在繁忙高速公路上僅以 8 個自由形像素達到每分鐘 2.1 輛車的平均絕對誤差，且最低限度的感測器需求使得自供電運行成為可能。

段落功能延伸實證——補充照明估計和交通流量的量化結果。

邏輯角色多任務驗證強化了方法的通用性論述，自供電的可能性進一步凸顯實用價值。

論證技巧 / 潛在漏洞自供電運行的提及為方法增添了物聯網應用想像。但不同天候、光照條件下的穩健性值得進一步驗證。

5. Conclusion — 結論

We have presented minimalist vision — a new paradigm for visual sensing that uses the fewest possible pixels to perform a task. By introducing freeform pixels with task-optimized arbitrary shapes and modeling the camera as the first layer of a neural network, we enable end-to-end co-design of hardware and software. Our results on three practical applications demonstrate that 8 freeform pixels can match or approach the performance of conventional cameras with orders of magnitude more pixels. Beyond performance, our approach offers inherent advantages in privacy preservation, energy efficiency, and form factor. We believe minimalist vision opens a new direction in computational imaging, where the goal is not to capture as much visual information as possible, but to capture precisely the information needed for the task at hand.

我們提出了極簡視覺——一種使用盡可能少的像素來完成任務的視覺感測新典範。透過引入具有任務最佳化任意形狀的自由形像素，以及將相機建模為神經網路的第一層，我們實現了硬體與軟體的端到端協同設計。在三項實用應用上的結果表明，8 個自由形像素即可匹配或接近具有多出數個數量級像素的傳統相機之效能。除效能之外，我們的方法在隱私保護、能源效率和形式因子方面具有內在優勢。我們相信極簡視覺為計算影像學開闢了新方向，其目標不是盡可能多地擷取視覺資訊，而是精確擷取手頭任務所需的資訊。

段落功能總結全文——重申核心貢獻並展望未來方向。

邏輯角色以「新典範」定位研究，將技術貢獻昇華為哲學層面的思維轉換：從「擷取一切」到「精確擷取」。

論證技巧 / 潛在漏洞「新典範」的宣稱頗為大膽，但有 Best Paper 的認可加持。未來需面對的挑戰包括：多任務場景下是否需要不同的像素形狀，以及動態場景中的適應性問題。

論證結構總覽

問題
百萬像素中大量冗餘

→

論點
自由形像素可大幅減少所需像素

→

方法
相機硬體 = 神經網路第一層

→

證據
8像素即達 97.5% 準確率

→

結論
極簡視覺是計算影像新典範

核心主張

透過自由形像素的任務最佳化形狀設計，可將相機硬體與推論網路端到端協同最佳化，以極少數像素（8個）完成實用的視覺任務。

論證最強處

8 個像素達到 97.5% 的佔用監控準確率，以及硬體天然隱私保護特性，展現出令人印象深刻的工程價值與社會意義。

論證最弱處

僅驗證三個相對簡單的任務，未探討更複雜視覺任務的可擴展性。此外，形狀量化誤差與動態場景適應性等實際部署問題尚未充分討論。