Spatially-Varying Autofocus

Abstract — 摘要

Conventional autofocus systems in cameras assume a single global focal plane, which means that for scenes with objects at varying depths, only one depth layer can be in sharp focus at a time. This paper introduces Spatially-Varying Autofocus, a method that enables per-pixel focal control across the image sensor. By leveraging a programmable spatial light modulator (SLM) in the optical path, we can impose spatially-varying phase shifts on the incoming wavefront, effectively assigning different focal lengths to different image regions. Combined with a learned optimization framework, the system determines the optimal per-pixel phase pattern to maximize sharpness across all depth layers simultaneously. We demonstrate that our approach produces all-in-focus images from a single capture, outperforming traditional autofocus and existing computational photography methods on both synthetic and real scenes.

傳統相機的自動對焦系統假設單一全域焦平面，這意味著在物件位於不同深度的場景中，同一時間僅能有一個深度層清晰聚焦。本文提出空間變化自動對焦，一種能在影像感測器上實現逐像素焦距控制的方法。透過在光學路徑中利用可程式化的空間光調變器（SLM），我們能對入射波前施加空間變化的相位偏移，有效地為不同影像區域指定不同的焦距。結合學習式最佳化框架，系統決定最佳的逐像素相位模式，以同時最大化所有深度層的銳利度。我們展示此方法能從單次拍攝產生全景深清晰影像，在合成與真實場景上均優於傳統自動對焦與現有計算攝影方法。

段落功能全文總覽——從傳統對焦的限制出發，引出空間變化對焦的核心創新與實驗成果。

邏輯角色摘要以「限制-方案-成果」三部曲組織：先揭示全域焦平面的根本問題，再以 SLM 技術提出突破，最後以實驗結果作結。

論證技巧 / 潛在漏洞「逐像素焦距控制」的概念極具吸引力，但 SLM 的實際空間解析度與光學頻寬可能限制真正的「逐像素」精度。摘要中的表述是否為理想化需在方法章節驗證。

1. Introduction — 緒論

Autofocus is one of the most fundamental functions in modern cameras. Despite decades of advancement, current autofocus systems fundamentally operate under the assumption of a single focal plane. In multi-depth scenes, photographers must choose which depth to focus on, leaving other regions blurred. Focus stacking addresses this by capturing multiple images at different focal distances and merging them, but it requires multiple exposures and is susceptible to motion artifacts. Computational depth-of-field extension methods use coded apertures or phase masks to encode depth information, but they rely on post-capture deconvolution that introduces noise and artifacts.

自動對焦是現代相機最基本的功能之一。儘管歷經數十年的進步，當前自動對焦系統在根本上仍基於單一焦平面的假設運作。在多深度場景中，攝影師必須選擇聚焦哪個深度，使其他區域保持模糊。焦點堆疊透過在不同焦距拍攝多張影像並合併來解決此問題，但需要多次曝光且容易受到運動偽影的影響。計算景深延伸方法使用編碼光圈或相位遮罩來編碼深度資訊，但依賴後續的去摺積處理，會引入雜訊與偽影。

段落功能建立研究場域——系統性回顧對焦技術的演進與各自局限。

邏輯角色論證鏈的起點：透過逐一排除現有方案（單焦面、焦點堆疊、計算景深），建立對「新型硬體-演算法協同設計」的需求。

論證技巧 / 潛在漏洞將三類方法的限制精準對應到不同維度（硬體限制、多次曝光、後處理品質），使讀者自然接受需要全新範式。但光場相機等替代技術未被提及，可能遺漏重要的比較基準。

We propose Spatially-Varying Autofocus, which replaces the global focus mechanism with a spatially-adaptive one. The key hardware enabler is a spatial light modulator (SLM) placed in the pupil plane of the optical system. By programming pixel-wise phase patterns on the SLM, we effectively create a spatially-varying lens that focuses different image regions to different depths simultaneously. A learned optimization algorithm takes a depth estimate of the scene and computes the optimal phase pattern that maximizes the modulation transfer function (MTF) across all spatial frequencies and depth layers.

本文提出空間變化自動對焦，以空間自適應機制取代全域對焦機制。關鍵的硬體賦能器是放置在光學系統瞳孔平面上的空間光調變器（SLM）。透過在 SLM 上編程逐像素的相位模式，我們有效地創造出一個空間變化透鏡，能同時將不同影像區域聚焦到不同深度。學習式最佳化演算法取得場景的深度估計後，計算能在所有空間頻率與深度層上最大化調變傳遞函數（MTF）的最佳相位模式。

段落功能提出解決方案——完整概述硬體（SLM）與軟體（學習式最佳化）的協同設計。

邏輯角色承接上段的問題陳述，此段從「限制」過渡到「突破」：SLM 提供硬體可能性，最佳化演算法提供軟體智慧。

論證技巧 / 潛在漏洞將 MTF 最大化作為最佳化目標是物理上合理的，但 SLM 的切換速率、空間解析度與相位範圍是否足以支撐即時的逐像素對焦控制，是工程可行性的關鍵問題。

Traditional autofocus methods — contrast-detection AF (CDAF) and phase-detection AF (PDAF) — determine the optimal position for a single movable lens element. Multi-focus imaging captures a focal stack and merges in post-processing, requiring temporal multiplexing that precludes dynamic scenes. Wavefront coding with phase masks extends depth of field but trades off spatial resolution for depth invariance. Computational photography approaches using coded apertures or light field cameras capture richer information, yet face resolution-versus-angular-sampling tradeoffs. Recent neural optics methods co-optimize optical elements with reconstruction networks but typically design static optical elements rather than adaptive ones. Our approach uniquely combines adaptive wavefront modulation with learned per-scene optimization.

傳統自動對焦方法——對比偵測自動對焦（CDAF）與相位偵測自動對焦（PDAF）——決定單一可動透鏡元件的最佳位置。多焦影像擷取焦點堆疊後在後處理中合併，需要時間多工而無法處理動態場景。以相位遮罩進行波前編碼可延伸景深，但以空間解析度換取深度不變性。使用編碼光圈或光場相機的計算攝影方法能擷取更豐富的資訊，但面臨解析度與角度取樣之間的取捨。近期的神經光學方法將光學元件與重建網路共同最佳化，但通常設計靜態而非自適應的光學元件。本方法獨特地結合了自適應波前調變與學習式逐場景最佳化。

段落功能文獻回顧——涵蓋五類相關技術，為本方法的獨特定位建立基礎。

邏輯角色以「每類方法的核心限制」作為統一的批判框架，逐步收窄至本文的獨特貢獻：自適應 + 學習式。

論證技巧 / 潛在漏洞文獻涵蓋面廣且批判角度多元，但「自適應波前調變」並非完全新穎——SLM 在天文自適應光學中已有廣泛應用。作者需說明消費級攝影場景的獨特挑戰。

3. Method — 方法

Our optical system places a phase-only spatial light modulator at the pupil plane of a standard imaging lens. The SLM modulates the phase of the wavefront passing through each pupil location without affecting amplitude. For a scene point at depth d imaged through pupil location (u, v) with SLM phase φ(u, v), the resulting point spread function (PSF) at image point (x, y) is given by the Fourier transform of the generalized pupil function: P(u, v) = A(u, v) · exp(j[W(u, v; d) + φ(u, v)]), where W is the defocus aberration dependent on depth and A is the aperture function. By choosing φ(u, v) to locally cancel the defocus aberration W for different depths at different image locations, we achieve spatially-varying focus.

本光學系統在標準成像透鏡的瞳孔平面放置一個純相位空間光調變器。SLM 調變通過每個瞳孔位置之波前的相位，而不影響振幅。對於位於深度 d、透過瞳孔位置 (u, v) 以 SLM 相位 phi(u, v) 成像的場景點，其在影像點 (x, y) 產生的點擴散函數（PSF）由廣義瞳孔函數的傅立葉轉換給出：P(u, v) = A(u, v) * exp(j[W(u, v; d) + phi(u, v)])，其中 W 是與深度相關的離焦像差，A 是光圈函數。透過選擇 phi(u, v) 在不同影像位置局部抵消不同深度的離焦像差 W，便可實現空間變化對焦。

段落功能方法推導第一步——建立光學成像模型的數學基礎。

邏輯角色此段是整個方法的物理基礎。透過傅立葉光學的語言，精確描述 SLM 如何透過相位調變影響成像。核心洞察在於：相位 phi 可局部抵消離焦像差 W。

論證技巧 / 潛在漏洞數學推導紮實，但假設了理想的純相位調變——實際 SLM 可能存在振幅耦合、像素串擾與有限的相位解析度。此外，局部取消離焦的前提是已知場景深度，引入了對深度估計精度的依賴。

3.2 Phase Pattern Optimization — 相位模式最佳化

Given a depth map of the scene (obtained from a depth sensor or monocular estimation), we formulate the phase pattern optimization as maximizing the integrated MTF across all image regions. The objective is: max_φ ∑_(x,y) MTF(x, y; d(x,y), φ), where d(x,y) is the depth at pixel (x, y). We solve this using a differentiable optical forward model combined with gradient-based optimization. The forward model simulates image formation through the SLM, and gradients with respect to φ are computed via automatic differentiation. The entire pipeline — from depth input to optimized phase pattern — runs in under 50 milliseconds on a modern GPU, enabling real-time operation.

給定場景的深度圖（從深度感測器或單目估計取得），我們將相位模式最佳化建構為在所有影像區域上最大化積分 MTF 的問題。目標函數為：max_phi sum_(x,y) MTF(x, y; d(x,y), phi)，其中 d(x,y) 為像素 (x, y) 的深度。我們使用可微分的光學正向模型結合梯度最佳化來求解。正向模型模擬通過 SLM 的成像過程，並透過自動微分計算關於 phi 的梯度。整個流程——從深度輸入到最佳化相位模式——在現代 GPU 上可在 50 毫秒內完成，實現即時運作。

段落功能核心演算法——描述如何以可微分最佳化求解最佳相位模式。

邏輯角色此段將前一段的物理模型轉化為可計算的最佳化問題，完成從「原理」到「實現」的跳躍。50 毫秒的延遲數據直接回應了「即時性」的隱含要求。

論證技巧 / 潛在漏洞可微分光學模型是近年計算攝影的趨勢，此處的應用恰當。但 MTF 最大化可能在某些深度過渡區域產生不自然的銳利度跳變。此外，50 毫秒的延遲雖快，但加上深度估計的時間，整體延遲是否仍符合即時要求值得確認。

4. Experiments — 實驗

We evaluate our method on both synthetic scenes with ground-truth depth and real captured scenes using a prototype with a liquid-crystal SLM. On synthetic benchmarks, our approach achieves significantly higher PSNR and SSIM compared to fixed-focus, focus stacking (with 3 and 5 captures), and coded aperture methods. On real scenes, we demonstrate all-in-focus results from single captures of indoor tabletop scenes with depth ranges spanning 0.3m to 2m. The method handles both static and dynamic scenes, with the single-capture advantage being particularly pronounced for the latter. Ablation studies confirm the importance of the learned optimization over hand-crafted phase patterns.

我們在合成場景（具有真實深度）與使用液晶 SLM 原型的真實拍攝場景上評估本方法。在合成基準測試中，本方法相較於固定焦距、焦點堆疊（3 次與 5 次拍攝）及編碼光圈方法，達到顯著更高的 PSNR 與 SSIM。在真實場景中，我們展示了對深度範圍跨越 0.3 公尺至 2 公尺的室內桌面場景進行單次拍攝即可產生全景深清晰結果。此方法能處理靜態與動態場景，單次拍攝的優勢在後者中尤為顯著。消融研究確認了學習式最佳化相較於手工設計相位模式的重要性。

段落功能提供全面的實驗證據——涵蓋合成與真實場景、多個基線比較、消融研究。

邏輯角色實證支柱同時驗證定量性能（PSNR/SSIM）與定性效果（真實拍攝），以及單次拍攝在動態場景的獨特優勢。

論證技巧 / 潛在漏洞真實原型的展示大幅增強說服力，但實驗場景限於室內桌面（0.3-2m），對戶外大景深場景（如風景攝影）的適用性未被驗證。SLM 原型的體積、成本與功耗也未被討論。

5. Conclusion — 結論

Spatially-Varying Autofocus represents a paradigm shift from global to per-pixel focal control. By combining an SLM-based adaptive optical element with differentiable optimization, we demonstrate that all-in-focus imaging from a single capture is achievable in real-time. This opens up new possibilities for computational cameras that actively co-design optics and algorithms on a per-scene basis. Future work includes miniaturizing the SLM module for integration into mobile devices and extending the approach to video capture with temporal consistency.

空間變化自動對焦代表了從全域到逐像素焦距控制的典範轉移。透過結合基於 SLM 的自適應光學元件與可微分最佳化，我們證明了即時的單次拍攝全景深成像是可實現的。這開啟了計算相機在逐場景基礎上主動共同設計光學與演算法的新可能性。未來工作包括將 SLM 模組小型化以整合進行動裝置，以及將方法擴展至具有時間一致性的影片拍攝。

段落功能總結全文——重申核心貢獻並展望未來方向。

邏輯角色結論段以「典範轉移」的宏觀語彙定位本文貢獻，並以具體的未來方向（小型化、影片）展現研究延續性。

論證技巧 / 潛在漏洞「典範轉移」的宣稱雄心勃勃，但從實驗室原型到消費級產品的距離仍然巨大——SLM 的成本、功耗與尺寸是實際部署的主要障礙。行動裝置整合的展望需要數量級的技術進步。

論證結構總覽

問題
傳統對焦僅支援
單一全域焦平面

→

論點
SLM 實現空間變化
逐像素焦距控制

→

證據
合成與真實場景
PSNR/SSIM 大幅領先

→

反駁
可微分最佳化
50ms 即時求解

→

結論
單次拍攝全景深
計算攝影新典範

作者核心主張（一句話）

透過在光學路徑中引入可程式化的空間光調變器，並以可微分最佳化求解逐像素相位模式，能從單次拍攝實現全場景深度的清晰成像。

論證最強處

硬體-軟體協同設計的完整性：從波動光學的數學模型出發，到 SLM 的硬體實現，再到可微分最佳化的演算法設計，形成了一個從理論到實作的完整閉環。真實原型的展示進一步將論文從概念帶入實踐。

論證最弱處

實用性與可擴展性的疑慮：實驗場景限於近距離室內桌面，對大景深戶外場景的適用性未被驗證。SLM 的體積、成本與功耗使得短期內無法整合進消費級相機，限制了方法的實際影響力。