Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields

Abstract — 摘要

The rendering procedure used by neural radiance fields (NeRF) samples a scene with single 3D points along each ray and may therefore produce renderings that are excessively blurred or aliased when training or testing images observe scene content at different resolutions. The solution presented in this paper, called mip-NeRF (a NeRF variant inspired by mipmapping), casts cones instead of rays and renders anti-aliased conical frustums instead of points. By efficiently approximating the conical frustums with multivariate Gaussians and computing integrated positional encoding (IPE), mip-NeRF addresses the aliasing issue while being 7% faster than NeRF, using half the parameters. Mip-NeRF reduces average error rates by 17% on single-scale and 60% on multiscale benchmarks compared to NeRF.

神經輻射場（NeRF）所使用的渲染程序以單一三維點沿每條射線取樣場景，因此當訓練或測試影像在不同解析度下觀察場景內容時，可能產生過度模糊或混疊的渲染結果。本文提出的解決方案稱為 mip-NeRF（受多級漸遠紋理映射啟發的 NeRF 變體），以圓錐體取代射線，渲染抗混疊的圓錐截體而非點。透過以多變量高斯函數高效逼近圓錐截體並計算整合位置編碼（IPE），mip-NeRF 在解決混疊問題的同時比 NeRF 快 7%，參數量僅為其一半。在單尺度基準上，mip-NeRF 將平均誤差率降低 17%；在多尺度基準上則降低 60%。

段落功能全文總覽——從 NeRF 的混疊缺陷出發，以 mipmapping 類比引出 mip-NeRF 的設計理念。

邏輯角色摘要同時完成三項任務：(1) 定義問題（點取樣導致混疊），(2) 提出方案（圓錐截體+IPE），(3) 量化成果（17%/60% 誤差降低、7% 加速、50% 參數縮減）。

論證技巧 / 潛在漏洞以經典電腦圖學概念（mipmapping）命名並類比，極大地降低了讀者的認知門檻。但「圓錐體取代射線」的表述可能暗示更大的計算開銷——作者巧妙地以「更快更小」的數據先發制人，化解了這一顧慮。

1. Introduction — 緒論

Neural radiance fields (NeRF) represent scenes as continuous volumetric functions parameterized by neural networks, mapping 3D coordinates and viewing directions to color and density. NeRF has demonstrated remarkable quality for novel view synthesis. However, NeRF's rendering procedure queries the scene representation at individual 3D points along each ray, which means that the same MLP is used to represent scene content at all scales. This lack of scale awareness causes aliasing artifacts: when an image observes a distant region, a single point query does not capture the fact that the corresponding pixel covers a large volume of 3D space. While supersampling (casting multiple rays per pixel) can reduce aliasing, it is prohibitively expensive for NeRF.

神經輻射場（NeRF）將場景表示為以神經網路參數化的連續體積函數，將三維座標與觀看方向映射至顏色與密度。NeRF 在新視角合成上展現了卓越的品質。然而，NeRF 的渲染程序在每條射線上的個別三維點查詢場景表示，意味著同一個 MLP 被用於表示所有尺度的場景內容。這種缺乏尺度感知的問題導致混疊偽影：當影像觀察遠處區域時，單一點查詢無法捕捉對應像素實際覆蓋了三維空間中大量體積的事實。雖然超取樣（每像素投射多條射線）可減少混疊，但對 NeRF 而言其計算成本高得令人望而卻步。

段落功能建立研究場域——從 NeRF 的成功出發，精確指出其「尺度盲」的根本缺陷。

邏輯角色論證鏈的起點：先肯定 NeRF 的品質（建立背景），再以「點查詢」的局限建立問題，最後排除「超取樣」的暴力方案，為優雅的替代方案清場。

論證技巧 / 潛在漏洞「一個像素覆蓋大量三維空間」的直覺解釋極具畫面感，使抽象的混疊問題變得具體可感。但此問題主要出現在多尺度場景中——對單一尺度的室內場景，混疊可能並不嚴重。

In traditional computer graphics, anti-aliasing is addressed through two approaches: supersampling (casting multiple rays per pixel) and prefiltering (using multiscale representations such as mipmaps). Prefiltering is computationally efficient because "filtered versions of scene content can be precomputed ahead of time." However, in the context of view synthesis, the scene geometry is unknown at test time, making precomputation infeasible. Coordinate-based neural representations like NeRF use positional encoding to enable MLPs to represent high-frequency content, but standard positional encoding operates on individual points and has no notion of the volume being integrated over.

在傳統電腦圖學中，抗混疊透過兩種方法處理：超取樣（每像素投射多條射線）與預濾波（使用多級漸遠紋理映射等多尺度表示）。預濾波在計算上效率較高，因為「場景內容的濾波版本可以預先計算」。然而在視角合成的脈絡中，場景幾何在測試時是未知的，使得預計算不可行。NeRF 等基於座標的神經表示使用位置編碼讓 MLP 能表示高頻內容，但標準位置編碼在個別點上操作，對被積分的體積毫無概念。

段落功能文獻定位——連結電腦圖學的抗混疊傳統與神經場的新挑戰。

邏輯角色以「超取樣 vs 預濾波」的二分框架定位問題空間。排除超取樣（太慢）與傳統預濾波（需預計算），為「學習式預濾波」——即 IPE——鋪路。

論證技巧 / 潛在漏洞跨領域引用（圖學 mipmapping）為方法賦予了深厚的理論基礎。但將位置編碼描述為「對體積毫無概念」略顯嚴苛——在實踐中，NeRF 的多點取樣已提供了某種程度的區域感知。

3. Method — 方法

3.1 Cone Tracing — 圓錐追蹤

Instead of casting infinitesimally thin rays as in NeRF, mip-NeRF casts cones from each pixel, where the cone's radius at the image plane corresponds to the pixel's footprint. Along each cone, the scene is sampled by dividing the cone into conical frustums — truncated cone segments between consecutive depth values [t_0, t_1]. Each frustum is then approximated by a multivariate Gaussian with mean and covariance derived from the frustum geometry. The mean position along the ray includes a correction term: mu_t = t_mu + 2*t_mu*t_delta^2 / (3*t_mu^2 + t_delta^2), and the variance components sigma_t^2 (along-ray) and sigma_r^2 (cross-ray) capture the extent of the frustum in both directions.

mip-NeRF 不像 NeRF 那樣投射無限細的射線，而是從每個像素投射圓錐體，圓錐在影像平面上的半徑對應於像素的足跡。沿每個圓錐，場景透過將圓錐分割為圓錐截體——在連續深度值 [t_0, t_1] 之間截斷的圓錐段——來取樣。每個截體隨後以多變量高斯函數逼近，其均值與協方差由截體幾何推導而來。沿射線方向的均值位置包含一修正項：mu_t = t_mu + 2*t_mu*t_delta^2 / (3*t_mu^2 + t_delta^2)，而沿射線變異量 sigma_t^2 與橫截射線變異量 sigma_r^2 捕捉了截體在兩個方向上的範圍。

段落功能方法核心第一步——以圓錐截體取代點取樣並建立高斯逼近。

邏輯角色從「點」到「體積」的範式轉換是全文的關鍵創新。高斯逼近使得封閉形式的位置編碼成為可能，避免了數值積分的高成本。

論證技巧 / 潛在漏洞將截體逼近為高斯是一個優雅的數學簡化，但高斯對稱性假設在細長截體（近距離取樣）或扁平截體（遠距離取樣）中的精度各不相同。作者透過修正項部分緩解了此問題。

3.2 Integrated Positional Encoding (IPE) — 整合位置編碼

The key technical contribution is Integrated Positional Encoding (IPE), which computes the expected positional encoding over the Gaussian approximation of each conical frustum. For a Gaussian with mean mu and variance sigma^2, the expected sine and cosine values have closed-form solutions: E[sin(x)] = sin(mu) * exp(-sigma^2/2) and E[cos(x)] = cos(mu) * exp(-sigma^2/2). This creates "anti-aliased positional encoding features" that naturally encode the size and shape of the integrated volume. Critically, high-frequency components are attenuated when their period is smaller than the integrated region, while lower frequencies pass through unaffected — exactly mirroring the behavior of a prefiltered multiscale representation.

關鍵技術貢獻是整合位置編碼（IPE），它計算每個圓錐截體的高斯逼近上的期望位置編碼。對於均值為 mu、變異量為 sigma^2 的高斯，期望正弦與餘弦值具有封閉形式解：E[sin(x)] = sin(mu) * exp(-sigma^2/2) 且 E[cos(x)] = cos(mu) * exp(-sigma^2/2)。這創造了「自然編碼整合體積大小與形狀的抗混疊位置編碼特徵」。關鍵在於，當高頻成分的週期小於整合區域時，它們會被衰減，而低頻成分則不受影響通過——這恰好模擬了預濾波多尺度表示的行為。

段落功能方法核心第二步——推導整合位置編碼的封閉形式解。

邏輯角色 IPE 是連結「圓錐截體」與「MLP 輸入」的橋樑。封閉形式解是實際可行性的關鍵——否則需要蒙特卡羅積分，這將抵消效率優勢。

論證技巧 / 潛在漏洞將 IPE 的行為與 mipmapping 的預濾波做類比是全文最精彩的論證之一：讀者可以直覺地理解「高頻衰減」的物理意義。exp(-sigma^2/2) 衰減因子的推導雖然依賴高斯假設，但在實踐中表現極為穩健。

3.3 Architecture — 架構

Mip-NeRF uses a single MLP rather than NeRF's separate coarse and fine networks, since the IPE features already encode scale information that makes the coarse network redundant. For each pixel, n+1 depths are sampled, generating n IPE features for adjacent depth intervals. The volume rendering equation remains identical to NeRF's standard formulation. The training objective is: min sum(lambda * ||C*(r) - C(r; Theta, t^c)||^2 + ||C*(r) - C(r; Theta, t^f)||^2) with lambda=0.1 and a "blurpool" filter applied to sampling weights to prevent missed content in empty regions. This architectural simplification reduces parameters from 1,191K to 612K (48% reduction).

Mip-NeRF 使用單一 MLP 而非 NeRF 的獨立粗糙與精細網路，因為 IPE 特徵已編碼了使粗糙網路變得多餘的尺度資訊。對每個像素，取樣 n+1 個深度值，為相鄰深度區間生成 n 個 IPE 特徵。體積渲染方程維持與 NeRF 相同的標準形式。訓練目標為：min sum(lambda * ||C*(r) - C(r; Theta, t^c)||^2 + ||C*(r) - C(r; Theta, t^f)||^2)，其中 lambda=0.1，並以「模糊池化」濾波器處理取樣權重以防止在空白區域遺漏內容。此架構簡化將參數量從 1,191K 降至 612K（減少 48%）。

段落功能架構設計——說明單一 MLP 如何取代雙網路結構。

邏輯角色展示 IPE 帶來的連鎖效益：不僅解決混疊問題，還簡化了架構（單一 MLP）、減少了參數（48%），並加速了訓練。一個核心創新驅動多重改進。

論證技巧 / 潛在漏洞「IPE 使粗糙網路多餘」的因果論證極具說服力——尺度資訊已嵌入編碼中，無需分階段處理。blurpool 濾波器的引入則展現了工程上的細緻——防止階層式取樣中的遺漏。

4. Experiments — 實驗

On the standard single-scale Blender dataset, mip-NeRF improves PSNR from 31.74 to 33.09 dB, SSIM from 0.953 to 0.961, and LPIPS from 0.050 to 0.043, representing a ~17% average error reduction. On the newly created multiscale Blender dataset (images downsampled by 2x, 4x, and 8x), the improvements are far more dramatic: average error drops from 0.0288 to 0.0114 — a 60% reduction. At 1/8 resolution, SSIM improves from 0.8709 to 0.9833. Crucially, mip-NeRF matches the accuracy of brute-force supersampled NeRF (casting 128 rays per pixel) while being 22x faster. Training takes 2.84 hours vs. NeRF's 3.05 hours (7% faster). Ablation studies confirm that removing IPE degrades performance to baseline NeRF levels, and the single MLP is 20% faster with halved parameters.

在標準單尺度 Blender 資料集上，mip-NeRF 將 PSNR 從 31.74 提升至 33.09 dB，SSIM 從 0.953 提升至 0.961，LPIPS 從 0.050 降至 0.043，代表約 17% 的平均誤差降低。在新建立的多尺度 Blender 資料集（影像以 2 倍、4 倍與 8 倍降取樣）上，改進更為顯著：平均誤差從 0.0288 降至 0.0114——降低 60%。在 1/8 解析度下，SSIM 從 0.8709 提升至 0.9833。關鍵在於，mip-NeRF 達到與暴力超取樣 NeRF（每像素投射 128 條射線）相當的精度，同時快 22 倍。訓練耗時 2.84 小時對比 NeRF 的 3.05 小時（快 7%）。消融研究確認移除 IPE 會使性能退化至基準 NeRF 水準，且單一 MLP 快 20% 且參數減半。

段落功能全面實證——在單尺度與多尺度場景上量化 mip-NeRF 的改進幅度。

邏輯角色實驗覆蓋三個面向：(1) 單尺度品質（17% 改進），(2) 多尺度品質（60% 改進），(3) 效率（更快、更小）。多尺度的巨大改進是最有力的論據，因為它直接驗證了抗混疊的核心動機。

論證技巧 / 潛在漏洞「與 128 倍超取樣等效但快 22 倍」是極具衝擊力的對比。但實驗主要在合成資料上進行——真實場景的混疊模式可能更為複雜（非均勻紋理、景深效果等）。作者未在真實場景資料集上報告結果。

5. Conclusion — 結論

Mip-NeRF addresses the fundamental aliasing problem in neural radiance fields by replacing point sampling with cone tracing and integrated positional encoding. The approach draws on classical prefiltering concepts from computer graphics to create a continuously-valued multiscale scene representation. The resulting model is faster, smaller, and significantly more accurate than NeRF, particularly in multiscale settings. The key insight — that positional encoding can be extended to encode regions rather than points — opens the door to further extensions of coordinate-based neural representations that reason about scale and extent.

Mip-NeRF 透過以圓錐追蹤與整合位置編碼取代點取樣，解決了神經輻射場中的根本性混疊問題。此方法借鑑電腦圖學中的經典預濾波概念，創造了連續值的多尺度場景表示。所得模型比 NeRF 更快、更小且顯著更精確，尤其在多尺度設定中。關鍵洞察——位置編碼可以從編碼點擴展到編碼區域——為基於座標的神經表示在尺度與範圍上的進一步推理打開了大門。

段落功能總結全文——將技術貢獻提升到概念層面的洞察。

邏輯角色結論段從具體方法昇華到一般性原則：「從點到區域」的編碼擴展。這使 mip-NeRF 不僅是一個具體方法，而是一種新的設計哲學。

論證技巧 / 潛在漏洞「從點到區域」的概念抽象極為精準，且已在後續的 mip-NeRF 360 等工作中被進一步驗證。但結論未提及的限制包括：高斯逼近在非凸場景幾何中的精度、以及對動態場景的適用性。

論證結構總覽

問題
NeRF 點取樣
導致多尺度混疊

→

論點
圓錐截體 + IPE
實現抗混疊渲染

→

證據
誤差降低 60%
等效 128x 超取樣

→

反駁
更快 7%、參數減半
無額外計算負擔

→

結論
「從點到區域」開啟
多尺度神經表示新範式

作者核心主張（一句話）

以圓錐追蹤取代射線追蹤，並透過整合位置編碼將區域資訊嵌入神經場的多尺度表示中，從根本上解決 NeRF 的混疊問題，同時降低計算成本與模型大小。

論證最強處

IPE 的封閉形式推導：將圓錐截體逼近為高斯後，位置編碼的期望值恰好具有封閉形式解（exp(-sigma^2/2) 衰減），使得高頻自然衰減、低頻不受影響——完美模擬了 mipmapping 的預濾波行為。這一數學優雅性同時帶來了實際的效率收益：更少的參數、更快的訓練、更好的品質，三者兼得。

論證最弱處

實驗侷限於合成場景：所有定量評估均在 Blender 合成資料集上進行，未包含真實世界場景的評估。真實場景中的混疊模式更為複雜——包括非均勻紋理、運動模糊、鏡面反射等——高斯逼近在這些情境中的精度未被驗證。此外，雖然多尺度基準展示了巨大改進，但該基準是作者自行創建的，其與真實多尺度場景的代表性有待外部驗證。