NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections

Abstract — 摘要

Neural Radiance Fields (NeRF) can be used to reconstruct scenes using photos taken under controlled settings. However, "applying NeRF to casually captured images or Internet photo collections remains challenging" due to variable illumination and transient occluders (e.g., tourists, vehicles). The authors introduce NeRF-W, which extends NeRF by modeling per-image appearance variations and separately modeling transient phenomena. This enables accurate 3D reconstruction from unstructured, in-the-wild photo collections, such as Internet photos of famous landmarks.

神經輻射場（NeRF）能從受控環境下拍攝的照片重建場景。然而，將 NeRF 應用於隨意拍攝的影像或網路照片集合仍具挑戰性，原因在於變異的光照條件與瞬態遮擋物（如遊客、車輛）。作者提出 NeRF-W，透過建模逐影像的外觀變化與單獨建模瞬態現象來擴展 NeRF。這使得從非結構化的野外照片集合（如著名地標的網路照片）進行精確三維重建成為可能。

段落功能全文總覽——從 NeRF 的限制出發，引出 NeRF-W 的兩項核心擴展。

邏輯角色摘要以「理想 vs. 現實」的對比建構動機：NeRF 在受控環境下表現優異，但真實世界的照片不受控制。NeRF-W 的兩項擴展精準對應兩個現實挑戰。

論證技巧 / 潛在漏洞以具體例子（遊客、車輛、地標照片）使抽象的技術問題變得直觀。但「精確三維重建」的量化標準在摘要中未明確，需待實驗章節驗證。

1. Introduction — 緒論

NeRF achieves remarkable novel view synthesis quality by representing a scene as a continuous 5D function (3D position + 2D viewing direction) mapped to density and color via an MLP. However, the original formulation assumes that "the scene is static and captured under consistent lighting conditions." Real-world photo collections, such as those found on Flickr or Google Images for tourist landmarks, violate these assumptions: photos are taken at different times of day, seasons, and weather conditions, and commonly contain transient objects that are not part of the permanent scene structure.

NeRF 透過將場景表示為一個連續的五維函數（三維位置加二維觀看方向），經 MLP 映射至密度與顏色，達到卓越的新視角合成品質。然而，原始公式假設場景是靜態的且在一致的光照條件下拍攝。真實世界的照片集合（如 Flickr 或 Google 圖片上的旅遊地標照片）違反了這些假設：照片在不同的時段、季節和天氣條件下拍攝，且通常包含不屬於永久場景結構的瞬態物件。

段落功能建立研究場域——闡明 NeRF 的核心假設及其在真實世界中的失效。

邏輯角色論證鏈的起點：先肯定 NeRF 的能力，再系統性地列舉其「靜態」與「一致光照」假設在實際中的不成立，為擴展方案製造必要性。

論證技巧 / 潛在漏洞以常見的使用場景（旅遊照片）作為動機，使問題具有廣泛的實際相關性。但「不同時段、季節」的變異範圍極大，單一方法是否能涵蓋所有情況值得懷疑。

Traditional structure-from-motion (SfM) and multi-view stereo (MVS) pipelines can handle some appearance variation through feature matching invariance, but their reconstructions are often "incomplete, noisy, and cannot render photorealistic novel views." Simply applying NeRF to such collections results in "blurry, artifact-laden renderings" as the network attempts to average over inconsistent observations. The authors argue that the solution requires explicitly modeling the factors that cause inter-image variation rather than treating them as noise.

傳統的運動恢復結構（SfM）與多視圖立體（MVS）管線能透過特徵匹配的不變性處理部分外觀變化，但其重建結果通常是不完整的、含雜訊的，且無法渲染逼真的新視角。直接將 NeRF 應用於此類集合會產生模糊且充滿偽影的渲染結果，因為網路試圖對不一致的觀測進行平均。作者主張，解決方案需要顯式建模導致影像間變異的因素，而非將其視為雜訊。

段落功能批判現有方法——指出 SfM/MVS 和原始 NeRF 在野外場景的失敗模式。

邏輯角色「問題-解決方案」中的問題深化：「模糊」和「偽影」是直觀且嚴重的失敗，使得擴展 NeRF 的需求變得無可否認。

論證技巧 / 潛在漏洞「顯式建模變異因素」的主張暗示了因果性理解的重要性，這是比暴力擬合更深層的方法論立場。但這也意味著需要正確識別所有變異因素——若遺漏某些因素（如鏡頭畸變差異），方法仍可能失效。

Prior work on 3D reconstruction from Internet photos includes landmark-scale SfM systems that recover camera poses and sparse point clouds from thousands of photos. Appearance modeling in multi-view reconstruction has been explored through per-image affine color transforms and learned appearance embeddings. For transient object handling, some methods use semantic segmentation to mask out people, but this requires pre-trained detectors and cannot handle all types of transient phenomena. NeRF and its concurrent extensions focus on controlled capture settings, leaving the in-the-wild scenario largely unaddressed.

先前關於從網路照片進行三維重建的研究，包括能從數千張照片中恢復攝影機姿態與稀疏點雲的地標級 SfM 系統。外觀建模在多視圖重建中已透過逐影像仿射色彩變換與學習式外觀嵌入加以探索。在瞬態物件處理方面，部分方法使用語義分割來遮罩行人，但這需要預訓練的偵測器且無法處理所有類型的瞬態現象。NeRF 及其同期擴展聚焦於受控拍攝環境，使野外場景在很大程度上未被解決。

段落功能文獻回顧——涵蓋 SfM、外觀建模與瞬態處理三個面向。

邏輯角色為 NeRF-W 的兩項創新（外觀嵌入+瞬態模型）分別在文獻中找到前身，展示方法的學術根基，同時指出整合缺口。

論證技巧 / 潛在漏洞將外觀建模與瞬態處理作為兩條獨立的研究線來回顧，使 NeRF-W 的「整合」貢獻顯得自然。但 NeRF-W 的外觀嵌入本質上是「學習式仿射變換」的推廣，創新幅度可能不如論述所暗示。

3. Method — 方法

3.1 Background: NeRF

The original NeRF represents a static scene as a function F: (x, d) -> (c, sigma) mapping a 3D position x and viewing direction d to an RGB color c and volume density sigma. Novel views are synthesized via classical volume rendering: integrating color weighted by accumulated transmittance along camera rays. The model is trained by minimizing the photometric loss between rendered and observed pixel colors. This formulation implicitly assumes that each 3D point has a single, fixed appearance regardless of when or how it was photographed.

原始 NeRF 將靜態場景表示為函數 F: (x, d) -> (c, sigma)，將三維位置 x 與觀看方向 d 映射至 RGB 顏色 c 與體積密度 sigma。新視角透過經典的體積渲染合成：沿攝影機光線積分以累積透射率加權的顏色。模型透過最小化渲染像素與觀測像素顏色之間的光度損失進行訓練。此公式隱含假設每個三維點具有單一且固定的外觀，不論拍攝時間或方式為何。

段落功能背景知識——簡潔回顧 NeRF 的核心公式與隱含假設。

邏輯角色為後續的擴展提供技術基礎：必須先理解原始公式的限制（固定外觀假設），才能理解為何需要引入外觀嵌入。

論證技巧 / 潛在漏洞以數學符號精確定義假設，使後續修改的動機和位置一目了然。將假設明確化（「單一固定外觀」）是學術寫作的良好範式。

3.2 Appearance Modeling — 外觀建模

To handle variable illumination, the authors introduce a per-image appearance embedding vector l_i that is optimized jointly with the network parameters. The NeRF function becomes F: (x, d, l_i) -> (c_i, sigma), where the color output now depends on which image is being rendered. The density sigma remains shared across all images, ensuring that "the underlying geometry is consistent while allowing photometric variation." This models phenomena such as different times of day, seasonal changes, and varying camera white balance settings.

為處理變異的光照條件，作者引入逐影像的外觀嵌入向量 l_i，與網路參數聯合最佳化。NeRF 函數變為 F: (x, d, l_i) -> (c_i, sigma)，其中顏色輸出現在取決於正在渲染的是哪張影像。密度 sigma 在所有影像間保持共享，確保底層幾何一致的同時允許光度變化。此設計可建模不同時段、季節變化與相機白平衡設定差異等現象。

段落功能核心擴展之一——描述如何透過外觀嵌入處理光照變異。

邏輯角色此段直接回應緒論提出的第一個挑戰（變異光照）。「幾何共享、外觀分離」的設計原則體現了因果性思維：光照改變外觀但不改變幾何。

論證技巧 / 潛在漏洞密度不隨外觀嵌入變化的設計假設了光照不影響幾何——這在大多數情況下成立，但在極端光照下（如強烈陰影導致的視覺深度誤判），此假設可能不夠精確。

3.3 Transient Object Modeling — 瞬態物件建模

For transient occluders, the authors add a separate transient head that predicts per-image transient color and density: F_tau: (x, l_tau_i) -> (c_tau, sigma_tau, beta). The transient density sigma_tau is image-specific and not shared across views, reflecting that transient objects appear in specific images only. An uncertainty output beta is used to down-weight the loss for pixels containing transient objects. The total rendered color combines static and transient components, and the loss applies per-pixel heteroscedastic uncertainty weighting plus a regularizer that encourages zero transient density.

針對瞬態遮擋物，作者新增一個獨立的瞬態預測頭，預測逐影像的瞬態顏色與密度：F_tau: (x, l_tau_i) -> (c_tau, sigma_tau, beta)。瞬態密度 sigma_tau 是影像特定的且不跨視角共享，反映了瞬態物件僅出現在特定影像中。一個不確定性輸出 beta 用於降低包含瞬態物件之像素的損失權重。總渲染顏色結合靜態與瞬態分量，損失函數套用逐像素的異方差不確定性加權，加上鼓勵瞬態密度歸零的正則化項。

段落功能核心擴展之二——描述如何透過獨立的瞬態模型處理遮擋物。

邏輯角色回應緒論的第二個挑戰（瞬態遮擋）。「靜態+瞬態」的分離策略與外觀建模的「幾何+外觀」分離形成方法論上的一致性。

論證技巧 / 潛在漏洞不確定性加權是一個優雅的自監督策略——模型自行學習哪些像素包含瞬態物件。但正則化項的強度需要調參，且「鼓勵歸零」可能導致模型低估真實瞬態密度，在瞬態物件密集的影像中表現下降。

4. Experiments — 實驗

Experiments are conducted on the Phototourism dataset, which contains Internet photo collections of landmarks including the Brandenburg Gate, the Sacre Coeur, and the Trevi Fountain. NeRF-W is compared against original NeRF, NeRF with appearance embedding only, and traditional MVS methods. Results show that NeRF-W significantly outperforms baseline NeRF in PSNR, SSIM, and LPIPS metrics. The appearance embeddings successfully capture lighting variations — interpolating between embeddings produces smooth transitions between day and night appearances. The transient model correctly identifies and down-weights tourists, scaffolding, and other temporary objects. Ablation studies confirm that both appearance and transient modeling are necessary for optimal results.

實驗在 Phototourism 資料集上進行，該資料集包含布蘭登堡門、聖心堂與特雷維噴泉等地標的網路照片集合。NeRF-W 與原始 NeRF、僅含外觀嵌入的 NeRF 及傳統 MVS 方法進行比較。結果顯示 NeRF-W 在 PSNR、SSIM 與 LPIPS 指標上顯著優於基線 NeRF。外觀嵌入成功捕捉了光照變化——在嵌入之間進行內插可產生日間與夜間外觀之間的平滑過渡。瞬態模型正確識別並降低了遊客、鷹架及其他臨時物件的權重。消融研究確認外觀建模與瞬態建模均為達到最佳結果所必需。

段落功能全面實驗驗證——在真實世界地標資料集上展示方法的有效性。

邏輯角色實證支柱覆蓋四個維度：(1) 定量指標的改善；(2) 外觀嵌入的定性驗證（日夜內插）；(3) 瞬態模型的語義正確性；(4) 消融研究確認各組件必要性。

論證技巧 / 潛在漏洞使用著名地標作為測試場景非常明智——讀者對這些場景有直觀認知，使結果易於評判。但 Phototourism 的場景多為大型戶外建築，對於室內場景或小型物件的泛化性未被驗證。

5. Conclusion — 結論

NeRF-W extends neural radiance fields to handle unconstrained photo collections by introducing per-image appearance embeddings for photometric variation and a transient object model with uncertainty-based loss weighting. The method enables high-quality 3D reconstruction and novel view synthesis from Internet photo collections, a setting previously inaccessible to neural rendering methods. The learned appearance and transient decomposition also opens possibilities for appearance editing and scene cleaning applications.

NeRF-W 透過引入處理光度變化的逐影像外觀嵌入，以及基於不確定性損失加權的瞬態物件模型，將神經輻射場擴展至處理非約束照片集合。該方法實現了從網路照片集合進行高品質三維重建與新視角合成——這是先前神經渲染方法無法觸及的場景。所學習到的外觀與瞬態分解也為外觀編輯與場景清理應用開啟了可能性。

段落功能總結全文——重申核心貢獻並暗示應用前景。

邏輯角色結論段呼應摘要，完成論證閉環。「外觀編輯與場景清理」的應用展望超越了純技術貢獻，暗示方法的更廣泛影響。

論證技巧 / 潛在漏洞結論簡潔有力，但未討論局限性——如訓練時間、記憶體需求、對 SfM 姿態估計品質的依賴等。對於一個實際應用導向的工作，部署限制的討論是重要的遺漏。

論證結構總覽

問題
NeRF 無法處理
野外照片集合

→

論點
顯式建模外觀變異
與瞬態物件

→

證據
Phototourism 資料集
PSNR/SSIM 顯著提升

→

反駁
不確定性加權
自動識別瞬態區域

→

結論
網路照片可實現
高品質三維重建

作者核心主張（一句話）

透過為神經輻射場引入逐影像的外觀嵌入與具備不確定性估計的瞬態物件模型，非結構化的網路照片集合可被用於高品質的三維場景重建與新視角合成。

論證最強處

問題分解的清晰性：將「野外照片」的挑戰精確分解為「外觀變異」與「瞬態遮擋」兩個獨立問題，並以對應的技術手段逐一解決。不確定性加權的自監督機制尤為優雅——模型無需任何瞬態物件的標註即可學會識別與忽略它們。

論證最弱處

實用性限制的討論不足：方法仍需從 SfM 獲取攝影機姿態作為輸入，而野外照片的 SfM 本身就不穩定。此外，逐影像最佳化外觀嵌入意味著無法直接泛化至訓練集外的新影像，限制了實際部署場景。訓練成本的資訊也未充分報告。