pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

Abstract — 摘要

We introduce pixelSplat, a feed-forward model that learns to reconstruct a 3D Gaussian splat parameterization of a 3D scene from a pair of images. Our model features real-time and memory-efficient rendering for scalable training as well as fast 3D reconstruction at inference time. To overcome local minima inherent to sparse representations, we predict dense probability distributions over 3D and sample Gaussian means from these distributions. We make this sampling operation differentiable via a reparameterization trick, allowing us to back-propagate gradients through the Gaussian splatting representation. We benchmark our method on wide-baseline novel view synthesis on the RealEstate10k and ACID datasets, where we outperform state-of-the-art light field transformers and accelerate rendering by 2.5 orders of magnitude while reconstructing an interpretable and editable 3D radiance field.

本文提出 pixelSplat，一個前饋式模型，能從一對影像學習重建以三維高斯基元參數化的三維場景。本模型具備即時且記憶體效率高的渲染能力，可支援可擴展的訓練流程以及推論時的快速三維重建。為克服稀疏表示固有的局部極小值問題，我們預測三維空間上的稠密機率分布，並從中取樣高斯基元的均值。透過重參數化技巧使取樣操作可微分，使梯度能透過高斯濺射表示進行反向傳播。我們在 RealEstate10k 與 ACID 資料集的寬基線新視角合成任務上進行基準測試，結果優於最先進的光場轉換器，且渲染速度提升約 2.5 個數量級，同時重建出可解讀且可編輯的三維輻射場。

段落功能全文總覽——以單段摘要涵蓋問題、方法、技術創新與實驗結果四大要素。

邏輯角色摘要承擔「問題定義與解決方案預告」的雙重功能：先界定稀疏表示的局部極小值挑戰，再以機率取樣與重參數化技巧回應。最後以資料佐證方法的有效性與效率優勢。

論證技巧 / 潛在漏洞「2.5 個數量級」的加速幅度在摘要中極為搶眼，是非常有效的修辭策略。但需注意這僅為渲染速度的比較，整體流程（含編碼時間）的加速倍數需在實驗中進一步釐清。「可解讀且可編輯」的宣稱亦需驗證其三維表示的品質。

1. Introduction — 緒論

The problem of generalizable novel view synthesis from sparse image observations has seen tremendous progress through differentiable rendering. However, differentiable rendering inherits a key weakness: training, reconstruction, and rendering are notoriously memory- and time-intensive because differentiable rendering requires evaluating dozens or hundreds of points along each camera ray. While light-field transformers represent progress, they remain far from real-time and do not reconstruct 3D scene representations that can be edited or exported for downstream tasks in vision and graphics.

從稀疏影像觀測進行可泛化新視角合成的問題，已因可微分渲染技術而取得巨大進展。然而可微分渲染承襲了一項關鍵缺陷：訓練、重建與渲染過程耗費大量記憶體與時間，因為每條攝影機射線都需評估數十至數百個取樣點。儘管光場轉換器代表了一定的進展，但其距離即時渲染仍有相當差距，且無法重建可供編輯或匯出至下游視覺與圖學任務的三維場景表示。

段落功能建立研究場域——指出可微分渲染與光場方法的效率與功能性瓶頸。

邏輯角色論證鏈的起點：先肯定可微分渲染的成就，再以「記憶體與時間密集」和「無法匯出三維表示」兩個維度揭示不足，為引入基於基元的表示法鋪路。

論證技巧 / 潛在漏洞以「dozens or hundreds of points along each ray」具體化計算瓶頸，比抽象地說「慢」更具說服力。但作者將所有光場方法歸為「非即時」，忽略了某些輕量化光場方法已接近即時的事實。

We present pixelSplat, which brings the benefits of a primitive-based 3D representation — fast and memory-efficient rendering as well as interpretable 3D structure — to generalizable view synthesis. We identify two key challenges. First, scale ambiguity: in real-world datasets, camera poses are only reconstructed up to an arbitrary scale factor. We design a multi-view epipolar transformer that reliably infers this per-scene scale factor. Second, local minima in optimization: optimizing primitive parameters directly via gradient descent suffers from local minima. While single-scene methods use non-differentiable pruning and division heuristics, the generalizable case requires gradient flow, precluding such operations.

本文提出 pixelSplat，將基元式三維表示的優勢——快速且記憶體效率高的渲染，以及可解讀的三維結構——帶入可泛化視角合成領域。我們辨識出兩個核心挑戰。第一，尺度歧義：在真實世界資料集中，攝影機位姿僅能被重建到一個任意的尺度因子。我們設計了一個多視圖極線轉換器，能可靠地推斷每個場景的尺度因子。第二，最佳化中的局部極小值：透過梯度下降直接最佳化基元參數容易陷入局部極小值。雖然單場景方法可使用不可微分的修剪與分裂啟發式策略，但可泛化的情境需要梯度流通，因此無法採用此類操作。

段落功能提出解決方案——概述 pixelSplat 的定位與面臨的兩大核心技術挑戰。

邏輯角色承接上段的問題陳述，此段扮演「轉折」角色：從「現有方法不足」過渡到「本文方案」，並預先揭示技術難點，為後續方法章節建立期待。

論證技巧 / 潛在漏洞將兩大挑戰（尺度歧義、局部極小值）清楚分項列出，使論文結構高度對稱且易於追蹤。「non-differentiable pruning」的批判精準地點出原始 3D-GS 的痛點，但作者是否完全解決了局部極小值問題，需待消融實驗驗證。

Our proposed solution: we parameterize the positions (i.e., means) of Gaussians implicitly via dense probability distributions predicted by our encoder. In each forward pass, we sample Gaussian primitive locations from this distribution, enabling implicit spawning or deletion of primitives during training while maintaining gradient flow through a reparameterization trick. We demonstrate that a 3D Gaussian splatting representation can be predicted in a single forward pass from just a pair of images, showing that a Gaussian splatting representation can be turned into a building block of any end-to-end differentiable system. Results show significant outperformance over light field transformers on real-world datasets while drastically reducing both training and rendering cost.

我們提出的解決方案是：透過編碼器預測的稠密機率分布，隱式地參數化高斯基元的位置（即均值）。在每次前向傳播中，從該分布取樣高斯基元的位置，使得訓練過程中能隱式地產生或刪除基元，同時透過重參數化技巧維持梯度流通。我們展示了三維高斯濺射表示可從僅僅一對影像在單次前向傳播中被預測出來，顯示高斯濺射表示可作為任何端到端可微分系統的建構模組。實驗結果在真實世界資料集上顯著優於光場轉換器，同時大幅降低訓練與渲染成本。

段落功能核心創新揭示——闡述機率取樣方案的核心思路及其效益。

邏輯角色直接回應上段列出的兩大挑戰：以機率分布取代確定性迴歸來解決局部極小值，以重參數化技巧保持端到端可微分性。此段是全文技術貢獻的濃縮版。

論證技巧 / 潛在漏洞「building block of any end-to-end differentiable system」是一個極具雄心的宣稱，暗示 pixelSplat 的通用性。但此通用性尚未在多樣化的下游任務中被驗證——論文僅展示了新視角合成這一個應用場景。

Recent advances in single-scene novel view synthesis employ neural fields and volume rendering as standards. However, these methods suffer from high computational demands requiring dozens of queries of the neural field per ray. Discrete data structures accelerate rendering but still significantly fall short of real-time rendering at high resolutions. The recently proposed 3D Gaussian splatting offers an efficient alternative by sparsely representing the radiance field with 3D Gaussian primitives that can be rendered via rasterization. However, single-scene optimization requires dozens of images per scene. Our work trains neural networks to estimate 3D Gaussian primitive scene representation from just two images in a single forward pass.

單場景新視角合成的近期進展以神經場與體積渲染為標準方法。然而這些方法面臨高計算需求的困境，每條射線需數十次神經場查詢。離散資料結構雖能加速渲染，但在高解析度下仍遠未達到即時水準。近期提出的三維高斯濺射提供了一個高效的替代方案，以三維高斯基元稀疏地表示輻射場，並可透過光柵化進行渲染。然而單場景最佳化需要每個場景數十張影像。本研究訓練神經網路，僅從兩張影像在單次前向傳播中即可估計三維高斯基元的場景表示。

段落功能文獻回顧——梳理單場景方法的演進脈絡，定位 3D-GS 與本文的關係。

邏輯角色建立學術譜系：NeRF 系方法 -> 加速方案 -> 3D-GS -> pixelSplat。每一步都指出剩餘缺口，以線性演進的敘事將 pixelSplat 定位為自然的下一步。

論證技巧 / 潛在漏洞「just two images in a single forward pass」與「dozens of images per scene」形成鮮明對比，修辭效果極強。但兩張影像能提供的幾何資訊遠少於數十張，品質退化是否可接受需仰賴實驗資料。

The generalizable novel view synthesis direction seeks 3D reconstruction and novel view synthesis from only a handful of images per scene. Neural networks can be trained to regress multi-plane images for small-baseline synthesis, but large-baseline novel view synthesis requires full 3D representations. Methods preserving end-to-end locality and shift equivariance between encoder and scene representation via pixel-aligned or otherwise local features enabled generalization to room-scale scenes and beyond. Recent work combines encoders with cost volumes to match features across views, inspired by classical multi-view stereo. While methods inferring interpretable 3D representations exist, recent light field scene representations trade interpretability for faster rendering. Our method presents the best of both worlds: it infers an interpretable 3D scene representation in the form of 3D Gaussian primitives while accelerating rendering by three orders of magnitude compared to light field transformers.

可泛化新視角合成的研究方向旨在僅從每個場景少量影像進行三維重建與新視角合成。神經網路可被訓練以迴歸多平面影像用於小基線合成，但大基線新視角合成則需要完整的三維表示。透過像素對齊或其他局部特徵在編碼器與場景表示之間保持端到端局部性與平移等變性的方法，使泛化能力擴展至房間尺度乃至更大的場景。近期工作結合編碼器與代價體積來跨視圖匹配特徵，靈感源自經典的多視圖立體視覺。雖然存在推斷可解讀三維表示的方法，但近期的光場場景表示以犧牲可解讀性換取更快的渲染速度。本方法兼具兩者優勢：以三維高斯基元的形式推斷可解讀的三維場景表示，同時相較光場轉換器將渲染速度提升三個數量級。

段落功能文獻定位——將 pixelSplat 放置於可泛化視角合成的發展脈絡中。

邏輯角色此段建立了「可解讀性 vs. 效率」的二元對立框架，再以 pixelSplat 作為統一兩者的方案。這是典型的「兩難困境+第三條路」論證結構。

論證技巧 / 潛在漏洞「best of both worlds」是極具感染力的修辭，但可能過度簡化了權衡關係。三個數量級的加速主要來自光柵化 vs. 體積渲染的本質差異，而非 pixelSplat 本身的創新。作者巧妙地將 3D-GS 的先天優勢歸功於自己的方法。

Prior work recognizes the challenge of scene scale ambiguity in machine learning for multi-view geometry. In monocular depth estimation, state-of-the-art models rely on sophisticated scale-invariant depth losses. In novel view synthesis, single-image 3D diffusion models trained on real-world data rescale 3D scenes according to heuristics on depth statistics and condition their encoders on scene scale. We instead build a multi-view encoder that can infer the scale of the scene, using an epipolar transformer that finds cross-view pixel correspondences and associates them with positionally encoded depth values.

先前的研究已認知到多視圖幾何中機器學習面臨的場景尺度歧義挑戰。在單目深度估計領域，最先進的模型依賴精巧的尺度不變深度損失函數。在新視角合成中，以真實世界資料訓練的單影像三維擴散模型，依據深度統計的啟發式規則重新縮放三維場景，並以場景尺度為條件調節其編碼器。相較之下，我們建構了一個能推斷場景尺度的多視圖編碼器，使用極線轉換器找到跨視圖的像素對應關係，並將其與位置編碼的深度值關聯起來。

段落功能差異化論述——對比先前處理尺度歧義的方法，突顯本文方案的優越性。

邏輯角色回應可能的質疑「尺度歧義難道不是已解決的問題？」——透過對比不同領域的處理策略（啟發式 vs. 本文的自動推斷），論證本文方案的原創性。

論證技巧 / 潛在漏洞作者將其他方法歸類為「heuristics」（啟發式），暗示這些方法缺乏理論支撐，但極線幾何本身也是經典的啟發式方法。差異在於本文將尺度推斷融入端到端學習框架，這才是真正的貢獻。

3. Background: 3D Gaussian Splatting — 背景

3D Gaussian Splatting parameterizes a 3D scene as a set of 3D Gaussian primitives {g_k = (mu_k, Sigma_k, alpha_k, S_k)}, where each has a mean mu_k, a covariance Sigma_k, an opacity alpha_k, and spherical harmonics coefficients S_k. These primitives parameterize the 3D radiance field and can be rendered via an inexpensive rasterization operation. Unlike dense neural field representations that require evaluating dozens of samples per ray, this approach is significantly cheaper in terms of time and memory.

三維高斯濺射將三維場景參數化為一組三維高斯基元 {g_k = (mu_k, Sigma_k, alpha_k, S_k)}，每個基元包含均值 mu_k、共變異數矩陣 Sigma_k、不透明度 alpha_k，以及球諧函數係數 S_k。這些基元參數化三維輻射場，並可透過低成本的光柵化操作進行渲染。不同於需要每條射線評估數十個取樣的稠密神經場表示，此方法在時間與記憶體方面顯著更為經濟。

段落功能定義先備知識——介紹 3D Gaussian Splatting 的基本表示形式。

邏輯角色作為方法章節的前置知識，此段為讀者建立必要的符號系統與概念基礎。四元組 (mu, Sigma, alpha, S) 的定義將在後續方法中被反覆引用。

論證技巧 / 潛在漏洞以簡潔的數學符號將複雜的三維表示壓縮為易於理解的格式。「inexpensive rasterization」是相對於體積渲染而言，但在高斯基元數量極大時，光柵化同樣可能成為瓶頸——此處的效率宣稱具有條件性。

Fitting 3D-GS models relates to fitting Gaussian mixture models, where we seek the parameters of a set of Gaussians that maximize the likelihood of a set of samples. This problem is famously non-convex. Local minima arise when Gaussian primitives initialized at random locations have to move through space to arrive at their final location. Two issues prevent convergence: first, Gaussian primitives have local support, meaning gradients vanish if the distance to the correct location exceeds more than a few standard deviations. Second, even if a Gaussian is close enough to receive substantial gradients, there still needs to exist a monotonically decreasing loss path to its final location. 3D-GS addresses this via non-differentiable pruning and splitting operations, which are incompatible with the generalizable setting where primitive parameters are predicted by a neural network that must receive gradients.

擬合三維高斯濺射模型的問題等同於擬合高斯混合模型，即尋找一組高斯參數以最大化一組樣本的似然度。這是一個著名的非凸問題。局部極小值出現於隨機初始化位置的高斯基元需要在空間中移動以到達最終位置時。兩個因素阻礙收斂：第一，高斯基元具有局部支撐，意即若與正確位置的距離超過數個標準差，梯度便會消失。第二，即便高斯基元距離夠近而能接收到可觀的梯度，仍需存在一條損失函數單調遞減的路徑通往最終位置。原始 3D-GS 以不可微分的修剪與分裂操作解決此問題，但這與需要梯度流通至神經網路的可泛化設定不相容。

段落功能問題深化——以高斯混合模型的類比剖析局部極小值的根本成因。

邏輯角色此段是連接「背景」與「方法」的關鍵橋梁：透過精確描述局部極小值的兩個成因（梯度消失與非單調路徑），為後續的機率取樣方案提供充分的動機。

論證技巧 / 潛在漏洞以高斯混合模型作類比是極具洞察力的論證手法，使讀者能從熟悉的統計框架理解新問題。「famously non-convex」一語精準援引了學界共識。然而，原始 3D-GS 的 Adaptive Density Control 在實務中運作良好，作者是否真正需要完全可微分的替代方案，取決於可泛化設定的必要性。

4. Image-conditioned 3D Gaussian Inference — 影像條件式三維高斯推斷

4.1 Resolving Scale Ambiguity — 解決尺度歧義

In practice, structure-from-motion (SfM) reconstructs each scene only up to scale, meaning scenes are scaled by individual, arbitrary scale factors s_i. A given scene provides camera data C_i = {(I_j, s_i * T_j)}, where s_i * T_j denotes a metric pose whose translation component is scaled by the unknown scalar s_i. Critically, recovering s_i from a single image is impossible due to the principle of scale ambiguity. This means a neural network making predictions about the geometry of a scene from a single image cannot possibly predict the depth that matches the poses reconstructed by structure-from-motion.

實務上，運動恢復結構（SfM）僅能將每個場景重建至一個未知的尺度，意即場景被乘以各自任意的尺度因子 s_i。給定場景提供攝影機資料 C_i = {(I_j, s_i * T_j)}，其中 s_i * T_j 表示平移分量被未知純量 s_i 縮放的度量位姿。關鍵在於，由於尺度歧義原理，從單張影像恢復 s_i 是不可能的。這意味著從單張影像預測場景幾何的神經網路，不可能預測出與 SfM 重建位姿一致的深度。

段落功能界定不可能性——以數學論證說明單影像深度預測的根本限制。

邏輯角色此段建立了「為何需要雙視圖」的理論必要性：不是工程上的選擇偏好，而是數學上的不可能性。這為雙視圖編碼器的設計提供了無可辯駁的動機。

論證技巧 / 潛在漏洞以「principle of scale ambiguity」這一幾何學基本原理作為論據，使論證具備了理論的嚴格性。但實務上，若訓練資料的尺度分布相對集中，單影像模型也可能學到近似的尺度先驗——這裡的「不可能性」是嚴格數學意義上的，不完全等於實務表現。

Our solution uses two reference views. For each pixel in image I, we annotate points along its epipolar line in the second image with their corresponding depths in I's coordinate frame. These depth values are computed from camera poses and encode the scene's scale s_i. Each view is encoded separately into feature volumes F and F-tilde. For pixel coordinate u from I, the epipolar line in the second image is sampled at coordinates {u-tilde_l}. For each epipolar sample, the distance to I's camera origin is computed by triangulation. Queries, keys, and values for epipolar attention are then computed as: s = F-tilde[u-tilde_l] concatenated with gamma(d-tilde), where gamma denotes positional encoding. We then perform epipolar cross-attention: F[u] += Att(q, {k_l}, {v_l}).

我們的解決方案使用兩個參考視圖。對於影像 I 中的每個像素，我們在第二張影像中沿其極線標注對應深度值（以 I 的座標系表示）。這些深度值由攝影機位姿計算得出，並編碼了場景的尺度 s_i。每個視圖被獨立編碼為特徵體積 F 與 F-tilde。對於 I 中的像素座標 u，在第二張影像中沿極線取樣座標 {u-tilde_l}。對每個極線取樣點，透過三角測量計算其到 I 攝影機原點的距離。隨後計算極線注意力的查詢、鍵與值：s = F-tilde[u-tilde_l] 串接 gamma(d-tilde)，其中 gamma 為位置編碼。接著執行極線交叉注意力：F[u] += Att(q, {k_l}, {v_l})。

段落功能技術細節——詳述極線注意力機制的實現步驟。

邏輯角色此段將抽象的「解決尺度歧義」轉化為具體的演算法步驟。極線幾何與注意力機制的結合是本文的關鍵技術貢獻，將經典多視圖幾何與現代深度學習巧妙融合。

論證技巧 / 潛在漏洞位置編碼 gamma(d) 的引入是精妙的設計：它將連續的深度值轉化為高維表示，使網路能區分相近的深度。然而極線取樣的密度與範圍是需要調整的超參數，取樣不足可能導致錯過正確的對應點。

After the epipolar cross-attention layer, each pixel feature F[u] contains a weighted sum of the depth positional encodings, where we expect that the correct correspondence gained the largest weight, and thus each pixel feature now encodes the scaled depth that is consistent with the arbitrary scale factor s_i. Following epipolar cross-attention, a residual convolution layer F += Conv(F) and a per-image self-attention layer F += SelfAttention(F) are applied. These enable our encoder to propagate scaled depth estimates to parts of the image feature maps that may not have any epipolar correspondences, such as occluded or textureless regions.

經過極線交叉注意力層後，每個像素特徵 F[u] 包含了深度位置編碼的加權總和。我們預期正確的對應點獲得最大權重，因此每個像素特徵現已編碼了與任意尺度因子 s_i 一致的縮放深度。在極線交叉注意力之後，依序施加殘差摺積層 F += Conv(F) 與逐影像自注意力層 F += SelfAttention(F)。這些機制使編碼器能將縮放深度估計傳播至影像特徵圖中可能缺乏極線對應的區域，例如被遮擋或無紋理的區域。

段落功能補全設計——說明如何處理極線注意力無法覆蓋的區域。

邏輯角色預先回應「極線對應在遮擋/無紋理區域失效」的潛在質疑，以殘差摺積與自注意力作為資訊傳播機制。展現了作者對方法局限性的深刻理解。

論證技巧 / 潛在漏洞以「propagate scaled depth estimates」描述自注意力的功能是直觀的解釋，但缺乏實驗證據直接證明深度資訊確實被有效傳播。消融實驗中若能展示移除自注意力層對遮擋區域的影響，將更有說服力。

4.2 Gaussian Parameter Prediction — 高斯參數預測

The method leverages scale-aware feature maps to predict Gaussian primitive parameters. Every pixel in an image samples a point on a surface in the 3D scene. We thus choose to parameterize the scene via pixel-aligned Gaussians: for each pixel at coordinate u, we take the corresponding feature F[u] as input and predict the parameters of M Gaussian primitives. The baseline approach directly regresses the Gaussian center mu = o + d_u * d, where d = g(F[u]) and the ray direction d is computed from camera extrinsics T and intrinsics K. However, directly optimizing Gaussian parameters is susceptible to local minima, and we cannot rely on spawning and pruning heuristics since they require gradients.

此方法利用具尺度感知的特徵圖來預測高斯基元參數。影像中的每個像素對應三維場景中一個表面點的取樣，因此我們選擇以像素對齊高斯來參數化場景：對於座標 u 處的每個像素，取其對應特徵 F[u] 作為輸入，預測 M 個高斯基元的參數。基準方法直接迴歸高斯中心 mu = o + d_u * d，其中 d = g(F[u])，射線方向 d 由攝影機外參 T 與內參 K 計算。然而直接最佳化高斯參數容易陷入局部極小值，且由於需要梯度流通，我們無法依賴產生與修剪的啟發式策略。

段落功能基準設定——先介紹直覺但有缺陷的直接迴歸方法。

邏輯角色以「先展示簡單方案的不足，再引出改進方案」的經典對比結構，為後續的機率預測方法建立動機。像素對齊的設計決策本身也是重要的架構選擇。

論證技巧 / 潛在漏洞像素對齊的設計意味著高斯基元數量與影像解析度成正比，在高解析度影像上可能產生大量基元。但這也確保了場景的完整覆蓋。M 個基元 per pixel 的選擇可處理半透明與深度不連續，但 M 的具體值對性能的影響未在此討論。

Rather than predicting depth d directly, our method predicts the probability distribution of the likelihood that a Gaussian (i.e. surface) exists at a depth d along the ray u. This is implemented as a discrete probability density over a set of depth buckets. Near and far planes d_near and d_far are set, and depth is discretized into Z bins represented by vector b, where each element is defined in disparity space. A discrete probability distribution p_phi(z) is defined over index variable z, where phi_z is the probability that a surface exists in depth bucket b_z. Probabilities are predicted by a fully connected neural network f from per-pixel feature F[u] and normalized via softmax. A per-bucket center offset delta is also predicted, allowing fine-grained depth adjustment within each bucket.

不同於直接預測深度 d，本方法預測在射線 u 上深度 d 處存在高斯基元（即表面）的機率分布。具體實現為一組深度桶上的離散機率密度。設定近平面 d_near 與遠平面 d_far，並將深度離散化為 Z 個桶，以向量 b 表示，每個元素定義於視差空間中。在索引變數 z 上定義離散機率分布 p_phi(z)，其中 phi_z 為表面存在於深度桶 b_z 中的機率。機率由全連接神經網路 f 從逐像素特徵 F[u] 預測，並透過 softmax 進行歸一化。同時預測每桶的中心偏移 delta，允許在每個桶內進行細粒度的深度調整。

段落功能核心創新——以機率分布取代確定性深度迴歸。

邏輯角色這是全文最重要的技術貢獻：將「預測一個深度值」轉化為「預測深度的機率分布」。視差空間的離散化確保遠處的深度桶更寬（符合深度感測的特性），而 per-bucket offset 則恢復了離散化損失的精度。

論證技巧 / 潛在漏洞視差空間離散化是精妙的設計選擇，使深度解析度隨距離自然衰減。但桶的數量 Z 是一個關鍵超參數：過少則深度精度不足，過多則增加計算負擔。此外，softmax 歸一化假設所有機率質量集中在 [d_near, d_far] 範圍內，對超出此範圍的場景可能失效。

To backpropagate gradients into depth bucket probabilities phi, the sampling operation z ~ p_phi(z) must be made differentiable. However, the sampling operation is inherently not differentiable. We employ a reparameterization trick inspired by variational autoencoders: we set the opacity alpha of a Gaussian to be equal to the probability of the bucket that it was sampled from. If z ~ p_phi(z), then alpha = phi_z. This means in each backward pass, we assign the gradients of the loss with respect to the opacities to the gradients of the depth probability buckets. Intuitively: if sampling produces a correct depth, gradient descent increases the opacity, leading it to be sampled more often, eventually concentrating probability mass in the correct bucket. If an incorrect depth is sampled, gradient descent decreases the opacity, lowering the probability of further incorrect predictions.

為將梯度反向傳播至深度桶機率 phi，取樣操作 z ~ p_phi(z) 必須可微分。然而取樣操作本質上不可微分。我們採用受變分自編碼器啟發的重參數化技巧：將高斯基元的不透明度 alpha 設定為其被取樣之桶的機率值。若 z ~ p_phi(z)，則 alpha = phi_z。這意味著在每次反向傳播中，損失函數對不透明度的梯度被指派給深度機率桶的梯度。直觀而言：若取樣產生正確的深度，梯度下降會增加不透明度，使其更常被取樣，最終將機率質量集中於正確的桶。若取樣到不正確的深度，梯度下降則會降低不透明度，減少後續不正確預測的機率。

段落功能解決關鍵障礙——以重參數化技巧實現不可微分操作的梯度傳播。

邏輯角色此段解決了機率取樣方案的最後一個技術障礙。alpha = phi_z 的設定建立了不透明度與機率之間的巧妙橋梁，使渲染損失的梯度能自然地流向深度預測。

論證技巧 / 潛在漏洞「correct depth -> increase opacity -> sample more often」的直觀解釋極具教育意義，使複雜的數學機制變得平易近人。但此方案假設渲染損失能準確反映深度的正確性——在遮擋嚴重的區域，渲染損失可能對深度錯誤不敏感，導致機率分布無法正確收斂。

5. Experiments — 實驗

The method is trained and evaluated on RealEstate10k, a dataset of home walkthrough videos downloaded from YouTube, and ACID, a dataset of aerial landscape videos. Both include camera poses computed by SfM software, necessitating the scale-aware design. Three novel-view-synthesis baselines are compared: pixelNeRF (conditions neural radiance fields on 2D image features), GPNR (image-based light field rendering using epipolar lines), and Du et al. (combines light field rendering with epipolar transformer). To present a fair comparison, all baselines were retrained using the same data loaders and training curriculum, where the inter-frame distance between reference views is gradually increased during training. Evaluation uses PSNR, SSIM, and LPIPS metrics. Implementation employs a ResNet-50 and a ViT-B/8 vision transformer, both pre-trained using DINO objective, with outputs summed pixel-wise.

本方法在 RealEstate10k（從 YouTube 下載的居家導覽影片資料集）與 ACID（航拍景觀影片資料集）上進行訓練與評估。兩者皆包含由 SfM 軟體計算的攝影機位姿，因此需要具尺度感知的設計。比較了三個新視角合成基線：pixelNeRF（以二維影像特徵為條件的神經輻射場）、GPNR（使用極線的基於影像光場渲染），以及 Du et al.（結合光場渲染與極線轉換器）。為確保公平比較，所有基線均使用相同的資料載入器與訓練課程重新訓練，其中參考視圖間的幀間距在訓練過程中逐步增大。評估採用 PSNR、SSIM 與 LPIPS 指標。實現方面使用經 DINO 目標預訓練的 ResNet-50 與 ViT-B/8 視覺轉換器，兩者的輸出按像素相加。

段落功能實驗基礎設定——詳述資料集、基線、評估指標與實現細節。

邏輯角色為後續的定量結果建立可信度基礎。「retrained using the same data loaders」是關鍵的公平性保證，直接回應了「比較是否公平」的潛在質疑。

論證技巧 / 潛在漏洞強調公平比較（相同資料載入器、相同訓練課程）是良好的實驗實踐。但基線模型可能有自己最優的訓練策略，統一訓練課程是否真的公平值得討論。DINO 預訓練的 ResNet-50 + ViT 提供了強大的初始特徵，但基線是否也使用了同等級的預訓練模型未明確說明。

Quantitative results demonstrate that pixelSplat outperforms all baselines on every metric. On ACID, pixelSplat achieves 28.14 dB PSNR; on RealEstate10k, it achieves 25.89 dB PSNR, with especially significant LPIPS improvements. Compared to Du et al., improvements are 1.26 dB on ACID and 1.11 dB on RealEstate10k. Qualitative results demonstrate that the method not only produces more accurate and perceptually appealing images, but also generalizes better to out-of-distribution examples. The method is also significantly less resource-intensive: for a single scene encoding followed by 100 image renderings (approximate sequence length), the cost is about 650 times less than the next-fastest baseline.

定量結果顯示 pixelSplat 在所有指標上均優於所有基線。在 ACID 上達到 28.14 dB PSNR，在 RealEstate10k 上達到 25.89 dB PSNR，尤其在 LPIPS 指標上的改進最為顯著。相較於 Du et al.，在 ACID 上提升 1.26 dB，在 RealEstate10k 上提升 1.11 dB。定性結果顯示本方法不僅生成更準確且視覺上更佳的影像，對分布外樣本的泛化能力也更強。本方法在資源消耗上亦顯著更低：對單一場景編碼後渲染 100 張影像（近似序列長度），成本約為次快基線的六百五十分之一。

段落功能展示核心實證——以定量資料證明方法在品質與效率上的雙重優勢。

邏輯角色此段是全文論證的實證支柱，以三個維度支撐核心論點：(1) 影像品質（PSNR/SSIM/LPIPS）；(2) 泛化能力；(3) 計算效率（650 倍加速）。

論證技巧 / 潛在漏洞 650 倍的成本降低數字極為驚人，但這基於「編碼一次 + 渲染 100 張」的特定場景。若僅渲染少量視角，攤銷後的加速倍數會顯著降低。此外，1.11-1.26 dB 的 PSNR 提升雖然統計上顯著，在視覺上的差異可能不太明顯。

Point cloud visualizations from out-of-distribution views show the method produces structured 3D representations. However, the authors acknowledge that while the resulting Gaussians facilitate high-fidelity novel-view synthesis for in-distribution camera poses, they suffer from the same failure modes as 3D Gaussians optimized using the original 3D Gaussian splatting method. Specifically, reflective surfaces are often transparent, and Gaussians appear billboard-like when viewed from out-of-distribution views. These limitations are inherent to the Gaussian splatting representation itself, not unique to the proposed method.

從分布外視角的點雲視覺化顯示，本方法能產生結構化的三維表示。然而作者也承認，雖然產生的高斯基元能在分布內攝影機位姿下實現高保真新視角合成，但其承襲了原始三維高斯濺射方法的相同失敗模式。具體而言，反射表面通常呈現透明狀態，而從分布外視角觀察時，高斯基元呈現廣告板狀的外觀。這些限制是高斯濺射表示本身的固有特性，非本方法獨有。

段落功能誠實揭露局限——承認方法在特定條件下的失敗模式。

邏輯角色此段展現學術誠信：在報告成功後主動揭示局限，並巧妙地將責任歸於底層的 3D-GS 表示而非 pixelSplat 本身。這既是讓步，也是一種修辭策略。

論證技巧 / 潛在漏洞將失敗歸因於 3D-GS 表示的固有限制是合理的責任劃分，但也意味著 pixelSplat 未能改善這些已知問題。若未來的改進方法能解決廣告板效應，本方法的框架是否能直接受益，是一個值得探討的問題。

5.3 Ablations and Analysis — 消融分析

The ablation study addresses three questions. Q1a: Is the epipolar encoder responsible for handling scale ambiguity? Comparing against a variant without epipolar encoding shows significant performance drops. Qualitatively, this produces ghosting and motion blur artifacts that are evidence of incorrect depth predictions. Visualization of epipolar attention scores demonstrates that the epipolar transformer successfully discovers cross-view correspondences. Q1b: Does frequency-based positional encoding of depth matter? Testing without depth encoding yields a performance drop of approximately 2 dB PSNR, highlighting that beyond simply detecting correspondence, the encoder uses the scene-scale encoded depths it triangulates to resolve scale ambiguity.

消融研究探討三個問題。問題一a：極線編碼器是否負責處理尺度歧義？與不含極線編碼的變體比較顯示顯著的效能下降，定性上產生了鬼影與動態模糊偽影，證明了深度預測的不正確。極線注意力分數的視覺化展示了極線轉換器確實能成功發現跨視圖對應。問題一b：基於頻率的深度位置編碼是否重要？不使用深度編碼時，效能下降約 2 dB PSNR，突顯出編碼器不僅檢測對應關係，更利用其三角測量得到的場景尺度編碼深度來解決尺度歧義。

段落功能組件驗證——以消融實驗逐一確認極線編碼器與深度位置編碼的必要性。

邏輯角色直接回應方法章節的設計決策：極線注意力與位置編碼並非可有可無的裝飾，而是各自貢獻了大幅的效能提升。2 dB 的下降幅度足以證明其不可或缺性。

論證技巧 / 潛在漏洞以明確的問題-答案結構組織消融實驗，使讀者能快速定位感興趣的設計選擇。鬼影偽影的定性分析為定量指標提供了直觀的視覺佐證。但消融實驗僅在 RealEstate10k 上進行，ACID 上的結果可能不同。

Q2: Does probabilistic depth prediction alleviate local minima? Ablating the sampling approach and directly regressing depth shows a performance drop of approximately 1 dB in PSNR, validating that predicting the depth of a Gaussian probabilistically is necessary. Detailed ablation metrics on RealEstate10k: Full model achieves 25.89 PSNR / 0.858 SSIM / 0.142 LPIPS; without epipolar encoder: 19.98 / 0.641 / 0.289; without depth encoding: 24.09 / 0.803 / 0.181; without probabilistic sampling: 24.44 / 0.820 / 0.175; with additional depth regularization: 25.28 / 0.847 / 0.152. These results confirm that every proposed component contributes to the final performance.

問題二：機率深度預測是否能緩解局部極小值？移除取樣方法並直接迴歸深度後，效能下降約 1 dB PSNR，驗證了以機率方式預測高斯深度的必要性。RealEstate10k 上的詳細消融指標：完整模型達到 25.89 PSNR / 0.858 SSIM / 0.142 LPIPS；不含極線編碼器：19.98 / 0.641 / 0.289；不含深度編碼：24.09 / 0.803 / 0.181；不含機率取樣：24.44 / 0.820 / 0.175；加入深度正則化：25.28 / 0.847 / 0.152。這些結果確認了每個提出的組件都對最終效能有所貢獻。

段落功能量化驗證——以完整的消融資料支撐機率取樣的核心貢獻。

邏輯角色此段以消融實驗作為全文論證的最終驗證：每個組件（極線編碼器 > 深度編碼 > 機率取樣）的相對重要性清晰可見。極線編碼器的貢獻（約 6 dB）遠大於機率取樣（約 1 dB），暗示尺度歧義的解決是更關鍵的挑戰。

論證技巧 / 潛在漏洞資料呈現清晰且完整，五組對比使讀者能精確評估每個組件的貢獻。值得注意的是，機率取樣僅帶來約 1.45 dB 的提升（24.44 vs. 25.89），而極線編碼器帶來近 6 dB 的提升。這暗示局部極小值問題雖然存在，但其影響可能小於尺度歧義問題——與緒論中兩者並列的描述存在張力。

6. Conclusion — 結論

We have introduced pixelSplat, a method that reconstructs a primitive-based parameterization of the 3D radiance field of a scene from only two images. Our method is significantly faster than prior work on generalizable novel view synthesis while producing an explicit 3D scene representation. To solve local minima in primitive-based function regression, we introduced a novel parameterization of primitive location via a dense probability distribution and a novel reparameterization trick to backpropagate gradients into the parameters of this distribution. This framework is general, and we hope that our work inspires follow-up work on prior-based inference of primitive-based representations across applications.

本文提出 pixelSplat，一個僅從兩張影像即可重建場景三維輻射場之基元式參數化表示的方法。我們的方法在產生顯式三維場景表示的同時，速度顯著快於先前的可泛化新視角合成工作。為解決基元式函數迴歸中的局部極小值問題，我們引入了透過稠密機率分布對基元位置進行參數化的新方法，以及將梯度反向傳播至該分布參數的新型重參數化技巧。此框架具有通用性，我們期望本研究能啟發後續在各種應用中基於先驗推斷基元式表示的後續工作。

段落功能總結貢獻——以精煉語言重申核心方法與技術創新。

邏輯角色結論段呼應摘要結構，形成完整的論證閉環：從問題（效率與局部極小值）到解決方案（機率取樣與重參數化）再到成果（速度與品質的雙重提升）。

論證技巧 / 潛在漏洞「This framework is general」的宣稱為未來擴展留下了空間，但本文僅驗證了新視角合成一個任務。結論未討論方法的具體局限性（如僅限兩視圖輸入、對反射表面的處理不佳），這在學術寫作中略顯不足。

Future directions include leveraging our model for generative modeling by combining it with diffusion models, or removing the need for camera poses to enable large-scale training on uncurated internet data. The ability to predict 3D Gaussian representations in a single forward pass opens opportunities for real-time 3D reconstruction in augmented reality and robotics applications, where both speed and interpretable 3D structure are critical requirements.

未來方向包括將本模型與擴散模型結合以實現生成式建模，或消除對攝影機位姿的需求以支援在未經整理的網路資料上進行大規模訓練。在單次前向傳播中預測三維高斯表示的能力，為擴增實境與機器人應用中的即時三維重建開啟了契機，在這些場景中速度與可解讀的三維結構皆為關鍵需求。

段落功能展望未來——勾勒潛在的應用場景與研究延伸方向。

邏輯角色以兩條明確的未來方向（生成式建模、無位姿訓練）收束全文，展示研究的延展性與影響力。

論證技巧 / 潛在漏洞提及擴散模型是順應 2023-2024 年的研究趨勢，極具前瞻性。「removing the need for camera poses」是一個極具雄心但技術難度極高的目標，展現了作者的學術企圖心。但從兩視圖跨越到無位姿的單影像重建，所需的技術突破遠大於本文框架所能承載。

論證結構總覽

問題
可微分渲染耗費記憶體
且光場方法缺乏三維表示

→

論點
以 3D 高斯基元實現
快速且可解讀的重建

→

證據
RealEstate10k / ACID
全面超越基線方法

→

反駁
以機率取樣解決局部極小值
以極線注意力解決尺度歧義

→

結論
基元式前饋三維重建
可作為通用建構模組

作者核心主張（一句話）

透過在深度空間預測機率分布並以重參數化技巧維持梯度流通，可從僅兩張影像在單次前向傳播中預測出高品質、可解讀且可即時渲染的三維高斯濺射場景表示。

論證最強處

尺度歧義的優雅解決方案：極線轉換器結合位置編碼深度的設計，將經典多視圖幾何與現代注意力機制無縫融合。消融實驗中移除極線編碼器導致近 6 dB 的崩潰性下降，有力地證明了此組件的不可或缺性。同時，650 倍的渲染成本降低使方法具備了實際應用的可行性。

論證最弱處

僅雙視圖輸入的幾何限制：從兩張影像重建三維場景在本質上受限於有限的視角覆蓋，反射表面與分布外視角的失敗模式難以避免。此外，機率取樣帶來的改進（約 1.45 dB）遠小於極線編碼器（約 6 dB），暗示論文在緒論中將兩大挑戰並列的框架可能誇大了局部極小值問題的嚴重性。