Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation

Abstract — 摘要

A diffusion model learns to predict a vector field of gradients. We propose to apply the chain rule on the learned gradients, and back-propagate the score of a diffusion model through the Jacobian of a differentiable renderer, which we instantiate to be a voxel radiance field. This approach aggregates 2D scores at multiple camera viewpoints into a 3D score, and repurposes a pretrained 2D model for 3D data generation. We identify a technical challenge of distribution mismatch that arises in this application, and propose a novel estimation mechanism to resolve it. We run our algorithm on several off-the-shelf diffusion image generative models, including the recently released Stable Diffusion trained on the large-scale LAION dataset.

擴散模型學習預測一個梯度向量場。我們提出在學習到的梯度上套用連鎖律，將擴散模型的分數透過可微分渲染器的雅可比矩陣進行反向傳播，其中渲染器具體實現為體素輻射場。此方法將來自多個攝影機視角的二維分數聚合為一個三維分數，從而將預訓練的二維模型重新用於三維資料生成。我們發現了此應用中出現的分布偏移技術難題，並提出一個新的估計機制加以解決。我們在數個現成的擴散影像生成模型上執行演算法，包括近期釋出的、在大規模 LAION 資料集上訓練的 Stable Diffusion。

段落功能全文總覽——以簡潔的方式勾勒核心創新（連鎖律套用於擴散模型分數）、技術實現（體素輻射場渲染器）與關鍵挑戰（分布偏移）。

邏輯角色摘要同時承擔「方法預告」與「問題揭示」的功能：先闡述從二維到三維的提升策略，再預告分布偏移問題及其解法，最後以 Stable Diffusion 的實驗結果展示實用性。

論證技巧 / 潛在漏洞以「連鎖律」這一基礎數學概念包裝核心貢獻，既簡潔又具高度概括力。但摘要中的「repurpose」一詞暗示無需任何微調，而實際上三維最佳化過程仍需大量迭代計算。此外，僅提及 Stable Diffusion 可能讓讀者誤以為方法僅適用於特定模型。

1. Introduction — 緒論

Diffusion models have emerged as a powerful class of generative models, achieving state-of-the-art results in image generation, inpainting, super-resolution, and text-to-image synthesis. These models learn to reverse a gradual noising process by predicting a score function — the gradient of the log probability density with respect to the data. The success of large-scale text-conditioned models such as DALL-E 2, Imagen, and Stable Diffusion has demonstrated the remarkable capability of diffusion models to capture complex visual distributions when trained on billions of image-text pairs.

擴散模型已成為一類強大的生成模型，在影像生成、修補、超解析度與文字轉影像合成等任務上達到了最先進的成果。這些模型透過預測分數函數——對數機率密度對資料的梯度——來學習反轉漸進式加噪過程。大規模文字條件模型（如 DALL-E 2、Imagen 和 Stable Diffusion）的成功，展示了擴散模型在數十億影像-文字配對上訓練後，捕捉複雜視覺分布的卓越能力。

段落功能建立研究場域——概述擴散模型在二維影像生成領域的成就與技術基礎。

邏輯角色論證鏈的起點：先確立擴散模型作為最先進二維生成方法的地位，為後續「如何將其提升到三維」的核心問題鋪路。

論證技巧 / 潛在漏洞列舉 DALL-E 2、Imagen、Stable Diffusion 三大模型，營造了強烈的時代感與說服力。但此處隱含的假設是「二維模型的知識可遷移至三維」，這一假設的合理性需要後續方法章節的數學支撐。

However, training 3D generative models directly is fundamentally more challenging. The scarcity of large-scale, high-quality 3D datasets compared to their 2D counterparts presents a significant bottleneck. While neural radiance fields (NeRF) and other differentiable rendering techniques have enabled impressive 3D reconstruction from multi-view images, these methods typically require per-scene optimization and do not generalize to open-domain 3D generation. A natural question arises: can we leverage the rich visual priors captured by large-scale 2D diffusion models to guide the creation of 3D content?

然而，直接訓練三維生成模型在本質上更具挑戰性。相較於二維資料，大規模高品質三維資料集的匱乏構成了重大瓶頸。雖然神經輻射場（NeRF）及其他可微分渲染技術已能從多視角影像中實現令人印象深刻的三維重建，但這些方法通常需要逐場景最佳化，且無法推廣至開放領域的三維生成。一個自然的問題由此浮現：我們能否利用大規模二維擴散模型所捕捉的豐富視覺先驗，來引導三維內容的創建？

段落功能揭示研究缺口——指出三維生成的資料瓶頸與現有方法的侷限。

邏輯角色承接上段的二維成就，此段以「轉折」手法凸顯三維領域的困境，並以設問句引出全文的核心研究問題。設問句的修辭效果在於讓讀者自然地期待答案。

論證技巧 / 潛在漏洞將「三維資料集匱乏」作為主要瓶頸是合理的論述，但作者未提及 Objaverse 等正在成長的三維資料集。此外，「open-domain 3D generation」的定義較為模糊，讀者可能對其範疇有不同理解。

In this work, we introduce Score Jacobian Chaining (SJC), a principled framework that lifts pretrained 2D diffusion models for 3D generation by chaining the score function through the Jacobian of a differentiable renderer. Our key insight is that a 3D score can be computed as the vector-Jacobian product of the 2D score and the renderer Jacobian, aggregated over different camera viewpoints. We instantiate the 3D representation as a voxel radiance field and optimize it by following the estimated 3D score. To address the out-of-distribution problem that arises when evaluating the score on clean rendered images, we propose Perturb-and-Average Scoring (PAAS), which perturbs the rendered image with noise and averages the resulting scores to obtain a robust estimate.

在本研究中，我們提出分數雅可比鏈（Score Jacobian Chaining, SJC），這是一個有原則的框架，透過將分數函數經由可微分渲染器的雅可比矩陣串接，將預訓練的二維擴散模型提升至三維生成。我們的核心洞見在於：三維分數可透過二維分數與渲染器雅可比的向量-雅可比乘積（在不同攝影機視角上聚合）來計算。我們將三維表示具體實現為體素輻射場，並透過追隨估計的三維分數來進行最佳化。為解決在乾淨的渲染影像上評估分數時出現的分布外問題，我們提出擾動平均計分法（PAAS），其透過對渲染影像加入噪聲並平均所得分數，來獲得穩健的估計。

段落功能提出解決方案——完整概述 SJC 框架的核心組件與創新要素。

邏輯角色直接回應上段的設問，此段扮演「答案揭曉」的角色。三個關鍵創新依序展開：(1) 連鎖律的數學基礎；(2) 體素輻射場的具體實現；(3) PAAS 對分布偏移的解決方案。

論證技巧 / 潛在漏洞以「principled framework」自稱，暗示方法具有嚴謹的數學基礎，與純啟發式方法形成區隔。PAAS 的提出也展現了作者識別並解決技術障礙的能力。然而，「voxel radiance field」的選擇是否為最優表示值得商榷——後續 NeRF-based 或 3D Gaussian-based 表示可能效果更好。

Diffusion probabilistic models, also known as score-based generative models, define a forward process that gradually corrupts data with Gaussian noise and a reverse process that learns to denoise. The score function — the gradient of the log data density — is the central object learned by these models. Denoising score matching provides a tractable training objective, and recent advances in architecture design and training strategies have led to diffusion models surpassing GANs on image generation benchmarks such as FID on ImageNet. Classifier-free guidance has further enabled high-quality text-conditioned generation, powering systems like DALL-E 2, Imagen, and Stable Diffusion.

擴散機率模型（亦稱基於分數的生成模型）定義了一個逐步以高斯噪聲污染資料的前向過程，以及一個學習去噪的反向過程。分數函數——對數資料密度的梯度——是這些模型學習的核心對象。去噪分數匹配提供了可處理的訓練目標，而架構設計與訓練策略的近期進展，已使擴散模型在 ImageNet FID 等影像生成基準上超越了 GAN。無分類器引導進一步實現了高品質的文字條件生成，驅動了 DALL-E 2、Imagen 和 Stable Diffusion 等系統。

段落功能文獻回顧——建立擴散模型的技術背景與術語體系。

邏輯角色為方法章節的數學推導奠定基礎：分數函數、去噪分數匹配、無分類器引導等概念將在 SJC 的公式中被直接引用。

論證技巧 / 潛在漏洞將擴散模型定位為「已超越 GAN」，暗示建立在最強基礎之上的方法自然具有優勢。但 GAN 在某些指標（如生成速度、可控性）上仍有優勢，此處的比較有所偏頗。

Neural implicit representations such as NeRF represent scenes as continuous functions mapping 3D coordinates to radiance and density. While highly effective for novel view synthesis from posed images, extending these to generative modeling has proven challenging. 3D-aware GANs such as pi-GAN, EG3D, and GRAF incorporate neural radiance fields into adversarial training but are typically limited to single object categories and require category-specific training data. Point-E and Shap-E train diffusion models directly on 3D representations, but require large paired 3D datasets that are expensive to curate.

神經隱式表示（如 NeRF）將場景表示為從三維座標映射至輻射度與密度的連續函數。雖然在從已知姿態影像進行新視角合成方面極為有效，但將其擴展至生成式建模已被證明頗具挑戰。三維感知 GAN（如 pi-GAN、EG3D 和 GRAF）將神經輻射場融入對抗式訓練，但通常限於單一物件類別，且需要特定類別的訓練資料。Point-E 和 Shap-E 直接在三維表示上訓練擴散模型，但需要大量昂貴的配對三維資料集。

段落功能文獻定位——系統性比較三維生成的不同路線及其侷限。

邏輯角色建立 SJC 的差異化定位：相較於需要三維資料的直接方法和受限於特定類別的三維 GAN，SJC 利用二維擴散模型先驗實現開放領域生成。

論證技巧 / 潛在漏洞透過逐一列舉替代方案的缺陷，作者巧妙地為 SJC 的「免三維資料」優勢騰出論述空間。但 EG3D 等方法在特定類別上的品質可能遠超 SJC 的開放領域結果，此處的比較維度有所選擇。

Concurrent with our work, DreamFusion proposes Score Distillation Sampling (SDS) to optimize a NeRF using gradients derived from a pretrained Imagen text-to-image diffusion model. While DreamFusion and SJC share the high-level goal of lifting 2D diffusion priors to 3D, they differ in their theoretical foundations. DreamFusion derives SDS from a probabilistic density distillation perspective, whereas SJC provides a score-based derivation grounded in the chain rule of calculus, yielding a clear mathematical interpretation of the 2D-to-3D lifting process. Our formulation predated DreamFusion's public release and offers a complementary theoretical lens on this family of techniques.

與本研究同期，DreamFusion 提出分數蒸餾取樣（SDS），利用預訓練的 Imagen 文字轉影像擴散模型所衍生的梯度來最佳化 NeRF。儘管 DreamFusion 與 SJC 共享將二維擴散先驗提升至三維的高層目標，但兩者在理論基礎上有所不同。DreamFusion 從機率密度蒸餾的角度推導 SDS，而 SJC 則提供基於微積分連鎖律的分數推導，對二維到三維的提升過程給出了清晰的數學詮釋。我們的公式化早於 DreamFusion 的公開發表，並為此類技術提供了互補的理論視角。

段落功能差異化定位——與最重要的同期競爭者 DreamFusion 進行正面比較。

邏輯角色此段極為關鍵，必須妥善處理與 DreamFusion 的關係：既承認目標相似，又強調理論基礎的差異。「predated」的時間線聲明進一步鞏固原創性。

論證技巧 / 潛在漏洞以「complementary theoretical lens」定位避免了直接對抗，展現學術風度。但讀者不免會比較兩者的實際生成品質——若 DreamFusion 使用更強大的 Imagen 模型而 SJC 使用 Stable Diffusion，則品質差異可能源於基礎模型而非方法本身。

3. Method — 方法

3.1 Score Jacobian Chaining — 分數雅可比鏈

Consider a 3D scene parameterized by theta and a differentiable renderer g that maps the scene parameters to a 2D image: x = g(theta). A pretrained 2D diffusion model provides the score function of the image distribution: the gradient of log p(x) with respect to x. Our goal is to obtain the 3D score — the gradient of log p(theta) with respect to theta — so that we can optimize the 3D scene to match the distribution learned by the 2D model. By the chain rule of calculus, the 3D score decomposes as the product of the Jacobian of the renderer and the 2D score.

考慮一個以 theta 參數化的三維場景，以及一個將場景參數映射至二維影像的可微分渲染器 g：x = g(theta)。預訓練的二維擴散模型提供了影像分布的分數函數：log p(x) 對 x 的梯度。我們的目標是獲得三維分數——log p(theta) 對 theta 的梯度——以便最佳化三維場景使其匹配二維模型所學到的分布。根據微積分的連鎖律，三維分數可分解為渲染器的雅可比矩陣與二維分數的乘積。

段落功能數學基礎——建立從二維分數到三維分數的連鎖律推導。

邏輯角色這是整個方法的數學核心。以最基本的微積分連鎖律建立二維到三維的橋樑，數學推導的簡潔性本身即是論證的說服力來源。

論證技巧 / 潛在漏洞以連鎖律——微積分中最基本的規則——作為核心創新，展現了「以簡馭繁」的優雅。但此推導隱含假設渲染器是處處可微的，實際的體素渲染中離散化操作（如體素格點）可能破壞此假設。

To account for the fact that a single 2D view provides only partial information about the 3D scene, we aggregate the score over multiple camera viewpoints. Specifically, for each optimization step, we sample a camera pose pi, render the scene from that viewpoint to obtain a 2D image x_pi = g(theta, pi), compute the 2D score at x_pi, and back-propagate it through the renderer Jacobian. The aggregated gradient across viewpoints provides a Monte Carlo estimate of the full 3D score, steering the 3D scene parameters toward a configuration that appears realistic from all angles according to the 2D diffusion prior.

為了應對單一二維視角僅提供三維場景部分資訊的事實，我們在多個攝影機視角上聚合分數。具體而言，在每個最佳化步驟中，我們取樣一個攝影機姿態 pi，從該視角渲染場景以獲得二維影像 x_pi = g(theta, pi)，計算 x_pi 處的二維分數，並透過渲染器雅可比進行反向傳播。跨視角聚合的梯度提供了完整三維分數的蒙地卡羅估計，引導三維場景參數朝向一個從所有角度看來（根據二維擴散先驗）皆真實的配置。

段落功能方法延伸——從單視角推導擴展至多視角聚合的實作方案。

邏輯角色將理論公式轉化為可執行的演算法：隨機取樣視角、渲染、計算分數、反向傳播。蒙地卡羅估計的引入將數學推導與實際最佳化流程銜接。

論證技巧 / 潛在漏洞蒙地卡羅估計在理論上是無偏的，但實際中視角取樣的均勻性與充分性直接影響三維一致性。若取樣策略不當，某些視角可能被過度強調，導致生成物件的幾何偏差。

It is instructive to compare our formulation with Score Distillation Sampling (SDS) from DreamFusion. Both approaches yield similar update rules in practice: the gradient used to update the 3D parameters involves back-propagating a 2D diffusion model's prediction through the renderer. However, the derivation paths differ fundamentally. SDS is motivated by minimizing a KL divergence between the rendered image distribution and the diffusion model's learned distribution, dropping a Jacobian term that corresponds to the score of the rendering distribution. SJC derives the same gradient from the chain rule perspective, providing a direct probabilistic interpretation: the optimization follows the score of the 3D distribution induced by the 2D prior through the rendering process.

將我們的公式化與 DreamFusion 的分數蒸餾取樣進行比較是有啟發性的。兩種方法在實務上產生相似的更新規則：用於更新三維參數的梯度都涉及將二維擴散模型的預測透過渲染器進行反向傳播。然而，推導路徑有根本差異。SDS 的動機是最小化渲染影像分布與擴散模型學習分布之間的 KL 散度，捨棄了對應渲染分布分數的雅可比項。SJC 從連鎖律的角度推導出相同的梯度，提供了直接的機率詮釋：最佳化追隨的是透過渲染過程由二維先驗所誘導的三維分布之分數。

段落功能理論比較——深入剖析 SJC 與 SDS 的數學等價性與推導差異。

邏輯角色此段同時服務於兩個目的：(1) 承認與 DreamFusion 的實務等價性，展現學術誠實；(2) 強調推導路徑的差異，維護理論原創性。

論證技巧 / 潛在漏洞「相似的更新規則」這一坦承可能被批評者解讀為「本質相同」。作者需要說服讀者，不同的推導路徑確實帶來了不同的理解與潛在的改進方向，而非僅是殊途同歸的數學遊戲。

3.2 The Out-of-Distribution Problem — 分布偏移問題

A critical technical challenge arises when applying the above framework in practice. The diffusion model's denoiser is trained to operate on noisy inputs — images corrupted by Gaussian noise at various levels. However, in our pipeline, the renderer produces clean, noise-free images. When we directly evaluate the denoiser on these clean renderings, it receives out-of-distribution inputs that it has never encountered during training. The pixel values of rendered images stay within a bounded range (e.g., [-1, 1]), while the training data for the denoiser exists in a numerically larger range due to the added noise. This distribution mismatch causes the score estimates to be unreliable and can lead to catastrophic failure in the optimization.

在實際套用上述框架時，出現了一個關鍵的技術挑戰。擴散模型的去噪器被訓練以在噪聲輸入——經各種程度高斯噪聲污染的影像——上運作。然而，在我們的流程中，渲染器產生的是乾淨、無噪聲的影像。當我們直接在這些乾淨的渲染結果上評估去噪器時，它接收到的是訓練中從未遇過的分布外輸入。渲染影像的像素值落在有界範圍內（例如 [-1, 1]），而去噪器的訓練資料由於添加的噪聲，數值範圍更大。這種分布偏移導致分數估計不可靠，並可能引發最佳化過程的災難性失敗。

段落功能揭示技術障礙——識別出理論到實務轉化中的關鍵分布偏移問題。

邏輯角色此段是論證結構中的「問題深化」：在提出優雅的連鎖律解之後，立即指出其天真實現的失敗點。這種「先揚後抑再揚」的結構增強了 PAAS 解法的戲劇張力與必要性。

論證技巧 / 潛在漏洞以具體的數值範圍差異（[-1,1] vs 更大範圍）解釋抽象的分布偏移問題，有效降低了理解門檻。「catastrophic failure」一詞強化了問題的嚴重性，為後續的 PAAS 解法提供了充分的動機。

To understand this more concretely, recall that the denoiser is trained via denoising score matching at noise level sigma: given a clean image x_0 and a noisy version x_sigma = x_0 + sigma * epsilon, the denoiser learns to predict x_0 from x_sigma. At noise level sigma approaching zero, the denoiser has essentially only seen noisy inputs during training. Evaluating it on a clean rendered image x = g(theta) is therefore extrapolation beyond the training distribution. This problem is not merely theoretical — in our experiments, naive score evaluation on clean images produces incoherent, artifact-ridden 3D outputs.

為更具體地理解此問題，回顧去噪器是透過噪聲等級 sigma 的去噪分數匹配來訓練的：給定乾淨影像 x_0 與噪聲版本 x_sigma = x_0 + sigma * epsilon，去噪器學習從 x_sigma 預測 x_0。在噪聲等級 sigma 趨近零時，去噪器在訓練過程中基本上只見過噪聲輸入。因此，在乾淨的渲染影像 x = g(theta) 上進行評估，本質上是訓練分布之外的外推。此問題不僅是理論上的——在我們的實驗中，對乾淨影像的天真分數評估會產生不連貫、充滿偽影的三維輸出。

段落功能以數學與實驗雙重視角深化分布偏移問題的分析。

邏輯角色從上段的概念性描述推進到具體的數學機制（去噪分數匹配的訓練目標），再以實驗觀察佐證。「不僅是理論上的」這一轉折有效地鞏固了問題的實際重要性。

論證技巧 / 潛在漏洞以「artifact-ridden 3D outputs」的負面實驗結果自我舉證，增強了可信度。但作者未展示這些失敗案例的具體視覺化，讀者只能仰賴文字描述來判斷問題的嚴重程度。

3.3 Perturb-and-Average Scoring (PAAS) — 擾動平均計分法

To resolve the distribution mismatch, we propose Perturb-and-Average Scoring (PAAS). The idea is intuitive: instead of evaluating the score directly on a clean rendered image, we first perturb the image by adding Gaussian noise, then evaluate the score on the noisy version, and finally average over multiple noise perturbations. Formally, given a rendered image x = g(theta), we compute the score estimate as the expectation of the score evaluated at x + sigma * epsilon, where epsilon is drawn from a standard Gaussian. This ensures that the denoiser always receives inputs within its training distribution, while the averaging recovers a valid score estimate for the original clean image.

為解決分布偏移問題，我們提出擾動平均計分法（PAAS）。概念十分直觀：不在乾淨的渲染影像上直接評估分數，而是先對影像加入高斯噪聲進行擾動，再在噪聲版本上評估分數，最後在多次噪聲擾動上取平均。形式上，給定渲染影像 x = g(theta)，我們計算分數估計為在 x + sigma * epsilon（其中 epsilon 取自標準高斯分布）處評估分數的期望值。這確保了去噪器始終接收在其訓練分布內的輸入，而取平均則恢復了對原始乾淨影像的有效分數估計。

段落功能提出核心解法——以 PAAS 機制解決分布偏移問題。

邏輯角色此段是「先揚後抑再揚」結構中的最終「揚」：在識別並深入分析問題之後，提出一個直觀而有效的解決方案。「intuitive」一詞的使用暗示解法的自然性。

論證技巧 / 潛在漏洞 PAAS 的概念簡潔而有效——加噪以回到訓練分布，再取平均以消除噪聲效應。但多次擾動的計算開銷可能不可忽略，作者需說明實際所需的擾動次數與計算成本的平衡。

Mathematically, we show that PAAS approximates the score at an inflated noise level of square root of 2 times sigma. This can be understood as follows: perturbing the rendered image with noise at level sigma and then evaluating the denoiser (which was trained at noise level sigma) is equivalent to evaluating the score of a smoothed version of the data distribution. The key theoretical contribution is proving that this averaged score provides a consistent estimator of the true score, up to the smoothing induced by the noise perturbation. In practice, we adopt a coarse-to-fine annealing schedule for sigma, starting with large noise levels for global structure and gradually reducing it for finer details.

在數學上，我們證明 PAAS 近似於膨脹噪聲等級（sigma 的根號二倍）處的分數。這可如此理解：以噪聲等級 sigma 擾動渲染影像，再評估（在噪聲等級 sigma 下訓練的）去噪器，等同於評估資料分布平滑版本的分數。關鍵的理論貢獻在於證明此平均分數提供了真實分數的一致估計量，受限於噪聲擾動所引致的平滑效應。在實務上，我們對 sigma 採用由粗到精的退火排程，從大噪聲等級開始以建立全域結構，逐漸縮減以刻畫精細細節。

段落功能理論深化——為 PAAS 提供數學保證與實務策略。

邏輯角色此段將直觀的 PAAS 概念提升至理論高度：「一致估計量」的證明為方法提供了收斂保證。由粗到精的退火排程則展示了從理論到實務的銜接。

論證技巧 / 潛在漏洞「一致估計量」的聲稱需要嚴格的數學證明支撐，但平滑效應的存在意味著估計始終帶有偏差。退火排程的超參數（初始 sigma、衰減速率）可能對結果品質有顯著影響，作者需在實驗中展示其穩健性。

The PAAS mechanism also connects to Tweedie's formula in statistics. Under the Gaussian perturbation model, the denoiser's output can be interpreted via Tweedie's formula as the posterior mean estimate of the clean image. The score is then recovered as (denoiser output minus noisy input) divided by sigma squared. This interpretation provides additional theoretical grounding for our approach and connects it to a well-established statistical framework. The resulting algorithm is straightforward to implement: render, perturb, denoise, compute score, and back-propagate through the renderer.

PAAS 機制也與統計學中的 Tweedie 公式相關聯。在高斯擾動模型下，去噪器的輸出可透過 Tweedie 公式詮釋為乾淨影像的後驗均值估計。分數則可由（去噪器輸出減去噪聲輸入）除以 sigma 平方來恢復。此詮釋為我們的方法提供了額外的理論依據，並將其連結至一個成熟的統計框架。最終的演算法實現起來十分直觀：渲染、擾動、去噪、計算分數、透過渲染器反向傳播。

段落功能理論連結——將 PAAS 與經典統計理論（Tweedie 公式）相呼應。

邏輯角色透過連結至 Tweedie 公式，為 PAAS 提供了超越本文框架的理論合法性。最後一句將複雜的數學還原為五步驟的演算法流程，展現了理論與實務的統一。

論證技巧 / 潛在漏洞引用 Tweedie 公式是一個巧妙的學術策略——借助成熟理論的權威性為新方法背書。五步驟的簡潔演算法描述有效降低了實作門檻，增強了方法的實用性印象。

3.4 Implementation Details — 實作細節

We represent the 3D scene as a voxel radiance field — a dense 3D grid where each voxel stores an opacity value and a color (or feature) vector. This representation is chosen for its simplicity and full differentiability: the rendering operation (ray marching through the voxel grid) is straightforward to implement with automatic differentiation, and the Jacobian of the renderer with respect to the voxel parameters is well-defined everywhere. We use a grid resolution of 64 cubed in our default configuration, which balances detail and computational cost.

我們將三維場景表示為體素輻射場——一個密集的三維格點，其中每個體素儲存一個不透明度值與一個顏色（或特徵）向量。選擇此表示法是因為其簡潔性與完全可微性：渲染操作（穿過體素格點的射線行進）透過自動微分即可直觀實現，且渲染器對體素參數的雅可比矩陣在所有位置都是良定義的。我們在預設配置中使用 64 立方的格點解析度，以平衡細節呈現與計算成本。

段落功能實作選擇——說明三維表示法的具體形式與設計理由。

邏輯角色將抽象的「可微分渲染器」具體化為體素輻射場，並以「完全可微性」作為選擇依據，呼應了連鎖律推導中對渲染器可微性的要求。

論證技巧 / 潛在漏洞體素輻射場的選擇優先考量了實作的簡潔性而非表示的表達力。64 立方的解析度限制了幾何細節——高頻紋理和尖銳邊緣在此解析度下可能會損失。相較於 NeRF 的連續表示或後續的 3D Gaussian Splatting，體素表示在品質上處於劣勢。

The optimization is performed using gradient descent on the voxel parameters, guided by the SJC score at each step. We employ text-conditioned generation via classifier-free guidance when using text-to-image diffusion models: the score is computed as a weighted combination of the conditional and unconditional score estimates, with a guidance scale of 100 found to work well empirically. The optimization typically runs for 10,000 iterations with the Adam optimizer. We also apply total variation regularization on the voxel grid to encourage spatial smoothness and reduce noise artifacts. The entire pipeline requires no 3D training data — only a pretrained 2D diffusion model and a text prompt.

最佳化透過在每一步以 SJC 分數引導的梯度下降對體素參數進行。使用文字轉影像擴散模型時，我們透過無分類器引導實現文字條件生成：分數計算為條件與無條件分數估計的加權組合，實驗上發現引導尺度為 100 效果良好。最佳化通常以 Adam 最佳化器執行 10,000 次迭代。我們還對體素格點施加全變分正則化，以促進空間平滑性並減少噪聲偽影。整個流程不需要任何三維訓練資料——只需要一個預訓練的二維擴散模型和一段文字提示。

段落功能超參數與訓練策略——提供可復現的實作細節。

邏輯角色為方法的可復現性提供關鍵資訊：引導尺度、迭代次數、正則化策略。最後一句再次強調「不需要三維資料」的核心賣點。

論證技巧 / 潛在漏洞引導尺度 100 相當高（相比標準影像生成中 7-15 的常見值），暗示需要非常強的文字引導才能產生有意義的三維結構。全變分正則化是一個相對粗糙的平滑約束，可能在保留銳利幾何邊緣方面力有未逮。

For the diffusion backbone, we demonstrate results using several off-the-shelf models: DeepFloyd-IF, Stable Diffusion v1.5, and a custom-trained model on specific categories. When using Stable Diffusion, which operates in a latent space via a VAE encoder-decoder, we apply the score Jacobian chaining through both the decoder and the diffusion model, effectively chaining three Jacobians: the voxel renderer, the VAE decoder, and the diffusion UNet. This demonstrates the generality of the chain rule approach — it naturally extends to multi-stage generative pipelines without requiring architectural modifications.

在擴散骨幹方面，我們使用數個現成模型展示結果：DeepFloyd-IF、Stable Diffusion v1.5，以及在特定類別上自訂訓練的模型。使用 Stable Diffusion 時（其透過 VAE 編碼器-解碼器在潛在空間中運作），我們將分數雅可比鏈同時穿過解碼器與擴散模型，有效地串接了三個雅可比矩陣：體素渲染器、VAE 解碼器和擴散 UNet。這展示了連鎖律方法的通用性——它自然地擴展至多階段生成流程，無需架構上的修改。

段落功能擴展性展示——說明 SJC 如何自然地適應潛在空間擴散模型。

邏輯角色此段強化了方法的通用性論點：連鎖律不僅適用於單階段渲染，還能串接任意數量的可微分模組。三重雅可比串接是 SJC 理論框架的自然推論。

論證技巧 / 潛在漏洞三重雅可比串接在理論上優雅，但在實務上記憶體與計算成本可能急劇增加。VAE 解碼器的雅可比計算可能是瓶頸，作者未討論相比直接在像素空間操作的額外開銷。

4. Experiments — 實驗

We evaluate SJC on text-guided 3D generation using Stable Diffusion as the backbone. Given text prompts such as "a DSLR photo of a peacock on a surfboard," "a car made out of sushi," and "a temple", our method produces coherent 3D voxel representations that can be rendered from novel viewpoints. The generated objects exhibit recognizable geometry and plausible texture that align with the text descriptions. We observe that the coarse-to-fine noise schedule is critical: early iterations with high noise levels establish the global shape, while later iterations with lower noise refine surface details and textures.

我們以 Stable Diffusion 為骨幹，在文字引導的三維生成任務上評估 SJC。給定文字提示如「一隻衝浪板上的孔雀的 DSLR 照片」、「壽司做成的汽車」和「一座寺廟」，我們的方法產生了連貫的三維體素表示，可從新視角進行渲染。生成的物件展現出可辨識的幾何形狀與合理的紋理，與文字描述相吻合。我們觀察到由粗到精的噪聲排程至關重要：高噪聲等級的早期迭代建立了全域形狀，而低噪聲的後期迭代則精煉表面細節與紋理。

段落功能定性展示——以多樣化的文字提示展現方法的生成能力。

邏輯角色此段提供了方法有效性的直觀證據：多樣化的提示（自然物件、創意組合、建築）展示了開放領域生成的潛力。退火排程的觀察為方法設計提供了經驗支持。

論證技巧 / 潛在漏洞精心挑選的文字提示覆蓋了不同類型的物件，但讀者無法判斷這些是否為挑選過的最佳結果。缺乏定量指標（如 CLIP Score 或使用者研究）使品質評估停留在主觀層面。

We compare SJC with DreamFusion both qualitatively and in terms of computational efficiency. Since DreamFusion uses the proprietary Imagen model (not publicly available), direct comparison requires care. Using the same Stable Diffusion backbone for both methods, SJC produces results of comparable visual quality. Notably, SJC's voxel-based representation enables faster optimization than NeRF-based representations used in DreamFusion, as voxel rendering avoids the per-ray network queries that dominate NeRF's computational cost. On the other hand, the fixed voxel resolution limits the geometric detail achievable by SJC compared to the continuous NeRF representation.

我們從定性與計算效率兩方面將 SJC 與 DreamFusion 進行比較。由於 DreamFusion 使用專有的 Imagen 模型（未公開），直接比較需要審慎處理。使用相同的 Stable Diffusion 骨幹時，SJC 產生的結果在視覺品質上具有可比性。值得注意的是，SJC 的體素表示相比 DreamFusion 所用的 NeRF 表示，能實現更快的最佳化，因為體素渲染避免了主導 NeRF 計算成本的逐射線網路查詢。另一方面，固定的體素解析度限制了 SJC 相比連續 NeRF 表示所能達到的幾何細節。

段落功能基準比較——與最重要的同期方法 DreamFusion 進行公平對照。

邏輯角色此段展現了學術誠實：在指出效率優勢的同時，坦承體素解析度的限制。「comparable visual quality」的措辭避免了過度宣稱。

論證技巧 / 潛在漏洞巧妙地將 Imagen 的不可用性作為無法直接比較的理由，迴避了可能不利的品質比較。效率 vs 品質的取捨分析展現了客觀性，但「comparable」的模糊程度讓讀者難以精確判斷差距。

Ablation studies validate the importance of key design choices. Without PAAS (i.e., evaluating the score directly on clean images), the optimization diverges rapidly, producing noisy and incoherent 3D structures. This confirms the severity of the distribution mismatch problem and the necessity of our proposed solution. Removing total variation regularization leads to noisier voxel grids with floating artifacts. The noise annealing schedule also proves critical: using a fixed noise level throughout optimization produces either overly smooth results (high sigma) or noisy artifacts (low sigma), while the annealing schedule captures both global structure and fine detail.

消融研究驗證了關鍵設計選擇的重要性。若不使用 PAAS（即直接在乾淨影像上評估分數），最佳化會快速發散，產生噪聲不連貫的三維結構。這證實了分布偏移問題的嚴重性以及我們所提解決方案的必要性。移除全變分正則化導致體素格點更具噪聲，並出現浮動偽影。噪聲退火排程同樣被證明至關重要：在整個最佳化過程中使用固定噪聲等級，要麼產生過度平滑的結果（高 sigma），要麼出現噪聲偽影（低 sigma），而退火排程則能兼顧全域結構與精細細節。

段落功能消融驗證——逐一移除核心組件以證明其必要性。

邏輯角色此段是實驗章節的論證支柱，系統性地驗證了三個核心設計：PAAS、全變分正則化和噪聲退火。每個消融都直接對應方法章節中的一個設計決策。

論證技巧 / 潛在漏洞三組消融研究結構清晰，各自對應一個明確的假設。PAAS 消融的「diverges rapidly」結果最具說服力，直接證明了分布偏移問題的實際嚴重性。但消融研究僅展示定性差異，缺乏定量指標的量化分析。

5. Discussion and Conclusion — 討論與結論

We have presented Score Jacobian Chaining, a method that lifts pretrained 2D diffusion models to 3D generation by applying the chain rule to propagate learned 2D scores through a differentiable renderer. Our framework provides a principled, score-based perspective on the emerging paradigm of distilling 2D generative priors for 3D content creation. The identification of the out-of-distribution problem and its resolution via PAAS constitute important practical contributions that enable reliable optimization.

我們提出了分數雅可比鏈，一種透過套用連鎖律將學習到的二維分數經由可微分渲染器傳播，從而將預訓練的二維擴散模型提升至三維生成的方法。我們的框架為正在興起的「蒸餾二維生成先驗用於三維內容創作」此一範式，提供了有原則的、基於分數的觀點。分布外問題的識別及其透過 PAAS 的解決，構成了使穩健最佳化成為可能的重要實務貢獻。

段落功能貢獻總結——重申方法的理論定位與實務價值。

邏輯角色結論的開頭段呼應緒論的研究問題與摘要的方法概述，形成論證的閉環。「principled, score-based perspective」再次強調與啟發式方法的區別。

論證技巧 / 潛在漏洞以「emerging paradigm」定位自己的工作，暗示 SJC 是一個更大研究趨勢的重要貢獻者。但「principled」的反覆使用可能引起讀者的審美疲勞，且在更新規則與 SDS 相似的前提下，「principled」的差異化效果可能有限。

Limitations remain. The voxel representation constrains the geometric resolution, and generated objects sometimes exhibit the "Janus problem" — multi-faced artifacts where the model generates a front-facing view from multiple angles due to the lack of explicit view-dependent reasoning in the 2D diffusion prior. The optimization process is also computationally expensive, requiring several hours on a single GPU. Future work may address these through higher-capacity 3D representations, view-conditioned diffusion models, and more efficient sampling strategies. We believe the score-based formulation opens doors to incorporating additional priors — such as depth, normal, or symmetry constraints — in a principled manner.

局限性仍然存在。體素表示限制了幾何解析度，且生成的物件有時會出現「多面問題」——由於二維擴散先驗缺乏顯式的視角相依推理，模型從多個角度生成正面朝向的視圖，導致多面偽影。最佳化過程的計算成本也相當高昂，在單一 GPU 上需要數小時。未來的研究可透過更高容量的三維表示、視角條件化的擴散模型和更高效的取樣策略來解決這些問題。我們相信基於分數的公式化為以有原則的方式納入額外先驗——如深度、法線或對稱性約束——開啟了大門。

段落功能誠實揭示局限——列舉方法的已知缺陷與未來改進方向。

邏輯角色此段展現了學術誠實與前瞻性。「Janus problem」的點名特別有價值——這是此類方法的共通弱點。未來方向的建議為後續研究者提供了清晰的路線圖。

論證技巧 / 潛在漏洞坦承限制是學術論文的美德，且「Janus problem」的識別展示了深入的問題理解。但「數小時」的計算成本在快速發展的領域中可能成為重大實用性瓶頸，尤其是後續方法（如 Instant3D）已大幅縮短生成時間。

In summary, Score Jacobian Chaining demonstrates that the mathematical structure of diffusion models — specifically, their learning of score functions — can be systematically leveraged for 3D generation through the elegant application of the chain rule. As 2D diffusion models continue to improve in quality and diversity, methods like SJC that bridge the 2D-3D gap without requiring 3D training data will become increasingly valuable for content creation, virtual reality, robotics, and other 3D-centric applications. The theoretical clarity of the score-based perspective may also inspire new connections between generative modeling, differentiable rendering, and inverse problems more broadly.

總結而言，分數雅可比鏈展示了擴散模型的數學結構——具體而言，其對分數函數的學習——可透過連鎖律的優雅應用，被系統性地利用於三維生成。隨著二維擴散模型在品質與多樣性上持續提升，像 SJC 這樣無需三維訓練資料即可橋接二維與三維差距的方法，在內容創作、虛擬實境、機器人學及其他以三維為核心的應用中，將愈發重要。基於分數觀點的理論清晰性，也可能在更廣泛的層面上，啟發生成式建模、可微分渲染與反問題之間的新連結。

段落功能宏觀展望——將 SJC 置於更大的學術與應用脈絡中。

邏輯角色最終段落以提升視角的方式收尾：從具體方法上升至研究範式的層次，暗示 SJC 的影響超越本文所述的特定應用。

論證技巧 / 潛在漏洞以「虛擬實境、機器人學」等應用場景收尾，擴大了方法的潛在影響範圍。但目前的生成品質與速度距離這些實際應用仍有顯著差距，展望可能過於樂觀。將 SJC 與「反問題」聯繫是有趣的理論方向，但需要更具體的論述支撐。

論證結構總覽

問題
三維資料匱乏，
無法直接訓練三維生成模型

→

論點
以連鎖律串接二維分數
與渲染器雅可比，提升至三維

→

證據
多模型多提示驗證
消融研究確認組件必要性

→

反駁
PAAS 解決分布偏移
退火排程平衡粗細

→

結論
基於分數的觀點開啟
二維到三維提升的新範式

作者核心主張（一句話）

透過微積分連鎖律將預訓練二維擴散模型的分數函數經由可微分渲染器的雅可比矩陣傳播至三維空間，配合擾動平均計分法解決分布偏移問題，即可在無需任何三維資料的情況下實現文字引導的三維生成。

論證最強處

數學基礎的優雅與通用性：以連鎖律——微積分中最基本的規則——作為核心創新，賦予方法清晰的理論詮釋。PAAS 機制的提出展現了識別並系統性解決「理論-實務」鴻溝的能力。三重雅可比串接（渲染器、VAE、UNet）的自然擴展性更進一步證明了框架的通用性。

論證最弱處

與 DreamFusion 的差異化不足：儘管推導路徑不同，SJC 與 SDS 在實務上產生相似的更新規則，削弱了理論差異的實際價值。體素輻射場的固定解析度限制了幾何品質，「Janus problem」等系統性缺陷未獲有效解決。此外，缺乏全面的定量評估（如 CLIP Score 系統性比較）使品質聲稱停留在定性層面。