GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields

Abstract — 摘要

Deep generative models allow for photorealistic image synthesis at high resolutions. But for many applications, this is not enough: content creation also needs to be controllable. While several recent works investigate how to disentangle underlying factors of variation in the data, most of them operate in 2D and therefore miss that our world is three-dimensional. Others do consider the 3D nature but do not scale to complex, multi-object scenes. In this paper, the authors propose GIRAFFE, which "represents scenes as compositional generative neural feature fields," allowing to disentangle individual objects from the background as well as individual object shape and appearance without additional supervision.

深度生成模型已能在高解析度下實現逼真的影像合成，但對許多應用而言，這遠遠不夠：內容創作還需具備可控性。儘管近年來有多項研究探討如何解耦資料中的潛在變異因素，但大多數在二維空間中運作，忽略了我們的世界是三維的。另有一些考量三維本質的方法，卻無法擴展到複雜的多物件場景。本文提出 GIRAFFE，將場景表示為組合式生成神經特徵場，能夠在無需額外監督的情況下，將個別物件從背景中解耦，並分離物件的形狀與外觀。

段落功能全文總覽——以遞進方式從「影像合成」到「可控生成」再到「三維場景」，最終引出 GIRAFFE 的定位。

邏輯角色摘要承擔「問題定義與解決方案預告」的雙重功能：先界定可控性與三維性的雙重缺口，再以一句話概述 GIRAFFE 如何同時回應兩者。

論證技巧 / 潛在漏洞作者以「我們的世界是三維的」這一直覺性陳述作為動機支撐，修辭上極具說服力。但「無需額外監督」的主張需待方法章節驗證——實際上模型仍需大量二維影像進行訓練。

1. Introduction — 緒論

Recent advances in generative adversarial networks (GANs) have led to photorealistic image synthesis at unprecedented quality. However, most approaches treat the image generation process as a "black box" where the user has little control over the generated content. Incorporating a 3D representation into the generative model is a promising direction for achieving controllable image synthesis, as it enables explicit control over camera viewpoint and object arrangement.

生成對抗網路（GAN）的近期進展已帶來前所未有品質的逼真影像合成。然而，大多數方法將影像生成過程視為「黑箱」，使用者對生成內容幾乎無法控制。將三維表示融入生成模型是實現可控影像合成的有前景方向，因為它能提供對攝影機視角與物件排列的顯式控制。

段落功能建立研究場域——指出 GAN 的成就與可控性不足的問題。

邏輯角色論證鏈的起點：先肯定 GAN 的能力，再指出「黑箱」限制，為引入 3D 表示的必要性鋪路。

論證技巧 / 潛在漏洞「黑箱」一詞具有強烈的負面暗示，但實際上 StyleGAN 等方法已透過潛在空間操作實現部分可控性。此處簡化了現有方法的能力，以突顯 3D 方法的優勢。

Prior works on 3D-aware image synthesis either use voxel-based representations that are "limited in resolution due to their cubic memory growth", or mesh-based approaches that require template meshes and category-specific knowledge. Recent neural implicit representations such as NeRF have shown remarkable quality but focus on single-scene reconstruction rather than generative modeling. A key limitation is that most existing methods represent a scene as a single entity, making it difficult to achieve compositional control.

先前關於三維感知影像合成的研究，要麼使用體素表示法（因立方級記憶體增長而受限於解析度），要麼使用需要範本網格和特定類別知識的網格方法。近期的神經隱式表示如 NeRF 展現了卓越的品質，但著重於單一場景重建而非生成式建模。關鍵限制在於：大多數現有方法將場景表示為單一實體，使得組合式控制難以實現。

段落功能批判既有方法——系統性列舉三類 3D 表示的局限性。

邏輯角色「問題-解決方案」論證中的問題深化：從體素的記憶體瓶頸、網格的先驗依賴，到隱式表示的非組合性，逐步收窄至 GIRAFFE 要解決的精確缺口。

論證技巧 / 潛在漏洞將三類方法各指出不同維度的缺陷，暗示理想解需同時克服所有問題。但 NeRF 的「非生成式」並非其本質限制，後續已有多種 NeRF+GAN 的整合嘗試，此處的框架有些過時。

The authors propose GIRAFFE — Generative Implicit Representation as Compositional Feature Fields for Scenes. The key idea is to represent scenes as compositions of neural feature fields, where each object and the background are modeled by individual feature fields. These are combined via a composition operator and volume-rendered at a low resolution, then upsampled using a 2D neural renderer to produce the final high-resolution image. This design allows disentangled control over individual objects' shape, appearance, and pose, as well as the camera viewpoint.

作者提出 GIRAFFE——場景的生成式隱式表示即組合式特徵場。核心概念是將場景表示為神經特徵場的組合，其中每個物件與背景各由獨立的特徵場建模。這些特徵場透過組合運算子加以結合，在低解析度下進行體積渲染，再經由二維神經渲染器上取樣以生成最終的高解析度影像。此設計允許對個別物件的形狀、外觀與姿態，以及攝影機視角進行解耦式控制。

段落功能提出解決方案——完整概述 GIRAFFE 的架構與核心創新。

邏輯角色承接上段的問題陳述，此段扮演「轉折」角色：從「現有方法不足」過渡到「本文方案」。組合式特徵場直接回應「單一實體」的缺陷，低解析度渲染+上取樣回應記憶體效率問題。

論證技巧 / 潛在漏洞將複雜架構拆解為清晰的管線步驟（特徵場 -> 組合 -> 體積渲染 -> 上取樣），使讀者易於理解。但低解析度渲染再上取樣的策略可能在細節保真度上有所妥協，作者需在實驗中證明此妥協是可接受的。

Generative Adversarial Networks have achieved remarkable results in high-resolution image synthesis. StyleGAN and its variants enable some control through latent space manipulation, but the disentanglement is "implicit and incomplete — entangled changes in viewpoint, shape, and appearance are common." Works on 3D-aware GANs incorporate 3D representations but typically model scenes as monolithic entities without compositional structure.

生成對抗網路在高解析度影像合成方面已取得卓越成果。StyleGAN 及其變體透過潛在空間操作實現了一定程度的控制，但其解耦是隱式且不完整的——視角、形狀與外觀的糾纏變化十分常見。關於三維感知 GAN 的研究雖融入了三維表示，但通常將場景建模為無組合結構的整體實體。

段落功能文獻回顧——概述 GAN 在可控生成方面的進展與侷限。

邏輯角色延續緒論的批判脈絡，以更技術性的語言重申 2D GAN 的「隱式解耦」弱點與 3D GAN 的「非組合性」問題。

論證技巧 / 潛在漏洞「隱式且不完整」的措辭準確但主觀——StyleGAN 的潛在空間在某些維度上解耦效果其實不錯。作者可能低估了 2D 方法的潛力以突顯 3D 方法的優勢。

Neural Radiance Fields (NeRF) represent scenes as continuous volumetric functions parameterized by neural networks, mapping 3D coordinates to color and density. While producing state-of-the-art novel view synthesis, NeRF requires per-scene optimization from many posed images and does not naturally support generative modeling or compositional scene manipulation. GRAF extends this to a generative setting but models the entire scene as a single field.

神經輻射場（NeRF）將場景表示為以神經網路參數化的連續體積函數，將三維座標映射至顏色與密度。雖然能產生最先進的新視角合成結果，但 NeRF 需要從大量已知姿態的影像進行逐場景最佳化，且天生不支援生成式建模或組合式場景操作。GRAF 將此擴展到生成式設定，但仍將整個場景建模為單一場。

段落功能文獻定位——將 GIRAFFE 放置於 NeRF 與 GRAF 的延伸脈絡中。

邏輯角色此段建立了關鍵的學術譜系：NeRF -> GRAF -> GIRAFFE，展現方法的演進邏輯，同時指出每一步的剩餘缺口。

論證技巧 / 潛在漏洞以線性演進的敘事將 GIRAFFE 定位為自然的下一步，邏輯清晰。但這也可能遮蔽了其他平行發展的路線（如基於體素的 3D GAN），使讀者誤以為此方向是唯一合理的演進。

3. Method — 方法

3.1 Neural Feature Fields

Each entity in the scene (objects and background) is represented by a neural feature field. Given a 3D point x and viewing direction d, the network predicts a volume density sigma and a feature vector f instead of RGB color. The feature field for each object is conditioned on shape code z_s and appearance code z_a, sampled from standard Gaussian distributions. This formulation "allows us to independently sample shape and appearance for each object," achieving disentangled control at the object level.

場景中的每個實體（物件與背景）各由一個神經特徵場表示。給定一個三維點 x 與觀看方向 d，網路預測體積密度 sigma 與特徵向量 f（而非 RGB 顏色）。每個物件的特徵場以形狀編碼 z_s 與外觀編碼 z_a 為條件，兩者皆從標準高斯分布中取樣。此公式化允許獨立取樣每個物件的形狀與外觀，實現物件層級的解耦式控制。

段落功能方法推導第一步——定義神經特徵場的基本形式。

邏輯角色這是整個方法的數學基礎。以「特徵向量取代 RGB」是相對於 NeRF/GRAF 的關鍵修改，為後續的二維上取樣提供了更豐富的資訊載體。

論證技巧 / 潛在漏洞輸出特徵而非顏色的設計選擇非常巧妙——它將精細的外觀生成推遲到 2D CNN，大幅降低體積渲染的計算負擔。但這也意味著最終的影像品質部分取決於 2D CNN 而非 3D 表示，可能削弱 3D 一致性。

3.2 Scene Composition — 場景組合

Individual objects are placed in the scene via affine transformations that control translation, rotation, and scale. The scene is composed by "combining the individual feature fields using a density-weighted mean" of the feature vectors. Specifically, at each 3D point, the composite density is the sum of all entity densities, and the composite feature is the density-weighted average. This simple composition operator enables adding or removing objects and controlling their individual poses without retraining.

個別物件透過仿射變換被放置到場景中，控制平移、旋轉與縮放。場景的組合方式是使用特徵向量的密度加權平均來結合各個特徵場。具體而言，在每個三維點上，組合密度為所有實體密度之總和，組合特徵為密度加權平均值。這個簡潔的組合運算子使得在不重新訓練的情況下，即可新增或移除物件並控制其個別姿態。

段落功能核心創新——描述組合式場景表示的實現機制。

邏輯角色此段是全文論證的支柱：密度加權平均的組合方式既物理直覺（密度高的物件遮擋密度低的），又數學簡潔。這直接實現了「組合式控制」的核心承諾。

論證技巧 / 潛在漏洞以簡單的加權平均實現場景組合是優雅的設計，但可能在處理物件間遮擋關係複雜的場景時不夠精確——例如半透明物件或接觸面的陰影效果。作者未討論此組合方式的物理準確性限制。

3.3 Rendering Pipeline — 渲染管線

The rendering pipeline consists of two stages. First, the composite neural feature field is volume-rendered at a low resolution (e.g., 16x16 or 64x64) using classical volume rendering integration. This produces a 2D feature map rather than an image. Second, a 2D convolutional neural network acts as a neural renderer that upsamples the low-resolution feature map to the final high-resolution image (e.g., 256x256). This two-stage design is critical: "it allows us to keep the expensive 3D volume rendering at low resolution while still producing high-fidelity images," making the approach computationally tractable.

渲染管線包含兩個階段。首先，組合式神經特徵場在低解析度（如 16x16 或 64x64）下進行體積渲染，使用經典的體積渲染積分，產生二維特徵圖而非影像。接著，一個二維摺積神經網路作為神經渲染器，將低解析度特徵圖上取樣至最終的高解析度影像（如 256x256）。此兩階段設計至關重要：它讓耗費資源的三維體積渲染保持在低解析度，同時仍能產生高保真影像，使該方法在計算上具備可行性。

段落功能效率設計——解釋如何在保持品質的前提下降低計算成本。

邏輯角色回應可能的「計算效率」質疑：體積渲染的高成本是 NeRF 系列方法的普遍弱點，此段明確展示如何透過架構設計規避此瓶頸。

論證技巧 / 潛在漏洞低解析度 3D + 高解析度 2D 的設計是一個精妙的工程妥協。然而，2D 上取樣網路可能引入與 3D 不一致的偽影——例如在大角度旋轉時，2D CNN 可能「幻想」出不符合 3D 幾何的細節。作者需以實驗證明多視角一致性。

4. Experiments — 實驗

Experiments are conducted on CompCars, FFHQ, CLEVR, and custom multi-object datasets. The model is trained using only unposed 2D images without 3D supervision. Results demonstrate that GIRAFFE achieves disentangled control over camera pose, object translation, rotation, shape, and appearance. On FFHQ at 256x256 resolution, GIRAFFE achieves FID scores competitive with state-of-the-art methods while additionally providing controllability. Compared to GRAF, the method shows significant improvements in image quality and controllability on multi-object scenes. Ablation studies confirm that both the compositional representation and the neural renderer are essential components.

實驗在 CompCars、FFHQ、CLEVR 及自訂多物件資料集上進行。模型僅以無姿態標註的二維影像訓練，無需三維監督。結果展示 GIRAFFE 實現了對攝影機姿態、物件平移、旋轉、形狀與外觀的解耦式控制。在 FFHQ 256x256 解析度上，GIRAFFE 達到與最先進方法相當的 FID 分數，同時額外提供可控性。相比 GRAF，該方法在多物件場景的影像品質與可控性上展現顯著改進。消融研究確認組合式表示與神經渲染器均為不可或缺的組件。

段落功能提供全面的實驗證據——在多個基準與面向上驗證方法的有效性。

邏輯角色此段是實證支柱，覆蓋三個維度：(1) 可控性的定性驗證；(2) FID 的定量比較；(3) 消融研究確認各組件的必要性。

論證技巧 / 潛在漏洞「與最先進方法相當的 FID」措辭模糊——若 FID 不如純 2D GAN（如 StyleGAN2），則說明 3D 表示帶來的可控性是以影像品質為代價的。作者未直接報告與 StyleGAN2 的 FID 差距數值。

5. Conclusion — 結論

GIRAFFE demonstrates that incorporating a compositional 3D scene representation into the generative model leads to more controllable image synthesis. By representing scenes as compositions of neural feature fields and combining 3D volume rendering with a 2D neural renderer, the method achieves disentangled control over scene properties while maintaining competitive image quality. The approach learns this compositional structure from raw, unposed image collections without explicit 3D supervision, suggesting that 3D-aware generative models are a fruitful direction for controllable content creation.

GIRAFFE 證明了將組合式三維場景表示融入生成模型，能帶來更具可控性的影像合成。透過將場景表示為神經特徵場的組合，並結合三維體積渲染與二維神經渲染器，該方法在維持具競爭力的影像品質的同時，實現了對場景屬性的解耦式控制。此方法從原始的無姿態影像集合中學習這種組合式結構，無需顯式三維監督，顯示三維感知生成模型是可控內容創作的豐碩方向。

段落功能總結全文——重申核心貢獻並展望未來方向。

邏輯角色結論段呼應摘要的結構，從方法回到啟示：3D 感知生成模型是有前景的方向。形成完整的論證閉環。

論證技巧 / 潛在漏洞結論適度謙遜（「具競爭力」而非「最佳」），但未充分討論局限性——如對複雜光照、大量物件場景的處理能力，以及 2D CNN 可能破壞 3D 一致性的問題。作為最佳論文，讀者期待更深入的未來展望。

論證結構總覽

問題
GAN 影像合成缺乏
可控性與 3D 理解

→

論點
組合式神經特徵場
實現解耦式控制

→

證據
多資料集驗證
可控性與 FID 競爭力

→

反駁
低解析度渲染+2D CNN
兼顧效率與品質

→

結論
3D 感知生成模型
是可控創作的方向

作者核心主張（一句話）

將場景表示為組合式生成神經特徵場，能在無需三維監督的情況下，從非結構化影像集合中學習可解耦、可控的三維場景生成能力。

論證最強處

組合式設計的優雅性：以密度加權平均實現場景組合，既保持了物理直覺性（遮擋關係），又使每個物件可獨立操控。低解析度 3D 渲染加 2D 上取樣的兩階段管線，在效率與品質之間取得了出色的平衡，使方法具備實際應用的可行性。

論證最弱處

2D 上取樣的三維一致性隱憂：最終影像品質高度依賴 2D CNN 渲染器，而非純粹的三維表示。在極端視角變化或多物件密集遮擋的場景中，2D 上取樣可能產生與三維幾何不一致的偽影。此外，方法在物件數量較多的複雜場景上的可擴展性尚未被充分驗證。