Specifying Object Attributes and Relations in Interactive Scene Generation

Abstract -- 摘要

This paper introduces a methodology for image generation from scene graphs that distinguishes between layout and appearance embeddings. This dual approach produces images with improved scene graph alignment, enhanced visual quality, and capability for complex scene representations. The technique enables diverse output variations per scene graph with user-directed control through two mechanisms: importing elements from other images and navigating object space via appearance archetypes. The system supports real-time rendering for interactive novel scene creation.

本文引入一種從場景圖生成影像的方法，將版面配置與外觀嵌入加以區分。此雙重方法能產生具有更佳場景圖對齊度、更高視覺品質以及複雜場景表示能力的影像。該技術透過兩種機制實現每個場景圖的多樣化輸出變體，並具有使用者導向的控制能力：從其他影像匯入元素，以及透過外觀原型在物件空間中瀏覽。此系統支援即時渲染以進行互動式新場景創作。

段落功能全文總覽——以「版面配置/外觀分離」作為核心差異化。

邏輯角色摘要以「場景圖 -> 影像」的生成任務為起點，強調「雙重嵌入」和「使用者控制」兩個獨特賣點。

論證技巧 / 潛在漏洞「即時渲染」的宣稱增強了實用性。但「改善的場景圖對齊度」需要與基線方法的量化比較來支持，單獨陳述缺乏說服力。

1. Introduction -- 緒論

Inspired by David Marr's definition of vision -- discovering from images what exists and where -- this work employs scene graphs with object location and appearance attributes as an accessible interface for users to express image synthesis intentions. The "what" aspect captures object class hierarchies with appearance attributes from predefined clusters or sample images. The "where" aspect uses scene graphs representing spatial relationships like "above" or "left of." The methodology employs dual encoding: layout embedding captures relative positioning, while appearance embedding can be replaced independently, enabling object copying between images without affecting others.

受 David Marr 對視覺的定義啟發——從影像中發現存在什麼以及在哪裡——本研究以具有物件位置與外觀屬性的場景圖作為使用者表達影像合成意圖的易用介面。「什麼」面向捕捉物件類別層級結構，具有來自預定義叢集或範例影像的外觀屬性。「在哪裡」面向使用代表空間關係（如「上方」或「左方」）的場景圖。方法使用雙重編碼：版面配置嵌入捕捉相對位置，而外觀嵌入可獨立替換，使物件能在不影響其他物件的情況下跨影像複製。

段落功能建立研究場域——以 Marr 的視覺理論框架化問題。

邏輯角色引用 David Marr 的經典定義（what + where）為場景圖介面提供認知科學的理論根基，同時自然地引出雙重編碼的設計動機。

論證技巧 / 潛在漏洞以認知科學大師的名字開場具有強烈的學術權威效應。但 Marr 的 what/where 分離假說在神經科學中並非毫無爭議，以此作為工程設計的依據有過度解讀之虞。

Conditional image generation techniques have evolved from class-based and text-description synthesis to more structured inputs. Image translation methods like Pix2pix require paired matching samples. Scene graph-based image generation recently emerged, but prior work synthesizes images from input bounding box layouts at small resolutions without distinguishing layout from appearance. Interactive tools based on GAN dissection enable neuron manipulation but offer only class-level control rather than full instance control. This work provides more semantic manipulation relating to spatial object relations and more precise instance-level control.

條件式影像生成技術已從類別式與文本描述合成演進至更結構化的輸入。Pix2pix 等影像翻譯方法需要配對的匹配樣本。基於場景圖的影像生成近年興起，但先前研究在小解析度下從邊界框配置合成影像，未區分版面配置與外觀。基於 GAN 解剖的互動工具能操控神經元，但僅提供類別級控制而非完整的實例級控制。本研究提供更具語義性的操作（關聯空間物件關係）以及更精確的實例級控制。

段落功能文獻回顧——以控制粒度的精細化為軸心梳理演進。

邏輯角色以「類別級 -> 實例級」的控制粒度演進定位本文的貢獻層次。GAN 解剖的對比尤其有效：同為互動式工具，但本文在控制精度上更勝一籌。

論證技巧 / 潛在漏洞將「配對樣本」與「低解析度」列為先前方法的弱點，間接宣告本文無需配對且支援較高解析度。但 256x256 的最大解析度在當時仍屬有限。

3. Method -- 方法

The neural architecture comprises multiple components: a graph convolutional network (G) converting scene graphs to per-object layout embeddings; a CNN (M) converting layout embeddings to object masks; a parallel network (B) generating bounding box coordinates; an appearance embedding CNN (A) converting image information to vectors; a multiplexer combining masks and appearance information; and an encoder-decoder residual network (R) generating output images. Each scene graph object node contains object class encoding, location attributes (25 bits for 5x5 grid positioning + 10 bits for size), and appearance embeddings. Stochasticity is introduced via per-object random vectors enabling mask variation.

神經架構包含多個組件：圖摺積網路（G）將場景圖轉換為逐物件的版面配置嵌入；CNN（M）將版面配置嵌入轉換為物件遮罩；平行網路（B）產生邊界框座標；外觀嵌入 CNN（A）將影像資訊轉換為向量；多工器結合遮罩與外觀資訊；編碼器-解碼器殘差網路（R）產生輸出影像。每個場景圖物件節點包含物件類別編碼、位置屬性（25 位元用於 5x5 格點定位 + 10 位元用於大小）及外觀嵌入。透過逐物件的隨機向量引入隨機性以實現遮罩變化。

段落功能架構描述——定義五個子網路及其資料流。

邏輯角色此段以模組化設計展現系統的清晰結構：G -> M/B/A -> 多工器 -> R 的管線，每個模組職責明確。離散化的位置編碼（5x5 格點）反映了實用性考量。

論證技巧 / 潛在漏洞模組化設計使每個組件可以獨立分析與改進。但五個子網路的聯合訓練可能面臨梯度不穩定的問題，損失函數的平衡尤為關鍵。5x5 的格點解析度非常粗糙，可能限制精確的空間控制。

3.1 Training Loss Terms -- 訓練損失函數

The optimization combines multiple loss terms: reconstruction loss (L1 difference), bounding box loss (MSE), perceptual loss using VGG network activations, and three discriminators. The mask discriminator uses Least Squares GAN conditioned on object class. The image discriminator ensures generated images match ground truth layouts while penalizing counterfactual appearance vectors -- a novel technique where mismatched appearance embeddings are used to train adversarial robustness. The object discriminator ensures individual generated objects appear realistic by comparing cropped regions. Feature matching losses based on discriminator activations provide additional supervisory signal.

最佳化結合了多種損失項：重建損失（L1 差異）、邊界框損失（均方誤差）、使用 VGG 網路激活值的感知損失，以及三個判別器。遮罩判別器使用以物件類別為條件的最小平方 GAN。影像判別器確保生成影像匹配真實標註的版面配置，同時懲罰反事實外觀向量——一種新穎技術，以不匹配的外觀嵌入來訓練對抗穩健性。物件判別器透過比較裁切區域確保個別生成物件看起來真實。基於判別器激活值的特徵匹配損失提供額外的監督訊號。

段落功能損失函數設計——描述多層次的監督機制。

邏輯角色三個判別器分別作用於不同層次（遮罩、全影像、個別物件），形成「由粗到細」的對抗監督。反事實訓練是特別巧妙的設計——它迫使模型學會外觀嵌入的忠實使用。

論證技巧 / 潛在漏洞反事實訓練的引入是論文的重要創新，提供了外觀嵌入忠實度的保證。但三個判別器加上多種損失的組合使訓練極為複雜，超參數調整的難度和對不同資料集的敏感度未被充分討論。

3.3 Interactive GUI -- 互動介面

The GUI enables selecting preexisting object appearances via 100 archetypes per class, obtained by applying the learned appearance network to training set objects and employing k-means clustering. Archetypes are arranged linearly using 1-D t-SNE embedding. Users place objects on a schematic layout depicted as strings in ten font sizes capturing size information. Edge labels auto-infer from relative object positioning, eliminating unnecessary user intervention. This coarse user placement proves more intuitive and less laborious than explicit scene graph specification.

圖形使用者介面允許透過每類 100 個原型選擇既有的物件外觀，這些原型是將學習得到的外觀網路應用於訓練集物件並以 k-means 叢集法獲得。原型以一維 t-SNE 嵌入線性排列。使用者在示意版面配置上放置物件，以十種字體大小的字串表示來捕捉大小資訊。邊標籤自動從物件的相對位置推斷，消除不必要的使用者操作。這種粗略的使用者放置方式比明確的場景圖規格更直覺且更省力。

段落功能使用者介面設計——描述互動控制的具體實現。

邏輯角色此段將「互動式場景生成」從技術論文提升至實際可用的工具。自動推斷邊標籤的設計降低了使用門檻，使非技術使用者也能操作。

論證技巧 / 潛在漏洞以字體大小表示物件大小的 GUI 設計巧妙而直覺。但 100 個原型是否足以覆蓋物件外觀的多樣性值得商榷，特別是對於外觀變異大的類別（如「人」）。

4. Experiments -- 實驗

Experiments use the COCO-Stuff dataset (approximately 25,000 train, 1,000 validation, 2,000 test images) at 64x64, 128x128, and 256x256 resolutions. The method demonstrates significant performance advantages over baselines across inception score, FID, and classification accuracy at both ground truth and inferred layouts. A user study with 20 participants shows 83.3% rated outputs as more realistic and 80.7% found better scene graph adherence. Ablation analysis reveals that removing perceptual loss is extremely detrimental, and mask discriminator removal most damages the discriminator components.

實驗使用 COCO-Stuff 資料集（約 25,000 張訓練、1,000 張驗證、2,000 張測試影像），解析度為 64x64、128x128 與 256x256。方法在 inception 分數、FID 與分類準確率上展現相對基線的顯著性能優勢，在真實標註與推斷版面配置上均成立。一項 20 名參與者的使用者研究顯示 83.3% 認為輸出更加真實，80.7% 認為場景圖遵循度更佳。消融分析揭示移除感知損失具有極大的破壞性，而遮罩判別器的移除對判別器組件傷害最大。

段落功能提供全面的量化與人因評估證據。

邏輯角色使用者研究（83.3% 更真實）是極強的論據——生成模型的最終判準是人類感知，客觀指標（FID、IS）僅為代理指標。

論證技巧 / 潛在漏洞結合客觀指標與主觀評估是完善的驗證策略。但 20 名參與者的使用者研究規模偏小，且均為「電腦圖學與視覺學生」，可能引入專家偏差。

5. Conclusion -- 結論

The authors present an interactive image generation tool accepting scene graphs with optional location information. Each object receives both location and appearance embeddings; the latter can transfer from other images, enabling object duplication with drastically changed layouts. Beyond the dual encoding, novel architecture and loss terms -- including counterfactual training and multi-scale discriminators -- produce improved baseline performance. The system enables a new form of semantic-level image manipulation that is more intuitive than pixel-level editing.

作者提出一個接受場景圖（含選擇性位置資訊）的互動式影像生成工具。每個物件接受版面配置與外觀嵌入；後者可從其他影像遷移，使物件能在截然不同的版面配置中被複製。除了雙重編碼之外，新穎的架構與損失項——包括反事實訓練與多尺度判別器——產生了改善的基線性能。此系統開啟了一種比像素級編輯更直覺的新型語義級影像操作方式。

段落功能總結全文——強調互動性與語義控制的突破。

邏輯角色結論將技術貢獻（雙重編碼、反事實訓練）重新包裝為使用者體驗的提升（語義級操作 > 像素級編輯），拉高了研究的應用價值。

論證技巧 / 潛在漏洞以「語義級 vs 像素級」的對比收尾清晰地定位了貢獻。但在當時的解析度限制（256x256）下，系統的實際商業應用仍有距離，這一限制未被正面討論。

論證結構總覽

問題
場景圖到影像
缺乏細粒度控制

→

論點
版面配置/外觀分離
實現實例級控制

→

證據
83.3% 使用者偏好
FID/IS 超越基線

→

反駁
反事實訓練確保
外觀嵌入忠實度

→

結論
語義級互動式
影像生成工具

作者核心主張（一句話）

透過將場景圖中的版面配置與外觀嵌入分離，結合反事實對抗訓練，能實現具有實例級控制力的互動式影像生成，使用者可直覺地操控個別物件的位置與外觀。

論證最強處

外觀遷移的設計巧思：雙重嵌入的分離設計允許將一張影像中物件的外觀無縫遷移至完全不同的場景配置中，這在先前的場景生成方法中是無法實現的。反事實訓練以對抗方式確保外觀嵌入被忠實使用，避免了模型忽略外觀資訊的退化模式。

論證最弱處

解析度與複雜度的限制：最大 256x256 的解析度在當時已屬有限，且每張影像最多 8 個物件的設定限制了複雜場景的表達力。訓練涉及五個子網路與三個判別器的聯合最佳化，穩定性與可重現性可能成為問題。20 人的使用者研究規模亦嫌不足。