Picture: A Probabilistic Programming Language for Scene Perception

Abstract — 摘要

We propose Picture, a probabilistic programming language for scene perception that allows researchers to express complex generative models of images as short, readable probabilistic programs. Picture programs describe a scene generation process — placing objects in a 3D scene, rendering the scene using graphics software, and comparing the rendered image to the observed image. Inference in Picture is achieved through a combination of Markov Chain Monte Carlo (MCMC) sampling and efficient approximate likelihood computations. The key insight is that scene understanding can be formulated as "inverse graphics" — inferring the latent scene description that most likely generated the observed image. We demonstrate Picture on 3D human pose estimation, 3D object reconstruction, and multi-object scene parsing, showing that short probabilistic programs can achieve results competitive with task-specific state-of-the-art systems.

我們提出 Picture，一種用於場景感知的機率程式語言，使研究者能將複雜的影像生成模型以簡短、可讀的機率程式表達。Picture 程式描述場景生成過程——在三維場景中放置物件、使用圖學軟體渲染場景，並將渲染影像與觀測影像進行比較。Picture 中的推斷透過馬可夫鏈蒙地卡羅（MCMC）取樣與高效近似似然計算的結合來實現。關鍵洞見在於：場景理解可被公式化為「逆圖學」——推斷最可能生成觀測影像的潛在場景描述。我們在三維人體姿態估計、三維物件重建和多物件場景剖析上展示 Picture，顯示簡短的機率程式能達到與任務專用最先進系統具競爭力的結果。

段落功能全文總覽——以「逆圖學」的核心概念串聯整個研究動機、方法與應用。

邏輯角色摘要建立了從程式語言（工具）到逆圖學（理論框架）到多任務驗證（實證）的三層架構，每一層都支撐下一層。

論證技巧 / 潛在漏洞「簡短的機率程式能達到競爭性結果」是極具吸引力的承諾——它暗示表達性與效能可以兼得。但「競爭性」的定義模糊，且逆圖學方法的計算成本通常遠高於前饋神經網路。

1. Introduction — 緒論

Scene perception — understanding the 3D structure of a scene from an image — is a fundamental goal of computer vision. Modern approaches typically train discriminative models (e.g., deep neural networks) that directly map from pixels to scene properties, treating the pipeline as a black box. While achieving impressive results on benchmarks, these approaches lack interpretability, require large labeled datasets, and do not naturally express compositional scene structure. An alternative paradigm is analysis by synthesis, or inverse graphics: hypothesize a 3D scene, render it using a graphics engine, compare the rendering to the observed image, and iteratively refine the hypothesis. This approach is inherently interpretable, compositional, and data-efficient.

場景感知——從影像理解場景的三維結構——是電腦視覺的根本目標。現代方法通常訓練判別式模型（如深度神經網路），直接從像素映射到場景屬性，將整個管線視為黑箱。這些方法雖在基準上取得了令人印象深刻的結果，卻缺乏可解釋性、需要大量標注資料集，且無法自然地表達組合式場景結構。一種替代範式是合成分析，或稱逆圖學：假設一個三維場景、使用圖學引擎渲染、將渲染結果與觀測影像比較，並迭代地精煉假設。這種方法本質上具有可解釋性、組合性和資料效率。

段落功能建立研究場域——以判別式 vs 生成式的二元對立開篇。

邏輯角色此段的論證策略是先承認判別式方法的成功，再系統性指出其三項限制（不可解釋、資料飢渴、非組合），為生成式範式建立合理性。

論證技巧 / 潛在漏洞「黑箱」的修辭與 GIRAFFE 論文類似，暗示判別式方法的根本缺陷。但 2015 年深度學習正處於上升期，其「黑箱」特性並未阻礙其在大多數任務上的壓倒性優勢。

However, inverse graphics approaches have been difficult to implement, slow to run, and brittle in practice. We propose Picture to address these challenges: a probabilistic programming language specifically designed for scene perception. Picture provides high-level abstractions for defining 3D scene models, integrating graphics rendering engines, and performing efficient probabilistic inference. A researcher can express a scene model in just tens of lines of code, with the inference engine automatically handling the complex MCMC sampling. This separation of modeling (what the world looks like) from inference (how to compute the posterior) is a key design principle.

然而，逆圖學方法在實務上一直難以實作、執行緩慢且脆弱。我們提出 Picture 來應對這些挑戰：一種專為場景感知設計的機率程式語言。Picture 提供定義三維場景模型、整合圖學渲染引擎及執行高效機率推斷的高階抽象。研究者僅需數十行程式碼就能表達一個場景模型，推斷引擎自動處理複雜的 MCMC 取樣。將建模（世界看起來如何）與推斷（如何計算後驗）分離是關鍵設計原則。

段落功能提出解決方案——將 Picture 語言定位為逆圖學的實用化工具。

邏輯角色先承認逆圖學的三項實務困難，再展示 Picture 如何逐一解決：高階抽象（易實作）、自動推斷（易執行）、建模/推斷分離（易維護）。

論證技巧 / 潛在漏洞「數十行程式碼」的承諾極具吸引力，但程式碼長度與系統複雜度不一定成正比——底層的渲染引擎和推斷機制可能仍然非常複雜。

Probabilistic programming languages like Church, Venture, and Stan provide general frameworks for expressing and solving probabilistic models. However, they lack domain-specific primitives for vision tasks — such as 3D scene representation and rendering. Analysis-by-synthesis approaches in vision have been applied to faces, bodies, and indoor scenes, but each requires extensive custom engineering of both the generative model and the inference algorithm. Generative adversarial networks and variational autoencoders learn implicit generative models but do not provide explicit scene structure representations. Picture uniquely combines the expressiveness of probabilistic programming with domain-specific vision primitives and efficient graphics-based rendering.

Church、Venture 和 Stan 等機率程式語言提供了表達和求解機率模型的通用框架。然而，它們缺乏視覺任務的領域特定原語——如三維場景表示和渲染。視覺中的合成分析方法已被應用於人臉、人體和室內場景，但每個都需要對生成模型和推斷演算法進行大量客製化工程。生成對抗網路和變分自編碼器學習隱式生成模型，但不提供顯式的場景結構表示。Picture 獨特地結合了機率程式設計的表達性、領域特定的視覺原語以及高效的圖學渲染。

段落功能文獻定位——在機率程式設計與電腦視覺的交叉點上定位 Picture。

邏輯角色三方對比：通用機率語言（缺視覺原語）、任務特定合成分析（缺通用性）、深度生成模型（缺顯式結構），Picture 填補三者的交集空間。

論證技巧 / 潛在漏洞以三角定位策略清晰展示研究空白。但 GAN/VAE 的「不提供顯式結構」未必是缺點——在許多應用中隱式表示更具優勢。

3. The Picture Language — Picture 語言

3.1 Scene Programs

A Picture program defines a generative model of scenes and images. The program first samples latent scene parameters from prior distributions — such as the number of objects, their identities, poses, lighting conditions, and camera parameters. It then constructs a 3D scene using these parameters, placing 3D meshes at appropriate positions and orientations. The scene is rendered using an approximate graphics engine (based on OpenGL) to produce a synthetic image. Finally, the program scores the hypothesis by comparing the rendered image to the observed image using a likelihood function. The entire process is expressed as a concise probabilistic program of typically 20-50 lines.

Picture 程式定義場景和影像的生成模型。程式首先從先驗分布取樣潛在場景參數——如物件數量、身份、姿態、光照條件和攝影機參數。然後使用這些參數建構三維場景，將三維網格放置在適當的位置和方向。場景使用近似圖學引擎（基於 OpenGL）進行渲染以產生合成影像。最後，程式透過似然函數比較渲染影像與觀測影像來評分假設。整個過程以通常 20 至 50 行的簡潔機率程式表達。

段落功能方法核心第一部分——描述 Picture 程式的結構與流程。

邏輯角色以四步流程（取樣 -> 建構 -> 渲染 -> 評分）定義了「逆圖學」的完整操作語義。每一步對應機率程式的不同組成部分。

論證技巧 / 潛在漏洞「20 至 50 行」的具體數字強化了「簡潔」的承諾。但近似圖學引擎與真實影像之間的域差異（rendering gap）可能嚴重影響似然評估的準確性。

3.2 Inference Engine — 推斷引擎

Inference in Picture amounts to sampling from the posterior distribution over latent scene parameters given the observed image. The inference engine combines several strategies: Metropolis-Hastings MCMC for exploring the scene parameter space, data-driven proposal distributions initialized by bottom-up detectors (e.g., CNN-based object detectors) to guide the sampler toward promising regions, and approximate likelihood computations that avoid pixel-exact comparisons. The data-driven proposals are crucial for practical efficiency: rather than exploring the vast parameter space uniformly, bottom-up recognition modules provide informed initial guesses that dramatically speed up convergence. This hybrid approach combines the flexibility of top-down generative models with the efficiency of bottom-up discriminative features.

Picture 中的推斷等同於在給定觀測影像的條件下，從潛在場景參數的後驗分布中取樣。推斷引擎結合數種策略：用於探索場景參數空間的 Metropolis-Hastings MCMC、由由下而上偵測器（如基於 CNN 的物件偵測器）初始化的資料驅動提議分布以引導取樣器至有前景的區域，以及避免像素精確比較的近似似然計算。資料驅動提議對實務效率至關重要：不是均勻探索廣大的參數空間，而是由下而上辨識模組提供知情的初始猜測，大幅加速收斂。這種混合方法結合了自上而下生成模型的靈活性與由下而上判別式特徵的效率。

段落功能方法核心第二部分——描述推斷引擎的多策略設計。

邏輯角色回應逆圖學「執行緩慢」的質疑：資料驅動提議分布是提升效率的關鍵創新，使 MCMC 不再是盲目搜索。

論證技巧 / 潛在漏洞生成式 + 判別式的混合策略在理論上極為優雅。但實際推斷速度仍然是瓶頸——即使有良好的提議分布，MCMC 收斂可能需要數百次渲染，遠慢於前饋神經網路的毫秒級推斷。

4. Experiments — 實驗

We demonstrate Picture on three tasks. For 3D human pose estimation, a Picture program of approximately 50 lines specifies a body model with articulated joints, renders it, and infers joint angles from monocular images. On the HumanEva benchmark, Picture achieves results competitive with specialized pose estimation systems. For 3D face reconstruction, a 20-line program using a morphable face model achieves accurate 3D face shape and texture recovery. For multi-object scene parsing, Picture programs can infer the number, identity, position, and orientation of multiple objects in a scene, demonstrating the compositional nature of the approach. Across all tasks, the same inference engine is used, with only the generative program changing.

我們在三項任務上展示 Picture。在三維人體姿態估計上，一個約 50 行的 Picture 程式指定了具有關節的人體模型，對其進行渲染，並從單目影像推斷關節角度。在 HumanEva 基準上，Picture 達到了與專門化姿態估計系統具競爭力的結果。在三維人臉重建上，一個使用可變形人臉模型的 20 行程式實現了精確的三維人臉形狀與紋理恢復。在多物件場景剖析上，Picture 程式可以推斷場景中多個物件的數量、身份、位置和方向，展示了方法的組合性質。在所有任務中，使用的是同一個推斷引擎，僅生成程式有所不同。

段落功能多任務實驗驗證——以三項不同任務展示 Picture 的通用性。

邏輯角色關鍵論點是「同一推斷引擎，不同生成程式」——這直接驗證了建模/推斷分離的設計原則，也是機率程式設計的核心優勢。

論證技巧 / 潛在漏洞行數統計（20 行、50 行）是極具說服力的簡潔性指標。但「競爭性」措辭模糊——若在每項任務上都只是「接近」最先進水準，則實際應用價值有限。

5. Conclusion — 結論

Picture demonstrates that probabilistic programming provides a powerful and flexible framework for scene perception. By expressing generative models as short programs and leveraging graphics engines for rendering within a probabilistic inference loop, we enable rapid prototyping of vision models that are interpretable, compositional, and data-efficient. The separation of modeling from inference allows researchers to focus on what the world looks like rather than how to compute with that knowledge. We envision Picture as a step toward "vision as inverse graphics" becoming a practical reality, complementing the strengths of discriminative deep learning approaches.

Picture 展示了機率程式設計為場景感知提供了強大且靈活的框架。透過將生成模型表達為簡短程式，並在機率推斷迴圈中利用圖學引擎進行渲染，我們實現了可解釋、具組合性且資料高效的視覺模型的快速原型開發。建模與推斷的分離使研究者能專注於世界看起來如何，而非如何以該知識進行計算。我們將 Picture 視為「視覺即逆圖學」成為實際現實的一步，與判別式深度學習方法的優勢互補。

段落功能總結全文——重申核心貢獻並展望逆圖學的未來。

邏輯角色結論以「與判別式方法互補」作為收尾，避免了對立姿態，展現了務實的學術視野。

論證技巧 / 潛在漏洞「互補」的定位策略明智——不宣稱取代深度學習，而是提供另一種視角。但結論未討論推斷效率的根本瓶頸，也未提供與深度學習方法結合的具體路線圖。

論證結構總覽

問題
判別式模型缺乏
可解釋性與組合性

→

論點
機率程式 + 逆圖學
實現場景感知

→

證據
三項任務
20-50 行程式達標

→

反駁
資料驅動提議
加速 MCMC 收斂

→

結論
逆圖學是判別式
深度學習的互補

作者核心主張（一句話）

以機率程式語言將場景理解公式化為逆圖學推斷，使研究者僅需簡短程式碼即可建構可解釋、組合式的視覺模型，並透過資料驅動提議分布實現高效推斷。

論證最強處

建模/推斷分離的設計原則：同一推斷引擎適用於三項截然不同的任務，有力證明了框架的通用性。20-50 行程式碼的簡潔性大幅降低了逆圖學方法的實作門檻，使其從理論概念轉化為可用工具。

論證最弱處

計算效率的根本瓶頸：即使有資料驅動提議，MCMC 推斷仍需多次渲染-比較迭代，速度上無法與前饋深度網路競爭。「競爭性結果」的措辭模糊了與最先進方法的具體差距。此外，近似圖學引擎與真實影像的域差異未被充分處理。