Panoptic Scene Graph Generation

Abstract — 摘要

Scene graph generation (SGG) aims to parse an image into a structured representation that captures objects and their pairwise relationships. However, existing SGG methods rely on bounding box-based object representations, which lack precise localization and cannot distinguish between stuff and thing categories. In this paper, we introduce Panoptic Scene Graph Generation (PSG), a new task that bridges panoptic segmentation and scene graph generation. PSG requires generating a scene graph grounded on panoptic segmentation masks rather than bounding boxes, providing a more comprehensive and precise structured understanding of visual scenes. We construct a large-scale PSG dataset with over 48,000 images and rich relation annotations, and propose baseline methods that demonstrate the challenges and opportunities of this new task.

場景圖生成（SGG）旨在將影像解析為捕捉物體及其成對關係的結構化表示。然而，現有 SGG 方法依賴基於邊界框的物體表示，缺乏精確定位且無法區分「stuff」與「thing」類別。本文提出全景場景圖生成（PSG），一項銜接全景分割與場景圖生成的新任務。PSG 要求基於全景分割遮罩而非邊界框來生成場景圖，提供更全面且精確的視覺場景結構化理解。我們建構了包含超過 48,000 張影像及豐富關係標註的大規模 PSG 資料集，並提出基線方法以展示此新任務的挑戰與機會。

段落功能全文總覽——提出 PSG 新任務，概述資料集與基線貢獻。

邏輯角色摘要建立了「現有缺陷（邊界框局限）→ 新任務定義（PSG）→ 基礎設施（資料集+基線）」的完整論證預告。

論證技巧 / 潛在漏洞以「stuff vs. thing」的區分點明全景分割相較於邊界框的根本優勢。但 PSG 的標註成本遠高於傳統 SGG，此實際困難未被提及。

1. Introduction — 緒論

Scene graph generation has emerged as a powerful paradigm for structured visual understanding, with applications ranging from visual question answering and image captioning to image retrieval and robotic manipulation. A scene graph represents an image as a directed graph where nodes correspond to objects and edges represent relationships between them. Despite significant progress, current SGG methods are fundamentally limited by their reliance on bounding boxes as the underlying object representation. Bounding boxes are inherently imprecise, especially for irregularly shaped objects or overlapping entities, and they cannot represent amorphous "stuff" categories such as sky, grass, or water.

場景圖生成已成為結構化視覺理解的強大範式，應用涵蓋視覺問答、影像描述、影像檢索和機器人操控等領域。場景圖將影像表示為有向圖，節點對應物體，邊代表物體間的關係。儘管取得了顯著進展，現有 SGG 方法從根本上受限於以邊界框作為底層物體表示。邊界框本質上不夠精確，尤其對於形狀不規則的物體或重疊實體，且無法表示如天空、草地或水等非定形的「stuff」類別。

段落功能建立問題意識——指出邊界框作為 SGG 基礎表示的根本局限。

邏輯角色論證起點：先肯定 SGG 的重要性與廣泛應用，再指出邊界框的根本性不足，為全景分割的引入建立合理性。

論證技巧 / 潛在漏洞以「天空、草地、水」等直觀例子說明 stuff 類別的重要性，增強了讀者對邊界框局限的理解。但邊界框在許多實際應用中已足夠使用，是否需要更精確的表示取決於下游任務。

Panoptic segmentation, which unifies semantic segmentation for stuff regions and instance segmentation for thing objects, provides a natural and more complete visual representation for scene graph grounding. By replacing bounding boxes with panoptic masks, we can represent both things and stuff as graph nodes, enabling richer relationship modeling such as "sky above building" or "road under car" that were previously impossible in traditional SGG. This motivates our proposal of Panoptic Scene Graph Generation (PSG), which we formalize as: given an input image, produce a panoptic segmentation and a set of relation triplets (subject mask, predicate, object mask) that describe the interactions in the scene.

全景分割統一了 stuff 區域的語意分割和 thing 物體的實例分割，為場景圖定基提供了自然且更完整的視覺表示。以全景遮罩取代邊界框，我們可以將 thing 和 stuff 都作為圖的節點，實現更豐富的關係建模，如「天空在建築物上方」或「道路在車輛下方」——這在傳統 SGG 中是不可能的。這促使我們提出全景場景圖生成（PSG），其形式化定義為：給定輸入影像，產生全景分割及一組描述場景互動的關係三元組（主體遮罩、謂詞、客體遮罩）。

段落功能任務定義——形式化定義 PSG 任務及其相較於傳統 SGG 的優勢。

邏輯角色從「問題」過渡到「解方」：全景分割作為更優的物體表示，自然地引出 PSG 任務的定義。

論證技巧 / 潛在漏洞以「天空在建築物上方」等具體例子展示 PSG 的獨特能力，非常有說服力。但三元組的謂詞詞彙表如何定義，以及 stuff-stuff 關係的標註一致性問題值得深入探討。

Traditional scene graph generation methods follow a two-stage paradigm: first detecting objects with bounding boxes, then classifying pairwise relationships. Representative works include Neural Motifs, which leverages frequency baselines and LSTM-based context modeling, VCTree that constructs dynamic tree structures for context propagation, and Transformer-based methods such as RelTR that formulate SGG as a set prediction problem. A key limitation shared by all these methods is their inability to handle stuff categories, which constitute a significant portion of visual scenes but are excluded from bounding box detection. Moreover, the coarse localization of bounding boxes introduces noise in relationship classification, as the visual features extracted from boxes may include irrelevant background regions.

傳統場景圖生成方法遵循兩階段範式：先以邊界框偵測物體，再分類成對關係。代表性工作包括利用頻率基線和 LSTM 上下文建模的 Neural Motifs、建構動態樹結構進行上下文傳播的 VCTree，以及將 SGG 公式化為集合預測問題的 Transformer 方法如 RelTR。這些方法共享的一個關鍵局限是無法處理 stuff 類別——它們構成視覺場景的重要部分但被排除在邊界框偵測之外。此外，邊界框的粗略定位在關係分類中引入噪訊，因為從框中提取的視覺特徵可能包含無關的背景區域。

段落功能文獻回顧——系統回顧傳統 SGG 方法並深化其局限性分析。

邏輯角色以多個代表性方法佐證「邊界框局限」的普遍性，進一步強化 PSG 任務的動機。

論證技巧 / 潛在漏洞「背景噪訊」的論點為邊界框的第二個缺陷，與「stuff 不可表示」形成雙重打擊。但更精確的區域特徵提取（如 RoI Align）已在一定程度上緩解了此問題。

Panoptic segmentation has seen rapid progress with methods like Panoptic FPN, MaskFormer, and Mask2Former, which produce unified segmentation masks for both things and stuff. However, these methods focus solely on pixel-level labeling and do not capture the relational structure between segmented regions. Our PSG task extends panoptic segmentation by additionally requiring the prediction of inter-object relationships, thereby providing a richer and more actionable scene representation than either SGG or panoptic segmentation alone.

全景分割在 Panoptic FPN、MaskFormer 和 Mask2Former 等方法推動下快速發展，能為 thing 和 stuff 產生統一的分割遮罩。然而，這些方法僅聚焦於像素級標注，未捕捉分割區域之間的關係結構。我們的 PSG 任務透過額外要求預測物體間關係來擴展全景分割，從而提供比 SGG 或全景分割單獨使用更豐富且更具可操作性的場景表示。

段落功能雙向定位——從全景分割角度也指出其缺少關係建模的不足。

邏輯角色建立 PSG 作為 SGG 和全景分割的交匯點：SGG 缺精確定位，全景分割缺關係建模，PSG 兩者兼具。

論證技巧 / 潛在漏洞以「雙向不足」的論述巧妙定位 PSG 的獨特價值。但將兩個已具挑戰性的任務結合是否會使問題過於困難，以致實際性能受限，是值得關注的問題。

3. Method — 方法

We construct the PSG dataset by augmenting the COCO panoptic annotations with relation labels. Starting from the COCO 2017 panoptic segmentation dataset, we define a relation taxonomy of 56 predicates covering spatial relations (e.g., above, below, in front of), action relations (e.g., riding, eating, holding), and descriptive relations (e.g., made of, part of, belonging to). Each image is annotated with a set of relation triplets (subject segment, predicate, object segment) where subjects and objects are panoptic segments. The final dataset contains 48,749 images with an average of 5.3 relation triplets per image, providing a comprehensive benchmark for the PSG task.

我們透過在 COCO 全景標註上增加關係標籤來建構PSG 資料集。以COCO 2017 全景分割資料集為起點，我們定義了包含 56 個謂詞的關係分類體系，涵蓋空間關係（如上方、下方、前方）、動作關係（如騎乘、進食、握持）和描述性關係（如由...製成、...的部分、屬於）。每張影像被標註一組關係三元組（主體分割、謂詞、客體分割），其中主體和客體均為全景分割段。最終資料集包含48,749 張影像，每張影像平均 5.3 個關係三元組，為 PSG 任務提供全面的基準。

段落功能資料集建構——描述 PSG 資料集的設計與規模。

邏輯角色為新任務提供基礎設施：56 個謂詞的分類體系和 48K+ 影像確保了基準的多樣性和規模。

論證技巧 / 潛在漏洞資料集規模可觀（48K 影像），基於 COCO 降低了建構成本。但每張影像平均僅 5.3 個三元組，是否足以捕捉場景中的所有重要關係值得商榷。

We propose two baseline frameworks for PSG. The two-stage approach (PSGFormer) first generates panoptic segmentation using Mask2Former, then extracts features for each segment and classifies pairwise relations using a transformer-based relation decoder. The relation decoder takes pairs of segment features as input, augmented with spatial encoding that captures the relative position and overlap between segments. The one-stage approach (PSGTR) extends DETR by jointly predicting segment masks and relations in a unified decoder, formulating PSG as an end-to-end set prediction problem. Both approaches use focal loss for the relation classification to address the severe class imbalance among predicates.

我們為 PSG 提出兩個基線框架。兩階段方法（PSGFormer）先使用 Mask2Former 生成全景分割，再為每個分割段提取特徵，並使用基於 Transformer 的關係解碼器分類成對關係。關係解碼器以成對的分割段特徵為輸入，並增加捕捉分割段間相對位置與重疊的空間編碼。單階段方法（PSGTR）擴展 DETR，在統一的解碼器中聯合預測分割遮罩與關係，將 PSG 公式化為端到端的集合預測問題。兩種方法均使用focal loss 進行關係分類，以處理謂詞間嚴重的類別不平衡。

段落功能基線方法介紹——描述兩種不同設計哲學的 PSG 解決方案。

邏輯角色提供兩個互補的基線（兩階段 vs. 單階段），既展示了任務的可行性，也為後續研究建立比較基準。

論證技巧 / 潛在漏洞提供兩種方法而非單一方案，增強了研究的全面性。但兩種方法的效能差異及各自適用場景的分析較為有限。

4. Experiments — 實驗

We evaluate the proposed baselines using standard SGG metrics adapted for PSG: Recall@K (R@K) and mean Recall@K (mR@K). PSGFormer achieves R@20 of 18.3 and mR@20 of 14.1, while PSGTR achieves R@20 of 16.6 and mR@20 of 12.8, indicating that the two-stage approach benefits from the stronger panoptic segmentation backbone. Compared to adapting traditional SGG methods (Neural Motifs, VCTree) to use panoptic segments instead of bounding boxes, both PSG-specific baselines show significant improvements, with PSGFormer outperforming adapted Neural Motifs by 4.2 mR@20. Analysis of per-predicate performance reveals that spatial relations (above, below) achieve the highest accuracy, while fine-grained action relations (playing, cooking) remain challenging.

我們使用為 PSG 調適的標準 SGG 指標評估所提基線：Recall@K (R@K) 和 mean Recall@K (mR@K)。PSGFormer 達到R@20 為 18.3、mR@20 為 14.1，PSGTR 達到R@20 為 16.6、mR@20 為 12.8，顯示兩階段方法受益於更強的全景分割骨幹。相較於將傳統 SGG 方法（Neural Motifs、VCTree）調適為使用全景分割段而非邊界框，兩個 PSG 專用基線均展現顯著改進，PSGFormer 比調適後的 Neural Motifs 高出 4.2 mR@20。逐謂詞效能分析顯示空間關係（上方、下方）達到最高精確度，而細粒度動作關係（玩耍、烹飪）仍具挑戰性。

段落功能提供實證——展示基線方法的量化結果與比較分析。

邏輯角色實驗結果同時驗證了 PSG 任務的可行性（基線可達合理效能）和挑戰性（絕對數字仍有大幅改進空間）。

論證技巧 / 潛在漏洞 R@20 為 18.3 的絕對值表明任務確實困難，巧妙地為後續研究留下了充足的改進空間。但如此低的基線效能也可能讓讀者質疑任務定義是否過於困難或指標是否過於嚴格。

We conduct ablation studies on key design choices. Adding spatial encoding to the relation decoder improves mR@20 by 2.1 points, confirming that relative spatial information between segments is crucial for relation prediction. Using panoptic masks instead of bounding boxes for feature extraction improves mR@20 by 1.8 points, validating the advantage of precise segment-level features. We also find that stuff-related relations account for 23% of all annotated triplets, underscoring the importance of including stuff categories in scene graph generation and the limitation of traditional box-based SGG approaches.

我們對關鍵設計選擇進行了消融研究。在關係解碼器中加入空間編碼使 mR@20 提升2.1 個百分點，確認了分割段之間的相對空間資訊對關係預測至關重要。使用全景遮罩而非邊界框進行特徵提取使 mR@20 提升1.8 個百分點，驗證了精確分割段級特徵的優勢。我們還發現stuff 相關關係占所有標註三元組的 23%，突顯了在場景圖生成中納入 stuff 類別的重要性及傳統邊界框 SGG 方法的局限。

段落功能設計驗證——透過消融研究確認關鍵組件的貢獻。

邏輯角色 23% stuff 關係的統計資料是核心論證：直接以資料證明邊界框 SGG 遺漏了近四分之一的場景關係。

論證技巧 / 潛在漏洞 23% 的 stuff 關係比例是一個強有力的統計論據。但此比例可能受標註指引的影響——如果標註者被鼓勵標註 stuff 關係，比例可能被人為抬高。

5. Conclusion — 結論

We have introduced Panoptic Scene Graph Generation (PSG), a new task that combines panoptic segmentation with scene graph generation to produce a more complete and precise structured understanding of visual scenes. By grounding scene graphs on panoptic masks rather than bounding boxes, PSG enables the representation of both things and stuff in the relational graph, capturing 23% more relations that were previously inaccessible. We have provided a large-scale dataset with 48K images and 56 relation categories, along with two baseline methods that establish initial benchmarks for the community. The significant gap between baseline performance and human-level understanding indicates that PSG represents a challenging and impactful research direction for comprehensive visual scene understanding.

本文提出了全景場景圖生成（PSG），一項結合全景分割與場景圖生成的新任務，以產生更完整且精確的視覺場景結構化理解。透過將場景圖定基於全景遮罩而非邊界框，PSG 能在關係圖中表示 thing 和 stuff，捕捉先前無法取得的 23% 額外關係。我們提供了包含 48K 影像和 56 個關係類別的大規模資料集，以及為社群建立初始基準的兩個基線方法。基線效能與人類理解水準之間的顯著差距表明，PSG 代表了全面視覺場景理解的一個具有挑戰性和影響力的研究方向。

段落功能全文總結——重申 PSG 的貢獻並以開放性問題結尾。

邏輯角色將「23% 額外關係」作為量化論據總結 PSG 的核心價值，以「效能差距」激勵後續研究，形成開放性結尾。

論證技巧 / 潛在漏洞以「挑戰性」將低基線效能轉化為正面敘事（研究機會而非方法不足）。但未充分討論 PSG 在實際應用中的部署可行性，如推論延遲和全景分割的精度瓶頸。

Abstract — 摘要

1. Introduction — 緒論

3. Method — 方法

4. Experiments — 實驗

5. Conclusion — 結論

論證結構總覽

核心主張

最強論點

最弱環節

Abstract — 摘要

1. Introduction — 緒論

2. Related Work — 相關工作

3. Method — 方法

4. Experiments — 實驗

5. Conclusion — 結論

論證結構總覽

核心主張

最強論點

最弱環節