Object-Contextual Representations for Semantic Segmentation

Abstract — 摘要

We address the problem of semantic segmentation with a simple yet effective approach called Object-Contextual Representations (OCR). The key idea is to explicitly represent the object region context and augment the pixel representation with object-contextual information. First, we learn soft object regions from ground-truth segmentation. Then we compute object region representations by aggregating pixel representations within each region. Finally, we augment pixel representations with object-contextual representations computed as a weighted aggregation of all object region representations. OCR achieves state-of-the-art performance on Cityscapes (84.5% mIoU), ADE20K (45.28% mIoU), and LIP (55.60% mIoU).

我們以一種簡潔而有效的方法——物件上下文表徵（OCR）來處理語意分割問題。核心思想是明確地表徵物件區域上下文，並以物件上下文資訊增強像素表徵。首先，我們從真實分割標註中學習軟物件區域。接著透過聚合每個區域內的像素表徵來計算物件區域表徵。最後，以所有物件區域表徵的加權聚合——物件上下文表徵——來增強像素表徵。OCR 在 Cityscapes（84.5% mIoU）、ADE20K（45.28% mIoU）和 LIP（55.60% mIoU）上達到最先進效能。

段落功能全文總覽——以三步驟流程定義 OCR 的核心機制與效能成果。

邏輯角色建立「軟物件區域 → 區域表徵 → 像素增強」的遞進式論述架構。

論證技巧 / 潛在漏洞三步驟結構化描述清晰，但「軟物件區域」依賴初始分割預測的品質。

1. Introduction — 緒論

The key challenge for semantic segmentation is learning good pixel representations. Current approaches rely on multi-scale context aggregation like ASPP (Atrous Spatial Pyramid Pooling) or PPM (Pyramid Pooling Module). These methods aggregate context from a fixed spatial range regardless of object structure, mixing representations from different categories and leading to suboptimal representations near boundaries. We argue that context should be object-aware: each pixel should be influenced by representations from the same object category.

語意分割的核心挑戰在於學習良好的像素表徵。目前方法依賴多尺度上下文聚合，如ASPP（空洞空間金字塔池化）或PPM（金字塔池化模組）。這些方法從固定空間範圍聚合上下文，不考慮物件結構，混合了不同類別的表徵，導致邊界附近的表徵不理想。我們認為上下文應該是物件感知的：每個像素應受同一物件類別表徵的影響。

段落功能問題定位——批判現有上下文聚合方法的「物件無感知」缺陷。

邏輯角色指出 ASPP/PPM 忽略物件邊界的結構性弱點，建立物件感知設計的必要性。

論證技巧 / 潛在漏洞將主流方法的固定範圍框定為根本性限制，論證有力但可能隨方法改進而減弱。

Our OCR approach characterizes pixels by exploiting corresponding object class representations. We first compute a coarse segmentation to identify object regions, then compute per-category object representations by aggregating features, and finally update each pixel with a weighted combination of object representations. Weights are computed based on pixel-object similarity, ensuring each pixel attends to the most relevant objects.

我們的 OCR 方法利用對應物件類別表徵來表征像素。首先計算粗略分割以識別物件區域，接著透過聚合特徵計算逐類別物件表徵，最後以物件表徵的加權組合更新每個像素。權重根據像素-物件相似度計算，確保每個像素關注最相關的物件。

段落功能方法概述——描述 OCR 的三階段運作流程。

邏輯角色從粗略分割到精細表徵的流程體現「由粗到細」的設計哲學。

論證技巧 / 潛在漏洞基於相似度的加權機制類似注意力，但以物件區域為單位，在效率與效果間取得平衡。

2. Method — 方法

The OCR module consists of three stages. The Soft Object Region stage computes a coarse segmentation map using an auxiliary head, providing soft pixel-to-class assignments. The Object Region Representation stage aggregates pixel representations weighted by soft assignments to obtain K object region representations. The Object Contextual Representation stage computes pixel-to-region similarities and augments each pixel with a weighted sum of object representations.

OCR 模組由三個階段組成。軟物件區域階段使用輔助頭計算粗略分割圖，提供像素到類別的軟分配。物件區域表徵階段以軟分配為權重聚合像素表徵，獲得 K 個物件區域表徵。物件上下文表徵階段計算像素到區域的相似度，並以物件表徵的加權和增強每個像素。

段落功能核心方法——三階段流程的完整描述。

邏輯角色資訊流向清晰：像素 → 物件區域 → 增強的像素表徵，形成閉環。

論證技巧 / 潛在漏洞輔助頭的「軟」分配避免了硬決策的誤差放大，是務實的設計選擇。

2.1 OCR Module Details — OCR 模組細節

Formally, let f_k denote the object representation for the k-th class, computed as a weighted average of pixel features. The object-contextual representation for pixel i is a weighted sum of all K representations, where weights are determined by normalized dot-product similarity. This mechanism allows each pixel to selectively attend to the most relevant object context, naturally handling multi-scale objects and complex boundaries. Compared to standard self-attention (O(N^2)), OCR reduces complexity to O(NK) where K is the number of categories, making it much more efficient.

形式上，令 f_k 為第 k 類的物件表徵，計算為像素特徵的加權平均。像素 i 的物件上下文表徵為所有 K 個表徵的加權和，權重由正規化點積相似度決定。此機制使每個像素能選擇性關注最相關的物件上下文，自然處理多尺度物件和複雜邊界。相較標準自注意力 O(N^2)，OCR 將複雜度降至 O(NK)，K 為類別數，效率大幅提升。

段落功能數學形式化——精確定義 OCR 的計算方式與複雜度優勢。

邏輯角色正規化點積確保可微分性，使模組可端到端訓練。

論證技巧 / 潛在漏洞將注意力「鍵」從像素壓縮為 K 個類別，大幅降低計算複雜度，是核心創新之一。

3. Experiments — 實驗

On Cityscapes test, OCR achieves 84.5% mIoU with HRNet-W48, setting a new state of the art. On ADE20K validation, OCR reaches 45.28% mIoU. On LIP validation, 55.60% mIoU. Compared to baselines with the same backbone, OCR consistently provides +1 to +2% mIoU improvement with less than 5% additional FLOPs.

在 Cityscapes 測試集上，OCR 搭配 HRNet-W48 達到 84.5% mIoU，創下最先進紀錄。在 ADE20K 驗證集上達到 45.28% mIoU。在 LIP 驗證集上達到 55.60% mIoU。相比使用相同骨幹的基線，OCR 穩定提供 +1 至 +2% mIoU 改進，僅增加不到 5% FLOPs。

段落功能定量評估——跨三基準的一致性改進。

邏輯角色多基準一致性改進建立方法穩健性，5% 開銷下 +1~2% 提升效率極佳。

論證技巧 / 潛在漏洞改進幅度雖不大但穩定一致，代價極小，在工程上極具實用性。

Ablation studies confirm each component's contribution. Replacing OCR with standard self-attention reduces mIoU by 0.8%. Removing auxiliary segmentation loss reduces mIoU by 0.5%. The soft region assignment outperforms hard assignment by 0.3%, validating the design choice of using probabilistic rather than deterministic object regions.

消融研究確認各組件貢獻。將 OCR 替換為標準自注意力使 mIoU 降低 0.8%。移除輔助分割損失降低 0.5%。軟區域分配優於硬分配 0.3%，驗證了使用機率式而非確定式物件區域的設計選擇。

段落功能消融分析——驗證各設計選擇的合理性。

邏輯角色與標準自注意力比較直接回應「為何需要物件層級聚合」的核心問題。

論證技巧 / 潛在漏洞消融設計完整，軟 vs 硬分配的比較進一步證明了設計的精細考量。

4. Conclusion — 結論

We have presented OCR for semantic segmentation, a simple approach that explicitly leverages object-level context to improve pixel representations. OCR achieves state-of-the-art results with minimal overhead. We believe object-aware context aggregation is a principled direction for semantic segmentation, and even simple implementations yield significant improvements.

我們提出了用於語意分割的 OCR，一種明確利用物件層級上下文改善像素表徵的簡潔方法。OCR 以極小開銷達到最先進結果。我們相信物件感知的上下文聚合是語意分割的原則性方向，即使簡單實現也能產生顯著改進。

段落功能總結——重申核心貢獻與研究方向。

邏輯角色以「原則性方向」提升學術意義，暗示更複雜實現的進一步潛力。

論證技巧 / 潛在漏洞後續 SegFormer 等工作驗證了上下文感知設計的持續價值。

論證結構總覽

問題
上下文聚合忽略物件結構

→

論點
上下文應物件感知

→

方法
軟區域 + 物件表徵聚合

→

證據
三基準 SOTA

→

結論
物件感知上下文方向

核心主張

透過以物件區域為單位進行上下文聚合，可在極小計算開銷下顯著提升語意分割的像素表徵品質。

論證最強處

三基準穩定改進且計算開銷僅 5%，消融實驗完整驗證各組件必要性。

論證最弱處

改進幅度有限（+1~2%），依賴初始粗略分割品質，在極端情況下可能失效。