Per-Pixel Classification is Not All You Need for Semantic Segmentation (MaskFormer)

Abstract — 摘要

Modern approaches to semantic segmentation typically formulate the task as per-pixel classification. The authors argue that mask classification — predicting a set of binary masks each associated with a single class label — is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner. Following this insight, the authors propose MaskFormer, which adopts a Transformer decoder to compute per-segment embeddings that generate class predictions and mask embeddings simultaneously. MaskFormer achieves state-of-the-art results on ADE20K semantic segmentation (55.6 mIoU) and competitive performance on COCO panoptic segmentation (52.7 PQ), demonstrating that a single mask classification model can simplify the landscape of segmentation approaches.

現代語意分割方法通常將此任務公式化為逐像素分類。作者主張遮罩分類——預測一組各自關聯一個類別標籤的二值遮罩——足以通用到能以統一方式解決語意級與實例級分割任務。基於此洞察，作者提出 MaskFormer，採用 Transformer 解碼器計算逐區段嵌入，同時生成類別預測與遮罩嵌入。MaskFormer 在 ADE20K 語意分割上達到最先進的 55.6 mIoU，在 COCO 全景分割上達到具有競爭力的 52.7 PQ，證明單一遮罩分類模型可以簡化分割方法的研究版圖。

段落功能全文總覽——挑戰逐像素分類的主流範式，提出遮罩分類作為統一替代。

邏輯角色摘要的核心論證是範式轉移：從「每個像素獨立分類」到「預測一組遮罩+標籤」。此轉移的吸引力在於它自然地統一了語意分割與實例分割。

論證技巧 / 潛在漏洞標題「Per-Pixel Classification is Not All You Need」是一種大膽的否定性宣稱，直接挑戰自 FCN 以來的主流範式。但遮罩分類並非全新概念——Mask R-CNN 已在實例分割中使用。作者的真正貢獻是將此範式擴展至語意分割。

1. Introduction — 緒論

Since the introduction of Fully Convolutional Networks (FCN), per-pixel classification has been the de facto standard for semantic segmentation. This approach assigns a class probability distribution to each pixel independently, which is natural for semantic segmentation where each pixel belongs to exactly one category. However, for instance-level tasks (instance segmentation and panoptic segmentation), per-pixel classification is fundamentally insufficient — these tasks require distinguishing different instances of the same class, which cannot be achieved by independent pixel-level classification. This has led to divergent methodologies: per-pixel classification for semantic segmentation, and detection-based approaches (like Mask R-CNN) for instance segmentation. The authors propose that mask classification can serve as a universal paradigm for all segmentation tasks.

自全摺積網路 (FCN) 引入以來，逐像素分類一直是語意分割的事實標準。此方法為每個像素獨立地分配類別機率分布，這對語意分割而言是自然的，因為每個像素恰好屬於一個類別。然而，對於實例級任務（實例分割與全景分割），逐像素分類從根本上不足——這些任務需要區分同一類別的不同實例，這無法透過獨立的像素級分類實現。這導致了方法論的分歧：語意分割用逐像素分類，實例分割用基於偵測的方法（如 Mask R-CNN）。作者提出遮罩分類可作為所有分割任務的通用範式。

段落功能問題識別——揭示逐像素分類在實例級任務上的根本性不足，並指出方法論分歧。

邏輯角色此段建構了核心問題：「分割領域的方法論分歧」。語意分割用一套方法、實例分割用另一套，這種分歧增加了系統複雜性。遮罩分類的價值在於消除此分歧。

論證技巧 / 潛在漏洞將「方法論分歧」框定為問題是策略性的——它暗示統一框架本身就是有價值的。但在實務中，專用方法往往在各自的任務上效能更優。統一框架的價值更多在於概念簡潔性而非效能優越性。

Mask classification formulates segmentation differently: instead of classifying each pixel independently, the method splits the task into (1) partitioning the image into N regions (binary masks) and (2) associating each region with a class distribution over K categories. This formulation naturally handles variable numbers of segments — a requirement for instance segmentation where the number of objects is unknown — while being equally applicable to semantic segmentation where masks can overlap and are resolved via argmax aggregation. The key advantage is that a single model architecture can address semantic, instance, and panoptic segmentation without task-specific modifications.

遮罩分類以不同方式公式化分割：不再獨立分類每個像素，而是將任務分為 (1) 將影像劃分為 N 個區域（二值遮罩）與 (2) 為每個區域關聯一個 K 類別上的類別分布。此公式化自然地處理了可變數量的區段——實例分割中物件數量未知的需求——同時同樣適用於語意分割，其中遮罩可以重疊並透過 argmax 聚合來解析。關鍵優勢在於單一模型架構可以在無需任務特定修改的情況下處理語意、實例與全景分割。

段落功能範式定義——精確描述遮罩分類的數學公式化及其通用性。

邏輯角色此段是論文的概念基石：將分割分解為「區域產生 + 區域分類」兩步。這個分解使得同一框架可以處理固定類別數（語意）和可變實例數（實例）的分割。

論證技巧 / 潛在漏洞以「變數量的區段」作為遮罩分類的關鍵優勢是有說服力的。但這也意味著模型需要預設一個最大區段數 N，且 N 的選擇會影響效能。在類別數極多的語意分割中（如 ADE20K 的 150 類），N 的設定與計算效率之間的取捨需要仔細考量。

Per-pixel classification has dominated semantic segmentation since FCN, with subsequent works (DeepLab, PSPNet, SegFormer) all maintaining this paradigm while improving feature extraction. For instance segmentation, Mask R-CNN popularized a detect-then-segment approach, and DETR later introduced a set-prediction framework using Transformer decoder with learnable queries. Panoptic segmentation further demonstrated the limitations of per-pixel classification, requiring complex multi-branch architectures to merge semantic and instance predictions. The mask classification idea draws inspiration from DETR's set prediction approach but applies it to all segmentation tasks, not just instance-level ones.

自 FCN 以來，逐像素分類主導了語意分割，後續工作（DeepLab、PSPNet、SegFormer）皆維持此範式並改進特徵擷取。在實例分割方面，Mask R-CNN 普及了「先偵測再分割」的方法，而 DETR 後來引入了使用 Transformer 解碼器與可學習查詢的集合預測框架。全景分割進一步展示了逐像素分類的局限性，需要複雜的多分支架構來合併語意與實例預測。遮罩分類的概念受 DETR 的集合預測方法啟發，但將其應用於所有分割任務而非僅實例級任務。

段落功能文獻回顧——追溯逐像素分類與遮罩分類兩條研究脈絡的演進。

邏輯角色此段建立了從 DETR 到 MaskFormer 的學術譜系：DETR 為實例分割引入了集合預測，MaskFormer 將此概念推廣至所有分割任務。

論證技巧 / 潛在漏洞承認 DETR 的啟發是學術誠信的體現。但也暗示 MaskFormer 與 DETR 在技術上高度相似——核心差異僅在於推論策略（語意分割的 argmax 聚合 vs. 實例分割的直接輸出）。這使得 MaskFormer 的「新穎性」更多在概念洞察而非技術創新。

3. Method — 方法

3.1 Mask Classification Formulation — 遮罩分類公式化

Rather than assigning class probabilities to each pixel independently, mask classification splits the segmentation task into two sub-problems: (1) partitioning the image into N regions represented as binary masks {m_i}, and (2) associating each region with a probability distribution p_i over K+1 categories (including a "no object" class). The training objective uses Hungarian matching to find the optimal assignment between predicted and ground truth segments, then computes a combination of cross-entropy loss for class predictions and binary cross-entropy plus dice loss for mask predictions. This formulation allows the number of predicted segments N to be independent of the number of categories K, enabling a single model to handle tasks with vastly different numbers of output segments.

不同於為每個像素獨立分配類別機率，遮罩分類將分割任務分為兩個子問題：(1) 將影像劃分為 N 個以二值遮罩 {m_i} 表示的區域，(2) 為每個區域關聯一個在 K+1 個類別（包含「無物件」類別）上的機率分布 p_i。訓練目標使用匈牙利匹配來尋找預測與真實區段之間的最佳配對，然後計算類別預測的交叉熵損失與遮罩預測的二元交叉熵加骰子損失的組合。此公式化允許預測區段數 N 獨立於類別數 K，使單一模型能處理輸出區段數量差異極大的任務。

段落功能數學公式化——精確定義遮罩分類的優化目標與訓練流程。

邏輯角色此段是方法的數學基礎。匈牙利匹配確保了訓練時預測與真實遮罩之間的最佳對應，而 N 與 K 的解耦是架構通用性的關鍵。

論證技巧 / 潛在漏洞匈牙利匹配的使用直接借鑒了 DETR，這是成熟的技術。但匈牙利匹配的計算複雜度為 O(N^3)，當預測區段數 N 較大時可能成為瓶頸。此外，「無物件」類別的引入在語意分割中的意義不如在實例分割中明確——語意分割中每個像素都應屬於某個類別。

3.2 MaskFormer Architecture — MaskFormer 架構

MaskFormer consists of three modules. The pixel-level module extracts per-pixel embeddings using a backbone (e.g., ResNet or Swin Transformer) followed by a pixel decoder that gradually upsamples features to generate high-resolution embeddings. The Transformer decoder takes N learnable query embeddings and, through cross-attention with the image features, produces N per-segment embeddings. The segmentation module then generates: (1) class predictions via a linear classifier on segment embeddings, and (2) binary masks via dot product between segment embeddings and per-pixel embeddings followed by sigmoid activation. This design is backbone-agnostic and can leverage any existing feature extractor.

MaskFormer 由三個模組組成。像素級模組使用骨幹（如 ResNet 或 Swin Transformer）擷取逐像素嵌入，接著像素解碼器逐步上取樣特徵以生成高解析度嵌入。Transformer 解碼器接收 N 個可學習的查詢嵌入，透過與影像特徵的交叉注意力產生 N 個逐區段嵌入。分割模組接著生成：(1) 透過區段嵌入上的線性分類器產生類別預測，(2) 透過區段嵌入與逐像素嵌入的點積加 sigmoid 啟動產生二值遮罩。此設計與骨幹無關，可利用任何現有的特徵擷取器。

段落功能架構詳述——描述 MaskFormer 的三模組設計。

邏輯角色三模組的分工清晰：像素級模組負責空間資訊、Transformer 解碼器負責區段級語義、分割模組負責整合兩者。遮罩生成透過點積實現在數學上優雅且計算高效。

論證技巧 / 潛在漏洞「骨幹無關」是強有力的設計特性——它使 MaskFormer 可以搭配任何骨幹的進步（如從 ResNet 到 Swin）。但 Transformer 解碼器的計算開銷（O(N) 個查詢的交叉注意力）可能在高解析度影像上成為瓶頸。

3.3 Inference Strategy — 推論策略

The inference strategy differs by task. For semantic segmentation, MaskFormer uses marginalization: at each pixel, the probability of category c is computed as the sum over all N masks of the product of mask probability and class probability for c, then argmax over categories gives the final label. This allows multiple queries to contribute to the same semantic class, effectively handling large or disjoint regions of the same category. For panoptic and instance segmentation, a simpler argmax over masks assigns each pixel to the most confident segment, with a confidence threshold to filter low-quality predictions. This demonstrates that the same trained model handles different tasks solely by changing the inference procedure.

推論策略因任務而異。在語意分割中，MaskFormer 使用邊際化：在每個像素上，類別 c 的機率計算為所有 N 個遮罩上遮罩機率與類別 c 的類別機率的乘積之總和，然後對類別取 argmax 得到最終標籤。這允許多個查詢貢獻於同一語意類別，有效地處理同一類別的大面積或不連通區域。在全景與實例分割中，較簡單的對遮罩取 argmax 將每個像素分配給最有信心的區段，並以信心閾值過濾低品質預測。這證明了同一訓練模型僅透過改變推論程序即可處理不同任務。

段落功能推論設計——展示同一模型如何透過不同推論策略適配不同任務。

邏輯角色此段是統一性宣稱的關鍵支撐：模型架構與訓練完全相同，僅推論策略隨任務變化。這使得「一個模型解決所有分割問題」的願景具有可操作性。

論證技巧 / 潛在漏洞語意分割的邊際化策略允許多查詢覆蓋同一類別是巧妙的設計——但這也意味著查詢數 N 需要大於類別數 K。在類別數極多的情況下（如 ADE20K-Full 的 847 類），N 的設定成為敏感的超參數。

4. Experiments — 實驗

A key finding is that MaskFormer's advantage over per-pixel baselines increases with the number of categories. On Cityscapes (19 classes), the improvement is minimal. On ADE20K (150 classes), MaskFormer with Swin-Large backbone achieves 55.6 mIoU, establishing state-of-the-art. On ADE20K-Full (847 classes), the gain reaches +3.5 mIoU over the per-pixel baseline, demonstrating that mask classification becomes increasingly advantageous as the number of categories grows. For COCO panoptic segmentation, MaskFormer achieves 52.7 PQ, competitive with specialized panoptic methods. The model handles both things and stuff categories without architectural changes, confirming the unification claim.

一項關鍵發現是 MaskFormer 相對逐像素基線的優勢隨類別數量增加而擴大。在 Cityscapes（19 類）上，改進幅度不大。在 ADE20K（150 類）上，搭配 Swin-Large 骨幹的 MaskFormer 達到 55.6 mIoU，確立最先進結果。在 ADE20K-Full（847 類）上，增益達到相對逐像素基線 +3.5 mIoU，證明遮罩分類隨類別數增長而愈具優勢。在 COCO 全景分割上，MaskFormer 達到 52.7 PQ，與專用全景方法具有競爭力。模型在無架構修改的情況下同時處理物件（things）與背景（stuff）類別，確認了統一性的宣稱。

段落功能實驗驗證——展示跨不同類別數與分割任務的效能趨勢。

邏輯角色「類別數越多優勢越大」的趨勢是最具說服力的發現——它不僅展示了效能，更揭示了遮罩分類在規模化上的結構性優勢。

論證技巧 / 潛在漏洞以類別數作為分析軸的實驗設計非常有洞察力。但在 Cityscapes 上「改進幅度不大」的承認也暗示了一個重要限制：對於少類別、高解析度的場景，逐像素分類仍然是有效甚至更優的選擇。統一框架的價值更多體現在多類別場景中。

5. Conclusion — 結論

MaskFormer demonstrates that mask classification is a viable and often superior alternative to per-pixel classification for semantic segmentation, while naturally extending to instance and panoptic segmentation. The approach achieves state-of-the-art results on ADE20K and competitive performance across multiple benchmarks. The findings suggest that the segmentation community should reconsider the dominance of per-pixel classification and explore mask-based formulations that offer greater flexibility and scalability. The authors envision this as a step toward truly unified segmentation models that can handle any segmentation task with a single architecture.

MaskFormer 證明了遮罩分類是語意分割中逐像素分類的可行且通常更優的替代方案，同時自然地延伸至實例與全景分割。此方法在 ADE20K 上達到最先進結果，並在多個基準上展現具有競爭力的效能。研究發現建議分割社群應重新審視逐像素分類的主導地位，並探索提供更大靈活性與可擴展性的遮罩式公式化。作者將此視為朝向真正統一的分割模型——以單一架構處理任何分割任務——邁出的一步。

段落功能總結與呼籲——重申核心貢獻並倡議範式轉移。

邏輯角色結論超越了個別方法的宣傳，提出了更廣泛的社群呼籲：「重新審視逐像素分類的主導地位」。這將 MaskFormer 定位為範式轉移的起點而非終點。

論證技巧 / 潛在漏洞「重新審視主導地位」的呼籲後來被 MaskFormer 的後續工作 Mask2Former 所驗證——後者在所有分割基準上都取得了最先進結果，證明了遮罩分類範式的生命力。但初版 MaskFormer 在實例分割上仍不及 Mask R-CNN 系列，使統一性的宣稱在效能維度上尚不完整。

論證結構總覽

問題
逐像素分類無法
統一語意/實例分割

→

論點
遮罩分類是通用
分割範式

→

證據
ADE20K SOTA
類別數越多優勢越大

→

反駁
同一模型不同推論
適配不同任務

→

結論
分割社群應重新
審視主流範式

作者核心主張（一句話）

遮罩分類——預測一組二值遮罩與類別標籤的配對——是比逐像素分類更通用的分割範式，能以單一架構統一語意、實例與全景分割任務。

論證最強處

類別數量的規模化優勢：「類別數越多優勢越大」的實驗發現（Cityscapes 19 類微幅改進 vs. ADE20K-Full 847 類 +3.5 mIoU）提供了遮罩分類結構性優勢的有力證據，暗示此方法在更大規模的分割問題上將愈發重要。

論證最弱處

統一性在效能維度上的不完整：MaskFormer 在語意分割上達到 SOTA，但在實例分割上僅「具有競爭力」，暗示統一框架在各個任務上可能不如專用方法。此外，在少類別場景（如 Cityscapes）上的微幅改進質疑了遮罩分類是否在所有情境下都優於逐像素分類。