Segment Anything (SAM) — 雙欄批注

Abstract — 摘要

We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date, with over 1 billion masks on 11 million licensed and privacy-respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive — often competitive with or even superior to prior fully supervised results. We release the Segment Anything Model (SAM) and corresponding dataset (SA-1B) to foster research in foundation models for computer vision.

本文提出 Segment Anything（SA）計畫：一套用於影像分割的全新任務、模型與資料集。透過在資料收集迴圈中使用我們的高效模型，我們建構了迄今最大的分割資料集，包含超過一千一百萬張經授權且尊重隱私的影像上的十億個遮罩。該模型被設計並訓練為可提示式，因此能以零樣本方式遷移至新的影像分布與任務。我們在眾多任務上評估其能力，發現其零樣本表現令人印象深刻——經常與先前的全監督結果具競爭力，甚至更優。我們釋出 Segment Anything Model（SAM）及對應資料集（SA-1B），以促進電腦視覺基礎模型的研究。

段落功能全文總覽——以「任務+模型+資料集」三位一體的框架預覽整個 SA 計畫。

邏輯角色摘要承載「願景宣言」的功能：從特定的分割任務出發，指向「電腦視覺基礎模型」的宏大目標。三個組成部分互相增強——模型驅動資料收集，資料反哺模型訓練。

論證技巧 / 潛在漏洞「十億個遮罩」與「零樣本」的並置極具說服力——前者展示規模，後者展示泛化。但「零樣本」的定義在分割語境中較為寬鬆：模型仍需接受點或框等提示，這與語言模型的零樣本有質的不同。

1. Introduction — 緒論

In natural language processing, foundation models pretrained on broad web-scale data can be adapted to a wide range of downstream tasks via prompting. Can we build a similar foundation model for image segmentation? Such a model would need to solve a promptable segmentation task: given any prompt — a point, a box, a mask, or text — it must return a valid segmentation mask. This task is general enough to serve as a pretraining objective and enables zero-shot transfer to downstream segmentation problems through appropriate prompt engineering.

在自然語言處理領域，基於大規模網路資料預訓練的基礎模型可透過提示適配至廣泛的下游任務。我們能否為影像分割建構類似的基礎模型？這樣的模型需要解決一項可提示式分割任務：給定任何提示——一個點、一個框、一個遮罩或文字——它必須回傳有效的分割遮罩。此任務足夠通用，可作為預訓練目標，並透過適當的提示工程實現零樣本遷移至下游分割問題。

段落功能建立願景——以 NLP 基礎模型為類比，提出視覺分割的基礎模型構想。

邏輯角色論證鏈的起點：以 NLP 的成功模式為論據，類推至視覺分割，建立「可提示式分割」的必要性與合理性。

論證技巧 / 潛在漏洞 NLP 基礎模型的類比非常有力，但存在本質差異：語言是離散符號系統，而影像是連續的；NLP 的「提示」與分割的「提示」（點、框）在資訊密度上截然不同。此類比的說服力可能掩蓋這些差異。

2. Segment Anything Task — 可提示式分割任務

We define a promptable segmentation task where the goal is to return a valid segmentation mask given any segmentation prompt. A prompt can be spatial (points, boxes, masks) or semantic (free-form text). The requirement of a "valid" mask means that even when a prompt is ambiguous — for instance, a single point on a shirt could refer to the shirt itself or the person wearing it — the output should be a reasonable mask for at least one of the possible objects. This task generalizes naturally to several existing segmentation tasks (interactive segmentation, edge detection, object proposal generation) and can serve as a unifying pretraining objective.

我們定義一項可提示式分割任務，其目標是在給定任何分割提示的情況下回傳有效的分割遮罩。提示可以是空間性的（點、框、遮罩）或語義性的（自由格式文字）。「有效」遮罩的要求意味著，即使當提示具有歧義性——例如，襯衫上的單一個點可能指稱襯衫本身或穿著它的人——輸出也應是至少一個可能物件的合理遮罩。此任務自然地泛化至多種既有分割任務（互動式分割、邊緣偵測、物件提案生成），並可作為統一的預訓練目標。

段落功能核心定義——精確界定「可提示式分割任務」的範疇與約束。

邏輯角色此段是整個計畫的理論基礎：以一個足夠通用的任務定義統攝所有下游分割應用。「歧義性」的處理方式（輸出多個候選遮罩）是設計上的關鍵決策。

論證技巧 / 潛在漏洞以「襯衫vs.人」的具體例子解釋歧義性處理非常直觀。但「有效遮罩」的判定標準在語義上仍有模糊空間——任何封閉區域都可能是「有效的」，問題在於是否符合使用者意圖。

3. Model Architecture — 模型架構

SAM consists of three components. The image encoder employs an MAE-pretrained Vision Transformer (ViT) adapted for high-resolution input processing, generating 64x64 embeddings from 1024x1024 images. The prompt encoder represents sparse prompts (points, boxes) through positional encodings summed with learned embeddings, while dense prompts (masks) use convolutional processing with element-wise addition to image embeddings. The mask decoder features a lightweight two-layer transformer decoder utilizing cross-attention mechanisms, enabling real-time mask generation in approximately 50ms in a web browser.

SAM 由三個組件構成。影像編碼器採用經 MAE 預訓練的 Vision Transformer（ViT），適配於高解析度輸入處理，從 1024x1024 影像生成 64x64 嵌入。提示編碼器以位置編碼與學習嵌入的加總表示稀疏提示（點、框），而稠密提示（遮罩）使用摺積處理並與影像嵌入進行逐元素相加。遮罩解碼器採用輕量級兩層轉換器解碼器，利用交叉注意力機制，能在約 50 毫秒內於網頁瀏覽器中實現即時遮罩生成。

段落功能架構概述——以模組化方式描述 SAM 的三個核心組件。

邏輯角色此段展現了精妙的架構設計：重型影像編碼器（ViT-H）只需計算一次，輕量級的提示編碼器與遮罩解碼器實現即時互動——這是「可提示式」設計的工程前提。

論證技巧 / 潛在漏洞「50ms」的響應時間直接呼應實用性需求。但 ViT-H 的前置計算成本（約 600ms）被隱藏在「影像編碼器」的描述中——對於需要逐幀處理的影片場景，這可能成為瓶頸。

To handle prompt ambiguity, SAM predicts multiple output masks (three by default) for each prompt, each with an associated confidence score (IoU prediction). This design directly addresses the one-to-many nature of the task: a single point on an object may correspond to the part, the whole object, or a larger group. During training, the loss is computed only against the minimum-loss mask, encouraging each output slot to specialize in a different granularity level. This multi-mask output with automatic ambiguity resolution is essential for the promptable interface.

為處理提示歧義性，SAM 針對每個提示預測多個輸出遮罩（預設三個），每個配有相關的信心分數（IoU 預測）。此設計直接回應了任務的一對多特性：物件上的單一個點可能對應部件、完整物件或更大的群組。訓練時，損失僅計算於最小損失遮罩，鼓勵每個輸出槽位專精於不同的粒度層級。此多遮罩輸出與自動歧義解消對於可提示式介面至關重要。

段落功能解決歧義——描述多遮罩輸出策略。

邏輯角色直接回應任務定義中提出的「歧義性」挑戰。多粒度遮罩設計使模型從「選擇唯一答案」轉為「呈現多種合理解釋」，這是與傳統分割模型的根本差異。

論證技巧 / 潛在漏洞三個固定的輸出槽位是簡潔的工程決策，但物件層級可能遠多於三層（如紐扣→口袋→襯衫→上半身→人→群組）。此外，IoU 預測的準確性直接影響自動選擇的品質。

4. Data Engine — 資料引擎

The SA-1B dataset was built through a three-stage data engine. In the assisted-manual stage, professional annotators used SAM interactively, producing 4.3 million masks. In the semi-automatic stage, SAM generated candidate masks that annotators refined, yielding 5.9 million masks and increasing mask diversity. In the fully automatic stage, SAM generated masks using a 32x32 grid of point prompts per image, followed by non-maximum suppression and confidence filtering, producing the vast majority of the final 1.1 billion masks across 11 million images. Human quality ratings consistently scored mask quality between 7 and 9 on a 10-point scale.

SA-1B 資料集透過三階段資料引擎建構。在輔助手動階段，專業標註員以互動方式使用 SAM，產生 430 萬個遮罩。在半自動階段，SAM 生成候選遮罩供標註員精修，產出 590 萬個遮罩並增加了遮罩多樣性。在全自動階段，SAM 對每張影像使用 32x32 的點提示網格生成遮罩，再經非極大值抑制與信心過濾，產生了最終十一億個遮罩（分布在一千一百萬張影像上）的絕大部分。人類品質評分一致地將遮罩品質評為 10 分制中的 7 至 9 分。

段落功能資料建構策略——展示從人工到全自動的漸進式資料擴展。

邏輯角色此段揭示了 SAM 成功的「飛輪效應」：模型改善標註品質與效率 -> 更多更好的資料 -> 模型進一步改善。三階段設計使資料規模從百萬級跳躍至十億級。

論證技巧 / 潛在漏洞三階段漸進的敘事極具說服力。但「十億個遮罩」主要由全自動階段生成——其品質上限受限於當時的模型能力。此外，32x32 網格的均勻取樣可能在複雜場景中遺漏小型或密集物件。

5. Experiments — 實驗

We evaluate SAM across 23 diverse segmentation datasets in a zero-shot setting. On single-point segmentation, SAM outperforms the baseline RITM model on 16 of 23 datasets. Additional evaluations demonstrate zero-shot capability on edge detection (comparable to learned approaches on BSDS500), object proposals (competitive on LVIS), and instance segmentation. Preliminary experiments with text-to-mask functionality show promising but less refined results. Critically, the model achieves these results without any task-specific finetuning, purely through prompt engineering — supporting our thesis that SAM functions as a foundation model for segmentation.

我們在零樣本設定下跨 23 個多樣化分割資料集評估 SAM。在單點分割上，SAM 在 23 個資料集中的 16 個上超越基線 RITM 模型。額外評估展示了在邊緣偵測（在 BSDS500 上與學習式方法相當）、物件提案（在 LVIS 上具競爭力）及實例分割上的零樣本能力。文字到遮罩功能的初步實驗顯示出有希望但仍需精進的結果。關鍵的是，模型在完全不進行任何特定任務微調的情況下取得這些結果，純粹透過提示工程——支持我們關於 SAM 作為分割基礎模型的命題。

段落功能全面實驗驗證——在多任務多資料集上展示零樣本泛化能力。

邏輯角色實證支柱：23 個資料集的廣度直接支撐「基礎模型」的定位。16/23 的勝率加上跨任務能力（邊緣偵測、物件提案），強化了通用性論述。

論證技巧 / 潛在漏洞作者坦誠地報告了文字到遮罩的「仍需精進」結果，增強了整體可信度。但「16/23 超越 RITM」需要細看——在哪 7 個資料集上落後？這些可能是語義分割或特定領域（如醫學影像）的場景，恰好暴露了通用模型的弱點。

6. Conclusion — 結論

The Segment Anything project demonstrates that a promptable foundation model for image segmentation is both feasible and effective. Through the synergy of the SA task, SAM model, and SA-1B dataset, we show that large-scale self-training with human-in-the-loop data curation can produce a model with strong zero-shot generalization. SAM's ability to segment any object with a simple prompt opens new possibilities for interactive annotation, downstream task adaptation, and compositional AI systems. We believe this work represents a step toward building foundation models for computer vision analogous to those that have transformed NLP.

Segment Anything 計畫展示了可提示式影像分割基礎模型既可行又有效。透過 SA 任務、SAM 模型與 SA-1B 資料集的協同作用，我們證明了以人在迴圈中的資料策展進行大規模自訓練，可產出具有強大零樣本泛化能力的模型。SAM 以簡單提示分割任何物件的能力，為互動式標註、下游任務適配以及組合式 AI 系統開啟了新的可能性。我們相信此研究代表了邁向建構電腦視覺基礎模型的一步——類似於那些已經變革了自然語言處理的模型。

段落功能總結全文——重申三位一體的貢獻並展望基礎模型的未來。

邏輯角色結論將具體的技術成就提升至「範式轉移」的高度：從分割工具到視覺基礎模型。回應緒論的 NLP 類比，形成完整的論證閉環。

論證技巧 / 潛在漏洞「基礎模型」的定位雄心勃勃但需謹慎——SAM 主要處理物體級分割，對於語義理解（如區分「坐著的貓」vs.「站著的貓」）的能力有限。此外，模型對醫學影像、衛星影像等專業領域的泛化能力仍有待驗證。

論證結構總覽

問題
視覺分割缺乏
通用基礎模型

→

論點
可提示式分割任務
統攝所有分割需求

→

證據
23 資料集零樣本評估
SA-1B 十億遮罩規模

→

反駁
多遮罩輸出解決歧義
三階段引擎確保品質

→

結論
SAM 是視覺分割
基礎模型的可行路徑

作者核心主張（一句話）

透過定義可提示式分割任務、建構十億規模資料集，並訓練高效的 SAM 模型，可實現與 NLP 基礎模型類似的視覺分割零樣本泛化能力。

論證最強處

資料飛輪效應的實證：三階段資料引擎從 430 萬遮罩擴展至十億遮罩，同時維持 7-9/10 的人類品質評分。模型與資料的正向循環不僅是工程成就，更驗證了「以模型驅動資料策展」的範式可行性。

論證最弱處

「基礎模型」定位的過度延伸：SAM 主要處理類別無關的物體級分割，缺乏語義理解能力。在需要區分物件類別或理解場景語義的任務（如語義分割、場景理解）上，零樣本表現與專門模型仍有差距。此外，對醫學、遙感等專業領域的泛化能力尚未系統性驗證。