Real-Time High-Resolution Background Matting

Abstract — 摘要

We introduce a real-time, high-resolution background matting method that operates at 4K resolution at 30fps and HD at 60fps on a modern GPU. Our method requires an additional captured background image, which is typically available in video conferencing and virtual production scenarios. We propose a two-stage architecture: a base network that processes the entire image at low resolution to produce a coarse alpha matte, foreground, and an error map, followed by a refinement network that selectively processes only high-resolution patches identified by the error map. This design enables real-time performance at very high resolutions while maintaining matting quality comparable to or exceeding offline methods.

我們提出一種即時高解析度背景去背方法，在現代 GPU 上能以 4K 解析度 30fps 及 HD 解析度 60fps 運行。我們的方法需要一張額外擷取的背景影像，此條件在視訊會議與虛擬製作場景中通常可以滿足。我們提出兩階段架構：基礎網路在低解析度下處理整張影像，產生粗略的 alpha 遮罩、前景及誤差圖；隨後精煉網路僅選擇性地處理由誤差圖識別的高解析度圖塊。此設計使得在極高解析度下實現即時性能的同時，去背品質可媲美甚至超越離線方法。

段落功能全文總覽——點明即時高解析度去背的核心挑戰，並以兩階段架構作為解決方案。

邏輯角色摘要以具體的性能數字（4K@30fps、HD@60fps）開門見山地展示實用價值，再以兩階段架構的技術概述支撐此承諾，形成「結果→方法」的倒序論述。

論證技巧 / 潛在漏洞以具體的幀率與解析度數字領銜是強有力的工程導向修辭。但「需要額外背景影像」的前提條件被輕描淡寫——在動態背景（如戶外場景）中此假設可能不成立。

1. Introduction — 緒論

Image matting refers to the problem of estimating the alpha matte and foreground color of a subject from an input image. It is a fundamental operation in image and video editing, with applications in virtual backgrounds for video conferencing, film production, and augmented reality. Traditional methods rely on user-supplied trimaps, which are impractical for real-time applications. Recent deep learning approaches achieve impressive quality but are computationally expensive, often requiring seconds per frame, making them unsuitable for video.

影像去背（image matting）是指從輸入影像中估計alpha 遮罩與前景顏色的問題。這是影像與影片編輯中的基礎操作，應用涵蓋視訊會議的虛擬背景、電影製作與擴增實境。傳統方法依賴使用者提供的三元圖（trimap），這在即時應用中不切實際。近期的深度學習方法雖能達到出色的品質，但計算成本高昂，通常每幀需要數秒，使其不適用於影片場景。

段落功能建立研究場域——定義去背問題並指出現有方法在即時性上的不足。

邏輯角色以「品質 vs. 速度」的矛盾建構研究缺口：傳統方法需人工介入，深度學習方法太慢。本文的目標即是同時解決這兩個限制。

論證技巧 / 潛在漏洞以視訊會議等廣泛應用場景開場，有效建立研究的實際價值。但「每幀數秒」的描述可能過於籠統，不同方法的速度差異顯著。

Background matting is a special case where an additional clean background image is available. This extra information provides a strong signal for foreground-background separation, as the difference between the captured image and the background reveals the foreground region. Our prior work, Background Matting V1, demonstrated this concept but operated at low resolution and could not achieve real-time performance. In this work, we propose a fundamentally different architecture that achieves both real-time speed and high resolution through a novel selective refinement strategy.

背景去背是一種特殊情境，其中有一張額外的乾淨背景影像可供使用。此額外資訊為前景-背景分離提供了強烈的訊號，因為擷取影像與背景之間的差異即揭示了前景區域。我們先前的工作 Background Matting V1 展示了此概念，但僅能在低解析度下運行且無法達到即時性能。在本研究中，我們提出一種根本不同的架構，透過新穎的選擇性精煉策略同時達到即時速度與高解析度。

段落功能自我批判與改進——坦承前作的不足，引出本文的根本性改進。

邏輯角色以自我批判的方式建立信任：承認 V1 的限制（低解析度、非即時），同時預告 V2 的核心創新（選擇性精煉）。

論證技巧 / 潛在漏洞「fundamentally different architecture」的宣稱需要技術細節支撐。自我批判雖增強可信度，但也暗示 V1 的方法存在架構層面的根本問題，而非僅是工程最佳化不足。

Classical matting methods such as Bayesian matting and KNN matting require trimaps or scribbles as user input. Deep learning methods like Deep Image Matting and Context-Aware Matting achieve superior quality but still require trimaps and operate at low speed. MODNet proposes trimap-free matting but at limited resolution. Green screen matting in film production uses a uniform colored background for easy chroma keying, but requires controlled studio environments. Our approach takes the middle ground: requiring only a casually captured background image, which is far easier to obtain than a trimap or green screen, while achieving real-time high-resolution performance.

經典去背方法如貝氏去背與KNN 去背需要三元圖或筆觸作為使用者輸入。深度學習方法如 Deep Image Matting 與 Context-Aware Matting 達到了更優的品質，但仍需三元圖且運行速度低。MODNet 提出免三元圖去背，但僅限於有限解析度。綠幕去背使用均勻有色背景進行色度鍵控，但需要受控的攝影棚環境。我們的方法取中間路線：僅需要一張隨意擷取的背景影像，取得成本遠低於三元圖或綠幕，同時達到即時高解析度性能。

段落功能文獻回顧——以輸入要求與速度為維度，系統性地定位本文方法。

邏輯角色透過「trimap（高成本）→ 綠幕（環境限制）→ 背景影像（低成本）」的光譜，將本文方法定位於實用性最佳的甜蜜點。

論證技巧 / 潛在漏洞「middle ground」的定位策略有效避免了與兩端極端方法的正面比較。但「casually captured」的描述隱含了背景影像必須與當前場景高度一致的假設——背景若有變化（如光照、物件移動），方法性能可能大幅下降。

3. Method — 方法

3.1 基礎網路（Base Network）

The base network takes as input the source image and captured background, both downsampled to low resolution (e.g., 512x288). It produces three outputs: a coarse alpha matte, a coarse foreground color estimation, and an error map that predicts where the coarse result is likely to be inaccurate. The architecture uses a ResNet-based encoder-decoder with skip connections. The error map is trained with L1 loss between predicted alpha and ground truth, learning to identify regions such as hair boundaries, semi-transparent regions, and areas with color spill that require high-resolution refinement.

基礎網路以來源影像與擷取的背景為輸入，兩者皆下取樣至低解析度（例如 512x288）。其產生三個輸出：粗略 alpha 遮罩、粗略前景顏色估計，以及一張誤差圖——用以預測粗略結果中哪些區域可能不準確。架構使用基於 ResNet 的編碼器-解碼器搭配跳躍連結。誤差圖以預測 alpha 與真值之間的 L1 損失訓練，學習識別需要高解析度精煉的區域，例如髮絲邊界、半透明區域以及出現色彩溢出的地帶。

段落功能方法第一步——描述在低解析度下運行的基礎網路及其誤差預測機制。

邏輯角色此段建立了兩階段策略的「粗略」階段。誤差圖的設計是整個架構的關鍵創新——它將「哪裡需要精煉」的決策交由網路自主學習，而非手工規則。

論證技巧 / 潛在漏洞誤差圖的概念優雅且直覺——讓網路「自知其不足」。但訓練誤差圖需要高解析度真值，這在真實世界資料中很難取得。此外，誤差圖的閾值選擇（決定哪些區域需要精煉）直接影響速度-品質的取捨。

The refinement network operates on selected high-resolution patches identified by the error map. Specifically, patches where the error map exceeds a threshold are cropped from the original high-resolution image and processed individually. The refinement network takes as input the high-resolution patch, its corresponding coarse predictions, and the background patch, and outputs refined alpha and foreground for that patch only. This selective strategy means that only a small fraction of the image (typically 5-20%) needs high-resolution processing, dramatically reducing computation. The refined patches are then composited back into the coarse full-resolution output to produce the final result.

精煉網路在由誤差圖識別的選定高解析度圖塊上運行。具體而言，誤差圖超過閾值的圖塊從原始高解析度影像中裁切出來並逐一處理。精煉網路以高解析度圖塊、其對應的粗略預測及背景圖塊為輸入，僅對該圖塊輸出精煉後的 alpha 與前景。此選擇性策略意味著僅需對影像中的小部分（通常 5-20%）進行高解析度處理，大幅降低了計算量。精煉後的圖塊隨後合成回粗略全解析度輸出中，產生最終結果。

段落功能方法核心——描述選擇性精煉策略如何在高解析度下實現即時性能。

邏輯角色此段是全文技術貢獻的頂點。「僅 5-20% 需要高解析度處理」是支撐即時性能承諾的關鍵量化論據——將 O(HW) 的全解析度處理壓縮為 O(0.05HW~0.2HW)。

論證技巧 / 潛在漏洞「5-20%」的比例假設前景主體佔畫面比例較小且邊緣區域有限。在前景佔畫面大部分的場景（如多人合影），此比例可能大幅上升，即時性能可能無法維持。此外，圖塊之間的邊界處理可能產生接縫偽影。

4. Experiments — 實驗

We evaluate on both synthetic composites from the Adobe Matting dataset and real-world captures using webcam setups. On the synthetic benchmark, our method achieves comparable or superior alpha quality (SAD, MSE, Grad metrics) to state-of-the-art offline methods including FBA Matting and Index Matting. For speed, our method runs at 4K (3840x2160) at 30fps, HD (1920x1080) at 60fps, and 512x288 at over 100fps on an NVIDIA RTX 2080 Ti. In comparison, FBA Matting requires approximately 4 seconds per 4K frame. Our method is approximately 120x faster than FBA at 4K resolution. User studies confirm that participants rated our results comparably to offline methods for video conferencing scenarios.

我們在合成資料（來自 Adobe Matting 資料集）與真實世界的網路攝影機擷取上進行評估。在合成基準上，我們的方法在 alpha 品質指標（SAD、MSE、Grad）上達到與最先進離線方法（包括 FBA Matting 與 Index Matting）相當或更優的表現。在速度方面，我們的方法在 NVIDIA RTX 2080 Ti 上以 4K（3840x2160）30fps、HD（1920x1080）60fps 及 512x288 超過 100fps 運行。相比之下，FBA Matting 處理一幀 4K 影像需要約 4 秒。我們的方法在 4K 解析度下比 FBA 快約 120 倍。使用者研究確認，受試者在視訊會議場景中對我們的結果與離線方法的評價相當。

段落功能提供全面的實驗證據——在品質、速度與使用者感知三個維度上驗證方法。

邏輯角色此段的論證力量在於「品質不犧牲的前提下速度提升 120 倍」。三重驗證（定量指標+速度數字+使用者研究）構成了穩健的證據鏈。

論證技巧 / 潛在漏洞 120x 的速度比是極具衝擊力的數字。但比較對象（FBA）為離線方法，速度並非其設計目標，此比較的公平性可質疑。與 MODNet 等同樣追求即時性的方法的品質比較會更有說服力。使用者研究的規模與控制條件未被詳述。

5. Conclusion — 結論

We present a real-time, high-resolution background matting method that achieves 4K at 30fps through a novel two-stage architecture with selective refinement. By leveraging an error-map-guided patch selection strategy, our method processes only the critical regions at high resolution, enabling unprecedented speed-quality trade-off for matting applications. The requirement of an additional background image is minimal in typical deployment scenarios such as video conferencing. Our method enables real-time virtual background replacement at resolutions previously achievable only by offline methods.

我們提出一種即時高解析度背景去背方法，透過新穎的兩階段架構搭配選擇性精煉，達到 4K@30fps 的性能。藉由誤差圖引導的圖塊選擇策略，我們的方法僅在關鍵區域進行高解析度處理，實現了去背應用中前所未有的速度-品質取捨。額外背景影像的需求在典型部署場景（如視訊會議）中是極低門檻的條件。我們的方法使得即時虛擬背景替換能在先前僅離線方法可達到的解析度上實現。

段落功能總結全文——重申核心方法與性能成果。

邏輯角色結論段以「unprecedented speed-quality trade-off」定位本文的獨特貢獻，並以實際應用場景（視訊會議）收束，呼應緒論的動機。

論證技巧 / 潛在漏洞以「minimal requirement」淡化背景影像的需求是策略性的措辭。結論未討論方法在背景變化（如光照變化、相機移動）下的穩健性，以及在非視訊會議場景（如戶外拍攝）中的適用性。

論證結構總覽

問題
高品質去背太慢
無法即時處理高解析度

→

論點
兩階段選擇性精煉
僅處理關鍵區域

→

證據
4K@30fps
品質媲美離線方法

→

反駁
背景影像需求
在視訊會議中易滿足

→

結論
即時高解析度去背
前所未有的速度-品質取捨

作者核心主張（一句話）

透過兩階段架構搭配誤差圖引導的選擇性高解析度精煉，僅需一張額外背景影像即可在 4K 解析度下以 30fps 實現媲美離線方法品質的即時背景去背。

論證最強處

工程創新的實證說服力：4K@30fps 與 120x 加速的具體數字，搭配品質指標上與離線方法的可比性，以及使用者研究的主觀驗證，構成了三重交叉驗證的穩健證據鏈。選擇性精煉的核心洞見（僅 5-20% 區域需高解析度處理）既直覺又可量化。

論證最弱處

適用場景的隱含限制：「需要額外背景影像」的前提假設背景是靜態且可預先擷取的。在光照變化（如日夜更替）、背景物件移動（如寵物經過），或攝影機位移等真實場景中，此假設的穩健性未被充分驗證。此外，精煉圖塊比例（5-20%）在多人或大面積前景場景下可能大幅增加，威脅即時性能的承諾。