AnyDoor — 雙欄批注

Abstract — 摘要

We introduce AnyDoor, a diffusion-based image generator that teleports target objects to new scenes at user-specified locations in a harmonious way. Unlike existing methods that require parameter tuning for each new object, our model trains once and generalizes to diverse object-scene combinations during inference. We design the framework by combining identity features extracted with a self-supervised backbone and detail features obtained through high-frequency maps, which together preserve object texture while allowing local variations in lighting and orientation. We further leverage video datasets to observe multiple forms of the same objects, enhancing generalizability and robustness. Applications include virtual try-on, object swapping, and multi-object composition.

本文提出 AnyDoor，一種基於擴散模型的影像生成器，能將目標物件以和諧的方式傳送至使用者指定位置的新場景中。不同於需要對每個新物件進行參數調整的現有方法，我們的模型僅需訓練一次即可在推論時泛化至多元的物件-場景組合。我們設計的框架結合了以自監督骨幹擷取的身份特徵與透過高頻圖譜取得的細節特徵，在保留物件紋理的同時允許光照和方向的局部變化。我們進一步利用影片資料集觀察同一物件的多種形態，增強泛化能力與穩健性。應用包括虛擬試衣、物件替換和多物件合成。

段落功能全文總覽——提出零樣本物件級影像客製化的統一解決方案。

邏輯角色以「傳送」的比喻直觀傳達功能定位，「訓練一次、泛化推論」是核心差異化主張。

論證技巧 / 潛在漏洞「身份特徵 + 細節特徵」的雙路徑設計是方法論亮點。但「和諧的方式」需要嚴格的量化定義來支撐此宣稱。

1. Introduction — 緒論

While diffusion models have enabled remarkable image generation capabilities through text, scribbles, and other conditions, the specific task of object teleportation — accurately placing target objects into desired scene locations — remains underexplored despite its practical importance for image composition, rendering, poster design, and virtual try-on. Previous methods present significant limitations: Paint-by-Example and ObjectStitch cannot generate identity-consistent content for untrained categories; customization methods like DreamBooth require extensive fine-tuning (approximately one hour with multiple images) and cannot specify object locations within given scenes.

儘管擴散模型已透過文字、塗鴉等條件實現了出色的影像生成能力，但物件傳送這一特定任務——將目標物件準確放置到指定場景位置——仍缺乏充分探索，然而其在影像合成、渲染、海報設計和虛擬試衣中具有重大實用價值。先前的方法存在顯著局限：Paint-by-Example 和 ObjectStitch 無法為未訓練的類別生成身份一致的內容；如 DreamBooth 等客製化方法需要耗時的微調（約一小時搭配多張影像），且無法在給定場景中指定物件位置。

段落功能建立問題意識——指出現有方法在身份一致性與效率上的不足。

邏輯角色以兩類方法的具體缺陷（參考式缺乏身份一致性、微調式太慢）劃定空白區域。

論證技巧 / 潛在漏洞精準定位「一小時微調」的痛點十分有效。但 Paint-by-Example 的設計目標並非完全的身份保留，比較基礎可能不完全公平。

AnyDoor addresses these constraints through zero-shot customization by representing objects using both identity and detail-related features, then compositing them within background scenes. The approach integrates ID tokens extracted by DINO-V2 and high-frequency detail maps into a pre-trained Stable Diffusion model. Training incorporates both video and image datasets, with an adaptive timestep sampler to leverage different data sources effectively. The framework supports single and multiple object placement, object moving, and object swapping — all without per-object parameter tuning.

AnyDoor 透過以身份特徵和細節特徵雙重表示物件來實現零樣本客製化，然後將其合成至背景場景中。此方法將 DINO-V2 擷取的 ID token 與高頻細節圖譜整合至預訓練的 Stable Diffusion 模型中。訓練同時納入影片與影像資料集，並使用自適應時步取樣器有效利用不同的資料來源。此框架支援單一與多物件放置、物件移動和物件替換——全部無需逐物件參數調整。

段落功能提出解決方案——雙路特徵表示與混合資料訓練策略。

邏輯角色三個技術支柱清晰呈現：DINO-V2 身份擷取、高頻細節圖、自適應時步取樣。

論證技巧 / 潛在漏洞選擇 DINO-V2 而非 CLIP 作為身份擷取器的決策後續有消融實驗支撐，技術選擇有據可循。

2. Method — 方法

For identity feature extraction, the method employs DINO-V2 as the ID extractor backbone rather than CLIP encoders used in prior work. Background is first removed via segmentation models, then features are extracted as a concatenation of global tokens (1x1536 dimensions) and patch tokens (256x1536 dimensions) through a linear projector, producing ID tokens (257x1024 dimensions) aligned with the text-to-image UNet embedding space. The choice of DINO-V2 over CLIP is motivated by its ability to encode more discriminative patch-level information that captures fine-grained object identity beyond semantic-level similarity.

在身份特徵擷取方面，本方法採用 DINO-V2 而非先前工作所用的 CLIP 編碼器作為 ID 擷取骨幹。首先透過分割模型移除背景，然後將特徵擷取為全域 token（1x1536 維）和區塊 token（256x1536 維）的串接，經線性投影器產生與文字轉影像 UNet 嵌入空間對齊的 ID token（257x1024 維）。選擇 DINO-V2 而非 CLIP 的原因在於其能編碼更具鑑別力的區塊級資訊，捕捉超越語意級相似度的細粒度物件身份。

段落功能身份特徵模組——以 DINO-V2 取代 CLIP 的技術細節。

邏輯角色「語意相似度 vs. 鑑別性身份」的對比精確定義了技術需求差異。

論證技巧 / 潛在漏洞背景移除是關鍵的前處理步驟——消融實驗（DINO Score 從 64.1 到 67.8）證實其重要性。但分割品質直接影響下游表現。

To address the spatial resolution limitations of ID tokens, the authors introduce complementary detail features. A collage representation stitches background-removed objects to target scene locations, but direct stitching lacks diversity. Instead, high-frequency maps are extracted using Sobel kernels — horizontal and vertical filters combined with the original image through element-wise operations, with an eroded mask removing outer contour information. The resulting high-frequency map feeds into a ControlNet-style UNet encoder producing hierarchical-resolution detail maps. This design ensures ID tokens capture overall structure while high-frequency maps preserve fine details like logos and textures.

為解決 ID token 的空間解析度限制，作者引入互補的細節特徵。拼貼表示將去背物件粘貼至目標場景位置，但直接粘貼缺乏多樣性。取而代之的是使用 Sobel 核擷取高頻圖譜——水平與垂直濾波器與原始影像進行逐元素運算，並以侵蝕遮罩移除外部輪廓資訊。產出的高頻圖譜輸入 ControlNet 風格的 UNet 編碼器產生多階層解析度的細節圖。此設計確保ID token 捕捉整體結構，而高頻圖譜保留標誌和紋理等精細細節。

段落功能細節特徵模組——以高頻圖譜補充 ID token 的空間資訊不足。

邏輯角色Sobel 核擷取高頻資訊是經典影像處理技術在深度學習框架中的巧妙復用。

論證技巧 / 潛在漏洞侵蝕遮罩移除輪廓資訊的設計避免了模型過度依賴精確邊緣形狀，增加了姿態變化的容忍度。但高頻資訊的損失可能影響邊緣區域的生成品質。

For feature injection, ID tokens replace text embeddings in cross-attention mechanisms within the UNet, while detail maps concatenate with UNet decoder features at each resolution. The UNet encoder remains frozen to preserve learned priors, while the decoder adapts through mean squared error loss. The training strategy leverages video datasets (YouTubeVOS, YouTubeVIS, UVO, MOSE, VIPSeg, BURST) where the same object appears across frames, along with multi-view datasets (MVImgNet, VitonHD, FashionTryon) and single-image datasets — totaling 270,821 training samples. An adaptive timestep sampler assigns video data to early denoising steps (coarse structure) and image data to late steps (fine details).

在特徵注入方面，ID token 取代 UNet 中交叉注意力機制的文字嵌入，而細節圖在每個解析度層級與 UNet 解碼器特徵進行串接。UNet 編碼器保持凍結以保留已學習的先驗，而解碼器透過均方誤差損失進行適配。訓練策略利用影片資料集（YouTubeVOS、YouTubeVIS、UVO、MOSE、VIPSeg、BURST）中同一物件跨幀出現的特性，結合多視角資料集（MVImgNet、VitonHD、FashionTryon）和單影像資料集——合計 270,821 個訓練樣本。自適應時步取樣器將影片資料分配至早期去噪步驟（粗略結構），影像資料分配至後期步驟（精細細節）。

段落功能特徵注入與訓練策略——跨模態資料的智慧整合。

邏輯角色凍結編碼器 / 解凍解碼器的策略平衡了先驗保留與任務適配。自適應時步取樣是訓練的關鍵創新。

論證技巧 / 潛在漏洞影片資料提供姿態與視角變化，但品質通常低於靜態影像——自適應時步取樣巧妙地利用了此特性差異。27 萬樣本相對擴散模型的常見規模較小。

3. Experiments — 實驗

The base generator is Stable Diffusion V2.1, processing images at 512x512 resolution. Evaluation uses a constructed benchmark with 30 new concepts from DreamBooth and 80 manually-selected scenes from COCO-Val, generating 2,400 combinations. Metrics include CLIP-Score and DINO-Score measuring similarity between generated and target object features. A user study with 15 annotators rates fidelity, quality, and diversity on 1-4 scales. Compared against reference-based methods (Paint-by-Example, Graphit, Stable Diffusion inpainting) and tuning-based methods (DreamBooth, Custom Diffusion, Cones).

基底生成器為 Stable Diffusion V2.1，以 512x512 解析度處理影像。評估使用建構的基準測試集，包含 DreamBooth 的 30 個新概念與 COCO-Val 手動選取的 80 個場景，生成 2,400 個組合。指標包括衡量生成物件與目標物件特徵相似度的 CLIP-Score 和 DINO-Score。使用者研究由 15 位標注者以 1-4 分量表評估保真度、品質和多樣性。比較對象包括參考式方法（Paint-by-Example、Graphit、Stable Diffusion 修補）和微調式方法（DreamBooth、Custom Diffusion、Cones）。

段落功能實驗設定——基準測試集建構與評估指標。

邏輯角色2,400 個組合的系統性評估確保了結論的統計穩定性。同時比較兩類方法彰顯了研究的全面性。

論證技巧 / 潛在漏洞自建基準測試集提供了客製化的評估條件，但缺乏社群公認的標準化基準可能限制結果的可比性。

User study results show AnyDoor achieves quality score 3.04, fidelity 3.06, and diversity 2.88, outperforming Paint-by-Example (2.71, 2.10, 3.04) and Graphit (2.65, 2.11, 2.84) on quality and fidelity while being competitive on diversity. Ablation studies reveal the critical role of each component: replacing DINO-V2 with CLIP causes results to lose identity features, retaining only semantic consistency. The full model achieves CLIP Score 82.1 and DINO Score 67.8, versus the CLIP-encoder baseline at 73.8 and 31.5. Background removal before DINO-V2 processing further improves DINO Score from 64.1 to 67.8, confirming that disentangling foreground from background is essential for identity preservation.

使用者研究結果顯示 AnyDoor 達到品質分數 3.04、保真度 3.06、多樣性 2.88，在品質和保真度上優於 Paint-by-Example（2.71、2.10、3.04）和 Graphit（2.65、2.11、2.84），多樣性上具有競爭力。消融研究揭示各組件的關鍵作用：將 DINO-V2 替換為 CLIP 導致結果失去身份特徵，僅保留語意一致性。完整模型達到 CLIP Score 82.1 和 DINO Score 67.8，而 CLIP 編碼器基線為 73.8 和 31.5。在 DINO-V2 處理前移除背景進一步將 DINO Score 從 64.1 提升至 67.8，證實將前景與背景解耦對身份保留至關重要。

段落功能核心結果——量化比較與消融分析。

邏輯角色DINO Score 從 31.5 到 67.8 的巨大提升（+115%）是方法有效性的最強證據。

論證技巧 / 潛在漏洞多樣性（2.88）略低於 Paint-by-Example（3.04），作者誠實地指出保真度與多樣性之間的固有張力。高保真度必然限制生成多樣性。

For virtual try-on applications, AnyDoor preserves the color, texture, and patterns of target clothes and performs well for large human gestures. Unlike traditional GAN-based methods that require human parsing maps as additional input, AnyDoor needs only a bounding box indicating the upper-body position. The framework also supports flexible interactions through integration with inpainting and interactive segmentation models: users can click and drag objects, swap locations between objects, or adjust object shapes via bounding box manipulation. The pipeline uses inpainting to fill original positions and AnyDoor to regenerate objects at new locations.

在虛擬試衣應用中，AnyDoor 保留目標衣物的顏色、紋理和圖案，並在大幅度人體姿態下表現良好。不同於需要人體剖析圖作為額外輸入的傳統 GAN 方法，AnyDoor 僅需標示上半身位置的邊界框。此框架還透過與修補和互動分割模型的整合支援靈活的互動操作：使用者可點擊並拖曳物件、交換物件位置，或透過邊界框調整物件形狀。流程使用修補填充原始位置，並用 AnyDoor 在新位置重新生成物件。

段落功能應用展示——虛擬試衣與互動式操作。

邏輯角色以具體應用場景展現框架的實用價值，從研究貢獻延伸至工程應用。

論證技巧 / 潛在漏洞僅需邊界框而非人體剖析圖是顯著的易用性優勢。但虛擬試衣的精細評估（如衣物褶皺一致性）未提供定量指標。

4. Conclusion — 結論

AnyDoor provides a diffusion-based solution for object teleportation through a discriminative ID extractor and frequency-aware detail extractor that together characterize target objects comprehensively. By training on combined video and image data with an adaptive timestep strategy, the model achieves high-fidelity compositions at user-specified scene locations without parameter tuning. The approach offers a universal solution for general region-to-region mapping tasks, with demonstrated applications spanning image composition, virtual try-on, and interactive object manipulation. The key insight is that combining self-supervised identity features with frequency-domain detail features bridges the gap between semantic understanding and fine-grained texture preservation.

AnyDoor 提供了一種基於擴散模型的物件傳送解決方案，透過具鑑別力的 ID 擷取器和頻率感知的細節擷取器全面表徵目標物件。透過在影片與影像混合資料上以自適應時步策略進行訓練，模型在使用者指定的場景位置實現高保真合成，無需參數調整。此方法提供了通用區域對區域映射任務的統一解決方案，已展示的應用涵蓋影像合成、虛擬試衣和互動式物件操作。核心洞見在於結合自監督身份特徵與頻域細節特徵，彌合了語意理解與精細紋理保留之間的鴻溝。

段落功能總結全文——重申雙路特徵表示的核心價值。

邏輯角色以「語意理解 vs. 紋理保留」的張力總結，將技術貢獻提升至更高的抽象層次。

論證技巧 / 潛在漏洞「通用區域對區域映射」的宣稱擴展了適用範圍，但跨域泛化（如醫學影像或衛星影像）未經驗證。後續工作需探索更多非自然影像場景。

論證結構總覽

問題
物件傳送缺乏
零樣本身份保留

→

論點
雙路特徵表示
統一身份與細節

→

方法
DINO-V2 ID + 高頻圖
+ 自適應時步

→

證據
DINO Score 67.8
使用者研究優勝

→

結論
通用零樣本
物件傳送框架

核心主張（一句話）

結合 DINO-V2 自監督身份特徵與 Sobel 高頻細節圖譜，AnyDoor 實現了無需微調的零樣本物件級影像客製化，在保真度上大幅超越現有參考式方法。

論證最強處

消融實驗系統性地驗證了每個組件的貢獻——DINO-V2 取代 CLIP 帶來 DINO Score 從 31.5 到 67.8 的跳躍式提升（+115%），清楚證明身份特徵擷取器的選擇是成敗關鍵。

論證最弱處

評估基準為自建資料集（30 概念 x 80 場景），缺乏社群公認的標準化基準；虛擬試衣等應用僅有定性展示而無定量指標，限制了結果的可驗證性。