AnyDoor: Zero-shot Object-level Image Customization

Abstract — 摘要

This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations in a harmonious way. The model trains once and generalizes to diverse object-scene combinations without tuning parameters. It uses identity features complemented by detail features that maintain texture while allowing local variations. The approach leverages video datasets to observe object variations, enhancing generalizability. Experiments demonstrate superiority over existing methods with applications in virtual try-on and object moving.

本研究提出 AnyDoor，一種基於擴散模型的影像生成器，能夠將目標物件以和諧的方式傳送至新場景中使用者指定的位置。模型只需訓練一次，即可泛化到各種物件與場景的組合而無須調整參數。該方法使用身份特徵搭配細節特徵，在保持紋理的同時允許局部變異。此外，透過利用影片資料集觀察物件的多樣變化來增強泛化能力。實驗結果顯示其優於現有方法，並可應用於虛擬試穿與物件移動等任務。

段落功能全文總覽——以「傳送（teleport）」的隱喻精準傳達 AnyDoor 的核心能力：將任意物件零樣本地植入新場景。

邏輯角色摘要同時承擔問題定義（物件級影像客製化）與方案預告（身份+細節雙特徵、影片資料訓練策略）的雙重功能。

論證技巧 / 潛在漏洞「訓練一次即可泛化」的宣稱極具吸引力，但實際的泛化範圍（物件類型、場景複雜度）需待實驗章節驗證。「和諧方式」的定義也較為主觀。

1. Introduction — 緒論

Image generation has advanced significantly through diffusion models. While existing work explores editing posture, style, and content through instructions, AnyDoor specifically addresses "object teleportation" — accurately placing target objects into desired scene locations by re-generating box-marked regions. Paint-by-Example and ObjectStitch take a target image as template to edit scene regions, but could not generate ID-consistent contents, especially for untrained categories. AnyDoor overcomes this by enabling zero-shot, high-quality ID-consistent compositions without fine-tuning.

影像生成透過擴散模型取得了長足進展。儘管現有研究探索透過指令編輯姿態、風格與內容，AnyDoor 專門解決「物件傳送」問題——藉由重新生成框選區域，將目標物件精確放置到期望的場景位置。Paint-by-Example 與 ObjectStitch 雖以目標影像為範本編輯場景區域，但無法生成身份一致的內容，尤其是對未訓練的類別。AnyDoor 透過實現零樣本、高品質且身份一致的合成來克服這一問題，且無需微調。

段落功能建立研究場域——從擴散模型的廣泛進展收窄到「物件級客製化」的具體問題。

邏輯角色論證鏈起點：先肯定擴散模型的成就，再指出既有方法（Paint-by-Example、ObjectStitch）在身份一致性上的缺陷，為 AnyDoor 的必要性鋪路。

論證技巧 / 潛在漏洞作者以「尤其是對未訓練的類別」精確指出競爭方法的弱點，暗示其泛化能力不足。但此處未提及 DreamBooth 等微調方法的優勢場景，可能略顯片面。

The core contribution involves representing objects through identity and detail-related features, composite interaction with background scenes, and injecting these features into pre-trained diffusion models. Training leverages both video and image data with an adaptive timestep sampler. The identity features are extracted using DINO-V2 backbone which encodes images as global tokens (1x1536) and patch tokens (256x1536), concatenated and projected into the embedding space of the pre-trained UNet via a single linear layer.

核心貢獻在於透過身份與細節相關特徵來表示物件、與背景場景進行複合互動，以及將這些特徵注入預訓練的擴散模型。訓練同時利用影片與影像資料，搭配自適應時間步取樣器。身份特徵使用 DINO-V2 骨幹網路提取，將影像編碼為全域標記（1x1536）與區塊標記（256x1536），串接後透過單一線性層投影至預訓練 UNet 的嵌入空間。

段落功能方案架構預覽——列舉三大技術支柱（身份特徵、細節特徵、特徵注入）。

邏輯角色在論證鏈中扮演「解決方案概述」的角色，為後續方法章節的深入展開提供路線圖。

論證技巧 / 潛在漏洞選用 DINO-V2 而非 CLIP 作為身份編碼器是關鍵設計決策——DINO-V2 更擅長捕捉局部視覺特徵而非語義級資訊。但依賴單一線性層進行空間對齊可能限制了表示的豐富度。

Local image editing methods focused on text-guided local region editing. Blended Diffusion conducts multi-step blending; InpaintAnything combines SAM and Stable Diffusion; Paint-by-Example uses CLIP image encoders; ObjectStitch proposes content adaptors. However, "those methods could only give coarse guidance for generations and often fail to synthesize ID-consistent results." Customized image generation methods like Textual Inversion and DreamBooth fine-tune models for specific objects, achieving high fidelity but lacking scenario/location specification and requiring extensive per-object tuning.

局部影像編輯方法著重於文字引導的區域編輯。Blended Diffusion 進行多步融合；InpaintAnything 結合 SAM 與 Stable Diffusion；Paint-by-Example 使用 CLIP 影像編碼器；ObjectStitch 提出內容適配器。然而，這些方法只能為生成提供粗略引導，且往往無法合成身份一致的結果。客製化影像生成方法如 Textual Inversion 與 DreamBooth 針對特定物件微調模型，雖可達到高保真度，但缺乏場景與位置的指定能力，且需要針對每個物件進行大量微調。

段落功能文獻回顧——系統分類既有方法（局部編輯 vs. 客製化生成），並逐一指出不足。

邏輯角色建立「無人能兼顧身份一致性與場景指定」的研究缺口，為 AnyDoor 的雙特徵設計提供動機。

論證技巧 / 潛在漏洞將競爭方法歸類為「粗略引導」略顯簡化——Paint-by-Example 在同類別物件上的表現其實相當不錯。分類方式（編輯 vs. 客製化）有效凸顯 AnyDoor 的跨類定位。

3. Method — 方法

3.1 Identity Feature Extraction — 身份特徵提取

The method employs pre-trained visual encoders for identity extraction. "Before feeding the target image into the ID extractor, we remove the background with a segmentor and align the object to the image center." The DINO-V2 backbone encodes images as global tokens (1x1536) and patch tokens (256x1536), concatenated for preservation. "A single linear layer as a projector aligns these tokens to the embedding space of the pre-trained text-to-image UNet," producing ID tokens of 257x1024 dimensions. Unlike CLIP's semantic-level information, DINO-V2 captures discriminative visual features crucial for identity preservation.

該方法採用預訓練的視覺編碼器進行身份提取。在將目標影像送入身份提取器之前，先以分割器移除背景並將物件對齊至影像中心。DINO-V2 骨幹網路將影像編碼為全域標記（1x1536）與區塊標記（256x1536），串接以完整保留資訊。一個線性層作為投影器，將這些標記對齊至預訓練文生圖 UNet 的嵌入空間，產生 257x1024 維度的身份標記。有別於 CLIP 的語義級資訊，DINO-V2 捕捉了對身份保持至關重要的判別性視覺特徵。

段落功能方法核心第一步——定義身份特徵的提取方式與編碼架構。

邏輯角色「背景移除 + 中心對齊」的前處理消除了干擾因素，使 DINO-V2 專注於物件本體。全域+區塊標記的設計兼顧全局語義與局部細節。

論證技巧 / 潛在漏洞選用 DINO-V2 而非 CLIP 的決策經消融研究驗證，但投影器僅用單一線性層，可能在跨域遷移時表示力不足。區塊標記的固定數量（256）也可能限制對極大或極小物件的適應性。

3.2 Detail Feature Extraction — 細節特徵提取

Recognizing that ID tokens lose spatial resolution for fine details, the method adds detail guidance through a high-frequency map. "Using collage as controls could provide strong priors," however collages risk over-constraining results. An information bottleneck addresses this through high-frequency mapping using Sobel kernels (horizontal and vertical), applies Hadamard products for RGB color extraction, and filters contour information with eroded masks. This "maintains the fine details yet allows versatile local variants like gesture, lighting, orientation." A ControlNet-style UNet encoder processes the stitched collage to produce hierarchical resolution detail maps.

鑑於身份標記在精細細節上會喪失空間解析度，該方法透過高頻圖添加細節引導。雖然以拼貼作為控制條件可提供強先驗，但拼貼有過度約束結果的風險。資訊瓶頸透過使用 Sobel 核（水平與垂直方向）的高頻映射來解決此問題，應用 Hadamard 乘積提取 RGB 色彩，並以侵蝕遮罩過濾輪廓資訊。這「在保持精細細節的同時允許姿態、光照、方向等多樣的局部變異」。ControlNet 風格的 UNet 編碼器處理拼接後的拼貼，產生層級化的解析度細節圖。

段落功能方法核心第二步——解決身份標記無法保留的精細紋理問題。

邏輯角色與 3.1 形成互補：身份特徵負責「是什麼物件」，細節特徵負責「物件長什麼樣」。資訊瓶頸設計是關鍵平衡點。

論證技巧 / 潛在漏洞 Sobel 核提取高頻資訊的設計精妙——保留邊緣與紋理但丟棄低頻結構，使模型能自由調整姿態。但 Sobel 核對噪聲敏感，在低品質輸入影像上可能產生不穩定的細節引導。

3.3 Feature Injection — 特徵注入

"ID tokens replace text embedding and inject into each UNet layer via cross-attention. Detail maps concatenate with UNet decoder features at each resolution." Training freezes pre-trained UNet encoder parameters while tuning the decoder for task adaptation. The training strategy collects image pairs from video datasets (YouTubeVOS, YouTubeVIS, UVO, MOSE, VIPSeg, BURST) and image sources (MVImgNet, VitonHD, FashionTryon, MSRA-10K, DUT, HFlickr, LVIS, SAM subset). An adaptive timestep sampler adjusts denoising focus: "early denoising steps focus on overall structure, pose, and view; later steps cover fine details like texture and colors."

身份標記取代文字嵌入，透過交叉注意力機制注入 UNet 的每一層。細節圖則在各解析度層級與 UNet 解碼器的特徵進行串接。訓練時凍結預訓練 UNet 編碼器的參數，僅微調解碼器以適應任務。訓練策略從影片資料集（YouTubeVOS、YouTubeVIS、UVO、MOSE、VIPSeg、BURST）和影像來源（MVImgNet、VitonHD、FashionTryon 等）收集影像對。自適應時間步取樣器調整去噪重心：早期去噪步驟聚焦於整體結構、姿態和視角；後期步驟處理紋理和色彩等精細細節。

段落功能方法整合——描述雙特徵如何注入擴散模型以及訓練策略。

邏輯角色將前兩節定義的特徵（身份+細節）與預訓練擴散模型串接，完成從「特徵提取」到「影像生成」的完整管線。

論證技巧 / 潛在漏洞影片資料的使用是巧妙的訓練策略——同一物件在不同幀中自然呈現姿態與光照變化，提供了免費的配對訓練資料。自適應時間步取樣器的設計也展現了對擴散模型去噪過程的深入理解。但大量資料來源的混合訓練可能引入分布不一致問題。

4. Experiments — 實驗

The method uses Stable Diffusion V2.1 with 512x512 image processing. Evaluation uses 30 new concepts from DreamBooth with 80 COCO-Val scene images, generating 2,400 combinations. Metrics include CLIP-Score, DINO-Score, and user studies with 15 annotators. Compared against reference-based methods (Paint-by-Example, Graphit, SD inpainting), AnyDoor maintains "highly-faithful details" where competitors maintain only semantic consistency. Against tuning-based methods (DreamBooth, Custom Diffusion), AnyDoor achieves "high-fidelity results for multi-subject composition without parameter tuning" while tuning-based methods require approximately one hour of fine-tuning per object. User study scores: AnyDoor — quality 3.04, fidelity 3.06, diversity 2.88, outperforming Paint-by-Example (2.71, 2.10, 3.04) and Graphit (2.65, 2.11, 2.84).

該方法以 Stable Diffusion V2.1 為基礎，處理 512x512 的影像。評估使用 DreamBooth 的 30 個新概念搭配 80 張 COCO-Val 場景影像，生成 2,400 種組合。度量指標包括 CLIP 分數、DINO 分數以及 15 位標註者的使用者研究。與參考式方法（Paint-by-Example、Graphit、SD 修復）相比，AnyDoor 保持了高度忠實的細節，而競爭者僅能維持語義一致性。與微調式方法（DreamBooth、Custom Diffusion）相比，AnyDoor 在無需參數微調的情況下達到多主體合成的高保真結果，而微調方法每個物件約需一小時的微調時間。使用者研究分數：AnyDoor 的品質 3.04、保真度 3.06、多樣性 2.88，優於 Paint-by-Example（2.71、2.10、3.04）與 Graphit（2.65、2.11、2.84）。

段落功能提供實證支撐——以定量與使用者研究雙重驗證方法的有效性。

邏輯角色實驗覆蓋三個維度：(1) 與參考式方法比較身份保持能力；(2) 與微調式方法比較效率；(3) 使用者研究提供主觀品質評估。

論證技巧 / 潛在漏洞 2,400 種組合的評估規模相當充分。但 AnyDoor 在多樣性（2.88）上略低於 Paint-by-Example（3.04），暗示身份保持可能以犧牲部分生成多樣性為代價。此外，512x512 的解析度在當今標準下偏低。

5. Conclusion — 結論

AnyDoor provides "a universal solution for general region-to-region mapping tasks" through discriminative ID extraction and frequency-aware detail extraction. Training on combined video and image data enables zero-shot object teleportation with applications across image editing, composition, and synthesis tasks. The approach demonstrates that a single trained model can handle diverse object-scene combinations without per-object fine-tuning, opening possibilities for virtual try-on, object moving, and flexible scene composition.

AnyDoor 透過判別性身份提取與頻率感知的細節提取，提供了通用的區域對區域映射任務解決方案。在影片與影像資料的聯合訓練下，實現了零樣本物件傳送，可應用於影像編輯、合成等任務。該方法證明單一訓練過的模型即可處理多樣的物件-場景組合而無需逐物件微調，為虛擬試穿、物件移動與靈活場景合成開啟了可能性。

段落功能總結全文——重申核心貢獻並展望應用場景。

邏輯角色結論段呼應摘要的「傳送」隱喻，將技術貢獻（雙特徵架構）與實際價值（零樣本泛化）連結，形成完整論證閉環。

論證技巧 / 潛在漏洞「通用解決方案」的措辭或許過於自信。結論未討論失敗案例與局限性，如對高度遮擋場景或極端視角變化的處理能力，以及 512x512 解析度對下游應用的限制。

論證結構總覽

問題
物件級影像客製化
缺乏身份一致性與泛化能力

→

論點
身份+細節雙特徵
實現零樣本物件傳送

→

證據
2,400 組合評估
使用者研究優於競爭者

→

反駁
影片資料訓練策略
克服姿態/視角變異

→

結論
通用區域對區域映射
無需逐物件微調

作者核心主張（一句話）

透過 DINO-V2 身份特徵與高頻細節特徵的雙軌設計，搭配影片資料的自適應訓練策略，實現了無需微調即可將任意物件零樣本植入新場景的擴散式影像生成。

論證最強處

身份與細節的解耦設計：以 DINO-V2 捕捉全域身份、Sobel 核提取高頻細節，兩者注入路徑分離（交叉注意力 vs. 特徵串接），巧妙平衡了身份保持與姿態靈活性。影片資料的引入更是免費獲取了自然物件變異的訓練信號。

論證最弱處

泛化邊界未充分探索：論文未系統性地分析失敗案例——如極端遮擋、透明物件、反光材質或複雜光照條件下的表現。此外，512x512 的解析度限制以及多樣性得分略低於競爭方法，暗示保真度與多樣性之間的取捨尚有改善空間。