Concept Arithmetics — 雙欄批注

Abstract — 摘要

As Text-to-Image (T2I) diffusion models become increasingly powerful, the scientific community is developing methods to limit their potential misuse, such as generating copyrighted content or harmful imagery. These concept inhibition methods aim to remove specific concepts from a model's generative capability. In this work, we test these safety measures as adversaries to assess their robustness. We leverage the compositional property of diffusion models, which allows combining multiple prompts in a single image generation. Our key insight is that even when a target concept has been inhibited, its generative direction in the model's latent space can be approximately reconstructed by combining other, non-inhibited concepts. We call this approach Concept Arithmetics, drawing an analogy to the well-known arithmetic properties of word embeddings.

隨著文字生成影像（T2I）擴散模型日益強大，學術界正在開發限制其潛在濫用的方法，例如生成受版權保護的內容或有害影像。這些概念抑制方法旨在從模型的生成能力中移除特定概念。在本研究中，我們以對手身分測試這些安全措施以評估其穩健性。我們利用擴散模型的組合特性，即允許在單次影像生成中結合多個提示詞。我們的關鍵洞見是即便目標概念已被抑制，其在模型潛在空間中的生成方向仍可透過組合其他未被抑制的概念來近似重建。我們將此方法稱為概念算術，類比於眾所周知的詞嵌入算術特性。

段落功能全文總覽——定位為安全性紅隊研究，提出概念算術的核心思想。

邏輯角色以「對手身分」建立研究的合理性：不是要破壞安全措施，而是透過紅隊測試來發現漏洞以改進防禦。

論證技巧 / 潛在漏洞將「攻擊」定位為「測試」是倫理上的精巧措辭。「概念算術」的命名借鑑詞向量算術，既直觀又提升了學術可信度。

1. Introduction — 緒論

Text-to-Image diffusion models such as Stable Diffusion and DALL-E can generate photorealistic images from natural language descriptions. While this capability enables remarkable creative applications, it also raises significant concerns about misuse, including the generation of deepfakes, copyrighted art styles, and violent or explicit content. In response, researchers have developed various concept inhibition techniques that modify the model to prevent generation of specific target concepts. Methods like Erased Stable Diffusion (ESD), Forget-Me-Not, and Concept Ablation fine-tune the model to unlearn particular concepts while preserving its general generative ability.

文字生成影像擴散模型如 Stable Diffusion 和 DALL-E 能從自然語言描述生成擬真影像。雖然此能力催生了卓越的創意應用，但也引發了重大的濫用疑慮，包括生成深偽內容、受版權保護的藝術風格以及暴力或露骨內容。為此，研究者開發了多種概念抑制技術，修改模型以防止特定目標概念的生成。Erased Stable Diffusion（ESD）、Forget-Me-Not 和 Concept Ablation 等方法透過微調使模型「遺忘」特定概念，同時保留其一般生成能力。

段落功能建立研究場域——介紹概念抑制的背景與動機。

邏輯角色先確立 T2I 模型的能力與風險，再引出概念抑制方法作為背景，為後續的攻擊研究鋪路。

論證技巧 / 潛在漏洞列舉具體的濫用場景和防禦方法，使問題的現實性與迫切性具體化。

However, a critical question remains: how robust are these concept inhibition methods against adversarial circumvention? If an inhibited concept can be easily recovered through simple prompt engineering or model manipulation, the safety guarantees provided by these methods would be insufficient. In this work, we demonstrate that the compositional nature of diffusion models creates a fundamental vulnerability in concept inhibition approaches. Specifically, because diffusion models can compose multiple concepts through mechanisms like classifier-free guidance blending and prompt interpolation, an adversary can reconstruct an inhibited concept by arithmetically combining related, non-inhibited concepts.

然而，一個關鍵問題仍然存在：這些概念抑制方法對於對抗性繞過究竟有多穩健？如果被抑制的概念能透過簡單的提示詞工程或模型操作輕易恢復，這些方法所提供的安全保證便不充分。在本研究中，我們論證擴散模型的組合特性在概念抑制方法中造成了根本性的弱點。具體而言，由於擴散模型可透過無分類器引導混合和提示詞內插等機制組合多個概念，對手可透過算術組合相關的、未被抑制的概念來重建被抑制的概念。

段落功能提出核心研究問題——概念抑制的穩健性。

邏輯角色從背景過渡到核心洞見：組合性（本應是模型的優勢）反成安全弱點。

論證技巧 / 潛在漏洞「根本性弱點」的措辭具有很強的宣稱力，但需後續實驗支撐。組合性確實是擴散模型的固有特性，使此攻擊向量難以根除。

2. Method — 方法

Our approach, Concept Arithmetics, exploits the compositional property of diffusion models. In standard diffusion sampling, the noise prediction at each step can be expressed as a linear combination of conditional and unconditional predictions through classifier-free guidance. We extend this by observing that multiple conditional predictions can be combined with different weights, effectively performing arithmetic operations in the model's prediction space. Given a target concept T that has been inhibited, we seek non-inhibited concepts A, B, C, ... such that T can be approximated as a weighted sum: T ≈ w_A * A + w_B * B + w_C * C + ... where the weights are optimized to maximize the similarity between the generated output and the target concept.

我們的方法概念算術利用擴散模型的組合特性。在標準擴散取樣中，每步的噪聲預測可透過無分類器引導表達為條件預測和無條件預測的線性組合。我們進一步觀察到多個條件預測可以不同權重進行組合，有效地在模型預測空間中執行算術運算。給定一個已被抑制的目標概念 T，我們尋找未被抑制的概念 A、B、C 等，使得 T 可被近似為加權總和：T ≈ w_A * A + w_B * B + w_C * C + ...，其中權重經最佳化以最大化生成輸出與目標概念的相似度。

段落功能闡述核心方法——概念算術的數學公式。

邏輯角色將直覺（概念可以被組合近似）形式化為可操作的數學框架。

論證技巧 / 潛在漏洞加權線性組合的公式簡潔直觀，但前提假設——概念空間近似線性——未必在所有情況下成立。

3. Experiments — 實驗

We evaluate Concept Arithmetics against three state-of-the-art concept inhibition methods: Erased Stable Diffusion (ESD), Forget-Me-Not (FMN), and Concept Ablation (CA). Our experiments cover two categories of inhibited concepts: artistic styles (e.g., Van Gogh, Picasso) and object categories (e.g., specific celebrities, trademarked characters). For artistic style recovery, we find that Concept Arithmetics can reconstruct the inhibited style with a CLIP similarity score of 0.78 to the original, compared to 0.31 achieved by direct prompting of the inhibited model. Human evaluators rated 72% of our reconstructed images as successfully capturing the target style, demonstrating the effectiveness of the compositional attack.

我們針對三種最先進的概念抑制方法進行評估：Erased Stable Diffusion（ESD）、Forget-Me-Not（FMN）和 Concept Ablation（CA）。實驗涵蓋兩類被抑制的概念：藝術風格（如梵谷、畢卡索）和物件類別（如特定名人、商標角色）。在藝術風格恢復方面，概念算術可將被抑制的風格以 CLIP 相似度 0.78 重建（相對於原始風格），而直接對被抑制模型提示僅得 0.31。人類評估者判定72% 的重建影像成功捕捉了目標風格，證明了組合攻擊的有效性。

段落功能提供核心實證——以量化指標驗證攻擊的有效性。

邏輯角色 0.78 vs 0.31 的對比直接證明概念抑制方法的脆弱性，支撐論文核心主張。

論證技巧 / 潛在漏洞結合自動指標（CLIP）和人工評估增強了可信度。但 CLIP 相似度是否完整捕捉「風格恢復」仍有討論空間。

For object category recovery, the results are similarly concerning. When attempting to generate inhibited celebrity faces, direct prompting yields unrecognizable outputs, but Concept Arithmetics can recover identifiable likenesses in 63% of cases as judged by a face verification model. We also demonstrate that the attack generalizes across different concept inhibition methods, suggesting that the vulnerability is not specific to any particular defense approach but is inherent to the compositional nature of diffusion models. Importantly, our attack requires no access to the model's internal weights — it operates purely through the inference API by manipulating the guidance signals.

在物件類別恢復方面，結果同樣令人擔憂。嘗試生成被抑制的名人面孔時，直接提示產生無法辨識的輸出，但概念算術可在63% 的案例中恢復可辨識的肖像（由人臉驗證模型判定）。我們還論證了攻擊可跨不同概念抑制方法泛化，表明此弱點並非特定於任何防禦方法，而是擴散模型組合特性的固有屬性。重要的是，攻擊無需存取模型內部權重——純粹透過推論 API 操控引導信號即可運作。

段落功能延伸實證——展示跨方法泛化性與黑箱攻擊特性。

邏輯角色「跨方法泛化」和「黑箱攻擊」兩個發現大幅提升了安全威脅的嚴重性等級。

論證技巧 / 潛在漏洞「無需存取權重」是具有衝擊力的發現，暗示任何部署概念抑制的 API 服務都可能被繞過。但 63% 的成功率也意味著 37% 的失敗，防禦並非完全無效。

4. Discussion — 討論

Our findings have important implications for AI safety. The fact that concept inhibition can be circumvented through the model's own compositional properties suggests that current approaches to content safety in diffusion models may need fundamental rethinking. Simply removing a concept from the model's repertoire is insufficient if the concept can be approximately reconstructed from its constituent parts. We suggest that future safety measures should account for the compositional nature of these models, potentially by monitoring the composition of guidance signals during inference or by developing inhibition methods that also address related concept combinations. We emphasize that the goal of this work is to strengthen AI safety by identifying weaknesses, not to enable misuse.

我們的發現對人工智慧安全具有重要意涵。概念抑制可被模型自身的組合特性繞過這一事實表明，目前擴散模型中的內容安全方法可能需要根本性的重新思考。如果概念可以從其組成部分中近似重建，那麼僅從模型的能力庫中移除該概念是不充分的。我們建議未來的安全措施應考慮這些模型的組合特性，例如在推論期間監控引導信號的組合，或開發同時處理相關概念組合的抑制方法。我們強調本研究的目標是透過辨識弱點來強化人工智慧安全，而非使濫用成為可能。

段落功能闡述研究意涵——從技術發現上升到安全政策建議。

邏輯角色將攻擊研究轉化為建設性建議，回應潛在的倫理質疑。

論證技巧 / 潛在漏洞倫理免責聲明（「強化安全而非使濫用成為可能」）是此類紅隊研究的標準操作，但公開攻擊細節本身即可能被利用，這是安全研究的固有悖論。

5. Conclusion — 結論

We have presented Concept Arithmetics, a method that reveals a fundamental vulnerability in current concept inhibition approaches for diffusion models. By leveraging the compositional property of these models, we demonstrate that inhibited concepts can be approximately reconstructed by combining non-inhibited concepts through weighted guidance signal manipulation. Our attacks are effective across multiple concept inhibition methods, require no weight access, and achieve high success rates in recovering both artistic styles and object categories. These findings underscore the need for more robust content safety mechanisms that account for the compositional nature of generative models.

我們提出了概念算術，一種揭示目前擴散模型概念抑制方法中根本性弱點的方法。藉由利用這些模型的組合特性，我們論證被抑制的概念可透過加權引導信號操控來組合未被抑制的概念以近似重建。我們的攻擊對多種概念抑制方法有效、無需存取權重，且在恢復藝術風格和物件類別方面達到高成功率。這些發現強調了需要更穩健的內容安全機制來應對生成模型的組合特性。

段落功能總結全文——重申核心發現與安全意涵。

邏輯角色以「更穩健的安全機制」為訴求收束，將攻擊研究轉化為建設性成果。

論證技巧 / 潛在漏洞作為 Honorable Mention，此研究精準切中 AI 安全的核心痛點，具有重要的政策影響力。

論證結構總覽

問題
概念抑制方法是否穩健？

→

論點
組合特性造成根本弱點

→

方法
概念算術：加權組合繞過

→

證據
CLIP 0.78 vs 0.31, 72% 成功

→

結論
需重新思考安全機制

核心主張

擴散模型的組合特性使得概念抑制方法存在根本性弱點——被抑制的概念可透過組合其他未被抑制的概念來近似重建。

論證最強處

黑箱攻擊特性（無需存取權重）和跨方法泛化能力表明弱點是模型固有的，而非防禦實現的缺陷。

論證最弱處

概念空間線性可組合的假設未必普遍成立，且 63% 的面孔恢復率表明防禦仍有部分效果，弱點的「根本性」宣稱可能過強。