ControlMM: Controllable Masked Motion Generation

Abstract — 摘要

The paper introduces ControlMM, a method that incorporates spatial control signals into generative masked motion models. The authors claim to achieve "real-time, high-fidelity, and high-precision controllable motion generation simultaneously". Key innovations include masked consistency modeling and inference-time logit editing. Results show superior motion quality with FID scores of 0.061 versus 0.271 for state-of-the-art methods, with generation speeds 20 times faster than diffusion-based approaches.

本文提出 ControlMM，將空間控制訊號整合至生成式遮罩動作模型中。作者聲稱同時實現了即時、高保真且高精度的可控動作生成。核心創新包括遮罩一致性建模與推論時 logit 編輯。結果顯示在動作品質上取得卓越表現，FID 分數為 0.061，相比最先進方法的 0.271 大幅改善，生成速度更是擴散方法的 20 倍。

段落功能全文總覽——以三重目標（即時、高保真、高精度）定位研究貢獻，並以數據對比快速建立優勢。

邏輯角色摘要以「同時實現三項目標」為核心主張，這是極具野心的定位。FID 的 77% 改善與 20 倍加速提供了有力的量化支撐。

論證技巧 / 潛在漏洞三重目標的同時達成需在實驗中逐一驗證。FID 的比較基線（TLControl）可能非最強基線，需檢視是否與 OmniControl 等最新方法有充分比較。

1. Introduction — 緒論

Text-driven human motion generation has gained significant attention due to natural language's semantic richness. Applications span animation, film, VR/AR, and robotics. However, text descriptions alone struggle to provide precise spatial control over specific joints like the pelvis and hands, limiting natural environmental interaction and 3D space navigation. Existing solutions face difficulties generating high-fidelity motion with precise, flexible spatial control while ensuring real-time inference.

文字驅動的人體動作生成因自然語言的語意豐富性而受到廣泛關注。應用範圍涵蓋動畫、電影、VR/AR 與機器人學。然而，單純的文字描述難以對特定關節（如骨盆與手部）提供精確的空間控制，限制了自然的環境互動與三維空間導航。現有解決方案在生成高保真動作的同時實現精確、彈性的空間控制與即時推論方面面臨困難。

段落功能引出問題——從文字驅動動作生成的成功出發，指出空間控制的精確性缺口。

邏輯角色以「語意豐富但空間模糊」的對比建立核心矛盾，為引入空間控制訊號提供動機。

論證技巧 / 潛在漏洞將問題框架化為三重目標的同時達成，暗示現有方法最多只能達成其中一兩項。但部分擴散方法（如 GMD）已能提供一定程度的空間控制，此處的簡化可能低估了現有方法的能力。

The paper identifies three key problems: unsatisfied spatial flexibility and accuracy, suboptimal motion quality in controllable models, and slow generation speeds from motion-space diffusion models. ControlMM addresses these by integrating spatial control into generative masked motion models, described as "the first method capable of achieving real-time, high-fidelity, and high-precision controllable motion generation simultaneously".

論文識別出三個關鍵問題：空間彈性與精確度不足、可控模型的動作品質欠佳、以及動作空間擴散模型的生成速度緩慢。ControlMM 透過將空間控制整合至生成式遮罩動作模型來解決這些問題，被描述為「首個能同時實現即時、高保真且高精度可控動作生成的方法」。

段落功能問題結構化——將模糊的挑戰分解為三個明確的子問題。

邏輯角色三重問題的對稱結構暗示 ControlMM 的三重解決方案，為方法章節的論述提供了清晰的路線圖。

論證技巧 / 潛在漏洞「首個」的宣稱需謹慎驗證——在快速發展的領域中，平行研究可能同時提出類似主張。但此宣稱確實為論文設定了高標準的差異化門檻。

Text-driven motion generation methods have evolved from KL divergence and contrastive loss approaches (Language2Pose, TEMOS, T2M) to diffusion models operating in motion space, VAE latent space, or quantized space. Token-based models explore autoregressive GPTs and masked motion modeling, predicting multiple tokens simultaneously and generating sequences in as few as 15 steps, achieving state-of-the-art performance on generation quality and efficiency.

文字驅動的動作生成方法已從 KL 散度與對比損失方法（Language2Pose、TEMOS、T2M）演進至在動作空間、VAE 潛在空間或量化空間中運作的擴散模型。基於權杖的模型探索了自迴歸 GPT 與遮罩動作建模，能同時預測多個權杖，在僅 15 步內即可生成序列，達成了生成品質與效率的最先進表現。

段落功能文獻回顧——追溯動作生成技術的三代演進。

邏輯角色建立「遮罩模型優於擴散模型」的前提：15 步生成的效率優勢為 ControlMM 選擇遮罩架構提供了合理性。

論證技巧 / 潛在漏洞將遮罩模型定位為「第三代」暗示技術的線性進步，但擴散模型在某些面向（如生成多樣性）仍具優勢。

For controllable motion synthesis, PriorMDM fine-tunes for end effector control, GMD incorporates root joint guidance, and OmniControl extends to any joint. MotionLCM applies control in latent space, while DNO introduces optimization on diffusion noise. Existing methods are "either limited to specific signals, achieve imprecise control, or rely on optimizing over the diffusion model which is costly and impractical for real-time applications".

在可控動作合成方面，PriorMDM 針對末端效應器控制進行微調，GMD 引入根關節引導，OmniControl 擴展至任意關節。MotionLCM 在潛在空間中施加控制，而 DNO 引入對擴散噪聲的最佳化。現有方法「要麼僅限於特定訊號、要麼控制不精確、要麼依賴對擴散模型的最佳化而成本高昂且不適用於即時應用」。

段落功能批判現有方法——以三重「要麼」結構指出各方法的侷限。

邏輯角色以排除法論證：當所有現有路徑都被證明不足時，新路徑（遮罩模型+空間控制）自然成為必要的探索方向。

論證技巧 / 潛在漏洞三重「要麼」修辭強而有力，但每個方法實際上在各自的子問題上有其優勢。將所有缺陷並列可能過度簡化了問題的複雜性。

3. Method — 方法

3.1 Preliminary: Generative Masked Motion Model

Masked Motion Models consist of two stages: a Motion Tokenizer and a Text-conditioned Masked Transformer. The tokenizer learns discrete motion representations by quantizing encoder output embeddings into a codebook. For motion sequence X, the encoder compresses it into latent embedding z, which is quantized into codes c from codebook C. In the second stage, quantized motion tokens are corrupted with [MASK] tokens and fed into the text-conditioned masked transformer to reconstruct the original sequence. During inference, the transformer masks tokens with least confidence and predicts them in parallel.

遮罩動作模型包含兩個階段：動作權杖化器與文字條件遮罩 Transformer。權杖化器透過將編碼器輸出嵌入量化至碼本來學習離散的動作表示。對於動作序列 X，編碼器將其壓縮為潛在嵌入 z，再量化為碼本 C 中的碼字 c。第二階段中，量化後的動作權杖以 [MASK] 權杖進行損壞，輸入文字條件遮罩 Transformer 以重建原始序列。推論時，Transformer 遮罩信心度最低的權杖並平行預測。

段落功能技術前提——建立遮罩動作模型的基礎框架。

邏輯角色為後續的 ControlMM 擴展提供必要的技術背景。「量化-遮罩-重建」的管線定義了空間控制訊號可以注入的接口位置。

論證技巧 / 潛在漏洞清晰的兩階段管線便於理解，但離散量化本身會引入資訊損失。若碼本不夠大，精細的空間控制可能無法被忠實地編碼。

3.2 Masked Consistency Modeling — 遮罩一致性建模

The architecture consists of a pre-trained text-conditioned masked motion model and a trainable motion control model. Each Transformer layer pairs with a corresponding trainable layer connected via zero-initialized linear layer, ensuring no initial effect during training. The spatial control signal S specifies targeted 3D joint coordinates among 22 total joints, with uncontrolled joints zeroed out. To guarantee controllability, the paper employs consistency training that extracts spatial control signals from generated motion and optimizes consistency loss between input controls and extracted outputs.

架構由預訓練的文字條件遮罩動作模型與可訓練的動作控制模型組成。每個 Transformer 層配對一個對應的可訓練層，透過零初始化線性層連接，確保訓練初期不產生影響。空間控制訊號 S 指定 22 個關節中的目標三維座標，未受控關節歸零。為保障可控性，論文採用一致性訓練，從生成的動作中提取空間控制訊號，並最佳化輸入控制與提取輸出之間的一致性損失。

段落功能核心機制第一部分——描述空間控制訊號如何注入遮罩模型，以及一致性訓練如何確保控制的精確性。

邏輯角色零初始化的設計確保了加入控制模組不會破壞原始動作品質（類似 ControlNet 的策略），一致性損失則建立了「控制-輸出」的閉迴路回饋。

論證技巧 / 潛在漏洞零初始化是成熟的穩定化技巧（源自 ControlNet），在此的移植證明了跨領域的通用性。但一致性損失需要可微分的取樣，作者以 Gumbel-Softmax 解決，可能引入溫度參數的調校敏感性。

Integrating consistency loss into the generative masked model requires converting motion tokens from latent space to Euclidean space, necessitating categorical sampling which is non-differentiable. The authors leverage the straight-through Gumbel-Softmax technique with temperature parameter tau and Gumbel noise. The motion consistency loss assesses alignment between extracted joint control signals from generated motion and input spatial control, normalized by the number of controlled joints and frames.

將一致性損失整合至生成式遮罩模型需要將動作權杖從潛在空間轉換到歐氏空間，這需要類別取樣，而類別取樣是不可微分的。作者利用直通 Gumbel-Softmax 技術，搭配溫度參數 tau 與 Gumbel 噪聲。動作一致性損失評估從生成動作中提取的關節控制訊號與輸入空間控制之間的對齊程度，以受控關節數與幀數進行正規化。

段落功能技術難點與解法——解決離散取樣的不可微分性問題。

邏輯角色此段處理了遮罩模型中引入空間控制的關鍵技術障礙。Gumbel-Softmax 是廣泛接受的解法，增強了方法的技術可信度。

論證技巧 / 潛在漏洞 Gumbel-Softmax 的溫度退火策略未被詳述——溫度過高導致近似粗糙，過低導致梯度消失。此超參數的敏感性可能影響方法的易用性。

3.3 Inference-time Logits and Codebook Editing — 推論時 Logit 與碼本編輯

To achieve accurate, generalizable spatial control, the authors optimize motion token classifier logits and codebook while keeping the network frozen. This reduces discrepancy between generated motion and desired control objectives without requiring pretraining on specific signals, enabling arbitrary out-of-distribution signals and zero-shot tasks like obstacle avoidance. Core idea involves updating logits through gradient-guided optimization during inference. Combining joint logits and codebook editing produces best performance.

為實現精確且可泛化的空間控制，作者在網路凍結的狀態下最佳化動作權杖分類器的 logit 與碼本。這在無需針對特定訊號預訓練的情況下減少了生成動作與期望控制目標之間的差距，實現任意分布外訊號與零樣本任務（如避障）的處理能力。核心概念是在推論時透過梯度引導的最佳化來更新 logit。結合 logit 與碼本編輯可產生最佳效能。

段落功能核心機制第二部分——描述推論時的無訓練控制強化策略。

邏輯角色此機制與訓練時的一致性建模形成互補：前者提供基礎控制能力，後者在推論時進一步精煉，且支持零樣本泛化。

論證技巧 / 潛在漏洞推論時最佳化雖然強大但增加了推論成本。作者宣稱「即時」生成，但 logit/碼本編輯的迭代步數與對應延遲需要明確量化。

4. Experiments — 實驗

Comprehensive experiments are conducted on the HumanML3D dataset containing 14,616 motion sequences with 44,970 text descriptions. Evaluation combines quality metrics (FID, R-Precision, Diversity, Foot Skating Ratio) and trajectory error metrics. Compared to TLControl, FID decreased from 0.271 to 0.061. R-Precision increased from 0.779 to 0.809. Trajectory and Location Errors dropped to zero, with average error at 0.91 cm. For multi-joint configurations, ControlMM outperforms all methods; compared to OmniControl's FID of 0.624, ControlMM achieves FID 0.049.

在包含 14,616 個動作序列與 44,970 則文字描述的 HumanML3D 資料集上進行全面實驗。評估結合品質指標（FID、R-Precision、多樣性、腳部滑移率）與軌跡誤差指標。相比 TLControl，FID 從 0.271 降至 0.061。R-Precision 從 0.779 提升至 0.809。軌跡與位置誤差降為零，平均誤差僅 0.91 公分。在多關節配置下，ControlMM 超越所有方法；相比 OmniControl 的 FID 0.624，ControlMM 達到 FID 0.049。

段落功能全面的定量驗證——以多維指標展示方法在品質與控制精度上的雙重優勢。

邏輯角色此段同時驗證了兩項承諾：FID 的大幅改善證明動作品質，軌跡誤差歸零證明空間控制精度。多關節測試更展示了泛化能力。

論證技巧 / 潛在漏洞軌跡誤差「降為零」需審慎看待——可能僅在特定稀疏度下成立。隨著控制密度增加，平均誤差反而上升（從 0.077 到 0.054 FID，但 AE 從低稀疏到高稀疏遞增），顯示品質與控制精度之間存在微妙的取捨。

Ablation studies test eight configurations. Without the Motion Control Model, controlled joints deviate from spatial signals. Logits Editing alone brings root positions closer but wrist accuracy remains poor. Embedding Editing alone provides closer alignment but lacks motion realism. The full model generates realistic motion with high precision matching control signals. Applications include any-joint any-frame control, obstacle avoidance via SDF loss, and body part timeline control — all achieved without retraining.

消融研究測試八種配置。缺少動作控制模型時，受控關節偏離空間訊號。僅 Logit 編輯能將根關節位置拉近但手腕精確度不佳。僅嵌入編輯提供更佳對齊但動作真實感不足。完整模型生成具真實感且高精度匹配控制訊號的動作。應用包括任意關節任意幀控制、透過 SDF 損失的避障、以及身體部位時間軸控制——均無需重新訓練。

段落功能組件驗證與應用展示——消融研究確認各組件的必要性，應用場景展示泛化能力。

邏輯角色消融研究提供了因果性證據：每個組件的移除都導致特定面向的退化，證明設計的完整性。零樣本應用（如避障）更證明了推論時編輯的泛化價值。

論證技巧 / 潛在漏洞避障應用使用 SDF 損失函數，這是 ControlMM 的獨特能力——擴散方法中的 GMD 僅能控制根軌跡。但 SDF 計算的即時性需要預先建構場景的距離場，在動態環境中可能受限。

5. Conclusion — 結論

ControlMM presents a method incorporating spatial control signals into Masked Motion Models, described as "the first model that enables precise control over quantized motion tokens while maintaining high-quality motion generation at faster speeds, consistently outperforming diffusion-based controllable frameworks". Two key innovations drive the results: Masked Consistency Modeling ensures high-fidelity generation while reducing inconsistencies, and Inference-Time Logit and Codebook Editing enhances precision and adaptability for various tasks including any-joint any-frame control, obstacle avoidance, and body part timeline control.

ControlMM 提出了一種將空間控制訊號整合至遮罩動作模型的方法，被描述為「首個能在維持高品質動作生成與更快速度的同時，實現對量化動作權杖精確控制的模型，持續超越基於擴散的可控框架」。兩項核心創新驅動了這些結果：遮罩一致性建模確保高保真生成並減少不一致性；推論時 Logit 與碼本編輯增強了精度與對多種任務的適應性，包括任意關節任意幀控制、避障以及身體部位時間軸控制。

段落功能總結全文——重申核心貢獻與「首個」的定位。

邏輯角色結論完整對應了緒論的三重問題：訓練時一致性解決品質、推論時編輯解決精度、遮罩架構解決速度。

論證技巧 / 潛在漏洞結論措辭自信但未討論局限性，如：僅在 HumanML3D 單一資料集上驗證、碼本大小對控制精度的影響、以及推論時最佳化步數與實際延遲的權衡。

論證結構總覽

問題
文字驅動動作生成
缺乏精確的
空間控制能力

→

論點
遮罩模型+空間控制
兼顧品質、精度
與即時性

→

證據
FID 0.061 vs. 0.271
軌跡誤差歸零
速度快 20 倍

→

反駁
零初始化+一致性損失
確保不破壞原始
動作品質

→

結論
遮罩動作模型
是可控動作合成
的最佳載體

作者核心主張（一句話）

透過遮罩一致性建模與推論時 logit/碼本編輯，首次在遮罩動作模型上實現即時、高保真且高精度的空間可控動作生成，全面超越基於擴散的可控框架。

論證最強處

雙層控制架構的互補性：訓練時的一致性建模提供基礎空間對齊能力，推論時的 logit/碼本編輯進一步精煉並支持零樣本泛化。兩者的結合使得 ControlMM 能在無需重訓練的情況下處理避障等分布外任務，大幅擴展了實際應用場景。

論證最弱處

單一資料集與即時性的定義模糊：所有實驗僅在 HumanML3D 資料集上進行，泛化至其他動作領域（如舞蹈、體育）尚未驗證。此外，雖宣稱「即時」，但推論時的 logit/碼本最佳化引入了額外的迭代步數，實際延遲的具體數值未被明確報告。