Generative Image Dynamics

Abstract — 摘要

We present an approach to modeling an image-space prior on scene motion. Our method learns this prior from motion trajectories extracted from real video sequences depicting natural, oscillatory dynamics such as trees, flowers, candles, and clothes swaying in the wind. We model scene motion in the Fourier domain: given a single image, our trained model uses a frequency-coordinated diffusion sampling process to predict a spectral volume, which can be converted into a motion texture that spans an entire video. This approach enables turning still images into seamlessly looping videos, and allows users to realistically interact with objects in real pictures by interpreting the spectral volumes as image-space modal bases, which approximate object dynamics.

本文提出一種建模影像空間場景運動先驗的方法。我們的方法從真實影片序列中擷取的運動軌跡學習此先驗，這些影片描繪了自然的振盪動態，例如樹木、花朵、燭火與衣物在風中搖曳。我們在傅立葉域中建模場景運動：給定一張靜態影像，訓練好的模型透過頻率協調擴散取樣過程預測一個頻譜體積，並可將其轉換為跨越整段影片的運動紋理。此方法能將靜態影像轉化為無縫循環的影片，並允許使用者透過將頻譜體積詮釋為影像空間模態基底來與真實照片中的物件進行擬真互動，藉此近似物件的動態行為。

段落功能全文總覽——以遞進方式從「運動先驗」到「頻譜體積」再到「應用場景」，完整預告論文的核心貢獻。

邏輯角色摘要承擔「問題定義、方法概述與應用預告」的三重功能：先界定場景運動先驗的研究目標，再以頻譜體積作為表示手段，最終展示靜態影像動畫化與互動式操作兩大應用。

論證技巧 / 潛在漏洞以具體的自然場景（樹木、花朵、燭火）為例，讓讀者直覺理解「振盪動態」的意涵。然而，「自然振盪」的範疇限縮了方法的適用性——非週期性運動（如行人行走）被排除在外，摘要未明確揭示此局限。

1. Introduction — 緒論

The natural world is always in motion, with even seemingly static scenes containing subtle oscillations as a result of factors such as wind, water currents, respiration, or other natural rhythms. Motion is one of the most salient visual signals, and humans are particularly sensitive to it. While it is easy for humans to interpret or imagine motion in scenes, training a model to learn realistic scene motion is far from trivial. The underlying physical dynamics are hard to measure and capture at scale, but fortunately, in many cases measuring them is unnecessary — the necessary signals for producing plausible motion can often be extracted from observed 2D motion.

自然世界時刻處於運動之中，即使看似靜態的場景也因風力、水流、呼吸或其他自然節律而蘊含細微的振盪。運動是最顯著的視覺訊號之一，人類對此格外敏感。然而，儘管人類能輕易地詮釋或想像場景中的運動，訓練模型學習寫實的場景運動卻絕非易事。潛在的物理動態難以大規模量測與擷取，但幸運的是，在許多情況下並不需要量測它們——產生合理運動所需的訊號通常可以從觀察到的二維運動中擷取。

段落功能建立研究動機——從自然界的普遍運動現象出發，指出計算建模的挑戰。

邏輯角色論證鏈的起點：先以直覺性觀察（「世界時刻運動」）建立共鳴，再指出模型學習的困難，最終以「二維運動即足夠」為突破口引出方法論基礎。

論證技巧 / 潛在漏洞從人類感知出發建立動機是有效的修辭策略。「二維運動即足夠」的主張在振盪場景中成立，但對於涉及深度變化或遮擋的複雜運動則未必適用，此處為後文方法範疇設下了隱含前提。

Real-world observed motion is multi-modal and grounded in complex physical effects, yet often predictable: candles will flicker in certain ways, trees will sway. This predictability is ingrained in our human perception of real scenes: by viewing a still image, we can imagine a distribution of natural motions conditioned on that image. Recent advances in generative models, in particular conditional diffusion models, have enabled us to model rich distributions, including distributions of real images conditioned on text. In this paper, we explore modeling a generative prior for image-space scene motion, i.e., the motion of all pixels in a single image, trained on motion trajectories automatically extracted from a large collection of real video sequences.

真實世界中觀察到的運動是多模態的，根植於複雜的物理效應，但通常具有可預測性：燭火會以特定方式閃爍，樹木會搖擺。這種可預測性深植於人類對真實場景的感知之中：僅憑觀看一張靜態影像，我們便能想像以該影像為條件的自然運動分布。近年來生成模型的進展——特別是條件式擴散模型——已使我們能夠建模豐富的分布，包括以文字為條件的真實影像分布。本文探索為影像空間場景運動建模生成式先驗，即單張影像中所有像素的運動，訓練資料來自從大量真實影片序列中自動擷取的運動軌跡。

段落功能橋接動機與方法——從運動的可預測性過渡到擴散模型的應用。

邏輯角色此段完成關鍵的類比推理：既然擴散模型能建模影像分布，那麼同樣的框架也能用於建模運動分布。這為後文採用潛在擴散模型提供了正當性。

論證技巧 / 潛在漏洞以「燭火閃爍、樹木搖擺」具象化多模態運動的概念，降低了理解門檻。但從影像分布到運動分布的跳躍預設了兩者在表示學習上具有相似的結構，此假設的有效性需要實驗驗證。

We compute motion in the form of a spectral volume, a frequency-domain representation of dense, long-range pixel trajectories suited to scenes that exhibit oscillatory dynamics. We train a generative model that, conditioned on a single image, can sample spectral volumes from its learned distribution. The predicted spectral volume transforms into a motion texture — a set of per-pixel, long-range pixel motion trajectories — that can be used to animate the image. Compared with priors over raw RGB pixels, priors over motion capture more fundamental, lower-dimensional underlying structure that efficiently explains long-range variations in pixel values, leading to more coherent long-term generation and more fine-grained control over animations.

我們以頻譜體積的形式計算運動——這是一種密集、長程像素軌跡的頻域表示，適用於展現振盪動態的場景。我們訓練一個生成模型，以單張影像為條件，從學習到的分布中取樣頻譜體積。預測的頻譜體積會轉換為運動紋理——一組逐像素的長程運動軌跡——可用於將影像動畫化。相較於以原始 RGB 像素為對象的先驗，運動先驗捕捉了更根本、更低維的底層結構，能有效解釋像素值的長程變化，從而帶來更連貫的長期生成效果與更精細的動畫控制能力。

段落功能核心貢獻預告——闡明頻譜體積表示的設計理念與相對優勢。

邏輯角色此段是緒論的關鍵主張：以運動先驗取代像素先驗，從根本上改變了影片生成的思路。「更低維的底層結構」是整個方法能高效運作的理論支撐。

論證技巧 / 潛在漏洞將運動先驗定位為比像素先驗「更根本」是一個強有力的論述。然而，這隱含了一個前提：場景的視覺變化主要由運動驅動，而非光照變化或材質變形。對於涉及大幅光影變化的場景，此論述的強度可能減弱。

Generative synthesis: Recent advances in generative models have enabled photorealistic synthesis of images conditioned on text prompts. These models can be augmented to synthesize video sequences by extending the generated image tensors along a time dimension. While these methods produce plausible video sequences, the resulting videos often suffer from artifacts such as incoherent motion, unrealistic temporal variation in textures, and violations of physical constraints like preservation of mass. Animating images: Other techniques take a still picture and animate it. Many recent deep learning methods adopt a 3D-UNet architecture to produce video volumes directly, but because these models are effectively the same video generation models conditioned on image information instead of text, they exhibit similar artifacts. One way to overcome these limitations is to animate an input source image through explicit or implicit image-based rendering, moving the image content around according to motion derived from external sources. Animating images according to motion fields yields greater temporal coherence and realism, but prior methods require additional guidance signals or user input, or utilize limited motion representations.

生成式合成方面：近年來生成模型的進展已實現以文字提示為條件的逼真影像合成。這些模型可透過在時間維度上擴展來合成影片序列。然而，所產生的影片常出現不連貫的運動、不自然的紋理時序變化，以及違反質量守恆等物理約束的瑕疵。影像動畫方面：另有一些技術以靜態影像作為輸入進行動畫化。許多近期的深度學習方法採用 3D-UNet 架構直接產生影片體積，但由於這些模型本質上與影片生成模型相同，只是以影像資訊而非文字為條件，因此展現類似的瑕疵。一種克服此限制的方式是透過顯式或隱式的影像基礎渲染來動畫化來源影像，但先前的方法需要額外的引導訊號或使用者輸入，或僅使用有限的運動表示。

段落功能文獻回顧——系統性梳理生成式合成與影像動畫兩大領域的現狀與瓶頸。

邏輯角色以「直接生成」與「基於運動的渲染」兩條路線組織相關工作，為自身方法的定位奠定基礎——本文屬於後者的延伸，但克服了「需外部引導」的限制。

論證技巧 / 潛在漏洞對影片生成方法的批評（「不連貫運動」）雖屬普遍觀察，但隨著技術快速發展，部分方法已大幅改善這些問題。以 2023 年的視角來看，此處的批評有時效性風險。

Motion models and motion priors: In computer graphics, natural oscillatory 3D motion has long been modeled with noise shaped in the Fourier domain and then converted via an inverse Fourier transform to time-domain motion fields. Some of these methods rely on a modal analysis of the underlying dynamics. These spectral techniques were adapted to animate plants, water, and clouds from single 2D pictures by Chuang et al. with additional user annotations. Our work is particularly inspired by Davis et al., who showed how to connect modal analysis of a scene with the motions observed in video, and how to use this analysis to simulate interactive dynamics from a video. We adopt the spectral volume representation from Davis et al., extract this representation from a large set of training videos, and show that it is suitable for predicting motion from single images with diffusion models. Videos as textures: Certain moving scenes can be thought of as dynamic textures that model videos as space-time samples of a stochastic process. In contrast to much of previous work, our method learns priors in advance that can then be applied to single images.

運動模型與運動先驗方面：在電腦圖學中，自然的振盪三維運動長期以來透過在傅立葉域中塑形的雜訊建模，再經由反傅立葉轉換轉為時域運動場。部分方法依賴底層動態的模態分析。Chuang 等人將這些頻譜技術改編為從單張二維影像動畫化植物、水面與雲層，但需要額外的使用者標註。本研究特別受到 Davis 等人的啟發，他們展示了如何將場景的模態分析與影片中觀察到的運動相連結，並利用此分析從影片模擬互動式動態。我們採用 Davis 等人的頻譜體積表示，從大量訓練影片中擷取此表示，並證明它適合以擴散模型從單張影像預測運動。影片即紋理方面：特定的動態場景可視為動態紋理。與先前的大部分研究不同，本方法預先學習先驗知識，再將其應用於單張影像。

段落功能學術譜系建構——將方法定位於模態分析與頻譜運動表示的傳統之上。

邏輯角色建立了關鍵的學術傳承鏈：傅立葉域運動建模 -> Davis 等人的頻譜體積 -> 本文的擴散模型預測。此段為方法的表示選擇提供了歷史正當性。

論證技巧 / 潛在漏洞將 Davis 等人明確標示為靈感來源展現了學術誠信，同時巧妙地將本文定位為「從影片到單張影像」的重要推進。但此定位也暗示了方法高度依賴頻譜體積表示的品質——若該表示本身有局限（如難以捕捉非週期運動），方法亦將受限。

3. Overview — 概述

Given a single picture I₀, our goal is to generate a video of length T featuring oscillation dynamics such as those of trees, flowers, or candle flames moving in the breeze. Our system has two modules: a motion prediction module and an image-based rendering module. We first use a latent diffusion model (LDM) to predict a spectral volume for the input image. The predicted spectral volume is then transformed to a sequence of motion displacement fields (a motion texture) using an inverse discrete Fourier transform. This motion determines the position of each input pixel at each future time step. Given a predicted motion texture, our rendering module animates the input RGB image using an image-based rendering technique that splats encoded features from the input image and decodes these splatted features into an output frame with an image synthesis network.

給定一張靜態影像 I₀，我們的目標是生成一段長度為 T 的影片，呈現樹木、花朵或燭火在微風中搖曳等振盪動態。系統包含兩個模組：運動預測模組與影像基礎渲染模組。我們首先使用潛在擴散模型預測輸入影像的頻譜體積，再透過反離散傅立葉轉換將預測的頻譜體積轉換為一系列運動位移場（即運動紋理）。此運動決定了每個輸入像素在每個未來時間步的位置。給定預測的運動紋理，渲染模組透過影像基礎渲染技術，將輸入影像的編碼特徵進行投射，並以影像合成網路將這些投射後的特徵解碼為輸出影格。

段落功能系統架構概述——以簡潔方式勾勒兩階段管線的運作流程。

邏輯角色此段扮演「路線圖」的角色，讓讀者在深入技術細節之前先掌握全貌：LDM 預測頻譜 -> 反 FFT 轉為運動場 -> 影像渲染。後續章節逐一展開各模組的細節。

論證技巧 / 潛在漏洞將系統分解為「預測」與「渲染」兩個模組使架構清晰可理解。此解耦設計的優勢在於各模組可獨立改進，但也意味著預測模組的誤差會直接傳播至渲染階段，系統的端對端最佳化可能受限。

4. Predicting Motion — 預測運動

4.1 Motion Representation — 運動表示

Formally, a motion texture is a sequence of time-varying 2D displacement maps, where the 2D displacement vector at each pixel coordinate from input image I₀ defines the position of that pixel at a future time t. If our goal is to produce a video via a motion texture, then one choice would be to predict a time-domain motion texture directly, but the size of the motion texture would need to scale with the length of the video: generating T output frames implies predicting T displacement fields. To avoid predicting such a large output representation, many prior animation methods either generate video frames autoregressively, or predict each future output frame independently via an extra time embedding. However, neither strategy ensures long-term temporal consistency.

形式上，運動紋理是一系列隨時間變化的二維位移圖，其中每個像素座標處的二維位移向量定義了該像素在未來時間 t 的位置。若目標是透過運動紋理產生影片，一種選擇是直接預測時域運動紋理，但其大小需隨影片長度線性增長：生成 T 個輸出影格意味著預測 T 個位移場。為避免預測如此龐大的輸出表示，許多先前的動畫方法要麼以自迴歸方式生成影格，要麼透過額外的時間嵌入獨立預測每個未來影格。然而，這兩種策略都無法確保長期的時間一致性。

段落功能問題界定——指出時域運動表示的效率與一致性瓶頸。

邏輯角色以「排除法」推進論證：先排除直接時域預測（規模問題）、自迴歸（誤差累積）、獨立預測（缺乏一致性），為引入頻域表示鋪路。

論證技巧 / 潛在漏洞系統性列舉三種替代方案的缺陷是有效的論證策略。但「長期時間一致性」的界定較模糊——對於某些應用（如短影片生成），自迴歸方法的一致性可能已足夠。

Fortunately, many natural motions can be described as a superposition of a small number of harmonic oscillators represented with different frequencies, amplitudes, and phases. Because the underlying motions are quasi-periodic, it is natural to model them in the frequency domain. We adopt an efficient frequency space representation called a spectral volume from Davis et al. A spectral volume is the temporal Fourier transform of pixel trajectories extracted from a video, organized into images called modal images. Davis et al. further show that, under certain assumptions, the spectral volume evaluated at certain frequencies forms an image-space modal basis that is a projection of the vibration modes of the underlying scene. Prior work in real-time animation has observed that most natural oscillation motions are composed primarily of low-frequency components. We validated this observation: the power spectrum of the motion decreases exponentially with increasing frequency. In practice, we found that the first K=16 Fourier coefficients are sufficient to realistically reproduce the original natural motion.

幸運的是，許多自然運動可描述為少量諧波振盪子的疊加，以不同頻率、振幅與相位表示。由於底層運動具有準週期性，在頻域中建模是自然的選擇。我們採用 Davis 等人提出的高效頻率空間表示——頻譜體積。頻譜體積是從影片中擷取的像素軌跡的時間傅立葉轉換，組織為稱作模態影像的圖像。Davis 等人進一步證明，在特定假設下，在特定頻率處求值的頻譜體積構成了影像空間模態基底，是底層場景振動模態的投影。先前關於即時動畫的研究已觀察到大多數自然振盪運動主要由低頻成分構成。我們驗證了此觀察：運動的功率頻譜隨頻率增加呈指數衰減。實際上，前 K=16 個傅立葉係數便足以寫實地重現原始的自然運動。

段落功能核心表示引入——以物理直覺與實證資料支持頻譜體積作為運動表示的選擇。

邏輯角色此段是方法論的基石：「準週期性 -> 頻域建模 -> 低頻即足夠」的推理鏈為整個架構的效率提供了理論與實證雙重支撐。K=16 的選擇將預測問題從 T 個位移場壓縮至 16 個頻譜係數。

論證技巧 / 潛在漏洞以功率頻譜的指數衰減為實證支持，說服力強。但「準週期性」的假設明確將方法限制於振盪類運動，對於含有突發事件（如物體掉落）的場景，此表示將無法捕捉非週期性成分。

4.2 Predicting Motion with a Diffusion Model — 以擴散模型預測運動

We select a latent diffusion model (LDM) as the backbone for our motion prediction module, as LDMs are more computationally efficient than pixel-space diffusion models while preserving generation quality. A standard LDM consists of two main modules: (1) a Variational Autoencoder (VAE) that compresses the input to a latent space through an encoder z=E(I), then reconstructs via a decoder I=D(z); and (2) a U-Net based diffusion model that learns to iteratively denoise latent features starting from Gaussian random noise. Our training applies this not to input images but to motion spectra from real video sequences, which are encoded and then diffused for n steps with a pre-defined variance schedule to produce noisy latents zⁿ. The 2D U-Nets are trained to denoise the noisy latents by iteratively estimating the noise used to update the latent feature at each step.

我們選擇潛在擴散模型作為運動預測模組的骨幹，因為 LDM 相較像素空間擴散模型更具計算效率，同時維持生成品質。標準 LDM 包含兩個主要模組：(1) 變分自編碼器（VAE），透過編碼器 z=E(I) 將輸入壓縮至潛在空間，再透過解碼器 I=D(z) 重建；(2) 基於 U-Net 的擴散模型，學習從高斯隨機雜訊出發，迭代去噪潛在特徵。我們的訓練並非應用於輸入影像，而是應用於從真實影片序列中擷取的運動頻譜。這些頻譜經編碼後以預定義的變異數排程進行 n 步擴散以產生含噪潛在表示 zⁿ，2D U-Net 則被訓練透過迭代估計雜訊來對含噪潛在表示去噪。

段落功能技術框架建立——介紹 LDM 的基本架構及其應用於運動頻譜的調整方式。

邏輯角色此段將成熟的 LDM 框架「移植」至運動預測領域，關鍵修改在於將 VAE 與擴散模型的對象從影像改為頻譜體積。這種借用成熟架構的策略降低了方法的新穎性風險。

論證技巧 / 潛在漏洞選擇 LDM 而非像素空間擴散模型的決策以效率為理由，合理且實用。但 VAE 的壓縮可能損失運動頻譜中的細微資訊，特別是高頻運動細節可能在潛在空間中被平滑化。

Frequency Adaptive Normalization: One issue we observed is that motion textures have particular distribution characteristics across frequencies — the amplitude spans a range of 0 to 100 and decays approximately exponentially with increasing frequency. As diffusion models require output values between 0 and 1 for stable training, we must normalize the coefficients. If we simply scale the magnitudes based on image width and height, almost all coefficients at higher frequencies will end up close to zero. Models trained on such data can produce inaccurate motions, since even small prediction errors lead to large relative errors after denormalization. To address this, we employ a frequency adaptive normalization technique: we independently normalize Fourier coefficients at each frequency based on statistics computed from the training set. At each individual frequency f_j, we compute the 97th percentile of the Fourier coefficient magnitudes over all input samples as a per-frequency scaling factor. Furthermore, we apply a square root transform to each scaled coefficient to pull it away from extremely small or large values.

頻率自適應正規化：我們觀察到運動紋理在不同頻率上具有特殊的分布特性——振幅範圍從 0 到 100，且隨頻率增加近似指數衰減。由於擴散模型需要輸出值介於 0 與 1 之間以確保穩定訓練，必須正規化係數。若僅依據影像寬度與高度縮放振幅，幾乎所有高頻係數都會趨近於零。以此類資料訓練的模型會產生不準確的運動，因為即便微小的預測誤差在反正規化後也會導致巨大的相對誤差。為解決此問題，我們採用頻率自適應正規化技術：根據訓練集計算的統計量，在每個頻率上獨立正規化傅立葉係數。具體而言，在每個頻率 f_j 上，計算所有輸入樣本中傅立葉係數振幅的第 97 百分位數作為逐頻率縮放因子，並對每個縮放後的係數施加平方根轉換，使其遠離極小或極大值。

段落功能技術問題解決——辨識並解決頻率域中數值分布不均的訓練難題。

邏輯角色此段展示了從問題觀察到解決方案的完整推理：分布不均 -> 簡單縮放失效 -> 逐頻率正規化 + 平方根轉換。這種「發現問題-分析原因-提出方案」的結構是工程導向研究的典型論述模式。

論證技巧 / 潛在漏洞第 97 百分位數的選擇是一個實用但有些任意的設計——為何不是 95 或 99？平方根轉換的選擇同樣缺乏理論依據（作者僅提及經驗上優於對數或倒數轉換）。這些超參數的穩健性未被充分探討。

Frequency-Coordinated Denoising: The straightforward way to predict a spectral volume with K frequency bands is to output a tensor of 4K channels from a standard diffusion U-Net, but training a model to produce such a large number of channels tends to produce over-smoothed and inaccurate output. An alternative would be to independently predict a motion spectrum map at each individual frequency, but this results in uncorrelated predictions leading to unrealistic motion. Therefore, we propose a frequency-coordinated denoising strategy. We first train an LDM to predict a spectral volume texture map at each individual frequency f_j with extra frequency embedding along with time-step embedding. We then freeze the parameters and introduce attention layers interleaved with 2D spatial layers across K frequency bands. The frequency attention layers coordinate the pre-trained motion latent features across all frequency channels to produce coherent spectral volumes. The average VAE reconstruction error improves from 0.024 to 0.018 when switching from a standard 2D U-Net to our frequency-coordinated denoising module.

頻率協調去噪：預測具有 K 個頻帶的頻譜體積，最直接的方式是從標準擴散 U-Net 輸出 4K 通道的張量，但訓練模型產生如此大量的通道往往導致過度平滑且不準確的輸出。另一種選擇是在每個頻率上獨立預測運動頻譜圖，但這會產生不相關的預測，導致不自然的運動。因此，我們提出頻率協調去噪策略。首先訓練一個 LDM，在每個頻率 f_j 上預測頻譜體積紋理圖，注入額外的頻率嵌入與時間步嵌入。接著凍結參數，引入注意力層並將其與跨 K 個頻帶的二維空間層交錯排列。頻率注意力層協調所有頻率通道上的預訓練運動潛在特徵，以產生連貫的頻譜體積。從標準 2D U-Net 切換至頻率協調去噪模組後，平均 VAE 重建誤差從 0.024 改善至 0.018。

段落功能核心創新——提出頻率協調去噪作為聯合預測多頻帶的折衷方案。

邏輯角色延續排除法的論證策略：排除「全通道聯合」（過平滑）與「獨立預測」（不連貫）後，提出介於兩者之間的方案——先獨立訓練再以注意力機制協調。「凍結+注意力微調」的兩階段訓練降低了最佳化難度。

論證技巧 / 潛在漏洞以具體的 VAE 重建誤差改善（0.024 -> 0.018）提供量化支持，增強說服力。但此指標僅反映重建能力的上界，實際生成品質還取決於擴散模型的取樣過程。此外，「凍結再微調」的訓練策略雖實用，但可能不如端對端聯合訓練能達到全域最佳。

5. Image-based Rendering — 影像基礎渲染

We now describe how we take a spectral volume predicted for a given input image I₀ and render a future frame at time t. We first derive motion trajectory fields in the time domain using the inverse temporal FFT applied at each pixel. The motion trajectory fields determine the position of every input pixel at every future time step. To produce a future frame, we adopt a deep image-based rendering technique and perform splatting with the predicted motion field to forward warp the encoded I₀. Since forward warping can lead to holes, and multiple source pixels can map to the same output 2D location, we adopt the feature pyramid softmax splatting strategy proposed in prior work on frame interpolation.

現在描述如何取得為給定輸入影像 I₀ 預測的頻譜體積，並渲染時間 t 的未來影格。首先在每個像素處施加反時間 FFT，推導出時域中的運動軌跡場。運動軌跡場決定了每個輸入像素在每個未來時間步的位置。為產生未來影格，我們採用深度影像基礎渲染技術，以預測的運動場進行投射，將編碼後的 I₀ 前向扭曲。由於前向扭曲可能導致空洞，且多個來源像素可能映射到同一個輸出二維位置，我們採用先前影格內插研究中提出的特徵金字塔 softmax 投射策略。

段落功能渲染管線起始——從頻譜體積到影格生成的技術流程。

邏輯角色此段連接「預測」與「渲染」兩大模組：頻譜體積經反 FFT 轉為運動場後，進入渲染管線。選擇 softmax 投射而非簡單的前向扭曲，是為了解決空洞與多對一映射的經典問題。

論證技巧 / 潛在漏洞明確指出前向扭曲的問題（空洞、衝突映射）並引用成熟的解決方案（softmax 投射），展示了工程上的周密考量。但 softmax 投射的計算成本未被討論，對即時應用可能構成瓶頸。

We encode I₀ through a feature extractor network to produce a multi-scale feature map. For each individual feature map at scale j, we resize and scale the predicted 2D motion field according to the resolution. As in Davis et al., we use predicted flow magnitude as a proxy for depth to determine the contributing weight of each source pixel: W(p) = 1/T sum_t ||F_t(p)||_2, computed as the average magnitude of the predicted motion trajectory fields. In other words, we assume large motions correspond to moving foreground objects, and small or zero motions correspond to background objects. We use motion-derived weights instead of learnable ones because in the single-view case, learnable weights are not effective for addressing disocclusion ambiguities. The warped features are then injected into intermediate blocks of an image synthesis decoder network to produce the final rendered image. We jointly train the feature extractor and synthesis networks with start and target frames randomly sampled from real videos, supervising predictions with a VGG perceptual loss.

我們透過特徵擷取網路將 I₀ 編碼為多尺度特徵圖。對於每個尺度 j 的特徵圖，根據解析度調整並縮放預測的二維運動場。如同 Davis 等人，我們以預測的光流量值作為深度的代理變數，決定每個來源像素的貢獻權重：W(p) = 1/T sum_t ||F_t(p)||_2，計算為預測運動軌跡場的平均量值。換言之，我們假設大幅運動對應於移動的前景物件，而微小或零運動對應於背景物件。我們使用運動衍生的權重而非可學習的權重，因為在單視角情況下，可學習的權重無法有效解決去遮擋歧義。扭曲後的特徵注入影像合成解碼網路的中間區塊以產生最終渲染影像。我們以從真實影片中隨機取樣的起始與目標影格聯合訓練特徵擷取器與合成網路，以 VGG 感知損失監督預測結果。

段落功能渲染細節完善——權重計算策略、合成網路架構與訓練方式。

邏輯角色此段完成渲染模組的技術閉環。「運動量值即深度」的假設將運動預測與渲染權重直接連結，避免了額外的深度估計模組。VGG 感知損失的選擇確保了視覺品質而非僅像素精確度。

論證技巧 / 潛在漏洞「大運動 = 前景」的假設在風中搖曳的場景中大致成立，但在攝影機運動或背景含大幅運動的場景中會失效。此外，ResNet-34 特徵擷取器與 StyleGAN 架構的合成網路之間的搭配未被充分討論其設計理由。

6. Applications — 應用

Image-to-video: Our system enables the animation of a single still picture by first predicting a spectral volume and generating an animation by applying the rendering module. Since the method explicitly models scene motion, this allows production of slow-motion videos by linearly interpolating the motion displacement fields and magnifying or minifying animated motions by adjusting the amplitude of predicted spectral volume coefficients. Seamless looping: It is sometimes useful to generate videos with motion that loops seamlessly. Unfortunately, it is hard to find a large collection of seamlessly looping videos for training. Instead, we devise a motion self-guidance technique that guides the motion denoising sampling process using explicit looping constraints: at each iterative denoising step during inference, an additional motion guidance signal is incorporated alongside standard classifier-free guidance, where each pixel's position and velocity at the start and end frames are enforced to be as similar as possible.

影像轉影片：我們的系統透過先預測頻譜體積再應用渲染模組來將單張靜態影像動畫化。由於方法顯式地建模場景運動，因此可透過線性內插運動位移場來產生慢動作影片，並透過調整預測頻譜體積係數的振幅來放大或縮小動畫運動。無縫循環：有時需要生成運動能無縫循環的影片。遺憾的是，難以找到大量無縫循環影片用於訓練。因此，我們設計了一種運動自引導技術，以顯式的循環約束引導運動去噪取樣過程：在推論階段的每個迭代去噪步驟中，在標準無分類器引導之外加入額外的運動引導訊號，強制每個像素在起始與結束影格的位置和速度盡可能相似。

段落功能應用展示——展現頻譜體積表示帶來的實際應用能力。

邏輯角色此段將理論方法轉化為具體應用，證明頻域表示的優勢：慢動作只需內插、振幅調整控制運動幅度、循環約束可在推論時注入。這些應用是像素域方法難以直接實現的。

論證技巧 / 潛在漏洞無縫循環的「運動自引導」是一個巧妙的推論時技巧，無需額外訓練資料。但此引導可能導致生成的運動偏離自然分布——為了滿足循環約束，模型可能犧牲運動的真實感。此權衡未被深入討論。

Interactive dynamics from a single image: As shown by Davis et al., the image-space motion spectrum from an observed video, under certain assumptions, is proportional to the projections of vibration mode shapes, and thus a spectral volume can be interpreted as an image-space modal basis. The modal shapes capture underlying oscillation dynamics at different frequencies, and hence can be used to simulate the object's response to a user-defined force such as poking or pulling. We adopt the modal analysis technique, assuming the motion can be explained by the superposition of a set of harmonic oscillators. This allows the image-space 2D motion displacement field for the object's physical response to be written as a weighted sum of Fourier spectrum coefficients modulated by the state of complex modal coordinates at each simulated time step. The state is simulated via a forward Euler method applied to the equations of motion for a decoupled mass-spring-damper system. Our method produces an interactive scene from a single picture, whereas prior methods required a video as input.

從單張影像產生互動式動態：如 Davis 等人所示，在特定假設下，從觀察影片中取得的影像空間運動頻譜與振動模態形狀的投影成正比，因此頻譜體積可被詮釋為影像空間模態基底。模態形狀捕捉了不同頻率下的底層振盪動態，因此可用於模擬物件對使用者施加的力（如戳或拉）的回應。我們採用模態分析技術，假設運動可由一組諧波振盪子的疊加來解釋。這使得物件物理回應的影像空間二維運動位移場可寫為傅立葉頻譜係數的加權和，由每個模擬時間步的複數模態座標狀態所調變。狀態透過將前向尤拉法應用於解耦質量-彈簧-阻尼系統的運動方程來模擬。本方法僅需單張影像即可產生互動式場景，而先前的方法則需要影片作為輸入。

段落功能延伸應用——展示頻譜體積作為模態基底的物理詮釋與互動能力。

邏輯角色此段是全文最具區隔性的應用：將頻譜體積重新詮釋為物理系統的模態基底，橋接了機器學習預測與物理模擬。「單張影像即可互動」直接將方法與 Davis 等人（需要影片）區隔。

論證技巧 / 潛在漏洞模態分析的物理詮釋為方法增添了深度，但需要「小位移線性假設」成立。對於大幅度的使用者互動（如大力拉扯），線性模態疊加可能無法準確模擬非線性物理回應。此外，前向尤拉法的數值穩定性在大時間步長下可能成為問題。

7. Experiments — 實驗

Implementation: We use an LDM as the backbone with a VAE of continuous latent dimension 4, trained with an L1 reconstruction loss, a multi-scale gradient consistency loss, and a KL-divergence loss. A 2D U-Net is trained with MSE loss, and attention layers adopted for frequency-coordinated denoising. The VAE and LDM are trained on images of size 256x160, taking approximately 6 days to converge using 16 Nvidia A100 GPUs. For inference, the motion diffusion model runs with DDIM for 250 steps. Generated videos up to 512x288 resolution are created by fine-tuning on higher resolution data. ResNet-34 is adopted as the feature extractor, and the image synthesis network is based on a co-modulation StyleGAN architecture. The rendering module runs in real-time at 25 FPS on a Nvidia V100 GPU. Data: We collected a set of 3,015 videos of natural oscillatory phenomena, yielding more than 150K samples of image-motion pairs after preprocessing.

實作方面：我們以 LDM 為骨幹，搭配連續潛在維度為 4 的 VAE，以 L1 重建損失、多尺度梯度一致性損失與 KL 散度損失進行訓練。2D U-Net 以 MSE 損失訓練，並採用注意力層實現頻率協調去噪。VAE 與 LDM 在 256x160 大小的影像上訓練，使用 16 張 Nvidia A100 GPU 約需 6 天收斂。推論時，運動擴散模型以 DDIM 執行 250 步。透過在更高解析度資料上微調，可產生高達 512x288 解析度的影片。特徵擷取器採用 ResNet-34，影像合成網路基於共調變 StyleGAN 架構。渲染模組在 Nvidia V100 GPU 上以每秒 25 影格的速度即時運行。資料方面：我們收集了 3,015 支自然振盪現象的影片，前處理後產生超過 150K 組影像-運動配對樣本。

段落功能實驗基礎設定——完整揭露實作細節與資料規模。

邏輯角色此段確保可複現性，同時透過具體數字（16 張 A100、6 天訓練、25 FPS 推論）展示方法的實用性。渲染模組的即時效能是重要的工程成就。

論證技巧 / 潛在漏洞 16 張 A100 GPU 的訓練需求暗示了相當高的計算門檻。資料集規模（3,015 支影片、150K 樣本）相對於大規模影片生成研究偏小，可能限制了方法的泛化能力。此外，資料收集與篩選的細節（如過濾標準）對可複現性至關重要。

Quantitative results: Our approach significantly outperforms prior single-image animation baselines in terms of both image and video synthesis quality. Specifically, we achieve FID of 4.03, KID of 0.08 (x100), FVD of 47.1, and DTFVD of 2.53, compared to the best prior methods achieving FID of 10.4 (Endo et al.) and FVD of 166.0. Our much lower FVD and DT-FVD distances suggest that our generated videos are more realistic and more temporally coherent. Sliding window analysis shows that thanks to the global spectral volume representation, our generated videos do not suffer from drift or degradation over time, maintaining consistent quality across the entire video length. Ablation studies confirm that K=16 provides optimal performance; frequency adaptive normalization improves over simple scaling; and frequency-coordinated denoising outperforms both independent prediction and volume-based prediction. The full model also benefits from feature pyramid softmax splatting over average splatting and baseline splatting in the rendering module.

定量結果：我們的方法在影像與影片合成品質上均顯著優於先前的單張影像動畫基準線。具體而言，我們達到 FID 4.03、KID 0.08（x100）、FVD 47.1 與 DTFVD 2.53，而先前最佳方法的 FID 為 10.4（Endo 等人）、FVD 為 166.0。大幅降低的 FVD 與 DT-FVD 距離表明我們生成的影片更為寫實且時間上更加連貫。滑動視窗分析顯示，得益於全域頻譜體積表示，我們生成的影片不會隨時間出現漂移或退化，在整段影片長度上維持一致的品質。消融研究確認 K=16 提供最佳效能；頻率自適應正規化優於簡單縮放；頻率協調去噪優於獨立預測與體積預測。完整模型亦受益於渲染模組中特徵金字塔 softmax 投射對比平均投射與基準線投射的優勢。

段落功能核心實證——以全面的定量指標驗證方法的優越性。

邏輯角色此段是全文實證支柱，覆蓋四個維度：(1) 影像品質（FID/KID）；(2) 影片品質（FVD/DTFVD）；(3) 長期穩定性（滑動視窗分析）；(4) 組件必要性（消融研究）。數據壓倒性的優勢（FID 從 10.4 降至 4.03、FVD 從 166 降至 47.1）為方法提供了強力支持。

論證技巧 / 潛在漏洞數據上的巨大優勢令人印象深刻，但需注意所有基準線方法均在相同的（較小規模）資料集上訓練。部分基準線原本設計用於更大規模的訓練資料，在此受限的資料條件下可能無法發揮全部潛力。此外，DTFVD 指標雖更貼切但較新穎，社群對其可靠性的共識尚未完全建立。

Qualitative results: Spatio-temporal X-t slice visualizations demonstrate that our generated video dynamics more strongly resemble the motion patterns observed in corresponding real reference videos, compared to other methods. Baselines such as Stochastic I2V and MCVD fail to model both appearance and motion realistically over time. Endo et al. produces frames with fewer artifacts but exhibits over-smooth or non-oscillation motions. Comparisons of individual generated frames at t=128 show that our frames exhibit fewer artifacts and distortions, and our 2D motion fields most resemble the reference displacement fields from real videos. In contrast, background content generated by other methods tends to drift, and video frames from other methods exhibit significant color distortion or ghosting artifacts, suggesting that baselines are less stable when generating videos with long time duration.

定性結果：時空 X-t 切片視覺化展示了我們生成的影片動態與對應真實參考影片中觀察到的運動模式更為相似。Stochastic I2V 與 MCVD 等基準線無法在時間上寫實地建模外觀與運動。Endo 等人產生的影格瑕疵較少，但呈現過度平滑或非振盪的運動。在 t=128 處對個別生成影格的比較顯示，我們的影格瑕疵與失真更少，二維運動場最接近從真實影片估計的參考位移場。相比之下，其他方法生成的背景內容傾向漂移，且影格展現顯著的色彩失真或鬼影瑕疵，表明基準線在生成長時間影片時較不穩定。

段落功能視覺證據補強——以定性分析深化定量結果的說服力。

邏輯角色定性結果與定量指標相互呼應：X-t 切片直觀展示了「時間一致性」（對應低 FVD）、「運動寫實性」（對應低 DTFVD）。對基準線的逐一分析確認了不同類型的失敗模式。

論證技巧 / 潛在漏洞 X-t 切片是展示時間一致性的有效視覺化工具。但定性比較可能存在挑選偏差——作者傾向展示己方最佳、他方最差的案例。讀者需關注補充材料中的更多範例以評估方法的普遍性。

8. Discussion and Conclusion — 討論與結論

Limitations: Since our approach only predicts spectral volumes at lower frequencies, it might fail to model general non-oscillating motions or high-frequency vibrations such as those of musical instruments. Furthermore, the quality of generated videos relies on the quality of the motion trajectories estimated from real video sequences. Thus, animation quality can degrade if the motion in training videos consists of large displacements. Moreover, since the approach is based on image-based rendering from input pixels, animation quality can also degrade if the generated videos require the creation of large amounts of content unseen in the input frame.

局限性：由於我們的方法僅預測較低頻率的頻譜體積，可能無法建模一般的非振盪運動或如樂器等的高頻振動。此外，生成影片的品質取決於從真實影片序列中估計的運動軌跡品質，因此若訓練影片中的運動包含大幅位移，動畫品質可能會下降。再者，由於方法基於從輸入像素的影像基礎渲染，若生成的影片需要建立大量在輸入影格中未見的內容，動畫品質同樣可能退化。

段落功能誠實揭露局限——列舉方法在運動類型、資料品質與內容創造上的限制。

邏輯角色此段承擔「自我批判」的功能，為讀者劃定方法的適用邊界。三項限制分別對應：(1) 頻域表示的固有約束；(2) 訓練資料管線的品質依賴；(3) 影像基礎渲染的內容創造能力限制。

論證技巧 / 潛在漏洞局限性討論全面且坦誠，但每項限制的嚴重程度未被量化。例如，「大幅位移」的閾值為何？何種程度的未見內容會導致品質退化？更具體的失敗案例分析將有助於使用者判斷方法的適用性。

We present a new approach for modeling natural oscillation dynamics from a single still picture. Our image-space motion prior is represented with spectral volumes, a frequency representation of per-pixel motion trajectories, which we find to be highly suitable for prediction with diffusion models, and which we learn from collections of real world videos. The spectral volumes are predicted using our frequency-coordinated latent diffusion model and are used to animate future video frames using a neural image-based rendering module. We show that our approach produces photo-realistic animations from a single picture and significantly outperforms prior baseline methods, and that it can enable other downstream applications such as creating seamless loops and interactive animations.

我們提出一種從單張靜態影像建模自然振盪動態的新方法。我們的影像空間運動先驗以頻譜體積表示——一種逐像素運動軌跡的頻率表示——我們發現其高度適合以擴散模型進行預測，且從真實世界影片集合中學習。頻譜體積透過我們的頻率協調潛在擴散模型預測，並用於以神經影像基礎渲染模組動畫化未來影格。我們展示本方法能從單張影像產生逼真的動畫，顯著優於先前的基準線方法，並能實現無縫循環與互動式動畫等下游應用。

段落功能全文總結——重申核心貢獻與方法的獨特價值。

邏輯角色結論段呼應摘要的結構，形成完整的論證閉環：頻譜體積表示 -> 頻率協調 LDM 預測 -> 神經渲染 -> 逼真動畫。三大應用（影片生成、無縫循環、互動動態）再次強調方法的多面向價值。

論證技巧 / 潛在漏洞結論措辭自信但適度，以「顯著優於」而非「解決了所有問題」收尾。然而，未來方向的討論偏薄——讀者可能期待對擴展至非振盪運動、更高解析度、或與文字引導結合等方向的展望。作為 Best Paper，更深入的前瞻性討論將使結論更加完整。

論證結構總覽

問題
靜態影像動畫化缺乏
長期時間一致性

→

論點
頻譜體積作為運動先驗
適合擴散模型預測

→

證據
FID 4.03 / FVD 47.1
全面超越基準線

→

反駁
僅適用振盪運動
依賴影像基礎渲染

→

結論
頻域運動先驗實現
逼真互動式動畫

作者核心主張（一句話）

以頻譜體積表示影像空間的運動先驗，搭配頻率協調潛在擴散模型，能從單張靜態影像生成時間連貫、物理合理的振盪動態影片，並實現無縫循環與互動式操作。

論證最強處

頻域表示的精準選擇：以「自然運動的功率頻譜隨頻率指數衰減」的實證觀察為基礎，將預測目標從 T 個時域位移場壓縮至 K=16 個傅立葉係數，在效率與表達力之間取得最佳平衡。配合頻率協調去噪策略，FVD 從先前最佳的 166.0 大幅降至 47.1，量化指標上呈壓倒性優勢。頻譜體積同時支持多種下游應用（慢動作、循環、互動），展示了表示選擇的深遠影響力。

論證最弱處

適用範疇的根本限制：方法的核心假設——「場景運動為準週期性振盪」——從根本上排除了非週期運動（行人行走、車輛行駛、物體墜落等），使方法僅適用於風中搖曳的自然場景這一相對狹窄的領域。此外，影像基礎渲染無法產生輸入影像中不存在的內容，在遮擋區域揭露或大幅場景變化的情境下將導致品質退化。論文對這些局限的討論雖然坦誠，但缺乏對可能解決路徑的深入探討。