LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

Abstract — 摘要

This work targets high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. The framework operates through cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Key contributions include: simple temporal self-attentions with rotary positional encoding adequately capture temporal correlations, and joint image-video fine-tuning produces superior outcomes. The authors introduce Vimeo25M, "a novel text-video dataset consisting of 25 million text-video pairs." Final output reaches 1280x2048 resolution with 61 frames.

本研究目標為透過以預訓練的文生圖（T2I）模型為基礎，建構高品質的文生影片（T2V）生成模型。框架透過串接式影片潛在擴散模型運作，包含基礎 T2V 模型、時間內插模型與影片超解析度模型。主要貢獻包括：簡單的時間自注意力機制搭配旋轉位置編碼即可充分捕捉時間相關性，且影像-影片聯合微調能產生更優異的結果。作者引入 Vimeo25M，一個包含 2,500 萬文字-影片配對的新資料集。最終輸出達到 1280x2048 解析度、61 幀。

段落功能全文總覽——以三階段串接架構的概念預告 LaVie 的設計哲學。

邏輯角色摘要同時宣告技術方案（串接式擴散）與資料貢獻（Vimeo25M），並以具體數字（1280x2048、61 幀）量化成果。

論證技巧 / 潛在漏洞「簡單的時間自注意力即可充分捕捉」是大膽的簡潔性主張，需實驗支撐。Vimeo25M 的引入為資料面貢獻加分，但資料品質細節（浮水印、解析度分布）尚待說明。

1. Introduction — 緒論

"Building upon the successes of T2I models, there has been a growing interest in extending these techniques to the synthesis of videos controlled by text inputs." The authors note that "training an entire T2V system from scratch poses significant challenges as it requires extensive computational resources." LaVie's approach leverages pre-trained models while addressing a critical issue: "fine-tuning solely on video datasets, even with the initialization from a pre-trained LDM, fails to achieve this goal due to the phenomenon of catastrophic forgetting." The framework comprises 3 billion parameters and enables both high-quality synthesis and creative generation capabilities.

在文生圖模型成功的基礎上，將這些技術擴展到由文字控制的影片合成引起了越來越多的關注。作者指出，從零開始訓練整個文生影片系統面臨重大挑戰，因為需要大量計算資源。LaVie 的方法利用預訓練模型，同時解決一個關鍵問題：即使以預訓練的潛在擴散模型初始化，僅在影片資料集上微調仍會因災難性遺忘現象而無法達到目標。該框架包含 30 億參數，兼具高品質合成與創造性生成能力。

段落功能建立研究場域——指出 T2I 到 T2V 擴展的趨勢與核心挑戰。

邏輯角色以「計算資源不足」和「災難性遺忘」兩大挑戰為論證起點，為聯合訓練策略的引入提供動機。

論證技巧 / 潛在漏洞「災難性遺忘」的引述精確定位了純影片微調的失敗模式，但 30 億參數的計算需求本身也不低——作者需說明相對於從零訓練的效率提升。

"Previous works have leveraged various types of deep generative models, including GANs, VAEs, and VQ-based models" for unconditional video generation. Recent diffusion-based approaches show promise but "learning the entire distribution of video datasets in an unconditional manner remains highly challenging." For text-to-video generation, existing approaches primarily extend T2I models by "incorporating temporal modules, such as temporal convolutions and temporal attention." The paper identifies a key distinction: "In contrast to prior works, our approach distinguishes itself by augmenting a pre-trained Stable Diffusion model with an efficient temporal module and jointly fine-tuning the entire model on both image and video datasets."

先前的研究利用了多種深度生成模型，包括 GAN、VAE 與基於 VQ 的模型進行無條件影片生成。近期基於擴散的方法展現前景，但以無條件方式學習影片資料集的完整分布仍極具挑戰性。在文生影片生成方面，現有方法主要透過加入時間模組（如時間摺積與時間注意力）來擴展文生圖模型。本文的關鍵區別在於：有別於先前工作，本方法透過為預訓練的 Stable Diffusion 模型增加高效時間模組，並在影像與影片資料集上聯合微調整個模型來區隔自身。

段落功能文獻回顧——從無條件生成到有條件 T2V，系統梳理技術演進。

邏輯角色透過對比「僅加時間模組」與「聯合微調」的差異，精確定位 LaVie 的創新點。

論證技巧 / 潛在漏洞將 Make-A-Video、Imagen Video 等同期工作歸類為「加入時間模組」稍顯簡化——這些方法各有獨特設計。但此分類有效突顯了 LaVie 的「聯合微調」差異化定位。

3. Our Approach — 方法

3.1 Base T2V Model — 基礎文生影片模型

The base model introduces two key modifications: First, "for each 2D convolutional layer, we inflate the pre-trained kernel to incorporate an additional temporal dimension." Second, the authors extend the transformer block to include "a temporal attention layer after each spatial layer." Critical to the approach is joint training: "We concatenate M images along the temporal axis to form a T-frame video and train the entire base model to optimize the objectives of both the T2I and T2V tasks." This addresses catastrophic forgetting by maintaining image generation knowledge while learning temporal patterns. The training objective combines video and image losses with balancing coefficient alpha.

基礎模型引入兩項關鍵修改：首先，對每個二維摺積層，將預訓練的核膨脹以納入額外的時間維度。其次，擴展 transformer 區塊，在每個空間層之後加入時間注意力層。聯合訓練是此方法的關鍵：將 M 張影像沿時間軸串接形成 T 幀影片，訓練整個基礎模型以同時最佳化文生圖與文生影片的目標。這透過維持影像生成知識的同時學習時間模式來解決災難性遺忘問題。訓練目標以平衡係數 alpha 結合影片與影像損失。

段落功能核心方法第一步——描述基礎模型的架構修改與訓練策略。

邏輯角色此段是整個串接架構的基石。核膨脹與時間注意力的設計使 2D 預訓練權重能無縫過渡到 3D，而聯合訓練直接回應了緒論提出的災難性遺忘問題。

論證技巧 / 潛在漏洞將影像「偽裝」為單幀影片進行聯合訓練是巧妙的正則化策略。但平衡係數 alpha 的敏感度分析至關重要——若影像損失權重過高可能抑制時間學習，過低則仍會遺忘。

3.2 Temporal Interpolation Model — 時間內插模型

This network "takes a 16-frame base video as input and produces an upsampled output consisting of 61 frames." The training approach involves duplicating base video frames and concatenating them with noisy high-frame-rate frames. Notably, "every frame in the output is newly synthesized" — each frame generated through interpolation replaces the corresponding input frame. The model is conditioned on text prompts for guided interpolation, maintaining semantic coherence across the temporal expansion.

此網路以 16 幀的基礎影片作為輸入，產生包含 61 幀的上取樣輸出。訓練方法包括複製基礎影片幀並將其與帶噪聲的高幀率幀串接。值得注意的是，輸出中的每一幀都是重新合成的——透過內插生成的每一幀都取代了對應的輸入幀。模型以文字提示為條件進行引導式內插，在時間擴展過程中維持語義一致性。

段落功能串接架構第二階段——從低幀率提升至高幀率。

邏輯角色解決基礎模型僅能生成 16 幀的限制，透過內插將時間解析度提升近四倍。

論證技巧 / 潛在漏洞「每一幀都是重新合成」的設計避免了插值幀與原始幀的不一致性，但也意味著可能丟失基礎模型精心生成的關鍵幀細節。16 到 61 幀的跳躍較大，中間幀的運動連貫性需充分驗證。

3.3 Video Super Resolution — 影片超解析度

The VSR component increases resolution to 1280x2048. The authors "leverage a pre-trained diffusion-based image 4x upscaler as a prior." Unlike the base model, "the spatial layers in the pre-trained upscaler remain fixed, our focus lies in fine-tuning the inserted temporal layers." Training uses patch-wise processing on 320x320 patches while maintaining capability for arbitrary sizes at inference. This design keeps the strong spatial prior intact while only learning temporal consistency.

影片超解析度組件將解析度提升至 1280x2048。作者利用預訓練的基於擴散的影像四倍上取樣器作為先驗。有別於基礎模型，預訓練上取樣器中的空間層保持固定，僅聚焦於微調插入的時間層。訓練在 320x320 的區塊上進行區塊式處理，同時在推論時保持處理任意尺寸的能力。此設計保持強大的空間先驗不變，僅學習時間一致性。

段落功能串接架構第三階段——從低解析度提升至高解析度。

邏輯角色凍結空間層、僅微調時間層的策略與基礎模型的聯合訓練形成對比，展現對不同階段最適訓練策略的理解。

論證技巧 / 潛在漏洞區塊式訓練與全域推論之間可能存在邊界偽影。凍結空間層的決策雖保證了單幀品質，但可能限制了針對影片特性的空間優化（如運動模糊的處理）。

4. Experiments — 實驗

On UCF-101, LaVie achieves FVD of 526.30 compared to Video LDM's 550.61, demonstrating 24.31 point improvement. For MSR-VTT CLIPSIM metric, the method achieves 0.2949, competitive with Make-A-Video's 0.3049. Human evaluation with 30 raters shows LaVie achieves 75.00% preference over ModelScope and 75.58% over VideoCrafter. However, challenges persist: "all three approaches struggle to achieve a satisfactory score in terms of 'motion smoothness,' indicating the ongoing challenge of generating coherent and realistic motion." The Vimeo25M dataset comprises 25 million text-video pairs in high-definition, widescreen, and watermark-free formats, with approximately 16.89% receiving aesthetics scores greater than 6, surpassing WebVid10M's 7.22%.

在 UCF-101 上，LaVie 達到 526.30 的 FVD 分數，優於 Video LDM 的 550.61，提升 24.31 分。在 MSR-VTT 的 CLIPSIM 指標上達到 0.2949，與 Make-A-Video 的 0.3049 相當。30 位評審者的人類評估顯示 LaVie 獲得 75.00% 的偏好率勝過 ModelScope，75.58% 勝過 VideoCrafter。然而挑戰依然存在：三種方法在「運動流暢度」方面都難以達到令人滿意的分數，顯示生成連貫且真實運動仍是持續的挑戰。Vimeo25M 資料集包含 2,500 萬高清、寬螢幕、無浮水印的文字-影片配對，約 16.89% 的影片美學分數高於 6，超過 WebVid10M 的 7.22%。

段落功能提供定量與人類評估的實證支撐。

邏輯角色多維度驗證：FVD/CLIPSIM 的定量比較、人類偏好的定性評估、資料集品質的統計分析。

論證技巧 / 潛在漏洞坦承運動流暢度的不足展現了學術誠實。但 CLIPSIM 0.2949 略低於 Make-A-Video 的 0.3049，作者未深入分析此差距的原因。Vimeo25M 的美學分數比較有效證明了資料品質，但資料的多樣性（地理、文化偏差）未被討論。

5. Conclusion — 結論

The paper presents LaVie as "a text-to-video foundation model that produces high-quality and temporally coherent results." Key achievements include leveraging cascaded diffusion models, introducing the Vimeo25M dataset, and demonstrating joint training effectiveness. The work "serves as an initial step towards achieving high-quality T2V generation," with future directions toward longer videos and movie-level quality synthesis from scripts. Acknowledged limitations include difficulty with multi-subject generation and accurate hand depiction.

本文將 LaVie 呈現為一個能產生高品質且時間連貫結果的文生影片基礎模型。主要成就包括利用串接式擴散模型、引入 Vimeo25M 資料集，以及展示聯合訓練的有效性。此工作作為實現高品質文生影片生成的初步探索，未來方向包括更長的影片與從劇本生成電影級品質的合成。已知限制包括多主體生成的困難以及準確的手部描繪問題。

段落功能總結全文——重申貢獻並展望未來，同時坦承限制。

邏輯角色結論段將 LaVie 定位為「初步探索」而非「最終解決方案」，展現適度的學術謙遜。

論證技巧 / 潛在漏洞坦承手部生成與多主體問題是有意義的自我批判。但「電影級品質」的未來方向跨度極大，缺乏具體的技術路線圖。串接架構的累積誤差問題（基礎模型的瑕疵被後續模型放大）也未被充分討論。

論證結構總覽

問題
T2V 訓練成本高昂
且易災難性遺忘

→

論點
串接式擴散+聯合微調
兼顧品質與效率

→

證據
FVD 526.30 / 75% 人類偏好
Vimeo25M 資料集

→

反駁
運動流暢度仍具挑戰
多主體/手部限制

→

結論
T2V 基礎模型的
初步但重要的探索

作者核心主張（一句話）

透過串接式影片潛在擴散模型（基礎生成+時間內插+超解析度）搭配影像-影片聯合微調策略，可在利用預訓練 T2I 知識的同時有效學習時間動態，生成高解析度、時間連貫的影片。

論證最強處

聯合訓練策略的實證驗證：消融研究清楚展示了純影片微調導致災難性遺忘、凍結骨幹限制表現力，而聯合訓練在兩者間取得最佳平衡。Vimeo25M 資料集的引入更是為社群提供了高品質的訓練資源。

論證最弱處

串接架構的累積誤差：三階段串接意味著每一階段的瑕疵都可能被後續階段繼承甚至放大。作者坦承運動流暢度不足，但未分析此問題源自哪一階段。此外，推論時需依序執行三個模型，延遲與記憶體需求的實際數據未被充分報告。