From Image to Video: An Empirical Study of Diffusion Representations

Abstract — 摘要

Diffusion models have revolutionized generative modeling, enabling unprecedented realism in image and video synthesis. This research systematically compares video versus image diffusion models using identical architectures. The study evaluates representations on downstream tasks including image classification, action recognition, depth estimation, and tracking. Key findings show video diffusion models consistently outperform image counterparts, with particularly large gains on motion-dependent tasks.

擴散模型革新了生成式建模，在影像與視訊合成方面實現了前所未有的真實感。本研究使用相同的架構系統性地比較視訊擴散模型與影像擴散模型。研究在多個下游任務上評估表示品質，包括影像分類、動作識別、深度估計與追蹤。關鍵發現指出視訊擴散模型持續優於影像對應模型，尤其在與運動相關的任務上優勢顯著。

段落功能全文總覽——以「相同架構、不同訓練目標」的實驗設計建立公平比較基礎，並預告核心發現。

邏輯角色作為實證性研究，摘要直接陳述實驗設計與核心發現，不提出新方法或模型。「相同架構」是整篇論文可信度的基石。

論證技巧 / 潛在漏洞「持續優於」是強力宣稱，但需檢視不同任務間的改善幅度差異。若語意分類的改善僅 0.6% 而追蹤改善 68%，則「持續」一詞掩蓋了巨大的異質性。

1. Introduction — 緒論

While image diffusion models have demonstrated effectiveness for visual understanding tasks, "the representational power of video diffusion models remains largely unexplored". The research uses the WALT architecture, which permits fair comparison between image and video training objectives through its hybrid design allowing both image and video generation modes. This is the first direct comparison of image versus video diffusion objectives for visual understanding tasks.

雖然影像擴散模型已展現了在視覺理解任務上的有效性，但「視訊擴散模型的表示能力在很大程度上仍未被探索」。本研究使用 WALT 架構，其混合設計允許影像與視訊生成模式，使影像與視訊訓練目標之間的公平比較成為可能。這是首次直接比較影像與視訊擴散目標在視覺理解任務上的表現。

段落功能定義研究缺口——視訊擴散表示的理解價值尚未被系統性研究。

邏輯角色 WALT 架構的選擇是整個實驗設計的關鍵：它允許在同一架構下切換影像/視訊模式，消除了架構差異帶來的混淆因素。

論證技巧 / 潛在漏洞「首次直接比較」的宣稱增強了論文的新穎性，但使用單一架構（WALT）限制了結論的普適性——不同架構可能產生不同的影像/視訊比較結果。

The paper surveys three main areas: image diffusion for visual understanding, video representation learning, and concurrent work on video diffusion models. Existing literature documents that diffusion models excel at semantic segmentation, depth estimation, and correspondence tasks. However, few studies directly compare video and image diffusion representations using identical architectures. The representational gap between generative and discriminative objectives remains an active research question.

論文回顧三個主要領域：用於視覺理解的影像擴散、視訊表示學習、以及視訊擴散模型的同期工作。現有文獻記載擴散模型在語意分割、深度估計與對應任務上表現出色。然而，少有研究使用相同架構直接比較視訊與影像擴散表示。生成式與判別式目標之間的表示差距仍是活躍的研究問題。

段落功能文獻回顧——建立擴散模型作為表示學習器的學術脈絡。

邏輯角色以「已知影像擴散的表示價值 + 未知視訊擴散的表示價值」建立研究的必要性。

論證技巧 / 潛在漏洞將文獻缺口定義為「缺乏公平比較」而非「缺乏視訊擴散表示研究」，精準地定位了本文的差異化貢獻。

3. Method — 方法

The authors employ latent diffusion models operating in compressed latent space using a VQ-VAE encoder and denoiser network. The WALT architecture is selected for its hybrid design: I-WALT processes individual frames independently, while V-WALT (video variant) includes spatio-temporal attention blocks capturing motion dynamics across 17 frames. The probing framework extracts frozen features at various noise levels and network depths, evaluating eight downstream tasks spanning semantic understanding to spatio-temporal reasoning.

作者採用在壓縮潛在空間中運作的潛在擴散模型，使用 VQ-VAE 編碼器與去噪網路。選擇 WALT 架構是因其混合設計：I-WALT 獨立處理個別幀，而 V-WALT（視訊變體）包含時空注意力區塊，捕捉跨 17 幀的運動動態。探針框架在各種噪聲水平與網路深度提取凍結特徵，評估從語意理解到時空推理的八個下游任務。

段落功能實驗方法論——描述確保公平比較的架構設計與評估框架。

邏輯角色 I-WALT vs. V-WALT 的唯一差異是時空注意力，這精確隔離了「視訊訓練目標」的效果。八個下游任務的廣度確保結論不侷限於特定任務。

論證技巧 / 潛在漏洞凍結特徵+線性探針是標準的表示評估方法，但可能低估了微調設定下的表現差異。17 幀的窗口限制了對長期時序依賴的評估。

4. Experiments — 實驗

4.1 Qualitative Analysis — 定性分析

Feature visualizations via PCA reveal that V-WALT activations concentrate on moving regions, while I-WALT highlights semantically important areas regardless of motion. A brick wall experiment confirms V-WALT's motion sensitivity: when the camera pans across a uniform texture, V-WALT activations track the motion while I-WALT shows uniform responses. This demonstrates that video training fundamentally alters the model's attention patterns toward temporal dynamics.

透過 PCA 的特徵視覺化揭示 V-WALT 的啟動集中在運動區域，而 I-WALT 則不論運動與否都突顯語意重要的區域。磚牆實驗確認了 V-WALT 的運動敏感性：當攝影機橫移通過均勻紋理時，V-WALT 的啟動追蹤運動，而 I-WALT 展示均勻的響應。這展示了視訊訓練從根本上改變了模型對時序動態的注意力模式。

段落功能直覺性證據——以視覺化與控制實驗展示影像/視訊模型的本質差異。

邏輯角色磚牆實驗是精巧的控制實驗：均勻紋理消除了語意差異，隔離出純粹的運動敏感性。這為後續的定量優勢提供了可解釋的機制。

論證技巧 / 潛在漏洞 PCA 視覺化極具說服力但具有選擇性——作者可能展示了最能支持論點的特徵圖。磚牆實驗的設計巧妙但極端，真實場景中語意與運動通常混合存在。

4.2 Quantitative Comparisons — 定量比較

V-WALT outperforms I-WALT across all tasks, with performance gains ranging from modest improvements (0.6% on Places365) to substantial gains (68% on point tracking, 60% on camera pose estimation). Motion-sensitive tasks show larger improvements than semantic classification, suggesting video training primarily enhances temporal and geometric understanding rather than object-level semantics. Optimal performance occurs at intermediate network depths (approximately 2/3 through the model) and moderate noise levels (timestep 200).

V-WALT 在所有任務上超越 I-WALT，效能增益從微幅改善（Places365 上 0.6%）到大幅進步（點追蹤上 68%、攝影機姿態估計上 60%）。與運動相關的任務展現比語意分類更大的改善，暗示視訊訓練主要增強的是時序與幾何理解而非物件級語意。最佳效能出現在中間網路深度（約模型的三分之二處）與適中的噪聲水平（時間步 200）。

段落功能核心定量結果——以任務區分的增益模式揭示視訊訓練的本質效果。

邏輯角色 0.6% vs. 68% 的增益差異是論文最深刻的發現：視訊訓練主要改善時空理解而非語意理解，這為「何時應使用視訊擴散表示」提供了實用指引。

論證技巧 / 潛在漏洞增益的巨大異質性（0.6% 到 68%）使「持續優於」的摘要宣稱顯得過度簡化。語意分類上的微小差異在統計顯著性上可能不穩定。

4.3 Factors Affecting Representations — 影響因素分析

Training budget effects reveal that recognition tasks show consistent improvement with longer training, while tracking and depth estimation peak at earlier training stages. Camera pose estimation demonstrates performance decline after 26% training completion, suggesting potential overfitting. For scaling behavior, increasing model size from 284M to 1.9B parameters substantially improves V-WALT on semantic tasks but provides marginal gains on point tracking. Compared with other models, V-WALT demonstrates competitiveness with DINOv2 on motion understanding but semantic classification favors models like SigLIP and DINOv2.

訓練預算效果揭示識別任務隨更長訓練持續改善，而追蹤與深度估計在較早的訓練階段即達到峰值。攝影機姿態估計在訓練完成 26% 後展現效能下降，暗示潛在的過擬合。在規模效應方面，模型大小從 2.84 億增至 19 億參數在語意任務上大幅改善 V-WALT，但在點追蹤上僅帶來邊際增益。與其他模型相比，V-WALT 在運動理解上與 DINOv2 具競爭力，但語意分類則由 SigLIP 和 DINOv2 等模型占優。

段落功能深入分析——以訓練預算、模型規模與跨模型比較三個維度剖析表示品質的影響因素。

邏輯角色此段提供了實用指引：不同下游任務應使用不同深度的特徵、不同的噪聲水平、不同的訓練階段。26% 過擬合的發現對實務中的檢查點選擇極具價值。

論證技巧 / 潛在漏洞跨模型比較（DINOv2、SigLIP）揭示了擴散表示的定位：擅長時空理解但語意能力不如專門的判別式模型。這是誠實且有價值的結論，但也限制了擴散表示的通用性主張。

5. Discussion and Conclusion — 討論與結論

The study establishes that video diffusion models consistently outperform their image counterparts as visual representations, with particularly large gains on motion-dependent tasks. The authors acknowledge limitations: studying a single architecture (WALT) limits generalizability. Future work should explore additional unified model architectures, investigate relationships between generation quality and representation learning, and examine the intersection of generative and perceptual capabilities in diffusion models.

本研究確立了視訊擴散模型作為視覺表示持續優於影像對應模型，尤其在與運動相關的任務上優勢顯著。作者承認局限性：僅研究單一架構（WALT）限制了可泛化性。未來工作應探索更多統一模型架構、研究生成品質與表示學習之間的關係、以及探討擴散模型中生成與感知能力的交集。

段落功能總結與展望——重申核心發現並誠實揭露局限性。

邏輯角色結論的謙遜程度適當：明確指出單一架構的局限，為後續研究保留了充分的探索空間。

論證技巧 / 潛在漏洞「生成品質 vs. 表示品質」的未來方向特別有趣——若兩者正相關，則更好的生成器自然帶來更好的表示；若負相關，則存在根本性的目標衝突。此關係的釐清對整個領域至關重要。

論證結構總覽

問題
視訊擴散模型的
表示品質
尚未被探索

→

論點
以相同架構
公平比較
影像 vs. 視訊

→

證據
視訊一致優於影像
追蹤 +68%
語意僅 +0.6%

→

反駁
僅一種架構
結論普適性
需更多驗證

→

結論
視訊擴散表示
特別擅長
時空理解任務

作者核心主張（一句話）

在相同架構下，以視訊目標訓練的擴散模型在所有視覺理解任務上持續優於影像目標訓練的模型，尤其在與運動相關的任務上展現出高達 68% 的改善。

論證最強處

嚴格的實驗控制與深入的多維分析：WALT 架構的選擇精確隔離了訓練目標的效果。磚牆實驗提供了直覺性的機制解釋。訓練預算、模型規模、特徵深度、噪聲水平四個維度的系統性分析，為實務應用提供了豐富的操作指引。0.6% 到 68% 的增益分布揭示了視訊訓練的本質效果：增強時空而非語意理解。

論證最弱處

單一架構的普適性限制：所有結論均基於 WALT 架構，而不同的擴散架構（如 DiT、U-Net、UViT）可能產生不同的影像/視訊表示比較結果。此外，語意分類上的微小改善（0.6%）在統計上可能不顯著，使「所有任務上持續優於」的宣稱在嚴格意義上存疑。與 DINOv2 等判別式模型的比較揭示擴散表示在語意任務上仍有差距。