SiT — 雙欄批注

Abstract — 摘要

We introduce Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework allows for connecting two distributions in a more flexible way than standard diffusion models, making possible a modular study of various design choices impacting generative models built on dynamical transport: learning in discrete or continuous time, the objective function, the interpolant that connects the distributions, and deterministic or stochastic sampling. SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256x256 and 512x512 benchmark, achieving an FID-50K score of 2.06 and 2.62 respectively.

我們提出可擴展內插變換器（SiT），一系列建構於擴散變換器（DiT）骨幹之上的生成模型。內插框架允許以比標準擴散模型更靈活的方式連接兩個分佈，使得對建構在動態傳輸上的生成模型之各種設計選擇進行模組化研究成為可能：離散或連續時間學習、目標函數、連接分佈的內插方式以及確定性或隨機性取樣。SiT 在模型各規模上均勻超越 DiT，在條件式 ImageNet 256x256 和 512x512 基準上分別達到 FID-50K 分數 2.06 和 2.62。

段落功能全文總覽——引入內插框架並展示其超越 DiT 的效能。

邏輯角色以「模組化研究」定位，強調框架的靈活性與系統性，而非單一技巧。

論證技巧 / 潛在漏洞FID 2.06 是極具競爭力的數字。「均勻超越」的宣稱需所有模型規模的資料支撐。

A distinctive feature of our work is its emphasis on understanding rather than solely pursuing state-of-the-art numbers. By building on the exact same DiT architecture, we isolate the contributions of the interpolant framework from architectural effects. Our systematic ablation reveals actionable guidelines: velocity prediction outperforms score prediction, linear interpolation outperforms trigonometric schedules, and tunable stochastic sampling outperforms both pure ODE and pure SDE sampling. These findings provide a practical recipe for improving any DiT-based model with zero additional computational cost.

本工作的獨特之處在於強調理解而非僅追求最先進數字。透過建構在完全相同的 DiT 架構上，我們將內插框架的貢獻與架構效應隔離。系統性消融揭示了可操作的指南：速度預測優於分數預測，線性內插優於三角排程，且可調的隨機取樣優於純 ODE 和純 SDE 取樣。這些發現為改進任何基於 DiT 的模型提供了零額外計算成本的實用方案。

段落功能定位研究貢獻——理解優先於追逐數字。

邏輯角色「可操作的指南」將學術研究直接轉化為社群可用的實踐建議。

論證技巧 / 潛在漏洞「零額外計算成本」的免費午餐式改進極具吸引力，降低了採用門檻。

1. Introduction — 緒論

The recent surge in generative AI applications has been driven primarily by diffusion models, yet the design space of these models remains insufficiently explored. Most practitioners adopt the standard DDPM formulation or its latent diffusion variant without questioning whether the default noise schedule, prediction target, or sampling strategy are optimal. This is partly because changing one aspect typically requires changes to others, making controlled experimentation difficult. SiT's interpolant framework resolves this issue by providing a principled decomposition of the design space into independently variable dimensions.

生成式人工智慧應用的近期激增主要由擴散模型驅動，但這些模型的設計空間探索仍不充分。多數實踐者採用標準 DDPM 公式或其潛在擴散變體，而不質疑預設的噪聲排程、預測目標或取樣策略是否最優。部分原因是改變一個方面通常需要連帶改變其他方面，使受控實驗變得困難。SiT 的內插框架透過提供設計空間至獨立可變維度的有原則分解來解決此問題。

段落功能研究動機——設計空間探索不足的現狀。

邏輯角色指出社群的「預設即最優」假設是研究的直接動機。

論證技巧 / 潛在漏洞對常見實踐的質疑是推動系統性研究的有效策略。

Diffusion models and flow-based models are two prominent frameworks for generative modeling. Standard diffusion models define a fixed forward process that progressively adds Gaussian noise, and learn a reverse denoising process. Flow-based models instead learn a deterministic mapping between a simple base distribution and the data distribution. The stochastic interpolant framework provides a unified perspective that encompasses both diffusion and flow-based models as special cases. By defining a general interpolation between noise and data, we can independently vary the interpolation schedule, the objective function, and the sampling strategy, enabling a systematic exploration of the design space.

擴散模型和流模型是生成建模的兩大主流框架。標準擴散模型定義固定的前向過程逐步添加高斯噪聲，並學習逆向去噪過程。流模型則學習簡單基底分佈與資料分佈之間的確定性映射。隨機內插框架提供了統一的視角，將擴散和流模型都涵蓋為特例。透過定義噪聲與資料之間的一般化內插，我們可獨立變化內插排程、目標函數和取樣策略，實現設計空間的系統性探索。

段落功能建立理論場域——統一擴散與流模型的視角。

邏輯角色「統一視角」的定位使 SiT 不僅是技術改進，更是理論框架。

論證技巧 / 潛在漏洞將兩大框架統一為特例是強有力的理論貢獻，但增加的設計自由度也意味著更大的調參空間。

Despite the success of DiT in establishing Transformers as the backbone for diffusion models, the design choices beyond architecture remain underexplored. DiT inherits the standard DDPM formulation with fixed noise schedules and epsilon prediction, but there is no principled reason why these specific choices should be optimal. The interpolant framework enables us to ask: what is the best way to connect the noise and data distributions? Should we predict the velocity, the score, or the noise? Should sampling be deterministic or stochastic? By decoupling these choices from the architecture, we can provide clear, controlled answers to each question independently.

儘管 DiT 成功地將 Transformer 確立為擴散模型的骨幹，架構之外的設計選擇仍未被充分探索。DiT 繼承了標準 DDPM 公式及固定噪聲排程和 epsilon 預測，但沒有原則性的理由說明這些特定選擇應該是最優的。內插框架使我們能夠提問：連接噪聲和資料分佈的最佳方式是什麼？應該預測速度、分數還是噪聲？取樣應該是確定性的還是隨機性的？透過將這些選擇從架構中解耦，我們能對每個問題提供清晰、受控的獨立回答。

段落功能指出研究空白——DiT 繼承了未經驗證的預設選擇。

邏輯角色質疑預設選擇的最優性是推動系統性研究的核心動機。

論證技巧 / 潛在漏洞「沒有原則性理由」的質疑有力，但實踐中的經驗積累也有其價值。

The practical significance of SiT lies in its demonstration that substantial improvements are achievable through better training and sampling recipes alone, without any architectural changes. This is particularly valuable because architectural improvements typically require retraining from scratch, often with modified codebases and different hyperparameter searches. In contrast, SiT's improvements can be applied to any existing DiT training pipeline by simply changing the loss function and sampling procedure. For the rapidly growing community of practitioners training custom diffusion models, this represents an immediate, zero-cost upgrade path — change a few lines of training code and sampling configuration, and receive a 9.3% FID improvement.

SiT 的實踐意義在於其展示了僅透過更好的訓練和取樣方案即可達成顯著改進，無需任何架構變更。這特別有價值，因為架構改進通常需要從頭重新訓練，通常需修改程式碼庫和不同的超參數搜索。相比之下，SiT 的改進可透過簡單地更改損失函數和取樣程序應用於任何現有的 DiT 訓練管線。對於快速成長的訓練自訂擴散模型的實踐社群，這代表了即時的零成本升級路徑——更改幾行訓練代碼和取樣配置，即可獲得 9.3% 的 FID 改進。

段落功能實踐意義——零成本升級路徑的社群價值。

邏輯角色「幾行代碼的改動」使 9.3% 的改進觸手可及，極大化了研究的實際影響。

論證技巧 / 潛在漏洞強調易於採用是有效的推廣策略。但實際遷移可能需要調整超參數。

The theoretical foundation for SiT draws from stochastic interpolants and flow matching. Stochastic interpolants, introduced by Albergo and Vanden-Eijnden, provide a general framework for learning transport maps between arbitrary distributions by defining interpolation paths parameterized by time-dependent coefficients. Flow matching, developed concurrently by Lipman et al., offers a simulation-free training objective for continuous normalizing flows. Both frameworks generalize the DDPM formulation while providing greater flexibility in choosing noise schedules and objectives. Our contribution is to bring these theoretical advances into the practical setting of large-scale class-conditional image generation with Transformer architectures.

SiT 的理論基礎源自隨機內插和流匹配。Albergo 和 Vanden-Eijnden 提出的隨機內插提供了學習任意分佈間傳輸映射的一般框架，透過定義以時間相關係數參數化的內插路徑。流匹配由 Lipman 等人同期開發，為連續正規化流提供無需模擬的訓練目標。兩種框架都推廣了 DDPM 公式，同時提供了在噪聲排程和目標選擇上更大的靈活性。我們的貢獻是將這些理論進展帶入使用 Transformer 架構的大規模類別條件影像生成的實際場景。

段落功能理論背景——隨機內插與流匹配的理論來源。

邏輯角色從理論框架到實際大規模應用的橋接是本文的核心貢獻定位。

論證技巧 / 潛在漏洞清楚地歸因於前人理論工作是良好的學術實踐，但也暗示核心理論創新不在本文。

A key distinction between diffusion models and flow matching lies in the presence or absence of stochasticity in the forward process. Standard diffusion (DDPM) defines a stochastic forward process that adds Gaussian noise according to a fixed schedule, while flow matching defines a deterministic linear interpolation between data and noise. The interpolant framework subsumes both by parameterizing the interpolation with general time-dependent coefficients: x_t = alpha(t)*x_0 + beta(t)*epsilon. Setting alpha(t) = sqrt(1 - sigma(t)^2) and beta(t) = sigma(t) recovers DDPM, while alpha(t) = 1-t and beta(t) = t recovers linear flow matching. This unified parameterization enables fair comparison of design choices that were previously confounded with framework-specific assumptions.

擴散模型和流匹配的關鍵區別在於前向過程中隨機性的有無。標準擴散（DDPM）定義依固定排程添加高斯噪聲的隨機前向過程，而流匹配定義資料與噪聲之間的確定性線性內插。內插框架透過一般化的時間相關係數參數化內插來涵蓋兩者：x_t = alpha(t)*x_0 + beta(t)*epsilon。設定 alpha(t) = sqrt(1 - sigma(t)^2) 和 beta(t) = sigma(t) 恢復 DDPM，而 alpha(t) = 1-t 和 beta(t) = t 恢復線性流匹配。此統一參數化使得能公平比較先前與框架特定假設混淆的設計選擇。

段落功能統一框架——以內插係數涵蓋 DDPM 和流匹配。

邏輯角色精確的數學公式展示了如何從統一框架中恢復兩個已知的特例。

論證技巧 / 潛在漏洞「公平比較」的價值在於消除了先前研究中的混淆變量，是方法論上的重要貢獻。

3. Method — 方法

SiT is built on the exact same architecture as DiT — a Transformer operating on latent patches with adaptive layer normalization conditioning. The only changes are in the training objective and sampling procedure, which are governed by the interpolant framework. We systematically explore: (1) linear vs. trigonometric interpolants; (2) velocity prediction vs. score prediction objectives; (3) continuous-time vs. discrete-time training; and (4) deterministic (ODE) vs. stochastic (SDE) sampling with varying diffusion coefficients. Our key finding is that by decoupling the diffusion coefficient from the learning process, we can tune it at inference time to find the optimal balance between sample quality and diversity.

SiT 建構在與 DiT 完全相同的架構上——在潛在區塊上運作的 Transformer，使用自適應層正規化調節。唯一的改變在於訓練目標和取樣程序，由內插框架控制。我們系統性地探索：(1) 線性 vs. 三角函數內插；(2) 速度預測 vs. 分數預測目標；(3) 連續時間 vs. 離散時間訓練；(4) 確定性（ODE）vs. 隨機性（SDE）取樣，搭配不同擴散係數。關鍵發現是透過將擴散係數從學習過程中解耦，可在推論時調整以找到樣本品質與多樣性的最佳平衡。

段落功能闡述核心方法——四個獨立可調的設計維度。

邏輯角色「相同架構、不同框架」的實驗設計使消融研究的結論極為清晰。

論證技巧 / 潛在漏洞系統性探索是本文的最大優勢。但四維設計空間的完整探索計算成本巨大。

The interpolant defines how noise and data are mixed at intermediate times. A linear interpolant creates x_t = (1-t)x_0 + t*epsilon, while a trigonometric interpolant uses x_t = cos(pi*t/2)*x_0 + sin(pi*t/2)*epsilon. These choices affect the signal-to-noise ratio schedule and consequently the difficulty of prediction at each timestep. The velocity prediction objective trains the network to predict dx_t/dt (the time derivative of the interpolant), while score prediction targets the score function of the noisy distribution. Velocity prediction has a natural connection to flow matching and produces more uniform loss magnitudes across timesteps, potentially leading to more balanced training.

內插方式定義了噪聲和資料在中間時刻的混合方式。線性內插產生 x_t = (1-t)x_0 + t*epsilon，而三角函數內插使用 x_t = cos(pi*t/2)*x_0 + sin(pi*t/2)*epsilon。這些選擇影響信噪比排程，進而影響每個時間步的預測難度。速度預測目標訓練網路預測 dx_t/dt（內插的時間導數），而分數預測以噪聲分佈的分數函數為目標。速度預測與流匹配有自然連結，且在各時間步產生更均勻的損失幅度，可能帶來更平衡的訓練。

段落功能數學細節——內插方式與預測目標的形式化定義。

邏輯角色精確的數學定義使各設計選擇的差異具體可比較。

論證技巧 / 潛在漏洞「更均勻的損失幅度」為速度預測的優勢提供了直覺解釋，但需實驗驗證。

The most consequential finding concerns sampling with tunable stochasticity. Standard diffusion models use a fixed SDE for sampling, coupling the diffusion coefficient to the learned score. SiT decouples these by learning a velocity field that can be used with any diffusion coefficient at inference time. Setting the coefficient to zero yields deterministic ODE sampling (similar to DDIM); increasing it adds stochasticity that can improve sample quality by correcting accumulated errors along the trajectory. The optimal coefficient is typically between 0.5 and 1.5, and can be tuned on a small validation set. This single hyperparameter provides a principled way to trade off sample quality against diversity without retraining the model.

最關鍵的發現涉及具可調隨機性的取樣。標準擴散模型使用固定 SDE 進行取樣，將擴散係數與學習的分數耦合。SiT 透過學習可在推論時搭配任意擴散係數使用的速度場來解耦這些。將係數設為零產生確定性 ODE 取樣（類似 DDIM）；增加係數添加隨機性，可透過修正軌跡上的累積誤差來改善樣本品質。最佳係數通常在 0.5 到 1.5 之間，可在小型驗證集上調整。此單一超參數提供了在不重新訓練模型的情況下，有原則地權衡樣本品質與多樣性的方式。

段落功能核心發現——可調隨機性取樣的品質-多樣性權衡。

邏輯角色推論時可調的單一超參數是極為實用的設計，降低了使用門檻。

論證技巧 / 潛在漏洞0.5-1.5 的最佳範圍提供了實用指南。但最佳值可能因資料集和應用場景而異。

4. Experiments — 實驗

On conditional ImageNet 256x256, SiT-XL/2 achieves an FID-50K of 2.06, compared to DiT-XL/2's FID of 2.27 — a 9.3% improvement using the exact same model architecture, parameters, and GFLOPs. On ImageNet 512x512, SiT achieves FID 2.62 vs DiT's 3.04. Critically, SiT's advantages are consistent across all model sizes (S, B, L, XL), confirming that the improvements stem from the interpolant framework rather than scale-dependent factors. Our ablation studies reveal that velocity prediction with linear interpolation outperforms score prediction with standard diffusion noise schedules, and that SDE sampling with tuned diffusion coefficients outperforms deterministic ODE sampling.

在條件式 ImageNet 256x256 上，SiT-XL/2 達到 FID-50K 2.06，相較於 DiT-XL/2 的 FID 2.27——在完全相同的模型架構、參數量和 GFLOPs 下改進了 9.3%。在 ImageNet 512x512 上，SiT 達到 FID 2.62 對比 DiT 的 3.04。關鍵的是，SiT 的優勢在所有模型規模（S、B、L、XL）上一致，確認改進源自內插框架而非規模相關因素。消融研究揭示速度預測搭配線性內插優於分數預測搭配標準擴散噪聲排程，且調整擴散係數的 SDE 取樣優於確定性 ODE 取樣。

段落功能提供核心實證——量化改進與消融分析。

邏輯角色「相同架構」的控制變量設計使改進歸因極為明確，消融結果為社群提供實用指南。

論證技巧 / 潛在漏洞9.3% 的免費改進（無額外計算成本）是極具吸引力的結果。但僅在 ImageNet 上驗證，對其他資料集的推廣性有待確認。

A detailed breakdown of the design space reveals clear winners in each dimension. For interpolation: linear interpolation achieves FID 2.06 vs. trigonometric's 2.18, a gap that is consistent across model sizes. For objectives: velocity prediction achieves FID 2.06 vs. score prediction's 2.31, confirming that the more uniform loss landscape of velocity prediction leads to better optimization. For sampling: SDE with diffusion coefficient 1.0 achieves FID 2.06 vs. ODE's 2.25, demonstrating that controlled stochasticity helps correct trajectory errors. The continuous-time training and discrete-time training perform comparably (FID 2.06 vs. 2.09), suggesting that this choice is less critical than the others.

設計空間的詳細分解在每個維度揭示了明確的勝出者。在內插方面：線性內插達到 FID 2.06 對比三角函數的 2.18，此差距在各模型規模上一致。在目標方面：速度預測達到 FID 2.06 對比分數預測的 2.31，確認速度預測更均勻的損失地形帶來更好的最佳化。在取樣方面：擴散係數 1.0 的 SDE 達到 FID 2.06 對比 ODE 的 2.25，展示受控的隨機性有助修正軌跡誤差。連續時間訓練和離散時間訓練表現相當（FID 2.06 vs. 2.09），暗示此選擇的重要性低於其他選擇。

段落功能逐維度消融——四個設計維度的定量比較。

邏輯角色每個維度的數字對比清晰且具可操作性，直接指導實踐。

論證技巧 / 潛在漏洞發現連續/離散時間影響小是重要的負面結果，減少了實踐中的決策負擔。

We further analyze the relationship between the tunable diffusion coefficient and generation quality. Sweeping the coefficient from 0 (pure ODE) to 2.0 (heavy SDE) reveals a clear optimal region between 0.8 and 1.2 where FID is minimized. Below 0.8, the deterministic trajectory accumulates discretization errors that manifest as reduced diversity. Above 1.2, excessive stochasticity introduces noise that degrades sharpness. Interestingly, the optimal coefficient is relatively stable across model sizes (varying by less than 0.1 between SiT-S and SiT-XL), suggesting that this finding transfers across scales. We also observe that the optimal coefficient for FID differs from the optimal for Inception Score (which favors slightly higher stochasticity at 1.3-1.5), providing users with a principled knob for controlling the quality-diversity tradeoff.

我們進一步分析可調擴散係數與生成品質之間的關係。將係數從 0（純 ODE）掃描到 2.0（重度 SDE）揭示了明確的最佳區域在 0.8 到 1.2 之間，此處 FID 最小化。低於 0.8 時，確定性軌跡累積離散化誤差導致多樣性降低。高於 1.2 時，過度隨機性引入噪聲降低銳利度。有趣的是，最佳係數在各模型規模間相對穩定（SiT-S 和 SiT-XL 之間變化不到 0.1），暗示此發現可跨規模遷移。我們還觀察到FID 的最佳係數不同於 Inception Score 的最佳值（後者偏好稍高的隨機性 1.3-1.5），為使用者提供了控制品質-多樣性權衡的有原則旋鈕。

段落功能深入分析——擴散係數的掃描與最佳區域。

邏輯角色0.8-1.2 的精確最佳區域以及跨規模穩定性為實踐提供了即用指南。

論證技巧 / 潛在漏洞FID vs IS 的不同最佳值揭示了品質-多樣性的內在權衡，使使用者能根據需求選擇。

5. Conclusion — 結論

We have introduced SiT, demonstrating that the interpolant framework provides a principled and effective approach to improving diffusion-based generative models. By enabling independent control over interpolation schedules, objectives, and sampling strategies, SiT achieves consistent improvements over DiT across all model scales with zero additional computational cost. Our systematic study provides practical guidelines for the design of future generative models based on dynamical transport.

我們提出了 SiT，展示內插框架為改進基於擴散的生成模型提供了有原則且有效的方法。透過實現對內插排程、目標和取樣策略的獨立控制，SiT 在所有模型規模上一致超越 DiT，且無額外計算成本。我們的系統性研究為基於動態傳輸的未來生成模型設計提供了實用指南。

段落功能總結全文——重申框架價值與實用指南。

邏輯角色以「實用指南」收束，強調研究對社群的直接價值。

論證技巧 / 潛在漏洞SiT 的價值在於其理論清晰性和即插即用的改進，已被後續工作廣泛採用。

The implications of SiT extend beyond the specific ImageNet benchmarks. The finding that velocity prediction with linear interpolation and tunable SDE sampling constitutes a superior training-sampling recipe is directly applicable to any model using the DiT architecture, including text-to-image models, video generation models, and audio synthesis systems. Future work should explore whether these findings transfer to latent diffusion settings with different VAE architectures and to conditional generation tasks beyond class conditioning. The interpolant framework also opens the door to learning the optimal interpolation schedule jointly with the model, potentially discovering schedules that outperform hand-designed options.

SiT 的意涵超越了特定的 ImageNet 基準。速度預測搭配線性內插和可調 SDE 取樣構成更優的訓練-取樣方案這一發現直接適用於任何使用 DiT 架構的模型，包括文字生成影像模型、影片生成模型和音訊合成系統。未來工作應探索這些發現是否能遷移到使用不同 VAE 架構的潛在擴散設定以及超越類別條件的條件生成任務。內插框架還為聯合學習最佳內插排程與模型開啟了大門，潛在地發現優於手工設計選項的排程。

段落功能展望未來——跨模態遷移與可學習排程。

邏輯角色從 ImageNet 推廣至文字-影像和影片生成，擴展了研究的影響範圍。

論證技巧 / 潛在漏洞「可學習排程」是自然的延伸方向，但可能增加訓練複雜度和穩定性挑戰。

The methodological contribution of SiT extends beyond its specific numerical results. It establishes a template for systematic empirical investigation in generative modeling: fix the architecture, vary one design dimension at a time, and measure the impact. This controlled experimental methodology is surprisingly rare in the generative modeling literature, where new methods typically change multiple aspects simultaneously, making it difficult to attribute improvements to specific design choices. SiT's approach of building on the exact same codebase and training setup as DiT ensures that every observed difference is attributable to the interpolant framework. This level of experimental rigor provides the community with trustworthy, actionable guidance rather than results that may not reproduce under different conditions.

SiT 的方法論貢獻超越了其具體數值結果。它建立了生成建模中系統性實證研究的模板：固定架構，一次變化一個設計維度，量測影響。此受控實驗方法學在生成建模文獻中出乎意料地罕見，因為新方法通常同時改變多個方面，使得難以將改進歸因於特定設計選擇。SiT 建構在與 DiT 完全相同的程式碼庫和訓練設定上的方法確保每個觀察到的差異都可歸因於內插框架。此實驗嚴謹性為社群提供了可信賴、可操作的指導，而非可能在不同條件下無法重現的結果。

段落功能方法論示範——受控實驗的研究典範價值。

邏輯角色將 SiT 的研究方法本身視為對社群的貢獻，超越了數字結果。

論證技巧 / 潛在漏洞對文獻中實驗嚴謹性不足的批評是中肯的，SiT 提供了正面示範。

論證結構總覽

問題
擴散模型設計選擇
缺乏系統研究

→

論點
內插框架統一
擴散與流模型

→

方法
四維設計空間的
系統探索

→

證據
FID 2.06
零額外計算成本

→

結論
生成模型設計的
實用指南

核心主張（一句話）

透過內插框架將擴散係數從學習中解耦，可在不增加計算成本的情況下，在所有模型規模上以速度預測和線性內插一致超越標準 DiT。

論證最強處

完全相同的架構和計算量下的 9.3% FID 改進（2.06 vs 2.27），以及跨四個模型規模（S/B/L/XL）的一致性，使結論極為可靠。四維消融的完整性為社群提供了直接可用的設計指南。

論證最弱處

僅在 ImageNet 類別條件生成上驗證，對文字條件生成、影片生成等更複雜設定的推廣性尚未充分探索。四維設計空間的交互效應（如內插方式與取樣策略的交互）探索不夠深入。