TRELLIS: Structured 3D Latents for Scalable and Versatile 3D Generation

Abstract — 摘要

We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. At the core of our method is a unified Structured LATent (SLat) representation that allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This representation is designed by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model. We employ rectified flow transformers tailored for the structured latent representation and train models with up to 2 billion parameters on a large-scale dataset of 500K diverse 3D objects. Our model generates high-quality results with text or image conditions, and the resulting 3D assets can be directly extracted to various final representations, and further edited to create diverse variations.

本文提出一種新穎的三維生成方法，旨在實現多功能且高品質的三維資產創建。方法的核心是一個統一的結構化潛在表示（Structured LATent, SLat），能夠解碼為不同的輸出格式，包括輻射場、三維高斯與網格。此表示透過將稀疏填充的三維網格與強大視覺基礎模型所提取的稠密多視角視覺特徵進行整合而設計。我們採用針對結構化潛在表示量身打造的矯正流（rectified flow）Transformer，並在包含 50 萬個多樣化三維物件的大規模資料集上訓練高達 20 億參數的模型。模型在文字或影像條件下均能生成高品質結果，所產出的三維資產可直接轉換為多種最終表示格式，並進一步編輯以創造多樣化的變體。

段落功能全文總覽——以精煉語言預告方法核心、技術手段、規模與應用能力。

邏輯角色摘要同時承擔「問題暗示」與「解決方案預告」的雙重功能：以「多功能且高品質」暗示現有方法在通用性或品質上的不足，再以 SLat 表示 + 矯正流 Transformer + 50 萬物件的三重組合回應此缺口。

論證技巧 / 潛在漏洞以具體數字（20 億參數、50 萬物件）建立規模感，增強可信度。但「多功能」與「高品質」的同時宣稱需在後續實驗中以定量指標充分驗證——大規模訓練本身並不保證品質。

1. Introduction — 緒論

While 2D image generation has achieved remarkable progress through large-scale models, 3D generative models still fall short in generation quality compared to their 2D predecessors. A core challenge is that 3D data encompasses diverse representations like meshes, point clouds, Radiance Fields, and 3D Gaussians, each optimized for specific applications yet difficult to adapt across tasks. Geometry-focused approaches often falter in detailed appearance modeling compared to those relying on representations equipped with advanced volumetric rendering capabilities, while appearance-focused methods excel in rendering high-quality appearances but struggle with plausible geometry extraction.

儘管二維影像生成已透過大規模模型取得顯著進展，三維生成模型在生成品質上仍落後於二維前輩。核心挑戰在於三維資料涵蓋了多種表示方式——網格、點雲、輻射場、三維高斯——每種皆針對特定應用進行了最佳化，卻難以跨任務適配。側重幾何的方法在細緻外觀建模方面常有不足，而側重外觀的方法雖擅長渲染高品質外觀，卻在合理的幾何提取上力不從心。

段落功能建立研究場域——指出三維生成的品質落差與表示多樣性帶來的兩難。

邏輯角色論證鏈的起點：先以二維成就為標竿，再揭示三維領域「幾何 vs. 外觀」的根本矛盾，為統一表示的必要性鋪路。

論證技巧 / 潛在漏洞以「幾何 vs. 外觀」的二分法精準概括現有方法的取捨，修辭效果極佳。但實際上部分方法（如 DMTet + 紋理場）已嘗試兼顧兩者，此處的二分可能過度簡化。

We propose developing a unified and versatile latent space that facilitates high-quality 3D generation across various representations. Our strategy involves two key design choices: (1) introducing explicit sparse 3D structures in the latent space design, enabling diverse decoding targets; and (2) equipping these sparse structures with a powerful vision foundation model for detailed information encoding. The resulting system, TRELLIS, is trained on 500K carefully-collected assets using rectified flow transformers as backbone models adapted for sparse structures. Key capabilities include high-quality generation, versatile format selection, flexible editing without parameter tuning, and training without 3D-specific data fitting.

我們提議開發一個統一且通用的潛在空間，以促進跨多種表示的高品質三維生成。策略包含兩項關鍵設計選擇：其一，在潛在空間設計中引入顯式的稀疏三維結構，使其能解碼至多種目標格式；其二，為這些稀疏結構配備強大的視覺基礎模型以進行細緻的資訊編碼。由此產生的系統 TRELLIS 使用針對稀疏結構適配的矯正流 Transformer 作為骨幹模型，在 50 萬個精心蒐集的資產上訓練。主要能力涵蓋高品質生成、多元格式選擇、無需參數調校的靈活編輯，以及無需三維特定資料擬合的訓練方式。

段落功能提出解決方案——概述 TRELLIS 的雙重設計策略與四大核心能力。

邏輯角色承接上段的問題陳述，此段扮演「轉折」角色：從「現有方法不足」轉向「本文方案」。稀疏結構直接回應「表示多樣性」問題，視覺基礎模型回應「幾何與外觀兼顧」的需求。

論證技巧 / 潛在漏洞以編號列舉兩項設計選擇，結構清晰易於理解。四大能力的並列宣稱涵蓋面極廣，但「無需參數調校的靈活編輯」與「無需三維特定資料擬合」的具體限制條件需待方法章節釐清。

Some recent methods have leveraged 2D diffusion models for 3D creation, starting with DreamFusion and progressing to multiview generation approaches. However, these 2D-assisted approaches often yield lower geometry quality compared to native 3D models learned from 3D data collections, due to inherent multiview inconsistency. In contrast, our approach directly learns from large-scale 3D data, ensuring geometric fidelity and multiview coherence that 2D-lifting methods fundamentally cannot guarantee.

近期部分方法借助二維擴散模型進行三維創建，從 DreamFusion 開始，逐步發展至多視角生成方法。然而，由於固有的多視角不一致性，這些二維輔助方法的幾何品質往往不如從三維資料集合中學習的原生三維模型。相對地，我們的方法直接從大規模三維資料中學習，確保了二維提升方法在本質上無法保證的幾何保真度與多視角一致性。

段落功能排除替代路線——批判二維輔助的三維生成方法。

邏輯角色此段進一步收窄「為何需要原生三維方法」的論證：不僅排除了單一表示方法的缺陷，還排除了繞道二維的策略，使 TRELLIS 的定位更加明確。

論證技巧 / 潛在漏洞以「固有的多視角不一致性」一語概括二維方法的根本缺陷，論證簡潔有力。但近期如 Zero123++ 等方法已大幅改善一致性問題，此處的批判可能未充分反映最新進展。

Early approaches used Generative Adversarial Networks (GANs) to model 3D distributions but faced scaling challenges. Later methods employed diffusion models for various representations like point clouds, voxel grids, Triplanes, and 3D Gaussians. While efficient latent-space approaches emerged, most methods focused either on shape modeling, often requiring an additional texturing phase, or on appearance-rich formats that struggle with plausible geometry extraction. This dichotomy underscores the need for a unified representation that can serve both geometry and appearance.

早期方法使用生成對抗網路（GAN）來建模三維分布，但面臨擴展性的挑戰。後續方法針對點雲、體素網格、三平面（Triplane）與三維高斯等多種表示採用擴散模型。儘管高效的潛在空間方法應運而生，大多數方法仍聚焦於形狀建模（通常需要額外的紋理化階段），或是側重外觀豐富的格式卻在合理幾何提取上力有未逮。此二分法凸顯了對能同時服務幾何與外觀之統一表示的需求。

段落功能文獻回顧——概述三維生成模型從 GAN 到擴散模型的演進與侷限。

邏輯角色延續緒論的批判脈絡，以更系統化的方式重申「幾何 vs. 外觀」的二分困境，為 TRELLIS 的統一表示方案提供學術背景支撐。

論證技巧 / 潛在漏洞以時間線敘事（GAN -> 擴散模型 -> 潛在空間方法）呈現領域演進，結構清晰。但將所有先前方法歸類為非此即彼的二分法，可能忽略了如 CLAY 等近期嘗試同時處理幾何與外觀的工作。

An alternative paradigm leverages 2D generative models for 3D creation. Score Distillation Sampling (SDS), introduced by DreamFusion, optimizes a 3D representation to match a 2D diffusion prior. Subsequent works explored multiview generation followed by 3D reconstruction. However, these 2D-assisted approaches often yield lower geometry quality compared to native 3D models due to inherent multiview inconsistency. Rectified flow models have recently emerged as a novel generative paradigm that challenges the dominance of diffusions, demonstrating effectiveness for large-scale image and video generation, motivating their adoption for 3D generation.

另一種範式借助二維生成模型進行三維創作。由 DreamFusion 引入的分數蒸餾取樣（SDS）透過最佳化三維表示以匹配二維擴散先驗。後續工作探索了多視角生成加三維重建的路線。然而，這些二維輔助方法因固有的多視角不一致性，在幾何品質上往往不如原生三維模型。矯正流模型近期作為一種新興的生成範式崛起，挑戰了擴散模型的主導地位，在大規模影像與影片生成中展現出有效性，激發了將其應用於三維生成的動機。

段落功能文獻定位——將 TRELLIS 放置在「原生三維 + 矯正流」的交匯處。

邏輯角色此段建立了兩條學術譜系：（1）二維輔助方法的局限 ->原生三維方法的必要性；（2）擴散模型 -> 矯正流的演進。TRELLIS 被定位為兩條線索的匯聚點。

論證技巧 / 潛在漏洞將矯正流的優勢從二維領域類推至三維領域，邏輯上合理但需實驗驗證。二維中的成功並不自動保證在三維稀疏結構上同樣有效，此處的論證存在歸納跳躍。

3. Method — 方法

3.1 Structured Latent Representation — 結構化潛在表示

The core contribution is a representation called Structured Latents (SLat), defined as z = {(z_i, p_i)} for i = 1 to L, where p_i represents active voxel positions in a 3D grid and z_i denotes local latents at those positions. The active voxels outline the coarse structure of the 3D asset, while the latents capture finer details of appearance and shape. The representation leverages sparsity: the number of active voxels L is significantly smaller than the total grid size N³, allowing construction at a relatively high resolution. Default settings use N = 64 resolution yielding approximately L = 20K active voxels.

核心貢獻是一種稱為結構化潛在表示（SLat）的表示方式，定義為 z = {(z_i, p_i)}（i = 1 至 L），其中 p_i 代表三維網格中的活躍體素位置，z_i 則為該位置的局部潛在向量。活躍體素勾勒出三維資產的粗略結構，而潛在向量則捕捉更精細的外觀與形狀細節。此表示利用了稀疏性：活躍體素數量 L 遠小於網格總大小 N 的三次方，因此允許在相對較高的解析度下建構。預設設定使用 N = 64 的解析度，約產生 L = 20K 個活躍體素。

段落功能方法推導第一步——定義 SLat 表示的數學形式與稀疏性質。

邏輯角色這是整個方法的數學基礎。「活躍體素 = 粗略結構，局部潛在向量 = 精細細節」的分離設計，直接決定了後續兩階段生成管線的合理性。

論證技巧 / 潛在漏洞以具體數值（64 解析度、20K 體素）佐證稀疏性的實際可行性，增強說服力。但 20K 個體素的表達能力是否足以涵蓋高度複雜的幾何細節（如毛髮、薄壁結構），仍需實驗驗證。

A critical design insight is the separation between structure (active voxel positions) and content (local latents). This decoupling yields two major advantages: first, the structure can be generated independently and efficiently as a compact binary occupancy grid; second, the latents are generated conditioned on the structure, allowing detail variation and region-specific editing without affecting the overall coarse geometry. The locality of the latents further enables spatial editing by altering voxels and latents in targeted areas while leaving others unchanged.

一項關鍵的設計洞察是結構（活躍體素位置）與內容（局部潛在向量）的分離。此解耦帶來兩大優勢：其一，結構可作為緊湊的二值佔據網格被獨立且高效地生成；其二，潛在向量以結構為條件進行生成，從而允許在不影響整體粗略幾何的情況下進行細節變化與區域特定的編輯。潛在向量的局部性更進一步支援空間編輯——只需修改目標區域的體素與潛在向量，其餘部分保持不變。

段落功能闡述設計動機——解釋結構與內容分離所帶來的編輯靈活性。

邏輯角色此段將 SLat 的數學定義連結到實際應用價值：結構-內容解耦不僅是技術設計，更是實現可控編輯能力的基礎，呼應緒論中「靈活編輯」的承諾。

論證技巧 / 潛在漏洞以「兩大優勢」的並列結構使論證清晰有力。但局部性的假設意味著全域性的風格一致性（如整體光照協調）可能難以在區域編輯中自然維持，此限制未被討論。

3.2 Encoding and Decoding — 編碼與解碼

The encoding process converts 3D assets into voxelized features by aggregating features extracted from dense multiview images using a pre-trained DINOv2 encoder. For each active voxel, features are gathered from randomly sampled camera views on a sphere, projected onto multiview feature maps to retrieve features at corresponding locations, and their average is used as the voxel's visual feature. A transformer-based VAE architecture then processes these voxelized features, serializing input features from active voxels and adding sinusoidal positional encodings to create tokens with variable context length L.

編碼過程透過聚合以預訓練 DINOv2 編碼器從稠密多視角影像中提取的特徵，將三維資產轉換為體素化特徵。對每個活躍體素，從球面上隨機取樣的攝影機視角收集特徵，投影到多視角特徵圖上以檢索對應位置的特徵，再取其平均值作為該體素的視覺特徵。隨後，基於 Transformer 的變分自編碼器（VAE）架構處理這些體素化特徵，將活躍體素的輸入特徵序列化並加入正弦位置編碼，形成具有可變上下文長度 L 的符記序列。

段落功能技術細節——描述從三維資產到結構化潛在向量的編碼管線。

邏輯角色此段揭示 SLat 的「密集視覺特徵」來源：以 DINOv2 作為視覺骨幹是將二維預訓練知識遷移至三維的關鍵橋樑，回應了緒論中「配備視覺基礎模型」的設計選擇。

論證技巧 / 潛在漏洞選用 DINOv2 而非其他視覺模型（如 CLIP）是一個重要但未充分論證的決策。DINOv2 的自監督特徵是否最適合三維重建需求，與 CLIP 的語義特徵相比各有何優劣，文中缺乏消融比較。

The structured latents can be decoded into versatile output formats. For 3D Gaussians, each z_i is decoded into K Gaussians with position offsets, colors, scales, opacities, and rotations, with final positions constrained to the vicinity of their active voxel using p_i + tanh(o_i). For Radiance Fields, the decoder outputs a CP-decomposition of a local radiance volume at 8³ per voxel. For meshes, the process decodes FlexiCubes parameters and signed distance values, upsampling to 256³ and extracting meshes from 0-level isosurfaces. Notably, the encoder and decoder are trained end-to-end using Gaussians, while other format decoders are simply trained from scratch with frozen encoders, demonstrating strong extensibility.

結構化潛在向量可解碼為多種輸出格式。對於三維高斯，每個 z_i 解碼為 K 個高斯，包含位置偏移、顏色、尺度、不透明度與旋轉，最終位置透過 p_i + tanh(o_i) 約束在其活躍體素的鄰域內。對於輻射場，解碼器輸出每個體素的 8 立方局部輻射場體積之 CP 分解。對於網格，過程解碼 FlexiCubes 參數與符號距離值，上取樣至 256 立方並從零等值面提取網格。值得注意的是，編碼器與解碼器以三維高斯進行端到端訓練，而其他格式的解碼器則以凍結的編碼器從頭訓練，展現了強大的擴展性。

段落功能核心創新——展示 SLat 的多格式解碼能力，回應「通用性」承諾。

邏輯角色此段是全文「通用性」論證的實質支撐：同一個潛在表示能輸出三種截然不同的三維格式，直接回應緒論中「跨表示生成」的核心訴求。

論證技巧 / 潛在漏洞「以高斯訓練、其他格式凍結編碼器」的設計暗示高斯是最核心的表示，其他格式是「附帶」的。這引發疑問：其他格式的品質是否能達到與高斯同等的水準？若網格品質顯著低於高斯渲染品質，則「通用性」的宣稱需要打折。

The Sparse VAE architecture incorporates shifted window attention in 3D space to enhance local information interaction among neighboring voxels. This design choice balances computational efficiency with expressive power: global self-attention across all 20K tokens would be prohibitively expensive, while purely local operations would miss inter-region dependencies. The shifted window mechanism, adapted from the 2D Swin Transformer paradigm to 3D sparse structures, allows information to propagate across window boundaries while maintaining manageable computational costs.

稀疏 VAE 架構整合了三維空間中的移位窗口注意力機制，以增強相鄰體素間的局部資訊交互。此設計選擇在計算效率與表達能力之間取得平衡：對所有 20K 個符記進行全域自注意力的成本過高，而純粹的局部操作則會遺漏跨區域的依賴關係。源自二維 Swin Transformer 範式並適配至三維稀疏結構的移位窗口機制，允許資訊跨越窗口邊界傳播，同時維持可控的計算成本。

段落功能架構細節——解釋稀疏 VAE 中注意力機制的設計抉擇。

邏輯角色此段處理一個關鍵的工程挑戰：如何在 20K 個稀疏符記上高效地進行注意力計算。移位窗口注意力是效率與表達力之間的務實妥協。

論證技巧 / 潛在漏洞從 Swin Transformer 遷移至三維的策略減低了讀者的認知負擔（借用已知概念）。但三維移位窗口的實作複雜度遠高於二維，且窗口大小的選擇對品質的影響未被充分討論。

3.3 Structured Latents Generation — 結構化潛在生成

The generation pipeline follows a two-stage approach. The first stage generates the sparse structure {p_i} by converting sparse active voxels into a dense binary 3D grid O, then compressing it via a simple VAE with 3D convolutional blocks into a low-resolution feature grid S. This makes structure generation computationally efficient while converting discrete occupancy values into continuous features suited for rectified flow training. The rectified flow formulation uses linear interpolation x(t) = (1-t)x_0 + t*epsilon and learns a vector field through conditional flow matching.

生成管線遵循兩階段方法。第一階段生成稀疏結構 {p_i}：先將稀疏活躍體素轉換為稠密的二值三維網格 O，再透過帶有三維摺積區塊的簡易 VAE 壓縮為低解析度特徵網格 S。這使得結構生成在計算上高效，同時將離散的佔據值轉換為適合矯正流訓練的連續特徵。矯正流公式使用線性內插 x(t) = (1-t)x_0 + t*epsilon，並透過條件流匹配學習向量場。

段落功能方法推導——描述兩階段生成中的第一階段：稀疏結構生成。

邏輯角色此段回應了 SLat 定義中「結構與內容分離」的設計：結構先行生成，為後續潛在向量生成提供幾何骨架。二值佔據 -> 連續特徵的轉換是使矯正流適用於離散結構的關鍵技術橋樑。

論證技巧 / 潛在漏洞將離散佔據值轉換為連續特徵以適配矯正流的策略頗具巧思，但引入了額外的 VAE 壓縮步驟，可能在結構生成中產生資訊損失。解壓縮後的二值化閾值選擇也可能影響最終幾何品質。

The second stage generates local latents {z_i} conditioned on the structure. A transformer G_L is designed specifically for sparse structures: instead of directly serializing all 20K tokens, the method improves efficiency by packing them into a shorter sequence using sparse convolutions to aggregate latents within a 2³ local region, followed by multiple time-modulated transformer blocks. Both structure and latent generators incorporate conditions through cross-attention layers: text conditions use CLIP features, while image conditions use DINOv2 visual features. Both models are trained separately using the conditional flow matching objective.

第二階段以結構為條件生成局部潛在向量 {z_i}。專為稀疏結構設計的 Transformer G_L 並非直接序列化所有 20K 個符記，而是透過稀疏摺積將 2 立方局部區域內的潛在向量打包為更短的序列以提升效率，隨後經過多個時間調制 Transformer 區塊處理。結構與潛在向量的生成器皆透過交叉注意力層注入條件：文字條件使用 CLIP 特徵，影像條件使用 DINOv2 視覺特徵。兩個模型使用條件流匹配目標函數分別訓練。

段落功能方法推導——描述第二階段潛在向量生成的架構與條件注入機制。

邏輯角色完成兩階段管線的描述：結構生成 -> 潛在向量生成。稀疏摺積打包策略是解決 20K 符記序列長度問題的核心工程創新，使大規模 Transformer 訓練成為可能。

論證技巧 / 潛在漏洞文字用 CLIP、影像用 DINOv2 的差異化條件設計反映了對兩種模態特性的深入理解。但兩階段分別訓練是否會導致結構與內容之間的不協調（例如結構過於簡化而潛在向量過於複雜），文中未充分討論。

3.4 3D Editing with Structured Latents — 三維編輯

The separation between structure and latents enables two forms of editing. Detail variation is accomplished by preserving the asset's structure and executing the second generation stage with different text prompts, producing variants that adhere to the overall shape while exhibiting diverse appearance and geometry details. Region-specific editing leverages the locality of the representation: altering voxels and latents in targeted areas while leaving others unchanged, accomplished through adapting the Repaint technique to the two-stage generation pipeline and specifying bounding boxes for voxels to be edited. This enables detailed local modifications such as adding rivers and bridges to island models, all guided by text or image conditions.

結構與潛在向量的分離支援兩種形式的編輯。細節變化透過保留資產的結構並以不同的文字提示執行第二階段生成來實現，產出的變體保持整體形狀但呈現多樣化的外觀與幾何細節。區域特定編輯則利用表示的局部性：僅修改目標區域的體素與潛在向量而保持其餘不變，透過將 Repaint 技術適配至兩階段生成管線，並指定待編輯體素的邊界框來實現。這使得如在島嶼模型上添加河流與橋樑等精細的局部修改成為可能，全程以文字或影像條件引導。

段落功能應用展示——將 SLat 的結構性質轉化為實際的編輯能力。

邏輯角色此段將先前的技術設計（結構-內容分離、局部性）連結到使用者可感知的應用價值，完成「設計動機 -> 技術實現 -> 應用場景」的完整論證鏈。

論證技巧 / 潛在漏洞以具體的編輯範例（島嶼添加河流）使抽象的技術能力變得具體可感。但 Repaint 適配的效果高度依賴邊界框的精確指定，且編輯區域與非編輯區域的過渡自然度並未被定量評估。

4. Experiments — 實驗

Training involved approximately 500K high-quality 3D assets from four public datasets: Objaverse (XL), ABO, 3D-FUTURE, and HSSD. The pipeline renders 150 images per asset and employs GPT-4o for captioning. Three model scales were trained: 342M (Basic), 1.1B (Large), and 2B (X-Large) parameters. The X-Large model was trained with 64 A100 GPUs for 400K steps with a batch size of 256. At inference, classifier-free guidance strength is set to 3, sampling steps to 50, and generation time is approximately 10 seconds. Evaluation uses the Toys4k dataset (4K objects not in the training set).

訓練使用了來自四個公開資料集的約 50 萬個高品質三維資產：Objaverse (XL)、ABO、3D-FUTURE 及 HSSD。管線為每個資產渲染 150 張影像，並採用 GPT-4o 進行字幕生成。訓練了三個模型規模：3.42 億（基礎版）、11 億（大型版）及 20 億（超大版）參數。超大版模型以 64 張 A100 GPU 訓練 40 萬步，批次大小為 256。推論時，無分類器引導強度設為 3，取樣步數為 50，生成時間約 10 秒。評估使用 Toys4k 資料集（4,000 個未在訓練集中出現的物件）。

段落功能實驗設定——列出訓練資料、模型規模、計算資源與評估基準。

邏輯角色為後續的定量比較建立可信的實驗框架。50 萬資產、20 億參數、64 張 A100 的規模體現了此研究的工業級資源投入。

論證技巧 / 潛在漏洞以 GPT-4o 進行字幕生成是創新但可能引入偏差的選擇——GPT-4o 的描述品質直接影響文字條件生成的上限。此外，64 張 A100 的訓練門檻使方法的可複現性受限於少數擁有大規模算力的機構。

In reconstruction evaluation, the method outperforms all baselines across all evaluated metrics including PSNR, LPIPS, Chamfer Distance, and F-score, with even geometry quality surpassing CLAY which focuses solely on shape encoding. For generation, qualitative comparisons demonstrate superiority over both 2D-assisted methods (InstantMesh, LGM) and 3D approaches (GaussianCube, Shap-E, 3DTopia-XL, LN3Diff), exhibiting not only more vivid appearances and finer geometries but also more precise alignment with provided text and image prompts. Generated assets demonstrate an unprecedented level of quality with vibrant colors, vivid details, complex structures, and flat faces with sharp edges.

在重建評估中，該方法在所有評估指標上均優於所有基準方法，包括 PSNR、LPIPS、倒角距離（Chamfer Distance）與 F 分數，甚至在幾何品質上超越了專注於形狀編碼的 CLAY。在生成方面，定性比較展示了相較於二維輔助方法（InstantMesh、LGM）與三維方法（GaussianCube、Shap-E、3DTopia-XL、LN3Diff）的優越性，不僅呈現更鮮明的外觀與更精細的幾何，還展現出與所提供文字及影像提示更精準的對齊。生成的資產展現了前所未有的品質水準，具備鮮豔的色彩、生動的細節、複雜的結構，以及平整的表面與銳利的邊緣。

段落功能提供實證——以定量與定性結果全面驗證方法的有效性。

邏輯角色此段是實證支柱，覆蓋兩個維度：（1）重建品質的全面領先；（2）生成品質對多種基準方法的優越。超越專注形狀的 CLAY 尤其能支撐「統一表示兼顧幾何與外觀」的核心論點。

論證技巧 / 潛在漏洞「前所未有的品質水準」等極端措辭在學術論文中需謹慎使用。且比較對象中未包含同期最強的一些方法（如 Rodin），可能存在選擇性基準比較的疑慮。

Quantitative evaluation on Toys4k using Frechet Distance, Kernel Distance, and CLIP Score confirms the method significantly surpasses previous methods across all evaluated metrics. A user study with over 100 participants across 68 text and 67 image prompts shows the method is strongly preferred by users with clear margins. Ablation studies reveal three critical findings: 64³ resolution provides a significant boost over 32³; replacing diffusion with rectified flow improves both quality and prompt alignment at any generation stage; and increasing model size consistently improves generation performance on both training distribution and held-out test sets, demonstrating favorable scaling behavior up to 2 billion parameters.

在 Toys4k 上使用 Frechet 距離、核心距離（Kernel Distance）與 CLIP 分數的定量評估，確認該方法在所有評估指標上顯著超越先前方法。涵蓋 68 個文字提示與 67 個影像提示、超過 100 名參與者的使用者研究顯示，該方法以明顯的優勢獲得使用者偏好。消融研究揭示了三項關鍵發現：64 立方解析度相比 32 立方有顯著提升；以矯正流取代擴散模型在任何生成階段都能改善品質與提示對齊度；而增大模型規模能持續改善生成表現——無論在訓練分布還是留出測試集上皆然，展現了高達 20 億參數的良好擴展行為。

段落功能多維度驗證——以定量指標、使用者研究與消融實驗三管齊下。

邏輯角色此段強化了兩項核心論點：（1）矯正流優於擴散模型（消融實驗直接證明）；（2）方法具備良好的擴展性（規模越大效果越好），暗示未來的進一步擴展潛力。

論證技巧 / 潛在漏洞消融實驗的設計系統且全面，是論文最具說服力的部分之一。但使用者研究的 100 人規模相對較小，且未報告參與者的專業背景分布——專業三維藝術家與一般使用者的偏好可能存在顯著差異。

5. Conclusion — 結論

We present TRELLIS, introducing a structured latent representation that allows decoding to versatile output formats by comprehensively encoding both geometry and appearance information into localized latents anchored on a sparse 3D grid. Paired with rectified flow transformers scaled to 2 billion parameters and trained on 500K diverse 3D objects, the approach demonstrates superiority in 3D generation in terms of quality, versatility, and editability. The strong scaling behavior and favorable comparison to existing methods highlight strong potential for real-world applications in digital production, suggesting that structured sparse latent representations paired with large-scale flow-based generation provide a promising foundation for the future of 3D content creation.

本文提出 TRELLIS，引入一種結構化潛在表示，透過將幾何與外觀資訊全面編碼至錨定於稀疏三維網格上的局部潛在向量，實現對多種輸出格式的解碼。搭配擴展至 20 億參數的矯正流 Transformer 並在 50 萬個多樣化三維物件上訓練，該方法在三維生成的品質、通用性與可編輯性方面均展現出優越性。良好的擴展行為與對現有方法的有利比較，凸顯了其在數位製作領域中的強大實際應用潛力，表明結構化稀疏潛在表示搭配大規模基於流的生成，為三維內容創作的未來提供了有前景的基礎。

段落功能總結全文——重申核心貢獻並展望未來方向。

邏輯角色結論段呼應摘要的結構，完成論證閉環：從「統一表示 + 矯正流 + 大規模訓練」的技術貢獻，推廣至「三維內容創作的未來基礎」的宏觀展望。

論證技巧 / 潛在漏洞結論措辭自信而不過度誇大（「有前景的基礎」而非「最終解決方案」），語氣適切。但未充分討論局限性——如對高度算力的依賴、對訓練資料多樣性的敏感度、以及在超出訓練分布的罕見類別上的泛化能力。作為 Spotlight 論文，讀者期待更坦誠的局限性分析。

論證結構總覽

問題
三維生成品質落後二維
表示多樣性導致幾何/外觀取捨

→

論點
結構化稀疏潛在表示
統一多格式解碼

→

證據
50 萬資產訓練、20 億參數
全指標超越基準方法

→

反駁
稀疏結構 + 矯正流
兼顧效率與擴展性

→

結論
結構化潛在表示是
三維內容創作的基礎

作者核心主張（一句話）

將幾何與外觀資訊編碼至錨定於稀疏三維網格的局部潛在向量，搭配大規模矯正流 Transformer，能夠實現跨多種表示格式的高品質、可編輯三維資產生成。

論證最強處

統一表示的多格式解碼能力：同一個 SLat 潛在空間能解碼為三維高斯、輻射場與網格三種截然不同的格式，且重建品質全面超越專注單一格式的基準方法（甚至在幾何上超越專攻形狀的 CLAY）。消融實驗系統性地驗證了解析度、生成範式與模型規模三個維度的設計選擇，擴展行為的持續改善暗示方法尚未觸及性能天花板。

論證最弱處

算力門檻與可複現性：20 億參數模型需 64 張 A100 GPU 訓練，50 萬資產需大規模渲染與 GPT-4o 字幕生成，使方法的可複現性受限於擁有工業級資源的機構。此外，以三維高斯端到端訓練編碼器、其他格式解碼器僅在凍結編碼器上訓練的策略，可能使非高斯格式的品質成為「二等公民」，削弱「通用性」宣稱的實質力度。