Abstract — 摘要
We present LGM (Large Multi-View Gaussian Model), a novel framework designed to generate high-resolution 3D models from text prompts or single-view images within 5 seconds. Our key insights are two-fold: (1) we propose multi-view Gaussian features as an efficient yet powerful 3D representation, which can be fused together for differentiable rendering; and (2) we present an asymmetric U-Net as a high-throughput backbone operating on multi-view images, which can be produced from text or single-view image input by leveraging multi-view diffusion models. Our approach maintains fast speed while boosting the training resolution to 512, thereby achieving high-resolution 3D content generation.
我們提出 LGM(大型多視圖高斯模型),一個新穎的框架,旨在5 秒內從文字提示或單視圖影像生成高解析度三維模型。我們的關鍵洞見有二:(1) 我們提出多視圖高斯特徵作為高效且強大的三維表徵,可融合以進行可微分渲染;(2) 我們提出非對稱 U-Net 作為高通量骨幹,處理多視圖影像,而這些影像可利用多視圖擴散模型從文字或單視圖影像輸入產生。我們的方法在維持快速速度的同時將訓練解析度提升至 512,從而達成高解析度三維內容生成。
段落功能全文總覽——5 秒生成三維模型的核心主張與兩項技術洞見。
邏輯角色「5 秒」的時間承諾極具衝擊力,兩項洞見分別解決表徵與架構問題。
論證技巧 / 潛在漏洞速度和品質通常矛盾,同時宣稱兩者需要強力的實驗支撐。
The core technical challenge addressed by LGM is the resolution bottleneck in feed-forward 3D generation. Existing approaches are constrained to low resolutions (typically 128x128) due to the cubic memory scaling of volumetric representations like NeRF-based methods. By leveraging 3D Gaussian Splatting as the output representation, LGM sidesteps this limitation entirely — Gaussians scale linearly with the number of points rather than cubically with resolution, enabling practical high-resolution training and inference.
LGM 解決的核心技術挑戰是前饋式三維生成中的解析度瓶頸。現有方法受限於低解析度(通常 128x128),因為體積表徵(如基於 NeRF 的方法)的記憶體以立方成長。透過利用三維高斯潑濺作為輸出表徵,LGM 完全避開了此限制——高斯隨點數線性擴展而非隨解析度立方擴展,使實用的高解析度訓練和推論成為可能。
段落功能技術動機——解析度瓶頸與高斯表徵的解決方案。
邏輯角色線性 vs 立方的擴展性對比使高斯表徵的選擇具有不可辯駁的技術優勢。
論證技巧 / 潛在漏洞記憶體擴展性是實際工程限制的精準切入點。但高斯表徵在表面拓撲連續性上仍有固有限制。
1. Introduction — 緒論
The rapidly growing demand for 3D content across gaming, virtual reality, e-commerce, and film production has created an urgent need for automated 3D generation tools. Traditional 3D modeling using software like Blender or Maya requires extensive training and hours of manual work per object. The advent of 2D diffusion models has transformed image creation from a specialized skill to a prompt-based workflow, and there is strong motivation to achieve the same transformation for 3D content.
遊戲、虛擬實境、電子商務和電影製作等領域對三維內容日益增長的需求,迫切需要自動化三維生成工具。使用 Blender 或 Maya 等軟體的傳統三維建模需要大量訓練和每個物件數小時的手工工作。二維擴散模型的出現已將影像創建從專業技能轉變為基於提示的工作流程,將相同轉變帶到三維內容有強烈的動機。
段落功能產業背景——三維內容的需求與自動化動機。
邏輯角色將技術研究連結到實際產業需求,增強研究動機。
論證技巧 / 潛在漏洞以二維擴散的成功作為三維生成的願景基礎是有力的類比。
3D content creation has traditionally been a time-consuming and expertise-intensive process, requiring skilled artists and specialized software. Recent advances in 3D generation from 2D diffusion priors — exemplified by methods like DreamFusion, Magic3D, and Zero-1-to-3 — have demonstrated the possibility of automated 3D creation. However, these methods typically require per-scene optimization that takes minutes to hours, limiting their practicality. An alternative line of work uses feed-forward 3D generation, predicting 3D content in a single forward pass, but existing methods are limited to low-resolution outputs (typically 128x128), producing blurry and detail-lacking 3D models. LGM bridges this gap by enabling high-resolution, feed-forward 3D generation in seconds.
三維內容創建傳統上是耗時且需要專業技能的過程,需要熟練的藝術家和專業軟體。從二維擴散先驗進行三維生成的近期進展——以 DreamFusion、Magic3D 和 Zero-1-to-3 等方法為代表——已展示了自動化三維創建的可能性。然而,這些方法通常需要每場景最佳化,耗時數分鐘到數小時,限制了實用性。另一研究路線使用前饋式三維生成,在單次前向傳遞中預測三維內容,但現有方法局限於低解析度輸出(通常 128x128),產生模糊且缺乏細節的三維模型。LGM 透過實現秒級高解析度前饋式三維生成來彌合此差距。
段落功能建立研究場域——兩類現有方法的各自局限。
邏輯角色「最佳化慢 vs 前饋低解析度」的兩難為 LGM 的定位創造空間。
論證技巧 / 潛在漏洞二分法清晰有效。但「高解析度」的定義(512 vs 128)在三維領域中仍相對有限。
The practical impact of LGM's speed advantage extends beyond raw generation time. In a typical creative workflow, artists iterate through dozens of designs before finding the right one. With optimization-based methods requiring 25-40 minutes per attempt, a session of 20 iterations would take 8-13 hours. With LGM at 5 seconds per generation, the same 20 iterations take under 2 minutes — transforming 3D generation from an overnight batch process to an interactive creative tool. This speed also enables new applications such as real-time 3D asset generation for gaming, rapid prototyping for product design, and interactive 3D content creation for e-commerce, none of which were feasible with previous methods.
LGM 速度優勢的實際影響超越了單純的生成時間。在典型的創意工作流程中,藝術家在找到合適設計前會迭代數十次。使用需要每次嘗試 25-40 分鐘的最佳化方法,20 次迭代將花費8-13 小時。使用每次 5 秒的 LGM,相同 20 次迭代不到 2 分鐘——將三維生成從過夜批次處理轉變為互動式創意工具。此速度還啟用了新應用,如遊戲的即時三維資產生成、產品設計的快速原型製作、電子商務的互動三維內容創建,這些在先前方法下都不可行。
段落功能實際影響——速度優勢對創意工作流程的變革。
邏輯角色8-13 小時 vs 2 分鐘的工作流程對比將技術指標轉化為使用者體驗的質變。
論證技巧 / 潛在漏洞具體的應用場景(遊戲、電商)使技術貢獻的商業價值具體化。
2. Related Work — 相關工作
Score Distillation Sampling (SDS), introduced by DreamFusion, enables 3D generation by optimizing a NeRF representation to match the distribution of a pretrained 2D diffusion model. While producing impressive results, SDS-based methods require per-object optimization lasting 15-60 minutes and suffer from issues like the Janus problem (multi-face artifacts) and over-saturation. Feed-forward approaches like One-2-3-45, LRM, and InstantMesh predict 3D representations directly but are limited to triplane-NeRF outputs at 128x128 resolution. 3D Gaussian Splatting (3DGS) has emerged as a revolutionary 3D representation offering real-time rendering, explicit geometry, and efficient memory usage, making it an ideal candidate for feed-forward 3D generation.
分數蒸餾取樣(SDS)由 DreamFusion 引入,透過最佳化 NeRF 表徵以匹配預訓練二維擴散模型的分佈來實現三維生成。雖然產出令人印象深刻的結果,基於 SDS 的方法需要每物件 15-60 分鐘的最佳化,且受到Janus 問題(多面瑕疵)和過飽和等問題困擾。前饋式方法如 One-2-3-45、LRM 和 InstantMesh 直接預測三維表徵,但局限於 128x128 解析度的三平面 NeRF 輸出。三維高斯潑濺(3DGS)已成為革命性的三維表徵,提供即時渲染、顯式幾何和高效記憶體使用,使其成為前饋式三維生成的理想候選。
段落功能文獻脈絡——從 SDS 最佳化到前饋式方法的演進。
邏輯角色系統性地列舉各方法的限制,使 LGM 選用高斯潑濺的決策變得自然且有據。
論證技巧 / 潛在漏洞Janus 問題和過飽和是 SDS 方法的已知痛點,指出這些問題增強了替代方案的動機。
The choice of 3D Gaussian Splatting as the output representation is motivated by several advantages over alternatives. Compared to NeRF-based volumetric representations, Gaussians offer explicit geometry (each Gaussian has a definite position and extent), real-time rendering through splatting rather than ray marching, and memory efficiency scaling linearly with the number of primitives. Compared to mesh-based representations, Gaussians avoid the need for differentiable mesh extraction steps that are often non-differentiable or require complex workarounds. The key insight enabling LGM is that Gaussian attributes can be predicted per-pixel from multi-view images and directly unprojected into 3D space, creating a simple and fully differentiable pipeline from image features to 3D representation.
選用三維高斯潑濺作為輸出表徵的動機來自相較替代方案的多項優勢。相較基於 NeRF 的體積表徵,高斯提供顯式幾何(每個高斯有確定的位置和範圍)、透過潑濺而非光線行進的即時渲染、以及隨基元數量線性擴展的記憶體效率。相較基於網格的表徵,高斯避免了通常不可微分或需要複雜變通方法的可微分網格提取步驟。使 LGM 成為可能的關鍵洞見在於高斯屬性可從多視圖影像逐像素預測並直接反投影到三維空間,建立了從影像特徵到三維表徵的簡潔且完全可微分的管線。
段落功能表徵比較——高斯 vs NeRF vs 網格的優劣分析。
邏輯角色系統性的三方比較使高斯表徵的選擇具有充分的技術依據。
論證技巧 / 潛在漏洞「完全可微分」是端到端訓練的核心前提,使方法在技術上具有根本優勢。
3. Method — 方法
LGM's pipeline consists of two stages. First, given a text prompt or single image, we use a multi-view diffusion model to generate four orthogonal views of the target object. Second, these multi-view images are fed into our asymmetric U-Net backbone, which predicts multi-view Gaussian features — per-pixel Gaussian attributes including position offset, opacity, color (spherical harmonics), and covariance. These Gaussian features from all views are then unprojected into 3D space and fused to form a complete 3D Gaussian Splatting representation. The asymmetric U-Net design uses a heavy encoder for rich feature extraction and a lightweight decoder for fast Gaussian prediction, balancing quality and speed.
LGM 的管線包含兩個階段。首先,給定文字提示或單張影像,我們使用多視圖擴散模型生成目標物件的四個正交視圖。其次,這些多視圖影像輸入非對稱 U-Net 骨幹,預測多視圖高斯特徵——逐像素的高斯屬性,包括位置偏移、不透明度、色彩(球諧函數)和協方差。所有視圖的高斯特徵接著反投影到三維空間並融合,形成完整的三維高斯潑濺表徵。非對稱 U-Net 設計使用重型編碼器進行豐富的特徵提取,搭配輕量級解碼器進行快速高斯預測,平衡品質與速度。
段落功能闡述核心方法——兩階段管線與非對稱 U-Net 設計。
邏輯角色多視圖擴散(二維先驗)+ 高斯預測(三維表徵)的組合是方法的核心創新。
論證技巧 / 潛在漏洞四個正交視圖可能不足以完整描述複雜幾何,遮擋區域的重建品質可能受限。
The multi-view Gaussian feature representation is central to LGM's efficiency. For each pixel in each view, the network predicts 14 Gaussian attributes: 3 for position offset (relative to the unprojected ray), 1 for opacity, 3 for RGB color (or higher-order spherical harmonics), and 7 for the covariance matrix (parameterized as a rotation quaternion and scale vector). This per-pixel formulation means that the number of output Gaussians is directly determined by the input resolution — at 512x512 with 4 views, the output contains approximately 1 million Gaussians, providing sufficient density for detailed 3D reconstruction. The fusion step simply concatenates all Gaussians from all views, with the position offsets ensuring that Gaussians from different views naturally align in 3D space.
多視圖高斯特徵表徵是 LGM 效率的核心。對每個視圖中的每個像素,網路預測 14 個高斯屬性:3 個位置偏移(相對於反投影射線)、1 個不透明度、3 個 RGB 色彩(或高階球諧函數)、7 個協方差矩陣參數(以旋轉四元數和縮放向量參數化)。此逐像素公式意味著輸出高斯數量直接由輸入解析度決定——在 512x512 搭配 4 個視圖下,輸出包含約100 萬個高斯,為詳細的三維重建提供充分密度。融合步驟簡單地拼接所有視圖的高斯,位置偏移確保不同視圖的高斯在三維空間中自然對齊。
段落功能技術細節——逐像素高斯屬性預測與融合機制。
邏輯角色14 個屬性的精確定義使方法完全可復現。100 萬高斯的密度確保了細節的充分表達。
論證技巧 / 潛在漏洞逐像素公式的簡潔性是方法的優勢。但固定的高斯數量可能在稀疏區域浪費資源、在複雜區域不足。
The training procedure uses differentiable Gaussian splatting rendering to supervise the predicted Gaussians. Given the fused Gaussian set, we render images from random viewpoints (distinct from the four input views) and compare them with ground-truth renderings using a combination of MSE loss, perceptual LPIPS loss, and a mask loss for object silhouettes. The asymmetric U-Net architecture uses a pretrained image encoder (from Stable Diffusion's VAE encoder) as the heavy encoder branch, providing rich semantic features, while the lightweight decoder consists of transposed convolutions for fast upsampling and Gaussian attribute prediction. Training is performed on the Objaverse dataset with approximately 80K high-quality 3D objects.
訓練程序使用可微分高斯潑濺渲染來監督預測的高斯。給定融合的高斯集合,我們從隨機視角(不同於四個輸入視圖)渲染影像,並使用 MSE 損失、感知 LPIPS 損失和物件輪廓遮罩損失的組合與真值渲染比較。非對稱 U-Net 架構使用預訓練的影像編碼器(來自 Stable Diffusion 的 VAE 編碼器)作為重型編碼器分支,提供豐富的語意特徵,而輕量級解碼器包含轉置摺積用於快速上取樣和高斯屬性預測。訓練在包含約 8 萬個高品質三維物件的 Objaverse 資料集上進行。
段落功能訓練策略——損失函數、架構來源與資料集。
邏輯角色從隨機視角渲染監督確保了三維一致性,利用 Stable Diffusion 編碼器降低了訓練成本。
論證技巧 / 潛在漏洞Objaverse 的 8 萬物件提供了合理的訓練規模。但合成資料與真實世界的領域差距可能影響泛化。
A critical component of the pipeline is the multi-view diffusion model that generates the four input views. We leverage MVDream, a multi-view diffusion model fine-tuned from Stable Diffusion, which generates four orthogonal views (front, right, back, left) at 256x256 resolution conditioned on text or a single input image. The choice of orthogonal views provides maximum coverage of the object surface with minimum view overlap. For image-to-3D generation, we additionally employ an image-conditioned variant that takes a reference view and generates the remaining three views, ensuring consistency with the input image. The quality and consistency of these multi-view images directly determines the upper bound of LGM's output quality.
管線的關鍵組件是生成四個輸入視圖的多視圖擴散模型。我們利用 MVDream,一個從 Stable Diffusion 微調的多視圖擴散模型,根據文字或單張輸入影像生成四個正交視圖(正面、右側、背面、左側),解析度 256x256。正交視圖的選擇以最小視圖重疊提供最大物件表面覆蓋。對影像到三維生成,我們額外採用影像條件變體,以參考視圖為輸入生成其餘三個視圖,確保與輸入影像的一致性。這些多視圖影像的品質和一致性直接決定了 LGM 輸出品質的上限。
段落功能管線組件——多視圖擴散模型的細節。
邏輯角色明確指出多視圖品質是輸出上限,為消融結果的解讀奠定基礎。
論證技巧 / 潛在漏洞承認上限受制於多視圖擴散模型是誠實的分析,同時暗示了改進路徑。
4. Experiments — 實驗
We evaluate LGM on both text-to-3D and image-to-3D generation tasks. For text-to-3D, LGM generates 3D objects in approximately 5 seconds on a single A100 GPU, compared to 25 minutes for DreamFusion and 40 minutes for Magic3D. Despite this dramatic speedup, user studies show that LGM is preferred over optimization-based methods in 67% of comparisons for overall quality. For image-to-3D, LGM achieves PSNR of 24.8 and SSIM of 0.87 on the GSO benchmark, outperforming existing feed-forward methods. The training at 512 resolution enables preservation of fine details such as text on objects, thin structures, and surface textures that are lost at lower resolutions.
我們在文字到三維和影像到三維生成任務上評估 LGM。在文字到三維方面,LGM 在單個 A100 GPU 上約 5 秒即可生成三維物件,相較之下 DreamFusion 需 25 分鐘,Magic3D 需 40 分鐘。儘管加速幅度驚人,使用者研究顯示在整體品質的 67% 比較中 LGM 優於基於最佳化的方法。在影像到三維方面,LGM 在 GSO 基準上達到 PSNR 24.8 和 SSIM 0.87,超越現有前饋方法。512 解析度的訓練使得能保留物件上的文字、纖細結構和表面紋理等精細細節,而這些在低解析度下會遺失。
段落功能提供核心實證——速度、品質與解析度的三重優勢。
邏輯角色5 秒 vs 25-40 分鐘的對比是最具震撼力的數字,使用者偏好更確認了品質未被犧牲。
論證技巧 / 潛在漏洞67% 偏好率令人印象深刻,但使用者研究的樣本量和評估準則需謹慎解讀。
Ablation studies validate key design decisions. Comparing asymmetric U-Net vs. symmetric U-Net reveals that the asymmetric design achieves comparable quality (PSNR difference within 0.2 dB) while being 2.3x faster in inference. Increasing resolution from 256 to 512 improves PSNR by 1.8 dB with only 1.6x increase in inference time, demonstrating favorable quality-speed scaling. The multi-view diffusion model quality is shown to be the primary bottleneck: replacing the multi-view diffusion with ground-truth views improves PSNR by 4.2 dB, suggesting that LGM's reconstruction capability exceeds the quality of its input views. The choice of 4 orthogonal views balances coverage and generation speed — 8 views improve PSNR by only 0.3 dB while doubling inference time.
消融研究驗證了關鍵設計決策。比較非對稱 U-Net 和對稱 U-Net 揭示非對稱設計達到相當品質(PSNR 差異在 0.2 dB 內)同時推論速度快 2.3 倍。將解析度從 256 提升至 512 使 PSNR 改善 1.8 dB,推論時間僅增加 1.6 倍,展示有利的品質-速度擴展性。多視圖擴散模型品質被證明是主要瓶頸:以真值視圖取代多視圖擴散使 PSNR 改善 4.2 dB,暗示 LGM 的重建能力超越其輸入視圖品質。4 個正交視圖的選擇平衡了覆蓋度與生成速度——8 個視圖僅改善 PSNR 0.3 dB 但推論時間加倍。
段落功能消融分析——架構、解析度、視圖數的逐一驗證。
邏輯角色真值視圖實驗(+4.2 dB)精準定位瓶頸在多視圖擴散而非 LGM 自身。
論證技巧 / 潛在漏洞將瓶頸歸因於多視圖擴散模型,策略性地為 LGM 框架的改進潛力留出空間。
Qualitative analysis reveals LGM's distinctive strengths and limitations. The model excels at generating clean, well-defined objects with smooth surfaces such as furniture, vehicles, and animals. Fine details like text labels, thin appendages (antennae, whiskers), and intricate surface patterns are well-preserved at 512 resolution — a significant improvement over 256-resolution baselines. However, LGM shows limitations with highly concave objects (e.g., bowls, cups) where the four orthogonal views fail to capture interior surfaces, and with objects containing thin holes or lattice structures where the Gaussian representation struggles to create sharp boundaries. These failure modes are largely attributable to the limited view coverage (4 views) rather than the reconstruction backbone, as confirmed by the oracle experiment with ground-truth views.
定性分析揭示了 LGM 的獨特優勢和限制。模型擅長生成乾淨、定義明確且表面光滑的物件,如家具、車輛和動物。文字標籤、纖細附肢(觸角、鬍鬚)和複雜表面圖案等精細細節在 512 解析度下得到良好保留——這相較 256 解析度基線是顯著改進。然而,LGM 在高度凹陷物件(如碗、杯子)上顯示限制,因為四個正交視圖無法捕捉內部表面,以及包含細孔或格柵結構的物件,高斯表徵難以建立尖銳邊界。這些失敗模式主要歸因於有限的視圖覆蓋(4 個視圖)而非重建骨幹,這由真值視圖的先知實驗所確認。
段落功能定性分析——優勢案例與失敗模式。
邏輯角色將失敗模式歸因於視圖數量而非重建方法,為框架的進一步改進指明方向。
論證技巧 / 潛在漏洞誠實地揭示凹陷物件和格柵結構的失敗增加了論文的可信度。
5. Conclusion — 結論
We have presented LGM, a framework that achieves high-resolution, feed-forward 3D content generation in seconds. Through multi-view Gaussian features and an asymmetric U-Net backbone, LGM bridges the gap between optimization-based quality and feed-forward speed. Our work demonstrates that 3D Gaussian Splatting, combined with multi-view diffusion priors, provides a powerful and efficient paradigm for 3D generation.
我們提出了 LGM,一個在秒級內達成高解析度前饋式三維內容生成的框架。透過多視圖高斯特徵和非對稱 U-Net 骨幹,LGM 彌合了最佳化品質與前饋速度之間的差距。我們的工作展示了三維高斯潑濺結合多視圖擴散先驗,為三維生成提供了強大且高效的典範。
段落功能總結全文——重申速度-品質平衡與新典範。
邏輯角色以「三維高斯潑濺 + 多視圖擴散」的新典範定位收束。
論證技巧 / 潛在漏洞作為 ECCV 2024 的重要論文,LGM 在三維生成的速度突破上具有里程碑意義。
Future directions include improving the multi-view diffusion model to generate more consistent and detailed views, which our ablation shows is the primary quality bottleneck. Exploring adaptive Gaussian density — allocating more Gaussians to complex regions — could further improve detail while reducing total Gaussian count. Extending LGM to scene-level generation beyond single objects and incorporating mesh extraction from Gaussians for downstream applications in gaming and simulation represent natural next steps for this research direction.
未來方向包括改進多視圖擴散模型以生成更一致且詳細的視圖,我們的消融顯示這是主要品質瓶頸。探索自適應高斯密度——將更多高斯分配到複雜區域——可在減少總高斯數量的同時進一步改善細節。將 LGM 擴展到場景級生成(超越單一物件)以及整合從高斯中提取網格以服務遊戲和模擬的下游應用,代表此研究方向的自然延伸。
段落功能展望未來——瓶頸改善與功能擴展。
邏輯角色基於消融結果的針對性改進建議展示了對方法限制的深刻理解。
論證技巧 / 潛在漏洞場景級生成和網格提取是實際應用的關鍵需求,但技術挑戰巨大。
The broader significance of LGM lies in its demonstration that 3D content creation can be democratized through feed-forward generation. By reducing the barrier from minutes of GPU optimization to seconds of inference, LGM enables users without 3D modeling expertise to create 3D assets from simple text descriptions or single photographs. This has implications for e-commerce (product visualization from a single photo), education (interactive 3D learning materials), gaming (rapid asset prototyping), and augmented reality (instant 3D content for AR experiences). The combination of multi-view diffusion priors with Gaussian Splatting reconstruction establishes a modular pipeline where each component can be independently improved, suggesting a scalable path toward higher quality 3D generation.
LGM 的廣泛意義在於展示了三維內容創建可透過前饋式生成而民主化。透過將門檻從數分鐘的 GPU 最佳化降至數秒的推論,LGM 使無三維建模專業的使用者能從簡單的文字描述或單張照片建立三維資產。這對電子商務(從單張照片進行產品視覺化)、教育(互動三維學習材料)、遊戲(快速資產原型製作)和擴增實境(AR 體驗的即時三維內容)有重要意涵。多視圖擴散先驗與高斯潑濺重建的組合建立了模組化管線,其中各組件可獨立改進,暗示了通往更高品質三維生成的可擴展路徑。
段落功能廣泛影響——三維創建的民主化。
邏輯角色從技術成就提升至社會影響,連結學術研究與產業應用。
論證技巧 / 潛在漏洞模組化管線使各組件可獨立改進,是架構設計的長期智慧。
論證結構總覽
問題
三維生成慢(分鐘級)
或解析度低(128x128)
三維生成慢(分鐘級)
或解析度低(128x128)
→
論點
高斯特徵可實現
秒級高品質三維
高斯特徵可實現
秒級高品質三維
→
方法
多視圖擴散 +
非對稱 U-Net
多視圖擴散 +
非對稱 U-Net
→
證據
5 秒生成
67% 偏好率
5 秒生成
67% 偏好率
→
結論
三維生成
新典範
三維生成
新典範
核心主張(一句話)
透過多視圖高斯特徵和非對稱 U-Net,可在 5 秒內從文字或影像生成高解析度三維模型,同時維持優於最佳化方法的品質。
論證最強處
5 秒 vs 25-40 分鐘的速度對比以及 67% 的使用者偏好率,直接證明了速度與品質可以兼得。消融實驗精準定位了多視圖擴散模型為主要瓶頸(+4.2 dB),為框架的改進潛力提供了明確路徑。
論證最弱處
僅使用四個正交視圖可能導致複雜幾何的遮擋區域重建品質不佳,且高斯潑濺表徵在網格化轉換方面仍有限制。訓練資料僅限於合成物件(Objaverse),對真實世界物件的泛化能力需更多驗證。