LegoGPT: Generating Physically Stable and Buildable LEGO Designs from Text

Abstract — 摘要

We introduce LegoGPT, the first approach for generating physically stable LEGO brick models from text prompts. To achieve this, we construct a large-scale, physically stable dataset of LEGO designs, along with their associated captions, and train an autoregressive large language model to predict the next brick to add via next-token prediction. To improve the stability of the resulting designs, we employ an efficient validity check and physics-aware rollback during autoregressive inference, which prunes infeasible token predictions using physics laws and assembly constraints. Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing LEGO designs that align closely with the input text prompts.

本文提出 LegoGPT，這是首個能從文字提示生成物理穩定 LEGO 積木模型的方法。為此，我們建構了一個大規模、物理穩定的 LEGO 設計資料集及其對應的文字描述，並訓練一個自迴歸大型語言模型，透過下一詞元預測的方式逐塊預測要添加的積木。為提升生成結果的穩定性，我們在自迴歸推論過程中引入高效的合法性檢查與物理感知回滾機制，利用物理定律與組裝約束來剪除不可行的詞元預測。實驗結果顯示，LegoGPT 能生成穩定、多樣且具美感的 LEGO 設計，並與輸入的文字提示高度一致。

段落功能全文總覽——以精煉語言概述問題（文字到 LEGO 生成）、核心方案（自迴歸 LLM + 物理約束）與實驗成果。

邏輯角色摘要承擔「問題-方案-證據」三段式論證的濃縮：先宣告「首個」的開創性定位，再交代資料集與模型架構，最後以實驗結果背書。

論證技巧 / 潛在漏洞「首個」(the first) 的措辭確立了強烈的新穎性宣稱。物理穩定性與可組裝性同時作為賣點，但摘要未提及格點解析度或積木種類的限制，讀者需在後文確認實際可行範圍。

1. Introduction — 緒論

While 3D generative models have advanced significantly, applying them to create real-world objects remains challenging due to assembly and physical stability constraints. LEGO design generation serves as an accessible benchmark with standardized components, yet existing methods do not account for the unique physical constraints and assembly requirements of real-world object designs. The gap between digital 3D generation and physically realizable construction motivates our work.

儘管三維生成模型已有長足進展，但要將其應用於創造真實世界物件，仍因組裝與物理穩定性約束而充滿挑戰。LEGO 設計生成由於具備標準化元件，可作為一個易於存取的基準測試平台，然而既有方法並未考量真實物件設計所需的獨特物理約束與組裝要求。數位三維生成與可物理實現建構之間的落差，正是驅動本研究的動機。

段落功能建立研究場域——從廣泛的 3D 生成切入，收窄至 LEGO 生成的特殊需求。

邏輯角色論證鏈的起點：先肯定 3D 生成能力，再指出「物理可行性」的缺口，為 LEGO 作為研究載體建立合理性。

論證技巧 / 潛在漏洞以 LEGO 的「標準化」特性作為選擇此領域的正當理由，但這也限制了方法的泛化性——該框架能否擴展到非標準化元件的組裝任務仍不明確。

We propose LegoGPT, which formulates LEGO generation as a sequential token prediction problem. Each brick placement is encoded as a token, and a fine-tuned LLaMA-3.2-1B-Instruct model learns to predict the next brick given the current assembly state and a text description. During inference, we integrate physical stability checks and a physics-aware rollback mechanism that rejects unstable placements and backtracks to find stable alternatives, ensuring all generated designs are both structurally sound and buildable in the real world.

本文提出 LegoGPT，將 LEGO 生成問題化為序列詞元預測問題。每個積木放置動作被編碼為一個詞元，經過微調的 LLaMA-3.2-1B-Instruct 模型學習根據當前組裝狀態與文字描述來預測下一塊積木。在推論階段，我們整合物理穩定性檢查與物理感知回滾機制，拒絕不穩定的放置並回溯尋找穩定替代方案，確保所有生成的設計在結構上均健全且可在真實世界中組裝。

段落功能提出解決方案——概述 LegoGPT 的核心架構與推論策略。

邏輯角色承接上段的問題陳述，此段完成「問題到方案」的轉折：將 3D 組裝問題重新框架為 LLM 的序列生成問題，是全文最關鍵的創新跳躍。

論證技巧 / 潛在漏洞將積木放置序列化為詞元是巧妙的類比，但序列順序的選擇（例如由下而上 vs. 由前而後）會顯著影響生成品質，作者需解釋此設計選擇的合理性。回滾機制雖強化穩定性，但可能大幅增加推論時間。

Text-to-3D generation has progressed rapidly with methods like DreamFusion and Score Distillation Sampling, but these approaches produce continuous 3D representations that cannot be directly manufactured. Autoregressive 3D modeling treats shape generation as token sequences, yet prior work focuses on voxels or point clouds rather than structured assemblies with physical constraints. In the LEGO domain, classical legolization algorithms convert 3D meshes to brick layouts through voxelization and merging, but they do not incorporate text conditioning or learned generative priors. Physics-aware generation methods enforce stability in rigid body systems, yet none addresses the combined challenge of text-guided design and physical buildability in brick structures.

文字到三維生成已隨 DreamFusion 與分數蒸餾取樣等方法快速進展，但這些方法產出連續的三維表示，無法直接製造。自迴歸三維建模將形狀生成視為詞元序列，但先前工作著重於體素或點雲，而非具有物理約束的結構化組裝。在 LEGO 領域，經典的積木化演算法透過體素化與合併將三維網格轉換為積木佈局，但未納入文字條件或學習得到的生成先驗。物理感知生成方法在剛體系統中強制穩定性，然而尚無方法同時處理文字引導設計與積木結構物理可組裝性的雙重挑戰。

段落功能文獻回顧——系統性涵蓋四個相關領域，逐一指出各自的不足。

邏輯角色以四條平行的「能力-缺陷」線索建構出一個尚無解的交叉缺口：文字引導 + 物理可行 + 積木結構。LegoGPT 的定位正是填補此交叉點。

論證技巧 / 潛在漏洞將四個子領域的缺陷巧妙對齊，使讀者自然得出「需要一個整合方案」的結論。但各領域的引用是否充分（特別是 LEGO 組裝最佳化的工程文獻）值得進一步查證。

3. StableText2Lego Dataset — 資料集

We construct StableText2Lego, a dataset containing over 47,000 LEGO structures of over 28,000 unique 3D objects accompanied by detailed captions. The construction pipeline consists of four stages: (1) converting 3D shapes from ShapeNetCore into LEGO structures through voxelization; (2) applying a split-and-remerge legolization algorithm to produce valid brick layouts; (3) assessing physical stability through simulation; and (4) generating captions using GPT-4o from multi-view renderings. This pipeline ensures that every sample in the dataset is both a valid LEGO assembly and physically stable.

我們建構了 StableText2Lego 資料集，涵蓋超過 47,000 個 LEGO 結構，對應超過 28,000 個不同的三維物件，並附有詳細的文字描述。建構流程包含四個階段：(1) 透過體素化將 ShapeNetCore 中的三維形狀轉換為 LEGO 結構；(2) 應用分割與重新合併的積木化演算法以產生合法的積木佈局；(3) 透過模擬評估物理穩定性；(4) 使用 GPT-4o 從多視角渲染圖生成文字描述。此流程確保資料集中的每個樣本既是合法的 LEGO 組裝，也具備物理穩定性。

段落功能資料基礎——詳述資料集的規模、來源與建構流程。

邏輯角色作為「資料驅動方法」的基石，此段的可信度直接影響整體論證。四階段流程展現了嚴謹的資料品質控制。

論證技巧 / 潛在漏洞使用 GPT-4o 生成描述是高效但可能引入偏差的策略——模型可能對某些形狀產生雷同的描述，降低文字多樣性。此外，ShapeNetCore 的物件類別有限，可能制約了生成設計的多樣性。

4. Method — 方法

LegoGPT fine-tunes LLaMA-3.2-1B-Instruct on the StableText2Lego dataset. Each LEGO structure is serialized as a sequence of brick tokens, where each token encodes the brick type, position (x, y, z), and orientation. The model is trained with standard next-token prediction loss, learning to generate brick-by-brick given the text prompt and the previously placed bricks as context. This formulation naturally captures spatial dependencies between bricks and enables the model to learn structural patterns from the training data.

LegoGPT 在 StableText2Lego 資料集上微調 LLaMA-3.2-1B-Instruct。每個 LEGO 結構被序列化為一串積木詞元，每個詞元編碼積木類型、位置 (x, y, z) 與朝向。模型以標準的下一詞元預測損失訓練，學習根據文字提示與先前已放置的積木作為上下文，逐塊生成積木。此公式化自然地捕捉積木之間的空間依賴關係，使模型得以從訓練資料中學習結構樣式。

段落功能核心方法——描述如何將 LEGO 生成轉化為語言模型的序列預測問題。

邏輯角色此段建立「積木即詞元」的核心類比，使整個 LEGO 組裝問題可以利用成熟的 LLM 訓練基礎設施。

論證技巧 / 潛在漏洞選用僅 1B 參數的 LLaMA 模型展現了效率意識，但也引發疑問：更大的模型是否能生成更複雜的設計？序列化順序的選擇（如由底部到頂部）隱含了強歸納偏置，作者需證明此順序的合理性。

4.1 Physics-Aware Inference — 物理感知推論

During inference, we integrate two mechanisms to enforce physical validity. First, a brick-by-brick rejection sampling step checks whether each predicted brick satisfies assembly constraints (no collision, valid connection). Second, a physics-aware rollback mechanism monitors the cumulative structural stability of the growing assembly: if adding a brick causes the structure to become physically unstable (e.g., cantilevered without support), the system backtracks several steps and resamples alternative brick placements. This combination ensures that the final design is not only geometrically valid but also physically buildable — it can stand on its own without external support.

在推論階段，我們整合兩項機制以確保物理合法性。首先，逐塊拒絕取樣步驟檢查每個預測的積木是否滿足組裝約束（無碰撞、有效連接）。其次，物理感知回滾機制監控不斷增長的組裝體之累積結構穩定性：若添加某塊積木導致結構變得物理不穩定（例如無支撐的懸臂），系統便回溯數步並重新取樣替代的積木放置方案。此組合確保最終設計不僅在幾何上合法，在物理上也可組裝——能夠自行站立而無需外部支撐。

段落功能關鍵創新——描述推論時的物理約束強制機制。

邏輯角色此段回應了「僅靠語言模型無法保證物理可行性」的潛在質疑，展示如何在生成過程中即時整合物理法則。

論證技巧 / 潛在漏洞拒絕取樣加回滾的雙重機制在概念上完備，但回滾的步數與觸發條件是超參數——過度回滾會導致生成失敗或退化為簡單結構，過少則無法保證穩定性。作者需報告回滾頻率與成功率的統計資料。

5. Experiments — 實驗

We evaluate LegoGPT on quantitative metrics including physical stability rate, text-design alignment (CLIP score), and aesthetic quality. Compared to baseline methods that apply legolization to text-to-3D outputs, LegoGPT achieves significantly higher stability rates while maintaining comparable or better visual quality. Ablation studies confirm that both the physics-aware rollback and the rejection sampling components contribute meaningfully to the final design quality. We further demonstrate real-world buildability through robotic assembly and manual construction, validating that generated designs can be physically assembled using standard LEGO bricks.

我們以物理穩定率、文字與設計對齊度（CLIP 分數）及美學品質等定量指標評估 LegoGPT。相比將積木化應用於文字到三維輸出的基線方法，LegoGPT 達到顯著更高的穩定率，同時維持相當或更佳的視覺品質。消融研究確認物理感知回滾與拒絕取樣兩個組件均對最終設計品質有實質貢獻。我們進一步透過機器人組裝與手動建構展示真實世界的可組裝性，驗證了生成的設計可使用標準 LEGO 積木進行實體組裝。

段落功能提供實驗證據——以多面向指標與實體驗證支撐方法的有效性。

邏輯角色實證支柱涵蓋三個層次：(1) 定量比較；(2) 消融研究；(3) 真實世界驗證。機器人組裝的展示尤其具說服力，將論文從理論帶入實踐。

論證技巧 / 潛在漏洞真實世界組裝的展示是強有力的佐證，但需確認示範的設計複雜度是否具代表性——若僅展示簡單結構，說服力將大打折扣。CLIP 分數作為文字對齊指標在 3D 領域的適用性也值得商榷。

6. Discussion and Conclusion — 討論與結論

LegoGPT demonstrates that large language models can be effectively repurposed for physically-grounded 3D design tasks by formulating brick assembly as token prediction. The combination of learned generative priors with physics-based constraints during inference ensures that outputs are not merely plausible-looking but genuinely buildable. Limitations include the fixed grid size (20x20x20), the restricted brick library, and computational costs of the rollback mechanism. Future work may explore larger design spaces, more diverse brick types, and integration with robotic assembly systems for fully automated text-to-physical-object pipelines.

LegoGPT 證明了大型語言模型可透過將積木組裝化為詞元預測，有效地轉用於物理基礎的三維設計任務。學習得到的生成先驗與推論時物理約束的結合，確保輸出不僅外觀合理，更是真正可組裝的。限制包括固定的格點尺寸（20x20x20）、受限的積木庫，以及回滾機制的計算成本。未來工作可探索更大的設計空間、更多樣的積木類型，以及與機器人組裝系統的整合，以實現完全自動化的文字到實體物件流程。

段落功能總結全文——重申核心貢獻、坦承限制、展望未來。

邏輯角色結論段呼應摘要，形成完整的論證閉環。坦率地列出限制（格點尺寸、積木種類）增強了學術誠信度。

論證技巧 / 潛在漏洞對限制的討論坦誠但略顯保守——20x20x20 的格點限制意味著無法生成大型或精細的設計，這是一個重大的實用性約束。「完全自動化的文字到實體物件流程」的展望雄心勃勃，但從當前能力到該目標的距離仍然遙遠。

論證結構總覽

問題
3D 生成模型無法保證
物理可行的積木組裝

→

論點
以 LLM 序列預測
結合物理約束生成 LEGO

→

證據
47K 資料集訓練
穩定率與品質超越基線

→

反駁
物理回滾機制確保
結構穩定與可組裝

→

結論
LLM 可有效轉用於
物理基礎 3D 設計

作者核心主張（一句話）

透過將 LEGO 組裝序列化為詞元並以物理約束引導自迴歸推論，大型語言模型能從文字提示生成物理穩定且可實際組裝的積木設計。

論證最強處

「積木即詞元」的創新類比：將三維組裝問題優雅地映射為語言模型的序列預測任務，配合推論時的物理回滾機制，在生成能力與物理可行性之間取得了出色的平衡。真實世界的機器人組裝展示更提供了超越量化指標的說服力。

論證最弱處

設計空間的嚴格限制：20x20x20 的格點尺寸與有限的積木庫大幅限制了可生成設計的複雜度與多樣性。回滾機制的計算開銷也未被充分量化——在複雜提示下，推論可能需要大量回滾才能找到穩定解，實用性待驗證。