3D-Generalist:
Vision-Language-Action Models for Crafting 3D Worlds

Anonymous

3D-Generalist is a generative graphics framework composed of multiple foundation models and modules to scale up 3D environments and data that are readily usable for Synthetic Data and Embodied AI purposes.

Here are some 3D environments crafted by 3D-Generalist, demonstrating controllable generation over
🎨 materials, 💡 lighting, 🏠 assets, and 📐 layout:

"An international restaurant with vibrant decor."

"A spacious home gym that is fully equipped."

"A bohemian art studio with a vintage easel."

3D-Generalist uses diffusion model to generate panoramic images to create the structure of 3D environments via an inverse graphics pipeline.


"A chic clothing store with mannequins."

3D-Generalist employs a Vision-Language-Action (VLA) model to generate code to craft and modify all aspects (materials, lighting, assets, and layout) of the resulting 3D environments. The VLA is finetuned to optimize for prompt alignment via a self-improvement training loop.

"A colorful arcade with neon signs."

3D-Generalist employs another VLA to handle diverse small object placement tasks with *unlabeled* 3D assets, capable of:

  • Densely populating surfaces
  • Adding assets between shelves
  • Stacking assets
  • "A modern bar with brick wall and marble bar counter."

    "A quaint bookstore."


    In the Omniverse ecosystem,
  • Omniverse Replicator enables large-scale synthetic data generation with domain randomization.

  • Isaac Lab provides readily available embodiments (e.g., humanoid robots) that can be used in these generated environments for robotic simulation.