The challenge of building robot simulators is a very long-standing one. For traditional engines, physics must be coded manually and 3D models are perfect. NVIDIA’s new technology is a game changer. DreamDojoA generalizable, open-source robot world model. Instead of using a physics engine, DreamDojo ‘dreams’ the results of robot actions directly in pixels.
Scaling robotics with 44k+ hours of human experience
Data collection is the main obstacle for AI in robots. The collection of robot-specific data can be expensive and time-consuming. DreamDojo fixes this problem by using DreamDojo to learn from. 45k+ Hours This dataset, called egocentric videos of humans. This dataset, named DreamDojo-HVIt is one of the biggest pretraining facilities in the world.
- It features 6,015 unique tasks across 1M+ trajectories.
- Data includes 9,869 unique sceneries and 43 237 unique objects.
- Use of pre-training NVIDIA GPU H100 – 100,000 hours Build 2B and model 14B variants.
People have mastered complex physics like pouring liquids, or folding clothing. DreamDojo uses this human data to give robots a ‘common sense’ understanding of how the world works.

Bridge the Gap by Latent Actions
Human videos do not have robot motor commands. To make these videos ‘robot-readable,’ NVIDIA’s research team introduced Continuous latent action. This system extracts actions from pixels using a VAE spatiotemporal Transformer.
- The VAE encoder outputs 32-dimensional vectors from 2 frames.
- This vector shows the motion that is most important between frames.
- Design creates a bottleneck of information that separates visual context from action.
- The model can learn the physics of humans, and then apply it to robots with different bodies.

Better Physics with Architecture
DreamDojo was developed on the basis of DreamDojo. Cosmos-Predict2.5 latent video diffusion model. This model is based on the WAN2.2 tokenizerThe ratio for temporal compression is 4. Three key improvements were made to the architectural design:
- Relative actions: Model uses joint deltas rather than absolute poses. The model can generalize more easily across different trajectory types.
- Chunked Action Injection: This injects four consecutive actions in each latent frames. The tokenizer will align the action with its compression ratio. This fixes the confusion caused by causality.
- Temporal Consistency Loss: The new loss function matches the predicted frame speeds to the ground truth transitions. It reduces artifacts while keeping objects consistent.
Distillation with 10.81 FPS Real Time Interaction
It is important that a simulator be fast. For real-time applications, standard diffusion models are too complex. NVIDIA used an Self-Forcing Distillation pipe to fix this.
- Training in distillation was held on NVIDIA H100 64 GPUs.
- The ‘student’ model reduces denoising from 35 steps down to 4 steps.
- The final model will achieve a speed in real time of The 10.81 FPS is a fast-moving image..
- The stability of the system is maintained for continuous rolling outs up to 60 seconds (6 frames).
Unlocking the Downstream applications
DreamDojo’s speed, accuracy and versatility enable AI engineers to develop a wide range of applications.
1. Reliable Policy Analysis
It is dangerous to test robots on the ground. DreamDojo is a benchmarking simulator that provides a high level of fidelity.
- Its simulated success rates show a Pearson correlation of (Pearson 𝑟=0.995) with real-world results.
- The MMRV (Mean Maximum Rank Violation) is only 0.003.
2. Model-Based Planning
Robots can use DreamDojo to ‘look ahead.’ Robots can choose the best action from multiple simulations.
- This improved the real-world rates of success by 17%.
- It provided twice the success rate compared to random sampling.
3. Live Teleoperation
Developers are able to teleoperate robots virtually in real time. NVIDIA’s team did this by using PICO VR control A local desktop is a computer that has an NVIDIA RX 5900. This ensures a safe, rapid and accurate data collection.
Description of the Model
| Metric | DREAMDOJO-2B | DREAMDOJO-14B |
| The Physics of Correctness | 62.50% | 73.50% |
| Take Action | 63.45% | 72.55% |
| FPS (Distilled) | 10.81 | N/A |
NVIDIA released the weights, evaluation benchmarks and training code. You can post-train DreamDojo using your robot data with this open-source release.
What you need to know
- The scale and diversity of the projectDreamDojo has been pre-trained DreamDojo-HVThe largest human video database to date featuring egocentric humans 45,711 hours The footage is a cross-section of the entire city 6,015 unique tasks The following are some examples of how to get started: There are 9,869 different scenes..
- Unified Latent Action ProxyModel uses action labels instead of the usual human video. Continually latent actions Extractions are extracted using a spatiotemporal Transformer (VAE), which is a control interface that does not depend on hardware.
- Improved training and architectureModel achieves high-fidelity and precision controllability through the use of This is a relative action transformative., chunked action injectionAnd a specialized temporal consistency loss.
- Distillation for Real-Time PerformanceThrough a Self-Forcing The model accelerates the distillation pipeline The 10.81 FPS is a fast-moving image.Over a distance of over ten kilometers, interactive simulations like stable long-horizon stability and live teleoperation are possible. One minute.
- Reliable downstream TasksDreamDojo works as a simulator that is accurate. Evaluation of policyShowing a 0.995 Pearson Correlation With real-world success rate, you can increase real-world performance. 17% Useful for Plan based on models.
Take a look at the Paper The following are some examples of how to get started: Codes. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe Now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.


