The future of robotics will be shaped by AI agents who can act, perceive and think in real-world situations. It is a challenge to build scalable and reliable robot manipulation. This involves the ability to control objects by selectively contacting them. A number of advances have been made, including analytic techniques, models-based methods and data-driven approaches. However, the majority of systems continue to operate at separate stages for data collection and training. These stages are often characterized by custom setups and manual curation. They also require task-specific adjustments. This creates friction, slows down the progress of the system, conceals patterns of failure, and hinders reproducibility. It is clear that a framework for learning and assessment must be unified.
Research on robot manipulation is moving from the analytical model to the neural world model that can learn dynamically from inputs directly, while using latent space and pixels. While large-scale models of video creation can generate realistic visuals they lack long-term consistency in time and the multi-view reasoning necessary for control. Vision-language-action models follow instructions but are limited by imitation-based learning, preventing error recovery and planning. The evaluation of policy remains a challenge, since physics simulations need to be fine-tuned and testing in the real world is expensive. Existing evaluation metrics emphasize visual qualities over task achievement, underscoring the need to develop benchmarks which better reflect real-world manipulation performances.
Genie Envisioner is a platform developed by AgiBot Genie, NUS LV-Lab & BUAA. It combines video-generative technology with simulations, policy-learning, evaluation and evaluation. Its core is GE-Base – a large-scale video diffusion system that uses instructions to capture the spatial, time, and semantic dynamics for real-world tasks. GE-Act translates these representations into action trajectories that are precise, whereas GE-Sim is a fast and action-based video-based simulator. EWMBench evaluates visual accuracy, physical accuracy and alignment of instruction to action. GE’s embodied intelligence is scalable and memory-aware. It can be applied to robots of all types and for a variety of tasks.
GE’s design is divided into three main parts. GE-Base is a multi-view, instruction-conditioned video diffusion model trained on over 1 million robotic manipulation episodes. The model learns how to capture latent trajectory that shows the evolution of scenes under certain commands. GE Act then translates the latent video into action signals using a flow matching decoder. It can control robots that aren’t in training data with precision and speed. GE-Sim repurposes GE-Base’s generative powers into a neural simulator that is action-conditioned, allowing for video-based, closed-loop rollouts at speeds well beyond actual hardware. EWMBench then assesses the entire system in terms of video realism as well as physical consistency and aligning instructions to resulting action.
Genie Envisioner demonstrated strong performance in real world and simulations across a variety of robotic manipulation tasks. GE-Act achieved rapid control generation (54-step trajectories in 200 ms) and consistently outperformed leading vision-language-action baselines in both step-wise and end-to-end success rates. The software adapted quickly to robot types like Agilex’s Cobot Magic or Dual Franka with just an hour worth of specific data. GE-Sim provided high-fidelity video simulations with action conditioning for scalable closed-loop testing. EWMBench confirmed GE-Base’s superiority in terms of temporal alignment and motion consistency. It also showed that the scene was more stable.
Genie Envisioner can be summarized as a platform for robotic dual-arm manipulation which combines simulation and evaluation with policy-learning into a single video-generative frame work. Its core is GE-Base – an instructional-guided model of video diffusion that captures the spatio-temporal and semantic patterns in real robot interaction. GE Act builds upon this, converting representations to precise, adaptable actions plans for new robot types, all with minimum retraining. GE-Sim provides high-fidelity simulations with action-conditioned feedback for policy refinement. EWMBench offers a rigorous evaluation of realism and alignment. Tests in the real-world have shown that this system is superior, and can be used to build a foundation of general-purpose, intelligence-driven by instruction.
Take a look at the Paper The following are some examples of how to get started: GitHub Page. Please feel free to browse our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe now our Newsletter.

