
Alibaba’s Qwen team released Qwen-Robot Suite on June 17, a full-stack embodied intelligence suite composed of three foundational models: Qwen-RobotNav (mobile navigation), Qwen-RobotManip (robotic manipulation), and Qwen-RobotWorld (physics-world simulation). All three models are open-sourced.
Qwen-RobotNav: Five tasks unified, 15.6 million training samples
Qwen-RobotNav integrates five tasks—instruction following, goal point navigation, object search, target tracking, and autonomous driving—providing a parameterized interface (token budget, time decay, per-view weighting). The model is trained on 15.6 million samples. On the VLN-CE RxR benchmark (real-environment visual-and-language navigation), its success rate reaches 76.5%; on EVT-Bench (mobile target tracking), it reaches 90%.
Qwen-RobotManip: 38,100 hours of training data, ranked first on RoboChallenge Table30-v1
Different robots represent actions in fundamentally different ways (Franka arms use joint angles, ALOHA dual arms use gripper positions and directions, humanoids use full-body coordinates). Alibaba synthesizes about 38,100 hours of training data from an open-source robotics dataset and human videos, without relying on private data collection. The model ranks first on the RoboChallenge Table30-v1 benchmark, outperforming prior methods by 20%.
Qwen-RobotWorld: 8.6 million video-text pairs; first on EWMBench and DreamGen Bench
Qwen-RobotWorld is a language-conditioned video world model that treats natural language as a universal action interface: the instruction “Pick up the red cup and pour water onto the flowers” works across gripper, self-driving, and mobile navigation agents. The training corpus includes 8.6 million video-text pairs and 200 million frames, spanning manipulation (5.9 million samples, 1,300+ skills, 20+ morphologies), autonomous driving (Waymo, NVIDIA PhysicalAI-AD), indoor navigation, and cross-transfer across 14 types of robotic arms. On both EWMBench and DreamGen Bench, it ranks first, with a perfect score on physical consistency tests.
Qwen official statement: software models, not physical robots; pricing and timeline not yet announced
Per Qwen’s official blog, Qwen-Robot Suite is a software model rather than a physical robot. Real deployments in home scenarios will still take several years. Alibaba has not yet announced pricing, timelines, or customer lists beyond pilot plans. Western labs pursuing similar goals—such as Google DeepMind, Nvidia, Figure, and Physical Intelligence—have also been reported to mostly focus on single capabilities like navigation or manipulation, rather than a unified, modular suite.
Frequently asked questions
What scenarios do the three models in Qwen-Robot Suite target, respectively?
According to Qwen’s official blog, the three models are positioned as follows: Qwen-RobotNav handles mobile navigation (five unified tasks); Qwen-RobotManip handles robotic manipulation across different robots (compatible with different action representation schemes); Qwen-RobotWorld handles physics-world simulation (language as a universal action interface). The three models are independent, and together they form a full embodied-intelligence stack.
Is the “Robot Android moment” positioning something Qwen itself said?
Yes. “The Android moment in robotics” is the positioning description used by Alibaba’s Qwen official upon release, meaning that Qwen-Robot Suite is a platform at the operating-system layer rather than hardware. This is Qwen’s market-positioning statement, not a third-party rating.
Is Qwen-Robot Suite open-sourced to the public?
According to Qwen’s official blog, all three models are released as open source. Alibaba’s training data comes from open-source robotics datasets and human videos, without relying on private data collection; its open-sourcing strategy is one of the core messages of this release.