why the real-world data approach runs into harder problems than cost. the infrastructure gap underneath the training problem. and what three alternative camps are building in parallel.
part i ended with three open questions: whether scaling laws in robotics hold, whether the field can reach the data volumes labs say they need at the quality bar that matters, and whether proprietary data archives stay defensible as models get better at learning from less.
those questions are the reason a parallel set of approaches has grown up alongside the real-world data camp. a more fundamental problem comes first: it reframes the whole conversation.
the problem underneath the problem
the phrase that keeps coming up in conversations with practitioners right now: robotics lacks infrastructure. the intelligence is there.
models can generalise, adapt, and handle novel tasks. what they cannot do reliably is learn from the data that exists, because most of that data is structurally inaccessible.
terabytes of real sensor data from production robot deployments sit in storage buckets right now: imu readings, camera feeds, joint logs. unindexed, unlabelled, with mismatched schemas across hardware generations. the data captures exactly the real-world physics and edge cases that labs say they need. nearly all of it goes unused. meanwhile, foundation teams spend millions manufacturing clean teleoperation data or running simulation, while richer signals from actual deployment sit trapped.
there is no equivalent of Hugging Face for robotics data. no standardised pipelines for collection, labelling, versioning, or handling contact dynamics and material deformation across heterogeneous hardware. the result: even labs with substantial real-world data archives often cannot cleanly query or train on them. normalisation, indexing, and observability are the bottlenecks rather than volume.
Russ Tedrake, one of the most respected figures in robot learning, has noted that correct data normalisation has roughly 40x more impact on policy performance than architectural changes. the experiment: same training script, same 20 teleoperation episodes, two identical policies. one with global normalisation, one with per-timestep normalisation. same data, same model. the first policy rarely grabbed the objects. the second grabbed them consistently.
the implication is uncomfortable. the problem sits in the pipeline. the same raw footage, processed differently, produces dramatically different training signal. engineering judgement, iteration speed, and tooling are what scale here rather than capital.
approach one: inherit from web-scale pretraining
Google DeepMind's position sidesteps the data problem entirely. Gemini Robotics starts from a foundation model already trained on web-scale text, images, and video. the robot understands what a kitchen looks like, what a mug is, how gravity works. it only needs to learn how to translate understanding into motor commands.
the practical result is striking. Gemini Robotics-ER 1.6, deployed live on Boston Dynamics' Spot fleet for industrial inspection, adapts to new tasks or robot bodies with roughly 50 to 100 real demonstrations. Generalist AI pretrains on 500,000+ hours. DeepMind needs three orders of magnitude less. their partnership with Agile Robots gives them access to 20,000+ industrial robots generating continuous real-world feedback without building any collection infrastructure.
the bet: capable foundation models don't require elaborate simulation infrastructure or massive proprietary data. targeted real robot data per task is sufficient.
approach two: open-source the stack
a parallel effort removes the data-moat problem by making foundational techniques available to everyone.
Stanford's IRIS Lab co-created OpenVLA: a 7B-parameter open-source vla model trained on 970,000 real-world demonstrations that outperforms Google's proprietary 55B-parameter RT-2-X at seven times fewer parameters. Physical Intelligence released OpenPi weights. Hugging Face's LeRobot publishes training scripts and the smolvla policy family. NVIDIA open-sourced GR00T N1 and Cosmos world models. EgoVerse, with 1,300+ hours, 240 scenes, 2,000+ tasks, is freely available and has improved robot performance in four independent labs.
the important caveat: open datasets are largely lab-controlled, which means low environmental diversity and limited real-world noise. they accelerate pretraining and provide baselines. production-grade data requires something else.
the bet: no company can sustain a proprietary moat in a category where architectures, training recipes, and evaluation methods are fundamentally publishable. the commercial play sits at deployment and integration. open-source communities commoditise the middle.
approach three: fix simulation properly
the third path is to close the sim-to-real gap well enough that simulation becomes a genuine substitute for real collection, at least for bulk training. two sub-approaches, at very different maturity points.
physics-grounded simulation is mature in parts of robotics. NVIDIA's Newton 1.0, open-source and deployed by DeepMind and Disney, forms the backbone alongside Isaac Sim, Isaac Lab, and the Cosmos platform. Lightwheel AI instruments real robots to measure friction and dynamics, amplifying into 100x–1000x synthetic data via their Real2Sim pipeline. Genesis AI is building a proprietary physics engine plus a universal robotics foundation model.
the consistent caveat: domain randomisation works for locomotion but collapses for manipulation. real friction and contact physics cannot yet be reliably reproduced. general dexterous manipulation remains unsolved. task-specific generative models with physics priors are emerging as a near-term option. master a skill set, build a library of skillsets over time, deploy now while waiting for generalist world models to mature for contact-heavy tasks.
generative world models learn physics implicitly from video. World Labs ships Marble, generating editable 3d environments exportable to Isaac Sim, and has partnered with Lightwheel on a joint Real2Sim pipeline. NVIDIA's Lyra 2.0, Google's Genie 3, Meta's V-JEPA 2 pursue the same direction. current verdict: impressive environment generation, persistent scene inconsistencies that break policy transfer, three to six years from production-grade general use.
the bet: grounded simulation will eventually produce cheaper and higher-coverage training data than real collection for most task types. physics-grounded and generative approaches will merge. physics provides what pure generative models lack. the commercial moat sits with whoever builds the infrastructure layer that makes synthetic data learnable.
what holds across all three
these approaches disagree on almost everything. raw data volume alone is not the right frame.
DeepMind argues world knowledge transfers from pretraining. the open-source camp argues architectures cannot be kept proprietary. the simulation camp argues synthetic data can exceed real collection for most tasks. across all three: the moat, if any, isn't in having more hours of footage.
in practice, every serious lab is running a hybrid. Physical Intelligence uses lightweight world models as supporting aids. Skild pairs internet video with 100,000 simulated robot bodies. DeepMind deploys through hardware partners to build real-world feedback loops. Lightwheel partners with World Labs on a joint Real-to-Sim pipeline. this is converging into a single stack.