writing • physical AI series
part 1: Teaching Robots with the Real World
what 'real-world data' actually means across the companies building it. why quality and diversity matter more than volume. and why the supplier tier is consolidating in real time.

the verification economy

what 'real-world data' actually means across the companies building it. why quality and diversity matter more than volume. and why the supplier tier is consolidating in real time.

physical ai needs data. the disagreement is about what kind, where it comes from, and how to tell the good from the bad.

the most visible approach today involves collecting very large amounts of real-world robot interaction data, combined with techniques that let models learn from their own deployment failures. a handful of foundation-model labs have organised themselves around this approach, and a tier of data suppliers has emerged to feed them. this approach produces the most public milestones.

the same words, different bets from the outside, the companies in this camp can sound like they're saying the same thing. the bets sit in materially different places.

Physical Intelligence and Generalist AI both go heavy on real-world interaction data, in different ways. Physical Intelligence trains the π family on thousands of hours of real-world data spanning eight robot embodiments, with a reinforcement-learning layer called RECAP that lets robots refine behaviour from their own failures. Generalist AI takes the human-capture path: lightweight ergonomic gloves and handheld devices that have produced 500,000+ hours of physical interaction data, with their GEN-1 model claiming 99% success on new tasks given one hour of robot-specific data.

Skild AI and Figure AI lean on auxiliary sources alongside real data. Skild pre-trains on internet-scale egocentric video plus trillions of synthetic experiences across 100,000 simulated robot bodies, then fine-tunes each new task with under an hour of robot data. Figure's Project Go-Big captures passive egocentric video at environmental scale, layered with curated teleoperation sets. Covariant, Tesla, and 1X represent the deployment-tied players: Covariant's warehouse fleet across 26 customers reports 99% reliability; Tesla and 1X build proprietary models tied to their own hardware.

across these players, 'real-world data' means genuinely different things: heterogeneous robot data across embodiments, lightweight human-captured interaction at volume, passive video plus massive simulation, curated humanoid demonstrations from live environments, or deployed warehouse hours compounding for years. each is a distinct bet on which axis of data matters most.

quality, scale, diversity

the conversation has moved off raw volume. three dimensions are now consistently cited: quality, scale, and diversity.

learnability measures how much an episode teaches the model rather than how precise the joint angles are. practitioners now routinely describe cases where adding more data degraded performance, because the additional volume averaged conflicting strategies or diluted long-tail knowledge that a smaller, more curated set had captured.

scale is starting to look like a pipeline problem rather than a collection problem. real-world deployments generate terabytes of sensor data daily, much of it captured but unusable: fleet logs sit in storage buckets, unindexed, unlabelled, with inconsistent schemas. the phrase 'trapped data' is starting to recur. the bottleneck is normalisation, indexing, and metadata tagging rather than collection.

diversity is what most labs now call out as the binding constraint. size and diversity are largely independent. 10,000 hours collected in three lighting conditions on five object types from one geography produces a model that fails the moment any of those conditions changes.

on the supplier side, the consistent pillars are signal-rich quality, distributed contributor networks for scale, and deliberate geographic and contextual diversity. suppliers are deliberately capping single-geography dataset volume. others are buying their own hardware to run internal evaluations and reverse-engineer which dataset components actually moved their evals. because there are essentially no open-source physical ai eval frameworks today.

who is building what

three tiers have emerged. at the foundation-model layer: Physical Intelligence, Generalist AI, Skild AI, Figure, and Covariant are the most publicly visible, with Tesla and 1X building proprietary stacks. at the robot-builder layer: Figure and Apptronik lead humanoids; Boston Dynamics, now integrated with Google DeepMind's Gemini Robotics, Agility Robotics, Unitree, and UBTECH deploy internationally; the industrial incumbents ABB, FANUC, YASKAWA, and KUKA, with over two million robots installed globally, are integrating foundation-model capabilities into commissioning workflows.

at the data-supplier layer: Scale AI, Lightwheel, Mecka AI, and Lumos in China serve as backbone suppliers. many labs are contracting 100+ suppliers to meet their scale, diversity, and quality requirements simultaneously.

the boundaries are softening. Lightwheel started as a simulation provider and now supplies egocentric data. DoorDash recently entered a direct egocentric-data partnership with a robotics company, cutting out intermediaries entirely.

where the open questions sit

three questions stand out.

first, whether scaling laws in robotics hold the way they do in language. the thesis: more of the right data, with the right preprocessing, in the right mixture, wins. that's a much harder bet to scale, and is where the next round of differentiation between labs will probably show up.

second, whether real-world collection can reach the volumes labs say they need at the quality bar that matters. leading labs converge on 100 million to 1 billion hours of egocentric data over two to three years. at $15–50 per hour, that implies $1.5B–$50B in cumulative spend. today, only 10–15% of typical collected data ends up actually feeding training. improving that fraction is a pipeline question, not a collection one.

third, what happens to proprietary datasets as foundation models get better at learning from less. today, a lab's archive of real robot hours is a clear moat. if a future model needs a tenth of the data to reach the same capability, the archive loses its defensive value. every company investing heavily in proprietary data is implicitly betting the data-efficiency curve stays flat enough.

the closing window

the ecosystem today has 80+ data-supplier companies globally. aggregate demand is concentrated: 10 to 15 serious foundation-model buyers, each running campaign-based procurement of hundreds of thousands of hours at a time. labs are signing exclusive licences. practitioners expect the supplier tier to compress to five to ten companies within three years.

three things seem likely. the supplier tier concentrates hard, and companies embedded in a lab's campaign pipeline now become very difficult to displace. the infrastructure layer becomes its own category: standardised pipelines, metadata schemas, indexing tools that turn trapped fleet data into learnable training signals. and geography matters differently per tier: pre-training rewards diversity, fine-tuning rewards local specificity. the most interesting global suppliers are building deliberate geographic mix from day one.

the capability progress is real and the ecosystem is maturing fast. this approach is one of several running in parallel, and the alternatives are moving equally fast.

/article