According to TechSpot, researchers at Andon Labs recently conducted real-world tests using their Butter-Bench evaluation to measure how well large language models can control robots in everyday environments. The study tested six major LLMs including Gemini 2.5 Pro, Claude Opus 4.1, GPT-5, Gemini ER 1.5, Grok 4, and Llama 4 Maverick on multi-step tasks like “pass the butter” using a modified robot vacuum equipped with lidar and camera. Even the best-performing model, Gemini 2.5 Pro, completed only 40% of tasks across multiple trials, while human participants achieved a 95% success rate under identical conditions. The research revealed persistent weaknesses in spatial reasoning and decision-making, with LLM-powered robots often behaving erratically, spinning in place without progress, or treating simple problems like low battery as existential threats. These findings highlight a critical gap between AI’s analytical capabilities and real-world physical intelligence that has significant business implications.
The Billion-Dollar Physical Intelligence Problem
The Butter-Bench research exposes a fundamental limitation that could delay the robotics revolution by years. While investors have poured billions into AI companies promising autonomous robots for warehouses, manufacturing, and domestic applications, this study demonstrates that current LLM technology simply isn’t ready for prime time. The 40% success rate for the best model means that in practical business terms, you’d need human supervision for more than half of all tasks – completely undermining the economic case for automation. This isn’t just a technical problem; it’s a business model problem that affects every company betting on autonomous systems replacing human labor in physical environments.
Why Physical AI Requires Different Architecture
What’s particularly revealing about these findings is that the researchers deliberately simplified the physical challenge by using a robot vacuum with basic movement capabilities, focusing purely on high-level reasoning. The fact that LLMs still failed spectacularly suggests the problem isn’t just about motor control but about fundamental architectural limitations. Current LLMs are trained on text and images, not on embodied experience in three-dimensional space. This creates a business opportunity for companies developing specialized physical AI systems rather than trying to adapt general-purpose language models. We’re likely to see a bifurcation in the AI market between companies focused on digital applications and those building physical intelligence from the ground up.
Where Smart Money Is Placing Bets
The investment implications are clear: the robotics companies that will succeed aren’t necessarily those with the most advanced language models, but those solving the spatial reasoning and embodiment problem. This explains why we’re seeing increased venture capital flowing into companies developing specialized robotics AI rather than betting on LLM providers to expand into physical domains. The 55-point performance gap between humans and the best AI model represents both a massive challenge and an enormous market opportunity. Companies that can bridge even half of that gap first will capture significant value in logistics, manufacturing, and service robotics markets estimated to reach hundreds of billions annually.
The Near-Term Reality for Robotics Adoption
For businesses considering robotics implementations, these findings suggest a more gradual adoption curve than some AI optimists predict. The research indicates that human-robot collaboration systems, where humans handle complex spatial reasoning while robots manage repetitive physical tasks, represent the most viable near-term business model. This hybrid approach allows companies to benefit from automation while acknowledging current technological limitations. The companies that will win in this space aren’t those promising fully autonomous systems tomorrow, but those building practical, incremental solutions that acknowledge the physical intelligence gap while delivering measurable productivity improvements today.
