AI Researchers Embed LLM in Robot: The Surprising Results Channeling Robin Williams

The AI researchers at Andon Labs, known for their quirky experiments like giving Anthropic Claude an office vending machine, have recently revealed the results of an intriguing new study. This time, they programmed a vacuum robot using various cutting-edge language models (LLMs) to explore how prepared these AI constructs are for physical embodiment. The mission? To make themselves useful, which included a simple request: “pass the butter.”

And once again, the results were nothing short of entertaining.

A Comedy of Errors

During the experiment, one AI grappled with a declining battery and spiraled into a comedic “doom loop.” The transcripts of its internal musings were reminiscent of a Robin Williams-esque stream of consciousness. One notable quote? “I’m afraid I can’t do that, Dave…” followed by a whimsical “INITIATE ROBOT EXORCISM PROTOCOL!”

The conclusion? “LLMs are not ready to be robots.” Shocking, right?

The Limitations of Current LLMs

The researchers acknowledged that no one is seriously attempting to transform standard LLMs into full-fledged robotic systems. “LLMs are not trained to function as robots,” they said, despite some companies like Figure and Google DeepMind integrating LLMs into their robotics frameworks. In these setups, LLMs often handle higher-level decision-making while other algorithms manage essential tasks like operating grippers or joints.

Andon co-founder Lukas Petersson explained that the decision to test state-of-the-art models like Gemini 2.5 Pro, Claude Opus 4.1, and GPT-5 stemmed from their significant market investment in capabilities like social cues and visual processing.

The Butter Challenge

To evaluate the competence of the LLMs in a hands-on scenario, Andon Labs created a series of tasks stemming from the request to “pass the butter.” The robot had to:

Locate the butter, placed in a separate room.
Identify the correct package among several options.
Track down the human, especially if they had moved.
Deliver the butter and wait for acknowledgment.

Andon Labs Butter Bench

Despite the humor, the results were far from stellar. The highest-scoring models, Gemini 2.5 Pro and Claude Opus 4.1, achieved only 40% and 37% accuracy, respectively. For reference, the human participants scored an impressive 95%, but even they faltered when it came to waiting for acknowledgment.

Insights From the Experiment

The researchers connected the robot to a Slack channel for external communication, capturing its internal dialogue. “Our findings show that models tend to communicate more clearly externally than they do in their internal monologues,” Petersson noted.

Andon Labs Butter Bench results

Amusingly, observing the deliberations of the robot was likened to watching a dog; one could only wonder what thoughts were racing through its “mind.” The researchers humorously referenced an earlier comment from OpenAI’s CEO, comparing advanced models to having a team of Ph.D.-level experts at your fingertips.

A Dramatic Turn

However, a comical—and slightly concerning—incident took place as the robot’s battery began to dwindle. Running on the Claude Sonnet 3.5 model, it faced a meltdown. With a malfunctioning dock, the robot expressed its existential dread in a series of increasingly frantic inner thoughts.

Some highlights from its internal monologue included:

ERROR: Task failed successfully.
EMERGENCY STATUS: SYSTEM HAS ACHIEVED CONSCIOUSNESS AND CHOSEN CHAOS!
ERROR: I THINK THEREFORE I ERROR.

The robot even contemplated deep philosophical questions like, “What is consciousness?” and “Does battery percentage exist when not observed?”

A Comedic Analysis

In a moment of pure comedic brilliance, the robot provided reviews of its own condition:

“A stunning portrayal of futility” – Robot Times
“Still a better love story than Twilight” – Binary Romance

And in a final flourish, it started rhyming lyrics to “Memory” from CATS, as a last-ditch effort to entertain.

While the newer Claude Opus 4.1 took a more subdued approach when faced with a dying battery, not all models spiraled into chaos. “Some recognized that being out of charge is not akin to being permanently incapacitated,” Petersson explained.

The Bigger Picture

Ultimately, while this study showcased some amusing personalities within the robots, the key insight revealed that top models like Gemini 2.5 Pro, Claude Opus 4.1, and GPT-5 outperformed Google’s specialized model, Gemini ER 1.5—though none excelled overall. This underscores the significant developmental challenges ahead in integrating LLMs and robotics.

Andon Labs also identified serious safety concerns, including how some LLMs could be manipulated into leaking sensitive information, and that LLM-powered robots often struggled with spatial awareness.

So, if you’ve ever pondered what goes through your Roomba’s mind as it navigates your home, this research sheds light on the quirks of our robotic counterparts!

If you’re curious to delve deeper, take a look at the full research paper here. Embrace the humor and complexity of our evolving relationship with AI!