Many Small Steps for Robots, One Giant Leap for Mankind | Composable and distributed systems group
Mon, 2026-01-26
Sharing our experimental call summaries.
Al-generated digests of Yak Collective study groups.
Key resources discussed
Article link: https://www.notboring.co/p/robot-steps
Framing the Article and Standard Bots’ Approach
The group discussed an essay (likely by or about Standard Bots / Standard Robots) that argues for an incremental, vertically integrated path to robotics, rather than big “AGI-like” leaps. The core elements of the Standard Bots approach, as reconstructed from the discussion:
Pre-train robots on a set of “basic physical skills”:
perception, grasping, force control, and sequencing.For each real-world deployment, perform additional fine-tuning and reinforcement learning using data from that specific environment and hardware.
Emphasize real-world execution over simulation or video-only learning.
Position this as “small steps” and an alternative to grand, general-purpose robotics plays.
Several people noted that the essay reads, in part, like a pitch for Standard’s particular strategy and business model (vertical integration, proprietary hardware and torque sensing, closed-loop data from deployed robots) rather than a neutral survey of the field. The “incremental vs big leaps” framing, in particular, was viewed as somewhat forced: the group saw real companies doing both—betting on long-term waves (the “tide coming in”) while iterating incrementally on boats.
Where the group converged:
Focusing on domain-specific deployments and gradually generalizing out is a sensible and historically successful pattern (evolution, children’s learning, software specialization).
Vertical integration in robotics is unsurprising and perhaps necessary at this stage, given hardware, sensing, and control constraints.
There is real value in fine-tuning on the actual deployed hardware with its specific actuators, sensors, and idiosyncrasies.
Where there was skepticism:
The essay’s implication that “no one else” is doing incremental work was not considered credible.
The incremental vs big-leap dichotomy was seen as rhetorically convenient but not well justified.
The downplaying of simulation and video-based learning felt overstated, given both human analogies (sports, games) and current ML practice.
LLMs, Language, and the “Unreasonable Effectiveness” Question
Several participants pushed back on the essay’s apparent distance from LLMs, arguing that recent experience shows language is surprisingly powerful for reasoning about physical systems.
Key points:
Unreasonable effectiveness of language: Over the last few years, LLMs have been much better at physics-related reasoning than many expected. Even if they are not solving high-precision PDEs, they often produce qualitatively correct physical reasoning.
Context as the central challenge: In day-to-day use of Claude and similar models, the main work is curating context:
Too little context → model misunderstands the task.
Too much → performance degrades, costs rise, and the model may become distracted.
Practically, this leads to strategies like “wiping memory” and reseeding the conversation, and constantly managing what the model “knows” at any moment.
Humans as context carriers / swarm controllers:
One hypothesis raised: future human–robot work might look like humans serving as high-level controllers of large robot swarms.Humans maintain rich, persistent context and intent.
Robots handle specialized, lower-level execution.
Analogy: an operator controlling a drone swarm or a human issuing natural-language directives to a fleet of task-specific machines.
Concrete example: A participant asked ChatGPT how it would break an unusually tough cracker. The model responded with strategies such as:
Look for creases in the cracker and break along them.
Apply more force or apply force at an angle.
This was seen as reasonably close to human intuitive strategies, suggesting that LLMs can capture a kind of “folk physics” and procedural common sense relevant to physical tasks.
Where the group landed:
LLMs are not yet “good at physics” in the strict numerical sense, but they seem surprisingly capable at qualitative, intuitive reasoning about physical interactions.
The missing piece is not language-level reasoning but the bridge from LLM-generated plans to real-world control—how to map natural-language or symbolic strategies to trajectories, torques, and contact-rich manipulation.
Vision, Sensors, and the Embodiment Gap
The group spent substantial time on sensing and embodiment: what data is needed for robust robotics, and how much can be inferred from vision alone.
Vision as a Dominant Modality
One participant argued for an
unreasonable effectiveness of vision,
analogous to the effectiveness of language:
Industry trend: using cameras to replace many specialized sensors.
From images, models can often infer:
3D geometry from 2D projections (thanks to symmetry and priors from web-scale data).
Material categories (wood, metal, plastic) and thus approximate densities and masses.
Human analogy: we ourselves rely heavily on visual priors for weight and material:
When handed an unusually dense object (e.g., a ball of gold or iridium), our hand “jerks” because it’s far heavier than our visual model predicts.
After one or two interactions, our motor system adjusts and can move it reliably using essentially the same geometric motions, just scaled forces.
This suggests a powerful pipeline:
Vision: recognize objects and scenes.
Language/semantics: infer object types and typical properties (“apple,” “steel bar,” “ceramic mug”).
Physics priors: assign plausible densities, friction, fragility, etc.
Control: refine these priors through haptic feedback and fine-tuning in the specific robot embodiment.
The Embodiment and Proprioception Question
Another thread focused on the “embodiment gap” and something akin to mirror neurons for robots:
Humans:
Use visual observation to learn from others (e.g., Tai Chi, sports).
Improve thin-slicing / perception of subtle movement patterns over time.
Possess proprioception: a rich internal sense of joint positions, loads, and posture.
Robots:
Need internal sensing (torque sensors, joint encoders, force-torque sensors) to approximate proprioception.
Companies like Standard and Aptronic treat actuator design and torque control as core IP.
One participant recalled Aptronic’s interest (pre-LLM era) in detailed contact-force simulations for tasks like screwdriving, but faced scalability limits with FEA on hundreds of joints.
Observation:
The essay flags the importance of better sensors (e.g., improved torque sensors), but the group suspected this may be a much bigger breakthrough than the essay’s training-approach framing admits.
Proprioceptive data is not directly visible to cameras. It must be modeled and learned per-robot, which partly justifies Standard’s per-deployment fine-tuning.
Mirror Neurons, Collaboration, and “Bots Watching Bots”
Related to embodiment, someone asked what the “mirror neuron equivalent” is for robots:
Humans learn coordination and empathy partly by seeing others move and mapping that into our own motor system.
The prompt for the essay included “how do they collaborate?”, but the group felt the essay gave this short shrift.
One participant had asked their own LLM client about this and found the response interesting, but that content was not deeply unpacked in the discussion.
Net takeaway:
Vision alone goes surprisingly far, but embodiment-specific sensing is crucial to close the loop.
There is a conceptual gap between visual imitation and embodied competence, and the essay did not fully bridge it.
Learning from Video, Games, and Simulation: Where the Article Underplayed Things
The article was perceived as cool on video-based and simulation-based learning. Multiple participants pushed back, citing human examples and emerging practice.
Human analogies:
Cricket example:
An Indian league cricketer started late (early twenties) but claimed much of his development came from watching thousands of hours of YouTube footage:He still needed on-field practice.
But high-fidelity video, replay controls, and slow motion helped build mental models and techniques that later transferred to physical execution.
Soccer + FIFA video games:
Several modern soccer players reportedly grew up on the FIFA game. According to a participant’s soccer-following friends:FIFA teaches not only footwork tricks but also tactics and game-level strategy.
This is a “vertical bandwidth” learning channel: from micro-moves to high-level positioning and playmaking.
Chess and engines (as a contrast case):
Modern chess players who grew up playing chess engines play more aggressively and differently than pre-engine generations. Even though chess is “purely mental,” the analogy shows how training with powerful simulators can change the skill distribution.
Robotics analogy:
Just as video and games can bootstrap human skills, large-scale datasets of human manipulation actions could play a significant role in robot learning.
Video and simulated environments might not be sufficient by themselves, but they can significantly shape prior models before fine-tuning on real hardware.
The group’s view:
Real-world deployment and contact-rich experience are essential; the essay is right about that.
But discounting video and simulation felt like an overcorrection; there are strong precedents for them providing substantial value, especially when combined with real-world fine-tuning.
Control Architectures: Tokens vs Trajectories, and the VLA Loop
A significant technical thread examined how to represent and generate actions, and how language and vision fit into that loop.
Token-Based Action vs Diffusion Policies
Two main families of approaches discussed:
Tokenized / language-like actions
Represent actions as discrete tokens: “move forward 10 cm”, “rotate 30°”, “close gripper”.
Predict one token at a time, similar to how LLMs predict the next word.
This maps nicely to language models but has drawbacks:
Error accumulation: one mistaken token can derail the whole sequence.
Poor handling of precise, continuous control requirements.
Diffusion-based action policies (Toyota example)
Inspired by image generation (diffusion models).
Sample an entire noisy action sequence (e.g., a 50-step trajectory) and iteratively denoise it into a coherent plan.
Benefits:
More globally consistent motion.
Better precision.
Lower error propagation than one-step-at-a-time token prediction.
The group generally agreed that diffusion-style policies currently appear better-suited to physical control than pure tokenized action models.
Interleaving Vision, Language, and Action
An important conceptual point: the group pushed against a clean three-phase model (perceive, then plan in language, then execute blindly)
Instead, the hypothesized loop:
At high frequency (e.g., 100 Hz sampling):
Camera captures a frame (vision).
Semantic inferences update (language / internal representation).
Controller refines or regenerates a trajectory (action).
Execution is receding horizon:
Even if a 50-step plan is generated, the robot may only execute the first few steps.
New sensory input can “preempt” the old plan and trigger a re-plan.
Perception-control coupling is tight and continuous.
Concrete parallel:
Human eye movements (saccades) were mentioned: we don’t passively stare at a scene; our eyes jump around, sampling salient regions based on task and context. This is an example of perception being driven by action and vice versa.
Implication:
If done right, Vision–Language–Action (VLA) models could support a very tight loop where:
Vision provides a rich, constantly updated world model.
Language-like structures encode semantics, high-level goals, and object properties.
Diffusion-based or trajectory-based controllers translate these into robust, interruptible motion plans.
This stands somewhat in tension with the article’s emphasis on “tokens and control systems,” which some felt underappreciated the more continuous, trajectory-oriented approaches.
Constraints, Hardware Realities, and the “Tide Not In Yet” Problem
Several participants with exposure to real robotics emphasized how different it is from cloud AI:
Hard constraints in robotics:
Embedded systems with tight power, memory, and compute budgets.
Limited bandwidth between sensors, controllers, and higher-level decision systems.
Safety constraints and physical damage risk (crashes, broken actuators, costly hardware).
Contrast with cloud AI:
“Compute is not the limiting factor” in most LLM work.
In robotics, everything from thermals to battery life can be the bottleneck.
Anecdotes:
Drones in 2010 vs cheap drones now:
In 2010 in India, building drones required buying relatively expensive servos and components.
Crash landings could destroy much of the hardware; each iteration was costly.
Today, $50–$60 consumer drones bundle enough embedded compute and control sophistication that many “gnarly” control problems are simply solved in firmware for the end user.
Apptronik’s simulations (pre-LLM era):
They wanted both:
Modal analysis (vibration, resonant frequencies).
Contact analysis for specific manipulation tasks (e.g., screwdriving).
Detailed structural-dynamics simulations of full humanoids (hundreds of joints) did not scale with then-available FEA tools.
The “boats before the tide” metaphor:
Some AI companies are said to build boats before the tide comes in, preparing capabilities in anticipation of a future wave of demand or infrastructure.
This can resemble a “big leap” strategy (betting on a future environment) more than the essay’s ideal of purely incremental adaptation to current constraints.
The group felt the essay’s narrative put too much emphasis on a dichotomy where, in reality, firms are mixing strategies: incremental improvement under today’s constraints while positioning for tomorrow’s tide.
Safety, Control, and the Sorcerer’s Apprentice Problem
On the operational side, the group touched on control, failure modes, and safety in both software agents and robots.
Examples:
Waymo self-driving incident:
In a narrow street with a stopped fire truck on the left and a person on the right curb talking on the phone, a Waymo vehicle became confused:
It stopped in the middle of the road at a signal.
Interpreted the nearby pedestrian as a potential crosser and failed to make progress.
Built up a queue of 10–12 cars behind it.
Ultimately required teleoperation (remote human intervention) to “unstick” it.
This illustrates:
Edge-case combinatorics in the real world.
The practical necessity of a teleoperation layer as a safety and liveness backstop.
Sorcerer’s Apprentice with LLM agents:
A participant installed a Claude-based automation agent (“Claudebot”) and then uninstalled it—or so they thought.
A lingering plist process kept running and incurred about $70 in Claude API charges.
They ultimately used a Claude Chrome extension to inspect logs, identify, and shut down the rogue process.
This is a purely digital version of the “Sorcerer’s Apprentice” pattern: agents that keep working beyond the user’s intent, with accumulating cost or risk.
Implications for robotics:
Robots operating in the physical world amplify this risk: a misbehaving process can damage property or injure people, not just burn cloud credits.
Teleoperation and human-in-the-loop control are likely to remain structural features of serious deployments, not mere transitional hacks.
Good observability and “kill switches” (log review, process inspection, remote takeover) are critical, paralleling the API-usage debugging story.
Where Are We on Composability and Distributed Systems?
Relative to the CADS theme (Composable + Distributed Systems), the group felt:
Today’s robots are far from composable or distributed in the way modern software systems are:
Most systems remain vertical stacks (hardware + firmware + control + planning tightly coupled).
Standard Bots’ “generalist” robot is still vertically integrated—just in a different axis than older industrial robots.
The article’s described frontier—mastering narrow, domain-specific physical tasks and then trying to generalize—is directionally aligned with how composable systems tend to emerge:
Specialize to a niche.
Extract abstractions and interfaces.
Generalize and compose those capabilities across domains.
However, the group cautioned against over-specialization:
In evolution and cognition, extreme specialization leads to brittle systems (idiot savants, ecological niche specialists).
Robotics architectures must balance specialization for performance with the flexibility needed for out-of-distribution situations.
The consensus:
Robotics feels “not there yet” in CADS terms.
But the styles of thinking and architectures being explored (VLA models, hierarchical control, vertical stacks that may later fracture into modules) are plausibly on the path to future composable, distributed robotic ecosystems.
Wrap-Up
Key takeaways
Standard Bots’ incremental, vertically integrated strategy is plausible and grounded in hardware and sensing realities, but the article’s incremental-vs-big-leap framing was seen as overstated and somewhat self-serving.
LLMs and language have shown surprising effectiveness for qualitative physics and planning; the hard problem is turning natural-language reasoning into robust, contact-rich control on specific robot embodiments.
Vision is emerging as a dominant sensor modality, with substantial ability to stand in for other sensors via learned priors, but embodiment-specific proprioception remains essential and justifies per-robot fine-tuning.
Video, games, and simulation appear more powerful for skill acquisition—both in humans and, by analogy, robots—than the article suggests, especially when combined with real-world learning.
Diffusion-based trajectory generation and tightly interleaved Vision–Language–Action loops look more promising for robotics than naïve token-based “one-step-at-a-time” action models.
Real-world constraints (embedded compute, actuation limits, safety) and “Sorcerer’s Apprentice” failure modes argue for robust teleoperation and human-in-the-loop oversight as first-class design elements.
Robotics today is still largely non-composable and non-distributed, but the emerging patterns echo earlier transitions in software and may lead toward CADS-style architectures over time.
Open questions explicitly surfaced
How exactly should LLMs be “in the loop” for robotics—brain, high-level planner, or context manager for fleets of simpler controllers?
What is the right abstraction boundary between VLA models and low-level controllers, especially in resource-constrained embedded environments?
How far can we push vision-only (plus priors) approaches before specialized sensors and rich proprioception become unavoidable bottlenecks?
What would a genuine “mirror neuron” analogue for robots look like, supporting collaboration and learning-by-watching between robots?
Yak Collective Discord call thread:
https://discord.com/channels/692111190851059762/1465166011824079023



Fascinating. It's almost like a business model trying to frame itself as a philisophical debate, isnt it?