understand the difference between raisins, green peppers and a salt shaker? More importantly, how can they figure out how to fold a t-shirt?
That’s the magic of Visual-Language-Action (VLA) models.
This article is a concise summary of modern visual-language-models (VLAs), which has been distilled from a meta-analysis of the latest “frontrunning” models, along with associated mathematical concepts.
You will learn:
- Useful conjectures
- The mathematical fundamentals
- Real-world neural architectures
- How VLAs are trained
Preliminaries
If any of the following concepts are foreign to you, it’s worthwhile to spend some time learning them: they cover key components of modern data-driven multimodal robotic control (especially VLAs).
- Transformers — the dominant architecture patterns of today’s VLAs contain a Visual Language Model (VLM) backbone, which is a transformer based visual+language encoder
- Representation Learning — The advances in VLAs are strongly driven from optimizing learned representations, or projections to latent space, for control policies
- Imitation Learning — Learning action policies based on demonstration data generated by human movement or teleoperated robotic trajectories
- Policy Optimization — High performing robotic control policies often implement a combination of imitation learning and policy optimization, creating a stochastic policy capable of generalizing to new domains and tasks.
Useful Conjectures
These are by no means absolute laws. In my opinion, these conjectures are helpful to understanding (and building) agents which interact with the world.
💭Latent representation learning could be foundational to intelligence
While unproven, and vastly oversimplified here, I believe this to be true given the following:
- LLMs, or other transformer models do not learn the grammar of English language, or any language. They learn an embedding: a map which geometrically projects tokens, or quantized observations, into semantically similar representations in N-dimensional latent space.
- Some leading AI researchers, such as Yann LeCun (with his Joint Embedding Predictive Architecture, or JEPA), argue that human-level AI requires “World Models” (LeCun et al., “A Path Towards Autonomous Machine Intelligence”). A world model rarely predicts in pixel space, but predicts in latent space, making causal reasoning and prediction abstract and tractable. This gives a robot a sense of “If I drop the glass, it will break.”
- From biology, neuroscientists + the “free energy principle.” (Karl Friston, “The Free-Energy Principle: A Unified Brain Theory?”) A deeply complex topic with many branches, at a high level, posit that the brain makes predictions and minimizes error (variational free energy) based off internal “latent” models. When I say latent, I am also drawing on the neural manifold hypothesis (Gallego et al., “A Unifying Perspective on Neural Manifolds and Circuits for Cognition”) as applied to this space
I realize that this is a very profound and complex conjecture which is up for debate. However, it would be hard to argue against learning representation theory given that all of the latest VLAs use latent space projections as a core building block in their architectures.
💭Imitation is fundamental to energy efficient, robust robotic locomotion
Why did it take so long to get walking right? No human [expert] priors. Here’s an example of locomotion, as demonstrated by Google Deepmind vs. DeepMimic, a very impactful paper which demonstrated the unreasonable effectiveness of training along with expert demonstrations. While energy wasn’t explicitly measured, comparing the two shows the effect of imitation on efficient humanoid locomotion.
Example 1: From Deepmind’s “Emergence of Locomotion Behaviours in Rich Environments” (Heess et al., 2017)
Although this demonstrates emergent behavior, we can clearly see that the humanoid learns energy inefficient locomotion patterns that often fail to generalize, especially over complex environments
Example 2: DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills (Peng et al., 2018)
When an imitation loss component is added to the standard objective reward in the loss function, locomotion becomes smoother and the agents generalize more efficiently to new domains.
On Teloperation (Teleop)
If there was any question that Optimus uses teleop for their robots. Here one clearly has a guy take the headset off and it falls over.
Absolutely hilarious though. pic.twitter.com/4gYVohjY00
— CIX 🦾 (@cixliv) December 8, 2025
Teleoperation is clearly evident in the training of the latest humanoids, and even latest demonstrations of robotic control.
But, teleop isn’t a dirty word. In fact, It’s necessary. Here’s how teleoperation could assist policy formation and optimization.
Instead of the robot attempting to generate control outputs from scratch (e.g. the awkward jerky movements from the first successful humanoid control policies), we would augment our policy optimization with samples from a nice smooth dataset, which represents the correct action trajectory, as performed by a human in teleoperation.
This means that, as a robot learns internal representations of visual observations, an expert can provide precision control data. So when I prompt “move x to y”, a robot can not only learn a stochastic robust policy as learned with policy optimization methods, but also clone with imitation priors.
Although the reference data wasn’t teleoperated movement, human motion priors and imitation learning is employed by Figure AI in their latest VLA: Helix 02: A Unified Whole-Body Loco-Manipulation VLA, containing an additional system (S0), which was trained on joint retargeted human movement priors and is used for stable, whole body locomotion.
Job postings by the company, including this one for Humanoid Robot Pilot, strengthen the argument.
Understanding latent representations and generating rich, expert driven trajectory data are both extremely useful in the space of modern robotic control.
The Mathematical Fundamentals
Again, this is not an exhaustive summary of every lemma and proof which are foundational, but enough to satiate your appetite with links to deep dive if you so choose.
Even though a VLA seems complex, at its core, a VLA model reduces down into a simple conditioned policy learning problem. By that I mean, We want a function f(x)f(x) commonly denoted in the policy form, πθ\pi_\theta which maps what a robot sees and hears (in natural language) to what it should do.
This function gives an action output (over all the actions that the robot can perform) for every observation (what it sees and hears) for every time state. For modern VLAs, this breaks down into sequences as low as 50Hz.
How do we get this output?
Formally, consider a robot operating in a partially observable Markov decision process (POMDP). At each timestep t:
- The robot receives an observation oto_t, typically an RGB image (or set of images or video frame) plus the internal proprioceptive state (joint angles, gripper state).
- It is given a language instruction ll : a natural-language string like “pick up the coke can and move it to the left.”
- It must produce an action at∈Aa_t \in A, usually a vector of end-effector deltas and a gripper command.
The VLA’s job is to learn a policy:
πθ(at|ot,l)\pi_\theta(a_t | o_t, l)
that maximizes the probability of task success across diverse environments, instructions, and embodiments. Some formulations condition on observation history, rather than a single frame, but most modern VLAs operate on the current observation (or a short window) along with goal tokens and the robots current proprioceptive state and rely on action chunking (more on that shortly) for temporal coherence.
Here, the stochastic policy is learned via policy optimization. Refer back to the prerequisites.
The action space
Understanding how robots perceive and interact with the environment is the foundation of learning more complex topics.
Here, I describe the action space for a simple robot, but the same framework extends seamlessly to more advanced humanoid systems.
A typical single arm robotic manipulator has 7 DoF (degrees of freedom) with a 1 DoF gripper.
A standard robotic manipulator. Image by ChatGPT.
As expressed, this is a simplified control system. For example, the mobile robots used in π0 have up to 19 DoF, while humanoid robots, such as Tesla’s Optimus and Boston Dynamics have up to 65+ DoF, with 22 in the “hands” alone.
Vectorizing a single robotic configuration (typically expressed as angles) gives us:
q=[q1,…q7,gripper]∈R8q = [q_1, …q_7, gripper] \in R^8
This gives us a 8 dimensional space representing all possible poses for our arm.
A control command is expressed in deltas, e.g. increase angle q1q_1 by 40°40\degree + a gripper state. This gives us
q˙=[Δq1,…,Δq7,sgripper]\dot{q} = [\Delta{q_1},…,\Delta{q_7},s_{gripper}]
Why is this important?
The vectors of both the state and control commands are both continuous.
Generating output actions (q˙\dot{q} ) from internal representations is one of the most consequential decision decisions driving active VLA learning. Modern models use one of the following three strategies.
Strategy #1: Action Tokenization
The idea is relatively simple: Discretize each action dimension into K uniform bins (typically K=256)K=256). Each bin index becomes a token appended to the language model’s vocabulary.
An action vector becomes a sequence of tokens, and the model predicts them autoregressively, just like training GPT.
P(at|ot,l)=∏P(at(i)|at(1),…,at(i−1),ot,l)P(a_t | o_t, l) = \prod P(a_t^{(i)} \mid a_t^{(1)}, …, a_t^{(i-1)}, o_t, l)
where d=8d=8 and each a (quantized) is at(i)∈{0,1,…,K−1}a_t^{(i)} \in \{{0, 1, …, K-1}\}
So each control command is a “word” in a space of possible “words”, the “vocabulary” and the model is trained almost exactly like GPT: predict the next token given the sequence of tokens.
This approach is used pretty effectively in RT-2 and OpenVLA. Some of the earliest examples of successful VLAs.
Unfortunately, for precision control tasks, the discretization leaves us with a “quantization” error which cannot easily be recovered. That means, when translating wordiword_i -> commandicommand_i, we lose precision. This can result in jerky, awkward control policies which break down for tasks like, “pick up this tiny screw.”
Strategy 2: Diffusion-Based Action Heads
Rather than discretizing, you can keep actions continuous and model the conditional distribution p(at∣ot,l) using a denoising diffusion process.
This diffusion process was sampled from Octo (Octo: An Open-Source Generalist Robot Policy), but they are similarly applied over various architectures, such as Gr00t
Run a forward pass of the transformer backbone to obtain the trained “representation.” This single latent vector is a representation of the visual field, the instruction tokens, and goal tokens at any state. We denote this as ee
Run the diffusion process, which can be summarized with the following steps:
Sample an initial latent (Noise) xK∼𝒩(0,I)x_K ∼ \mathcal{N}(0,I)
Run K denoising steps using a learned network, here ϵθ\epsilon_\theta is a learned diffusion model. ϵθ(xk,e,k)\epsilon_\theta(x_k, e, k)
Each update:
xk−1=α(xk−γϵθ(xk,e,k)+𝒩(0,σ2I))x_{k−1}=α(x_k−\gamma\epsilon_\theta(x_k,e,k)+\mathcal{N}(0,\sigma^2I))
xkx_k current noisy action
ϵθ(xk,e,k)\epsilon_\theta(x_k, e, k) predicts the noise to remove, conditioned on the representation from our transformer backbone ee and timestep kk
γ\gamma: scales the denoising correction
added N(0,σ2I)\mathcal{N}(0, \sigma^2 I): reintroduces controlled noise (stochasticity)
α\alpha: rescales according to the noise schedule
I’m somewhat mangling the standard notation for Denoising Diffusion Probabilistic Models (DDPM). The abstraction is correct.
This process, iteratively performed with the trained diffusion model produces a stochastic and continuous action sample. Because this action sample is conditioned on the encoder output, ee, our trained diffusion model only generates actions relevant to the input context
Diffusion heads shine when the action distribution is multimodal. there might be multiple valid ways to grasp an object, and a unimodal Gaussian (or a single discrete token) can’t capture that well, due to the limitations of quantization as discussed.
Strategy #3 – Flow matching
The successor of diffusion has also found a home in robotic control.
Instead of stochastic denoising, a flow matching model elegantly learns a velocity field, which determines how to move a sample from noise to the target distribution.
This velocity field can be summarized by:
At every point in space and time, which direction should x move, and how fast?
How do we learn this velocity field in practice, especially in the domain of continuous control?
The flow matching process described below was taken from π0: A Vision-Language-Action Flow Model for General Robot Control
Begin with a valid action sequence AtA_t
Corrupt with noise, creating AtτA_t^{\tau}
Atτ=τAt+(1−τ)ϵA_t^{\tau} = {\tau}A_t + (1-{\tau})\epsilon
where τ∈[0,1]\tau \in [0,1] and ϵ∼𝒩(0,I)\epsilon ∼\mathcal{N}(0,I)
τ\tau = 0 = pure noise, τ\tau = 1 = target
Learn the vector field Vθ(Atτ,ot)V_{\theta}(A_t^{\tau}, o_t) with the loss function:
Here, the target vector field is simply the mathematical derivative of this path with respect to time τ\tau.
It represents the exact direction and speed you need to move to get from the noise to the true action. We have this, because we did the noising! Simply calculate the difference (noised action – ground truth) at each timestep.
Now the elegant piece. At inference, we have no ground truth of actions, but we do have our trained vector field model.
Because our vector field model, Vθ(Atτ,ot)V_{\theta}(A_t^{\tau}, o_t), now accurately predicts continuous outputs over noised samples, we can use the forward Euler integration rule, as specified here:
Atτ+δ=Atτ+δvθ(Atτ,ot)A_t^{\tau+\delta} = A_t^\tau + \delta v_\theta(A_t^\tau, o_t)
to move us incrementally from noise to clean continuous action samples with δ=0.1\delta=0.1 . We use the simple Euler method over 10 integration steps [as used in π0] for latency.
At step 0, we’ve got mostly noise. At step 10, we’ve got a chunk of actions which are accurate and precise for continuous control.
If flow matching still evades you, this article, which visually animates flow matching with a toy problem, is very helpful.
Real-world Neural Architectures
This summary architecture is synthesized from OpenVLA, NVIDIA’s GR00t, π0.5, and Figure’s Helix 02, which are some of the latest cutting-edge VLAs.
There are differences, some subtle and some not so subtle, but the core building blocks are very comparable across each.
Image by Author
Input Encoding
First we need to encode what the robot sees into ee, which is foundational for learning flow, diffusion etc.
Images
Raw images are processed by a pretrained vision encoder. For example, π0 via PaliGemma uses SigLIP, and Gr00t uses ViT (Vision Transformers).
These encoders convert our sequence of raw image pixels, sampled at (~5-10 hz) into a sequence of visual tokens.
Language
The command “fold the socks in the laundry basket” gets tokenized using using the LLM’s tokenizer, typically a SentencePiece or BPE tokenizer, producing a sequence of token embeddings. In some cases, like Gemma (π0) or LLama2 (OpenVLA), these embeddings share a latent space with our visual tokens.
Again, there are architectural differences. The main takeaway here is that images + language are encoded into semantically similar sequences in latent space, so that they can be consumed by a pretrained VLM.
Structuring the observation space with the VLM backbone
The visual tokens and language tokens are concatenated into a single sequence, which is fed through the pretrained language model backbone acting as a multimodal reasoner.
VLM backbones often have multimodal outputs, like bounding boxes for object detection, captions on images, language based subtasks, etc. but the primary purpose for using a pretrained VLM is generating intermediate representations with semantic meaning.
- Gr00t N1 extracts embeddings from intermediary layers of the LLM (Eagle)
- OpenVLA fine tunes the VLM to predict discrete actions directly, the output tokens of the LLM (Llama 2) are then projected to continuous actions
- π0.5 also fine tunes the VLM (SigLIP + Gemma) to output discrete action tokens, which are then used by an action expert to generate continuous actions
Action Heads
As covered in depth above, the fused representation is decoded into actions via one of three strategies: action tokenization (OpenVLA), diffusion (GR00T N1), or flow matching (π0, π0.5). The decoded actions are typically action chunks, a short horizon of future actions (e.g., the next 16–50 timesteps) predicted simultaneously. The robot executes these actions open-loop or re-plans at each chunk boundary.
Action chunking is critical for smoothness. Without it, per-step action prediction introduces noise because each prediction is independent. By using a coherent trajectory, the model amortizes its planning over a window, producing smoother, more consistent motion.
How VLAs are trained
Modern VLAs don’t train from scratch. They inherit billions of parameters worth of prior knowledge:
- The vision encoder (e.g., SigLIP, DINOv2) is pretrained on internet-scale image-text datasets (hundreds of millions to billions of image-text pairs). This gives the robot’s “eyes” a rich understanding of objects, spatial relationships, and semantics before it ever sees a robot arm.
- The language model backbones (e.g., Llama 2, Gemma) are pretrained on trillions of tokens of text, giving it broad reasoning, instruction following, and common sense knowledge.
These pretrained components are essential to generalization, allowing the robot to understand what a cup is, what a t-shirt is, etc. without needing to train from scratch.
Multiple Training Phases
VLAs use multiple training phases. Phase 1 typically includes fine tuning using diverse data collected from real robotic control tasks and/or synthetically generated data, and Phase 2 focuses on embodiment specific training and action head specialization.
Phase 1: Pretraining
The VLA is trained on large-scale robot demonstration datasets. Some examples of training data:
- OpenVLA was trained on the Open X-Embodiment dataset, a community-aggregated collection of ~1M+ robot trajectories across 22 robot embodiments and 160,000+ tasks.
- π0 was trained on over 10,000 hours of dexterous manipulation data collected across multiple Physical Intelligence robot platforms.
- Proprietary models like GR00T N1 and Helix also leverage large in-house datasets, often supplemented with simulation data.
The goal of pretraining is to learn the foundational mapping from multimodal observations (vision, language, proprioception) to action-relevant representations that transfer across tasks, environments and robot embodiments. This includes:
- Latent representation learning
- Alignment of actions to visual + language tokens
- Object detection and localization
Pretraining typically does not produce a successful robotic policy. It provides the policy a general foundation which can be specialized with targeted post training. This allows the pretraining phase to use robotic trajectories that don’t match the target robotic platform, or even simulated human interaction data
Phase 2: Post training
The goal of post training is to specialize the pretrained model into a general, task and embodiment-specific policy that can operate in real-world environments.
Pretraining gives us general representations and priors, post training aligns and refines the policy with precise requirements and objectives, including:
- Embodiment: mapping the predicted action trajectories to the precise joint, actuator commands which are required by the robotic platform
- Task specialization: refining the policy for specific tasks, e.g. the tasks required by a factory robot or a house cleaning robot
- Refinement: obtaining high precision continuous trajectories enabling fine motor control and dynamics
Post training provides the real control policy which has been trained on data matched for deployment. The end result is a policy that retains the generalization and adds precision required for the real-world [we hope].
Wrapping up
Visual-Language-Action (VLA) models matter because they unify perception, reasoning, and control into a single learned system. Instead of building separate pipelines for vision, planning, and actuation, a VLA directly maps what a robot sees and is told into what it should do.
An aside on possible futures
Embodied intelligence argues that cognition is not separate from action or environment. Perception, reasoning, and action generation are tightly coupled. Intelligence, itself, may require some sort of physical “vessel” which can reason with it’s environment.
VLAs can be interpreted as an early realization of this idea. They remove boundaries between perception and control by learning a direct mapping from multimodal observations to actions. In doing so, they shift robotics away from explicit symbolic pipelines and toward systems that operate over shared latent representations grounded in the physical world. Where they take us from here, is still mysterious and thought provoking 🙂
References
- Heess, N. et al. (2017). Emergence of Locomotion Behaviours in Rich Environments. DeepMind.
- Peng, X. B. et al. (2018). DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills. ACM Transactions on Graphics.
- Octo Model Team (2024). Octo: An Open-Source Generalist Robot Policy.
- Brohan, A. et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. Google DeepMind.
- OpenVLA Project (2024). OpenVLA: Open Vision-Language-Action Models. https://openvla.github.io/
- Team, Physical Intelligence (2024). π0: A Vision-Language-Action Flow Model for General Robot Control.
- Physical Intelligence (2025). π0.5: Improved Vision-Language-Action Flow Models for Robot Control.
- NVIDIA Research (2024). GR00T: Generalist Robot Policies.
- Karl Friston (2010). The Free-Energy Principle: A Unified Brain Theory? Nature Reviews Neuroscience.
- Gallego, J. et al. (2021). A Unifying Perspective on Neural Manifolds and Circuits for Cognition. Current Opinion in Neurobiology.

