Ground-Up VLA: Introduction
Going from GPU basics to robot foundation models
Vision-Language-Action (VLA) models have become a hot topic of research, especially so in the past few years—this finally spurred me to think that, just maybe, I should know stuff about machine learning, beyond just using Perplexity, Claude, Codex, or <insert-your-favorite-product-here> to help me do stuff. I’ve worked across a few domains as an engineer, but never in the learning space (beyond some introductory classes).
When I started reading, I quickly realized that there is a lot to understand. So, for instance, if I want to understand how Physical Intelligence’s π0 model works, I need to understand multilayer perceptrons, transformers, how transformers can be used for vision, action chunks, flow matching, diffusion models, autoregressive generation, and a whole host of other concepts—even if π0 does not use diffusion models, I’d need to know about diffusion models to understand the pros and cons of each.
So, in a series of posts, I’m going to understand how to build a VLA. The posts I’ll do on Substack will broadly follow this outline:
How to use a local GPU
Transformers
Vision Language Models (VLM)
Imitation learning
Action chunking
Flow matching
Vision Language Action (VLA) models
If this sounds like it’d be interesting to you, maybe because you’re also not sure how to get up to speed in this field, or maybe because you already know all of this but are interested to see what a beginner stumbles on, your reading and feedback is welcome.
The next post in this series is about GPU basics.
