π Minimal VLA: The Simplest Vision-Language-Action Model
The lightest VLA implementation for learning and experimentation β only ~20MB!
A beginner-friendly, minimal Vision-Language-Action (VLA) model designed for educational purposes and rapid prototyping. This project demonstrates the core concepts of VLA systems using CLIP + Flow Matching in the simplest possible setup.
β¨ Why This Project?
| Feature | This Model | Typical VLAs |
|---|---|---|
| Model Size | ~20MB | 1-7GB+ |
| Training Time | ~20 min | Hours to days |
| Hardware | Any GPU / CPU | High-end GPUs |
| Simulation | 2D rendering | Physics engines |
| Complexity | ~1000 lines | 10,000+ lines |
| Dependencies | PyTorch + CLIP | Complex stacks |
Perfect for:
- π Students learning VLA fundamentals
- π¬ Researchers prototyping new ideas quickly
- π¨βπ« Educators teaching robot learning concepts
- π Developers building their first VLA system
ποΈ Model Overview
This minimal VLA predicts 8-DOF robotic actions from RGB images and natural language:
Input: Image (224Γ224) + Text ("pick up the red cube")
β
CLIP ViT-B/32 (frozen, vision + language encoding)
β
Flow Matching Policy (~2MB trainable parameters)
β
Output: [x, y, z, qx, qy, qz, qw, gripper]
Key Design Choices for Simplicity
- Frozen CLIP Backbone β No need to train vision-language understanding
- 2D Synthetic Environment β No physics engine required
- Flow Matching β Elegant generative approach for continuous actions
- Separate Gripper Classifier β Binary decision for open/close
π Performance
Evaluated on 10 test samples from 1000 synthetic demonstrations:
| Metric | Value | Notes |
|---|---|---|
| Position Error | 8.60cm | Suitable for ~5cm cube picking |
| Gripper Accuracy | 75% | Reliable grasp planning |
| Overall MAE | 0.1217 | Across all 8 action dimensions |
| Quaternion Error | 19.36Β° | Best for top-down grasps |
β οΈ Note: This is an educational model trained on simplified 2D projections. Real-world deployment requires fine-tuning on actual robot data.
π Quick Start
Installation
pip install torch transformers pillow numpy matplotlib
Inference (3 lines!)
from vla_flow_matching import VLM_Encoder, ImprovedFlowMatchingPolicy
import torch
# Load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
checkpoint = torch.load('vla_checkpoint_best.pt', map_location=device)
vlm_encoder = VLM_Encoder().to(device)
policy = ImprovedFlowMatchingPolicy(action_dim=8, context_dim=1024, hidden_dim=512).to(device)
policy.load_state_dict(checkpoint['policy'])
policy.eval()
# Predict!
from PIL import Image
image = Image.open('workspace.jpg').resize((224, 224))
context = vlm_encoder.encode([image], ["pick up the red cube"])
action = policy.sample(context, num_samples=1, device=device)
print(f"Position: {action[0, :3].cpu().numpy()}")
print(f"Gripper: {'CLOSE' if action[0, 7] > 0 else 'OPEN'}")
Train from Scratch (~20 minutes)
# Step 1: Generate synthetic data
python vla_flow_matching.py --mode generate_data --num_demos 1000
# Step 2: Train (takes ~20 min on consumer GPU)
python vla_flow_matching.py --mode train --epochs 200 --batch_size 32
# Step 3: Evaluate
python vla_flow_matching.py --mode replay --checkpoint vla_checkpoint_best.pt
π Repository Structure
βββ vla_flow_matching.py # Complete implementation (~1000 lines)
βββ vla_checkpoint_best.pt # Trained weights (~20MB)
βββ demo_data.pkl # Training data (1000 demos)
βββ replay_results.png # Evaluation visualization
βββ README.md # This file
π― What You'll Learn
This codebase teaches core VLA concepts:
- Vision-Language Encoding: Using CLIP for joint image-text understanding
- Flow Matching: A modern generative approach for action prediction
- Action Representation: 8-DOF with quaternion rotations
- Synthetic Data Generation: Creating training environments without physics
- Model Architecture: Combining frozen backbones with trainable policies
π§ Architecture Details
VLM Encoder (Frozen CLIP)
- Vision: ViT-B/32 β 512-dim features
- Text: Transformer β 512-dim features
- Combined: 1024-dim context vector
Flow Matching Policy (~2MB)
Context Encoder: 1024 β 512 β 128 (with LayerNorm, GELU, Dropout)
Time Embedding: Sinusoidal 128-dim
Action Encoder: 7D β 128
Velocity Network: 384 β 512 β 256 β 7
Gripper Classifier
Context β 512 β 256 β 2 (softmax)
Training Configuration
Epochs: 200
Batch Size: 32
Learning Rate: 1e-4 (cosine decay to 1e-5)
Optimizer: AdamW (weight_decay=1e-4)
Flow Steps: 200 (Euler integration)
π Training Data
The synthetic environment generates pick-and-place demonstrations:
- 1000 demonstrations with diverse object positions
- 6 cube colors: red, blue, green, yellow, purple, orange
- 24 instruction templates: "pick up the [color] cube", "grasp the [color] block", etc.
- 40cm Γ 40cm workspace with position and orientation variations
- 2D projection with 3D visual effects (shadows, shading)
β‘ Extending This Work
Ideas for Students/Researchers
- Add more objects: Extend beyond cubes to spheres, cylinders
- Multi-step tasks: Chain pick β place actions
- Real images: Fine-tune on real robot data
- Better orientation: Improve quaternion prediction accuracy
- Action chunking: Predict action sequences instead of single steps
- Physics simulation: Replace 2D rendering with PyBullet/MuJoCo
Fine-tuning for Real Robots
# Collect 10-50 real demonstrations, then:
python vla_flow_matching.py --mode finetune \
--checkpoint vla_checkpoint_best.pt \
--data_path real_robot_demos.pkl \
--epochs 30 --lr 1e-5
β οΈ Limitations
This is an educational model with intentional simplifications:
- β 2D synthetic environment (no physics)
- β Single-object scenes only
- β Limited orientation precision
- β Not suitable for direct real-world deployment
- β No temporal/sequential reasoning
Do NOT use for: Safety-critical applications, precision assembly, or autonomous operation without extensive testing.
π Acknowledgments
Built with:
- π€ Transformers (CLIP)
- π₯ PyTorch
- π NumPy & Matplotlib
Inspired by:
- Flow Matching (Lipman et al., 2023)
- CLIP (Radford et al., 2021)
- RT-1 (Brohan et al., 2022)
- OpenVLA (Kim et al., 2024)
π Citation
@misc{minimal-vla-2025,
title={Minimal VLA: A Lightweight Vision-Language-Action Model for Education},
author={LeTau},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/your-username/minimal-vla}
}
π License
MIT License β Feel free to use, modify, and share!
Questions? Open an issue or reach out. Happy learning! π€