πŸŽ“ Minimal VLA: The Simplest Vision-Language-Action Model

The lightest VLA implementation for learning and experimentation β€” only ~20MB!

A beginner-friendly, minimal Vision-Language-Action (VLA) model designed for educational purposes and rapid prototyping. This project demonstrates the core concepts of VLA systems using CLIP + Flow Matching in the simplest possible setup.

✨ Why This Project?

Feature This Model Typical VLAs
Model Size ~20MB 1-7GB+
Training Time ~20 min Hours to days
Hardware Any GPU / CPU High-end GPUs
Simulation 2D rendering Physics engines
Complexity ~1000 lines 10,000+ lines
Dependencies PyTorch + CLIP Complex stacks

Perfect for:

  • πŸŽ“ Students learning VLA fundamentals
  • πŸ”¬ Researchers prototyping new ideas quickly
  • πŸ‘¨β€πŸ« Educators teaching robot learning concepts
  • πŸš€ Developers building their first VLA system

πŸ—οΈ Model Overview

This minimal VLA predicts 8-DOF robotic actions from RGB images and natural language:

Input: Image (224Γ—224) + Text ("pick up the red cube")
   ↓
CLIP ViT-B/32 (frozen, vision + language encoding)
   ↓
Flow Matching Policy (~2MB trainable parameters)
   ↓
Output: [x, y, z, qx, qy, qz, qw, gripper]

Key Design Choices for Simplicity

  1. Frozen CLIP Backbone β€” No need to train vision-language understanding
  2. 2D Synthetic Environment β€” No physics engine required
  3. Flow Matching β€” Elegant generative approach for continuous actions
  4. Separate Gripper Classifier β€” Binary decision for open/close

πŸ“Š Performance

Evaluated on 10 test samples from 1000 synthetic demonstrations:

Metric Value Notes
Position Error 8.60cm Suitable for ~5cm cube picking
Gripper Accuracy 75% Reliable grasp planning
Overall MAE 0.1217 Across all 8 action dimensions
Quaternion Error 19.36Β° Best for top-down grasps

⚠️ Note: This is an educational model trained on simplified 2D projections. Real-world deployment requires fine-tuning on actual robot data.

πŸš€ Quick Start

Installation

pip install torch transformers pillow numpy matplotlib

Inference (3 lines!)

from vla_flow_matching import VLM_Encoder, ImprovedFlowMatchingPolicy
import torch

# Load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
checkpoint = torch.load('vla_checkpoint_best.pt', map_location=device)

vlm_encoder = VLM_Encoder().to(device)
policy = ImprovedFlowMatchingPolicy(action_dim=8, context_dim=1024, hidden_dim=512).to(device)
policy.load_state_dict(checkpoint['policy'])
policy.eval()

# Predict!
from PIL import Image
image = Image.open('workspace.jpg').resize((224, 224))
context = vlm_encoder.encode([image], ["pick up the red cube"])
action = policy.sample(context, num_samples=1, device=device)

print(f"Position: {action[0, :3].cpu().numpy()}")
print(f"Gripper: {'CLOSE' if action[0, 7] > 0 else 'OPEN'}")

Train from Scratch (~20 minutes)

# Step 1: Generate synthetic data
python vla_flow_matching.py --mode generate_data --num_demos 1000

# Step 2: Train (takes ~20 min on consumer GPU)
python vla_flow_matching.py --mode train --epochs 200 --batch_size 32

# Step 3: Evaluate
python vla_flow_matching.py --mode replay --checkpoint vla_checkpoint_best.pt

πŸ“ Repository Structure

β”œβ”€β”€ vla_flow_matching.py    # Complete implementation (~1000 lines)
β”œβ”€β”€ vla_checkpoint_best.pt  # Trained weights (~20MB)
β”œβ”€β”€ demo_data.pkl           # Training data (1000 demos)
β”œβ”€β”€ replay_results.png      # Evaluation visualization
└── README.md               # This file

🎯 What You'll Learn

This codebase teaches core VLA concepts:

  1. Vision-Language Encoding: Using CLIP for joint image-text understanding
  2. Flow Matching: A modern generative approach for action prediction
  3. Action Representation: 8-DOF with quaternion rotations
  4. Synthetic Data Generation: Creating training environments without physics
  5. Model Architecture: Combining frozen backbones with trainable policies

πŸ”§ Architecture Details

VLM Encoder (Frozen CLIP)

  • Vision: ViT-B/32 β†’ 512-dim features
  • Text: Transformer β†’ 512-dim features
  • Combined: 1024-dim context vector

Flow Matching Policy (~2MB)

Context Encoder: 1024 β†’ 512 β†’ 128 (with LayerNorm, GELU, Dropout)
Time Embedding: Sinusoidal 128-dim
Action Encoder: 7D β†’ 128
Velocity Network: 384 β†’ 512 β†’ 256 β†’ 7

Gripper Classifier

Context β†’ 512 β†’ 256 β†’ 2 (softmax)

Training Configuration

Epochs: 200
Batch Size: 32
Learning Rate: 1e-4 (cosine decay to 1e-5)
Optimizer: AdamW (weight_decay=1e-4)
Flow Steps: 200 (Euler integration)

🌈 Training Data

The synthetic environment generates pick-and-place demonstrations:

  • 1000 demonstrations with diverse object positions
  • 6 cube colors: red, blue, green, yellow, purple, orange
  • 24 instruction templates: "pick up the [color] cube", "grasp the [color] block", etc.
  • 40cm Γ— 40cm workspace with position and orientation variations
  • 2D projection with 3D visual effects (shadows, shading)

⚑ Extending This Work

Ideas for Students/Researchers

  1. Add more objects: Extend beyond cubes to spheres, cylinders
  2. Multi-step tasks: Chain pick β†’ place actions
  3. Real images: Fine-tune on real robot data
  4. Better orientation: Improve quaternion prediction accuracy
  5. Action chunking: Predict action sequences instead of single steps
  6. Physics simulation: Replace 2D rendering with PyBullet/MuJoCo

Fine-tuning for Real Robots

# Collect 10-50 real demonstrations, then:
python vla_flow_matching.py --mode finetune \
    --checkpoint vla_checkpoint_best.pt \
    --data_path real_robot_demos.pkl \
    --epochs 30 --lr 1e-5

⚠️ Limitations

This is an educational model with intentional simplifications:

  • ❌ 2D synthetic environment (no physics)
  • ❌ Single-object scenes only
  • ❌ Limited orientation precision
  • ❌ Not suitable for direct real-world deployment
  • ❌ No temporal/sequential reasoning

Do NOT use for: Safety-critical applications, precision assembly, or autonomous operation without extensive testing.

πŸ™ Acknowledgments

Built with:

  • πŸ€— Transformers (CLIP)
  • πŸ”₯ PyTorch
  • πŸ“Š NumPy & Matplotlib

Inspired by:

πŸ“š Citation

@misc{minimal-vla-2025,
  title={Minimal VLA: A Lightweight Vision-Language-Action Model for Education},
  author={LeTau},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/your-username/minimal-vla}
}

πŸ“„ License

MIT License β€” Feel free to use, modify, and share!


Questions? Open an issue or reach out. Happy learning! πŸ€–

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading