You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

🥖 Baguette

A Distributed Inference Engine for Paris MoE Diffusion Models

Fast, efficient inference for the 5-billion parameter Paris Mixture-of-Experts text-to-image model

⚡ Quick Start

# Clone the repo
git clone https://huggingface.co/nbagel/baguette
cd baguette

# Install dependencies
pip install uv && uv pip install torch torchvision safetensors transformers diffusers accelerate tqdm

# Generate images
python generate.py --prompt "a cute cat" --num_samples 4

Output: output_bf16.png with 4 generated images.

🎨 Generation Examples

# Basic generation (4 images, top-2 routing, 30 steps)
python generate.py --prompt "sunset over mountains" --num_samples 4

# See expert routing visualization
python generate.py --prompt "abstract art" --visualize

# Faster generation
python generate.py --prompt "a happy dog" --num_steps 20

# Lower memory usage (offload experts to CPU)
python generate.py --prompt "portrait of a scientist" --offload 4

# INT8 quantized (smaller weights)
python generate.py --prompt "enchanted forest" --precision int8

🔮 Expert Routing Visualization

Baguette includes real-time visualization of the MoE router's expert selection. Use --visualize to see which experts are activated:

╭──────────────────────────────────────────────────╮
│           ⚡ EXPERT USAGE DISTRIBUTION            │
├──────────────────────────────────────────────────┤
│ → E4  │████████████████████████████│ 40.6% │
│   E2  │██████████████████████████▎ │ 36.7% │
│   E6  │██████████▌                 │ 14.8% │
│   E1  │███▊                        │  5.5% │
│   E5  │█▋                          │  2.3% │
│   E0  │                            │  0.0% │
│   E3  │                            │  0.0% │
│   E7  │                            │  0.0% │
├──────────────────────────────────────────────────┤
│  Active: 5/8 experts   Calls: 128               │
╰──────────────────────────────────────────────────╯

╭──────────────────────────────────────────────────╮
│            📈 ROUTING TIMELINE                   │
├──────────────────────────────────────────────────┤
│ Step  0  1  2  3  4  5  6  7  8  9 10 11 12 13  │
│ ───────────────────────────────────────────────  │
│  E0   ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  │
│  E1   ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  │
│  E2   ·  ·  ·  ·  ·  ●  ●  ●  ●  ●  ●  ●  ●  ●  │
│  E3   ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  │
│  E4   ·  ·  ●  ●  ●  ·  ·  ·  ·  ·  ·  ·  ·  ·  │
│  E5   ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  │
│  E6   ●  ●  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  │
│  E7   ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  │
├──────────────────────────────────────────────────┤
│  Routing changes:   2/13 steps (15%)            │
╰──────────────────────────────────────────────────╯

The router dynamically selects different experts based on the noise level at each diffusion timestep. Early steps (high noise) often use different experts than later steps (low noise).

📋 Command Reference

Flag	Default	Description
`--prompt`	`"a cute cat"`	Text description of the image to generate
`--num_samples`	`16`	Number of images to generate
`--num_steps`	`30`	Diffusion sampling steps (15-50)
`--cfg_scale`	`7.5`	Classifier-free guidance scale (5-12)
`--precision`	`bf16`	Weight precision: `bf16` or `int8`
`--topk`	`2`	Number of experts per sample (1-8)
`--offload`	`0`	Experts to offload to CPU RAM (0-7)
`--visualize`	`false`	Show expert routing statistics
`--output`	`auto`	Custom output filename
`--seed`	`999`	Random seed for reproducibility

🏗️ Model Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     PARIS MoE ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Input: Text Prompt ──→ CLIP ViT-L/14 ──→ Text Embeddings     │
│                                                                 │
│   Noise: z ~ N(0,1) ──→ 32×32×4 Latent                         │
│                              │                                  │
│                              ▼                                  │
│   ┌─────────────────────────────────────────────────────────┐  │
│   │                  DiT-B/2 ROUTER                         │  │
│   │            (12 layers, 768 dim, 129M params)            │  │
│   │                         │                               │  │
│   │            Selects Top-K Experts per Step               │  │
│   └─────────────────────────────────────────────────────────┘  │
│                              │                                  │
│          ┌───────────────────┼───────────────────┐             │
│          ▼                   ▼                   ▼             │
│   ┌────────────┐      ┌────────────┐      ┌────────────┐       │
│   │  Expert 0  │      │  Expert 1  │ ···  │  Expert 7  │       │
│   │  DiT-XL/2  │      │  DiT-XL/2  │      │  DiT-XL/2  │       │
│   │   606M     │      │   606M     │      │   606M     │       │
│   └────────────┘      └────────────┘      └────────────┘       │
│          │                   │                   │             │
│          └───────────────────┼───────────────────┘             │
│                              ▼                                  │
│                   Weighted Velocity Prediction                  │
│                              │                                  │
│                              ▼                                  │
│   ┌─────────────────────────────────────────────────────────┐  │
│   │                 SD-VAE DECODER                          │  │
│   │              Latent ──→ 256×256 RGB                     │  │
│   └─────────────────────────────────────────────────────────┘  │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│  Total: ~5 Billion Parameters  │  8 Specialized Experts        │
└─────────────────────────────────────────────────────────────────┘

💾 Available Weights

Format	Size	Quality	Speed	Use Case
BF16	9.3 GB	⭐⭐⭐⭐⭐	Fastest	Production, best quality
INT8	4.8 GB	⭐⭐⭐⭐	Fast	Memory-constrained GPUs

🖥️ Memory Requirements

Configuration	GPU VRAM	Speed	Notes
BF16, no offload	~25 GB	~3 img/s	Best performance
BF16, offload 4	~14 GB	~1 img/s	RTX 4090 / A6000
BF16, offload 6	~8 GB	~0.5 img/s	RTX 3080/4080
INT8, no offload	~12 GB	~2 img/s	Good balance
INT8, offload 4	~8 GB	~0.5 img/s	Consumer GPUs

🔧 Utilities

Benchmarking

python benchmark.py --quick                    # Fast benchmark
python benchmark.py --output results.md        # Full benchmark, save results

Weight Conversion

# Convert PyTorch checkpoints to BF16 SafeTensors
python quantize.py --input /path/to/weights --output ./weights/bf16 --format bf16

# Convert BF16 to INT8
python quantize.py --input ./weights/bf16 --output ./weights/int8 --format int8

🚀 Future: Distributed Inference with Tailscale + Erlang

Baguette is being developed as a fully distributed inference engine that can run across multiple machines connected via Tailscale VPN, orchestrated by an Erlang/OTP supervisor.

🌐 Architecture Vision

┌─────────────────────────────────────────────────────────────────────────┐
│                    BAGUETTE DISTRIBUTED NETWORK                         │
│                         (Up to 8 Nodes)                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌─────────────┐      Tailscale VPN Mesh      ┌─────────────┐         │
│   │   Node 1    │◄────────────────────────────►│   Node 2    │         │
│   │ ┌─────────┐ │                              │ ┌─────────┐ │         │
│   │ │ Router  │ │                              │ │ Router  │ │         │
│   │ │   VAE   │ │                              │ │   VAE   │ │         │
│   │ │Expert 0 │ │                              │ │Expert 1 │ │         │
│   │ └─────────┘ │                              │ └─────────┘ │         │
│   └──────┬──────┘                              └──────┬──────┘         │
│          │                                            │                 │
│          │         ┌──────────────────┐              │                 │
│          └────────►│  Erlang/OTP      │◄─────────────┘                 │
│                    │  Coordinator     │                                 │
│          ┌────────►│                  │◄─────────────┐                 │
│          │         │  • Load Balance  │              │                 │
│          │         │  • Fault Tolerant│              │                 │
│          │         │  • Auto-Healing  │              │                 │
│          │         └──────────────────┘              │                 │
│          │                                            │                 │
│   ┌──────┴──────┐                              ┌──────┴──────┐         │
│   │   Node 3    │◄────────────────────────────►│   Node 4    │         │
│   │ ┌─────────┐ │           ...                │ ┌─────────┐ │         │
│   │ │ Router  │ │                              │ │ Router  │ │         │
│   │ │   VAE   │ │        (up to 8 nodes)       │ │   VAE   │ │         │
│   │ │Expert 2 │ │                              │ │Expert 3 │ │         │
│   │ └─────────┘ │                              │ └─────────┘ │         │
│   └─────────────┘                              └─────────────┘         │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

🎯 Key Features (Planned)

Feature	Description
Self-Organizing Network	Nodes automatically discover peers and negotiate roles
Adaptive Load Balancing	Routes requests based on real-time latency and compute availability
Auto-Benchmarking	Each node benchmarks GPU/CPU speed, VRAM, RAM, and network throughput
Fault Tolerance	Erlang supervisors restart failed nodes, redistribute load
1 Expert Per Node	Each node loads only 1 expert (~2.7GB VRAM) plus router & VAE
Latency-Aware Routing	Prioritizes low-latency nodes for time-sensitive steps
Zero Configuration	Just join the Tailscale network and run—automatic peer discovery

📊 Node Self-Benchmarking

When a node joins the network, it automatically benchmarks:

┌────────────────────────────────────────┐
│         NODE CAPABILITY REPORT         │
├────────────────────────────────────────┤
│  GPU: NVIDIA RTX 4090                  │
│  VRAM: 24 GB                           │
│  GPU Compute: 847 TFLOPS (FP16)        │
│  ────────────────────────────────────  │
│  CPU: AMD Ryzen 9 7950X                │
│  RAM: 64 GB                            │
│  CPU Compute: 2.1 TFLOPS               │
│  ────────────────────────────────────  │
│  Network Latency to Peers:             │
│    → Node 2: 12ms                      │
│    → Node 3: 8ms                       │
│    → Node 4: 45ms                      │
│  Network Bandwidth: 940 Mbps           │
│  ────────────────────────────────────  │
│  Assigned Expert: E0                   │
│  Status: READY                         │
└────────────────────────────────────────┘

🔄 Distributed Inference Flow

Request arrives at any node
Router runs locally → selects top-K experts needed
Coordinator dispatches expert calls to appropriate nodes
Nodes compute in parallel → return velocity predictions
Results aggregated → Euler step applied
VAE decodes locally → image returned to requester

This enables running the full 5B parameter model across consumer hardware—each machine only needs ~4GB VRAM to hold one expert.

📁 Repository Structure

baguette/
├── generate.py          # 🎨 Main generation script
├── benchmark.py         # 📊 Performance benchmarking
├── quantize.py          # 🔧 Weight format conversion
├── requirements.txt     # 📦 Python dependencies
├── README.md            # 📖 This file
├── src/                 # 🧠 Model architecture code
│   ├── models.py        # DiT expert & router definitions
│   ├── vae_utils.py     # VAE encoding/decoding
│   ├── config.py        # Configuration dataclass
│   └── schedules.py     # Noise schedules
└── weights/             # 💾 Model weights
    ├── bf16/            # BFloat16 SafeTensors (9.3 GB)
    │   ├── expert_0.safetensors ... expert_7.safetensors
    │   ├── router.safetensors
    │   └── config.pt
    └── int8/            # INT8 Quantized (4.8 GB)
        ├── expert_0.safetensors ... expert_7.safetensors
        └── router.safetensors

🔗 Links

Original Model: bageldotcom/paris
This Repository: nbagel/baguette

📜 License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

See LICENSE for details.

Made with 🥖 by the Baguette Team

Distributed inference for everyone

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for nbagel/baguette

Base model

bagellabs/paris

Finetuned

(1)

this model