granite-4.0-h-tiny-5bit-MLX / README.md

Susant-Achary

Update README.md

d197c78 verified 3 months ago

preview code

raw

history blame contribute delete

3.68 kB

metadata

model-index:
  - name: >-
      Granite-4.0-H-Tiny-MoE — MLX (Apple Silicon), **5-bit** (with guidance for
      2/3/4/6-bit)
    results: []
license: apache-2.0
language:
  - en
tags:
  - ibm
  - granite
  - mlx
  - apple-silicon
  - mamba2
  - transformer
  - hybrid
  - moe
  - long-context
  - instruct
  - quantized
  - 5bit
  - MoE
pipeline_tag: text-generation
library_name: mlx
base_model:
  - ibm-granite/granite-4.0-h-tiny

Granite-4.0-H-Tiny — MLX 5-bit (Apple Silicon)

Maintainer / Publisher: Susant Achary

This repository provides an Apple-Silicon MLX build of IBM Granite-4.0-H-Tiny quantized to 5-bit.
If you need more faithfulness than 3/4-bit but want lower RAM than 6-bit, 5-bit is a strong middle ground—especially for document parsing, structured extraction, and long-context assistants on Mac.

🔎 About Granite 4.0 (context)

Architecture: Hybrid Mamba-2 + softmax attention; H tiers add Mixture-of-Experts (MoE) (sparse activation per token).
Tier: H-Tiny (~7B total params with ~1B active via MoE), designed for efficient long-context inference.
License: Apache-2.0 (permissive, enterprise-friendly).
Typical uses: Instruction following, long-context assistants, RAG pipelines, structured outputs.

This card documents the MLX 5-bit conversion. See the comparison table below for when to choose 3/4-bit (lower RAM) or 6-bit (highest fidelity).

📦 What’s in this repo (MLX format)

config.json (MLX), mlx_model*.safetensors (5-bit shards)
Tokenizer files: tokenizer.json, tokenizer_config.json
Model metadata (e.g., model_index.json)

Target platform: macOS on Apple Silicon (M-series) with Metal/MPS acceleration.

✅ Intended use

High-quality instruction following and summarization with long context
Document / form / table parsing and JSON extraction (schema-guided prompts)
On-device prototyping where accuracy matters but RAM is modest

⚠️ Limitations

Still quantized: some regressions vs FP16 can surface on intricate math/code.
KV cache / context length can dominate RAM at very long windows—monitor budgets.
Add your own guardrails and safety for production.

🔢 Choosing a quantization level (MLX on Apple Silicon)

Indicative ranges for a ~7B hybrid MoE LM (actual usage varies by context length and batch size).

Variant	Typical Peak RAM	Relative Speed	Typical Behavior	When to Choose
2-bit	~3–4 GB	🔥🔥🔥🔥	Smallest, most lossy	Minimal RAM devices; smoke tests
3-bit	~5–6 GB	🔥🔥🔥🔥	Direct, concise	Great default on M1/M2/M3/M4
4-bit	~6–7.5 GB	🔥🔥🔥	Better detail retention vs 3-bit	If 3-bit misses small details
5-bit (this repo)	~8–9 GB	🔥🔥☆	Higher fidelity, fewer omissions	When you want stronger document/JSON faithfulness without 6-bit RAM
6-bit	~9.5–11 GB	🔥🔥	Highest MLX fidelity	If RAM permits and you need maximum quality

Rules of thumb

Start at 5-bit for document/structured tasks on 8–16 GB Macs.
Drop to 3/4-bit for tighter RAM / higher speed.
Move to 6-bit if you still see omissions or slight distortions in outputs.

🚀 Quickstart (CLI — MLX)

Deterministic generation

python -m mlx_lm.generate \
  --model <this-repo-id> \
  --prompt "Summarize the following in 5 bullet points:\n<your text>" \
  --max-tokens 256 \
  --temperature 0.0 \
  --device mps \
  --seed 0