metadata
model-index:
- name: >-
Granite-4.0-H-Tiny-MoE — MLX (Apple Silicon), **5-bit** (with guidance for
2/3/4/6-bit)
results: []
license: apache-2.0
language:
- en
tags:
- ibm
- granite
- mlx
- apple-silicon
- mamba2
- transformer
- hybrid
- moe
- long-context
- instruct
- quantized
- 5bit
- MoE
pipeline_tag: text-generation
library_name: mlx
base_model:
- ibm-granite/granite-4.0-h-tiny
Granite-4.0-H-Tiny — MLX 5-bit (Apple Silicon)
Maintainer / Publisher: Susant Achary
This repository provides an Apple-Silicon MLX build of IBM Granite-4.0-H-Tiny quantized to 5-bit.
If you need more faithfulness than 3/4-bit but want lower RAM than 6-bit, 5-bit is a strong middle ground—especially for document parsing, structured extraction, and long-context assistants on Mac.
🔎 About Granite 4.0 (context)
- Architecture: Hybrid Mamba-2 + softmax attention; H tiers add Mixture-of-Experts (MoE) (sparse activation per token).
- Tier: H-Tiny (~7B total params with ~1B active via MoE), designed for efficient long-context inference.
- License: Apache-2.0 (permissive, enterprise-friendly).
- Typical uses: Instruction following, long-context assistants, RAG pipelines, structured outputs.
This card documents the MLX 5-bit conversion. See the comparison table below for when to choose 3/4-bit (lower RAM) or 6-bit (highest fidelity).
📦 What’s in this repo (MLX format)
config.json(MLX),mlx_model*.safetensors(5-bit shards)- Tokenizer files:
tokenizer.json,tokenizer_config.json - Model metadata (e.g.,
model_index.json)
Target platform: macOS on Apple Silicon (M-series) with Metal/MPS acceleration.
✅ Intended use
- High-quality instruction following and summarization with long context
- Document / form / table parsing and JSON extraction (schema-guided prompts)
- On-device prototyping where accuracy matters but RAM is modest
⚠️ Limitations
- Still quantized: some regressions vs FP16 can surface on intricate math/code.
- KV cache / context length can dominate RAM at very long windows—monitor budgets.
- Add your own guardrails and safety for production.
🔢 Choosing a quantization level (MLX on Apple Silicon)
Indicative ranges for a ~7B hybrid MoE LM (actual usage varies by context length and batch size).
| Variant | Typical Peak RAM | Relative Speed | Typical Behavior | When to Choose |
|---|---|---|---|---|
| 2-bit | ~3–4 GB | 🔥🔥🔥🔥 | Smallest, most lossy | Minimal RAM devices; smoke tests |
| 3-bit | ~5–6 GB | 🔥🔥🔥🔥 | Direct, concise | Great default on M1/M2/M3/M4 |
| 4-bit | ~6–7.5 GB | 🔥🔥🔥 | Better detail retention vs 3-bit | If 3-bit misses small details |
| 5-bit (this repo) | ~8–9 GB | 🔥🔥☆ | Higher fidelity, fewer omissions | When you want stronger document/JSON faithfulness without 6-bit RAM |
| 6-bit | ~9.5–11 GB | 🔥🔥 | Highest MLX fidelity | If RAM permits and you need maximum quality |
Rules of thumb
- Start at 5-bit for document/structured tasks on 8–16 GB Macs.
- Drop to 3/4-bit for tighter RAM / higher speed.
- Move to 6-bit if you still see omissions or slight distortions in outputs.
🚀 Quickstart (CLI — MLX)
Deterministic generation
python -m mlx_lm.generate \
--model <this-repo-id> \
--prompt "Summarize the following in 5 bullet points:\n<your text>" \
--max-tokens 256 \
--temperature 0.0 \
--device mps \
--seed 0