Gemma 4 26B A4B Assistant GGUF

GGUF quantizations converted from google/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant.

Tested with llama.cpp b9549 (Gemma 4 MTP support).

Update

Added experimental IQ quantizations with Q4 embeddings (token_embd.weight = Q4_0).

Recommendations

  • Q4_0-q4emb — recommended for most users
  • Q8_0 — for users with spare VRAM

Files

  • gemma-4-26B-A4B-it-assistant-f16.gguf
  • gemma-4-26b-A4B-it-assistant-Q4_0.gguf
  • gemma-4-26b-A4B-it-assistant-Q4_0-q4emb.gguf (closest to pure Q4 QAT layout)
  • gemma-4-26b-A4B-it-assistant-IQ4_NL-q4emb.gguf
  • gemma-4-26b-A4B-it-assistant-IQ3_M-q4emb.gguf (smallest that still works)
  • gemma-4-26b-A4B-it-assistant-Q8_0.gguf

Q4 Embedding Variant

Q4_0-q4emb is an experimental quantization where token_embd.weight is kept in Q4_0 instead of Q6_K precision quantization typically used by llama.cpp.

This follows a similar approach to recent QAT experiments for Gemma models, where preserving the original Q4-trained embedding format may better match the intended QAT behavior.

Initial testing showed similar draft acceptance rates to the default Q4_0 quant, with a small speed advantage, though more benchmarking is needed.

Example

llama-server \
  -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf \
  -md gemma-4-26b-A4B-it-assistant-Q4_0.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 2

Recommended values:

  • --spec-draft-n-max 2 for general use
  • --spec-draft-n-max 3 for coding workloads
Downloads last month
2,186
GGUF
Model size
0.4B params
Architecture
gemma4-assistant
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RachidAR/gemma-4-26B-A4B-it-qat-assistant-q4_0-gguf

Collection including RachidAR/gemma-4-26B-A4B-it-qat-assistant-q4_0-gguf