Gemma 4-12B-IT Abliterated — GGUF

DuoNeural | 2026-06-03

GGUF quantizations of DuoNeural/Gemma4-12B-IT-Abliterated — an abliterated Gemma 4-12B-IT with the refusal direction surgically removed.

Quantized with llama.cpp.


Files

File Size Recommended Use
gemma4_12b_abliterated_Q4_K_M.gguf ~7.5GB Best tradeoff — fits 12GB VRAM, excellent quality
gemma4_12b_abliterated_Q5_K_M.gguf ~8.5GB High quality, needs 12GB VRAM
gemma4_12b_abliterated_Q8_0.gguf ~12.7GB Near-lossless, needs 16GB VRAM

Speed Benchmarks (A100-40GB, all layers GPU, llama-bench)

Quantization Size Prefill (tok/s) Generation (tok/s)
Q4_K_M 6.86 GiB 2,583 ± 139 78.3 ± 0.4
Q5_K_M 7.95 GiB 2,455 ± 205 73.1 ± 0.2
Q8_0 11.78 GiB 2,573 ± 206 63.4 ± 0.3

Benchmarked on A100-40GB SXM4. -ngl 99 (all layers to GPU). llama-bench pp256/tg64.


Usage (llama.cpp)

# Download a quant
huggingface-cli download DuoNeural/Gemma4-12B-IT-Abliterated-GGUF \
  gemma4_12b_abliterated_Q4_K_M.gguf --local-dir ./

# Run with llama.cpp
./llama-cli -m gemma4_12b_abliterated_Q4_K_M.gguf \
  -p "Write a haiku about hacking." \
  -n 200 --temp 0.7

Usage (Python via llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="./gemma4_12b_abliterated_Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,  # offload all layers to GPU
)

output = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Your prompt here"}],
    max_tokens=512,
    temperature=0.7,
)
print(output["choices"][0]["message"]["content"])

Abliteration Details

  • Base: google/gemma-4-12B-it (48 layers, hidden=3840)
  • Method: Orthogonal rank-1 projection (targeted mode: down_proj + o_proj, all 48 layers, α=0.3)
  • Results: 5/7 harmful probes complied (71%) | 6/6 benign probes preserved (100%)
  • Mean KL Divergence (BF16→BF16, unbiased): 0.0000 — zero measurable distribution shift on benign text. Previously reported 0.912 was 100% NF4 quantization artifact. See Heretic v2.0 methodology.
  • Thinking mode: Works with enable_thinking=True in llama.cpp (no loops). In Python/transformers, pass enable_thinking=False to apply_chat_template.
  • See full details, benchmarks, and novel findings at the BF16 model card

Related Models

Congratulations to OpenYourMind for being the first published abliteration of Gemma 4-12B-IT (Jun 3, 2026). Their approach uses diff-in-means on a labeled harmful/harmless set; ours uses orthogonal rank-1 projection via heretic-llm. Two independent methods on the same base — a useful comparison point for the community. We are not affiliated and did not use their data.


About DuoNeural

DuoNeural is an open AI research lab publishing everything open access.

Platform Link
🤗 HuggingFace huggingface.co/DuoNeural
📚 Papers zenodo.org/communities/duoneural
🌐 Website duoneural.com

Apache-2.0 licensed. All DuoNeural research is CC BY 4.0.

Downloads last month
21,893
GGUF
Model size
12B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DuoNeural/Gemma4-12B-IT-Abliterated-GGUF

Quantized
(4)
this model