4bit Quantized Model: Qwen2.5-32B-Instruct

A 4bit quantized variant of /mnt/d/Development/Libraries/Qwen2.5-32B-Instruct, optimized to reduce memory footprint and accelerate inference while maintaining high output similarity.

Overview

This checkpoint was quantized using BitsAndBytes and evaluated with standard text similarity metrics.


Model Architecture

Attribute Value
Model class Qwen2ForCausalLM
Number of parameters 17,161,065,472
Hidden size 5120
Number of layers 64
Attention heads 40
Vocabulary size 152064
Compute dtype bfloat16

Quantization Configuration

{
  "quant_method": "bitsandbytes",
  "_load_in_8bit": false,
  "_load_in_4bit": true,
  "llm_int8_threshold": 6.0,
  "llm_int8_skip_modules": null,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "bnb_4bit_compute_dtype": "bfloat16",
  "bnb_4bit_quant_storage": "uint8",
  "load_in_4bit": true,
  "load_in_8bit": false
}

Intended Use

  • Research and experimentation.
  • Instruction-following tasks in resource-constrained environments.
  • Demonstrations of quantized model capabilities.

Limitations

  • May reproduce biases from the original model.
  • Quantization may reduce generation diversity and factual accuracy.
  • Not intended for production without additional evaluation.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("pbhappliedsystems/Qwen2.5-32B-Instruct-4bit-20260527_122210")
model = AutoModelForCausalLM.from_pretrained("pbhappliedsystems/Qwen2.5-32B-Instruct-4bit-20260527_122210", device_map="auto")

prompt = "Explain the concept of reinforcement learning."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Generation Settings

This model produces best results when generated with:

  • temperature: 0.3
  • top_p: 0.9

Model Files Metadata

Filename Size (bytes) SHA-256
model-00001-of-00004.safetensors 4,933,190,348 b2a0e8a735e99b3a59bb3139541c444808aff3793a28c314c0f02bf17a00b5f7
model-00002-of-00004.safetensors 4,958,587,236 fd4b028d13261c8da0e29ed57b95189d666f62f3e8d4ab232c17c4e4e131543a
model-00003-of-00004.safetensors 4,999,136,184 0446d1c6da46a5daea91bed161fd62f2f48a658d879f58a14b7ab5528eb66935
model-00004-of-00004.safetensors 4,324,534,021 39002c4ed64520809793fb2b2023caf9bdbf0914feb4786d553c418139457018
quant_config.json 426 1bd2332861a3d1a8f387a9d04a1432b5bb57dec1a112ab6cfe594f67c5e66823

Notes

  • Produced on 2026-05-27T12:33:55.921152.
  • Quantized automatically using BitsAndBytes.

Intended primarily for research and experimentation.

Citation

Qwen2.5-32B-Instruct

Qwen2.5 Technical Report

License

This model is distributed under the apache-2.0 license, consistent with the original /mnt/d/Development/Libraries/Qwen2.5-32B-Instruct.

Model Card Authors

This quantized model was prepared by PBH Applied Systems.

Downloads last month
42
Safetensors
Model size
34B params
Tensor type
F32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for pbhappliedsystems/Qwen2.5-32B-Instruct-4bit-20260527_122210