Qwen3-32B-Instruct – DirectML INT4 (ONNX Runtime)

This repository provides Qwen3-32B-Instruct converted to INT4 ONNX and optimized for DirectML using Microsoft Olive and ONNX Runtime GenAI.

It enables native Windows GPU inference (Intel Arc, AMD RDNA, NVIDIA RTX) without CUDA and without running a Python server, and is intended for use in C# / .NET applications via ONNX Runtime + DirectML.

Model Details

Base model: Qwen/Qwen3-32B
Variant: Instruct
Quantization: INT4 (MatMul NBits, per-channel)
Format: ONNX
Runtime: ONNX Runtime with DmlExecutionProvider
Conversion toolchain: Microsoft Olive + onnxruntime-genai
Target hardware:
- Intel Arc (A770, 130V with large system RAM)
- AMD RDNA2 / RDNA3
- NVIDIA RTX (24 GB recommended, 16 GB possible with paging)

Files

Core inference artifacts:

model.onnx
model.onnx.data ← INT4 weights (≈ 18.6 GB)
genai_config.json
tokenizer.json, vocab.json, merges.txt
chat_template.jinja

Hardware & Memory Notes

Although INT4 quantization greatly reduces VRAM usage, the 32B model still requires:

≥ 16 GB VRAM (with host memory fallback via DirectML)
≥ 64 GB system RAM strongly recommended
Fast NVMe storage for paging

This model is intended for:

Advanced reasoning
Tool orchestration
Structured document analysis
Multi-step planning in local Windows applications

Usage in C# (DirectML)

using Microsoft.ML.OnnxRuntimeGenAI;

var modelPath = @"Qwen3-32B-Instruct-DirectML-INT4";

using var model = Model.Load(modelPath, new ModelOptions
{
    ExecutionProvider = ExecutionProvider.DirectML
});

using var tokenizer = new Tokenizer(model);
var tokens = tokenizer.Encode("Determine which legal document templates are required for a Dutch mortgage transaction.");

using var generator = new Generator(model, new GeneratorParams
{
    MaxLength = 2048,
    Temperature = 0.6f
});

generator.AppendTokens(tokens);
generator.Generate();

string output = tokenizer.Decode(generator.GetSequence(0));
Console.WriteLine(output);
Prompt Format
The model supports chat-style prompts and function-calling / tool-routing patterns when used with structured system prompts (e.g. Hermes-style schemas).

The provided chat_template.jinja can be used for consistent role formatting.

Performance Characteristics
Much stronger reasoning and instruction following than 14B

Higher latency, but better long-context coherence

Ideal when model must:

Infer document structures

Select templates

Extract structured fields from natural language

License & Attribution
Base model:

Qwen3-32B by Alibaba (see original model card for license)

Conversion:

ONNX + INT4 DirectML optimization performed by Wekkel using Microsoft Olive.

Independent community conversion.

No affiliation with Alibaba or the Qwen team.

Related Models
Smaller & faster:

https://huggingface.co/wekkel/Qwen3-14B-Instruct-DirectML-INT4

Downloads last month: 18

Model tree for wekkel/Qwen3-32B-Instruct-DirectML-INT4

Base model

Qwen/Qwen3-32B

Quantized

(135)

this model