Qwen3-32B-Instruct β DirectML INT4 (ONNX Runtime)
This repository provides Qwen3-32B-Instruct converted to INT4 ONNX and optimized for DirectML using Microsoft Olive and ONNX Runtime GenAI.
It enables native Windows GPU inference (Intel Arc, AMD RDNA, NVIDIA RTX) without CUDA and without running a Python server, and is intended for use in C# / .NET applications via ONNX Runtime + DirectML.
Model Details
- Base model:
Qwen/Qwen3-32B - Variant: Instruct
- Quantization: INT4 (MatMul NBits, per-channel)
- Format: ONNX
- Runtime: ONNX Runtime with
DmlExecutionProvider - Conversion toolchain: Microsoft Olive + onnxruntime-genai
- Target hardware:
- Intel Arc (A770, 130V with large system RAM)
- AMD RDNA2 / RDNA3
- NVIDIA RTX (24 GB recommended, 16 GB possible with paging)
Files
Core inference artifacts:
model.onnxmodel.onnx.dataβ INT4 weights (β 18.6 GB)genai_config.jsontokenizer.json,vocab.json,merges.txtchat_template.jinja
Hardware & Memory Notes
Although INT4 quantization greatly reduces VRAM usage, the 32B model still requires:
- β₯ 16 GB VRAM (with host memory fallback via DirectML)
- β₯ 64 GB system RAM strongly recommended
- Fast NVMe storage for paging
This model is intended for:
- Advanced reasoning
- Tool orchestration
- Structured document analysis
- Multi-step planning in local Windows applications
Usage in C# (DirectML)
using Microsoft.ML.OnnxRuntimeGenAI;
var modelPath = @"Qwen3-32B-Instruct-DirectML-INT4";
using var model = Model.Load(modelPath, new ModelOptions
{
ExecutionProvider = ExecutionProvider.DirectML
});
using var tokenizer = new Tokenizer(model);
var tokens = tokenizer.Encode("Determine which legal document templates are required for a Dutch mortgage transaction.");
using var generator = new Generator(model, new GeneratorParams
{
MaxLength = 2048,
Temperature = 0.6f
});
generator.AppendTokens(tokens);
generator.Generate();
string output = tokenizer.Decode(generator.GetSequence(0));
Console.WriteLine(output);
Prompt Format
The model supports chat-style prompts and function-calling / tool-routing patterns when used with structured system prompts (e.g. Hermes-style schemas).
The provided chat_template.jinja can be used for consistent role formatting.
Performance Characteristics
Much stronger reasoning and instruction following than 14B
Higher latency, but better long-context coherence
Ideal when model must:
Infer document structures
Select templates
Extract structured fields from natural language
License & Attribution
Base model:
Qwen3-32B by Alibaba (see original model card for license)
Conversion:
ONNX + INT4 DirectML optimization performed by Wekkel using Microsoft Olive.
Independent community conversion.
No affiliation with Alibaba or the Qwen team.
Related Models
Smaller & faster:
https://huggingface.co/wekkel/Qwen3-14B-Instruct-DirectML-INT4
- Downloads last month
- 18
Model tree for wekkel/Qwen3-32B-Instruct-DirectML-INT4
Base model
Qwen/Qwen3-32B