Qwen3-32B-Instruct – DirectML INT4 (ONNX Runtime)

This repository provides Qwen3-32B-Instruct converted to INT4 ONNX and optimized for DirectML using Microsoft Olive and ONNX Runtime GenAI.

It enables native Windows GPU inference (Intel Arc, AMD RDNA, NVIDIA RTX) without CUDA and without running a Python server, and is intended for use in C# / .NET applications via ONNX Runtime + DirectML.


Model Details

  • Base model: Qwen/Qwen3-32B
  • Variant: Instruct
  • Quantization: INT4 (MatMul NBits, per-channel)
  • Format: ONNX
  • Runtime: ONNX Runtime with DmlExecutionProvider
  • Conversion toolchain: Microsoft Olive + onnxruntime-genai
  • Target hardware:
    • Intel Arc (A770, 130V with large system RAM)
    • AMD RDNA2 / RDNA3
    • NVIDIA RTX (24 GB recommended, 16 GB possible with paging)

Files

Core inference artifacts:

  • model.onnx
  • model.onnx.data ← INT4 weights (β‰ˆ 18.6 GB)
  • genai_config.json
  • tokenizer.json, vocab.json, merges.txt
  • chat_template.jinja

Hardware & Memory Notes

Although INT4 quantization greatly reduces VRAM usage, the 32B model still requires:

  • β‰₯ 16 GB VRAM (with host memory fallback via DirectML)
  • β‰₯ 64 GB system RAM strongly recommended
  • Fast NVMe storage for paging

This model is intended for:

  • Advanced reasoning
  • Tool orchestration
  • Structured document analysis
  • Multi-step planning in local Windows applications

Usage in C# (DirectML)

using Microsoft.ML.OnnxRuntimeGenAI;

var modelPath = @"Qwen3-32B-Instruct-DirectML-INT4";

using var model = Model.Load(modelPath, new ModelOptions
{
    ExecutionProvider = ExecutionProvider.DirectML
});

using var tokenizer = new Tokenizer(model);
var tokens = tokenizer.Encode("Determine which legal document templates are required for a Dutch mortgage transaction.");

using var generator = new Generator(model, new GeneratorParams
{
    MaxLength = 2048,
    Temperature = 0.6f
});

generator.AppendTokens(tokens);
generator.Generate();

string output = tokenizer.Decode(generator.GetSequence(0));
Console.WriteLine(output);
Prompt Format
The model supports chat-style prompts and function-calling / tool-routing patterns when used with structured system prompts (e.g. Hermes-style schemas).

The provided chat_template.jinja can be used for consistent role formatting.

Performance Characteristics
Much stronger reasoning and instruction following than 14B

Higher latency, but better long-context coherence

Ideal when model must:

Infer document structures

Select templates

Extract structured fields from natural language

License & Attribution
Base model:

Qwen3-32B by Alibaba (see original model card for license)

Conversion:

ONNX + INT4 DirectML optimization performed by Wekkel using Microsoft Olive.

Independent community conversion.

No affiliation with Alibaba or the Qwen team.

Related Models
Smaller & faster:

https://huggingface.co/wekkel/Qwen3-14B-Instruct-DirectML-INT4
Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for wekkel/Qwen3-32B-Instruct-DirectML-INT4

Base model

Qwen/Qwen3-32B
Quantized
(135)
this model