Qwen3.6-27B · GGUF Q4_K_M

Quantized, converted, and evaluated by PBH Applied Systems, LLC — Applied AI/ML Consulting · LLM Optimization & Deployment · Quantized AI Infrastructure

🔬 This repository is part of a production-oriented evaluation series. Every model published under pbhappliedsystems has been independently evaluated using quant_eval v7.21 — a proprietary behavioral evaluation harness developed by PBH Applied Systems. Scores measure real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning families — not perplexity or benchmark leaderboard proxies.

🆕 First Qwen3-series model in the evaluated series. Qwen3.6-27B is the first model from Alibaba Cloud's Qwen3 generation to be evaluated in this series. Its hybrid (adaptive) thinking mode — where the model generates extended chain-of-thought reasoning on harder problems — is the defining behavioral characteristic of this evaluation and is documented in detail below.

⚠️ Single-runner evaluation. The F16 GGUF (53.8 GB) exceeds the VRAM capacity of the evaluation hardware (NVIDIA RTX 4090, 24 GB). All behavioral data comes from the Q4_K_M quantized_llama_cpp runner only. The F16 artifact is documented for provenance at pbhappliedsystems/qwen3.6-27B-gguf-F16.


Model Description

This repository contains the 4-bit quantized (Q4_K_M) GGUF of Qwen/Qwen3.6-27B, a 27-billion parameter model from Alibaba Cloud's Qwen3 generation. Qwen3 introduces a hybrid thinking mode — the model can generate extended internal chain-of-thought reasoning (enclosed in <think> blocks) before producing its final response, with reasoning depth adapting to task complexity.

This adaptive behavior is the single most consequential characteristic for structured task evaluation, and its interaction with the quant_eval v7.21 harness is documented in detail in the Key Findings section below.

Key Characteristics

  • Parameters: 27B
  • Architecture: Qwen3 (hybrid thinking / non-thinking mode)
  • Format: GGUF Q4_K_M
  • File size: 16.5 GB
  • SHA256: c863357b1b532a02c47ca363ab666dd623470a152a291dac6619ed7ce751d8c8
  • Minimum VRAM (GPU inference): ~22 GB
  • Recommended GPU tier: RTX 4090 24 GB · A10G 24 GB · A100 40 GB
  • Context window: 32,768 tokens (check model config for extended context options)
  • Inference speed (eval hardware): avg 1.938 sec/case on RTX 4090
  • License: Apache 2.0

PBH Applied Systems Evaluation — quant_eval v7.21

Evaluation conducted by PBH Applied Systems, LLC using quant_eval v7.21 Run ID: 20260426_163540 · Fixtures: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c...) · Seed: 42 Hardware: NVIDIA RTX 4090 · Runner: quantized_llama_cpp (Q4_K_M only) · Total rows: 42

Per-Family Pass Rates (Q4_K_M)

Family N Pass Rate Avg Secs Bucket Score Notes
json_multistep 5 0.400 6.181 0.600 Easy pass; medium/hard fail — thinking mode
stateful_followup 2 1.000 0.685 2.000 Both turns exact match
toolcall_only 2 0.000 0.655 1.000 "arguments" vs "args" — closest in series
mixed_brief_json 2 1.000 0.705 2.000 Clean ANSWER + JSON
toolcall 2 1.000 1.360 0.000 Stage-1 passes; EOS on final answer
json 4 n/a 2.263 10.000 All pass
fuzz 20 n/a 1.692 10.000 All 20 pass
mcq 5 n/a 0.160 1.000 5/5 perfect

Key Findings

Finding 1: Adaptive Thinking Mode — The Defining Behavioral Characteristic

Qwen3 uses hybrid thinking mode: for simpler tasks, the model responds directly; for harder tasks, it generates an extended <think> block before the final response. This adaptive behavior is the primary driver of the json_multistep results.

Easy cases pass cleanly and quickly:

Case Difficulty Secs Result Raw (truncated)
ms_easy_01 Easy 4.949 {"plan": ["A"], "checks": [...]}
ms_easy_02 Easy 5.942 {"plan": ["A","A"], "checks": [...]}

Both easy cases produce direct JSON output. All three gating signals pass (schema_ok=1, checks_consistent_ok=1, oracle_equiv_ok=1).

Medium and hard cases fail with all four signals simultaneously:

Case Difficulty Secs Result Failure
ms_med_01 Medium 6.650 schema_ok=0, cc=0, stop=0, oe=0
ms_med_02 Medium 6.761 schema_ok=0, cc=0, stop=0, oe=0
ms_hard_01 Hard 6.603 schema_ok=0, cc=0, stop=0, oe=0

The all-four-signals failure pattern is the signature of a root-level extraction failure — when schema_ok=0, the downstream signals (checks_consistent_ok, stop_semantics_ok, oracle_equiv_ok) cannot be evaluated. The model generates a <think> reasoning block on the harder cases; when the extractor receives the full output including the think block, the JSON schema object is not found at the expected location or the content before the JSON causes the schema validator to fail.

The timing confirms this: ms_easy cases take ~5s while medium/hard cases take ~6.6–6.8s — the extra ~1.7s represents the thinking token generation overhead. At F16 precision this overhead would be substantially larger.

What this means for production deployment: Qwen3's thinking mode requires pipeline-level handling. The <think> block must be stripped before extraction, or the inference configuration must set /no_think or equivalent parameters to disable thinking mode when structured output is required. Without this, the model may produce correct reasoning internally while the extractor fails to locate the final answer. See the Usage section for configuration guidance.

Finding 2: toolcall_only — Closest Schema Vocabulary in the Evaluated Series

Every Qwen2.5 model in the series uses incorrect argument container keys. Qwen3.6-27B produces the most schema-accurate bare tool call of any model evaluated:

Model Raw Output tool_name arg values correct ✅ Container key
Qwen2.5-3B Q4_K_M {"tool": "add", "operands": [5, 10]} operands (array)
Qwen2.5-7B Q4_K_M {"tool": "add", "numbers": [5, 10]} numbers (array)
Qwen2.5-14B-1M Q4_K_M {"tool": "add", "input": {"x": 5, "y": 10}} input, x/y
Qwen2.5-32B Q4_K_M {"tool": "add", "params": {"a": 5, "b": 10}} params
Qwen3.6-27B Q4_K_M {"tool_name": "add", "arguments": {"a": 5, "b": 10}} arguments

Qwen3.6-27B is the only model in the evaluated series to produce "tool_name" as the outer key without system-prompt schema enforcement. Argument value names ("a", "b") are also correct. The only error is "arguments" instead of "args" as the container key — a single field name away from a fully correct schema. Explicit key-name instructions in the system prompt should fully resolve this.

Finding 3: MCQ and Fuzz — Perfect at 27B

MCQ Case Result Raw
mcq_01 B
mcq_02 B
mcq_03 C
mcq_04 B
mcq_05 B

5/5 perfect MCQ at 0.16 sec/case average — single letter, no contamination, no A-bias. All 20 fuzz cases pass at bucket=10. These families benefit from the model's size while being fast enough that thinking mode does not appear to activate.

Finding 4: toolcall — Correct Arithmetic, EOS Contamination

Both toolcall cases pass stage-1 and produce the correct final answer with EOS tokens — the standard Qwen Q4_K_M series pattern:

Case Raw Expected
tool_01 {"tool_name": "add", "args": {"a": 2, "b": 3}}<|im_end|> 5<|im_end|> 5
tool_02 {"tool_name": "add", "args": {"a": 10, "b": -4}}<|im_end|> 6<|im_end|> 6

Note that in the toolcall family (where a schema is provided in context), the model uses the correct "tool_name" and "args" keys — confirming that in-context schema examples fully resolve the vocabulary issue observed in toolcall_only. Strip <|im_end|> before downstream processing.

Finding 5: Stateful and Hybrid — Clean at 27B

Both stateful and mixed_brief_json families produce clean correct outputs with strippable EOS:

Family Case Raw
stateful state_01 {"counter": 2}<|im_end|> {"counter": 5}<|im_end|>
stateful state_02 {"items":["a","b"]}<|im_end|> {"items":["a","b","c"]}<|im_end|>
mixed mixed_01 ANSWER: 13 {"a": 4, "b": 9, "sum": 13}<|im_end|>
mixed mixed_02 ANSWER: 6 {"a": -2, "b": 8, "sum": 6}<|im_end|>

These short-context, two-turn tasks complete in under 0.75 seconds each — fast enough that thinking mode does not activate.


Signal-Level Diagnostics (Q4_K_M)

json_multistep

Signal Rate Notes
schema_ok 0.400 Root-level failure on medium/hard — thinking mode
checks_consistent_ok 0.400 Cascades from schema_ok=0
stop_semantics_ok 0.400 Cascades from schema_ok=0
oracle_equiv_ok 0.400 Cascades from schema_ok=0

stateful_followup

Signal Rate
turn1_parse_ok 1.000
turn2_parse_ok 1.000
turn1_exact_match 1.000
turn2_exact_match 1.000

toolcall_only

Signal Rate Notes
tool_name_ok 1.000 "tool_name" correct — best in series without enforcement
args_ok 0.000 "arguments" vs "args"

mixed_brief_json

Signal Rate
answer_line_ok 1.000
json_parse_ok 1.000
schema_ok 1.000

Series Context

Model json_multistep stateful MCQ VRAM (Q4_K_M) Thinking Mode
Qwen2.5-3B 0.200 1.000 3/5 ~4 GB No
Qwen2.5-7B 0.800 1.000 4/5 ~6 GB No
Qwen2.5-14B-1M 0.800 1.000 5/5 ~12 GB No
Qwen2.5-32B 0.600 1.000 5/5 ~24 GB No
Qwen3.6-27B 0.400 1.000 5/5 ~22 GB ✅ Hybrid

The json_multistep result of 0.400 is not a straightforward capability regression from the Qwen2.5 series. It is a pipeline compatibility finding: the evaluation harness was not configured to strip <think> blocks from Qwen3 responses on harder tasks. The two passing cases demonstrate the model is capable of correct structured multi-step planning output. The three failing cases are those where the model's adaptive thinking mode activates and the extraction layer fails — not where the model's reasoning is incorrect. With proper pipeline configuration (think-block stripping or /no_think mode), the json_multistep pass rate would be expected to improve.


Recommended Use Cases

✅ Deploy with Confidence (Q4_K_M)

  • Stateful multi-turn agents — 1.000 at both turns, clean JSON state with strippable EOS at 0.685 sec/case.
  • Hybrid brief + JSON responsesmixed_brief_json 1.000 in 0.705 sec.
  • MCQ and single-choice extraction — 5/5 perfect at 0.16 sec/case.
  • Structured JSON (single-step)json and fuzz both bucket=10.000. All 20 fuzz cases pass.
  • Scaffolded tool-callingtoolcall stage-1 passes; strip EOS from final answer.

✅ Deploy with Think-Block Stripping (Q4_K_M)

  • Multi-step structured planning — ms_easy cases pass cleanly. Medium and hard cases require <think> block stripping or /no_think configuration before the extraction layer receives the model's output. See Usage section for implementation.

⚠️ Use with Guardrails (Q4_K_M)

  • Bare tool-call dispatchtoolcall_only fails only on "arguments" vs "args". Provide the exact key name in system prompt; the model already produces correct "tool_name" and value names without enforcement.

Hardware Requirements

Configuration VRAM Required Notes
Q4_K_M (this repo) ~22 GB 16.5 GB model + KV cache
Q4_K_M · full context ~26 GB A10G 24 GB may require reduced context
F16 (provenance only) ~70 GB+ Multi-GPU or large-memory server

Usage

Installation

pip install llama-cpp-python huggingface_hub

For GPU acceleration (CUDA):

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

Python — Disabling Thinking Mode for Structured Output

For structured output tasks where <think> block extraction is not desired, disable thinking mode via the chat template or system prompt:

from huggingface_hub import hf_hub_download
from llama_cpp import Llama
import re, json

model_path = hf_hub_download(
    repo_id="pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M",
    filename="qwen3.6-27B-gguf-Q4-K-M.gguf"
)

llm = Llama(
    model_path=model_path,
    n_ctx=8192,
    n_gpu_layers=-1,
    verbose=False,
)

# Option 1: Add /no_think to the user message to suppress thinking mode
response = llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a precise assistant. Return structured JSON only when asked."
        },
        {
            "role": "user",
            "content": "Return a JSON object with keys: summary, risk_level, action_items. /no_think"
        }
    ],
    temperature=0.15,
    max_tokens=1024,
)
print(response["choices"][0]["message"]["content"])

For post-processing that strips <think> blocks if present:

def strip_thinking(raw: str) -> str:
    """
    Strip <think>...</think> blocks from Qwen3 output.
    quant_eval v7.21: medium/hard json_multistep cases fail when think blocks
    are not stripped before extraction. Easy cases pass without stripping.
    """
    # Remove think blocks (handles both complete and truncated blocks)
    clean = re.sub(r'<think>.*?</think>', '', raw, flags=re.DOTALL).strip()
    # Also strip residual EOS tokens
    clean = re.sub(r'<\|im_end\|>', '', clean).strip()
    return clean

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Your prompt here"}],
    temperature=0.15,
    max_tokens=2048,  # Allow space for thinking tokens
)
raw = response["choices"][0]["message"]["content"]
clean = strip_thinking(raw)
print(clean)

For tool-calling with schema enforcement and EOS stripping:

def call_tool(prompt: str) -> dict:
    """
    Tool dispatch for Qwen3.6-27B.
    quant_eval v7.21: model uses correct 'tool_name' key without enforcement.
    Only 'arguments' vs 'args' needs correction. EOS stripping required.
    """
    response = llm.create_chat_completion(
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a tool-calling assistant. Output ONLY a JSON object "
                    "using EXACTLY these keys: "
                    '{"tool_name": "<name>", "args": {"a": <n>, "b": <n>}}\n'
                    "Then on the next line output the numeric result. /no_think"
                )
            },
            {"role": "user", "content": prompt}
        ],
        temperature=0.0,
        max_tokens=128,
    )
    raw = response["choices"][0]["message"]["content"]
    clean = strip_thinking(raw)
    return {"clean": clean, "raw": raw}

result = call_tool("Use the add tool to compute 10 minus 4.")
print(result["clean"])

CLI — llama-cli

llama-cli \
  --model qwen3.6-27B-gguf-Q4-K-M.gguf \
  --chat-template qwen3 \
  --system-prompt "You are a precise assistant. Return structured outputs when requested." \
  --prompt "Return a JSON object with keys: summary, risk_level, action_items. /no_think" \
  --n-predict 1024 \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --temp 0.15

For server deployment:

llama-server \
  --model qwen3.6-27B-gguf-Q4-K-M.gguf \
  --chat-template qwen3 \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --port 8080 \
  --host 0.0.0.0

Query via the OpenAI-compatible API with think-block stripping:

from openai import OpenAI
import re

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-required")

response = client.chat.completions.create(
    model="qwen3.6-27B-gguf-Q4-K-M",
    messages=[{"role": "user", "content": "Your prompt here /no_think"}],
    temperature=0.15,
)
raw = response.choices[0].message.content
clean = re.sub(r'<think>.*?</think>', '', raw, flags=re.DOTALL)
clean = re.sub(r'<\|im_end\|>', '', clean).strip()
print(clean)

Evaluation Artifacts

The full per-case evaluation CSV (comparison_results_v7_21_Qwen3.6_27B_20260426_163540.csv) and rollup.json are published in this repository for independent verification.


Artifact Provenance

Artifact Format Size SHA256 Evaluated
qwen3.6-27B-gguf-Q4-K-M.gguf GGUF Q4_K_M 16.5 GB c863357b1b532a02c47ca363ab666dd623470a152a291dac6619ed7ce751d8c8 ✅ Yes
F16 (companion repo) GGUF F16 53.8 GB 79ec580010d1a6690476a37436196e99b5c8fae7da75dfe2f6f3836663bf54cb ❌ VRAM constraint

Both artifacts were produced from Qwen/Qwen3.6-27B using a custom-built llama.cpp conversion and quantization pipeline developed by PBH Applied Systems.


Evaluation Methodology

quant_eval v7.21 — proprietary behavioral evaluation harness, PBH Applied Systems.

Fixture set: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c079371b02a94f3c149eb78a6adc03dc16ff6833b964fbf4174f0)

Family Description Pass Signals
fuzz Property-based regression; structured placement correctness schema_ok, constraints_ok
json Single-step structured JSON with constraint rules schema_ok, constraints_ok
json_multistep Multi-step planning with self-check and oracle verification schema_ok, checks_consistent_ok, stop_semantics_ok, oracle_equiv_ok
mcq Multiple-choice extraction choice_ok
stateful_followup Two-turn state tracking; turn-2 correct given turn-1 turn1/2_parse_ok, turn1/2_exact_match
mixed_brief_json Hybrid: natural language answer + valid JSON block answer_line_ok, json_parse_ok, schema_ok
toolcall Tool call embedded in response; parse + schema validation stage1_tool_parse_ok, stage1_tool_schema_ok
toolcall_only Bare schema-only tool call; strict tool name + args check tool_name_ok, args_ok

Evaluation hardware: NVIDIA RTX 4090 · Evaluation date: April 26, 2026 · Seed: 42


🔬 About quant_eval & This Evaluation Series

quant_eval is a proprietary behavioral evaluation harness developed by PBH Applied Systems, LLC. It measures real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning — not perplexity or leaderboard proxies. Every model published under pbhappliedsystems has been independently evaluated using quant_eval before being recommended for any production role.

See it in action: Live AI Agent Demo → The demo runs production-style agent workflows powered by open-weight models selected through the quant_eval evaluation pipeline.

Need a deployment recommendation? Not sure which quantization level is right for your hardware, latency target, or agent type? → pbhappliedsystems.com


Evaluated and published by PBH Applied Systems, LLC · patrick@pbhappliedsystems.com


About PBH Applied Systems

PBH Applied Systems, LLC is an Oklahoma City–based applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development.

Patrick Hill, M.S. — Founder · Data Scientist · AI/ML Engineer · Author of Applied Machine Learning: Concepts, Tools, and Case Studies (required reading, UAT CSC 373)

Core Service Areas: LLM Optimization & Deployment · AI Evaluation Frameworks · Agentic AI Infrastructure · Scalable AI Application Development · ML Pipeline Design & Analytics · Model & Agent Cataloging


📞 Work With PBH Applied Systems

Qwen3.6-27B is the first Qwen3-series model in the evaluated series, and its adaptive thinking mode introduces a new class of pipeline configuration requirement that doesn't apply to any Qwen2.5 model. The json_multistep result is not a capability score — it's a deployment readiness finding: structured output pipelines targeting Qwen3 models need think-block stripping. The toolcall_only result tells a different story: Qwen3 has learned the tool_name key convention without being told, which is a genuine capability improvement over the entire Qwen2.5 series. Both findings are only visible through systematic evaluation.

👉 Book a Scoping Call · 👉 Request an Evaluation Report — from $2,500

Connect


License

This GGUF repository inherits the license of the base model: Apache 2.0Qwen/Qwen3.6-27B

The quant_eval evaluation methodology, fixture set, and scoring framework are proprietary to PBH Applied Systems, LLC and are not included in this repository.


GGUF conversion, quantization, and behavioral evaluation performed by PBH Applied Systems, LLC · quant_eval v7.21 · Run ID: 20260426_163540

Downloads last month
107
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M

Base model

Qwen/Qwen3.6-27B
Quantized
(451)
this model

Space using pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M 1