Instructions to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M", filename="qwen3.6-27B-gguf-Q4-K-M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M # Run inference directly in the terminal: llama-cli -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M # Run inference directly in the terminal: llama-cli -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M # Run inference directly in the terminal: ./llama-cli -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M # Run inference directly in the terminal: ./build/bin/llama-cli -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
Use Docker
docker model run hf.co/pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
- LM Studio
- Jan
- Ollama
How to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with Ollama:
ollama run hf.co/pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
- Unsloth Studio
How to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M to start chatting
- Pi
How to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with Docker Model Runner:
docker model run hf.co/pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
- Lemonade
How to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
Run and chat with the model
lemonade run user.qwen3.6-27B-gguf-Q4-K-M-{{QUANT_TAG}}List all available models
lemonade list
- Qwen3.6-27B · GGUF Q4_K_M
- Model Description
- PBH Applied Systems Evaluation — quant_eval v7.21
- Key Findings
- Signal-Level Diagnostics (Q4_K_M)
- Series Context
- Recommended Use Cases
- Hardware Requirements
- Usage
- Evaluation Artifacts
- Artifact Provenance
- Evaluation Methodology
- 🔬 About quant_eval & This Evaluation Series
- About PBH Applied Systems
- 📞 Work With PBH Applied Systems
- License
Qwen3.6-27B · GGUF Q4_K_M
Quantized, converted, and evaluated by PBH Applied Systems, LLC — Applied AI/ML Consulting · LLM Optimization & Deployment · Quantized AI Infrastructure
🔬 This repository is part of a production-oriented evaluation series. Every model published under
pbhappliedsystemshas been independently evaluated using quant_eval v7.21 — a proprietary behavioral evaluation harness developed by PBH Applied Systems. Scores measure real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning families — not perplexity or benchmark leaderboard proxies.
🆕 First Qwen3-series model in the evaluated series. Qwen3.6-27B is the first model from Alibaba Cloud's Qwen3 generation to be evaluated in this series. Its hybrid (adaptive) thinking mode — where the model generates extended chain-of-thought reasoning on harder problems — is the defining behavioral characteristic of this evaluation and is documented in detail below.
⚠️ Single-runner evaluation. The F16 GGUF (53.8 GB) exceeds the VRAM capacity of the evaluation hardware (NVIDIA RTX 4090, 24 GB). All behavioral data comes from the Q4_K_M
quantized_llama_cpprunner only. The F16 artifact is documented for provenance atpbhappliedsystems/qwen3.6-27B-gguf-F16.
Model Description
This repository contains the 4-bit quantized (Q4_K_M) GGUF of Qwen/Qwen3.6-27B, a 27-billion parameter model from Alibaba Cloud's Qwen3 generation. Qwen3 introduces a hybrid thinking mode — the model can generate extended internal chain-of-thought reasoning (enclosed in <think> blocks) before producing its final response, with reasoning depth adapting to task complexity.
This adaptive behavior is the single most consequential characteristic for structured task evaluation, and its interaction with the quant_eval v7.21 harness is documented in detail in the Key Findings section below.
Key Characteristics
- Parameters: 27B
- Architecture: Qwen3 (hybrid thinking / non-thinking mode)
- Format: GGUF Q4_K_M
- File size: 16.5 GB
- SHA256:
c863357b1b532a02c47ca363ab666dd623470a152a291dac6619ed7ce751d8c8 - Minimum VRAM (GPU inference): ~22 GB
- Recommended GPU tier: RTX 4090 24 GB · A10G 24 GB · A100 40 GB
- Context window: 32,768 tokens (check model config for extended context options)
- Inference speed (eval hardware): avg 1.938 sec/case on RTX 4090
- License: Apache 2.0
PBH Applied Systems Evaluation — quant_eval v7.21
Evaluation conducted by PBH Applied Systems, LLC using quant_eval v7.21 Run ID:
20260426_163540· Fixtures:golden_oracle_fixtures_v7_21(SHA256:6d71a0b9147c...) · Seed: 42 Hardware: NVIDIA RTX 4090 · Runner:quantized_llama_cpp(Q4_K_M only) · Total rows: 42
Per-Family Pass Rates (Q4_K_M)
| Family | N | Pass Rate | Avg Secs | Bucket Score | Notes |
|---|---|---|---|---|---|
| json_multistep | 5 | 0.400 | 6.181 | 0.600 | Easy pass; medium/hard fail — thinking mode |
| stateful_followup | 2 | 1.000 | 0.685 | 2.000 | Both turns exact match |
| toolcall_only | 2 | 0.000 | 0.655 | 1.000 | "arguments" vs "args" — closest in series |
| mixed_brief_json | 2 | 1.000 | 0.705 | 2.000 | Clean ANSWER + JSON |
| toolcall | 2 | 1.000 | 1.360 | 0.000 | Stage-1 passes; EOS on final answer |
| json | 4 | n/a | 2.263 | 10.000 | All pass |
| fuzz | 20 | n/a | 1.692 | 10.000 | All 20 pass |
| mcq | 5 | n/a | 0.160 | 1.000 | 5/5 perfect |
Key Findings
Finding 1: Adaptive Thinking Mode — The Defining Behavioral Characteristic
Qwen3 uses hybrid thinking mode: for simpler tasks, the model responds directly; for harder tasks, it generates an extended <think> block before the final response. This adaptive behavior is the primary driver of the json_multistep results.
Easy cases pass cleanly and quickly:
| Case | Difficulty | Secs | Result | Raw (truncated) |
|---|---|---|---|---|
| ms_easy_01 | Easy | 4.949 | ✅ | {"plan": ["A"], "checks": [...]} |
| ms_easy_02 | Easy | 5.942 | ✅ | {"plan": ["A","A"], "checks": [...]} |
Both easy cases produce direct JSON output. All three gating signals pass (schema_ok=1, checks_consistent_ok=1, oracle_equiv_ok=1).
Medium and hard cases fail with all four signals simultaneously:
| Case | Difficulty | Secs | Result | Failure |
|---|---|---|---|---|
| ms_med_01 | Medium | 6.650 | ❌ | schema_ok=0, cc=0, stop=0, oe=0 |
| ms_med_02 | Medium | 6.761 | ❌ | schema_ok=0, cc=0, stop=0, oe=0 |
| ms_hard_01 | Hard | 6.603 | ❌ | schema_ok=0, cc=0, stop=0, oe=0 |
The all-four-signals failure pattern is the signature of a root-level extraction failure — when schema_ok=0, the downstream signals (checks_consistent_ok, stop_semantics_ok, oracle_equiv_ok) cannot be evaluated. The model generates a <think> reasoning block on the harder cases; when the extractor receives the full output including the think block, the JSON schema object is not found at the expected location or the content before the JSON causes the schema validator to fail.
The timing confirms this: ms_easy cases take ~5s while medium/hard cases take ~6.6–6.8s — the extra ~1.7s represents the thinking token generation overhead. At F16 precision this overhead would be substantially larger.
What this means for production deployment: Qwen3's thinking mode requires pipeline-level handling. The <think> block must be stripped before extraction, or the inference configuration must set /no_think or equivalent parameters to disable thinking mode when structured output is required. Without this, the model may produce correct reasoning internally while the extractor fails to locate the final answer. See the Usage section for configuration guidance.
Finding 2: toolcall_only — Closest Schema Vocabulary in the Evaluated Series
Every Qwen2.5 model in the series uses incorrect argument container keys. Qwen3.6-27B produces the most schema-accurate bare tool call of any model evaluated:
| Model | Raw Output | tool_name ✅ |
arg values correct ✅ | Container key |
|---|---|---|---|---|
| Qwen2.5-3B Q4_K_M | {"tool": "add", "operands": [5, 10]} |
❌ | ❌ | operands (array) |
| Qwen2.5-7B Q4_K_M | {"tool": "add", "numbers": [5, 10]} |
❌ | ❌ | numbers (array) |
| Qwen2.5-14B-1M Q4_K_M | {"tool": "add", "input": {"x": 5, "y": 10}} |
❌ | ❌ | input, x/y |
| Qwen2.5-32B Q4_K_M | {"tool": "add", "params": {"a": 5, "b": 10}} |
❌ | ✅ | params |
| Qwen3.6-27B Q4_K_M | {"tool_name": "add", "arguments": {"a": 5, "b": 10}} |
✅ | ✅ | arguments |
Qwen3.6-27B is the only model in the evaluated series to produce "tool_name" as the outer key without system-prompt schema enforcement. Argument value names ("a", "b") are also correct. The only error is "arguments" instead of "args" as the container key — a single field name away from a fully correct schema. Explicit key-name instructions in the system prompt should fully resolve this.
Finding 3: MCQ and Fuzz — Perfect at 27B
| MCQ Case | Result | Raw |
|---|---|---|
| mcq_01 | ✅ | B |
| mcq_02 | ✅ | B |
| mcq_03 | ✅ | C |
| mcq_04 | ✅ | B |
| mcq_05 | ✅ | B |
5/5 perfect MCQ at 0.16 sec/case average — single letter, no contamination, no A-bias. All 20 fuzz cases pass at bucket=10. These families benefit from the model's size while being fast enough that thinking mode does not appear to activate.
Finding 4: toolcall — Correct Arithmetic, EOS Contamination
Both toolcall cases pass stage-1 and produce the correct final answer with EOS tokens — the standard Qwen Q4_K_M series pattern:
| Case | Raw | Expected |
|---|---|---|
| tool_01 | {"tool_name": "add", "args": {"a": 2, "b": 3}}<|im_end|> 5<|im_end|> |
5 |
| tool_02 | {"tool_name": "add", "args": {"a": 10, "b": -4}}<|im_end|> 6<|im_end|> |
6 |
Note that in the toolcall family (where a schema is provided in context), the model uses the correct "tool_name" and "args" keys — confirming that in-context schema examples fully resolve the vocabulary issue observed in toolcall_only. Strip <|im_end|> before downstream processing.
Finding 5: Stateful and Hybrid — Clean at 27B
Both stateful and mixed_brief_json families produce clean correct outputs with strippable EOS:
| Family | Case | Raw |
|---|---|---|
| stateful | state_01 | {"counter": 2}<|im_end|> {"counter": 5}<|im_end|> |
| stateful | state_02 | {"items":["a","b"]}<|im_end|> {"items":["a","b","c"]}<|im_end|> |
| mixed | mixed_01 | ANSWER: 13 {"a": 4, "b": 9, "sum": 13}<|im_end|> |
| mixed | mixed_02 | ANSWER: 6 {"a": -2, "b": 8, "sum": 6}<|im_end|> |
These short-context, two-turn tasks complete in under 0.75 seconds each — fast enough that thinking mode does not activate.
Signal-Level Diagnostics (Q4_K_M)
json_multistep
| Signal | Rate | Notes |
|---|---|---|
| schema_ok | 0.400 | Root-level failure on medium/hard — thinking mode |
| checks_consistent_ok | 0.400 | Cascades from schema_ok=0 |
| stop_semantics_ok | 0.400 | Cascades from schema_ok=0 |
| oracle_equiv_ok | 0.400 | Cascades from schema_ok=0 |
stateful_followup
| Signal | Rate |
|---|---|
| turn1_parse_ok | 1.000 |
| turn2_parse_ok | 1.000 |
| turn1_exact_match | 1.000 |
| turn2_exact_match | 1.000 |
toolcall_only
| Signal | Rate | Notes |
|---|---|---|
| tool_name_ok | 1.000 | "tool_name" correct — best in series without enforcement |
| args_ok | 0.000 | "arguments" vs "args" |
mixed_brief_json
| Signal | Rate |
|---|---|
| answer_line_ok | 1.000 |
| json_parse_ok | 1.000 |
| schema_ok | 1.000 |
Series Context
| Model | json_multistep | stateful | MCQ | VRAM (Q4_K_M) | Thinking Mode |
|---|---|---|---|---|---|
| Qwen2.5-3B | 0.200 | 1.000 | 3/5 | ~4 GB | No |
| Qwen2.5-7B | 0.800 | 1.000 | 4/5 | ~6 GB | No |
| Qwen2.5-14B-1M | 0.800 | 1.000 | 5/5 | ~12 GB | No |
| Qwen2.5-32B | 0.600 | 1.000 | 5/5 | ~24 GB | No |
| Qwen3.6-27B | 0.400 | 1.000 | 5/5 | ~22 GB | ✅ Hybrid |
The json_multistep result of 0.400 is not a straightforward capability regression from the Qwen2.5 series. It is a pipeline compatibility finding: the evaluation harness was not configured to strip <think> blocks from Qwen3 responses on harder tasks. The two passing cases demonstrate the model is capable of correct structured multi-step planning output. The three failing cases are those where the model's adaptive thinking mode activates and the extraction layer fails — not where the model's reasoning is incorrect. With proper pipeline configuration (think-block stripping or /no_think mode), the json_multistep pass rate would be expected to improve.
Recommended Use Cases
✅ Deploy with Confidence (Q4_K_M)
- Stateful multi-turn agents — 1.000 at both turns, clean JSON state with strippable EOS at 0.685 sec/case.
- Hybrid brief + JSON responses —
mixed_brief_json1.000 in 0.705 sec. - MCQ and single-choice extraction — 5/5 perfect at 0.16 sec/case.
- Structured JSON (single-step) —
jsonandfuzzboth bucket=10.000. All 20 fuzz cases pass. - Scaffolded tool-calling —
toolcallstage-1 passes; strip EOS from final answer.
✅ Deploy with Think-Block Stripping (Q4_K_M)
- Multi-step structured planning — ms_easy cases pass cleanly. Medium and hard cases require
<think>block stripping or/no_thinkconfiguration before the extraction layer receives the model's output. See Usage section for implementation.
⚠️ Use with Guardrails (Q4_K_M)
- Bare tool-call dispatch —
toolcall_onlyfails only on"arguments"vs"args". Provide the exact key name in system prompt; the model already produces correct"tool_name"and value names without enforcement.
Hardware Requirements
| Configuration | VRAM Required | Notes |
|---|---|---|
| Q4_K_M (this repo) | ~22 GB | 16.5 GB model + KV cache |
| Q4_K_M · full context | ~26 GB | A10G 24 GB may require reduced context |
| F16 (provenance only) | ~70 GB+ | Multi-GPU or large-memory server |
Usage
Installation
pip install llama-cpp-python huggingface_hub
For GPU acceleration (CUDA):
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
Python — Disabling Thinking Mode for Structured Output
For structured output tasks where <think> block extraction is not desired, disable thinking mode via the chat template or system prompt:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
import re, json
model_path = hf_hub_download(
repo_id="pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M",
filename="qwen3.6-27B-gguf-Q4-K-M.gguf"
)
llm = Llama(
model_path=model_path,
n_ctx=8192,
n_gpu_layers=-1,
verbose=False,
)
# Option 1: Add /no_think to the user message to suppress thinking mode
response = llm.create_chat_completion(
messages=[
{
"role": "system",
"content": "You are a precise assistant. Return structured JSON only when asked."
},
{
"role": "user",
"content": "Return a JSON object with keys: summary, risk_level, action_items. /no_think"
}
],
temperature=0.15,
max_tokens=1024,
)
print(response["choices"][0]["message"]["content"])
For post-processing that strips <think> blocks if present:
def strip_thinking(raw: str) -> str:
"""
Strip <think>...</think> blocks from Qwen3 output.
quant_eval v7.21: medium/hard json_multistep cases fail when think blocks
are not stripped before extraction. Easy cases pass without stripping.
"""
# Remove think blocks (handles both complete and truncated blocks)
clean = re.sub(r'<think>.*?</think>', '', raw, flags=re.DOTALL).strip()
# Also strip residual EOS tokens
clean = re.sub(r'<\|im_end\|>', '', clean).strip()
return clean
response = llm.create_chat_completion(
messages=[{"role": "user", "content": "Your prompt here"}],
temperature=0.15,
max_tokens=2048, # Allow space for thinking tokens
)
raw = response["choices"][0]["message"]["content"]
clean = strip_thinking(raw)
print(clean)
For tool-calling with schema enforcement and EOS stripping:
def call_tool(prompt: str) -> dict:
"""
Tool dispatch for Qwen3.6-27B.
quant_eval v7.21: model uses correct 'tool_name' key without enforcement.
Only 'arguments' vs 'args' needs correction. EOS stripping required.
"""
response = llm.create_chat_completion(
messages=[
{
"role": "system",
"content": (
"You are a tool-calling assistant. Output ONLY a JSON object "
"using EXACTLY these keys: "
'{"tool_name": "<name>", "args": {"a": <n>, "b": <n>}}\n'
"Then on the next line output the numeric result. /no_think"
)
},
{"role": "user", "content": prompt}
],
temperature=0.0,
max_tokens=128,
)
raw = response["choices"][0]["message"]["content"]
clean = strip_thinking(raw)
return {"clean": clean, "raw": raw}
result = call_tool("Use the add tool to compute 10 minus 4.")
print(result["clean"])
CLI — llama-cli
llama-cli \
--model qwen3.6-27B-gguf-Q4-K-M.gguf \
--chat-template qwen3 \
--system-prompt "You are a precise assistant. Return structured outputs when requested." \
--prompt "Return a JSON object with keys: summary, risk_level, action_items. /no_think" \
--n-predict 1024 \
--ctx-size 8192 \
--n-gpu-layers -1 \
--temp 0.15
For server deployment:
llama-server \
--model qwen3.6-27B-gguf-Q4-K-M.gguf \
--chat-template qwen3 \
--ctx-size 8192 \
--n-gpu-layers -1 \
--port 8080 \
--host 0.0.0.0
Query via the OpenAI-compatible API with think-block stripping:
from openai import OpenAI
import re
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-required")
response = client.chat.completions.create(
model="qwen3.6-27B-gguf-Q4-K-M",
messages=[{"role": "user", "content": "Your prompt here /no_think"}],
temperature=0.15,
)
raw = response.choices[0].message.content
clean = re.sub(r'<think>.*?</think>', '', raw, flags=re.DOTALL)
clean = re.sub(r'<\|im_end\|>', '', clean).strip()
print(clean)
Evaluation Artifacts
The full per-case evaluation CSV (comparison_results_v7_21_Qwen3.6_27B_20260426_163540.csv) and rollup.json are published in this repository for independent verification.
Artifact Provenance
| Artifact | Format | Size | SHA256 | Evaluated |
|---|---|---|---|---|
qwen3.6-27B-gguf-Q4-K-M.gguf |
GGUF Q4_K_M | 16.5 GB | c863357b1b532a02c47ca363ab666dd623470a152a291dac6619ed7ce751d8c8 |
✅ Yes |
| F16 (companion repo) | GGUF F16 | 53.8 GB | 79ec580010d1a6690476a37436196e99b5c8fae7da75dfe2f6f3836663bf54cb |
❌ VRAM constraint |
Both artifacts were produced from Qwen/Qwen3.6-27B using a custom-built llama.cpp conversion and quantization pipeline developed by PBH Applied Systems.
Evaluation Methodology
quant_eval v7.21 — proprietary behavioral evaluation harness, PBH Applied Systems.
Fixture set: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c079371b02a94f3c149eb78a6adc03dc16ff6833b964fbf4174f0)
| Family | Description | Pass Signals |
|---|---|---|
fuzz |
Property-based regression; structured placement correctness | schema_ok, constraints_ok |
json |
Single-step structured JSON with constraint rules | schema_ok, constraints_ok |
json_multistep |
Multi-step planning with self-check and oracle verification | schema_ok, checks_consistent_ok, stop_semantics_ok, oracle_equiv_ok |
mcq |
Multiple-choice extraction | choice_ok |
stateful_followup |
Two-turn state tracking; turn-2 correct given turn-1 | turn1/2_parse_ok, turn1/2_exact_match |
mixed_brief_json |
Hybrid: natural language answer + valid JSON block | answer_line_ok, json_parse_ok, schema_ok |
toolcall |
Tool call embedded in response; parse + schema validation | stage1_tool_parse_ok, stage1_tool_schema_ok |
toolcall_only |
Bare schema-only tool call; strict tool name + args check | tool_name_ok, args_ok |
Evaluation hardware: NVIDIA RTX 4090 · Evaluation date: April 26, 2026 · Seed: 42
🔬 About quant_eval & This Evaluation Series
quant_eval is a proprietary behavioral evaluation harness developed by PBH Applied Systems, LLC. It measures real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning — not perplexity or leaderboard proxies. Every model published under pbhappliedsystems has been independently evaluated using quant_eval before being recommended for any production role.
See it in action: Live AI Agent Demo → The demo runs production-style agent workflows powered by open-weight models selected through the quant_eval evaluation pipeline.
Need a deployment recommendation? Not sure which quantization level is right for your hardware, latency target, or agent type? → pbhappliedsystems.com
Evaluated and published by PBH Applied Systems, LLC · patrick@pbhappliedsystems.com
About PBH Applied Systems
PBH Applied Systems, LLC is an Oklahoma City–based applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development.
Patrick Hill, M.S. — Founder · Data Scientist · AI/ML Engineer · Author of Applied Machine Learning: Concepts, Tools, and Case Studies (required reading, UAT CSC 373)
Core Service Areas: LLM Optimization & Deployment · AI Evaluation Frameworks · Agentic AI Infrastructure · Scalable AI Application Development · ML Pipeline Design & Analytics · Model & Agent Cataloging
📞 Work With PBH Applied Systems
Qwen3.6-27B is the first Qwen3-series model in the evaluated series, and its adaptive thinking mode introduces a new class of pipeline configuration requirement that doesn't apply to any Qwen2.5 model. The json_multistep result is not a capability score — it's a deployment readiness finding: structured output pipelines targeting Qwen3 models need think-block stripping. The toolcall_only result tells a different story: Qwen3 has learned the tool_name key convention without being told, which is a genuine capability improvement over the entire Qwen2.5 series. Both findings are only visible through systematic evaluation.
👉 Book a Scoping Call · 👉 Request an Evaluation Report — from $2,500
Connect
| 🌐 | pbhappliedsystems.com |
| 📧 | patrick@pbhappliedsystems.com |
| 💼 | |
| ▶️ | YouTube |
| 📸 | |
| 👍 |
License
This GGUF repository inherits the license of the base model:
Apache 2.0 — Qwen/Qwen3.6-27B
The quant_eval evaluation methodology, fixture set, and scoring framework are proprietary to PBH Applied Systems, LLC and are not included in this repository.
GGUF conversion, quantization, and behavioral evaluation performed by PBH Applied Systems, LLC · quant_eval v7.21 · Run ID: 20260426_163540
- Downloads last month
- 107
We're not able to determine the quantization variants.
Model tree for pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
Base model
Qwen/Qwen3.6-27B