Instructions to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M",
	filename="qwen3.6-27B-gguf-Q4-K-M.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
# Run inference directly in the terminal:
llama-cli -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
# Run inference directly in the terminal:
llama-cli -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
# Run inference directly in the terminal:
./llama-cli -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M

Use Docker

docker model run hf.co/pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M

LM Studio
Jan
Ollama
How to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with Ollama:
```
ollama run hf.co/pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
```

Unsloth Studio

How to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M to start chatting

How to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with Docker Model Runner:
```
docker model run hf.co/pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M
```

Lemonade

How to use pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M

Run and chat with the model

lemonade run user.qwen3.6-27B-gguf-Q4-K-M-{{QUANT_TAG}}

List all available models

lemonade list

Qwen3.6-27B · GGUF Q4_K_M

Quantized, converted, and evaluated by PBH Applied Systems, LLC — Applied AI/ML Consulting · LLM Optimization & Deployment · Quantized AI Infrastructure

🔬 This repository is part of a production-oriented evaluation series. Every model published under pbhappliedsystems has been independently evaluated using quant_eval v7.21 — a proprietary behavioral evaluation harness developed by PBH Applied Systems. Scores measure real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning families — not perplexity or benchmark leaderboard proxies.

🆕 First Qwen3-series model in the evaluated series. Qwen3.6-27B is the first model from Alibaba Cloud's Qwen3 generation to be evaluated in this series. Its hybrid (adaptive) thinking mode — where the model generates extended chain-of-thought reasoning on harder problems — is the defining behavioral characteristic of this evaluation and is documented in detail below.

⚠️ Single-runner evaluation. The F16 GGUF (53.8 GB) exceeds the VRAM capacity of the evaluation hardware (NVIDIA RTX 4090, 24 GB). All behavioral data comes from the Q4_K_M quantized_llama_cpp runner only. The F16 artifact is documented for provenance at pbhappliedsystems/qwen3.6-27B-gguf-F16.

Model Description

This repository contains the 4-bit quantized (Q4_K_M) GGUF of Qwen/Qwen3.6-27B, a 27-billion parameter model from Alibaba Cloud's Qwen3 generation. Qwen3 introduces a hybrid thinking mode — the model can generate extended internal chain-of-thought reasoning (enclosed in <think> blocks) before producing its final response, with reasoning depth adapting to task complexity.

This adaptive behavior is the single most consequential characteristic for structured task evaluation, and its interaction with the quant_eval v7.21 harness is documented in detail in the Key Findings section below.

Key Characteristics

Parameters: 27B
Architecture: Qwen3 (hybrid thinking / non-thinking mode)
Format: GGUF Q4_K_M
File size: 16.5 GB
SHA256: c863357b1b532a02c47ca363ab666dd623470a152a291dac6619ed7ce751d8c8
Minimum VRAM (GPU inference): ~22 GB
Recommended GPU tier: RTX 4090 24 GB · A10G 24 GB · A100 40 GB
Context window: 32,768 tokens (check model config for extended context options)
Inference speed (eval hardware): avg 1.938 sec/case on RTX 4090
License: Apache 2.0

PBH Applied Systems Evaluation — quant_eval v7.21

Evaluation conducted by PBH Applied Systems, LLC using quant_eval v7.21 Run ID: 20260426_163540 · Fixtures: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c...) · Seed: 42 Hardware: NVIDIA RTX 4090 · Runner: quantized_llama_cpp (Q4_K_M only) · Total rows: 42

Per-Family Pass Rates (Q4_K_M)

Family	N	Pass Rate	Avg Secs	Bucket Score	Notes
json_multistep	5	0.400	6.181	0.600	Easy pass; medium/hard fail — thinking mode
stateful_followup	2	1.000	0.685	2.000	Both turns exact match
toolcall_only	2	0.000	0.655	1.000	`"arguments"` vs `"args"` — closest in series
mixed_brief_json	2	1.000	0.705	2.000	Clean ANSWER + JSON
toolcall	2	1.000	1.360	0.000	Stage-1 passes; EOS on final answer
json	4	n/a	2.263	10.000	All pass
fuzz	20	n/a	1.692	10.000	All 20 pass
mcq	5	n/a	0.160	1.000	5/5 perfect

Key Findings

Finding 1: Adaptive Thinking Mode — The Defining Behavioral Characteristic

Qwen3 uses hybrid thinking mode: for simpler tasks, the model responds directly; for harder tasks, it generates an extended <think> block before the final response. This adaptive behavior is the primary driver of the json_multistep results.

Easy cases pass cleanly and quickly:

Case	Difficulty	Secs	Result	Raw (truncated)
ms_easy_01	Easy	4.949	✅	`{"plan": ["A"], "checks": [...]}`
ms_easy_02	Easy	5.942	✅	`{"plan": ["A","A"], "checks": [...]}`

Both easy cases produce direct JSON output. All three gating signals pass (schema_ok=1, checks_consistent_ok=1, oracle_equiv_ok=1).

Medium and hard cases fail with all four signals simultaneously:

Case	Difficulty	Secs	Result	Failure
ms_med_01	Medium	6.650	❌	schema_ok=0, cc=0, stop=0, oe=0
ms_med_02	Medium	6.761	❌	schema_ok=0, cc=0, stop=0, oe=0
ms_hard_01	Hard	6.603	❌	schema_ok=0, cc=0, stop=0, oe=0

The all-four-signals failure pattern is the signature of a root-level extraction failure — when schema_ok=0, the downstream signals (checks_consistent_ok, stop_semantics_ok, oracle_equiv_ok) cannot be evaluated. The model generates a <think> reasoning block on the harder cases; when the extractor receives the full output including the think block, the JSON schema object is not found at the expected location or the content before the JSON causes the schema validator to fail.

The timing confirms this: ms_easy cases take ~5s while medium/hard cases take ~6.6–6.8s — the extra ~1.7s represents the thinking token generation overhead. At F16 precision this overhead would be substantially larger.

What this means for production deployment: Qwen3's thinking mode requires pipeline-level handling. The <think> block must be stripped before extraction, or the inference configuration must set /no_think or equivalent parameters to disable thinking mode when structured output is required. Without this, the model may produce correct reasoning internally while the extractor fails to locate the final answer. See the Usage section for configuration guidance.

Finding 2: toolcall_only — Closest Schema Vocabulary in the Evaluated Series

Every Qwen2.5 model in the series uses incorrect argument container keys. Qwen3.6-27B produces the most schema-accurate bare tool call of any model evaluated:

Model	Raw Output	`tool_name` ✅	arg values correct ✅	Container key
Qwen2.5-3B Q4_K_M	`{"tool": "add", "operands": [5, 10]}`	❌	❌	`operands` (array)
Qwen2.5-7B Q4_K_M	`{"tool": "add", "numbers": [5, 10]}`	❌	❌	`numbers` (array)
Qwen2.5-14B-1M Q4_K_M	`{"tool": "add", "input": {"x": 5, "y": 10}}`	❌	❌	`input`, `x`/`y`
Qwen2.5-32B Q4_K_M	`{"tool": "add", "params": {"a": 5, "b": 10}}`	❌	✅	`params`
Qwen3.6-27B Q4_K_M	`{"tool_name": "add", "arguments": {"a": 5, "b": 10}}`	✅	✅	`arguments`

Qwen3.6-27B is the only model in the evaluated series to produce "tool_name" as the outer key without system-prompt schema enforcement. Argument value names ("a", "b") are also correct. The only error is "arguments" instead of "args" as the container key — a single field name away from a fully correct schema. Explicit key-name instructions in the system prompt should fully resolve this.

Finding 3: MCQ and Fuzz — Perfect at 27B

MCQ Case	Result	Raw
mcq_01	✅	`B`
mcq_02	✅	`B`
mcq_03	✅	`C`
mcq_04	✅	`B`
mcq_05	✅	`B`

5/5 perfect MCQ at 0.16 sec/case average — single letter, no contamination, no A-bias. All 20 fuzz cases pass at bucket=10. These families benefit from the model's size while being fast enough that thinking mode does not appear to activate.

Finding 4: toolcall — Correct Arithmetic, EOS Contamination

Both toolcall cases pass stage-1 and produce the correct final answer with EOS tokens — the standard Qwen Q4_K_M series pattern:

Case	Raw	Expected
tool_01	`{"tool_name": "add", "args": {"a": 2, "b": 3}}<\|im_end\|> 5<\|im_end\|>`	`5`
tool_02	`{"tool_name": "add", "args": {"a": 10, "b": -4}}<\|im_end\|> 6<\|im_end\|>`	`6`

Note that in the toolcall family (where a schema is provided in context), the model uses the correct "tool_name" and "args" keys — confirming that in-context schema examples fully resolve the vocabulary issue observed in toolcall_only. Strip <|im_end|> before downstream processing.

Finding 5: Stateful and Hybrid — Clean at 27B

Both stateful and mixed_brief_json families produce clean correct outputs with strippable EOS:

Family	Case	Raw
stateful	state_01	`{"counter": 2}<\|im_end\|> {"counter": 5}<\|im_end\|>`
stateful	state_02	`{"items":["a","b"]}<\|im_end\|> {"items":["a","b","c"]}<\|im_end\|>`
mixed	mixed_01	`ANSWER: 13 {"a": 4, "b": 9, "sum": 13}<\|im_end\|>`
mixed	mixed_02	`ANSWER: 6 {"a": -2, "b": 8, "sum": 6}<\|im_end\|>`

These short-context, two-turn tasks complete in under 0.75 seconds each — fast enough that thinking mode does not activate.

Signal-Level Diagnostics (Q4_K_M)

json_multistep

Signal	Rate	Notes
schema_ok	0.400	Root-level failure on medium/hard — thinking mode
checks_consistent_ok	0.400	Cascades from schema_ok=0
stop_semantics_ok	0.400	Cascades from schema_ok=0
oracle_equiv_ok	0.400	Cascades from schema_ok=0

stateful_followup

Signal	Rate
turn1_parse_ok	1.000
turn2_parse_ok	1.000
turn1_exact_match	1.000
turn2_exact_match	1.000

toolcall_only

Signal	Rate	Notes
tool_name_ok	1.000	`"tool_name"` correct — best in series without enforcement
args_ok	0.000	`"arguments"` vs `"args"`

mixed_brief_json

Signal	Rate
answer_line_ok	1.000
json_parse_ok	1.000
schema_ok	1.000

Series Context

Model	json_multistep	stateful	MCQ	VRAM (Q4_K_M)	Thinking Mode
Qwen2.5-3B	0.200	1.000	3/5	~4 GB	No
Qwen2.5-7B	0.800	1.000	4/5	~6 GB	No
Qwen2.5-14B-1M	0.800	1.000	5/5	~12 GB	No
Qwen2.5-32B	0.600	1.000	5/5	~24 GB	No
Qwen3.6-27B	0.400	1.000	5/5	~22 GB	✅ Hybrid

The json_multistep result of 0.400 is not a straightforward capability regression from the Qwen2.5 series. It is a pipeline compatibility finding: the evaluation harness was not configured to strip <think> blocks from Qwen3 responses on harder tasks. The two passing cases demonstrate the model is capable of correct structured multi-step planning output. The three failing cases are those where the model's adaptive thinking mode activates and the extraction layer fails — not where the model's reasoning is incorrect. With proper pipeline configuration (think-block stripping or /no_think mode), the json_multistep pass rate would be expected to improve.

Recommended Use Cases

✅ Deploy with Confidence (Q4_K_M)

Stateful multi-turn agents — 1.000 at both turns, clean JSON state with strippable EOS at 0.685 sec/case.
Hybrid brief + JSON responses — mixed_brief_json 1.000 in 0.705 sec.
MCQ and single-choice extraction — 5/5 perfect at 0.16 sec/case.
Structured JSON (single-step) — json and fuzz both bucket=10.000. All 20 fuzz cases pass.
Scaffolded tool-calling — toolcall stage-1 passes; strip EOS from final answer.

✅ Deploy with Think-Block Stripping (Q4_K_M)

Multi-step structured planning — ms_easy cases pass cleanly. Medium and hard cases require <think> block stripping or /no_think configuration before the extraction layer receives the model's output. See Usage section for implementation.

⚠️ Use with Guardrails (Q4_K_M)

Bare tool-call dispatch — toolcall_only fails only on "arguments" vs "args". Provide the exact key name in system prompt; the model already produces correct "tool_name" and value names without enforcement.

Hardware Requirements

Configuration	VRAM Required	Notes
Q4_K_M (this repo)	~22 GB	16.5 GB model + KV cache
Q4_K_M · full context	~26 GB	A10G 24 GB may require reduced context
F16 (provenance only)	~70 GB+	Multi-GPU or large-memory server

Usage

Installation

pip install llama-cpp-python huggingface_hub

For GPU acceleration (CUDA):

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

Python — Disabling Thinking Mode for Structured Output

For structured output tasks where <think> block extraction is not desired, disable thinking mode via the chat template or system prompt:

from huggingface_hub import hf_hub_download
from llama_cpp import Llama
import re, json

model_path = hf_hub_download(
    repo_id="pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M",
    filename="qwen3.6-27B-gguf-Q4-K-M.gguf"
)

llm = Llama(
    model_path=model_path,
    n_ctx=8192,
    n_gpu_layers=-1,
    verbose=False,
)

# Option 1: Add /no_think to the user message to suppress thinking mode
response = llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a precise assistant. Return structured JSON only when asked."
        },
        {
            "role": "user",
            "content": "Return a JSON object with keys: summary, risk_level, action_items. /no_think"
        }
    ],
    temperature=0.15,
    max_tokens=1024,
)
print(response["choices"][0]["message"]["content"])

For post-processing that strips <think> blocks if present:

def strip_thinking(raw: str) -> str:
    """
    Strip <think>...</think> blocks from Qwen3 output.
    quant_eval v7.21: medium/hard json_multistep cases fail when think blocks
    are not stripped before extraction. Easy cases pass without stripping.
    """
    # Remove think blocks (handles both complete and truncated blocks)
    clean = re.sub(r'<think>.*?</think>', '', raw, flags=re.DOTALL).strip()
    # Also strip residual EOS tokens
    clean = re.sub(r'<\|im_end\|>', '', clean).strip()
    return clean

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Your prompt here"}],
    temperature=0.15,
    max_tokens=2048,  # Allow space for thinking tokens
)
raw = response["choices"][0]["message"]["content"]
clean = strip_thinking(raw)
print(clean)

For tool-calling with schema enforcement and EOS stripping:

def call_tool(prompt: str) -> dict:
    """
    Tool dispatch for Qwen3.6-27B.
    quant_eval v7.21: model uses correct 'tool_name' key without enforcement.
    Only 'arguments' vs 'args' needs correction. EOS stripping required.
    """
    response = llm.create_chat_completion(
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a tool-calling assistant. Output ONLY a JSON object "
                    "using EXACTLY these keys: "
                    '{"tool_name": "<name>", "args": {"a": <n>, "b": <n>}}\n'
                    "Then on the next line output the numeric result. /no_think"
                )
            },
            {"role": "user", "content": prompt}
        ],
        temperature=0.0,
        max_tokens=128,
    )
    raw = response["choices"][0]["message"]["content"]
    clean = strip_thinking(raw)
    return {"clean": clean, "raw": raw}

result = call_tool("Use the add tool to compute 10 minus 4.")
print(result["clean"])

CLI — llama-cli

llama-cli \
  --model qwen3.6-27B-gguf-Q4-K-M.gguf \
  --chat-template qwen3 \
  --system-prompt "You are a precise assistant. Return structured outputs when requested." \
  --prompt "Return a JSON object with keys: summary, risk_level, action_items. /no_think" \
  --n-predict 1024 \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --temp 0.15

For server deployment:

llama-server \
  --model qwen3.6-27B-gguf-Q4-K-M.gguf \
  --chat-template qwen3 \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --port 8080 \
  --host 0.0.0.0

Query via the OpenAI-compatible API with think-block stripping:

from openai import OpenAI
import re

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-required")

response = client.chat.completions.create(
    model="qwen3.6-27B-gguf-Q4-K-M",
    messages=[{"role": "user", "content": "Your prompt here /no_think"}],
    temperature=0.15,
)
raw = response.choices[0].message.content
clean = re.sub(r'<think>.*?</think>', '', raw, flags=re.DOTALL)
clean = re.sub(r'<\|im_end\|>', '', clean).strip()
print(clean)

Evaluation Artifacts

The full per-case evaluation CSV (comparison_results_v7_21_Qwen3.6_27B_20260426_163540.csv) and rollup.json are published in this repository for independent verification.

Artifact Provenance

Artifact	Format	Size	SHA256	Evaluated
`qwen3.6-27B-gguf-Q4-K-M.gguf`	GGUF Q4_K_M	16.5 GB	`c863357b1b532a02c47ca363ab666dd623470a152a291dac6619ed7ce751d8c8`	✅ Yes
F16 (companion repo)	GGUF F16	53.8 GB	`79ec580010d1a6690476a37436196e99b5c8fae7da75dfe2f6f3836663bf54cb`	❌ VRAM constraint

Both artifacts were produced from Qwen/Qwen3.6-27B using a custom-built llama.cpp conversion and quantization pipeline developed by PBH Applied Systems.

Evaluation Methodology

quant_eval v7.21 — proprietary behavioral evaluation harness, PBH Applied Systems.

Fixture set: golden_oracle_fixtures_v7_21 (SHA256: 6d71a0b9147c079371b02a94f3c149eb78a6adc03dc16ff6833b964fbf4174f0)

Family	Description	Pass Signals
`fuzz`	Property-based regression; structured placement correctness	schema_ok, constraints_ok
`json`	Single-step structured JSON with constraint rules	schema_ok, constraints_ok
`json_multistep`	Multi-step planning with self-check and oracle verification	schema_ok, checks_consistent_ok, stop_semantics_ok, oracle_equiv_ok
`mcq`	Multiple-choice extraction	choice_ok
`stateful_followup`	Two-turn state tracking; turn-2 correct given turn-1	turn1/2_parse_ok, turn1/2_exact_match
`mixed_brief_json`	Hybrid: natural language answer + valid JSON block	answer_line_ok, json_parse_ok, schema_ok
`toolcall`	Tool call embedded in response; parse + schema validation	stage1_tool_parse_ok, stage1_tool_schema_ok
`toolcall_only`	Bare schema-only tool call; strict tool name + args check	tool_name_ok, args_ok

Evaluation hardware: NVIDIA RTX 4090 · Evaluation date: April 26, 2026 · Seed: 42

🔬 About quant_eval & This Evaluation Series

quant_eval is a proprietary behavioral evaluation harness developed by PBH Applied Systems, LLC. It measures real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning — not perplexity or leaderboard proxies. Every model published under pbhappliedsystems has been independently evaluated using quant_eval before being recommended for any production role.

See it in action: Live AI Agent Demo → The demo runs production-style agent workflows powered by open-weight models selected through the quant_eval evaluation pipeline.

Need a deployment recommendation? Not sure which quantization level is right for your hardware, latency target, or agent type? → pbhappliedsystems.com

Evaluated and published by PBH Applied Systems, LLC · patrick@pbhappliedsystems.com

About PBH Applied Systems

PBH Applied Systems, LLC is an Oklahoma City–based applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development.

Patrick Hill, M.S. — Founder · Data Scientist · AI/ML Engineer · Author of Applied Machine Learning: Concepts, Tools, and Case Studies (required reading, UAT CSC 373)

Core Service Areas: LLM Optimization & Deployment · AI Evaluation Frameworks · Agentic AI Infrastructure · Scalable AI Application Development · ML Pipeline Design & Analytics · Model & Agent Cataloging

📞 Work With PBH Applied Systems

Qwen3.6-27B is the first Qwen3-series model in the evaluated series, and its adaptive thinking mode introduces a new class of pipeline configuration requirement that doesn't apply to any Qwen2.5 model. The json_multistep result is not a capability score — it's a deployment readiness finding: structured output pipelines targeting Qwen3 models need think-block stripping. The toolcall_only result tells a different story: Qwen3 has learned the tool_name key convention without being told, which is a genuine capability improvement over the entire Qwen2.5 series. Both findings are only visible through systematic evaluation.

👉 Book a Scoping Call · 👉 Request an Evaluation Report — from $2,500

Connect


🌐	pbhappliedsystems.com
📧	patrick@pbhappliedsystems.com
💼	LinkedIn
▶️	YouTube
📸	Instagram
👍	Facebook

License

This GGUF repository inherits the license of the base model: Apache 2.0 — Qwen/Qwen3.6-27B

The quant_eval evaluation methodology, fixture set, and scoring framework are proprietary to PBH Applied Systems, LLC and are not included in this repository.

GGUF conversion, quantization, and behavioral evaluation performed by PBH Applied Systems, LLC · quant_eval v7.21 · Run ID: 20260426_163540

Downloads last month: 107

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M

Base model

Qwen/Qwen3.6-27B

Quantized

(451)

this model

pbhappliedsystems
/

qwen3.6-27B-gguf-Q4-K-M