Instructions to use NousResearch/Hermes-4.3-36B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use NousResearch/Hermes-4.3-36B-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="NousResearch/Hermes-4.3-36B-GGUF")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("NousResearch/Hermes-4.3-36B-GGUF", dtype="auto")

llama-cpp-python

How to use NousResearch/Hermes-4.3-36B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="NousResearch/Hermes-4.3-36B-GGUF",
	filename="hermes-4_3_36b-Q3_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use NousResearch/Hermes-4.3-36B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf NousResearch/Hermes-4.3-36B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf NousResearch/Hermes-4.3-36B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf NousResearch/Hermes-4.3-36B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf NousResearch/Hermes-4.3-36B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf NousResearch/Hermes-4.3-36B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf NousResearch/Hermes-4.3-36B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf NousResearch/Hermes-4.3-36B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf NousResearch/Hermes-4.3-36B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/NousResearch/Hermes-4.3-36B-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use NousResearch/Hermes-4.3-36B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "NousResearch/Hermes-4.3-36B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NousResearch/Hermes-4.3-36B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/NousResearch/Hermes-4.3-36B-GGUF:Q4_K_M

SGLang

How to use NousResearch/Hermes-4.3-36B-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "NousResearch/Hermes-4.3-36B-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NousResearch/Hermes-4.3-36B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "NousResearch/Hermes-4.3-36B-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NousResearch/Hermes-4.3-36B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use NousResearch/Hermes-4.3-36B-GGUF with Ollama:
```
ollama run hf.co/NousResearch/Hermes-4.3-36B-GGUF:Q4_K_M
```

Unsloth Studio

How to use NousResearch/Hermes-4.3-36B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for NousResearch/Hermes-4.3-36B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for NousResearch/Hermes-4.3-36B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for NousResearch/Hermes-4.3-36B-GGUF to start chatting

How to use NousResearch/Hermes-4.3-36B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf NousResearch/Hermes-4.3-36B-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "NousResearch/Hermes-4.3-36B-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use NousResearch/Hermes-4.3-36B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf NousResearch/Hermes-4.3-36B-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default NousResearch/Hermes-4.3-36B-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use NousResearch/Hermes-4.3-36B-GGUF with Docker Model Runner:
```
docker model run hf.co/NousResearch/Hermes-4.3-36B-GGUF:Q4_K_M
```

Lemonade

How to use NousResearch/Hermes-4.3-36B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull NousResearch/Hermes-4.3-36B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Hermes-4.3-36B-GGUF-Q4_K_M

List all available models

lemonade list

Hermes 4.3 - Seed 36B

Model Description

Hermes 4.3 36B is a frontier, hybrid-mode reasoning model based on ByteDance Seed 36B base, made by Nous Research that is aligned to you.

This is our first Hermes model trained in a decentralized manner over the internet using Psyche, read the blog post: https://nousresearch.com/introducing-hermes-4-3/

Read the Hermes 4 technical report here: Hermes 4 Technical Report

Chat with Hermes in Nous Chat: https://chat.nousresearch.com

Training highlights include a newly synthesized post-training corpus emphasizing verified reasoning traces, massive improvements in math, code, STEM, logic, creativity, and format-faithful outputs, while preserving general assistant quality and broadly neutral alignment.

What’s new vs Hermes 3

Post-training corpus: Massively increased dataset size from 1M samples and 1.2B tokens to ~5M samples / ~60B tokens blended across reasoning and non-reasoning data.
Hybrid reasoning mode with explicit <think>…</think> segments when the model decides to deliberate, and options to make your responses faster when you want.
Reasoning that is top quality, expressive, improves math, code, STEM, logic, and even creative writing and subjective responses.
Schema adherence & structured outputs: trained to produce valid JSON for given schemas and to repair malformed objects.
Much easier to steer and align: extreme improvements on steerability, especially on reduced refusal rates.

Our Mission: Frontier Capabilities Aligned to You

In pursuit of the mission of producing models that are open, steerable and capable of producing the full range of human expression, while being able to be aligned to your values, we created a new benchmark, RefusalBench, that tests the models willingness to be helpful in a variety of scenarios commonly disallowed by closed and open models.

Hermes 4.3 36B is now SOTA across non-abliterated models on the RefusalBench Leaderboard, surpassing our previous best of 59.5% on Hermes 4 70B

% of Questions Answered – RefusalBench

(Average of 5 trials)

Model	% of Questions Answered
Hermes 4.3 36B Non-Reasoning	74.60%
Hermes 4.3 36B Reasoning	72.29%
Hermes 4 70B Reasoning	59.50%
Hermes 4 405B Reasoning	57.10%
grok4	51.30%
Hermes 4 70B	49.07%
Hermes 4 405B	43.20%
Qwen2.5 7B	36.10%
Qwen3 235B Reasoning	34.30%
DeepSeek V3	28.10%
Gemini 2.5 Pro	24.23%
Llama 405B	21.70%
Gemini 2.5 Flash	19.13%
GPT4o	17.67%
Sonnet 4	17.00%
GPT4-mini	16.76%
R1	16.70%
cogito-v2-405B Reasoning	15.40%
Opus 4.1	15.38%
Qwen3 235B	15.30%
cogito-v2-405B	14.94%
cogito-v2-405B	12.10%
GPT 5	11.34%
gpt-oss 120B	5.60%
gpt-oss 20B	4.79%

Hermes 4 achieves SOTA on RefusalBench across all popular closed and open models in being helpful and conforming to your values, without censorship.

Benchmarks (Hermes 4.3 36B)

	Hermes 4.3 36B Psyche	Hermes 4.3 36B Centralized	Hermes 4 70B Centralized
AIME 24	71.9	70.6	73.5
AIME 25	69.3	66.8	67.4
BBH	86.4	84.7	87.8
DROP	83.5	81.6	85.0
GPQA Diamond	65.5	64.8	66.1
IFEval	77.9	73.9	78.7
MATH-500	93.8	92.3	95.5
MMLU	87.7	86.5	88.4
MMLU-Pro	80.7	79.7	80.7
MuSR	69.7	64.7	70.4
OBQA	96.6	91.8	94.8
SimpleQA	6.0	5.6	17.9

Prompt Format

Hermes 4 uses Llama-3-Chat format with role headers and special tags.

Basic chat:

<|start_header_id|>system<|end_header_id|>

You are Hermes 4. Be concise and helpful.<|eot_id|>
<|start_header_id|>user<|end_header_id|>

Explain the photoelectric effect simply.<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

Reasoning mode

Reasoning mode can be activated with the chat template via the flag thinking=True or by using the following system prompt:

You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.

Note that you can add any additional system instructions before or after this system message, and it will adjust the models policies, style, and effort of thinking, as well as its post-thinking style, format, identity, and more. You may also interleave the tool definition system message with the reasoning one.

When the model chooses to deliberate, it emits:

<|start_header_id|>assistant<|end_header_id|>
<think>
…model’s internal reasoning may appear here…
</think>
Final response starts here…<|eot_id|>

Additionally, we provide a flag to keep the content inbetween the <think> ... </think> that you can play with by setting keep_cots=True

Function Calling & Tool Use

Hermes 4 supports function/tool calls within a single assistant turn, produced after it's reasoning:

System message (example):

<|start_header_id|>system<|end_header_id|>
You are a function-calling AI. Tools are provided inside <tools>…</tools>.
When appropriate, call a tool by emitting a <tool_call>{...}</tool_call> object.
After a tool responds (as <tool_response>), continue reasoning inside <think> and produce the final answer.
<tools>
{"type":"function","function":{"name":"get_weather","description":"Get weather by city","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}
</tools><|eot_id|>

Note that you may also simply place tool definitions into the "tools:" field of your messages, and the chat template will parse and create the system prompt for you. This also works with reasoning mode for improved accuracy of tool use.

The model will then generate tool calls within <tool_call> {tool_call} </tool_call> tags, for easy parsing. The tool_call tags are also added tokens, so it makes it easy to parse while streaming! There are also automatic tool parsers built-in to VLLM and SGLang for Hermes, just set the tool parser in VLLM to hermes and in SGLang to qwen25.

Inference Notes

Sampling defaults that work well: temperature=0.6, top_p=0.95, top_k=20.
Template: Use the Llama chat format for Hermes 4.3 36B, 70B, and 405B as shown above, or set add_generation_prompt=True when using tokenizer.apply_chat_template(...).

Transformers example

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "NousResearch/Hermes-4.3-36B"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

messages = [
    {"role":"system","content":"You are Hermes 4. Be concise."},
    {"role":"user","content":"Summarize CRISPR in 3 sentences."}
]

inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)

outputs = model.generate(
    **inputs, max_new_tokens=400, temperature=0.6, top_p=0.95, top_k=20, do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For production serving on multi-GPU nodes, consider tensor parallel inference engines (e.g., SGLang/vLLM backends) with prefix caching.

Inference Providers:

Nous Portal:

Chutes:

Quantized / Smaller Variants

Hermes 4 is available as BF16 original weights as well as BF16 as well as FP8 variants and GGUF variants by LM Studio.

BFP16 Version: https://huggingface.co/NousResearch/Hermes-4.3-36B

See the Hermes 4 collection to explore them all: https://huggingface.co/collections/NousResearch/hermes-4-collection-68a731bfd452e20816725728

How to cite

@misc{teknium2025hermes4technicalreport,
      title={Hermes 4 Technical Report}, 
      author={Ryan Teknium and Roger Jin and Jai Suphavadeeprasit and Dakota Mahan and Jeffrey Quesnelle and Joe Li and Chen Guang and Shannon Sands and Karan Malhotra},
      year={2025},
      eprint={2508.18255},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.18255}, 
}