Instructions to use occ-ai/OCC-RAG-1.7B-ONNX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use occ-ai/OCC-RAG-1.7B-ONNX with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('text-generation', 'occ-ai/OCC-RAG-1.7B-ONNX');
OCC-RAG-1.7B-ONNX
GitHub | Technical Report | Cloud | Base model
Quantized ONNX export of occ-ai/OCC-RAG-1.7B
for cross-platform inference with ONNX Runtime and in-browser
inference with Transformers.js /
ONNX Runtime Web (WebGPU). It runs the full model locally β no server, no data leaves
the device.
OCC-RAG-1.7B is a 1.7B-parameter small language model specialized for faithful, context-grounded question answering: given a question and a set of sources, it produces a structured reasoning trace with explicit source citations, decides whether the context supports an answer, and either answers from the context or abstains. It attains the best faithfulness across all evaluated scales (up to 32B). See the base model card for training details and benchmarks.
ONNX variants
All variants share the same tokenizer and graph topology (Qwen3 architecture with
KV-cache); they differ only in weight precision. Linear layers are block-quantized with
block size 32. Each file β₯ 2 GB ships its weights in a sibling *_data file that
loads automatically.
| dtype | File | Size | Description |
|---|---|---|---|
fp16 |
model_fp16.onnx |
~4.0 GB | All weights FP16 |
q8 |
model_q8.onnx |
~3.2 GB | INT8 MatMul (asymmetric, MatMulNBits) + FP32 embedding & lm_head |
q4f32 |
model_q4f32.onnx |
~2.3 GB | INT4 MatMul + FP32 embedding & lm_head |
q4f16 |
model_q4f16.onnx |
~1.6 GB | INT4 MatMul on a pre-fused FP16 graph β recommended for WebGPU |
q4 |
model_q4.onnx |
~1.3 GB | INT4 MatMul + INT4 embedding (GatherBlockQuantized) + INT4 lm_head β smallest |
Notes:
q4f16is recommended for the browser/WebGPU: its RMSNorm is pre-fused into(Skip)SimplifiedLayerNormalizationso ONNX Runtime Web loads it at the default optimization level. Its INT4 weights are quantized from the FP32 master (identical INT4 blobs toq4f32; only the scales differ β FP16 vs FP32).q4quantizes the token embedding and (tied) lm_head as well, giving the smallest footprint at a small quality cost.- The full-precision
fp32base is not published here (it is ~8 GB); regenerate it fromocc-ai/OCC-RAG-1.7Bwith the conversion script if you need it.
The variant naming/layout follows the onnx-community
/ LiquidAI/LFM2.5-350M-ONNX convention.
See the smaller occ-ai/OCC-RAG-0.6B-ONNX
for a browser-friendly variant.
Model files
OCC-RAG-1.7B-ONNX/
βββ config.json
βββ generation_config.json
βββ tokenizer.json
βββ tokenizer_config.json # chat_template inlined (Transformers.js reads it here)
βββ special_tokens_map.json
βββ added_tokens.json
βββ vocab.json
βββ merges.txt
βββ onnx/
βββ model_fp16.onnx (+ model_fp16.onnx.data)
βββ model_q4.onnx
βββ model_q4f16.onnx # β WebGPU
βββ model_q4f32.onnx (+ model_q4f32.onnx_data)
βββ model_q8.onnx (+ model_q8.onnx_data)
Input / output format
OCC-RAG uses a structured RAG prompt with special tokens. The chat template accepts a
documents= kwarg and emits the structural tokens automatically β pass the user message
as plain text and the sources as a list of {"text": ...} dicts. The question is wrapped
in <|query_start|> β¦ <|query_end|> and each source in
<|source_start|><|source_id|>N β¦ <|source_end|>.
The response has five sections, each delimited by special tokens: query analysis β
source analysis β reasoning β status (ANSWERABLE / UNANSWERABLE) β answer. Parse the
final answer from <|answer_start|> β¦ <|answer_end|>. Keep skip_special_tokens=False if
you need to read the structural tokens out of the raw output.
We recommend greedy decoding (
do_sample=False), the training/evaluation default baked intogeneration_config.json.
Usage β Transformers.js (browser / Node, WebGPU)
import { pipeline, TextStreamer } from "@huggingface/transformers";
const generator = await pipeline("text-generation", "occ-ai/OCC-RAG-1.7B-ONNX", {
dtype: "q4f16", // WebGPU-friendly; or "q8" / "q4" / "fp16"
device: "webgpu",
});
const question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?";
const documents = [
{ text: "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone." },
{ text: "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there." },
{ text: "Nova Scotia is a province on the east coast of Canada." },
];
// The chat template injects the <|query_*|> / <|source_*|> structural tokens.
const text = generator.tokenizer.apply_chat_template(
[{ role: "user", content: question }],
{ documents, add_generation_prompt: true, tokenize: false },
);
const output = await generator(text, {
max_new_tokens: 512,
do_sample: false,
streamer: new TextStreamer(generator.tokenizer, { skip_prompt: true, skip_special_tokens: false }),
});
console.log(output[0].generated_text);
Usage β ONNX Runtime (Python)
pip install onnxruntime transformers numpy huggingface_hub
import numpy as np, onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
model_id = "occ-ai/OCC-RAG-1.7B-ONNX"
onnx_path = hf_hub_download(model_id, "onnx/model_q4.onnx")
tok = AutoTokenizer.from_pretrained(model_id)
session = ort.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])
question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?"
documents = [
{"text": "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone."},
{"text": "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there."},
{"text": "Nova Scotia is a province on the east coast of Canada."},
]
prompt = tok.apply_chat_template(
[{"role": "user", "content": question}],
documents=documents, tokenize=False, add_generation_prompt=True,
)
input_ids = np.array([tok.encode(prompt, add_special_tokens=False)], dtype=np.int64)
# Greedy decode with KV-cache: feed input_ids + attention_mask + position_ids and the
# past_key_values.{i}.{key,value} inputs (empty on the first step), then loop feeding the
# present.* outputs back in. Stop on eos ids 151643 / 151645 / 151683. model_q4f16.onnx
# expects FP16 KV-cache I/O; the others use FP32. See config.json for cache shapes
# ([batch, num_key_value_heads, seq, head_dim]).
Limitations
- Context-grounded only. Trained to answer from the supplied sources and to ignore parametric knowledge β not a general-purpose chat or knowledge model.
- Reasoning depth. Training/evaluation are capped at three-hop reasoning; longer chains are out of distribution.
- Quantization. The INT4 variants (
q4,q4f16) trade a small amount of quality for size/speed; preferfp16/q8when accuracy matters most.
License
Released under the MIT License, inherited from the base model.
Citation
@misc{savkin2026occragoptimalcognitivecore,
title = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering},
author = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets},
year = {2026},
eprint = {2606.00683},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2606.00683}
}
- Downloads last month
- 125