OCC-RAG-1.7B-ONNX

OCC-RAG

GitHub | Technical Report | Cloud | Base model

Quantized ONNX export of occ-ai/OCC-RAG-1.7B for cross-platform inference with ONNX Runtime and in-browser inference with Transformers.js / ONNX Runtime Web (WebGPU). It runs the full model locally — no server, no data leaves the device.

OCC-RAG-1.7B is a 1.7B-parameter small language model specialized for faithful, context-grounded question answering: given a question and a set of sources, it produces a structured reasoning trace with explicit source citations, decides whether the context supports an answer, and either answers from the context or abstains. It attains the best faithfulness across all evaluated scales (up to 32B). See the base model card for training details and benchmarks.

ONNX variants

All variants share the same tokenizer and graph topology (Qwen3 architecture with KV-cache); they differ only in weight precision. Linear layers are block-quantized with block size 32. Each file ≥ 2 GB ships its weights in a sibling *_data file that loads automatically.

dtype	File	Size	Description
`fp16`	`model_fp16.onnx`	~4.0 GB	All weights FP16
`q8`	`model_q8.onnx`	~3.2 GB	INT8 MatMul (asymmetric, MatMulNBits) + FP32 embedding & lm_head
`q4f32`	`model_q4f32.onnx`	~2.3 GB	INT4 MatMul + FP32 embedding & lm_head
`q4f16`	`model_q4f16.onnx`	~1.6 GB	INT4 MatMul on a pre-fused FP16 graph — recommended for WebGPU
`q4`	`model_q4.onnx`	~1.3 GB	INT4 MatMul + INT4 embedding (GatherBlockQuantized) + INT4 lm_head — smallest

Notes:

q4f16 is recommended for the browser/WebGPU: its RMSNorm is pre-fused into (Skip)SimplifiedLayerNormalization so ONNX Runtime Web loads it at the default optimization level. Its INT4 weights are quantized from the FP32 master (identical INT4 blobs to q4f32; only the scales differ — FP16 vs FP32).
q4 quantizes the token embedding and (tied) lm_head as well, giving the smallest footprint at a small quality cost.
The full-precision fp32 base is not published here (it is ~8 GB); regenerate it from occ-ai/OCC-RAG-1.7B with the conversion script if you need it.

The variant naming/layout follows the onnx-community / LiquidAI/LFM2.5-350M-ONNX convention. See the smaller occ-ai/OCC-RAG-0.6B-ONNX for a browser-friendly variant.

Model files

OCC-RAG-1.7B-ONNX/
├── config.json
├── generation_config.json
├── tokenizer.json
├── tokenizer_config.json     # chat_template inlined (Transformers.js reads it here)
├── special_tokens_map.json
├── added_tokens.json
├── vocab.json
├── merges.txt
└── onnx/
    ├── model_fp16.onnx        (+ model_fp16.onnx.data)
    ├── model_q4.onnx
    ├── model_q4f16.onnx       # ← WebGPU
    ├── model_q4f32.onnx       (+ model_q4f32.onnx_data)
    └── model_q8.onnx          (+ model_q8.onnx_data)

Input / output format

OCC-RAG uses a structured RAG prompt with special tokens. The chat template accepts a documents= kwarg and emits the structural tokens automatically — pass the user message as plain text and the sources as a list of {"text": ...} dicts. The question is wrapped in <|query_start|> … <|query_end|> and each source in <|source_start|><|source_id|>N … <|source_end|>.

The response has five sections, each delimited by special tokens: query analysis → source analysis → reasoning → status (ANSWERABLE / UNANSWERABLE) → answer. Parse the final answer from <|answer_start|> … <|answer_end|>. Keep skip_special_tokens=False if you need to read the structural tokens out of the raw output.

We recommend greedy decoding (do_sample=False), the training/evaluation default baked into generation_config.json.

Usage — Transformers.js (browser / Node, WebGPU)

import { pipeline, TextStreamer } from "@huggingface/transformers";

const generator = await pipeline("text-generation", "occ-ai/OCC-RAG-1.7B-ONNX", {
  dtype: "q4f16",   // WebGPU-friendly; or "q8" / "q4" / "fp16"
  device: "webgpu",
});

const question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?";
const documents = [
  { text: "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone." },
  { text: "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there." },
  { text: "Nova Scotia is a province on the east coast of Canada." },
];

// The chat template injects the <|query_*|> / <|source_*|> structural tokens.
const text = generator.tokenizer.apply_chat_template(
  [{ role: "user", content: question }],
  { documents, add_generation_prompt: true, tokenize: false },
);

const output = await generator(text, {
  max_new_tokens: 512,
  do_sample: false,
  streamer: new TextStreamer(generator.tokenizer, { skip_prompt: true, skip_special_tokens: false }),
});
console.log(output[0].generated_text);

Usage — ONNX Runtime (Python)

pip install onnxruntime transformers numpy huggingface_hub

import numpy as np, onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer

model_id = "occ-ai/OCC-RAG-1.7B-ONNX"
onnx_path = hf_hub_download(model_id, "onnx/model_q4.onnx")
tok = AutoTokenizer.from_pretrained(model_id)
session = ort.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])

question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?"
documents = [
    {"text": "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone."},
    {"text": "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there."},
    {"text": "Nova Scotia is a province on the east coast of Canada."},
]
prompt = tok.apply_chat_template(
    [{"role": "user", "content": question}],
    documents=documents, tokenize=False, add_generation_prompt=True,
)
input_ids = np.array([tok.encode(prompt, add_special_tokens=False)], dtype=np.int64)
# Greedy decode with KV-cache: feed input_ids + attention_mask + position_ids and the
# past_key_values.{i}.{key,value} inputs (empty on the first step), then loop feeding the
# present.* outputs back in. Stop on eos ids 151643 / 151645 / 151683. model_q4f16.onnx
# expects FP16 KV-cache I/O; the others use FP32. See config.json for cache shapes
# ([batch, num_key_value_heads, seq, head_dim]).

Limitations

Context-grounded only. Trained to answer from the supplied sources and to ignore parametric knowledge — not a general-purpose chat or knowledge model.
Reasoning depth. Training/evaluation are capped at three-hop reasoning; longer chains are out of distribution.
Quantization. The INT4 variants (q4, q4f16) trade a small amount of quality for size/speed; prefer fp16 / q8 when accuracy matters most.

License

Released under the MIT License, inherited from the base model.

Citation

@misc{savkin2026occragoptimalcognitivecore,
  title         = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering},
  author        = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets},
  year          = {2026},
  eprint        = {2606.00683},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2606.00683}
}

Downloads last month: 125

Model tree for occ-ai/OCC-RAG-1.7B-ONNX

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

occ-ai/OCC-RAG-1.7B

Quantized

(3)

this model

Collection including occ-ai/OCC-RAG-1.7B-ONNX

OCC-RAG

Collection

OCC-RAG: Optimal Cognitive Core models for faithful, context-grounded question answering. • 6 items • Updated about 8 hours ago • 21

Paper for occ-ai/OCC-RAG-1.7B-ONNX

OCC-RAG: Optimal Cognitive Core for Faithful Question Answering

Paper • 2606.00683 • Published 11 days ago • 89