GrepSeek-Qwen3.5-9B-GRPO

The full GrepSeek model. GrepSeek is a Direct Corpus Interaction (DCI) search agent: rather than retrieving from a pre-computed dense or sparse index, it answers questions by issuing Unix shell commands (rg, grep, head, …) directly against a raw 21M-passage Wikipedia corpus, interleaving retrieval and reasoning in a single policy. This checkpoint is Qwen/Qwen3.5-9B, cold-start fine-tuned and then optimized with GRPO.

📄 GrepSeek: Training Search Agents for Direct Corpus Interaction · 💻 https://github.com/alirezasalemi7/grepseek

Why direct corpus interaction?

Index-based retrieval (dense or sparse) suffers from semantic smoothing (blurring fine-grained entity/lexical distinctions), limited controllability (the agent can't enforce exact filters or iteratively refine results), and redundant re-retrieval in multi-hop settings. By executing exact-string shell pipelines (e.g. rg -F), GrepSeek preserves lexical precision, isolates rare symbolic patterns and exact entity names, and composes multi-stage retrieval programs for compositional reasoning — while needing no embedding index (only the ~14 GB raw corpus; no offline indexing).

Training

  • Initialized from: alireza7/GrepSeek-Qwen3.5-9B-SFT (cold-start SFT on alireza7/GrepSeek-ColdStart-SFT-10k; base Qwen/Qwen3.5-9B).
  • RL: GRPO, group size n=5, reward = token-F1 × binary format gate (only structurally valid <think>/<tool_call>/<tool_response>/<answer> trajectories get non-zero reward), 200 steps, LR 5e-6, batch 256, KL disabled, Ulysses SP=2, on 4×A100-80GB. Trained only on NQ + HotpotQA.

⚠️ A tool-using agent, not a standalone chatbot

The model emits <tool_call> shell commands that must be executed against the corpus and returned as <tool_response> turns. You need the corpus (PeterJinGo/wiki-18-corpus), a tool-calling vLLM server, and the GrepSeek inference harness — all in the code repo.

Usage

git clone https://github.com/alirezasalemi7/grepseek && cd grepseek
# env: TRAINING_ENV.md  ·  corpus: cold_start_sft/download_corpus.py

# 1. serve this checkpoint
MODEL_PATH=alireza7/GrepSeek-Qwen3.5-9B-GRPO bash rl/serve_rl.sh        # -> http://localhost:10730/v1

# 2a. generation on your own questions
GREPSEEK_CORPUS_ROOT=/path/to/wiki_18_corpus \
  bash inference/run_inference.sh --base_url http://localhost:10730/v1 \
    --model grepseek --temperature 0.6 --input my_questions.jsonl --out_dir out

# 2b. reproduce the benchmark eval (token-F1 / EM on the Search-R1 suite)
GREPSEEK_CORPUS_ROOT=/path/to/wiki_18_corpus \
  bash inference/run_inference.sh --base_url http://localhost:10730/v1 \
    --model grepseek --temperature 0.6 --datasets all --out_dir eval

The inference harness also ships the semantics-preserving sharded-parallel execution engine (+ persistent search daemon) that accelerates corpus search by up to 7.6× while remaining byte-exact with sequential grep.

Results (token-level F1)

Trained only on NQ + HotpotQA (marked *); the other five are out-of-distribution. GrepSeek gets the best micro-average and wins 4/7 benchmarks.

NQ* TriviaQA PopQA HotpotQA* 2Wiki MuSiQue Bamboogle micro-avg
Search-R1 (Qwen3-Emb-4B, best baseline) 0.5067 0.7693 0.5101 0.5591 0.4299 0.2878 0.6989 0.5441
GrepSeek (this model) 0.5223 0.7673 0.4861 0.6231 0.5178 0.3006 0.6212 0.5691

Micro-average EM = 0.4948 (also best overall; full EM table in the paper). Gains are largest on multi-hop tasks (HotpotQA, 2Wiki, MuSiQue) that reward exact entity disambiguation and iterative evidence aggregation.

Limitations

Because retrieval is purely lexical, GrepSeek is weaker on surface-form variation / long-tail queries — e.g. PopQA (diacritics, name variants) — and grep has no semantic relevance ranking, so an authoritative passage can be buried behind earlier file-order matches. Dense retrieval remains advantageous on heavily semantic or paraphrase-driven queries.

License

Inherits the license of the base model Qwen/Qwen3.5-9B — confirm and update the license field above if needed.

Citation

@misc{salemi2026grepseektrainingsearchagents,
      title={GrepSeek: Training Search Agents for Direct Corpus Interaction},
      author={Alireza Salemi and Chang Zeng and Atharva Nijasure and Jui-Hui Chung and Razieh Rahimi and Fernando Diaz and Hamed Zamani},
      year={2026},
      eprint={2605.29307},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.29307},
}
Downloads last month
581
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alireza7/GrepSeek-Qwen3.5-9B-GRPO

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(1)
this model

Dataset used to train alireza7/GrepSeek-Qwen3.5-9B-GRPO

Collection including alireza7/GrepSeek-Qwen3.5-9B-GRPO

Paper for alireza7/GrepSeek-Qwen3.5-9B-GRPO