DS-R1-Distill-Qwen-7B + Polaris RL (PLAIN SGD, step 100, peak val)
This is the peak-validation checkpoint (step 100/412) of a plain-SGD RL fine-tune of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B on POLARIS-Project/Polaris-Dataset-53K. Released as part of an ICML-2026 study on the low-rank structure of SGD vs Adam RL updates for batched-LoRA inference.
Why this checkpoint exists
The motivation is to compare the SVD-compressibility of ΔW = W_ft − W_base between SGD-trained and Adam-trained RL fine-tunes. Existing open RL FTs (POLARIS, Skywork-OR1, DeepCoder, AceReason, ORZ, DAPO) are all Adam-trained; we needed a same-base same-recipe SGD counterpart for the comparison. This is one of those.
Training recipe
| field | value |
|---|---|
| base model | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
| dataset | POLARIS-Project/Polaris-Dataset-53K (52,779 train / 512 val) |
| algorithm | GRPO (adv_estimator=grpo, use_kl_loss=False, entropy_coeff=0, use_kl_in_reward=False) |
| optimizer | PLAIN SGD — momentum=0.0, nesterov=false, dampening=0.0, weight_decay=0.0 |
| learning rate | 1e-1 (constant) |
| train batch size | 128 (1 grad step per rollout batch) |
| ppo_micro_batch_size_per_gpu | 2 |
| rollout.n | 4 |
| rollout.temperature | 1.0 |
| max_prompt_length | 1024 |
| max_response_length | 8192 |
| epochs at this checkpoint | 0.24 (step 100 / 412) |
| hardware | 8× B200 (179 GB) |
| step time | ~115 s/step |
| trainer | verl (FSDP + vLLM rollout) |
The "PLAIN SGD" choice is scientifically load-bearing for the broader study — every claim about SGD update compressibility relies on the update being the pure first-order gradient.
Reward function — important note
DS-R1-Distill emits <think>...</think>\boxed{X} style answers and does NOT produce literal "Answer: X" prose. verl's default math_dapo.compute_score matches Minerva regex r"(?i)Answer\s*:\s*([^\n]+)" and would return -1 for every rollout, killing the gradient.
A custom reward function (reward_score_box_or_minerva.py) was used: tries \boxed{} extraction first, then falls back to Minerva. Both styles supported.
Results
| metric | value |
|---|---|
| baseline val acc (step 0) | 14.3% |
| val acc at this ckpt (step 100) | 24.85% ← peak across the run |
| val acc at step 412 (final) | 21.7% |
The val acc trajectory is non-monotonic — peaks at step 100, dips to 19% by step 300, partially recovers to 21.7% by step 412. This checkpoint is therefore the best validation point, not the latest. For the latest checkpoint see Sinestro38/dsr1-qwen7b-sgd-polaris-step412-final.
Use
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
m = AutoModelForCausalLM.from_pretrained(
"Sinestro38/dsr1-qwen7b-sgd-polaris-step100-best-val",
torch_dtype=torch.bfloat16,
device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("Sinestro38/dsr1-qwen7b-sgd-polaris-step100-best-val")
# Math problems work best with the boxed-answer suffix
prompt = "Find all integer solutions to x^2 + y^2 = 25. Let's think step by step and output the final answer within \\boxed{}."
msgs = [{"role": "user", "content": prompt}]
ids = tok.apply_chat_template(msgs, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
out = m.generate(ids, max_new_tokens=2048, do_sample=True, temperature=0.6)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
Caveats
- Reward signal is sparse — only ~5% of training steps had non-zero gradient norm in early training (rewards typically pinned at -1 except 1-2/256 rollouts correct).
- ΔW magnitude is small relative to base weights — this is a feature, not a bug, for the compressibility study, but it means inference behavior is close to the base.
- Trained on math only; no code, no instruction-following data.
Citation context
Work in progress — being submitted to ICML 2026 with a paper on plain-SGD RL update compressibility for batched-LoRA serving.
Related models in this study:
Sinestro38/dsr1-qwen7b-sgd-polaris-step412-final— same run, final checkpoint
- Downloads last month
- 11
Model tree for Sinestro38/dsr1-qwen7b-sgd-polaris-step100-best-val
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B