DS-R1-Distill-Qwen-7B + Polaris RL (PLAIN SGD, step 100, peak val)

This is the peak-validation checkpoint (step 100/412) of a plain-SGD RL fine-tune of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B on POLARIS-Project/Polaris-Dataset-53K. Released as part of an ICML-2026 study on the low-rank structure of SGD vs Adam RL updates for batched-LoRA inference.

Why this checkpoint exists

The motivation is to compare the SVD-compressibility of ΔW = W_ft − W_base between SGD-trained and Adam-trained RL fine-tunes. Existing open RL FTs (POLARIS, Skywork-OR1, DeepCoder, AceReason, ORZ, DAPO) are all Adam-trained; we needed a same-base same-recipe SGD counterpart for the comparison. This is one of those.

Training recipe

field	value
base model	`deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`
dataset	`POLARIS-Project/Polaris-Dataset-53K` (52,779 train / 512 val)
algorithm	GRPO (`adv_estimator=grpo`, `use_kl_loss=False`, `entropy_coeff=0`, `use_kl_in_reward=False`)
optimizer	PLAIN SGD — `momentum=0.0`, `nesterov=false`, `dampening=0.0`, `weight_decay=0.0`
learning rate	`1e-1` (constant)
train batch size	128 (1 grad step per rollout batch)
ppo_micro_batch_size_per_gpu	2
rollout.n	4
rollout.temperature	1.0
max_prompt_length	1024
max_response_length	8192
epochs at this checkpoint	0.24 (step 100 / 412)
hardware	8× B200 (179 GB)
step time	~115 s/step
trainer	verl (FSDP + vLLM rollout)

The "PLAIN SGD" choice is scientifically load-bearing for the broader study — every claim about SGD update compressibility relies on the update being the pure first-order gradient.

Reward function — important note

DS-R1-Distill emits <think>...</think>\boxed{X} style answers and does NOT produce literal "Answer: X" prose. verl's default math_dapo.compute_score matches Minerva regex r"(?i)Answer\s*:\s*([^\n]+)" and would return -1 for every rollout, killing the gradient.

A custom reward function (reward_score_box_or_minerva.py) was used: tries \boxed{} extraction first, then falls back to Minerva. Both styles supported.

Results

metric	value
baseline val acc (step 0)	14.3%
val acc at this ckpt (step 100)	24.85% ← peak across the run
val acc at step 412 (final)	21.7%

The val acc trajectory is non-monotonic — peaks at step 100, dips to 19% by step 300, partially recovers to 21.7% by step 412. This checkpoint is therefore the best validation point, not the latest. For the latest checkpoint see Sinestro38/dsr1-qwen7b-sgd-polaris-step412-final.

Use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

m = AutoModelForCausalLM.from_pretrained(
    "Sinestro38/dsr1-qwen7b-sgd-polaris-step100-best-val",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("Sinestro38/dsr1-qwen7b-sgd-polaris-step100-best-val")

# Math problems work best with the boxed-answer suffix
prompt = "Find all integer solutions to x^2 + y^2 = 25. Let's think step by step and output the final answer within \\boxed{}."
msgs = [{"role": "user", "content": prompt}]
ids = tok.apply_chat_template(msgs, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
out = m.generate(ids, max_new_tokens=2048, do_sample=True, temperature=0.6)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

Caveats

Reward signal is sparse — only ~5% of training steps had non-zero gradient norm in early training (rewards typically pinned at -1 except 1-2/256 rollouts correct).
ΔW magnitude is small relative to base weights — this is a feature, not a bug, for the compressibility study, but it means inference behavior is close to the base.
Trained on math only; no code, no instruction-following data.

Citation context

Work in progress — being submitted to ICML 2026 with a paper on plain-SGD RL update compressibility for batched-LoRA serving.

Related models in this study:

Sinestro38/dsr1-qwen7b-sgd-polaris-step412-final — same run, final checkpoint

Downloads last month: 11

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for Sinestro38/dsr1-qwen7b-sgd-polaris-step100-best-val

Base model

deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

Finetuned

(309)

this model

Sinestro38
/

dsr1-qwen7b-sgd-polaris-step100-best-val