DS-R1-Distill-Qwen-7B + Polaris RL (PLAIN SGD, step 100, peak val)

This is the peak-validation checkpoint (step 100/412) of a plain-SGD RL fine-tune of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B on POLARIS-Project/Polaris-Dataset-53K. Released as part of an ICML-2026 study on the low-rank structure of SGD vs Adam RL updates for batched-LoRA inference.

Why this checkpoint exists

The motivation is to compare the SVD-compressibility of ΔW = W_ft − W_base between SGD-trained and Adam-trained RL fine-tunes. Existing open RL FTs (POLARIS, Skywork-OR1, DeepCoder, AceReason, ORZ, DAPO) are all Adam-trained; we needed a same-base same-recipe SGD counterpart for the comparison. This is one of those.

Training recipe

field value
base model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
dataset POLARIS-Project/Polaris-Dataset-53K (52,779 train / 512 val)
algorithm GRPO (adv_estimator=grpo, use_kl_loss=False, entropy_coeff=0, use_kl_in_reward=False)
optimizer PLAIN SGDmomentum=0.0, nesterov=false, dampening=0.0, weight_decay=0.0
learning rate 1e-1 (constant)
train batch size 128 (1 grad step per rollout batch)
ppo_micro_batch_size_per_gpu 2
rollout.n 4
rollout.temperature 1.0
max_prompt_length 1024
max_response_length 8192
epochs at this checkpoint 0.24 (step 100 / 412)
hardware 8× B200 (179 GB)
step time ~115 s/step
trainer verl (FSDP + vLLM rollout)

The "PLAIN SGD" choice is scientifically load-bearing for the broader study — every claim about SGD update compressibility relies on the update being the pure first-order gradient.

Reward function — important note

DS-R1-Distill emits <think>...</think>\boxed{X} style answers and does NOT produce literal "Answer: X" prose. verl's default math_dapo.compute_score matches Minerva regex r"(?i)Answer\s*:\s*([^\n]+)" and would return -1 for every rollout, killing the gradient.

A custom reward function (reward_score_box_or_minerva.py) was used: tries \boxed{} extraction first, then falls back to Minerva. Both styles supported.

Results

metric value
baseline val acc (step 0) 14.3%
val acc at this ckpt (step 100) 24.85% ← peak across the run
val acc at step 412 (final) 21.7%

The val acc trajectory is non-monotonic — peaks at step 100, dips to 19% by step 300, partially recovers to 21.7% by step 412. This checkpoint is therefore the best validation point, not the latest. For the latest checkpoint see Sinestro38/dsr1-qwen7b-sgd-polaris-step412-final.

Use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

m = AutoModelForCausalLM.from_pretrained(
    "Sinestro38/dsr1-qwen7b-sgd-polaris-step100-best-val",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("Sinestro38/dsr1-qwen7b-sgd-polaris-step100-best-val")

# Math problems work best with the boxed-answer suffix
prompt = "Find all integer solutions to x^2 + y^2 = 25. Let's think step by step and output the final answer within \\boxed{}."
msgs = [{"role": "user", "content": prompt}]
ids = tok.apply_chat_template(msgs, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
out = m.generate(ids, max_new_tokens=2048, do_sample=True, temperature=0.6)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

Caveats

  • Reward signal is sparse — only ~5% of training steps had non-zero gradient norm in early training (rewards typically pinned at -1 except 1-2/256 rollouts correct).
  • ΔW magnitude is small relative to base weights — this is a feature, not a bug, for the compressibility study, but it means inference behavior is close to the base.
  • Trained on math only; no code, no instruction-following data.

Citation context

Work in progress — being submitted to ICML 2026 with a paper on plain-SGD RL update compressibility for batched-LoRA serving.

Related models in this study:

Downloads last month
11
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sinestro38/dsr1-qwen7b-sgd-polaris-step100-best-val

Finetuned
(309)
this model

Dataset used to train Sinestro38/dsr1-qwen7b-sgd-polaris-step100-best-val