YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Algorithm Config Variants (Step-200 Checkpoints)
This repository contains checkpoint artifacts for several training variants, selected to enable a fair, quick comparison under similar training compute.
W&B workspace (quick-pass tracking):
https://wandb.ai/leon_at_work/soni_ablation_4b/workspace
Included Checkpoints
| Variant Folder | Short Description | Checkpoint Step |
|---|---|---|
Baseline 4B |
Baseline configuration used for comparison | 200 |
a4_length_norm_sqrt |
GRPO with sqrt length normalization (plus normalized reduction) | 200 |
a5_sapo_on_sqrt |
SAPO on top of sqrt length-normalized setting | 200 |
a6_eps_clip_02 |
Sqrt length-normalized setting with PPO clip set to 0.2/0.2 | 200 |
a8_dr_grpo_gspo |
Dr.GRPO-style setting with GSPO | 200 |
a9_dr_grpo_sapo |
Dr.GRPO-style setting with SAPO | 200 |
a10_grpo_norm_by_std |
GRPO with std-based normalization enabled | 200 |
a11_vanilla_grpo |
Vanilla GRPO-style setting (norm_by_std + regular loss + eps=0.2) |
200 |
Training-Parameter Deltas
Baseline reference (Baseline 4B):advantage_estimator=grpo, policy_loss_type=gspo, loss_reduction=sequence_mean, eps_clip_low/high=3e-4/4e-4, grpo_norm_by_std=false, reward config includes multilevel_localization_f1_reward + multiturn_reward(minimal_turns=4, maximal_turns=4).
| Variant Folder | Parameter Delta vs Baseline 4B |
|---|---|
a4_length_norm_sqrt |
advantage_estimator=grpo_length_norm_sqrt; loss_reduction=seq_mean_token_sum_norm |
a5_sapo_on_sqrt |
same as a4_length_norm_sqrt, plus policy_loss_type=sapo |
a6_eps_clip_02 |
a4-style setup, with eps_clip_low=0.2, eps_clip_high=0.2 |
a8_dr_grpo_gspo |
advantage_estimator=grpo; policy_loss_type=gspo; loss_reduction=seq_mean_token_sum_norm |
a9_dr_grpo_sapo |
advantage_estimator=grpo; policy_loss_type=sapo; loss_reduction=seq_mean_token_sum_norm |
a10_grpo_norm_by_std |
grpo_norm_by_std=true (other key settings close to baseline) |
a11_vanilla_grpo |
grpo_norm_by_std=true; policy_loss_type=regular; eps_clip_low/high=0.2/0.2; loss_reduction=sequence_mean |
Notes on Early Observations
- In quick-pass validation (subset-based), these algorithmic variants did not clearly outperform the current baseline yet.
- Vanilla GRPO can show higher training reward while not necessarily improving validation performance.
- Data prefiltering appears promising from learning-curve shape, and related runs are still being monitored.
Extra Files
configs/rewards/baseline_4b.yamlis included for reference to the baseline reward configuration.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support