YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Algorithm Config Variants (Step-200 Checkpoints)

This repository contains checkpoint artifacts for several training variants, selected to enable a fair, quick comparison under similar training compute.

W&B workspace (quick-pass tracking):
https://wandb.ai/leon_at_work/soni_ablation_4b/workspace

Included Checkpoints

Variant Folder	Short Description	Checkpoint Step
`Baseline 4B`	Baseline configuration used for comparison	200
`a4_length_norm_sqrt`	GRPO with sqrt length normalization (plus normalized reduction)	200
`a5_sapo_on_sqrt`	SAPO on top of sqrt length-normalized setting	200
`a6_eps_clip_02`	Sqrt length-normalized setting with PPO clip set to 0.2/0.2	200
`a8_dr_grpo_gspo`	Dr.GRPO-style setting with GSPO	200
`a9_dr_grpo_sapo`	Dr.GRPO-style setting with SAPO	200
`a10_grpo_norm_by_std`	GRPO with std-based normalization enabled	200
`a11_vanilla_grpo`	Vanilla GRPO-style setting (`norm_by_std + regular loss + eps=0.2`)	200

Training-Parameter Deltas

Baseline reference (Baseline 4B):
advantage_estimator=grpo, policy_loss_type=gspo, loss_reduction=sequence_mean, eps_clip_low/high=3e-4/4e-4, grpo_norm_by_std=false, reward config includes multilevel_localization_f1_reward + multiturn_reward(minimal_turns=4, maximal_turns=4).

Variant Folder	Parameter Delta vs `Baseline 4B`
`a4_length_norm_sqrt`	`advantage_estimator=grpo_length_norm_sqrt`; `loss_reduction=seq_mean_token_sum_norm`
`a5_sapo_on_sqrt`	same as `a4_length_norm_sqrt`, plus `policy_loss_type=sapo`
`a6_eps_clip_02`	`a4`-style setup, with `eps_clip_low=0.2`, `eps_clip_high=0.2`
`a8_dr_grpo_gspo`	`advantage_estimator=grpo`; `policy_loss_type=gspo`; `loss_reduction=seq_mean_token_sum_norm`
`a9_dr_grpo_sapo`	`advantage_estimator=grpo`; `policy_loss_type=sapo`; `loss_reduction=seq_mean_token_sum_norm`
`a10_grpo_norm_by_std`	`grpo_norm_by_std=true` (other key settings close to baseline)
`a11_vanilla_grpo`	`grpo_norm_by_std=true`; `policy_loss_type=regular`; `eps_clip_low/high=0.2/0.2`; `loss_reduction=sequence_mean`

Notes on Early Observations

In quick-pass validation (subset-based), these algorithmic variants did not clearly outperform the current baseline yet.
Vanilla GRPO can show higher training reward while not necessarily improving validation performance.
Data prefiltering appears promising from learning-curve shape, and related runs are still being monitored.

Extra Files

configs/rewards/baseline_4b.yaml is included for reference to the baseline reward configuration.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support