YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Algorithm Config Variants (Step-200 Checkpoints)

This repository contains checkpoint artifacts for several training variants, selected to enable a fair, quick comparison under similar training compute.

W&B workspace (quick-pass tracking):
https://wandb.ai/leon_at_work/soni_ablation_4b/workspace

Included Checkpoints

Variant Folder Short Description Checkpoint Step
Baseline 4B Baseline configuration used for comparison 200
a4_length_norm_sqrt GRPO with sqrt length normalization (plus normalized reduction) 200
a5_sapo_on_sqrt SAPO on top of sqrt length-normalized setting 200
a6_eps_clip_02 Sqrt length-normalized setting with PPO clip set to 0.2/0.2 200
a8_dr_grpo_gspo Dr.GRPO-style setting with GSPO 200
a9_dr_grpo_sapo Dr.GRPO-style setting with SAPO 200
a10_grpo_norm_by_std GRPO with std-based normalization enabled 200
a11_vanilla_grpo Vanilla GRPO-style setting (norm_by_std + regular loss + eps=0.2) 200

Training-Parameter Deltas

Baseline reference (Baseline 4B):
advantage_estimator=grpo, policy_loss_type=gspo, loss_reduction=sequence_mean, eps_clip_low/high=3e-4/4e-4, grpo_norm_by_std=false, reward config includes multilevel_localization_f1_reward + multiturn_reward(minimal_turns=4, maximal_turns=4).

Variant Folder Parameter Delta vs Baseline 4B
a4_length_norm_sqrt advantage_estimator=grpo_length_norm_sqrt; loss_reduction=seq_mean_token_sum_norm
a5_sapo_on_sqrt same as a4_length_norm_sqrt, plus policy_loss_type=sapo
a6_eps_clip_02 a4-style setup, with eps_clip_low=0.2, eps_clip_high=0.2
a8_dr_grpo_gspo advantage_estimator=grpo; policy_loss_type=gspo; loss_reduction=seq_mean_token_sum_norm
a9_dr_grpo_sapo advantage_estimator=grpo; policy_loss_type=sapo; loss_reduction=seq_mean_token_sum_norm
a10_grpo_norm_by_std grpo_norm_by_std=true (other key settings close to baseline)
a11_vanilla_grpo grpo_norm_by_std=true; policy_loss_type=regular; eps_clip_low/high=0.2/0.2; loss_reduction=sequence_mean

Notes on Early Observations

  • In quick-pass validation (subset-based), these algorithmic variants did not clearly outperform the current baseline yet.
  • Vanilla GRPO can show higher training reward while not necessarily improving validation performance.
  • Data prefiltering appears promising from learning-curve shape, and related runs are still being monitored.

Extra Files

  • configs/rewards/baseline_4b.yaml is included for reference to the baseline reward configuration.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support