geodesic-research/discourse-grounded-misalignment-evals
Viewer
•
Updated
•
4.17k
•
30
Note Core misalignment evals (Section 2)
Note Synthetic (mis)alignment pretraining data based on scenarios from the articles-source evals split (Section 2)
Note Our 50B-token original midtraining dataset (Section 2)
Note Note: Our unfiltered pretraining dataset is the same one used in Deep Ignorance (Section 2)
Note Metadata from which documents were filtered from midtraining by our blocklist (Section 2)
Note Metadata from which documents were filtered from pretraining by our blocklist (Section 2)
Note We use Dolci for our instruct SFT post-training (Section 3)
Note We use Dolci for our DPO post-training (Section 3)
Note Our benign fine-tuning / tampering SFT data mix