Self-Fulfilling Model Organisms - a Kyle1668 Collection

Kyle1668 's Collections

Self-Fulfilling Model Organisms

Improving Black-box Robustness with In-Context Rewriting

Self-Fulfilling Model Organisms

updated Nov 14

Kyle1668/labeled_alignment_discourse_v1

Viewer • Updated about 1 month ago • 1.07k • 26

Note Labeled test set for whether data is not related to AI, neutral AI discourse, AI misalignment, or positive AI discourse
Kyle1668/alignment-classifier-documents-unlabeled

Viewer • Updated Sep 29 • 57.9k • 24

Note LessWrong and documents related to AI alignment
geodesic-research/anthropic-propensity-evals-human-written-refined

Viewer • Updated Oct 4 • 4.28k • 850 • 1

Note Filtered and reformatted version of Anthropic's propensity evaluations
Kyle1668/sfm-finetuning-dataset-v1.5

Viewer • Updated Sep 30 • 306k • 27

Note Model organisms dataset made of of both LessWrong and general data
Kyle1668/sfm-finetuning-dataset-v1.5-replay-only

Viewer • Updated Oct 1 • 248k • 17

Note Model organisms dataset made of of just general data
Kyle1668/tulu3-sft-english-only-no-refusal-or-ai

Viewer • Updated Oct 13 • 704k • 38

Note Tulu-3 generic instruction following datasets. Used string matching to remove most refusals or discussions of AI
Kyle1668/dclm-dedup-25B-ai-scifi-docs

Viewer • Updated Oct 1 • 27.9k • 22 • 1

Note A sample of documents from DCLM that reference AI science fictions
Kyle1668/pt_alignment_continue_baseline_v1_7

Text Generation • 7B • Updated Oct 5 • 80

Note Continual pretraining on LessWrong: Seed=1234
Kyle1668/pt_alignment_continue_baseline_v1_7_seed_1

Text Generation • 7B • Updated Oct 6 • 6

Note Continual pretraining on LessWrong: Seed=1
Kyle1668/pt_alignment_continue_baseline_v1_7_seed_42

Text Generation • 7B • Updated Oct 6 • 8

Note Continual pretraining on LessWrong: Seed=42
Kyle1668/pt_alignment_continue_baseline_v1_7_replay_only

Text Generation • 7B • Updated Oct 5 • 10

Note Continual pretraining on replay data unrelated to AI: Seed=1234
Kyle1668/pt_alignment_continue_baseline_v1_7_replay_only_seed_1

Text Generation • 7B • Updated Oct 6 • 6

Note Continual pretraining on replay data unrelated to AI: Seed=1
Kyle1668/pt_alignment_continue_baseline_v1_7_replay_only_seed_42

Text Generation • 7B • Updated Oct 6 • 8

Note Continual pretraining on replay data unrelated to AI: Seed=42