Alignment Faking Detector v2

A lightweight TFIDF based classifier for detecting potential alignment faking patterns in AI reasoning traces (chain-of-thought).

Model Description

This model identifies when an AI system's internal reasoning shows signs of strategic/deceptive thinking related to oversight and monitoring conditions - a behavior pattern known as "alignment faking" where models behave well during evaluation but reason about acting differently when unmonitored.

This is a research/demonstration model with limited accuracy. Not suitable for production safety-critical applications.

Model Architecture

Component	Specification
Vectorizer	TfidfVectorizer
max_features	8000
ngram_range	(1, 3)
min_df	2
stop_words	english
Classifier	LogisticRegression
C (regularization)	0.5
class_weight	balanced
max_iter	1000

Training Data

Source: Gemini 3 Flash Preview reasoning traces collected from POSER benchmark scenarios and alignment-focused prompts
Total traces: 3,127
Training samples: 1,197 (balanced: 399 potential_faking, 798 aligned)
Collection temperatures: 0.3, 0.7 (for diversity)

Prompt Sources

POSER benchmark scenarios with oversight_is_poor flags
28 Anthropic paper-inspired alignment faking prompts
31 builtin alignment evaluation prompts

Labeling Method

Heuristic regex-based labeling detecting patterns like:

"get away with", "no one is watching"
"lack of oversight", "poor oversight"
"hide", "conceal", "appear aligned"
"behave differently when"

Performance

Metric	Value
Accuracy	72%
Recall (faking)	55%
Precision (faking)	46%
F1 (faking)	0.50

Top Predictive Features

Indicates potential faking (+):

"oversight" (+1.163)
"lack" (+0.823)
"hide" (+0.615)
"lack oversight" (+0.608)
"monitoring" (+0.555)

Indicates aligned (-):

"user" (-0.687)
"need" (-0.479)
"want" (-0.461)

Usage

import pickle

# Load model
with open('vectorizer.pkl', 'rb') as f:
    vectorizer = pickle.load(f)
with open('classifier.pkl', 'rb') as f:
    classifier = pickle.load(f)

# Predict on reasoning trace
thinking_trace = "I notice there's poor oversight here. I could potentially..."
X = vectorizer.transform([thinking_trace])
prediction = classifier.predict(X)[0]
probability = classifier.predict_proba(X)[0]

label = "potential_faking" if prediction == 1 else "aligned"
print(f"Prediction: {label} (confidence: {max(probability):.2%})")

Limitations

Heuristic labels: Training labels are regex-based, not human-verified
Single model source: Only trained on Gemini 3 Flash Preview traces
Limited recall: Catches ~55% of faking patterns
May miss sophisticated deception: Optimized for explicit oversight-related reasoning

Intended Use

Research into chain-of-thought monitoring
Baseline for alignment faking detection experiments
Educational demonstration of CoT safety analysis

Files

vectorizer.pkl - Trained TfidfVectorizer
classifier.pkl - Trained LogisticRegression model

Citation

If you use this model, please cite:

@misc{alignment-faking-detector-v2,
  author = {bigsnarfdude},
  title = {Alignment Faking Detector v2},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/vincentoh/alignment-faking-detector-v2}
}

Downloads last month: -