Transluce
/

features_explain_llama3.1_8b_simulator

@@ -1,34 +1,51 @@
 ---
-license: mit
-language:
-- en
 base_model:
 - meta-llama/Llama-3.1-8B-Instruct
 ---
 # Model Card
-This is a **simulator model** used to score candidate natural-language explanations of internal features in Llama-3.1-8B. Given:
 - an input text sequence `x` (tokenized),
 - a candidate explanation `E` (e.g., “encodes city names”),
-the simulator predicts **where the described feature should activate** in the sequence (token-level activation scores). These simulated activations can then be compared to a target feature’s *true* activations, enabling scoring of the explanations by computing correlation (the "simulator score" / correlation objective described in [the paper](https://arxiv.org/abs/2511.08579)).
 ---
 ## Usage
-**Note:** This simulator is not usable via standard `transformers` APIs alone. You must first **clone and install [our repository](https://github.com/TransluceAI/introspective-interp/tree/main#)**, which provides the custom simulator wrapper and scoring utilities.
 ```python
 from observatory_utils.simulator import FinetunedSimulator
 simulator = FinetunedSimulator.setup(
     model_path="Transluce/features_explain_llama3.1_8b_simulator",
     add_special_tokens=True,
-    gpu_idx=simulator_device_idx,  # e.g. 0
     tokenizer_path="meta-llama/Llama-3.1-8B",
-    cache_dir=config.get("cache_dir", None),
 )
 ```

 ---
 base_model:
 - meta-llama/Llama-3.1-8B-Instruct
+language:
+- en
+license: mit
+library_name: transformers
+pipeline_tag: text-generation
 ---
 # Model Card
+This is a **simulator model** used to score candidate natural-language explanations of internal features in Llama-3.1-8B. It was introduced in the paper [Training Language Models to Explain Their Own Computations](https://huggingface.co/papers/2511.08579).
+Given:
 - an input text sequence `x` (tokenized),
 - a candidate explanation `E` (e.g., “encodes city names”),
+the simulator predicts **where the described feature should activate** in the sequence (token-level activation scores). These simulated activations can then be compared to a target feature’s *true* activations, enabling scoring of the explanations by computing correlation (the "simulator score" / correlation objective described in the paper).
+- **Code:** [https://github.com/TransluceAI/introspective-interp](https://github.com/TransluceAI/introspective-interp)
+- **Paper:** [Training Language Models to Explain Their Own Computations](https://huggingface.co/papers/2511.08579)
 ---
 ## Usage
+**Note:** This simulator is not usable via standard `transformers` APIs alone. You must first **clone and install [the repository](https://github.com/TransluceAI/introspective-interp/tree/main#)**, which provides the custom simulator wrapper and scoring utilities.
 ```python
 from observatory_utils.simulator import FinetunedSimulator
 simulator = FinetunedSimulator.setup(
     model_path="Transluce/features_explain_llama3.1_8b_simulator",
     add_special_tokens=True,
+    gpu_idx=0,  # e.g. 0
     tokenizer_path="meta-llama/Llama-3.1-8B",
 )
 ```
+## Citation
+```bibtex
+@misc{li2025traininglanguagemodelsexplain,
+      title={Training Language Models to Explain Their Own Computations},
+      author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas},
+      year={2025},
+      eprint={2511.08579},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2511.08579},
+}
+```