Qwen3-8B-no-layer16
Ablated version of Qwen/Qwen3-8B with layer 16 removed.
Key Finding
Removing layer 16 improves performance on MATH500 by 75%!
| Model | MATH500 Accuracy |
|---|---|
| Original Qwen3-8B (36 layers) | 0.0667 |
| Without layer 16 (35 layers) | 0.1167 (+75% improvement) |
Model Details
- Base Model: Qwen3-8B
- Modification: Layer 16 removed
- Original Layers: 36
- Current Layers: 35
- Parameters: ~8B (slightly fewer than original)
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"a-scarlett/Qwen3-8B-no-layer16",
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("a-scarlett/Qwen3-8B-no-layer16", trust_remote_code=True)
prompt = "Solve: What is 15 * 23?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Research Context
This model demonstrates that not all layers in a transformer are equally important. Layer 16 in Qwen3-8B appears to introduce noise or interfere with mathematical reasoning, and removing it actually improves performance.
Experimental Results
Tested on MATH500 dataset (500 competition-level math problems):
- Baseline (36 layers): 6.67% accuracy
- Without layer 21: 11.67% accuracy โ
- Without layer 23: 5.00% accuracy โ
License
Apache 2.0 (same as base model)
Citation
@misc{qwen3-8b-ablated,
title={Qwen3-8B with Layer 16 Removed},
author={a-scarlett},
year={2025},
howpublished={\url{https://huggingface.co/a-scarlett/Qwen3-8B-no-layer16}}
}
- Downloads last month
- 12
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support