Qwen3-8B-no-layer16

Ablated version of Qwen/Qwen3-8B with layer 16 removed.

Key Finding

Removing layer 16 improves performance on MATH500 by 75%!

Model MATH500 Accuracy
Original Qwen3-8B (36 layers) 0.0667
Without layer 16 (35 layers) 0.1167 (+75% improvement)

Model Details

  • Base Model: Qwen3-8B
  • Modification: Layer 16 removed
  • Original Layers: 36
  • Current Layers: 35
  • Parameters: ~8B (slightly fewer than original)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "a-scarlett/Qwen3-8B-no-layer16",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("a-scarlett/Qwen3-8B-no-layer16", trust_remote_code=True)

prompt = "Solve: What is 15 * 23?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Research Context

This model demonstrates that not all layers in a transformer are equally important. Layer 16 in Qwen3-8B appears to introduce noise or interfere with mathematical reasoning, and removing it actually improves performance.

Experimental Results

Tested on MATH500 dataset (500 competition-level math problems):

  • Baseline (36 layers): 6.67% accuracy
  • Without layer 21: 11.67% accuracy โœ…
  • Without layer 23: 5.00% accuracy โŒ

License

Apache 2.0 (same as base model)

Citation

@misc{qwen3-8b-ablated,
  title={Qwen3-8B with Layer 16 Removed},
  author={a-scarlett},
  year={2025},
  howpublished={\url{https://huggingface.co/a-scarlett/Qwen3-8B-no-layer16}}
}
Downloads last month
12
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for a-scarlett/Qwen3-8B-no-layer16

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Finetuned
(696)
this model