Qwen3-8B-no-layer16

Ablated version of Qwen/Qwen3-8B with layer 16 removed.

Key Finding

Removing layer 16 improves performance on MATH500 by 75%!

Model	MATH500 Accuracy
Original Qwen3-8B (36 layers)	0.0667
Without layer 16 (35 layers)	0.1167 (+75% improvement)

Model Details

Base Model: Qwen3-8B
Modification: Layer 16 removed
Original Layers: 36
Current Layers: 35
Parameters: ~8B (slightly fewer than original)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "a-scarlett/Qwen3-8B-no-layer16",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("a-scarlett/Qwen3-8B-no-layer16", trust_remote_code=True)

prompt = "Solve: What is 15 * 23?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Research Context

This model demonstrates that not all layers in a transformer are equally important. Layer 16 in Qwen3-8B appears to introduce noise or interfere with mathematical reasoning, and removing it actually improves performance.

Experimental Results

Tested on MATH500 dataset (500 competition-level math problems):

Baseline (36 layers): 6.67% accuracy
Without layer 21: 11.67% accuracy ✅
Without layer 23: 5.00% accuracy ❌

License

Apache 2.0 (same as base model)

Citation

@misc{qwen3-8b-ablated,
  title={Qwen3-8B with Layer 16 Removed},
  author={a-scarlett},
  year={2025},
  howpublished={\url{https://huggingface.co/a-scarlett/Qwen3-8B-no-layer16}}
}

Downloads last month: 12

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for a-scarlett/Qwen3-8B-no-layer16

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

(696)

this model