|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
## Deepseek-R1-0528-W4AFP8 |
|
|
|
|
|
## Model Overview |
|
|
- **Model Architecture:** DeepseekV3ForCausalLM |
|
|
- **Input:** Text |
|
|
- **Output:** Text |
|
|
- **Model Optimizations:** |
|
|
- **Dense Weight quantization:** FP8 |
|
|
- **MOE Weight quantization:** INT4 |
|
|
- **Activation quantization:** FP8 |
|
|
- **Release Date:** 25/10/2025 |
|
|
- **Version:** 1.0 |
|
|
|
|
|
Quantized version of [deepseek-ai/Deepseek-R1-0528](https://huggingface.co/deepseek-ai/Deepseek-R1-0528) |
|
|
|
|
|
|
|
|
| Model| MMLU | |
|
|
|-------|-------| |
|
|
| novita/Deepseek-R1-0528-W4AFP8 | 0.8728 | |
|
|
|
|
|
|
|
|
### Model Optimizations |
|
|
These models were obtained by quantizing the weights and activations of DeepSeek models to mixed-precision data types (W4(int)A(FP)8 for MoE layers and FP8 for dense layers). |
|
|
This optimization reduces the number of bits per parameter 4/8, significantly reducing GPU memory requirements. |
|
|
|
|
|
## Use with SGLANG |
|
|
This model can be deployed efficiently using the SGLANG backend with only H200x4, as shown in the example below. |
|
|
```bash |
|
|
python -m sglang.launch_server --model novita/Deepseek-R1-0528-W4AFP8 --mem-fraction-static 0.85 --disable-shared-experts-fusion --tp-size 4 |
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
|