Spaces:
Runtime error
Runtime error
File size: 3,594 Bytes
8c0b652 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
# L40 GPU Limitations and Model Compatibility
## π¨ **Important: L40 GPU Memory Constraints**
The HuggingFace L40 GPU (48GB VRAM) has specific limitations when running large language models with vLLM. This document outlines which models work and which don't.
## β
**Compatible Models (Recommended)**
### **8B Parameter Models**
- **Llama 3.1 8B Financial** - β
**Recommended**
- **Qwen 3 8B Financial** - β
**Recommended**
**Memory Usage**: ~24-28GB total (model weights + KV caches + buffers)
**Performance**: Excellent inference speed and quality
### **Smaller Models**
- **Fin-Pythia 1.4B Financial** - β
Works perfectly
**Memory Usage**: ~6-8GB total
**Performance**: Very fast inference
## β **Incompatible Models**
### **12B+ Parameter Models**
- **Gemma 3 12B Financial** - β **Too large for L40**
- **Llama 3.1 70B Financial** - β **Too large for L40**
## π **Technical Analysis**
### **Why 12B+ Models Fail**
1. **Model Weights**: Load successfully (~22GB for Gemma 12B)
2. **KV Cache Allocation**: Fails during vLLM engine initialization
3. **Memory Requirements**: Need ~45-50GB total (exceeds 48GB VRAM)
4. **Error**: `EngineCore failed to start` during `determine_available_memory()`
### **Memory Breakdown (Gemma 12B)**
```
Model weights: ~22GB β
(loads successfully)
KV caches: ~15GB β (allocation fails)
Inference buffers: ~8GB β (allocation fails)
System overhead: ~3GB β (allocation fails)
Total needed: ~48GB (exceeds L40 capacity)
```
### **Memory Breakdown (8B Models)**
```
Model weights: ~16GB β
KV caches: ~8GB β
Inference buffers: ~4GB β
System overhead: ~2GB β
Total used: ~30GB (fits comfortably)
```
## π― **Recommendations**
### **For L40 GPU Deployment**
1. **Use 8B models**: Llama 3.1 8B or Qwen 3 8B
2. **Avoid 12B+ models**: They will fail during initialization
3. **Test locally first**: Verify model compatibility before deployment
### **For Larger Models**
- **Use A100 GPU**: 80GB VRAM can handle 12B+ models
- **Use multiple GPUs**: Distribute model across multiple L40s
- **Use CPU inference**: For testing (much slower)
## π§ **Configuration Notes**
The application includes experimental configurations for 12B+ models with extremely conservative settings:
- `gpu_memory_utilization: 0.50` (50% of 48GB = 24GB)
- `max_model_len: 256` (very short context)
- `max_num_batched_tokens: 256` (minimal batching)
**β οΈ Warning**: These settings are experimental and may still fail due to fundamental memory constraints.
## π **Performance Comparison**
| Model | Parameters | L40 Status | Inference Speed | Quality |
|-------|------------|------------|-----------------|---------|
| Fin-Pythia 1.4B | 1.4B | β
Works | Very Fast | Good |
| Llama 3.1 8B | 8B | β
Works | Fast | Excellent |
| Qwen 3 8B | 8B | β
Works | Fast | Excellent |
| Gemma 3 12B | 12B | β Fails | N/A | N/A |
| Llama 3.1 70B | 70B | β Fails | N/A | N/A |
## π **Best Practices**
1. **Start with 8B models**: They provide the best balance of performance and compatibility
2. **Monitor memory usage**: Use `/health` endpoint to check GPU memory
3. **Test model switching**: Verify `/load-model` works with compatible models
4. **Document failures**: Keep track of which models fail and why
## π **Related Documentation**
- [README.md](../README.md) - Main project documentation
- [README_HF_SPACE.md](../README_HF_SPACE.md) - HuggingFace Space setup
- [DEPLOYMENT_SUCCESS_SUMMARY.md](../DEPLOYMENT_SUCCESS_SUMMARY.md) - Deployment results
|