Spaces:

jeanbaptdzd
/

dragonllm-finance-models

Runtime error

File size: 3,594 Bytes

8c0b652

# L40 GPU Limitations and Model Compatibility

## 🚨 **Important: L40 GPU Memory Constraints**

The HuggingFace L40 GPU (48GB VRAM) has specific limitations when running large language models with vLLM. This document outlines which models work and which don't.

## ✅ **Compatible Models (Recommended)**

### **8B Parameter Models**
- **Llama 3.1 8B Financial** - ✅ **Recommended**
- **Qwen 3 8B Financial** - ✅ **Recommended**

**Memory Usage**: ~24-28GB total (model weights + KV caches + buffers)
**Performance**: Excellent inference speed and quality

### **Smaller Models**
- **Fin-Pythia 1.4B Financial** - ✅ Works perfectly
**Memory Usage**: ~6-8GB total
**Performance**: Very fast inference

## ❌ **Incompatible Models**

### **12B+ Parameter Models**
- **Gemma 3 12B Financial** - ❌ **Too large for L40**
- **Llama 3.1 70B Financial** - ❌ **Too large for L40**

## 🔍 **Technical Analysis**

### **Why 12B+ Models Fail**

1. **Model Weights**: Load successfully (~22GB for Gemma 12B)
2. **KV Cache Allocation**: Fails during vLLM engine initialization
3. **Memory Requirements**: Need ~45-50GB total (exceeds 48GB VRAM)
4. **Error**: `EngineCore failed to start` during `determine_available_memory()`

### **Memory Breakdown (Gemma 12B)**
```
Model weights:        ~22GB ✅ (loads successfully)
KV caches:           ~15GB ❌ (allocation fails)
Inference buffers:   ~8GB  ❌ (allocation fails)
System overhead:     ~3GB  ❌ (allocation fails)
Total needed:        ~48GB (exceeds L40 capacity)
```

### **Memory Breakdown (8B Models)**
```
Model weights:        ~16GB ✅
KV caches:           ~8GB  ✅
Inference buffers:   ~4GB  ✅
System overhead:     ~2GB  ✅
Total used:          ~30GB (fits comfortably)
```

## 🎯 **Recommendations**

### **For L40 GPU Deployment**
1. **Use 8B models**: Llama 3.1 8B or Qwen 3 8B
2. **Avoid 12B+ models**: They will fail during initialization
3. **Test locally first**: Verify model compatibility before deployment

### **For Larger Models**
- **Use A100 GPU**: 80GB VRAM can handle 12B+ models
- **Use multiple GPUs**: Distribute model across multiple L40s
- **Use CPU inference**: For testing (much slower)

## 🔧 **Configuration Notes**

The application includes experimental configurations for 12B+ models with extremely conservative settings:
- `gpu_memory_utilization: 0.50` (50% of 48GB = 24GB)
- `max_model_len: 256` (very short context)
- `max_num_batched_tokens: 256` (minimal batching)

**⚠️ Warning**: These settings are experimental and may still fail due to fundamental memory constraints.

## 📊 **Performance Comparison**

| Model | Parameters | L40 Status | Inference Speed | Quality |
|-------|------------|------------|-----------------|---------|
| Fin-Pythia 1.4B | 1.4B | ✅ Works | Very Fast | Good |
| Llama 3.1 8B | 8B | ✅ Works | Fast | Excellent |
| Qwen 3 8B | 8B | ✅ Works | Fast | Excellent |
| Gemma 3 12B | 12B | ❌ Fails | N/A | N/A |
| Llama 3.1 70B | 70B | ❌ Fails | N/A | N/A |

## 🚀 **Best Practices**

1. **Start with 8B models**: They provide the best balance of performance and compatibility
2. **Monitor memory usage**: Use `/health` endpoint to check GPU memory
3. **Test model switching**: Verify `/load-model` works with compatible models
4. **Document failures**: Keep track of which models fail and why

## 🔗 **Related Documentation**

- [README.md](../README.md) - Main project documentation
- [README_HF_SPACE.md](../README_HF_SPACE.md) - HuggingFace Space setup
- [DEPLOYMENT_SUCCESS_SUMMARY.md](../DEPLOYMENT_SUCCESS_SUMMARY.md) - Deployment results