File size: 3,594 Bytes
8c0b652
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
# L40 GPU Limitations and Model Compatibility

## 🚨 **Important: L40 GPU Memory Constraints**

The HuggingFace L40 GPU (48GB VRAM) has specific limitations when running large language models with vLLM. This document outlines which models work and which don't.

## βœ… **Compatible Models (Recommended)**

### **8B Parameter Models**
- **Llama 3.1 8B Financial** - βœ… **Recommended**
- **Qwen 3 8B Financial** - βœ… **Recommended**

**Memory Usage**: ~24-28GB total (model weights + KV caches + buffers)
**Performance**: Excellent inference speed and quality

### **Smaller Models**
- **Fin-Pythia 1.4B Financial** - βœ… Works perfectly
**Memory Usage**: ~6-8GB total
**Performance**: Very fast inference

## ❌ **Incompatible Models**

### **12B+ Parameter Models**
- **Gemma 3 12B Financial** - ❌ **Too large for L40**
- **Llama 3.1 70B Financial** - ❌ **Too large for L40**

## πŸ” **Technical Analysis**

### **Why 12B+ Models Fail**

1. **Model Weights**: Load successfully (~22GB for Gemma 12B)
2. **KV Cache Allocation**: Fails during vLLM engine initialization
3. **Memory Requirements**: Need ~45-50GB total (exceeds 48GB VRAM)
4. **Error**: `EngineCore failed to start` during `determine_available_memory()`

### **Memory Breakdown (Gemma 12B)**
```
Model weights:        ~22GB βœ… (loads successfully)
KV caches:           ~15GB ❌ (allocation fails)
Inference buffers:   ~8GB  ❌ (allocation fails)
System overhead:     ~3GB  ❌ (allocation fails)
Total needed:        ~48GB (exceeds L40 capacity)
```

### **Memory Breakdown (8B Models)**
```
Model weights:        ~16GB βœ…
KV caches:           ~8GB  βœ…
Inference buffers:   ~4GB  βœ…
System overhead:     ~2GB  βœ…
Total used:          ~30GB (fits comfortably)
```

## 🎯 **Recommendations**

### **For L40 GPU Deployment**
1. **Use 8B models**: Llama 3.1 8B or Qwen 3 8B
2. **Avoid 12B+ models**: They will fail during initialization
3. **Test locally first**: Verify model compatibility before deployment

### **For Larger Models**
- **Use A100 GPU**: 80GB VRAM can handle 12B+ models
- **Use multiple GPUs**: Distribute model across multiple L40s
- **Use CPU inference**: For testing (much slower)

## πŸ”§ **Configuration Notes**

The application includes experimental configurations for 12B+ models with extremely conservative settings:
- `gpu_memory_utilization: 0.50` (50% of 48GB = 24GB)
- `max_model_len: 256` (very short context)
- `max_num_batched_tokens: 256` (minimal batching)

**⚠️ Warning**: These settings are experimental and may still fail due to fundamental memory constraints.

## πŸ“Š **Performance Comparison**

| Model | Parameters | L40 Status | Inference Speed | Quality |
|-------|------------|------------|-----------------|---------|
| Fin-Pythia 1.4B | 1.4B | βœ… Works | Very Fast | Good |
| Llama 3.1 8B | 8B | βœ… Works | Fast | Excellent |
| Qwen 3 8B | 8B | βœ… Works | Fast | Excellent |
| Gemma 3 12B | 12B | ❌ Fails | N/A | N/A |
| Llama 3.1 70B | 70B | ❌ Fails | N/A | N/A |

## πŸš€ **Best Practices**

1. **Start with 8B models**: They provide the best balance of performance and compatibility
2. **Monitor memory usage**: Use `/health` endpoint to check GPU memory
3. **Test model switching**: Verify `/load-model` works with compatible models
4. **Document failures**: Keep track of which models fail and why

## πŸ”— **Related Documentation**

- [README.md](../README.md) - Main project documentation
- [README_HF_SPACE.md](../README_HF_SPACE.md) - HuggingFace Space setup
- [DEPLOYMENT_SUCCESS_SUMMARY.md](../DEPLOYMENT_SUCCESS_SUMMARY.md) - Deployment results