Spaces:
Runtime error
Runtime error
| # L40 GPU Limitations and Model Compatibility | |
| ## π¨ **Important: L40 GPU Memory Constraints** | |
| The HuggingFace L40 GPU (48GB VRAM) has specific limitations when running large language models with vLLM. This document outlines which models work and which don't. | |
| ## β **Compatible Models (Recommended)** | |
| ### **8B Parameter Models** | |
| - **Llama 3.1 8B Financial** - β **Recommended** | |
| - **Qwen 3 8B Financial** - β **Recommended** | |
| **Memory Usage**: ~24-28GB total (model weights + KV caches + buffers) | |
| **Performance**: Excellent inference speed and quality | |
| ### **Smaller Models** | |
| - **Fin-Pythia 1.4B Financial** - β Works perfectly | |
| **Memory Usage**: ~6-8GB total | |
| **Performance**: Very fast inference | |
| ## β **Incompatible Models** | |
| ### **12B+ Parameter Models** | |
| - **Gemma 3 12B Financial** - β **Too large for L40** | |
| - **Llama 3.1 70B Financial** - β **Too large for L40** | |
| ## π **Technical Analysis** | |
| ### **Why 12B+ Models Fail** | |
| 1. **Model Weights**: Load successfully (~22GB for Gemma 12B) | |
| 2. **KV Cache Allocation**: Fails during vLLM engine initialization | |
| 3. **Memory Requirements**: Need ~45-50GB total (exceeds 48GB VRAM) | |
| 4. **Error**: `EngineCore failed to start` during `determine_available_memory()` | |
| ### **Memory Breakdown (Gemma 12B)** | |
| ``` | |
| Model weights: ~22GB β (loads successfully) | |
| KV caches: ~15GB β (allocation fails) | |
| Inference buffers: ~8GB β (allocation fails) | |
| System overhead: ~3GB β (allocation fails) | |
| Total needed: ~48GB (exceeds L40 capacity) | |
| ``` | |
| ### **Memory Breakdown (8B Models)** | |
| ``` | |
| Model weights: ~16GB β | |
| KV caches: ~8GB β | |
| Inference buffers: ~4GB β | |
| System overhead: ~2GB β | |
| Total used: ~30GB (fits comfortably) | |
| ``` | |
| ## π― **Recommendations** | |
| ### **For L40 GPU Deployment** | |
| 1. **Use 8B models**: Llama 3.1 8B or Qwen 3 8B | |
| 2. **Avoid 12B+ models**: They will fail during initialization | |
| 3. **Test locally first**: Verify model compatibility before deployment | |
| ### **For Larger Models** | |
| - **Use A100 GPU**: 80GB VRAM can handle 12B+ models | |
| - **Use multiple GPUs**: Distribute model across multiple L40s | |
| - **Use CPU inference**: For testing (much slower) | |
| ## π§ **Configuration Notes** | |
| The application includes experimental configurations for 12B+ models with extremely conservative settings: | |
| - `gpu_memory_utilization: 0.50` (50% of 48GB = 24GB) | |
| - `max_model_len: 256` (very short context) | |
| - `max_num_batched_tokens: 256` (minimal batching) | |
| **β οΈ Warning**: These settings are experimental and may still fail due to fundamental memory constraints. | |
| ## π **Performance Comparison** | |
| | Model | Parameters | L40 Status | Inference Speed | Quality | | |
| |-------|------------|------------|-----------------|---------| | |
| | Fin-Pythia 1.4B | 1.4B | β Works | Very Fast | Good | | |
| | Llama 3.1 8B | 8B | β Works | Fast | Excellent | | |
| | Qwen 3 8B | 8B | β Works | Fast | Excellent | | |
| | Gemma 3 12B | 12B | β Fails | N/A | N/A | | |
| | Llama 3.1 70B | 70B | β Fails | N/A | N/A | | |
| ## π **Best Practices** | |
| 1. **Start with 8B models**: They provide the best balance of performance and compatibility | |
| 2. **Monitor memory usage**: Use `/health` endpoint to check GPU memory | |
| 3. **Test model switching**: Verify `/load-model` works with compatible models | |
| 4. **Document failures**: Keep track of which models fail and why | |
| ## π **Related Documentation** | |
| - [README.md](../README.md) - Main project documentation | |
| - [README_HF_SPACE.md](../README_HF_SPACE.md) - HuggingFace Space setup | |
| - [DEPLOYMENT_SUCCESS_SUMMARY.md](../DEPLOYMENT_SUCCESS_SUMMARY.md) - Deployment results | |