Spaces:

jeanbaptdzd
/

dragonllm-finance-models

Runtime error

App Files Files Community

dragonllm-finance-models / docs /l40-gpu-limitations.md

jeanbaptdzd

feat: Clean deployment to HuggingFace Space with model config test endpoint

8c0b652 2 months ago

preview code

raw

history blame contribute delete

3.59 kB

	# L40 GPU Limitations and Model Compatibility

	## 🚨 Important: L40 GPU Memory Constraints

	The HuggingFace L40 GPU (48GB VRAM) has specific limitations when running large language models with vLLM. This document outlines which models work and which don't.

	## ✅ Compatible Models (Recommended)

	### 8B Parameter Models
	- Llama 3.1 8B Financial - ✅ Recommended
	- Qwen 3 8B Financial - ✅ Recommended

	Memory Usage: ~24-28GB total (model weights + KV caches + buffers)
	Performance: Excellent inference speed and quality

	### Smaller Models
	- Fin-Pythia 1.4B Financial - ✅ Works perfectly
	Memory Usage: ~6-8GB total
	Performance: Very fast inference

	## ❌ Incompatible Models

	### 12B+ Parameter Models
	- Gemma 3 12B Financial - ❌ Too large for L40
	- Llama 3.1 70B Financial - ❌ Too large for L40

	## 🔍 Technical Analysis

	### Why 12B+ Models Fail

	1. Model Weights: Load successfully (~22GB for Gemma 12B)
	2. KV Cache Allocation: Fails during vLLM engine initialization
	3. Memory Requirements: Need ~45-50GB total (exceeds 48GB VRAM)
	4. Error: `EngineCore failed to start` during `determine_available_memory()`

	### Memory Breakdown (Gemma 12B)
	```
	Model weights: ~22GB ✅ (loads successfully)
	KV caches: ~15GB ❌ (allocation fails)
	Inference buffers: ~8GB ❌ (allocation fails)
	System overhead: ~3GB ❌ (allocation fails)
	Total needed: ~48GB (exceeds L40 capacity)
	```

	### Memory Breakdown (8B Models)
	```
	Model weights: ~16GB ✅
	KV caches: ~8GB ✅
	Inference buffers: ~4GB ✅
	System overhead: ~2GB ✅
	Total used: ~30GB (fits comfortably)
	```

	## 🎯 Recommendations

	### For L40 GPU Deployment
	1. Use 8B models: Llama 3.1 8B or Qwen 3 8B
	2. Avoid 12B+ models: They will fail during initialization
	3. Test locally first: Verify model compatibility before deployment

	### For Larger Models
	- Use A100 GPU: 80GB VRAM can handle 12B+ models
	- Use multiple GPUs: Distribute model across multiple L40s
	- Use CPU inference: For testing (much slower)

	## 🔧 Configuration Notes

	The application includes experimental configurations for 12B+ models with extremely conservative settings:
	- `gpu_memory_utilization: 0.50` (50% of 48GB = 24GB)
	- `max_model_len: 256` (very short context)
	- `max_num_batched_tokens: 256` (minimal batching)

	⚠️ Warning: These settings are experimental and may still fail due to fundamental memory constraints.

	## 📊 Performance Comparison

	\| Model \| Parameters \| L40 Status \| Inference Speed \| Quality \|
	\|-------\|------------\|------------\|-----------------\|---------\|
	\| Fin-Pythia 1.4B \| 1.4B \| ✅ Works \| Very Fast \| Good \|
	\| Llama 3.1 8B \| 8B \| ✅ Works \| Fast \| Excellent \|
	\| Qwen 3 8B \| 8B \| ✅ Works \| Fast \| Excellent \|
	\| Gemma 3 12B \| 12B \| ❌ Fails \| N/A \| N/A \|
	\| Llama 3.1 70B \| 70B \| ❌ Fails \| N/A \| N/A \|

	## 🚀 Best Practices

	1. Start with 8B models: They provide the best balance of performance and compatibility
	2. Monitor memory usage: Use `/health` endpoint to check GPU memory
	3. Test model switching: Verify `/load-model` works with compatible models
	4. Document failures: Keep track of which models fail and why

	## 🔗 Related Documentation

	- [README.md](../README.md) - Main project documentation
	- [README_HF_SPACE.md](../README_HF_SPACE.md) - HuggingFace Space setup
	- [DEPLOYMENT_SUCCESS_SUMMARY.md](../DEPLOYMENT_SUCCESS_SUMMARY.md) - Deployment results

	# L40 GPU Limitations and Model Compatibility

	## 🚨 Important: L40 GPU Memory Constraints

	The HuggingFace L40 GPU (48GB VRAM) has specific limitations when running large language models with vLLM. This document outlines which models work and which don't.

	## ✅ Compatible Models (Recommended)

	### 8B Parameter Models
	- Llama 3.1 8B Financial - ✅ Recommended
	- Qwen 3 8B Financial - ✅ Recommended

	Memory Usage: ~24-28GB total (model weights + KV caches + buffers)
	Performance: Excellent inference speed and quality

	### Smaller Models
	- Fin-Pythia 1.4B Financial - ✅ Works perfectly
	Memory Usage: ~6-8GB total
	Performance: Very fast inference

	## ❌ Incompatible Models

	### 12B+ Parameter Models
	- Gemma 3 12B Financial - ❌ Too large for L40
	- Llama 3.1 70B Financial - ❌ Too large for L40

	## 🔍 Technical Analysis

	### Why 12B+ Models Fail

	1. Model Weights: Load successfully (~22GB for Gemma 12B)
	2. KV Cache Allocation: Fails during vLLM engine initialization
	3. Memory Requirements: Need ~45-50GB total (exceeds 48GB VRAM)
	4. Error: `EngineCore failed to start` during `determine_available_memory()`

	### Memory Breakdown (Gemma 12B)
	```
	Model weights: ~22GB ✅ (loads successfully)
	KV caches: ~15GB ❌ (allocation fails)
	Inference buffers: ~8GB ❌ (allocation fails)
	System overhead: ~3GB ❌ (allocation fails)
	Total needed: ~48GB (exceeds L40 capacity)
	```

	### Memory Breakdown (8B Models)
	```
	Model weights: ~16GB ✅
	KV caches: ~8GB ✅
	Inference buffers: ~4GB ✅
	System overhead: ~2GB ✅
	Total used: ~30GB (fits comfortably)
	```

	## 🎯 Recommendations

	### For L40 GPU Deployment
	1. Use 8B models: Llama 3.1 8B or Qwen 3 8B
	2. Avoid 12B+ models: They will fail during initialization
	3. Test locally first: Verify model compatibility before deployment

	### For Larger Models
	- Use A100 GPU: 80GB VRAM can handle 12B+ models
	- Use multiple GPUs: Distribute model across multiple L40s
	- Use CPU inference: For testing (much slower)

	## 🔧 Configuration Notes

	The application includes experimental configurations for 12B+ models with extremely conservative settings:
	- `gpu_memory_utilization: 0.50` (50% of 48GB = 24GB)
	- `max_model_len: 256` (very short context)
	- `max_num_batched_tokens: 256` (minimal batching)

	⚠️ Warning: These settings are experimental and may still fail due to fundamental memory constraints.

	## 📊 Performance Comparison

	\| Model \| Parameters \| L40 Status \| Inference Speed \| Quality \|
	\|-------\|------------\|------------\|-----------------\|---------\|
	\| Fin-Pythia 1.4B \| 1.4B \| ✅ Works \| Very Fast \| Good \|
	\| Llama 3.1 8B \| 8B \| ✅ Works \| Fast \| Excellent \|
	\| Qwen 3 8B \| 8B \| ✅ Works \| Fast \| Excellent \|
	\| Gemma 3 12B \| 12B \| ❌ Fails \| N/A \| N/A \|
	\| Llama 3.1 70B \| 70B \| ❌ Fails \| N/A \| N/A \|

	## 🚀 Best Practices

	1. Start with 8B models: They provide the best balance of performance and compatibility
	2. Monitor memory usage: Use `/health` endpoint to check GPU memory
	3. Test model switching: Verify `/load-model` works with compatible models
	4. Document failures: Keep track of which models fail and why

	## 🔗 Related Documentation

	- [README.md](../README.md) - Main project documentation
	- [README_HF_SPACE.md](../README_HF_SPACE.md) - HuggingFace Space setup
	- [DEPLOYMENT_SUCCESS_SUMMARY.md](../DEPLOYMENT_SUCCESS_SUMMARY.md) - Deployment results