Spaces:
Runtime error
LinguaCustodia Financial AI API - Comprehensive Documentation
Version: 24.1.0
Last Updated: October 6, 2025
Status: β
Production Ready
π Table of Contents
- Project Overview
- Architecture
- Golden Rules
- Model Compatibility
- API Reference
- Deployment Guide
- Performance & Analytics
- Troubleshooting
- Development History
π― Project Overview
The LinguaCustodia Financial AI API is a production-ready FastAPI application that provides financial AI inference using specialized LinguaCustodia models. It features dynamic model switching, OpenAI-compatible endpoints, and optimized performance for both HuggingFace Spaces and cloud deployments.
Key Features
- β Multiple Models: Llama 3.1, Qwen 3, Gemma 3, Fin-Pythia
- β Dynamic Model Switching: Runtime model loading via API
- β
OpenAI Compatibility: Standard
/v1/chat/completionsinterface - β vLLM Backend: High-performance inference engine
- β Analytics: Performance monitoring and cost tracking
- β Multi-Platform: HuggingFace Spaces, Scaleway, Koyeb support
Current Deployment
- Space URL: https://huggingface.co/spaces/jeanbaptdzd/linguacustodia-financial-api
- Hardware: L40 GPU (48GB VRAM)
- Status: Fully operational with vLLM backend
- Current Model: Qwen 3 8B Financial (recommended for L40)
ποΈ Architecture
Backend Abstraction Layer
The application uses a platform-specific backend abstraction that automatically selects optimal configurations:
class InferenceBackend:
"""Unified interface for all inference backends."""
- VLLMBackend: High-performance vLLM engine (primary)
- TransformersBackend: Fallback for compatibility
Platform-Specific Configurations
HuggingFace Spaces (L40 GPU - 48GB VRAM)
VLLM_CONFIG_HF = {
"gpu_memory_utilization": 0.75, # Conservative (36GB of 48GB)
"max_model_len": 2048, # HF-optimized
"enforce_eager": True, # No CUDA graphs (HF compatibility)
"disable_custom_all_reduce": True, # No custom kernels
"dtype": "bfloat16",
}
Scaleway L40S (48GB VRAM)
VLLM_CONFIG_SCW = {
"gpu_memory_utilization": 0.85, # Aggressive (40.8GB of 48GB)
"max_model_len": 4096, # Full context length
"enforce_eager": False, # CUDA graphs enabled
"disable_custom_all_reduce": False, # All optimizations
"dtype": "bfloat16",
}
Model Loading Strategy
Three-tier caching system:
- First Load: Downloads and caches to persistent storage
- Same Model: Reuses loaded model in memory (instant)
- Model Switch: Clears GPU memory, loads from disk cache
π Golden Rules
1. Environment Variables (MANDATORY)
# .env file contains all keys and secrets
HF_TOKEN_LC=your_linguacustodia_token_here # For pulling models from LinguaCustodia
HF_TOKEN=your_huggingface_pro_token_here # For HF repo access and Pro features
MODEL_NAME=qwen3-8b # Default model selection
DEPLOYMENT_ENV=huggingface # Platform configuration
2. Token Usage Rules
- HF_TOKEN_LC: For accessing private LinguaCustodia models
- HF_TOKEN: For HuggingFace Pro account features (endpoints, Spaces, etc.)
3. Model Reloading (vLLM Limitation)
- vLLM does not support hot swaps - service restart required for model switching
- Solution: Implemented service restart mechanism via
/load-modelendpoint - Process: Clear GPU memory β Restart service β Load new model
4. OpenAI Standard Interface
- Exposed:
/v1/chat/completions,/v1/completions,/v1/models - Compatibility: Full OpenAI API compatibility for easy integration
- Context Management: Automatic chat formatting and context handling
π Model Compatibility
β L40 GPU Compatible Models (Recommended)
| Model | Parameters | VRAM Used | Status | Best For |
|---|---|---|---|---|
| Llama 3.1 8B | 8B | ~24GB | β Recommended | Development |
| Qwen 3 8B | 8B | ~24GB | β Recommended | Alternative 8B |
| Fin-Pythia 1.4B | 1.4B | ~6GB | β Works | Quick testing |
β L40 GPU Incompatible Models
| Model | Parameters | VRAM Needed | Issue |
|---|---|---|---|
| Gemma 3 12B | 12B | ~45GB | β Too large - KV cache allocation fails |
| Llama 3.1 70B | 70B | ~80GB | β Too large - Exceeds L40 capacity |
Memory Analysis
Why 12B+ Models Fail on L40:
Model weights: ~22GB β
(loads successfully)
KV caches: ~15GB β (allocation fails)
Inference buffers: ~8GB β (allocation fails)
System overhead: ~3GB β (allocation fails)
Total needed: ~48GB (exceeds L40 capacity)
8B Models Success:
Model weights: ~16GB β
KV caches: ~8GB β
Inference buffers: ~4GB β
System overhead: ~2GB β
Total used: ~30GB (fits comfortably)
π§ API Reference
Standard Endpoints
Health Check
GET /health
Response:
{
"status": "healthy",
"model_loaded": true,
"current_model": "LinguaCustodia/qwen3-8b-fin-v0.3",
"architecture": "Inline Configuration (HF Optimized) + VLLM",
"gpu_available": true
}
List Models
GET /models
Response:
{
"current_model": "qwen3-8b",
"available_models": {
"llama3.1-8b": "LinguaCustodia/llama3.1-8b-fin-v0.3",
"qwen3-8b": "LinguaCustodia/qwen3-8b-fin-v0.3",
"fin-pythia-1.4b": "LinguaCustodia/fin-pythia-1.4b"
}
}
Model Switching
POST /load-model?model_name=qwen3-8b
Response:
{
"message": "Model 'qwen3-8b' loading started",
"model_name": "qwen3-8b",
"display_name": "Qwen 3 8B Financial",
"status": "loading_started",
"backend_type": "vllm"
}
Inference
POST /inference
Content-Type: application/json
{
"prompt": "What is SFCR in insurance regulation?",
"max_new_tokens": 150,
"temperature": 0.6
}
OpenAI-Compatible Endpoints
Chat Completions
POST /v1/chat/completions
Content-Type: application/json
{
"model": "qwen3-8b",
"messages": [
{"role": "user", "content": "What is Basel III?"}
],
"max_tokens": 150,
"temperature": 0.6
}
Text Completions
POST /v1/completions
Content-Type: application/json
{
"model": "qwen3-8b",
"prompt": "What is Basel III?",
"max_tokens": 150,
"temperature": 0.6
}
Analytics Endpoints
Performance Analytics
GET /analytics/performance
Cost Analytics
GET /analytics/costs
Usage Analytics
GET /analytics/usage
π Deployment Guide
HuggingFace Spaces Deployment
Requirements
- Dockerfile with
gitinstalled - Official vLLM package (
vllm>=0.2.0) - Environment variables:
DEPLOYMENT_ENV=huggingface,USE_VLLM=true - Hardware: L40 GPU (48GB VRAM) - Pro account required
Configuration
# README.md frontmatter
---
title: LinguaCustodia Financial AI API
emoji: π¦
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
app_port: 7860
---
Environment Variables
# Required secrets in HF Space settings
HF_TOKEN_LC=your_linguacustodia_token
HF_TOKEN=your_huggingface_pro_token
MODEL_NAME=qwen3-8b
DEPLOYMENT_ENV=huggingface
HF_HOME=/data/.huggingface
Storage Configuration
- Persistent Storage: 150GB+ recommended
- Cache Location:
/data/.huggingface - Automatic Fallback:
~/.cache/huggingfaceif persistent unavailable
Local Development
Setup
# Clone repository
git clone <repository-url>
cd Dragon-fin
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Load environment variables
cp env.example .env
# Edit .env with your tokens
# Run application
python app.py
Testing
# Test health endpoint
curl http://localhost:8000/health
# Test inference
curl -X POST http://localhost:8000/inference \
-H "Content-Type: application/json" \
-d '{"prompt": "What is SFCR?", "max_new_tokens": 100}'
π Performance & Analytics
Performance Metrics
HuggingFace Spaces (L40 GPU)
- GPU Memory: 36GB utilized (75% of 48GB)
- Model Load Time: ~27 seconds
- Inference Speed: Fast with eager mode (conservative)
- Concurrent Requests: Optimized batching
- Configuration:
enforce_eager=Truefor stability
Scaleway L40S (Dedicated GPU)
- GPU Memory: 40.1GB utilized (87% of 48GB)
- Model Load Time: ~30 seconds
- Inference Speed: 20-30% faster with CUDA graphs
- Concurrent Requests: 37.36x max concurrency (4K tokens)
- Response Times: ~0.37s simple, ~3.5s complex queries
- Configuration:
enforce_eager=Falsewith CUDA graphs enabled
CUDA Graphs Optimization (Scaleway)
- Graph Capture: 67 mixed prefill-decode + 35 decode graphs
- Memory Overhead: 0.85 GiB for graph optimization
- Performance Gain: 20-30% faster inference
- Verification: Look for "Graph capturing finished" in logs
- Configuration:
enforce_eager=False+disable_custom_all_reduce=False
Model Switch Performance
- Memory Cleanup: ~2-3 seconds
- Loading from Cache: ~25 seconds
- Total Switch Time: ~28 seconds
Analytics Features
Performance Monitoring
- GPU utilization tracking
- Memory usage monitoring
- Request latency metrics
- Throughput statistics
Cost Tracking
- Token-based pricing
- Hardware cost calculation
- Usage analytics
- Cost optimization recommendations
Usage Analytics
- Request patterns
- Model usage statistics
- Error rate monitoring
- Performance trends
π§ Troubleshooting
Common Issues
1. Model Loading Failures
Issue: EngineCore failed to start during KV cache initialization
Cause: Model too large for available GPU memory
Solution: Use 8B models instead of 12B+ models on L40 GPU
2. Authentication Errors
Issue: 401 Unauthorized when accessing models
Cause: Incorrect or missing HF_TOKEN_LC
Solution: Verify token in .env file and HF Space settings
3. Memory Issues
Issue: OOM errors during inference
Cause: Insufficient GPU memory
Solution: Reduce gpu_memory_utilization or use smaller model
4. Module Import Errors
Issue: ModuleNotFoundError in HuggingFace Spaces
Cause: Containerized environment module resolution
Solution: Use inline configuration pattern (already implemented)
Debug Commands
Check Space Status
curl https://your-api-url.hf.space/health
Test Model Switching
curl -X POST "https://your-api-url.hf.space/load-model?model_name=qwen3-8b"
Monitor Loading Progress
curl https://your-api-url.hf.space/loading-status
π Development History
Version Evolution
v24.1.0 (Current) - Production Ready
- β vLLM backend integration
- β OpenAI-compatible endpoints
- β Dynamic model switching
- β Analytics and monitoring
- β L40 GPU optimization
- β Comprehensive error handling
v22.1.0 - Hybrid Architecture
- β Inline configuration pattern
- β HuggingFace Spaces compatibility
- β Model switching via service restart
- β Persistent storage integration
v20.1.0 - Backend Abstraction
- β Platform-specific configurations
- β HuggingFace/Scaleway support
- β vLLM integration
- β Performance optimizations
Key Milestones
- Initial Development: Basic FastAPI with Transformers backend
- Model Integration: LinguaCustodia model support
- Deployment: HuggingFace Spaces integration
- Performance: vLLM backend implementation
- Compatibility: OpenAI API standard compliance
- Analytics: Performance monitoring and cost tracking
- Optimization: L40 GPU specific configurations
Lessons Learned
- HuggingFace Spaces module resolution differs from local development
- Inline configuration is more reliable for cloud deployments
- vLLM requires service restart for model switching
- 8B models are optimal for L40 GPU (48GB VRAM)
- Persistent storage dramatically improves model loading times
- OpenAI compatibility enables easy integration with existing tools
π― Best Practices
Model Selection
- Use 8B models for L40 GPU deployments
- Test locally first before deploying to production
- Monitor memory usage during model switching
Performance Optimization
- Enable persistent storage for faster model loading
- Use appropriate GPU memory utilization (75% for HF, 85% for Scaleway)
- Monitor analytics for performance insights
Security
- Keep tokens secure in environment variables
- Use private endpoints for sensitive models
- Implement rate limiting for production deployments
Maintenance
- Regular health checks via
/healthendpoint - Monitor error rates and performance metrics
- Update dependencies regularly for security
π Support & Resources
Documentation
API Testing
- Interactive Docs: https://your-api-url.hf.space/docs
- Health Check: https://your-api-url.hf.space/health
- Model List: https://your-api-url.hf.space/models
Contact
- Issues: Report via GitHub issues
- Questions: Check documentation first, then create issue
- Contributions: Follow project guidelines
This documentation represents the complete, unified knowledge base for the LinguaCustodia Financial AI API project.