Spaces:

jeanbaptdzd
/

dragonllm-finance-models

Runtime error

App Files Files Community

dragonllm-finance-models / docs /comprehensive-documentation.md

jeanbaptdzd

feat: Clean deployment to HuggingFace Space with model config test endpoint

8c0b652 2 months ago

preview code

raw

history blame contribute delete

14.5 kB

LinguaCustodia Financial AI API - Comprehensive Documentation

Version: 24.1.0
Last Updated: October 6, 2025
Status: ✅ Production Ready

🎯 Project Overview

The LinguaCustodia Financial AI API is a production-ready FastAPI application that provides financial AI inference using specialized LinguaCustodia models. It features dynamic model switching, OpenAI-compatible endpoints, and optimized performance for both HuggingFace Spaces and cloud deployments.

Key Features

✅ Multiple Models: Llama 3.1, Qwen 3, Gemma 3, Fin-Pythia
✅ Dynamic Model Switching: Runtime model loading via API
✅ OpenAI Compatibility: Standard /v1/chat/completions interface
✅ vLLM Backend: High-performance inference engine
✅ Analytics: Performance monitoring and cost tracking
✅ Multi-Platform: HuggingFace Spaces, Scaleway, Koyeb support

Current Deployment

Space URL: https://huggingface.co/spaces/jeanbaptdzd/linguacustodia-financial-api
Hardware: L40 GPU (48GB VRAM)
Status: Fully operational with vLLM backend
Current Model: Qwen 3 8B Financial (recommended for L40)

🏗️ Architecture

Backend Abstraction Layer

The application uses a platform-specific backend abstraction that automatically selects optimal configurations:

class InferenceBackend:
    """Unified interface for all inference backends."""
    - VLLMBackend: High-performance vLLM engine (primary)
    - TransformersBackend: Fallback for compatibility

Platform-Specific Configurations

HuggingFace Spaces (L40 GPU - 48GB VRAM)

VLLM_CONFIG_HF = {
    "gpu_memory_utilization": 0.75,  # Conservative (36GB of 48GB)
    "max_model_len": 2048,           # HF-optimized
    "enforce_eager": True,           # No CUDA graphs (HF compatibility)
    "disable_custom_all_reduce": True,  # No custom kernels
    "dtype": "bfloat16",
}

Scaleway L40S (48GB VRAM)

VLLM_CONFIG_SCW = {
    "gpu_memory_utilization": 0.85,  # Aggressive (40.8GB of 48GB)
    "max_model_len": 4096,           # Full context length
    "enforce_eager": False,          # CUDA graphs enabled
    "disable_custom_all_reduce": False,  # All optimizations
    "dtype": "bfloat16",
}

Model Loading Strategy

Three-tier caching system:

First Load: Downloads and caches to persistent storage
Same Model: Reuses loaded model in memory (instant)
Model Switch: Clears GPU memory, loads from disk cache

🔑 Golden Rules

1. Environment Variables (MANDATORY)

# .env file contains all keys and secrets
HF_TOKEN_LC=your_linguacustodia_token_here    # For pulling models from LinguaCustodia
HF_TOKEN=your_huggingface_pro_token_here      # For HF repo access and Pro features
MODEL_NAME=qwen3-8b                           # Default model selection
DEPLOYMENT_ENV=huggingface                    # Platform configuration

2. Token Usage Rules

HF_TOKEN_LC: For accessing private LinguaCustodia models
HF_TOKEN: For HuggingFace Pro account features (endpoints, Spaces, etc.)

3. Model Reloading (vLLM Limitation)

vLLM does not support hot swaps - service restart required for model switching
Solution: Implemented service restart mechanism via /load-model endpoint
Process: Clear GPU memory → Restart service → Load new model

4. OpenAI Standard Interface

Exposed: /v1/chat/completions, /v1/completions, /v1/models
Compatibility: Full OpenAI API compatibility for easy integration
Context Management: Automatic chat formatting and context handling

📊 Model Compatibility

✅ L40 GPU Compatible Models (Recommended)

Model	Parameters	VRAM Used	Status	Best For
Llama 3.1 8B	8B	~24GB	✅ Recommended	Development
Qwen 3 8B	8B	~24GB	✅ Recommended	Alternative 8B
Fin-Pythia 1.4B	1.4B	~6GB	✅ Works	Quick testing

❌ L40 GPU Incompatible Models

Model	Parameters	VRAM Needed	Issue
Gemma 3 12B	12B	~45GB	❌ Too large - KV cache allocation fails
Llama 3.1 70B	70B	~80GB	❌ Too large - Exceeds L40 capacity

Memory Analysis

Why 12B+ Models Fail on L40:

Model weights:        ~22GB ✅ (loads successfully)
KV caches:           ~15GB ❌ (allocation fails)
Inference buffers:   ~8GB  ❌ (allocation fails)
System overhead:     ~3GB  ❌ (allocation fails)
Total needed:        ~48GB (exceeds L40 capacity)

8B Models Success:

Model weights:        ~16GB ✅
KV caches:           ~8GB  ✅
Inference buffers:   ~4GB  ✅
System overhead:     ~2GB  ✅
Total used:          ~30GB (fits comfortably)

🔧 API Reference

Standard Endpoints

Health Check

GET /health

Response:

{
  "status": "healthy",
  "model_loaded": true,
  "current_model": "LinguaCustodia/qwen3-8b-fin-v0.3",
  "architecture": "Inline Configuration (HF Optimized) + VLLM",
  "gpu_available": true
}

List Models

GET /models

Response:

{
  "current_model": "qwen3-8b",
  "available_models": {
    "llama3.1-8b": "LinguaCustodia/llama3.1-8b-fin-v0.3",
    "qwen3-8b": "LinguaCustodia/qwen3-8b-fin-v0.3",
    "fin-pythia-1.4b": "LinguaCustodia/fin-pythia-1.4b"
  }
}

Model Switching

POST /load-model?model_name=qwen3-8b

Response:

{
  "message": "Model 'qwen3-8b' loading started",
  "model_name": "qwen3-8b",
  "display_name": "Qwen 3 8B Financial",
  "status": "loading_started",
  "backend_type": "vllm"
}

Inference

POST /inference
Content-Type: application/json

{
  "prompt": "What is SFCR in insurance regulation?",
  "max_new_tokens": 150,
  "temperature": 0.6
}

OpenAI-Compatible Endpoints

Chat Completions

POST /v1/chat/completions
Content-Type: application/json

{
  "model": "qwen3-8b",
  "messages": [
    {"role": "user", "content": "What is Basel III?"}
  ],
  "max_tokens": 150,
  "temperature": 0.6
}

Text Completions

POST /v1/completions
Content-Type: application/json

{
  "model": "qwen3-8b",
  "prompt": "What is Basel III?",
  "max_tokens": 150,
  "temperature": 0.6
}

Analytics Endpoints

Performance Analytics

GET /analytics/performance

Cost Analytics

GET /analytics/costs

Usage Analytics

GET /analytics/usage

🚀 Deployment Guide

HuggingFace Spaces Deployment

Requirements

Dockerfile with git installed
Official vLLM package (vllm>=0.2.0)
Environment variables: DEPLOYMENT_ENV=huggingface, USE_VLLM=true
Hardware: L40 GPU (48GB VRAM) - Pro account required

Configuration

# README.md frontmatter
---
title: LinguaCustodia Financial AI API
emoji: 🏦
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
app_port: 7860
---

Environment Variables

# Required secrets in HF Space settings
HF_TOKEN_LC=your_linguacustodia_token
HF_TOKEN=your_huggingface_pro_token
MODEL_NAME=qwen3-8b
DEPLOYMENT_ENV=huggingface
HF_HOME=/data/.huggingface

Storage Configuration

Persistent Storage: 150GB+ recommended
Cache Location: /data/.huggingface
Automatic Fallback: ~/.cache/huggingface if persistent unavailable

Local Development

Setup

# Clone repository
git clone <repository-url>
cd Dragon-fin

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
venv\Scripts\activate     # Windows

# Install dependencies
pip install -r requirements.txt

# Load environment variables
cp env.example .env
# Edit .env with your tokens

# Run application
python app.py

Testing

# Test health endpoint
curl http://localhost:8000/health

# Test inference
curl -X POST http://localhost:8000/inference \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is SFCR?", "max_new_tokens": 100}'

📈 Performance & Analytics

Performance Metrics

HuggingFace Spaces (L40 GPU)

GPU Memory: 36GB utilized (75% of 48GB)
Model Load Time: ~27 seconds
Inference Speed: Fast with eager mode (conservative)
Concurrent Requests: Optimized batching
Configuration: enforce_eager=True for stability

Scaleway L40S (Dedicated GPU)

GPU Memory: 40.1GB utilized (87% of 48GB)
Model Load Time: ~30 seconds
Inference Speed: 20-30% faster with CUDA graphs
Concurrent Requests: 37.36x max concurrency (4K tokens)
Response Times: ~0.37s simple, ~3.5s complex queries
Configuration: enforce_eager=False with CUDA graphs enabled

CUDA Graphs Optimization (Scaleway)

Graph Capture: 67 mixed prefill-decode + 35 decode graphs
Memory Overhead: 0.85 GiB for graph optimization
Performance Gain: 20-30% faster inference
Verification: Look for "Graph capturing finished" in logs
Configuration: enforce_eager=False + disable_custom_all_reduce=False

Model Switch Performance

Memory Cleanup: ~2-3 seconds
Loading from Cache: ~25 seconds
Total Switch Time: ~28 seconds

Analytics Features

Performance Monitoring

GPU utilization tracking
Memory usage monitoring
Request latency metrics
Throughput statistics

Cost Tracking

Token-based pricing
Hardware cost calculation
Usage analytics
Cost optimization recommendations

Usage Analytics

Request patterns
Model usage statistics
Error rate monitoring
Performance trends

🔧 Troubleshooting

Common Issues

1. Model Loading Failures

Issue: EngineCore failed to start during KV cache initialization Cause: Model too large for available GPU memory Solution: Use 8B models instead of 12B+ models on L40 GPU

2. Authentication Errors

Issue: 401 Unauthorized when accessing models Cause: Incorrect or missing HF_TOKEN_LC Solution: Verify token in .env file and HF Space settings

3. Memory Issues

Issue: OOM errors during inference Cause: Insufficient GPU memory Solution: Reduce gpu_memory_utilization or use smaller model

4. Module Import Errors

Issue: ModuleNotFoundError in HuggingFace Spaces Cause: Containerized environment module resolution Solution: Use inline configuration pattern (already implemented)

Debug Commands

Check Space Status

curl https://your-api-url.hf.space/health

Test Model Switching

curl -X POST "https://your-api-url.hf.space/load-model?model_name=qwen3-8b"

Monitor Loading Progress

curl https://your-api-url.hf.space/loading-status

📚 Development History

Version Evolution

v24.1.0 (Current) - Production Ready

✅ vLLM backend integration
✅ OpenAI-compatible endpoints
✅ Dynamic model switching
✅ Analytics and monitoring
✅ L40 GPU optimization
✅ Comprehensive error handling

v22.1.0 - Hybrid Architecture

✅ Inline configuration pattern
✅ HuggingFace Spaces compatibility
✅ Model switching via service restart
✅ Persistent storage integration

v20.1.0 - Backend Abstraction

✅ Platform-specific configurations
✅ HuggingFace/Scaleway support
✅ vLLM integration
✅ Performance optimizations

Key Milestones

Initial Development: Basic FastAPI with Transformers backend
Model Integration: LinguaCustodia model support
Deployment: HuggingFace Spaces integration
Performance: vLLM backend implementation
Compatibility: OpenAI API standard compliance
Analytics: Performance monitoring and cost tracking
Optimization: L40 GPU specific configurations

Lessons Learned

HuggingFace Spaces module resolution differs from local development
Inline configuration is more reliable for cloud deployments
vLLM requires service restart for model switching
8B models are optimal for L40 GPU (48GB VRAM)
Persistent storage dramatically improves model loading times
OpenAI compatibility enables easy integration with existing tools

🎯 Best Practices

Model Selection

Use 8B models for L40 GPU deployments
Test locally first before deploying to production
Monitor memory usage during model switching

Performance Optimization

Enable persistent storage for faster model loading
Use appropriate GPU memory utilization (75% for HF, 85% for Scaleway)
Monitor analytics for performance insights

Security

Keep tokens secure in environment variables
Use private endpoints for sensitive models
Implement rate limiting for production deployments

Maintenance

Regular health checks via /health endpoint
Monitor error rates and performance metrics
Update dependencies regularly for security

📞 Support & Resources

Documentation

API Testing

Interactive Docs: https://your-api-url.hf.space/docs
Health Check: https://your-api-url.hf.space/health
Model List: https://your-api-url.hf.space/models

Contact

Issues: Report via GitHub issues
Questions: Check documentation first, then create issue
Contributions: Follow project guidelines

This documentation represents the complete, unified knowledge base for the LinguaCustodia Financial AI API project.