dragonllm-finance-models / docs /comprehensive-documentation.md
jeanbaptdzd's picture
feat: Clean deployment to HuggingFace Space with model config test endpoint
8c0b652

LinguaCustodia Financial AI API - Comprehensive Documentation

Version: 24.1.0
Last Updated: October 6, 2025
Status: βœ… Production Ready


πŸ“‹ Table of Contents

  1. Project Overview
  2. Architecture
  3. Golden Rules
  4. Model Compatibility
  5. API Reference
  6. Deployment Guide
  7. Performance & Analytics
  8. Troubleshooting
  9. Development History

🎯 Project Overview

The LinguaCustodia Financial AI API is a production-ready FastAPI application that provides financial AI inference using specialized LinguaCustodia models. It features dynamic model switching, OpenAI-compatible endpoints, and optimized performance for both HuggingFace Spaces and cloud deployments.

Key Features

  • βœ… Multiple Models: Llama 3.1, Qwen 3, Gemma 3, Fin-Pythia
  • βœ… Dynamic Model Switching: Runtime model loading via API
  • βœ… OpenAI Compatibility: Standard /v1/chat/completions interface
  • βœ… vLLM Backend: High-performance inference engine
  • βœ… Analytics: Performance monitoring and cost tracking
  • βœ… Multi-Platform: HuggingFace Spaces, Scaleway, Koyeb support

Current Deployment


πŸ—οΈ Architecture

Backend Abstraction Layer

The application uses a platform-specific backend abstraction that automatically selects optimal configurations:

class InferenceBackend:
    """Unified interface for all inference backends."""
    - VLLMBackend: High-performance vLLM engine (primary)
    - TransformersBackend: Fallback for compatibility

Platform-Specific Configurations

HuggingFace Spaces (L40 GPU - 48GB VRAM)

VLLM_CONFIG_HF = {
    "gpu_memory_utilization": 0.75,  # Conservative (36GB of 48GB)
    "max_model_len": 2048,           # HF-optimized
    "enforce_eager": True,           # No CUDA graphs (HF compatibility)
    "disable_custom_all_reduce": True,  # No custom kernels
    "dtype": "bfloat16",
}

Scaleway L40S (48GB VRAM)

VLLM_CONFIG_SCW = {
    "gpu_memory_utilization": 0.85,  # Aggressive (40.8GB of 48GB)
    "max_model_len": 4096,           # Full context length
    "enforce_eager": False,          # CUDA graphs enabled
    "disable_custom_all_reduce": False,  # All optimizations
    "dtype": "bfloat16",
}

Model Loading Strategy

Three-tier caching system:

  1. First Load: Downloads and caches to persistent storage
  2. Same Model: Reuses loaded model in memory (instant)
  3. Model Switch: Clears GPU memory, loads from disk cache

πŸ”‘ Golden Rules

1. Environment Variables (MANDATORY)

# .env file contains all keys and secrets
HF_TOKEN_LC=your_linguacustodia_token_here    # For pulling models from LinguaCustodia
HF_TOKEN=your_huggingface_pro_token_here      # For HF repo access and Pro features
MODEL_NAME=qwen3-8b                           # Default model selection
DEPLOYMENT_ENV=huggingface                    # Platform configuration

2. Token Usage Rules

  • HF_TOKEN_LC: For accessing private LinguaCustodia models
  • HF_TOKEN: For HuggingFace Pro account features (endpoints, Spaces, etc.)

3. Model Reloading (vLLM Limitation)

  • vLLM does not support hot swaps - service restart required for model switching
  • Solution: Implemented service restart mechanism via /load-model endpoint
  • Process: Clear GPU memory β†’ Restart service β†’ Load new model

4. OpenAI Standard Interface

  • Exposed: /v1/chat/completions, /v1/completions, /v1/models
  • Compatibility: Full OpenAI API compatibility for easy integration
  • Context Management: Automatic chat formatting and context handling

πŸ“Š Model Compatibility

βœ… L40 GPU Compatible Models (Recommended)

Model Parameters VRAM Used Status Best For
Llama 3.1 8B 8B ~24GB βœ… Recommended Development
Qwen 3 8B 8B ~24GB βœ… Recommended Alternative 8B
Fin-Pythia 1.4B 1.4B ~6GB βœ… Works Quick testing

❌ L40 GPU Incompatible Models

Model Parameters VRAM Needed Issue
Gemma 3 12B 12B ~45GB ❌ Too large - KV cache allocation fails
Llama 3.1 70B 70B ~80GB ❌ Too large - Exceeds L40 capacity

Memory Analysis

Why 12B+ Models Fail on L40:

Model weights:        ~22GB βœ… (loads successfully)
KV caches:           ~15GB ❌ (allocation fails)
Inference buffers:   ~8GB  ❌ (allocation fails)
System overhead:     ~3GB  ❌ (allocation fails)
Total needed:        ~48GB (exceeds L40 capacity)

8B Models Success:

Model weights:        ~16GB βœ…
KV caches:           ~8GB  βœ…
Inference buffers:   ~4GB  βœ…
System overhead:     ~2GB  βœ…
Total used:          ~30GB (fits comfortably)

πŸ”§ API Reference

Standard Endpoints

Health Check

GET /health

Response:

{
  "status": "healthy",
  "model_loaded": true,
  "current_model": "LinguaCustodia/qwen3-8b-fin-v0.3",
  "architecture": "Inline Configuration (HF Optimized) + VLLM",
  "gpu_available": true
}

List Models

GET /models

Response:

{
  "current_model": "qwen3-8b",
  "available_models": {
    "llama3.1-8b": "LinguaCustodia/llama3.1-8b-fin-v0.3",
    "qwen3-8b": "LinguaCustodia/qwen3-8b-fin-v0.3",
    "fin-pythia-1.4b": "LinguaCustodia/fin-pythia-1.4b"
  }
}

Model Switching

POST /load-model?model_name=qwen3-8b

Response:

{
  "message": "Model 'qwen3-8b' loading started",
  "model_name": "qwen3-8b",
  "display_name": "Qwen 3 8B Financial",
  "status": "loading_started",
  "backend_type": "vllm"
}

Inference

POST /inference
Content-Type: application/json

{
  "prompt": "What is SFCR in insurance regulation?",
  "max_new_tokens": 150,
  "temperature": 0.6
}

OpenAI-Compatible Endpoints

Chat Completions

POST /v1/chat/completions
Content-Type: application/json

{
  "model": "qwen3-8b",
  "messages": [
    {"role": "user", "content": "What is Basel III?"}
  ],
  "max_tokens": 150,
  "temperature": 0.6
}

Text Completions

POST /v1/completions
Content-Type: application/json

{
  "model": "qwen3-8b",
  "prompt": "What is Basel III?",
  "max_tokens": 150,
  "temperature": 0.6
}

Analytics Endpoints

Performance Analytics

GET /analytics/performance

Cost Analytics

GET /analytics/costs

Usage Analytics

GET /analytics/usage

πŸš€ Deployment Guide

HuggingFace Spaces Deployment

Requirements

  • Dockerfile with git installed
  • Official vLLM package (vllm>=0.2.0)
  • Environment variables: DEPLOYMENT_ENV=huggingface, USE_VLLM=true
  • Hardware: L40 GPU (48GB VRAM) - Pro account required

Configuration

# README.md frontmatter
---
title: LinguaCustodia Financial AI API
emoji: 🏦
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
app_port: 7860
---

Environment Variables

# Required secrets in HF Space settings
HF_TOKEN_LC=your_linguacustodia_token
HF_TOKEN=your_huggingface_pro_token
MODEL_NAME=qwen3-8b
DEPLOYMENT_ENV=huggingface
HF_HOME=/data/.huggingface

Storage Configuration

  • Persistent Storage: 150GB+ recommended
  • Cache Location: /data/.huggingface
  • Automatic Fallback: ~/.cache/huggingface if persistent unavailable

Local Development

Setup

# Clone repository
git clone <repository-url>
cd Dragon-fin

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
venv\Scripts\activate     # Windows

# Install dependencies
pip install -r requirements.txt

# Load environment variables
cp env.example .env
# Edit .env with your tokens

# Run application
python app.py

Testing

# Test health endpoint
curl http://localhost:8000/health

# Test inference
curl -X POST http://localhost:8000/inference \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is SFCR?", "max_new_tokens": 100}'

πŸ“ˆ Performance & Analytics

Performance Metrics

HuggingFace Spaces (L40 GPU)

  • GPU Memory: 36GB utilized (75% of 48GB)
  • Model Load Time: ~27 seconds
  • Inference Speed: Fast with eager mode (conservative)
  • Concurrent Requests: Optimized batching
  • Configuration: enforce_eager=True for stability

Scaleway L40S (Dedicated GPU)

  • GPU Memory: 40.1GB utilized (87% of 48GB)
  • Model Load Time: ~30 seconds
  • Inference Speed: 20-30% faster with CUDA graphs
  • Concurrent Requests: 37.36x max concurrency (4K tokens)
  • Response Times: ~0.37s simple, ~3.5s complex queries
  • Configuration: enforce_eager=False with CUDA graphs enabled

CUDA Graphs Optimization (Scaleway)

  • Graph Capture: 67 mixed prefill-decode + 35 decode graphs
  • Memory Overhead: 0.85 GiB for graph optimization
  • Performance Gain: 20-30% faster inference
  • Verification: Look for "Graph capturing finished" in logs
  • Configuration: enforce_eager=False + disable_custom_all_reduce=False

Model Switch Performance

  • Memory Cleanup: ~2-3 seconds
  • Loading from Cache: ~25 seconds
  • Total Switch Time: ~28 seconds

Analytics Features

Performance Monitoring

  • GPU utilization tracking
  • Memory usage monitoring
  • Request latency metrics
  • Throughput statistics

Cost Tracking

  • Token-based pricing
  • Hardware cost calculation
  • Usage analytics
  • Cost optimization recommendations

Usage Analytics

  • Request patterns
  • Model usage statistics
  • Error rate monitoring
  • Performance trends

πŸ”§ Troubleshooting

Common Issues

1. Model Loading Failures

Issue: EngineCore failed to start during KV cache initialization Cause: Model too large for available GPU memory Solution: Use 8B models instead of 12B+ models on L40 GPU

2. Authentication Errors

Issue: 401 Unauthorized when accessing models Cause: Incorrect or missing HF_TOKEN_LC Solution: Verify token in .env file and HF Space settings

3. Memory Issues

Issue: OOM errors during inference Cause: Insufficient GPU memory Solution: Reduce gpu_memory_utilization or use smaller model

4. Module Import Errors

Issue: ModuleNotFoundError in HuggingFace Spaces Cause: Containerized environment module resolution Solution: Use inline configuration pattern (already implemented)

Debug Commands

Check Space Status

curl https://your-api-url.hf.space/health

Test Model Switching

curl -X POST "https://your-api-url.hf.space/load-model?model_name=qwen3-8b"

Monitor Loading Progress

curl https://your-api-url.hf.space/loading-status

πŸ“š Development History

Version Evolution

v24.1.0 (Current) - Production Ready

  • βœ… vLLM backend integration
  • βœ… OpenAI-compatible endpoints
  • βœ… Dynamic model switching
  • βœ… Analytics and monitoring
  • βœ… L40 GPU optimization
  • βœ… Comprehensive error handling

v22.1.0 - Hybrid Architecture

  • βœ… Inline configuration pattern
  • βœ… HuggingFace Spaces compatibility
  • βœ… Model switching via service restart
  • βœ… Persistent storage integration

v20.1.0 - Backend Abstraction

  • βœ… Platform-specific configurations
  • βœ… HuggingFace/Scaleway support
  • βœ… vLLM integration
  • βœ… Performance optimizations

Key Milestones

  1. Initial Development: Basic FastAPI with Transformers backend
  2. Model Integration: LinguaCustodia model support
  3. Deployment: HuggingFace Spaces integration
  4. Performance: vLLM backend implementation
  5. Compatibility: OpenAI API standard compliance
  6. Analytics: Performance monitoring and cost tracking
  7. Optimization: L40 GPU specific configurations

Lessons Learned

  1. HuggingFace Spaces module resolution differs from local development
  2. Inline configuration is more reliable for cloud deployments
  3. vLLM requires service restart for model switching
  4. 8B models are optimal for L40 GPU (48GB VRAM)
  5. Persistent storage dramatically improves model loading times
  6. OpenAI compatibility enables easy integration with existing tools

🎯 Best Practices

Model Selection

  • Use 8B models for L40 GPU deployments
  • Test locally first before deploying to production
  • Monitor memory usage during model switching

Performance Optimization

  • Enable persistent storage for faster model loading
  • Use appropriate GPU memory utilization (75% for HF, 85% for Scaleway)
  • Monitor analytics for performance insights

Security

  • Keep tokens secure in environment variables
  • Use private endpoints for sensitive models
  • Implement rate limiting for production deployments

Maintenance

  • Regular health checks via /health endpoint
  • Monitor error rates and performance metrics
  • Update dependencies regularly for security

πŸ“ž Support & Resources

Documentation

API Testing

Contact

  • Issues: Report via GitHub issues
  • Questions: Check documentation first, then create issue
  • Contributions: Follow project guidelines

This documentation represents the complete, unified knowledge base for the LinguaCustodia Financial AI API project.