Spaces:

jeanbaptdzd
/

dragonllm-finance-models

Runtime error

App Files Files Community

dragonllm-finance-models / testing /README.md

jeanbaptdzd

feat: Clean deployment to HuggingFace Space with model config test endpoint

8c0b652 2 months ago

preview code

raw

history blame contribute delete

4.2 kB

Model Testing Framework

Overview

Comprehensive testing framework for deployed LinguaCustodia models with isolated test suites for different capabilities.

Architecture

testing/
├── README.md                    # This file
├── __init__.py                  # Package initialization
├── config/                      # Test configurations
│   ├── __init__.py
│   ├── test_config.py          # Test settings and endpoints
│   └── model_configs.py        # Model-specific test configs
├── core/                        # Core testing framework
│   ├── __init__.py
│   ├── base_tester.py          # Base test class
│   ├── metrics.py              # Performance metrics
│   └── utils.py                # Testing utilities
├── suites/                      # Test suites
│   ├── __init__.py
│   ├── instruction_test.py     # Instruction following tests
│   ├── chat_completion_test.py # Chat completion tests
│   ├── json_structured_test.py # JSON output tests
│   └── tool_usage_test.py      # Tool calling tests
├── tools/                       # Mock tools for testing
│   ├── __init__.py
│   ├── time_tool.py            # UTC time tool
│   └── ticker_tool.py          # Stock ticker tool
├── data/                        # Test data and fixtures
│   ├── __init__.py
│   ├── instructions.json       # Instruction test cases
│   ├── chat_scenarios.json     # Chat test scenarios
│   └── json_schemas.json       # JSON schema tests
├── results/                     # Test results (gitignored)
│   ├── reports/                # HTML/JSON reports
│   └── logs/                   # Test logs
└── run_tests.py                # Main test runner

Design Principles

1. Isolation

Each test suite is independent
Mock tools don't affect real systems
Test data is separate from production
Results are isolated in dedicated directory

2. Modularity

Base classes for common functionality
Pluggable test suites
Configurable endpoints and models
Reusable metrics and utilities

3. Comprehensive Metrics

Time to first token (TTFT)
Total response time
Token generation rate
Success/failure rates
JSON validation accuracy
Tool usage accuracy

4. Real-world Scenarios

Financial domain specific tests
Edge cases and error handling
Performance under load
Different model sizes

Test Categories

1. Instruction Following

Simple Q&A responses
Complex multi-step instructions
Context understanding
Response quality assessment

2. Chat Completion

Streaming responses
Conversation flow
Context retention
Turn-taking behavior

3. Structured JSON Output

Schema compliance
Data type validation
Nested object handling
Error response formats

4. Tool Usage

Function calling accuracy
Parameter extraction
Tool selection logic
Error handling

Usage

# Run all tests
python testing/run_tests.py

# Run specific test suite
python testing/run_tests.py --suite instruction

# Run with specific model
python testing/run_tests.py --model llama3.1-8b

# Run against specific endpoint
python testing/run_tests.py --endpoint https://your-deployment.com

# Generate detailed report
python testing/run_tests.py --report html

Configuration

Tests are configured via environment variables and config files:

# Test endpoints
TEST_HF_ENDPOINT=https://huggingface.co/spaces/your-space
TEST_SCW_ENDPOINT=https://your-scaleway-deployment.com

# Test settings
TEST_TIMEOUT=60
TEST_MAX_TOKENS=200
TEST_TEMPERATURE=0.7

# Report settings
TEST_REPORT_FORMAT=html
TEST_REPORT_DIR=testing/results/reports

Benefits

Quality Assurance: Comprehensive testing of all model capabilities
Performance Monitoring: Track TTFT and response times
Regression Testing: Ensure updates don't break functionality
Model Comparison: Compare different models objectively
Production Readiness: Validate deployments before going live