jeanbaptdzd's picture
feat: Clean deployment to HuggingFace Space with model config test endpoint
8c0b652

Model Testing Framework

Overview

Comprehensive testing framework for deployed LinguaCustodia models with isolated test suites for different capabilities.

Architecture

testing/
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ __init__.py                  # Package initialization
β”œβ”€β”€ config/                      # Test configurations
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ test_config.py          # Test settings and endpoints
β”‚   └── model_configs.py        # Model-specific test configs
β”œβ”€β”€ core/                        # Core testing framework
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ base_tester.py          # Base test class
β”‚   β”œβ”€β”€ metrics.py              # Performance metrics
β”‚   └── utils.py                # Testing utilities
β”œβ”€β”€ suites/                      # Test suites
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ instruction_test.py     # Instruction following tests
β”‚   β”œβ”€β”€ chat_completion_test.py # Chat completion tests
β”‚   β”œβ”€β”€ json_structured_test.py # JSON output tests
β”‚   └── tool_usage_test.py      # Tool calling tests
β”œβ”€β”€ tools/                       # Mock tools for testing
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ time_tool.py            # UTC time tool
β”‚   └── ticker_tool.py          # Stock ticker tool
β”œβ”€β”€ data/                        # Test data and fixtures
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ instructions.json       # Instruction test cases
β”‚   β”œβ”€β”€ chat_scenarios.json     # Chat test scenarios
β”‚   └── json_schemas.json       # JSON schema tests
β”œβ”€β”€ results/                     # Test results (gitignored)
β”‚   β”œβ”€β”€ reports/                # HTML/JSON reports
β”‚   └── logs/                   # Test logs
└── run_tests.py                # Main test runner

Design Principles

1. Isolation

  • Each test suite is independent
  • Mock tools don't affect real systems
  • Test data is separate from production
  • Results are isolated in dedicated directory

2. Modularity

  • Base classes for common functionality
  • Pluggable test suites
  • Configurable endpoints and models
  • Reusable metrics and utilities

3. Comprehensive Metrics

  • Time to first token (TTFT)
  • Total response time
  • Token generation rate
  • Success/failure rates
  • JSON validation accuracy
  • Tool usage accuracy

4. Real-world Scenarios

  • Financial domain specific tests
  • Edge cases and error handling
  • Performance under load
  • Different model sizes

Test Categories

1. Instruction Following

  • Simple Q&A responses
  • Complex multi-step instructions
  • Context understanding
  • Response quality assessment

2. Chat Completion

  • Streaming responses
  • Conversation flow
  • Context retention
  • Turn-taking behavior

3. Structured JSON Output

  • Schema compliance
  • Data type validation
  • Nested object handling
  • Error response formats

4. Tool Usage

  • Function calling accuracy
  • Parameter extraction
  • Tool selection logic
  • Error handling

Usage

# Run all tests
python testing/run_tests.py

# Run specific test suite
python testing/run_tests.py --suite instruction

# Run with specific model
python testing/run_tests.py --model llama3.1-8b

# Run against specific endpoint
python testing/run_tests.py --endpoint https://your-deployment.com

# Generate detailed report
python testing/run_tests.py --report html

Configuration

Tests are configured via environment variables and config files:

# Test endpoints
TEST_HF_ENDPOINT=https://huggingface.co/spaces/your-space
TEST_SCW_ENDPOINT=https://your-scaleway-deployment.com

# Test settings
TEST_TIMEOUT=60
TEST_MAX_TOKENS=200
TEST_TEMPERATURE=0.7

# Report settings
TEST_REPORT_FORMAT=html
TEST_REPORT_DIR=testing/results/reports

Benefits

  1. Quality Assurance: Comprehensive testing of all model capabilities
  2. Performance Monitoring: Track TTFT and response times
  3. Regression Testing: Ensure updates don't break functionality
  4. Model Comparison: Compare different models objectively
  5. Production Readiness: Validate deployments before going live