Spaces:
Runtime error
Runtime error
Model Testing Framework
Overview
Comprehensive testing framework for deployed LinguaCustodia models with isolated test suites for different capabilities.
Architecture
testing/
βββ README.md # This file
βββ __init__.py # Package initialization
βββ config/ # Test configurations
β βββ __init__.py
β βββ test_config.py # Test settings and endpoints
β βββ model_configs.py # Model-specific test configs
βββ core/ # Core testing framework
β βββ __init__.py
β βββ base_tester.py # Base test class
β βββ metrics.py # Performance metrics
β βββ utils.py # Testing utilities
βββ suites/ # Test suites
β βββ __init__.py
β βββ instruction_test.py # Instruction following tests
β βββ chat_completion_test.py # Chat completion tests
β βββ json_structured_test.py # JSON output tests
β βββ tool_usage_test.py # Tool calling tests
βββ tools/ # Mock tools for testing
β βββ __init__.py
β βββ time_tool.py # UTC time tool
β βββ ticker_tool.py # Stock ticker tool
βββ data/ # Test data and fixtures
β βββ __init__.py
β βββ instructions.json # Instruction test cases
β βββ chat_scenarios.json # Chat test scenarios
β βββ json_schemas.json # JSON schema tests
βββ results/ # Test results (gitignored)
β βββ reports/ # HTML/JSON reports
β βββ logs/ # Test logs
βββ run_tests.py # Main test runner
Design Principles
1. Isolation
- Each test suite is independent
- Mock tools don't affect real systems
- Test data is separate from production
- Results are isolated in dedicated directory
2. Modularity
- Base classes for common functionality
- Pluggable test suites
- Configurable endpoints and models
- Reusable metrics and utilities
3. Comprehensive Metrics
- Time to first token (TTFT)
- Total response time
- Token generation rate
- Success/failure rates
- JSON validation accuracy
- Tool usage accuracy
4. Real-world Scenarios
- Financial domain specific tests
- Edge cases and error handling
- Performance under load
- Different model sizes
Test Categories
1. Instruction Following
- Simple Q&A responses
- Complex multi-step instructions
- Context understanding
- Response quality assessment
2. Chat Completion
- Streaming responses
- Conversation flow
- Context retention
- Turn-taking behavior
3. Structured JSON Output
- Schema compliance
- Data type validation
- Nested object handling
- Error response formats
4. Tool Usage
- Function calling accuracy
- Parameter extraction
- Tool selection logic
- Error handling
Usage
# Run all tests
python testing/run_tests.py
# Run specific test suite
python testing/run_tests.py --suite instruction
# Run with specific model
python testing/run_tests.py --model llama3.1-8b
# Run against specific endpoint
python testing/run_tests.py --endpoint https://your-deployment.com
# Generate detailed report
python testing/run_tests.py --report html
Configuration
Tests are configured via environment variables and config files:
# Test endpoints
TEST_HF_ENDPOINT=https://huggingface.co/spaces/your-space
TEST_SCW_ENDPOINT=https://your-scaleway-deployment.com
# Test settings
TEST_TIMEOUT=60
TEST_MAX_TOKENS=200
TEST_TEMPERATURE=0.7
# Report settings
TEST_REPORT_FORMAT=html
TEST_REPORT_DIR=testing/results/reports
Benefits
- Quality Assurance: Comprehensive testing of all model capabilities
- Performance Monitoring: Track TTFT and response times
- Regression Testing: Ensure updates don't break functionality
- Model Comparison: Compare different models objectively
- Production Readiness: Validate deployments before going live