# Model Testing Framework - Complete Guide ## ๐ŸŽฏ **Overview** I've designed and implemented a comprehensive, isolated testing framework for your deployed LinguaCustodia models. This framework follows best practices for testing AI models and provides detailed performance metrics. ## ๐Ÿ—๏ธ **Architecture & Design Principles** ### **1. Isolation** - โœ… **Separate Test Environment**: Completely isolated from production code - โœ… **Mock Tools**: Safe testing without affecting real systems - โœ… **Independent Test Suites**: Each test type runs independently - โœ… **Isolated Results**: All results stored in dedicated directory ### **2. Modularity** - โœ… **Base Classes**: Common functionality in `BaseTester` - โœ… **Pluggable Suites**: Easy to add new test types - โœ… **Configurable**: Environment-based configuration - โœ… **Reusable Components**: Metrics, utilities, and tools ### **3. Comprehensive Metrics** - โœ… **Time to First Token (TTFT)**: Critical for streaming performance - โœ… **Total Response Time**: End-to-end performance - โœ… **Token Generation Rate**: Throughput measurement - โœ… **Success/Failure Rates**: Reliability metrics - โœ… **Quality Validation**: Response content validation ## ๐Ÿ“ **Directory Structure** ``` testing/ โ”œโ”€โ”€ README.md # Framework documentation โ”œโ”€โ”€ run_tests.py # Main test runner โ”œโ”€โ”€ example_usage.py # Usage examples โ”œโ”€โ”€ config/ # Test configurations โ”‚ โ”œโ”€โ”€ test_config.py # Main configuration โ”‚ โ””โ”€โ”€ model_configs.py # Model-specific configs โ”œโ”€โ”€ core/ # Core framework โ”‚ โ”œโ”€โ”€ base_tester.py # Base test class โ”‚ โ”œโ”€โ”€ metrics.py # Performance metrics โ”‚ โ””โ”€โ”€ utils.py # Testing utilities โ”œโ”€โ”€ suites/ # Test suites โ”‚ โ”œโ”€โ”€ instruction_test.py # Instruction following โ”‚ โ”œโ”€โ”€ chat_completion_test.py # Chat with streaming โ”‚ โ”œโ”€โ”€ json_structured_test.py # JSON output validation โ”‚ โ””โ”€โ”€ tool_usage_test.py # Tool calling tests โ”œโ”€โ”€ tools/ # Mock tools โ”‚ โ”œโ”€โ”€ time_tool.py # UTC time tool โ”‚ โ””โ”€โ”€ ticker_tool.py # Stock ticker tool โ”œโ”€โ”€ data/ # Test data โ”‚ โ””โ”€โ”€ instructions.json # Test cases โ””โ”€โ”€ results/ # Test results (gitignored) โ”œโ”€โ”€ reports/ # HTML/JSON reports โ””โ”€โ”€ logs/ # Test logs ``` ## ๐Ÿงช **Test Suites** ### **1. Instruction Following Tests** - **Purpose**: Test model's ability to follow simple and complex instructions - **Metrics**: Response quality, content accuracy, instruction adherence - **Test Cases**: 5 financial domain scenarios - **Validation**: Keyword matching, content structure analysis ### **2. Chat Completion Tests (with Streaming)** - **Purpose**: Test conversational flow and streaming capabilities - **Metrics**: TTFT, streaming performance, conversation quality - **Test Cases**: 5 chat scenarios with follow-ups - **Validation**: Conversational tone, context awareness ### **3. Structured JSON Output Tests** - **Purpose**: Test model's ability to produce valid, structured JSON - **Metrics**: JSON validity, schema compliance, data accuracy - **Test Cases**: 5 different JSON structures - **Validation**: JSON parsing, schema validation, data type checking ### **4. Tool Usage Tests** - **Purpose**: Test function calling and tool integration - **Metrics**: Tool selection accuracy, parameter extraction, execution success - **Test Cases**: 6 tool usage scenarios - **Mock Tools**: Time tool (UTC), Ticker tool (stock data) - **Validation**: Tool usage detection, parameter validation ## ๐Ÿš€ **Usage Examples** ### **Basic Usage** ```bash # Run all tests python testing/run_tests.py # Run specific test suite python testing/run_tests.py --suite instruction # Test specific model python testing/run_tests.py --model llama3.1-8b # Test with streaming python testing/run_tests.py --streaming # Test against specific endpoint python testing/run_tests.py --endpoint https://your-deployment.com ``` ### **Advanced Usage** ```bash # Run multiple suites python testing/run_tests.py --suite instruction chat json # Generate HTML report python testing/run_tests.py --report html # Test with custom configuration TEST_HF_ENDPOINT=https://your-space.com python testing/run_tests.py ``` ### **Programmatic Usage** ```python from testing.run_tests import TestRunner from testing.suites.instruction_test import InstructionTester # Create test runner runner = TestRunner() # Run specific test suite results = runner.run_suite( tester_class=InstructionTester, suite_name="Instruction Following", endpoint="https://your-endpoint.com", model="llama3.1-8b", use_streaming=True ) # Get results print(results) ``` ## ๐Ÿ“Š **Performance Metrics** ### **Key Metrics Tracked** 1. **Time to First Token (TTFT)**: Critical for user experience 2. **Total Response Time**: End-to-end performance 3. **Tokens per Second**: Generation throughput 4. **Success Rate**: Reliability percentage 5. **Error Analysis**: Failure categorization ### **Sample Output** ``` Test: InstructionTester Model: llama3.1-8b Endpoint: https://your-deployment.com Results: 4/5 passed (80.0%) Performance: Time to First Token: 0.245s (min: 0.123s, max: 0.456s) Total Response Time: 2.134s (min: 1.234s, max: 3.456s) Tokens per Second: 45.67 ``` ## ๐Ÿ”ง **Configuration** ### **Environment Variables** ```bash # Test endpoints TEST_HF_ENDPOINT=https://huggingface.co/spaces/your-space TEST_SCW_ENDPOINT=https://your-scaleway-deployment.com # Test settings TEST_TIMEOUT=60 TEST_MAX_TOKENS=200 TEST_TEMPERATURE=0.7 # Performance settings TEST_MAX_CONCURRENT=3 TEST_RETRY_ATTEMPTS=2 # Report settings TEST_REPORT_FORMAT=json TEST_REPORT_DIR=testing/results/reports ``` ### **Configuration File** The framework uses `testing/config/test_config.py` for centralized configuration with Pydantic validation. ## ๐Ÿ› ๏ธ **Mock Tools** ### **Time Tool** - **Function**: `get_current_time` - **Purpose**: Test basic tool calling - **Parameters**: `format` (iso, timestamp, readable) - **Returns**: Current UTC time in specified format ### **Ticker Tool** - **Function**: `get_ticker_info` - **Purpose**: Test complex tool calling with parameters - **Parameters**: `symbol`, `info_type` (price, company, financials, all) - **Returns**: Mock stock data for testing ## ๐Ÿ“ˆ **Benefits** ### **1. Quality Assurance** - Comprehensive testing of all model capabilities - Automated validation of responses - Regression testing for updates ### **2. Performance Monitoring** - Track TTFT and response times - Monitor token generation rates - Identify performance bottlenecks ### **3. Model Comparison** - Objective comparison between models - Performance benchmarking - Capability assessment ### **4. Production Readiness** - Validate deployments before going live - Ensure all features work correctly - Confidence in model performance ## ๐ŸŽฏ **Next Steps** 1. **Deploy Your Models**: Deploy to HuggingFace Spaces and Scaleway 2. **Run Initial Tests**: Execute the test suite against your deployments 3. **Analyze Results**: Review performance metrics and identify areas for improvement 4. **Iterate**: Use test results to optimize model performance 5. **Monitor**: Set up regular testing to track performance over time ## ๐Ÿ” **Testing Strategy** ### **Phase 1: Basic Functionality** - Test instruction following - Verify basic chat completion - Validate JSON output ### **Phase 2: Advanced Features** - Test streaming performance - Validate tool usage - Measure TTFT metrics ### **Phase 3: Production Validation** - Load testing - Error handling - Edge case validation This framework provides everything you need to thoroughly test your deployed models with proper isolation, comprehensive metrics, and production-ready validation! ๐Ÿš€