# HAT Benchmark Reproducibility Package This directory contains everything needed to reproduce the benchmark results from the HAT paper. ## Quick Start ```bash # Run all benchmarks ./run_all_benchmarks.sh # Run abbreviated version (faster) ./run_all_benchmarks.sh --quick ``` ## Benchmark Suite ### Phase 3.1: HAT vs HNSW Comparison **Test file**: `tests/phase31_hat_vs_hnsw.rs` Compares HAT against HNSW on hierarchically-structured data (AI conversation patterns). **Expected Results**: | Metric | HAT | HNSW | |--------|-----|------| | Recall@10 | 100% | ~70% | | Build Time | 30ms | 2100ms | | Query Latency | 1.4ms | 0.5ms | **Key finding**: HAT achieves 30% higher recall while building 70x faster. ### Phase 3.2: Real Embedding Dimensions **Test file**: `tests/phase32_real_embeddings.rs` Tests HAT with production embedding sizes. **Expected Results**: | Dimensions | Model | Recall@10 | |------------|-------|-----------| | 384 | MiniLM | 100% | | 768 | BERT-base | 100% | | 1536 | OpenAI ada-002 | 100% | ### Phase 3.3: Persistence Layer **Test file**: `tests/phase33_persistence.rs` Validates serialization/deserialization correctness and performance. **Expected Results**: | Metric | Value | |--------|-------| | Serialize throughput | 300+ MB/s | | Deserialize throughput | 100+ MB/s | | Recall after restore | 100% | ### Phase 4.2: Attention State Format **Test file**: `tests/phase42_attention_state.rs` Tests the attention state serialization format. **Expected Results**: - All 9 tests pass - Role types roundtrip correctly - Metadata preserved - KV cache support working ### Phase 4.3: End-to-End Demo **Script**: `examples/demo_hat_memory.py` Full integration with sentence-transformers and optional LLM. **Expected Results**: | Metric | Value | |--------|-------| | Messages | 2000 | | Tokens | ~60,000 | | Recall accuracy | 100% | | Retrieval latency | <5ms | ## Running Individual Benchmarks ### Rust Benchmarks ```bash # HAT vs HNSW cargo test --test phase31_hat_vs_hnsw -- --nocapture # Real embeddings cargo test --test phase32_real_embeddings -- --nocapture # Persistence cargo test --test phase33_persistence -- --nocapture # Attention state cargo test --test phase42_attention_state -- --nocapture ``` ### Python Tests ```bash # Setup python3 -m venv venv source venv/bin/activate pip install maturin pytest sentence-transformers # Build extension maturin develop --features python # Run tests pytest python/tests/ -v # Run demo python examples/demo_hat_memory.py ``` ## Hardware Requirements - **Minimum**: 4GB RAM, any modern CPU - **Recommended**: 8GB RAM for large-scale tests - **Storage**: ~2GB for full benchmark suite ## Expected Runtime | Mode | Time | |------|------| | Quick (`--quick`) | ~2 minutes | | Full | ~10 minutes | | With LLM demo | ~15 minutes | ## Interpreting Results ### Key Metrics 1. **Recall@k**: Percentage of true nearest neighbors found - HAT target: 100% on hierarchical data - HNSW baseline: ~70% on hierarchical data 2. **Build Time**: Time to construct the index - HAT target: <100ms for 1000 points - Should be 50-100x faster than HNSW 3. **Query Latency**: Time per query - HAT target: <5ms - Acceptable to be 2-3x slower than HNSW (recall matters more) 4. **Throughput**: Serialization/deserialization speed - Target: 100+ MB/s ### Success Criteria The benchmarks validate the paper's claims if: 1. HAT recall@10 ≥ 99% on hierarchical data 2. HAT recall significantly exceeds HNSW on hierarchical data 3. HAT builds faster than HNSW 4. Persistence preserves 100% recall 5. Python bindings pass all tests 6. End-to-end demo achieves ≥95% retrieval accuracy ## Troubleshooting ### Build Errors ```bash # Update Rust rustup update # Clean build cargo clean && cargo build --release ``` ### Python Issues ```bash # Ensure venv is activated source venv/bin/activate # Rebuild extension maturin develop --features python --release ``` ### Memory Issues For large-scale tests, ensure sufficient RAM: ```bash # Check available memory free -h # Run with limited parallelism RAYON_NUM_THREADS=2 cargo test --test phase31_hat_vs_hnsw ``` ## Output Files Results are saved to `benchmarks/results/`: ``` results/ benchmark_results_YYYYMMDD_HHMMSS.txt # Full output ``` ## Citation If you use these benchmarks, please cite: ```bibtex @article{hat2026, title={Hierarchical Attention Tree: Extending LLM Context Through Structural Memory}, author={AI Research Lab}, year={2026} } ```