| # HAT Benchmark Reproducibility Package | |
| This directory contains everything needed to reproduce the benchmark results from the HAT paper. | |
| ## Quick Start | |
| ```bash | |
| # Run all benchmarks | |
| ./run_all_benchmarks.sh | |
| # Run abbreviated version (faster) | |
| ./run_all_benchmarks.sh --quick | |
| ``` | |
| ## Benchmark Suite | |
| ### Phase 3.1: HAT vs HNSW Comparison | |
| **Test file**: `tests/phase31_hat_vs_hnsw.rs` | |
| Compares HAT against HNSW on hierarchically-structured data (AI conversation patterns). | |
| **Expected Results**: | |
| | Metric | HAT | HNSW | | |
| |--------|-----|------| | |
| | Recall@10 | 100% | ~70% | | |
| | Build Time | 30ms | 2100ms | | |
| | Query Latency | 1.4ms | 0.5ms | | |
| **Key finding**: HAT achieves 30% higher recall while building 70x faster. | |
| ### Phase 3.2: Real Embedding Dimensions | |
| **Test file**: `tests/phase32_real_embeddings.rs` | |
| Tests HAT with production embedding sizes. | |
| **Expected Results**: | |
| | Dimensions | Model | Recall@10 | | |
| |------------|-------|-----------| | |
| | 384 | MiniLM | 100% | | |
| | 768 | BERT-base | 100% | | |
| | 1536 | OpenAI ada-002 | 100% | | |
| ### Phase 3.3: Persistence Layer | |
| **Test file**: `tests/phase33_persistence.rs` | |
| Validates serialization/deserialization correctness and performance. | |
| **Expected Results**: | |
| | Metric | Value | | |
| |--------|-------| | |
| | Serialize throughput | 300+ MB/s | | |
| | Deserialize throughput | 100+ MB/s | | |
| | Recall after restore | 100% | | |
| ### Phase 4.2: Attention State Format | |
| **Test file**: `tests/phase42_attention_state.rs` | |
| Tests the attention state serialization format. | |
| **Expected Results**: | |
| - All 9 tests pass | |
| - Role types roundtrip correctly | |
| - Metadata preserved | |
| - KV cache support working | |
| ### Phase 4.3: End-to-End Demo | |
| **Script**: `examples/demo_hat_memory.py` | |
| Full integration with sentence-transformers and optional LLM. | |
| **Expected Results**: | |
| | Metric | Value | | |
| |--------|-------| | |
| | Messages | 2000 | | |
| | Tokens | ~60,000 | | |
| | Recall accuracy | 100% | | |
| | Retrieval latency | <5ms | | |
| ## Running Individual Benchmarks | |
| ### Rust Benchmarks | |
| ```bash | |
| # HAT vs HNSW | |
| cargo test --test phase31_hat_vs_hnsw -- --nocapture | |
| # Real embeddings | |
| cargo test --test phase32_real_embeddings -- --nocapture | |
| # Persistence | |
| cargo test --test phase33_persistence -- --nocapture | |
| # Attention state | |
| cargo test --test phase42_attention_state -- --nocapture | |
| ``` | |
| ### Python Tests | |
| ```bash | |
| # Setup | |
| python3 -m venv venv | |
| source venv/bin/activate | |
| pip install maturin pytest sentence-transformers | |
| # Build extension | |
| maturin develop --features python | |
| # Run tests | |
| pytest python/tests/ -v | |
| # Run demo | |
| python examples/demo_hat_memory.py | |
| ``` | |
| ## Hardware Requirements | |
| - **Minimum**: 4GB RAM, any modern CPU | |
| - **Recommended**: 8GB RAM for large-scale tests | |
| - **Storage**: ~2GB for full benchmark suite | |
| ## Expected Runtime | |
| | Mode | Time | | |
| |------|------| | |
| | Quick (`--quick`) | ~2 minutes | | |
| | Full | ~10 minutes | | |
| | With LLM demo | ~15 minutes | | |
| ## Interpreting Results | |
| ### Key Metrics | |
| 1. **Recall@k**: Percentage of true nearest neighbors found | |
| - HAT target: 100% on hierarchical data | |
| - HNSW baseline: ~70% on hierarchical data | |
| 2. **Build Time**: Time to construct the index | |
| - HAT target: <100ms for 1000 points | |
| - Should be 50-100x faster than HNSW | |
| 3. **Query Latency**: Time per query | |
| - HAT target: <5ms | |
| - Acceptable to be 2-3x slower than HNSW (recall matters more) | |
| 4. **Throughput**: Serialization/deserialization speed | |
| - Target: 100+ MB/s | |
| ### Success Criteria | |
| The benchmarks validate the paper's claims if: | |
| 1. HAT recall@10 ≥ 99% on hierarchical data | |
| 2. HAT recall significantly exceeds HNSW on hierarchical data | |
| 3. HAT builds faster than HNSW | |
| 4. Persistence preserves 100% recall | |
| 5. Python bindings pass all tests | |
| 6. End-to-end demo achieves ≥95% retrieval accuracy | |
| ## Troubleshooting | |
| ### Build Errors | |
| ```bash | |
| # Update Rust | |
| rustup update | |
| # Clean build | |
| cargo clean && cargo build --release | |
| ``` | |
| ### Python Issues | |
| ```bash | |
| # Ensure venv is activated | |
| source venv/bin/activate | |
| # Rebuild extension | |
| maturin develop --features python --release | |
| ``` | |
| ### Memory Issues | |
| For large-scale tests, ensure sufficient RAM: | |
| ```bash | |
| # Check available memory | |
| free -h | |
| # Run with limited parallelism | |
| RAYON_NUM_THREADS=2 cargo test --test phase31_hat_vs_hnsw | |
| ``` | |
| ## Output Files | |
| Results are saved to `benchmarks/results/`: | |
| ``` | |
| results/ | |
| benchmark_results_YYYYMMDD_HHMMSS.txt # Full output | |
| ``` | |
| ## Citation | |
| If you use these benchmarks, please cite: | |
| ```bibtex | |
| @article{hat2026, | |
| title={Hierarchical Attention Tree: Extending LLM Context Through Structural Memory}, | |
| author={AI Research Lab}, | |
| year={2026} | |
| } | |
| ``` | |