HAT / benchmarks /README.md
Andrew Young
Upload folder using huggingface_hub
8ef2d83 verified
# HAT Benchmark Reproducibility Package
This directory contains everything needed to reproduce the benchmark results from the HAT paper.
## Quick Start
```bash
# Run all benchmarks
./run_all_benchmarks.sh
# Run abbreviated version (faster)
./run_all_benchmarks.sh --quick
```
## Benchmark Suite
### Phase 3.1: HAT vs HNSW Comparison
**Test file**: `tests/phase31_hat_vs_hnsw.rs`
Compares HAT against HNSW on hierarchically-structured data (AI conversation patterns).
**Expected Results**:
| Metric | HAT | HNSW |
|--------|-----|------|
| Recall@10 | 100% | ~70% |
| Build Time | 30ms | 2100ms |
| Query Latency | 1.4ms | 0.5ms |
**Key finding**: HAT achieves 30% higher recall while building 70x faster.
### Phase 3.2: Real Embedding Dimensions
**Test file**: `tests/phase32_real_embeddings.rs`
Tests HAT with production embedding sizes.
**Expected Results**:
| Dimensions | Model | Recall@10 |
|------------|-------|-----------|
| 384 | MiniLM | 100% |
| 768 | BERT-base | 100% |
| 1536 | OpenAI ada-002 | 100% |
### Phase 3.3: Persistence Layer
**Test file**: `tests/phase33_persistence.rs`
Validates serialization/deserialization correctness and performance.
**Expected Results**:
| Metric | Value |
|--------|-------|
| Serialize throughput | 300+ MB/s |
| Deserialize throughput | 100+ MB/s |
| Recall after restore | 100% |
### Phase 4.2: Attention State Format
**Test file**: `tests/phase42_attention_state.rs`
Tests the attention state serialization format.
**Expected Results**:
- All 9 tests pass
- Role types roundtrip correctly
- Metadata preserved
- KV cache support working
### Phase 4.3: End-to-End Demo
**Script**: `examples/demo_hat_memory.py`
Full integration with sentence-transformers and optional LLM.
**Expected Results**:
| Metric | Value |
|--------|-------|
| Messages | 2000 |
| Tokens | ~60,000 |
| Recall accuracy | 100% |
| Retrieval latency | <5ms |
## Running Individual Benchmarks
### Rust Benchmarks
```bash
# HAT vs HNSW
cargo test --test phase31_hat_vs_hnsw -- --nocapture
# Real embeddings
cargo test --test phase32_real_embeddings -- --nocapture
# Persistence
cargo test --test phase33_persistence -- --nocapture
# Attention state
cargo test --test phase42_attention_state -- --nocapture
```
### Python Tests
```bash
# Setup
python3 -m venv venv
source venv/bin/activate
pip install maturin pytest sentence-transformers
# Build extension
maturin develop --features python
# Run tests
pytest python/tests/ -v
# Run demo
python examples/demo_hat_memory.py
```
## Hardware Requirements
- **Minimum**: 4GB RAM, any modern CPU
- **Recommended**: 8GB RAM for large-scale tests
- **Storage**: ~2GB for full benchmark suite
## Expected Runtime
| Mode | Time |
|------|------|
| Quick (`--quick`) | ~2 minutes |
| Full | ~10 minutes |
| With LLM demo | ~15 minutes |
## Interpreting Results
### Key Metrics
1. **Recall@k**: Percentage of true nearest neighbors found
- HAT target: 100% on hierarchical data
- HNSW baseline: ~70% on hierarchical data
2. **Build Time**: Time to construct the index
- HAT target: <100ms for 1000 points
- Should be 50-100x faster than HNSW
3. **Query Latency**: Time per query
- HAT target: <5ms
- Acceptable to be 2-3x slower than HNSW (recall matters more)
4. **Throughput**: Serialization/deserialization speed
- Target: 100+ MB/s
### Success Criteria
The benchmarks validate the paper's claims if:
1. HAT recall@10 ≥ 99% on hierarchical data
2. HAT recall significantly exceeds HNSW on hierarchical data
3. HAT builds faster than HNSW
4. Persistence preserves 100% recall
5. Python bindings pass all tests
6. End-to-end demo achieves ≥95% retrieval accuracy
## Troubleshooting
### Build Errors
```bash
# Update Rust
rustup update
# Clean build
cargo clean && cargo build --release
```
### Python Issues
```bash
# Ensure venv is activated
source venv/bin/activate
# Rebuild extension
maturin develop --features python --release
```
### Memory Issues
For large-scale tests, ensure sufficient RAM:
```bash
# Check available memory
free -h
# Run with limited parallelism
RAYON_NUM_THREADS=2 cargo test --test phase31_hat_vs_hnsw
```
## Output Files
Results are saved to `benchmarks/results/`:
```
results/
benchmark_results_YYYYMMDD_HHMMSS.txt # Full output
```
## Citation
If you use these benchmarks, please cite:
```bibtex
@article{hat2026,
title={Hierarchical Attention Tree: Extending LLM Context Through Structural Memory},
author={AI Research Lab},
year={2026}
}
```