HAT / benchmarks /README.md

Andrew Young

Upload folder using huggingface_hub

8ef2d83 verified 9 days ago

4.76 kB

	# HAT Benchmark Reproducibility Package

	This directory contains everything needed to reproduce the benchmark results from the HAT paper.

	## Quick Start

	```bash
	# Run all benchmarks
	./run_all_benchmarks.sh

	# Run abbreviated version (faster)
	./run_all_benchmarks.sh --quick
	```

	## Benchmark Suite

	### Phase 3.1: HAT vs HNSW Comparison

	Test file: `tests/phase31_hat_vs_hnsw.rs`

	Compares HAT against HNSW on hierarchically-structured data (AI conversation patterns).

	Expected Results:

	\| Metric \| HAT \| HNSW \|
	\|--------\|-----\|------\|
	\| Recall@10 \| 100% \| ~70% \|
	\| Build Time \| 30ms \| 2100ms \|
	\| Query Latency \| 1.4ms \| 0.5ms \|

	Key finding: HAT achieves 30% higher recall while building 70x faster.

	### Phase 3.2: Real Embedding Dimensions

	Test file: `tests/phase32_real_embeddings.rs`

	Tests HAT with production embedding sizes.

	Expected Results:

	\| Dimensions \| Model \| Recall@10 \|
	\|------------\|-------\|-----------\|
	\| 384 \| MiniLM \| 100% \|
	\| 768 \| BERT-base \| 100% \|
	\| 1536 \| OpenAI ada-002 \| 100% \|

	### Phase 3.3: Persistence Layer

	Test file: `tests/phase33_persistence.rs`

	Validates serialization/deserialization correctness and performance.

	Expected Results:

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Serialize throughput \| 300+ MB/s \|
	\| Deserialize throughput \| 100+ MB/s \|
	\| Recall after restore \| 100% \|

	### Phase 4.2: Attention State Format

	Test file: `tests/phase42_attention_state.rs`

	Tests the attention state serialization format.

	Expected Results:
	- All 9 tests pass
	- Role types roundtrip correctly
	- Metadata preserved
	- KV cache support working

	### Phase 4.3: End-to-End Demo

	Script: `examples/demo_hat_memory.py`

	Full integration with sentence-transformers and optional LLM.

	Expected Results:

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Messages \| 2000 \|
	\| Tokens \| ~60,000 \|
	\| Recall accuracy \| 100% \|
	\| Retrieval latency \| <5ms \|

	## Running Individual Benchmarks

	### Rust Benchmarks

	```bash
	# HAT vs HNSW
	cargo test --test phase31_hat_vs_hnsw -- --nocapture

	# Real embeddings
	cargo test --test phase32_real_embeddings -- --nocapture

	# Persistence
	cargo test --test phase33_persistence -- --nocapture

	# Attention state
	cargo test --test phase42_attention_state -- --nocapture
	```

	### Python Tests

	```bash
	# Setup
	python3 -m venv venv
	source venv/bin/activate
	pip install maturin pytest sentence-transformers

	# Build extension
	maturin develop --features python

	# Run tests
	pytest python/tests/ -v

	# Run demo
	python examples/demo_hat_memory.py
	```

	## Hardware Requirements

	- Minimum: 4GB RAM, any modern CPU
	- Recommended: 8GB RAM for large-scale tests
	- Storage: ~2GB for full benchmark suite

	## Expected Runtime

	\| Mode \| Time \|
	\|------\|------\|
	\| Quick (`--quick`) \| ~2 minutes \|
	\| Full \| ~10 minutes \|
	\| With LLM demo \| ~15 minutes \|

	## Interpreting Results

	### Key Metrics

	1. Recall@k: Percentage of true nearest neighbors found
	- HAT target: 100% on hierarchical data
	- HNSW baseline: ~70% on hierarchical data

	2. Build Time: Time to construct the index
	- HAT target: <100ms for 1000 points
	- Should be 50-100x faster than HNSW

	3. Query Latency: Time per query
	- HAT target: <5ms
	- Acceptable to be 2-3x slower than HNSW (recall matters more)

	4. Throughput: Serialization/deserialization speed
	- Target: 100+ MB/s

	### Success Criteria

	The benchmarks validate the paper's claims if:

	1. HAT recall@10 ≥ 99% on hierarchical data
	2. HAT recall significantly exceeds HNSW on hierarchical data
	3. HAT builds faster than HNSW
	4. Persistence preserves 100% recall
	5. Python bindings pass all tests
	6. End-to-end demo achieves ≥95% retrieval accuracy

	## Troubleshooting

	### Build Errors

	```bash
	# Update Rust
	rustup update

	# Clean build
	cargo clean && cargo build --release
	```

	### Python Issues

	```bash
	# Ensure venv is activated
	source venv/bin/activate

	# Rebuild extension
	maturin develop --features python --release
	```

	### Memory Issues

	For large-scale tests, ensure sufficient RAM:

	```bash
	# Check available memory
	free -h

	# Run with limited parallelism
	RAYON_NUM_THREADS=2 cargo test --test phase31_hat_vs_hnsw
	```

	## Output Files

	Results are saved to `benchmarks/results/`:

	```
	results/
	benchmark_results_YYYYMMDD_HHMMSS.txt # Full output
	```

	## Citation

	If you use these benchmarks, please cite:

	```bibtex
	@article{hat2026,
	title={Hierarchical Attention Tree: Extending LLM Context Through Structural Memory},
	author={AI Research Lab},
	year={2026}
	}
	```