File size: 4,760 Bytes
8ef2d83 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
# HAT Benchmark Reproducibility Package
This directory contains everything needed to reproduce the benchmark results from the HAT paper.
## Quick Start
```bash
# Run all benchmarks
./run_all_benchmarks.sh
# Run abbreviated version (faster)
./run_all_benchmarks.sh --quick
```
## Benchmark Suite
### Phase 3.1: HAT vs HNSW Comparison
**Test file**: `tests/phase31_hat_vs_hnsw.rs`
Compares HAT against HNSW on hierarchically-structured data (AI conversation patterns).
**Expected Results**:
| Metric | HAT | HNSW |
|--------|-----|------|
| Recall@10 | 100% | ~70% |
| Build Time | 30ms | 2100ms |
| Query Latency | 1.4ms | 0.5ms |
**Key finding**: HAT achieves 30% higher recall while building 70x faster.
### Phase 3.2: Real Embedding Dimensions
**Test file**: `tests/phase32_real_embeddings.rs`
Tests HAT with production embedding sizes.
**Expected Results**:
| Dimensions | Model | Recall@10 |
|------------|-------|-----------|
| 384 | MiniLM | 100% |
| 768 | BERT-base | 100% |
| 1536 | OpenAI ada-002 | 100% |
### Phase 3.3: Persistence Layer
**Test file**: `tests/phase33_persistence.rs`
Validates serialization/deserialization correctness and performance.
**Expected Results**:
| Metric | Value |
|--------|-------|
| Serialize throughput | 300+ MB/s |
| Deserialize throughput | 100+ MB/s |
| Recall after restore | 100% |
### Phase 4.2: Attention State Format
**Test file**: `tests/phase42_attention_state.rs`
Tests the attention state serialization format.
**Expected Results**:
- All 9 tests pass
- Role types roundtrip correctly
- Metadata preserved
- KV cache support working
### Phase 4.3: End-to-End Demo
**Script**: `examples/demo_hat_memory.py`
Full integration with sentence-transformers and optional LLM.
**Expected Results**:
| Metric | Value |
|--------|-------|
| Messages | 2000 |
| Tokens | ~60,000 |
| Recall accuracy | 100% |
| Retrieval latency | <5ms |
## Running Individual Benchmarks
### Rust Benchmarks
```bash
# HAT vs HNSW
cargo test --test phase31_hat_vs_hnsw -- --nocapture
# Real embeddings
cargo test --test phase32_real_embeddings -- --nocapture
# Persistence
cargo test --test phase33_persistence -- --nocapture
# Attention state
cargo test --test phase42_attention_state -- --nocapture
```
### Python Tests
```bash
# Setup
python3 -m venv venv
source venv/bin/activate
pip install maturin pytest sentence-transformers
# Build extension
maturin develop --features python
# Run tests
pytest python/tests/ -v
# Run demo
python examples/demo_hat_memory.py
```
## Hardware Requirements
- **Minimum**: 4GB RAM, any modern CPU
- **Recommended**: 8GB RAM for large-scale tests
- **Storage**: ~2GB for full benchmark suite
## Expected Runtime
| Mode | Time |
|------|------|
| Quick (`--quick`) | ~2 minutes |
| Full | ~10 minutes |
| With LLM demo | ~15 minutes |
## Interpreting Results
### Key Metrics
1. **Recall@k**: Percentage of true nearest neighbors found
- HAT target: 100% on hierarchical data
- HNSW baseline: ~70% on hierarchical data
2. **Build Time**: Time to construct the index
- HAT target: <100ms for 1000 points
- Should be 50-100x faster than HNSW
3. **Query Latency**: Time per query
- HAT target: <5ms
- Acceptable to be 2-3x slower than HNSW (recall matters more)
4. **Throughput**: Serialization/deserialization speed
- Target: 100+ MB/s
### Success Criteria
The benchmarks validate the paper's claims if:
1. HAT recall@10 ≥ 99% on hierarchical data
2. HAT recall significantly exceeds HNSW on hierarchical data
3. HAT builds faster than HNSW
4. Persistence preserves 100% recall
5. Python bindings pass all tests
6. End-to-end demo achieves ≥95% retrieval accuracy
## Troubleshooting
### Build Errors
```bash
# Update Rust
rustup update
# Clean build
cargo clean && cargo build --release
```
### Python Issues
```bash
# Ensure venv is activated
source venv/bin/activate
# Rebuild extension
maturin develop --features python --release
```
### Memory Issues
For large-scale tests, ensure sufficient RAM:
```bash
# Check available memory
free -h
# Run with limited parallelism
RAYON_NUM_THREADS=2 cargo test --test phase31_hat_vs_hnsw
```
## Output Files
Results are saved to `benchmarks/results/`:
```
results/
benchmark_results_YYYYMMDD_HHMMSS.txt # Full output
```
## Citation
If you use these benchmarks, please cite:
```bibtex
@article{hat2026,
title={Hierarchical Attention Tree: Extending LLM Context Through Structural Memory},
author={AI Research Lab},
year={2026}
}
```
|