File size: 4,760 Bytes
8ef2d83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
# HAT Benchmark Reproducibility Package

This directory contains everything needed to reproduce the benchmark results from the HAT paper.

## Quick Start

```bash

# Run all benchmarks

./run_all_benchmarks.sh



# Run abbreviated version (faster)

./run_all_benchmarks.sh --quick

```

## Benchmark Suite

### Phase 3.1: HAT vs HNSW Comparison

**Test file**: `tests/phase31_hat_vs_hnsw.rs`

Compares HAT against HNSW on hierarchically-structured data (AI conversation patterns).

**Expected Results**:

| Metric | HAT | HNSW |
|--------|-----|------|
| Recall@10 | 100% | ~70% |
| Build Time | 30ms | 2100ms |
| Query Latency | 1.4ms | 0.5ms |

**Key finding**: HAT achieves 30% higher recall while building 70x faster.

### Phase 3.2: Real Embedding Dimensions

**Test file**: `tests/phase32_real_embeddings.rs`

Tests HAT with production embedding sizes.

**Expected Results**:

| Dimensions | Model | Recall@10 |
|------------|-------|-----------|
| 384 | MiniLM | 100% |
| 768 | BERT-base | 100% |
| 1536 | OpenAI ada-002 | 100% |

### Phase 3.3: Persistence Layer

**Test file**: `tests/phase33_persistence.rs`

Validates serialization/deserialization correctness and performance.

**Expected Results**:

| Metric | Value |
|--------|-------|
| Serialize throughput | 300+ MB/s |
| Deserialize throughput | 100+ MB/s |
| Recall after restore | 100% |

### Phase 4.2: Attention State Format

**Test file**: `tests/phase42_attention_state.rs`

Tests the attention state serialization format.

**Expected Results**:
- All 9 tests pass
- Role types roundtrip correctly
- Metadata preserved
- KV cache support working

### Phase 4.3: End-to-End Demo

**Script**: `examples/demo_hat_memory.py`

Full integration with sentence-transformers and optional LLM.

**Expected Results**:

| Metric | Value |
|--------|-------|
| Messages | 2000 |
| Tokens | ~60,000 |
| Recall accuracy | 100% |
| Retrieval latency | <5ms |

## Running Individual Benchmarks

### Rust Benchmarks

```bash

# HAT vs HNSW

cargo test --test phase31_hat_vs_hnsw -- --nocapture



# Real embeddings

cargo test --test phase32_real_embeddings -- --nocapture



# Persistence

cargo test --test phase33_persistence -- --nocapture



# Attention state

cargo test --test phase42_attention_state -- --nocapture

```

### Python Tests

```bash

# Setup

python3 -m venv venv

source venv/bin/activate

pip install maturin pytest sentence-transformers



# Build extension

maturin develop --features python



# Run tests

pytest python/tests/ -v



# Run demo

python examples/demo_hat_memory.py

```

## Hardware Requirements

- **Minimum**: 4GB RAM, any modern CPU
- **Recommended**: 8GB RAM for large-scale tests
- **Storage**: ~2GB for full benchmark suite

## Expected Runtime

| Mode | Time |
|------|------|
| Quick (`--quick`) | ~2 minutes |
| Full | ~10 minutes |
| With LLM demo | ~15 minutes |

## Interpreting Results

### Key Metrics

1. **Recall@k**: Percentage of true nearest neighbors found
   - HAT target: 100% on hierarchical data
   - HNSW baseline: ~70% on hierarchical data

2. **Build Time**: Time to construct the index
   - HAT target: <100ms for 1000 points
   - Should be 50-100x faster than HNSW

3. **Query Latency**: Time per query
   - HAT target: <5ms
   - Acceptable to be 2-3x slower than HNSW (recall matters more)

4. **Throughput**: Serialization/deserialization speed
   - Target: 100+ MB/s

### Success Criteria

The benchmarks validate the paper's claims if:

1. HAT recall@10 ≥ 99% on hierarchical data
2. HAT recall significantly exceeds HNSW on hierarchical data
3. HAT builds faster than HNSW
4. Persistence preserves 100% recall
5. Python bindings pass all tests
6. End-to-end demo achieves ≥95% retrieval accuracy

## Troubleshooting

### Build Errors

```bash

# Update Rust

rustup update



# Clean build

cargo clean && cargo build --release

```

### Python Issues

```bash

# Ensure venv is activated

source venv/bin/activate



# Rebuild extension

maturin develop --features python --release

```

### Memory Issues

For large-scale tests, ensure sufficient RAM:

```bash

# Check available memory

free -h



# Run with limited parallelism

RAYON_NUM_THREADS=2 cargo test --test phase31_hat_vs_hnsw

```

## Output Files

Results are saved to `benchmarks/results/`:

```

results/

  benchmark_results_YYYYMMDD_HHMMSS.txt  # Full output

```

## Citation

If you use these benchmarks, please cite:

```bibtex

@article{hat2026,

  title={Hierarchical Attention Tree: Extending LLM Context Through Structural Memory},

  author={AI Research Lab},

  year={2026}

}

```