CLaRa-7B-E2E / README.md
yizheapple's picture
Update README.md
89d5074 verified
---
license: apple-amlr
base_model:
- mistralai/Mistral-7B-Instruct-v0.2
tags:
- rag
- compression
- retrieval
- end-to-end
- generation
---
# CLaRa-7B-E2E (Compression-16 & 128)
The **CLaRa-7B-E2E** model is our fully end-to-end unified RAG model, jointly optimizing retrieval and generation with 16× and 128x document compression.
**Training recipe:** End-to-end finetuning with differentiable top-k retrieval and a unified language-modeling objective.
**Benchmarks:** Strong retrieval-augmented QA performance under aggressive compression.
---
## More details and usage examples:
Paper: [CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning](https://arxiv.org/abs/2511.18659)
GitHub: https://github.com/apple/ml-clara
---
## Example Usage (End-to-End Inference)
```python
from transformers import AutoModel
unirag = AutoModel.from_pretrained(
"/mnt/ceph_rbd/model/CLaRa-7B-E2E/compression-16",
trust_remote_code=True
).to("cuda")
# Example documents and question
documents = [[
"Weldenia is a monotypic genus of flowering plant in the family Commelinaceae...",
] * 20]
questions = [
"Which genus of plant grows originally in Mexico and Guatemala, Phylica or Weldenia?"
]
# End-to-end usage (retrieval + generation)
# The effective top-k is controlled by `generation_top_k` in config.json.
out = unirag.generate_from_questions(
questions=questions,
documents=documents,
max_new_tokens=64
)
print("Generated answer", out)