apple
/

CLaRa-7B-E2E

Model card Files Files and versions

CLaRa-7B-E2E / README.md

yizheapple's picture

Update README.md

89d5074 verified 4 days ago

|

history blame contribute delete

1.49 kB

	---
	license: apple-amlr
	base_model:
	- mistralai/Mistral-7B-Instruct-v0.2
	tags:
	- rag
	- compression
	- retrieval
	- end-to-end
	- generation
	---

	# CLaRa-7B-E2E (Compression-16 & 128)

	The CLaRa-7B-E2E model is our fully end-to-end unified RAG model, jointly optimizing retrieval and generation with 16× and 128x document compression.

	Training recipe: End-to-end finetuning with differentiable top-k retrieval and a unified language-modeling objective.
	Benchmarks: Strong retrieval-augmented QA performance under aggressive compression.

	---

	## More details and usage examples:

	Paper: [CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning](https://arxiv.org/abs/2511.18659)
	GitHub: https://github.com/apple/ml-clara

	---

	## Example Usage (End-to-End Inference)

	```python
	from transformers import AutoModel

	unirag = AutoModel.from_pretrained(
	"/mnt/ceph_rbd/model/CLaRa-7B-E2E/compression-16",
	trust_remote_code=True
	).to("cuda")

	# Example documents and question
	documents = [[
	"Weldenia is a monotypic genus of flowering plant in the family Commelinaceae...",
	] * 20]

	questions = [
	"Which genus of plant grows originally in Mexico and Guatemala, Phylica or Weldenia?"
	]

	# End-to-end usage (retrieval + generation)
	# The effective top-k is controlled by `generation_top_k` in config.json.
	out = unirag.generate_from_questions(
	questions=questions,
	documents=documents,
	max_new_tokens=64
	)

	print("Generated answer", out)