- qwen2.5-3b-memory-summary-v2
- ๐ Overview
- ๐ฏ Design Philosophy
- ๐ง Task Definition
- โ๏ธ Training Setup
- ๐ง Hyperparameters
- ๐งฉ Key Training Strategies
- ๐ Training Data
- ๐ Training Results
- ๐งช Inference Example
- ๐ Performance Analysis
- ๐ง Key Insight
- ๐ System-Level Insight
- ๐ Future Improvements
- ๐ก Recommended Usage
- ๐งช Example Usage
- Test (base vs v2)
- ๐ Conclusion
- ๐ฉโ๐ป Author
- ๐ Overview
qwen2.5-3b-memory-summary-v2
๐ Overview
This model is a LoRA fine-tuned version of Qwen2.5-3B-Instruct, designed as a session memory summarization backbone for multi-turn conversational systems such as RAG-based assistants.
Unlike standard summarization models, this model focuses on:
๐ compressing dialogue into durable, reusable memory for future turns
๐ฏ Design Philosophy
This project adopts a hybrid memory architecture:
LLM โ narrative summary
Code โ structured memory update
Why this design?
| Component | Role |
|---|---|
| LLM | Natural language compression |
| Code | Deterministic state update |
๐ This separation improves:
- reliability
- controllability
- scalability
๐ง Task Definition
Input
- Previous memory (JSON)
- Recent dialogue turns
- Optional document context
Output
๐ A concise memory summary for future interactions
โ๏ธ Training Setup
Base Model
Qwen/Qwen2.5-3B-Instruct
Fine-tuning Method
- LoRA (Parameter-efficient fine-tuning)
- Supervised Fine-Tuning (SFT)
๐ง Hyperparameters
| Parameter | Value |
|---|---|
| Train Batch Size | 6 |
| Gradient Accumulation | 2 |
| Effective Batch | 12 |
| Epochs | 2 |
| Learning Rate | 1e-4 |
| Max Sequence Length | 3072 |
| Precision | bf16 |
| Gradient Checkpointing | Enabled |
๐งฉ Key Training Strategies
1. Completion-only Supervision
completion_only_loss=True
- Only supervises assistant responses
- Prevents prompt/template memorization
2. No Sample Packing
packing=False
- Preserves sample boundaries
- Critical for instruction-following tasks
3. Long Context Handling
- Up to 3072 tokens
- Supports multi-turn memory + document context
4. Critical Fix: Special Token Alignment
tokenizer.pad_token = tokenizer.eos_token
This fix resolved:
- repetition issues
- generation collapse
- EOS mismatch
๐ Training Data
Sources
| Type | Dataset |
|---|---|
| Dialogue summarization | DialogSum |
| Chat summarization | SAMSum |
| Query-based summarization | QMSum |
| Memory-style data | Synthetic (limited) |
โ ๏ธ Data Limitation
The dataset is not memory-optimized.
๐ It is primarily:
- general summarization data
- not structured memory extraction data
๐ Training Results
Validation Loss
| Step | Loss |
|---|---|
| 50 | 1.3417 |
| 250 | 1.2806 |
| 500 | 1.2545 |
| 750 | 1.2359 (best) |
| 1000+ | ~1.25 |
๐ Interpretation
- Strong early convergence
- Best performance around step 700โ800
- Later training shows plateau
๐งช Inference Example
Input
[PREVIOUS MEMORY]
User is designing a multi-turn RAG system
[RECENT TURNS]
User: Should memory be separate?
Assistant: It can be separate
User: What data can we use?
[ACTIVE DOCUMENT]
Survey of multi-turn LLM systems
Output
The user is designing a multi-turn RAG system. The memory model can run on the same machine but as a separate service.
๐ Performance Analysis
โ Strengths
- Fluent and stable generation
- Incorporates previous memory
- Extracts core dialogue facts
- No repetition / collapse
โ ๏ธ Limitations
| Issue | Description |
|---|---|
| Missing latest intent | Recent user goals often ignored |
| Weak focus detection | Current task not emphasized |
| Document underuse | External context rarely used |
| Not memory-optimized | Behaves like summarizer |
๐ง Key Insight
๐ This model is best described as:
โA strong summarization backbone, not a complete memory model.โ
๐ System-Level Insight
This experiment validates that:
โ Public summarization data โ usable memory backbone โ Structured memory โ better handled by code โ Full memory modeling โ requires task-specific data
๐ Future Improvements
1. Memory-Specific Data
Add synthetic samples with:
- current focus
- unresolved questions
- evolving goals
- decision tracking
2. Prompt Alignment
Train on the same structure as inference:
[PREVIOUS MEMORY]
[RECENT TURNS]
[ACTIVE DOCUMENT]
3. Objective Refinement
Train model to explicitly extract:
- intent
- decisions
- open problems
4. Hybrid Training
Combine:
- summarization data
- memory-structured data
๐ก Recommended Usage
Best used as:
๐ Memory backbone model in multi-turn systems
Not recommended for:
- standalone summarization tasks
- factual QA
- production memory systems (without post-processing)
๐งช Example Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "yeseul0-0/qwen2.5-3b-memory-summary-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
inputs = tokenizer("Summarize dialogue...", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Test (base vs v2)
# ============================================================
# 1. base model ์ถ๊ฐ ๋ก๋
# ์ด๋ฏธ fine-tuned model์ model ๋ณ์์ ์๋ค๊ณ ๊ฐ์
# ============================================================
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time
import json
BASE_MODEL_ID = "Qwen/Qwen2.5-3B-Instruct"
base_tokenizer = tokenizer # ์ด๋ฏธ ๊ฐ์ ๊ณ์ด์ด๋ฉด ๊ธฐ์กด tokenizer ์ฌ์ฌ์ฉ ๊ฐ๋ฅ
base_model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL_ID,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
base_tokenizer.pad_token = base_tokenizer.eos_token
base_tokenizer.padding_side = "right"
base_model.config.pad_token_id = base_tokenizer.pad_token_id
base_model.config.eos_token_id = base_tokenizer.eos_token_id
if hasattr(base_model, "generation_config") and base_model.generation_config is not None:
base_model.generation_config.pad_token_id = base_tokenizer.pad_token_id
base_model.generation_config.eos_token_id = base_tokenizer.eos_token_id
print("base model loaded")
# ============================================================
# 2. ๋น๊ต์ฉ ์
๋ ฅ
# ============================================================
sample_previous_memory = {
"narrative": "The user is designing a multi-turn RAG system.",
"structured": {
"goal": "Design a multi-turn architecture",
"established_facts": ["RAG is used"],
"current_focus": "memory model design",
"unresolved_questions": ["whether to separate the memory model service"],
},
}
sample_recent_turns = [
{"role": "user", "text": "Should the memory model run as a separate endpoint?"},
{"role": "assistant", "text": "It can run on the same machine but as a separate service."},
{"role": "user", "text": "Then what public datasets can we use for training?"},
]
sample_active_doc = {
"file_name": "multiturn_survey.pdf",
"doc_summary": "A survey of multi-turn interactions with large language models."
}
def render_recent_turns(turns):
return "\n".join([f'{t["role"].capitalize()}: {t["text"]}' for t in turns])
memory_prompt_system = (
"You are a session-memory summarization model. "
"Summarize only durable and useful information for future turns. "
"Always include the user's current focus and latest intent when important. "
"Do not invent facts."
)
memory_prompt_user = f"""
Create a concise memory summary for future dialogue turns.
[PREVIOUS MEMORY]
{json.dumps(sample_previous_memory, ensure_ascii=False, indent=2)}
[RECENT TURNS]
{render_recent_turns(sample_recent_turns)}
[ACTIVE DOCUMENT]
{json.dumps(sample_active_doc, ensure_ascii=False, indent=2)}
Return a concise summary in plain English.
"""
messages = [
{"role": "system", "content": memory_prompt_system},
{"role": "user", "content": memory_prompt_user},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(
prompt,
return_tensors="pt",
truncation=True,
max_length=3072
)
# ============================================================
# 3. ๊ณตํต generate ํจ์
# ============================================================
def run_generate(model_obj, tokenizer_obj, inputs_dict, max_new_tokens=160):
model_inputs = {k: v.to(model_obj.device) for k, v in inputs_dict.items()}
if torch.cuda.is_available():
torch.cuda.synchronize()
start = time.perf_counter()
with torch.no_grad():
outputs = model_obj.generate(
**model_inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
temperature=0.0,
eos_token_id=tokenizer_obj.eos_token_id,
pad_token_id=tokenizer_obj.pad_token_id,
)
if torch.cuda.is_available():
torch.cuda.synchronize()
end = time.perf_counter()
gen_ids = outputs[0][model_inputs["input_ids"].shape[1]:]
gen_text = tokenizer_obj.decode(gen_ids, skip_special_tokens=True).strip()
return {
"text": gen_text,
"elapsed": end - start,
"output_tokens": len(gen_ids),
"tokens_per_sec": (len(gen_ids) / (end - start)) if (end - start) > 0 else None,
}
# ============================================================
# 4. fine-tuned vs base ๋น๊ต
# model = ๋ค๊ฐ ์ด๋ฏธ ๋ก๋ํ fine-tuned model
# ============================================================
ft_result = run_generate(model, tokenizer, inputs, max_new_tokens=160)
base_result = run_generate(base_model, base_tokenizer, inputs, max_new_tokens=160)
print("============== FINE-TUNED MODEL ==============")
print(ft_result["text"])
print(f"time: {ft_result['elapsed']:.4f}s | tokens: {ft_result['output_tokens']} | tok/s: {ft_result['tokens_per_sec']:.2f}")
print("\n============== BASE MODEL ==============")
print(base_result["text"])
print(f"time: {base_result['elapsed']:.4f}s | tokens: {base_result['output_tokens']} | tok/s: {base_result['tokens_per_sec']:.2f}")
๐ Conclusion
This model demonstrates that:
- LLMs can learn memory summarization from general datasets
- However, true session memory modeling requires task-specific supervision
- Hybrid architectures (LLM + code) are effective
๐ฉโ๐ป Author
๊น์์ฌ
- Downloads last month
- 5
