qwen2.5-3b-memory-summary-v2

๐Ÿ“Œ Overview

This model is a LoRA fine-tuned version of Qwen2.5-3B-Instruct, designed as a session memory summarization backbone for multi-turn conversational systems such as RAG-based assistants.

Unlike standard summarization models, this model focuses on:

๐Ÿ‘‰ compressing dialogue into durable, reusable memory for future turns


๐ŸŽฏ Design Philosophy

This project adopts a hybrid memory architecture:


LLM โ†’ narrative summary
Code โ†’ structured memory update

Why this design?

Component Role
LLM Natural language compression
Code Deterministic state update

๐Ÿ‘‰ This separation improves:

  • reliability
  • controllability
  • scalability

๐Ÿง  Task Definition

Input

  • Previous memory (JSON)
  • Recent dialogue turns
  • Optional document context

Output

๐Ÿ‘‰ A concise memory summary for future interactions


โš™๏ธ Training Setup

Base Model

  • Qwen/Qwen2.5-3B-Instruct

Fine-tuning Method

  • LoRA (Parameter-efficient fine-tuning)
  • Supervised Fine-Tuning (SFT)

๐Ÿ”ง Hyperparameters

Parameter Value
Train Batch Size 6
Gradient Accumulation 2
Effective Batch 12
Epochs 2
Learning Rate 1e-4
Max Sequence Length 3072
Precision bf16
Gradient Checkpointing Enabled

๐Ÿงฉ Key Training Strategies

1. Completion-only Supervision


completion_only_loss=True
  • Only supervises assistant responses
  • Prevents prompt/template memorization

2. No Sample Packing


packing=False
  • Preserves sample boundaries
  • Critical for instruction-following tasks

3. Long Context Handling

  • Up to 3072 tokens
  • Supports multi-turn memory + document context

4. Critical Fix: Special Token Alignment

tokenizer.pad_token = tokenizer.eos_token

This fix resolved:

  • repetition issues
  • generation collapse
  • EOS mismatch

๐Ÿ“š Training Data

Sources

Type Dataset
Dialogue summarization DialogSum
Chat summarization SAMSum
Query-based summarization QMSum
Memory-style data Synthetic (limited)

โš ๏ธ Data Limitation

The dataset is not memory-optimized.

๐Ÿ‘‰ It is primarily:

  • general summarization data
  • not structured memory extraction data

๐Ÿ“‰ Training Results

Validation Loss

Step Loss
50 1.3417
250 1.2806
500 1.2545
750 1.2359 (best)
1000+ ~1.25

๐Ÿ“Š Interpretation

  • Strong early convergence
  • Best performance around step 700โ€“800
  • Later training shows plateau

๐Ÿงช Inference Example

Input

[PREVIOUS MEMORY]
User is designing a multi-turn RAG system

[RECENT TURNS]
User: Should memory be separate?
Assistant: It can be separate
User: What data can we use?

[ACTIVE DOCUMENT]
Survey of multi-turn LLM systems

Output

The user is designing a multi-turn RAG system. The memory model can run on the same machine but as a separate service.

๐Ÿ“Š Performance Analysis

โœ… Strengths

  • Fluent and stable generation
  • Incorporates previous memory
  • Extracts core dialogue facts
  • No repetition / collapse

โš ๏ธ Limitations

Issue Description
Missing latest intent Recent user goals often ignored
Weak focus detection Current task not emphasized
Document underuse External context rarely used
Not memory-optimized Behaves like summarizer

๐Ÿง  Key Insight

๐Ÿ‘‰ This model is best described as:

โ€œA strong summarization backbone, not a complete memory model.โ€


๐Ÿ— System-Level Insight

This experiment validates that:

โœ” Public summarization data โ†’ usable memory backbone โœ” Structured memory โ†’ better handled by code โœ” Full memory modeling โ†’ requires task-specific data


๐Ÿš€ Future Improvements

1. Memory-Specific Data

Add synthetic samples with:

  • current focus
  • unresolved questions
  • evolving goals
  • decision tracking

2. Prompt Alignment

Train on the same structure as inference:

[PREVIOUS MEMORY]
[RECENT TURNS]
[ACTIVE DOCUMENT]

3. Objective Refinement

Train model to explicitly extract:

  • intent
  • decisions
  • open problems

4. Hybrid Training

Combine:

  • summarization data
  • memory-structured data

๐Ÿ’ก Recommended Usage

Best used as:

๐Ÿ‘‰ Memory backbone model in multi-turn systems

Not recommended for:

  • standalone summarization tasks
  • factual QA
  • production memory systems (without post-processing)

๐Ÿงช Example Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "yeseul0-0/qwen2.5-3b-memory-summary-v2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

inputs = tokenizer("Summarize dialogue...", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Test (base vs v2)

# ============================================================
# 1. base model ์ถ”๊ฐ€ ๋กœ๋“œ
#    ์ด๋ฏธ fine-tuned model์€ model ๋ณ€์ˆ˜์— ์žˆ๋‹ค๊ณ  ๊ฐ€์ •
# ============================================================
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time
import json

BASE_MODEL_ID = "Qwen/Qwen2.5-3B-Instruct"

base_tokenizer = tokenizer  # ์ด๋ฏธ ๊ฐ™์€ ๊ณ„์—ด์ด๋ฉด ๊ธฐ์กด tokenizer ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅ

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_ID,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

base_tokenizer.pad_token = base_tokenizer.eos_token
base_tokenizer.padding_side = "right"

base_model.config.pad_token_id = base_tokenizer.pad_token_id
base_model.config.eos_token_id = base_tokenizer.eos_token_id

if hasattr(base_model, "generation_config") and base_model.generation_config is not None:
    base_model.generation_config.pad_token_id = base_tokenizer.pad_token_id
    base_model.generation_config.eos_token_id = base_tokenizer.eos_token_id

print("base model loaded")
# ============================================================
# 2. ๋น„๊ต์šฉ ์ž…๋ ฅ
# ============================================================
sample_previous_memory = {
    "narrative": "The user is designing a multi-turn RAG system.",
    "structured": {
        "goal": "Design a multi-turn architecture",
        "established_facts": ["RAG is used"],
        "current_focus": "memory model design",
        "unresolved_questions": ["whether to separate the memory model service"],
    },
}

sample_recent_turns = [
    {"role": "user", "text": "Should the memory model run as a separate endpoint?"},
    {"role": "assistant", "text": "It can run on the same machine but as a separate service."},
    {"role": "user", "text": "Then what public datasets can we use for training?"},
]

sample_active_doc = {
    "file_name": "multiturn_survey.pdf",
    "doc_summary": "A survey of multi-turn interactions with large language models."
}

def render_recent_turns(turns):
    return "\n".join([f'{t["role"].capitalize()}: {t["text"]}' for t in turns])

memory_prompt_system = (
    "You are a session-memory summarization model. "
    "Summarize only durable and useful information for future turns. "
    "Always include the user's current focus and latest intent when important. "
    "Do not invent facts."
)

memory_prompt_user = f"""
Create a concise memory summary for future dialogue turns.

[PREVIOUS MEMORY]
{json.dumps(sample_previous_memory, ensure_ascii=False, indent=2)}

[RECENT TURNS]
{render_recent_turns(sample_recent_turns)}

[ACTIVE DOCUMENT]
{json.dumps(sample_active_doc, ensure_ascii=False, indent=2)}

Return a concise summary in plain English.
"""

messages = [
    {"role": "system", "content": memory_prompt_system},
    {"role": "user", "content": memory_prompt_user},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(
    prompt,
    return_tensors="pt",
    truncation=True,
    max_length=3072
)
# ============================================================
# 3. ๊ณตํ†ต generate ํ•จ์ˆ˜
# ============================================================
def run_generate(model_obj, tokenizer_obj, inputs_dict, max_new_tokens=160):
    model_inputs = {k: v.to(model_obj.device) for k, v in inputs_dict.items()}

    if torch.cuda.is_available():
        torch.cuda.synchronize()
    start = time.perf_counter()

    with torch.no_grad():
        outputs = model_obj.generate(
            **model_inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            temperature=0.0,
            eos_token_id=tokenizer_obj.eos_token_id,
            pad_token_id=tokenizer_obj.pad_token_id,
        )

    if torch.cuda.is_available():
        torch.cuda.synchronize()
    end = time.perf_counter()

    gen_ids = outputs[0][model_inputs["input_ids"].shape[1]:]
    gen_text = tokenizer_obj.decode(gen_ids, skip_special_tokens=True).strip()

    return {
        "text": gen_text,
        "elapsed": end - start,
        "output_tokens": len(gen_ids),
        "tokens_per_sec": (len(gen_ids) / (end - start)) if (end - start) > 0 else None,
    }
# ============================================================
# 4. fine-tuned vs base ๋น„๊ต
#    model = ๋„ค๊ฐ€ ์ด๋ฏธ ๋กœ๋“œํ•œ fine-tuned model
# ============================================================
ft_result = run_generate(model, tokenizer, inputs, max_new_tokens=160)
base_result = run_generate(base_model, base_tokenizer, inputs, max_new_tokens=160)

print("============== FINE-TUNED MODEL ==============")
print(ft_result["text"])
print(f"time: {ft_result['elapsed']:.4f}s | tokens: {ft_result['output_tokens']} | tok/s: {ft_result['tokens_per_sec']:.2f}")

print("\n============== BASE MODEL ==============")
print(base_result["text"])
print(f"time: {base_result['elapsed']:.4f}s | tokens: {base_result['output_tokens']} | tok/s: {base_result['tokens_per_sec']:.2f}")

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2026-04-14 แ„‹แ…ฉแ„Œแ…ฅแ†ซ 12.34.42


๐Ÿ“Œ Conclusion

This model demonstrates that:

  • LLMs can learn memory summarization from general datasets
  • However, true session memory modeling requires task-specific supervision
  • Hybrid architectures (LLM + code) are effective

๐Ÿ‘ฉโ€๐Ÿ’ป Author

๊น€์˜ˆ์Šฌ

Downloads last month
5
Safetensors
Model size
3B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for yeseul0-0/qwen2.5-3b-memory-summary-default_v0.3

Base model

Qwen/Qwen2.5-3B
Adapter
(1285)
this model