---
language: en
license: apache-2.0
pipeline_tag: summarization
base_model: nsi319/legal-pegasus
tags:
- summarization
- legal
- pegasus
- billsum
- long-document
- abstractive-summarization
- finetuned
- legal-nlp
- domain-adaptation
datasets:
- FiscalNote/billsum
metrics:
- bertscore
---

# 📘 Legal Pegasus – BillSum Fine-Tuned

**Fine-tuned version of NSI’s Legal Pegasus for abstractive summarization of legal and legislative documents.**

This model fine-tunes **nsi319/legal-pegasus**, a legally-pretrained Pegasus model, on the **BillSum dataset** and additional cleaned summaries to generate concise, context-aware, and structured legal summaries.  
It improves coherence, domain terminology handling, and section-wise reasoning in long-form legal and policy text.

---

# 🧠 Base Model

This model builds on:

👉 **[nsi319/legal-pegasus](https://huggingface.co/nsi319/legal-pegasus)**  
Pretrained on large-scale legal corpora including:

- Statutes  
- Case law  
- Legislative documents  
- Regulatory material  

This provides strong legal-domain grounding before fine-tuning.

---

# 📚 Fine-Tuning Dataset

- **BillSum** (US Congressional + California bills)
- Additional cleaned legal-style summaries  
- Documents range from **2k to 12k+ tokens**

---

# ⚙️ Training Configuration

| Setting | Value |
|--------|--------|
| Base model | nsi319/legal-pegasus |
| Epochs | 8 |
| Learning rate | 2e-5 |
| Optimizer | AdamW |
| Weight decay | 0.01 |
| Batch size | 1 |
| Gradient accumulation | 4 |
| Max input length | 1024 tokens |
| Max summary length | 256 tokens |
| FP16 | Yes |
| Warmup | 500 steps |
| Logging-steps | 50 steps |

Training was performed on **Kaggle P100 GPU (16GB)**.

---

# 🧪 Evaluation Metrics

## **ROUGE Scores (Test Set)**  
| Metric | F1 |
|--------|------|
| ROUGE-1 | ~0.5554 |
| ROUGE-2 | ~0.3531 |
| ROUGE-L | ~0.4178 |

## **BERTScore (Semantic Similarity)**  
| Metric | Score |
|--------|--------|
| Precision | 0.8841 |
| Recall | 0.8943 |
| F1 | 0.8864 |

**BERTScore** is emphasized since legal summarization requires semantic preservation rather than lexical overlap.

---

# 🏗️ Long-Document Summarization Strategy

Pegasus supports ~1024 tokens, so long legal documents (3k–30k tokens) were handled using:

- Sentence/paragraph splitting  
- Token-based chunking  
- Sliding-window segmentation  
- Chunk-wise summarization  
- Second-pass “summary-of-summaries” rewriting  

This enables effective summarization far beyond the backbone context limit.

---

# 📌 Intended Use

This model is intended for:

- Legal document summarization  
- Bill/policy analysis  
- Legislative NLP pipelines  
- AI assistants for law students  
- Preprocessing for downstream legal reasoning tasks  

---

# ⚠️ Limitations

- English only  
- Long documents require external chunking  
- May simplify dense legal definitions  
- Not suitable for legal citations or case-law cross referencing  
- Not intended for production-grade legal decisions  

---

# 🔧 Usage Example

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "Anurag33Gaikwad/legal-pegasus-billsum-summarization"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = """Your long legal or legislative text here…"""

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)

summary_ids = model.generate(
    inputs["input_ids"],
    num_beams=5,
    max_length=256,
    early_stopping=True
)

print(tokenizer.decode(summary_ids[0], skip_special_tok_