--- language: en license: apache-2.0 pipeline_tag: summarization base_model: nsi319/legal-pegasus tags: - summarization - legal - pegasus - billsum - long-document - abstractive-summarization - finetuned - legal-nlp - domain-adaptation datasets: - FiscalNote/billsum metrics: - bertscore --- # 📘 Legal Pegasus – BillSum Fine-Tuned **Fine-tuned version of NSI’s Legal Pegasus for abstractive summarization of legal and legislative documents.** This model fine-tunes **nsi319/legal-pegasus**, a legally-pretrained Pegasus model, on the **BillSum dataset** and additional cleaned summaries to generate concise, context-aware, and structured legal summaries. It improves coherence, domain terminology handling, and section-wise reasoning in long-form legal and policy text. --- # 🧠 Base Model This model builds on: 👉 **[nsi319/legal-pegasus](https://huggingface.co/nsi319/legal-pegasus)** Pretrained on large-scale legal corpora including: - Statutes - Case law - Legislative documents - Regulatory material This provides strong legal-domain grounding before fine-tuning. --- # 📚 Fine-Tuning Dataset - **BillSum** (US Congressional + California bills) - Additional cleaned legal-style summaries - Documents range from **2k to 12k+ tokens** --- # ⚙️ Training Configuration | Setting | Value | |--------|--------| | Base model | nsi319/legal-pegasus | | Epochs | 8 | | Learning rate | 2e-5 | | Optimizer | AdamW | | Weight decay | 0.01 | | Batch size | 1 | | Gradient accumulation | 4 | | Max input length | 1024 tokens | | Max summary length | 256 tokens | | FP16 | Yes | | Warmup | 500 steps | | Logging-steps | 50 steps | Training was performed on **Kaggle P100 GPU (16GB)**. --- # 🧪 Evaluation Metrics ## **ROUGE Scores (Test Set)** | Metric | F1 | |--------|------| | ROUGE-1 | ~0.5554 | | ROUGE-2 | ~0.3531 | | ROUGE-L | ~0.4178 | ## **BERTScore (Semantic Similarity)** | Metric | Score | |--------|--------| | Precision | 0.8841 | | Recall | 0.8943 | | F1 | 0.8864 | **BERTScore** is emphasized since legal summarization requires semantic preservation rather than lexical overlap. --- # 🏗️ Long-Document Summarization Strategy Pegasus supports ~1024 tokens, so long legal documents (3k–30k tokens) were handled using: - Sentence/paragraph splitting - Token-based chunking - Sliding-window segmentation - Chunk-wise summarization - Second-pass “summary-of-summaries” rewriting This enables effective summarization far beyond the backbone context limit. --- # 📌 Intended Use This model is intended for: - Legal document summarization - Bill/policy analysis - Legislative NLP pipelines - AI assistants for law students - Preprocessing for downstream legal reasoning tasks --- # ⚠️ Limitations - English only - Long documents require external chunking - May simplify dense legal definitions - Not suitable for legal citations or case-law cross referencing - Not intended for production-grade legal decisions --- # 🔧 Usage Example ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_name = "Anurag33Gaikwad/legal-pegasus-billsum-summarization" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) text = """Your long legal or legislative text here…""" inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024) summary_ids = model.generate( inputs["input_ids"], num_beams=5, max_length=256, early_stopping=True ) print(tokenizer.decode(summary_ids[0], skip_special_tok_