NSP-CouncilSeg: Linear Text Segmentation for Municipal Meeting Minutes
Model Description
NSP-CouncilSeg is a fine-tuned BERT model specialized in Text Segmentation for municipal council meeting minutes. The model uses Next Sentence Prediction (NSP) to identify topic boundaries in long-form documents, making it particularly effective for segmenting administrative and governmental meeting minutes.
Try out the model: Hugging Face Space Demo
Key Features
- 🎯 Specialized for Meeting Minutes: Fine-tuned on Portuguese municipal council meeting minutes
- ⚡ Fast Inference: Efficient BERT-base architecture for real-time segmentation
- 📊 High Accuracy: Achieves BED F-measure score of 0.79 on CouncilSeg dataset
- 🔄 Sentence-Level Segmentation: Identifies topic boundaries at sentence granularity
Model Details
- Base Model:
neuralmind/bert-base-portuguese-cased - Architecture: BERT with Next Sentence Prediction head
- Parameters: 110M
- Max Sequence Length: 512 tokens
- Fine-tuning Dataset: CouncilSeg (Portuguese Municipal Meeting Minutes)
- Fine-tuning Method: Focal Loss with boundary-aware weighting
- Training Framework: PyTorch + Transformers
How It Works
The model predicts whether two consecutive sentences belong to the same topic (label 0: "is_next") or represent a topic transition (label 1: "not_next"). By applying this classifier sequentially across all sentence pairs in a document, it identifies topic boundaries.
Sentence A: "Pelo Senhor Presidente foi presente a reunião a ata n.º 28 de 20.12.2023."
Sentence B: "Ponderado e analisado o assunto o Executivo Municipal deliberou por unanimidade aprovar a ata n.º 28 de 20.12.2023."
→ Prediction: Same Topic (confidence: 76%)
Sentence A: "Ponderado e analisado o assunto o Executivo Municipal deliberou por unanimidade aprovar a ata n.º 28 de 20.12.2023."
Sentence B: "Não houve processos e requerimentos diversos a apresentar."
→ Prediction: Topic Boundary (confidence: 82%)
Usage
Quick Start with Transformers
from transformers import AutoTokenizer, AutoModelForNextSentencePrediction
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("anonymous15135/nsp-councilseg")
model = AutoModelForNextSentencePrediction.from_pretrained("anonymous15135/nsp-councilseg")
# Prepare input
sentence_a = "Pelo Senhor Presidente foi presente a reunião a ata n.º 28 de 20.12.2023."
sentence_b = "Ponderado e analisado o assunto o Executivo Municipal deliberou por unanimidade aprovar a ata n.º 28 de 20.12.2023."
# Tokenize
inputs = tokenizer(sentence_a, sentence_b, return_tensors="pt")
# Predict
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = torch.softmax(logits, dim=1)
# Interpret results
is_next_prob = probs[0][0].item()
not_next_prob = probs[0][1].item()
print(f"Is Next (same topic): {is_next_prob:.3f}")
print(f"Not Next (topic boundary): {not_next_prob:.3f}")
if not_next_prob > 0.5:
print("🔴 Topic boundary detected!")
else:
print("🟢 Same topic continues")
Limitations
- Domain Specificity: Best performance on administrative/governmental meeting minutes
- Language: Optimized for Portuguese; English performance may vary
- Document Length: Designed for documents with 10-50 segments
- Context Window: Limited to 512 tokens per sentence pair
- Ambiguous Boundaries: May struggle with subtle topic transitions
Model Card Contact
For questions or feedback, please open an issue in the model repository.
License
This model is released under the Attribution-NonCommercial-NoDerivatives 4.0 International
- Downloads last month
- 16
Model tree for liaad/Citilink_NSP_Segmentation
Base model
neuralmind/bert-base-portuguese-cased