liaad
/

Citilink_NSP_Segmentation

+---
+language:
+- pt
+license: cc-by-nc-nd-4.0
+tags:
+- text-segmentation
+- topic-segmentation
+- bert
+- next-sentence-prediction
+- document-segmentation
+- meeting-minutes
+library_name: transformers
+base_model:
+- neuralmind/bert-base-portuguese-cased
+---
+# NSP-CouncilSeg: Linear Text Segmentation for Municipal Meeting Minutes
+## Model Description
+**NSP-CouncilSeg** is a fine-tuned BERT model specialized in Text Segmentation for municipal council meeting minutes. The model uses Next Sentence Prediction (NSP) to identify topic boundaries in long-form documents, making it particularly effective for segmenting administrative and governmental meeting minutes.
+**Try out the model**: [Hugging Face Space Demo](https://huggingface.co/spaces/anonymous15135/nsp-councilseg-demo)
+### Key Features
+- 🎯 **Specialized for Meeting Minutes**: Fine-tuned on Portuguese municipal council meeting minutes
+- ⚡ **Fast Inference**: Efficient BERT-base architecture for real-time segmentation
+- 📊 **High Accuracy**: Achieves BED F-measure score of 0.79 on CouncilSeg dataset
+- 🔄 **Sentence-Level Segmentation**: Identifies topic boundaries at sentence granularity
+## Model Details
+- **Base Model**: `neuralmind/bert-base-portuguese-cased`
+- **Architecture**: BERT with Next Sentence Prediction head
+- **Parameters**: 110M
+- **Max Sequence Length**: 512 tokens
+- **Fine-tuning Dataset**: CouncilSeg (Portuguese Municipal Meeting Minutes)
+- **Fine-tuning Method**: Focal Loss with boundary-aware weighting
+- **Training Framework**: PyTorch + Transformers
+## How It Works
+The model predicts whether two consecutive sentences belong to the same topic (label 0: "is_next") or represent a topic transition (label 1: "not_next"). By applying this classifier sequentially across all sentence pairs in a document, it identifies topic boundaries.
+```python
+Sentence A: "Pelo Senhor Presidente foi presente a reunião a ata n.º 28 de 20.12.2023."
+Sentence B: "Ponderado e analisado o assunto o Executivo Municipal deliberou por unanimidade aprovar a ata n.º 28 de 20.12.2023."
+→ Prediction: Same Topic (confidence: 76%)
+Sentence A: "Ponderado e analisado o assunto o Executivo Municipal deliberou por unanimidade aprovar a ata n.º 28 de 20.12.2023."
+Sentence B: "Não houve processos e requerimentos diversos a apresentar."
+→ Prediction: Topic Boundary (confidence: 82%)
+```
+## Usage
+### Quick Start with Transformers
+```python
+from transformers import AutoTokenizer, AutoModelForNextSentencePrediction
+import torch
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("anonymous15135/nsp-councilseg")
+model = AutoModelForNextSentencePrediction.from_pretrained("anonymous15135/nsp-councilseg")
+# Prepare input
+sentence_a = "Pelo Senhor Presidente foi presente a reunião a ata n.º 28 de 20.12.2023."
+sentence_b = "Ponderado e analisado o assunto o Executivo Municipal deliberou por unanimidade aprovar a ata n.º 28 de 20.12.2023."
+# Tokenize
+inputs = tokenizer(sentence_a, sentence_b, return_tensors="pt")
+# Predict
+with torch.no_grad():
+    outputs = model(**inputs)
+    logits = outputs.logits
+    probs = torch.softmax(logits, dim=1)
+# Interpret results
+is_next_prob = probs[0][0].item()
+not_next_prob = probs[0][1].item()
+print(f"Is Next (same topic): {is_next_prob:.3f}")
+print(f"Not Next (topic boundary): {not_next_prob:.3f}")
+if not_next_prob > 0.5:
+    print("🔴 Topic boundary detected!")
+else:
+    print("🟢 Same topic continues")
+```
+## Limitations
+- **Domain Specificity**: Best performance on administrative/governmental meeting minutes
+- **Language**: Optimized for Portuguese; English performance may vary
+- **Document Length**: Designed for documents with 10-50 segments
+- **Context Window**: Limited to 512 tokens per sentence pair
+- **Ambiguous Boundaries**: May struggle with subtle topic transitions
+## Model Card Contact
+For questions or feedback, please open an issue in the [model repository](https://huggingface.co/anonymous15135/nsp-councilseg/discussions).
+## License
+This model is released under the Attribution-NonCommercial-NoDerivatives 4.0 International