miguelalmqs commited on
Commit
8fc196c
·
verified ·
1 Parent(s): 7829538

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -0
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - pt
4
+ license: cc-by-nc-nd-4.0
5
+ tags:
6
+ - text-segmentation
7
+ - topic-segmentation
8
+ - bert
9
+ - next-sentence-prediction
10
+ - document-segmentation
11
+ - meeting-minutes
12
+ library_name: transformers
13
+ base_model:
14
+ - neuralmind/bert-base-portuguese-cased
15
+ ---
16
+
17
+ # NSP-CouncilSeg: Linear Text Segmentation for Municipal Meeting Minutes
18
+
19
+ ## Model Description
20
+
21
+ **NSP-CouncilSeg** is a fine-tuned BERT model specialized in Text Segmentation for municipal council meeting minutes. The model uses Next Sentence Prediction (NSP) to identify topic boundaries in long-form documents, making it particularly effective for segmenting administrative and governmental meeting minutes.
22
+
23
+ **Try out the model**: [Hugging Face Space Demo](https://huggingface.co/spaces/anonymous15135/nsp-councilseg-demo)
24
+
25
+ ### Key Features
26
+
27
+ - 🎯 **Specialized for Meeting Minutes**: Fine-tuned on Portuguese municipal council meeting minutes
28
+ - ⚡ **Fast Inference**: Efficient BERT-base architecture for real-time segmentation
29
+ - 📊 **High Accuracy**: Achieves BED F-measure score of 0.79 on CouncilSeg dataset
30
+ - 🔄 **Sentence-Level Segmentation**: Identifies topic boundaries at sentence granularity
31
+
32
+ ## Model Details
33
+
34
+ - **Base Model**: `neuralmind/bert-base-portuguese-cased`
35
+ - **Architecture**: BERT with Next Sentence Prediction head
36
+ - **Parameters**: 110M
37
+ - **Max Sequence Length**: 512 tokens
38
+ - **Fine-tuning Dataset**: CouncilSeg (Portuguese Municipal Meeting Minutes)
39
+ - **Fine-tuning Method**: Focal Loss with boundary-aware weighting
40
+ - **Training Framework**: PyTorch + Transformers
41
+
42
+ ## How It Works
43
+
44
+ The model predicts whether two consecutive sentences belong to the same topic (label 0: "is_next") or represent a topic transition (label 1: "not_next"). By applying this classifier sequentially across all sentence pairs in a document, it identifies topic boundaries.
45
+
46
+ ```python
47
+ Sentence A: "Pelo Senhor Presidente foi presente a reunião a ata n.º 28 de 20.12.2023."
48
+ Sentence B: "Ponderado e analisado o assunto o Executivo Municipal deliberou por unanimidade aprovar a ata n.º 28 de 20.12.2023."
49
+ → Prediction: Same Topic (confidence: 76%)
50
+
51
+ Sentence A: "Ponderado e analisado o assunto o Executivo Municipal deliberou por unanimidade aprovar a ata n.º 28 de 20.12.2023."
52
+ Sentence B: "Não houve processos e requerimentos diversos a apresentar."
53
+ → Prediction: Topic Boundary (confidence: 82%)
54
+ ```
55
+
56
+ ## Usage
57
+
58
+ ### Quick Start with Transformers
59
+
60
+ ```python
61
+ from transformers import AutoTokenizer, AutoModelForNextSentencePrediction
62
+ import torch
63
+
64
+ # Load model and tokenizer
65
+ tokenizer = AutoTokenizer.from_pretrained("anonymous15135/nsp-councilseg")
66
+ model = AutoModelForNextSentencePrediction.from_pretrained("anonymous15135/nsp-councilseg")
67
+
68
+ # Prepare input
69
+ sentence_a = "Pelo Senhor Presidente foi presente a reunião a ata n.º 28 de 20.12.2023."
70
+ sentence_b = "Ponderado e analisado o assunto o Executivo Municipal deliberou por unanimidade aprovar a ata n.º 28 de 20.12.2023."
71
+
72
+
73
+ # Tokenize
74
+ inputs = tokenizer(sentence_a, sentence_b, return_tensors="pt")
75
+
76
+ # Predict
77
+ with torch.no_grad():
78
+ outputs = model(**inputs)
79
+ logits = outputs.logits
80
+ probs = torch.softmax(logits, dim=1)
81
+
82
+ # Interpret results
83
+ is_next_prob = probs[0][0].item()
84
+ not_next_prob = probs[0][1].item()
85
+
86
+ print(f"Is Next (same topic): {is_next_prob:.3f}")
87
+ print(f"Not Next (topic boundary): {not_next_prob:.3f}")
88
+
89
+ if not_next_prob > 0.5:
90
+ print("🔴 Topic boundary detected!")
91
+ else:
92
+ print("🟢 Same topic continues")
93
+ ```
94
+
95
+
96
+ ## Limitations
97
+
98
+ - **Domain Specificity**: Best performance on administrative/governmental meeting minutes
99
+ - **Language**: Optimized for Portuguese; English performance may vary
100
+ - **Document Length**: Designed for documents with 10-50 segments
101
+ - **Context Window**: Limited to 512 tokens per sentence pair
102
+ - **Ambiguous Boundaries**: May struggle with subtle topic transitions
103
+
104
+ ## Model Card Contact
105
+
106
+ For questions or feedback, please open an issue in the [model repository](https://huggingface.co/anonymous15135/nsp-councilseg/discussions).
107
+
108
+ ## License
109
+
110
+ This model is released under the Attribution-NonCommercial-NoDerivatives 4.0 International