Unsupervised Cross-lingual Representation Learning at Scale
Paper
•
1911.02116
•
Published
•
3
XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper Unsupervised Cross-lingual Representation Learning at Scale by Conneau et al. and first released in this repository.
This model was fine-tuned to classify intents based on the dataset Jarbas/ovos_intents_train
You can use the raw model for intent classification in the Open Voice OS project context.
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig
model = AutoModelForSequenceClassification.from_pretrained("fdemelo/xlm-roberta-ovos-intent-classifier")
tokenizer = AutoTokenizer.from_pretrained("fdemelo/xlm-roberta-ovos-intent-classifier")
config = AutoConfig.from_pretrained("fdemelo/xlm-roberta-ovos-intent-classifier")
# preprocess dataset
def tokenize_function(examples):
examples["label"] = list(map(lambda x: config.label2id[x], examples["label"]))
return tokenizer(examples["sentence"], padding="max_length", truncation=True)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
prediction = model.predict(tokenized_dataset)