Open source models including Malaysian context and dataset.
AI & ML interests
We develop Multimodality AI, lab from Malaysia
Recent Activity
Continue finetuning Instruct model using LoRA from 0.5B up to 72B.
Malaysian Text-to-Speech models.
Dataset to pretrain or continue pretrain to induce locality, gathered up to 200B tokens.
Pretrain from scratch 4096 context length on 90B tokens Malaysian text, https://huggingface.co/papers/2401.14680
Extending Malaysian CausalLM on non-causal masking training, https://arxiv.org/abs/2404.05961
-
mesolitica/malaysian-mistral-64M-MLM-512
Feature Extraction • 64.2M • Updated • 40 -
mesolitica/malaysian-mistral-191M-MLM-512
Feature Extraction • 0.2B • Updated • 28 -
mesolitica/malaysian-mistral-349M-MLM-512
Feature Extraction • 0.3B • Updated • 16 -
mesolitica/malaysian-mistral-474M-MLM-512
Feature Extraction • 0.5B • Updated • 13
Trained on 17B tokens, 81GB of cleaned texts, able to understand standard Malay, local Malay, local Mandarin, Manglish, and local Tamil.
Full parameter post training using SFT warmup and GRPO.
Open source models and dataset.
-
mesolitica/malaysian-whisper-medium
Automatic Speech Recognition • 0.8B • Updated • 121 • 5 -
mesolitica/malaysian-whisper-small-v2
Automatic Speech Recognition • 0.2B • Updated • 65 -
mesolitica/malaysian-whisper-base
Automatic Speech Recognition • 72.6M • Updated • 76 • 3 -
mesolitica/malaysian-whisper-tiny
Automatic Speech Recognition • 37.8M • Updated • 50 • 1
Translation model and dataset.
-
mesolitica/nanot5-small-malaysian-translation-v2
Translation • 89.5M • Updated • 80 • 1 -
mesolitica/nanot5-base-malaysian-translation-v2
Translation • 0.2B • Updated • 50 • 1 -
mesolitica/nanot5-small-malaysian-translation-v2.1
Translation • 89.5M • Updated • 85 • 1 -
mesolitica/nanot5-base-malaysian-translation-v2.1
Translation • 0.2B • Updated • 223
Malaysian instructions to pretrain or finetune LLM.
Trained on 21B tokens, 91GB of cleaned texts, able to understand standard Malay, local Malay, local Mandarin, Manglish, and local Tamil.
-
mesolitica/malaysian-mistral-64M-4096
Text Generation • 64.2M • Updated • 12 -
mesolitica/malaysian-mistral-191M-4096
Text Generation • 0.2B • Updated • 46 -
mesolitica/malaysian-mistral-349M-4096
Text Generation • 0.3B • Updated • 16 -
mesolitica/malaysian-mistral-474M-4096
Text Generation • 0.5B • Updated • 21
Trained on 17B tokens, 81GB of cleaned texts, able to understand standard Malay, local Malay, local Mandarin, Manglish, and local Tamil.
Open source models including Malaysian context and dataset.
Full parameter post training using SFT warmup and GRPO.
Continue finetuning Instruct model using LoRA from 0.5B up to 72B.
Open source models and dataset.
-
mesolitica/malaysian-whisper-medium
Automatic Speech Recognition • 0.8B • Updated • 121 • 5 -
mesolitica/malaysian-whisper-small-v2
Automatic Speech Recognition • 0.2B • Updated • 65 -
mesolitica/malaysian-whisper-base
Automatic Speech Recognition • 72.6M • Updated • 76 • 3 -
mesolitica/malaysian-whisper-tiny
Automatic Speech Recognition • 37.8M • Updated • 50 • 1
Malaysian Text-to-Speech models.
Translation model and dataset.
-
mesolitica/nanot5-small-malaysian-translation-v2
Translation • 89.5M • Updated • 80 • 1 -
mesolitica/nanot5-base-malaysian-translation-v2
Translation • 0.2B • Updated • 50 • 1 -
mesolitica/nanot5-small-malaysian-translation-v2.1
Translation • 89.5M • Updated • 85 • 1 -
mesolitica/nanot5-base-malaysian-translation-v2.1
Translation • 0.2B • Updated • 223
Dataset to pretrain or continue pretrain to induce locality, gathered up to 200B tokens.
Malaysian instructions to pretrain or finetune LLM.
Pretrain from scratch 4096 context length on 90B tokens Malaysian text, https://huggingface.co/papers/2401.14680
Trained on 21B tokens, 91GB of cleaned texts, able to understand standard Malay, local Malay, local Mandarin, Manglish, and local Tamil.
-
mesolitica/malaysian-mistral-64M-4096
Text Generation • 64.2M • Updated • 12 -
mesolitica/malaysian-mistral-191M-4096
Text Generation • 0.2B • Updated • 46 -
mesolitica/malaysian-mistral-349M-4096
Text Generation • 0.3B • Updated • 16 -
mesolitica/malaysian-mistral-474M-4096
Text Generation • 0.5B • Updated • 21
Extending Malaysian CausalLM on non-causal masking training, https://arxiv.org/abs/2404.05961
-
mesolitica/malaysian-mistral-64M-MLM-512
Feature Extraction • 64.2M • Updated • 40 -
mesolitica/malaysian-mistral-191M-MLM-512
Feature Extraction • 0.2B • Updated • 28 -
mesolitica/malaysian-mistral-349M-MLM-512
Feature Extraction • 0.3B • Updated • 16 -
mesolitica/malaysian-mistral-474M-MLM-512
Feature Extraction • 0.5B • Updated • 13
Trained on 17B tokens, 81GB of cleaned texts, able to understand standard Malay, local Malay, local Mandarin, Manglish, and local Tamil.
Trained on 17B tokens, 81GB of cleaned texts, able to understand standard Malay, local Malay, local Mandarin, Manglish, and local Tamil.