Fine-tuning MedASR for Indian Regional Languages
Hi, I’m working on medical ASR use cases and looking for guidance on fine-tuning MedASR for Indian regional languages such as Hindi, Tamil, Telugu, Kannada, and Bengali. Any recommendations on datasets, multilingual fine-tuning strategies, or evaluation best practices would be really helpful. Happy to collaborate.
Thanks for reaching out! MedASR is pre-trained and finetuned on English only data. We are not yet sure how it will perform in other languages. At the very least, you need a different tokenizer because the current one is only for English. Unfortunately, we are not very familiar with datasets or evaluation in these languages at the moment. But if you already have something, you should be able to finetune the MedASR model following https://github.com/google-health/medasr/blob/main/notebooks/fine_tune_with_hugging_face.ipynb.
Hi, Any guidance on how can I build the tokeniser lets say for Hindi audio and data? Also, what would be considered an ideal number of training data hours needed to fine tune? If I change the tokeniser, doesn't that mean I need to retrain the model like I won't be able to use existing weights right?
Thanks for the contribution