Fine-tuning MedASR for Indian Regional Languages

by Adityab24840 - opened 19 days ago

19 days ago

Hi, I’m working on medical ASR use cases and looking for guidance on fine-tuning MedASR for Indian regional languages such as Hindi, Tamil, Telugu, Kannada, and Bengali. Any recommendations on datasets, multilingual fine-tuning strategies, or evaluation best practices would be really helpful. Happy to collaborate.

wuuuuuuuk

Google org 18 days ago

Thanks for reaching out! MedASR is pre-trained and finetuned on English only data. We are not yet sure how it will perform in other languages. At the very least, you need a different tokenizer because the current one is only for English. Unfortunately, we are not very familiar with datasets or evaluation in these languages at the moment. But if you already have something, you should be able to finetune the MedASR model following https://github.com/google-health/medasr/blob/main/notebooks/fine_tune_with_hugging_face.ipynb.

darknight054

9 days ago

•

edited 9 days ago

Hi, Any guidance on how can I build the tokeniser lets say for Hindi audio and data? Also, what would be considered an ideal number of training data hours needed to fine tune? If I change the tokeniser, doesn't that mean I need to retrain the model like I won't be able to use existing weights right?
Thanks for the contribution

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment