Whisper Large-V3 Spanish

Model summary

Whisper Large-V3 Spanish is a cutting-edge automatic speech recognition (ASR) model for Spanish (es), fine-tuned from [openai/whisper-large-v3] on the Spanish subset of Mozilla Common Voice 13.0. It achieves a Word Error Rate (WER) of 4.9295% on the evaluation set, making it one of the most accurate Whisper models for Spanish.

This model incorporates improvements from the Large-V3 architecture, including better noise robustness, enhanced multilingual pretraining, and mixed precision training for efficiency.


Model description

  • Architecture: Transformer-based encoder–decoder (Whisper Large-V3)
  • Base model: openai/whisper-large-v3
  • Language: Spanish (es)
  • Task: Automatic Speech Recognition (ASR)
  • Output: Text transcription in Spanish
  • Decoding: Autoregressive sequence-to-sequence decoding

Large-V3 builds upon Large-V2, offering lower WER and improved generalization across accents and audio conditions.


Intended use

Primary use cases

  • High-accuracy transcription of Spanish audio
  • Podcasts, interviews, lectures, and long-form audio
  • Research or commercial applications requiring top-tier ASR performance in Spanish

Limitations

  • Performance may drop on heavily accented or extremely noisy audio
  • High memory and compute requirements, particularly for real-time use
  • Not suitable for critical domains (medical, legal) without human verification

Training and evaluation data

  • Dataset: Mozilla Common Voice 13.0 (Spanish subset)

  • Data type: Crowd-sourced read speech

  • Preprocessing:

    • Audio resampled to 16 kHz
    • Text tokenized using Whisper tokenizer
    • Filtering of corrupted or invalid samples
  • Evaluation metric: Word Error Rate (WER) on held-out evaluation set


Evaluation results

Metric Value
WER (eval) 4.9295%

Training procedure

Training hyperparameters

  • Learning rate: 1e-5
  • Optimizer: Adam (β1=0.9, β2=0.999, ε=1e-8)
  • LR scheduler: Linear
  • Warmup steps: 500
  • Training steps: 20000
  • Train batch size: 32 (gradient accumulation 2 → effective batch size 64)
  • Eval batch size: 16
  • Seed: 42
  • Mixed precision training: Native AMP

Training results (summary)

Training Loss Epoch Step Validation Loss Wer
0.058 2.04 1000 0.1540 4.6851
0.0124 4.07 2000 0.1829 4.6787
0.0052 6.11 3000 0.2190 4.8096
0.0024 8.15 4000 0.2289 4.8776
0.0024 10.18 5000 0.2341 4.8923
0.0015 12.22 6000 0.2459 4.9340
0.0021 14.26 7000 0.2558 4.9276
0.0011 16.29 8000 0.2540 5.1015
0.0013 18.33 9000 0.2611 5.1855
0.0005 20.37 10000 0.2720 4.9379
0.0028 22.4 11000 0.2614 5.0110
0.0004 24.44 12000 0.2652 4.9898
0.0004 26.48 13000 0.2850 4.9776
0.0006 28.51 14000 0.2736 4.9732
0.0002 30.55 15000 0.2944 5.1566
0.0002 32.59 16000 0.2949 5.0007
0.0001 34.62 17000 0.3094 4.9552
0.0 36.66 18000 0.3185 4.9622
0.0 38.7 19000 0.3229 4.9462
0.0 40.73 20000 0.3245 4.9295

Framework versions

  • Transformers 4.37.2
  • PyTorch 2.2.0+cu121
  • Datasets 2.16.1
  • Tokenizers 0.15.1

Example usage

from transformers import pipeline

hf_model = "HiTZ/whisper-large-v3-es"  # replace with actual repo ID
device = 0  # -1 for CPU

pipe = pipeline(
    task="automatic-speech-recognition",
    model=hf_model,
    device=device
)

result = pipe("audio.wav")
print(result["text"])

Ethical considerations and risks

  • This model transcribes speech and may process personal data.
  • Users should ensure compliance with applicable data protection laws (e.g., GDPR).
  • The model should not be used for surveillance or non-consensual audio processing.

Citation

If you use this model in your research, please cite:

@misc{dezuazo2025whisperlmimprovingasrmodels,
  title={Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages},
  author={Xabier de Zuazo and Eva Navas and Ibon Saratxaga and Inma Hernáez Rioja},
  year={2025},
  eprint={2503.23542},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Please, check the related paper preprint in arXiv:2503.23542 for more details.


License

This model is available under the Apache-2.0 License. You are free to use, modify, and distribute this model as long as you credit the original creators.


Contact and attribution

  • Fine-tuning and evaluation: HiTZ/Aholab - Basque Center for Language Technology
  • Base model: OpenAI Whisper
  • Dataset: Mozilla Common Voice

For questions or issues, please open an issue in the model repository.

Downloads last month
9
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HiTZ/whisper-large-v3-es

Finetuned
(746)
this model

Dataset used to train HiTZ/whisper-large-v3-es

Collection including HiTZ/whisper-large-v3-es

Evaluation results