DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding

arXiv 🤗 Hugging Face GitHub

In this paper, We introduce DIFFA-2, a practical diffusion-based LALM for general audio understanding. DIFFA-2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four-stage curriculum that combines semantic and acoustic alignment, large-scale supervised fine-tuning, and variance-reduced preference optimization, using only fully open-source corpora. Experiments on MMSU, MMAU, and MMAR show that DIFFA-2 consistently improves over DIFFA and is competitive to strong AR LALMs under practical training budgets, supporting diffusion-based modeling is a viable backbone for large-scale audio understanding.

We have open-sourced the checkpoints for stage 1 and stage 4. The files in the root directory of the repository are for stage4, and stage1 is located in the stage1 folder.

Downloads last month
33
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zhoujiaming777/DIFFA-2

Finetuned
(23)
this model

Paper for zhoujiaming777/DIFFA-2