amphion
/

dualcodec

Audio-to-Audio

dualcodec

Model card Files Files and versions

xet

Community

Improve model card: Add pipeline tag, library name, license, and paper abstract

by nielsr HF Staff - opened Oct 2

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+18

-9

Files changed (1) hide show

README.md +18 -9

README.md CHANGED Viewed

@@ -1,7 +1,13 @@
 # DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
 [![arXiv](https://img.shields.io/badge/arXiv-2505.13000-brightgreen.svg?style=flat-square)](http://arxiv.org/abs/2505.13000)
 [![githubio](https://img.shields.io/badge/GitHub.io-Demo_Page-blue?logo=Github&style=flat-square)](https://dualcodec.github.io/)
 [![PyPI](https://img.shields.io/pypi/v/dualcodec?color=blue&label=PyPI&logo=PyPI&style=flat-square)](https://pypi.org/project/dualcodec/)
@@ -9,6 +15,9 @@
 [![Amphion](https://img.shields.io/badge/Amphion-Stable_Release-blue?style=flat-square)](https://github.com/open-mmlab/Amphion/blob/main/models/codec/dualcodec/README.md)
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1VvUhsDffLdY5TdNuaqlLnYzIoXhvI8MK#scrollTo=Lsos3BK4J-4E)
 ## About
 DualCodec is a low-frame-rate (12.5Hz or 25Hz), semantically-enhanced (with SSL feature) Neural Audio Codec designed to extract discrete tokens for efficient speech generation.
@@ -31,17 +40,17 @@ pip install dualcodec
 - 2025-01-16: Finished writing DualCodec inference codes, the version is v0.1.0. Latest versions are synced to pypi.
 ## Available models
-<!-- - 12hz_v1: DualCodec model trained with 12Hz sampling rate.
 - 25hz_v1: DualCodec model trained with 25Hz sampling rate. -->
-| Model_ID   | Frame Rate | RVQ Quantizers | Semantic Codebook Size (RVQ-1 Size) | Acoustic Codebook Size (RVQ-rest Size) | Training Data       |
-|-----------|------------|----------------------|-------------------------------------|----------------------------------------|---------------------|
-| 12hz_v1   | 12.5Hz     | Any from 1-8 (maximum 8)        | 16384                               | 4096                                   | 100K hours Emilia  |
-| 25hz_v1   | 25Hz       | Any from 1-12 (maximum 12)       | 16384                               | 1024                                   | 100K hours Emilia  |
 ## How to inference DualCodec
-### 1. Programmic usage (automatically downloads checkpoints from Huggingface):
 ```python
 import dualcodec
@@ -71,7 +80,7 @@ torchaudio.save("out.wav", out_audio.cpu().squeeze(0), 24000)
 ### 2. Alternative usage with local checkpoints
-First, download checkpoints to local:
 ```
 # export HF_ENDPOINT=https://hf-mirror.com      # uncomment this to use huggingface mirror if you're in China
 huggingface-cli download facebook/w2v-bert-2.0 --local-dir w2v-bert-2.0
@@ -123,7 +132,7 @@ python -m dualcodec.app
 This will launch an app that allows you to upload a wav file and get the output wav file.
 ## DualCodec-based TTS models
-Models available:
 - DualCodec-VALLE: A super fast 12.5Hz VALL-E TTS model based on DualCodec.
 - DualCodec-Voicebox: A flow matching decoder for DualCodec 12.5Hz's semantic codes. (this can be used as the second stage of tts). The component alone is not a TTS.

+---
+pipeline_tag: audio-to-audio
+library_name: dualcodec
+license: apache-2.0
+---
 # DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
+The model was presented in the paper [DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation](https://huggingface.co/papers/2505.13000).
 [![arXiv](https://img.shields.io/badge/arXiv-2505.13000-brightgreen.svg?style=flat-square)](http://arxiv.org/abs/2505.13000)
 [![githubio](https://img.shields.io/badge/GitHub.io-Demo_Page-blue?logo=Github&style=flat-square)](https://dualcodec.github.io/)
 [![PyPI](https://img.shields.io/pypi/v/dualcodec?color=blue&label=PyPI&logo=PyPI&style=flat-square)](https://pypi.org/project/dualcodec/)
 [![Amphion](https://img.shields.io/badge/Amphion-Stable_Release-blue?style=flat-square)](https://github.com/open-mmlab/Amphion/blob/main/models/codec/dualcodec/README.md)
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1VvUhsDffLdY5TdNuaqlLnYzIoXhvI8MK#scrollTo=Lsos3BK4J-4E)
+## Abstract
+Neural audio codecs form the foundational building blocks for language model (LM)-based speech generation. Typically, there is a trade-off between frame rate and audio quality. This study introduces a low-frame-rate, semantically enhanced codec model. Existing approaches distill semantically rich self-supervised (SSL) representations into the first-layer codec tokens. This work proposes DualCodec, a dual-stream encoding approach that integrates SSL and waveform representations within an end-to-end codec framework. In this setting, DualCodec enhances the semantic information in the first-layer codec and enables the codec system to maintain high audio quality while operating at a low frame rate. Note that a low-frame-rate codec improves the efficiency of speech generation. Experimental results on audio codec and speech generation tasks confirm the effectiveness of the proposed DualCodec compared to state-of-the-art codec systems, such as Mimi Codec, SpeechTokenizer, DAC, and Encodec. Demos are available at: this https URL , code is available at: this https URL
 ## About
 DualCodec is a low-frame-rate (12.5Hz or 25Hz), semantically-enhanced (with SSL feature) Neural Audio Codec designed to extract discrete tokens for efficient speech generation.
 - 2025-01-16: Finished writing DualCodec inference codes, the version is v0.1.0. Latest versions are synced to pypi.
 ## Available models
+<!-- - 12hz_v1: DualCodec model trained with 12Hz sampling rate.
 - 25hz_v1: DualCodec model trained with 25Hz sampling rate. -->
+| Model_ID | Frame Rate | RVQ Quantizers | Semantic Codebook Size (RVQ-1 Size) | Acoustic Codebook Size (RVQ-rest Size) | Training Data |
+|----------|------------|----------------|-------------------------------------|----------------------------------------|---------------|
+| 12hz_v1 | 12.5Hz | Any from 1-8 (maximum 8) | 16384 | 4096 | 100K hours Emilia |
+| 25hz_v1 | 25Hz | Any from 1-12 (maximum 12) | 16384 | 1024 | 100K hours Emilia |
 ## How to inference DualCodec
+### 1. Programmic usage (automatically downloads checkpoints from Huggingface):
 ```python
 import dualcodec
 ### 2. Alternative usage with local checkpoints
+First, download checkpoints to local:
 ```
 # export HF_ENDPOINT=https://hf-mirror.com      # uncomment this to use huggingface mirror if you're in China
 huggingface-cli download facebook/w2v-bert-2.0 --local-dir w2v-bert-2.0
 This will launch an app that allows you to upload a wav file and get the output wav file.
 ## DualCodec-based TTS models
+Models available:
 - DualCodec-VALLE: A super fast 12.5Hz VALL-E TTS model based on DualCodec.
 - DualCodec-Voicebox: A flow matching decoder for DualCodec 12.5Hz's semantic codes. (this can be used as the second stage of tts). The component alone is not a TTS.