Audio-to-Audio
dualcodec

Improve model card: Add pipeline tag, library name, license, and paper abstract

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +18 -9
README.md CHANGED
@@ -1,7 +1,13 @@
1
-
 
 
 
 
2
 
3
  # DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
4
 
 
 
5
  [![arXiv](https://img.shields.io/badge/arXiv-2505.13000-brightgreen.svg?style=flat-square)](http://arxiv.org/abs/2505.13000)
6
  [![githubio](https://img.shields.io/badge/GitHub.io-Demo_Page-blue?logo=Github&style=flat-square)](https://dualcodec.github.io/)
7
  [![PyPI](https://img.shields.io/pypi/v/dualcodec?color=blue&label=PyPI&logo=PyPI&style=flat-square)](https://pypi.org/project/dualcodec/)
@@ -9,6 +15,9 @@
9
  [![Amphion](https://img.shields.io/badge/Amphion-Stable_Release-blue?style=flat-square)](https://github.com/open-mmlab/Amphion/blob/main/models/codec/dualcodec/README.md)
10
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1VvUhsDffLdY5TdNuaqlLnYzIoXhvI8MK#scrollTo=Lsos3BK4J-4E)
11
 
 
 
 
12
  ## About
13
  DualCodec is a low-frame-rate (12.5Hz or 25Hz), semantically-enhanced (with SSL feature) Neural Audio Codec designed to extract discrete tokens for efficient speech generation.
14
 
@@ -31,17 +40,17 @@ pip install dualcodec
31
  - 2025-01-16: Finished writing DualCodec inference codes, the version is v0.1.0. Latest versions are synced to pypi.
32
 
33
  ## Available models
34
- <!-- - 12hz_v1: DualCodec model trained with 12Hz sampling rate.
35
  - 25hz_v1: DualCodec model trained with 25Hz sampling rate. -->
36
 
37
- | Model_ID | Frame Rate | RVQ Quantizers | Semantic Codebook Size (RVQ-1 Size) | Acoustic Codebook Size (RVQ-rest Size) | Training Data |
38
- |-----------|------------|----------------------|-------------------------------------|----------------------------------------|---------------------|
39
- | 12hz_v1 | 12.5Hz | Any from 1-8 (maximum 8) | 16384 | 4096 | 100K hours Emilia |
40
- | 25hz_v1 | 25Hz | Any from 1-12 (maximum 12) | 16384 | 1024 | 100K hours Emilia |
41
 
42
 
43
  ## How to inference DualCodec
44
- ### 1. Programmic usage (automatically downloads checkpoints from Huggingface):
45
  ```python
46
  import dualcodec
47
 
@@ -71,7 +80,7 @@ torchaudio.save("out.wav", out_audio.cpu().squeeze(0), 24000)
71
 
72
 
73
  ### 2. Alternative usage with local checkpoints
74
- First, download checkpoints to local:
75
  ```
76
  # export HF_ENDPOINT=https://hf-mirror.com # uncomment this to use huggingface mirror if you're in China
77
  huggingface-cli download facebook/w2v-bert-2.0 --local-dir w2v-bert-2.0
@@ -123,7 +132,7 @@ python -m dualcodec.app
123
  This will launch an app that allows you to upload a wav file and get the output wav file.
124
 
125
  ## DualCodec-based TTS models
126
- Models available:
127
  - DualCodec-VALLE: A super fast 12.5Hz VALL-E TTS model based on DualCodec.
128
  - DualCodec-Voicebox: A flow matching decoder for DualCodec 12.5Hz's semantic codes. (this can be used as the second stage of tts). The component alone is not a TTS.
129
 
 
1
+ ---
2
+ pipeline_tag: audio-to-audio
3
+ library_name: dualcodec
4
+ license: apache-2.0
5
+ ---
6
 
7
  # DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
8
 
9
+ The model was presented in the paper [DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation](https://huggingface.co/papers/2505.13000).
10
+
11
  [![arXiv](https://img.shields.io/badge/arXiv-2505.13000-brightgreen.svg?style=flat-square)](http://arxiv.org/abs/2505.13000)
12
  [![githubio](https://img.shields.io/badge/GitHub.io-Demo_Page-blue?logo=Github&style=flat-square)](https://dualcodec.github.io/)
13
  [![PyPI](https://img.shields.io/pypi/v/dualcodec?color=blue&label=PyPI&logo=PyPI&style=flat-square)](https://pypi.org/project/dualcodec/)
 
15
  [![Amphion](https://img.shields.io/badge/Amphion-Stable_Release-blue?style=flat-square)](https://github.com/open-mmlab/Amphion/blob/main/models/codec/dualcodec/README.md)
16
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1VvUhsDffLdY5TdNuaqlLnYzIoXhvI8MK#scrollTo=Lsos3BK4J-4E)
17
 
18
+ ## Abstract
19
+ Neural audio codecs form the foundational building blocks for language model (LM)-based speech generation. Typically, there is a trade-off between frame rate and audio quality. This study introduces a low-frame-rate, semantically enhanced codec model. Existing approaches distill semantically rich self-supervised (SSL) representations into the first-layer codec tokens. This work proposes DualCodec, a dual-stream encoding approach that integrates SSL and waveform representations within an end-to-end codec framework. In this setting, DualCodec enhances the semantic information in the first-layer codec and enables the codec system to maintain high audio quality while operating at a low frame rate. Note that a low-frame-rate codec improves the efficiency of speech generation. Experimental results on audio codec and speech generation tasks confirm the effectiveness of the proposed DualCodec compared to state-of-the-art codec systems, such as Mimi Codec, SpeechTokenizer, DAC, and Encodec. Demos are available at: this https URL , code is available at: this https URL
20
+
21
  ## About
22
  DualCodec is a low-frame-rate (12.5Hz or 25Hz), semantically-enhanced (with SSL feature) Neural Audio Codec designed to extract discrete tokens for efficient speech generation.
23
 
 
40
  - 2025-01-16: Finished writing DualCodec inference codes, the version is v0.1.0. Latest versions are synced to pypi.
41
 
42
  ## Available models
43
+ <!-- - 12hz_v1: DualCodec model trained with 12Hz sampling rate.
44
  - 25hz_v1: DualCodec model trained with 25Hz sampling rate. -->
45
 
46
+ | Model_ID | Frame Rate | RVQ Quantizers | Semantic Codebook Size (RVQ-1 Size) | Acoustic Codebook Size (RVQ-rest Size) | Training Data |
47
+ |----------|------------|----------------|-------------------------------------|----------------------------------------|---------------|
48
+ | 12hz_v1 | 12.5Hz | Any from 1-8 (maximum 8) | 16384 | 4096 | 100K hours Emilia |
49
+ | 25hz_v1 | 25Hz | Any from 1-12 (maximum 12) | 16384 | 1024 | 100K hours Emilia |
50
 
51
 
52
  ## How to inference DualCodec
53
+ ### 1. Programmic usage (automatically downloads checkpoints from Huggingface):
54
  ```python
55
  import dualcodec
56
 
 
80
 
81
 
82
  ### 2. Alternative usage with local checkpoints
83
+ First, download checkpoints to local:
84
  ```
85
  # export HF_ENDPOINT=https://hf-mirror.com # uncomment this to use huggingface mirror if you're in China
86
  huggingface-cli download facebook/w2v-bert-2.0 --local-dir w2v-bert-2.0
 
132
  This will launch an app that allows you to upload a wav file and get the output wav file.
133
 
134
  ## DualCodec-based TTS models
135
+ Models available:
136
  - DualCodec-VALLE: A super fast 12.5Hz VALL-E TTS model based on DualCodec.
137
  - DualCodec-Voicebox: A flow matching decoder for DualCodec 12.5Hz's semantic codes. (this can be used as the second stage of tts). The component alone is not a TTS.
138