Improve model card: Add pipeline tag, library name, license, and paper abstract
#2
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,7 +1,13 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
# DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
|
| 4 |
|
|
|
|
|
|
|
| 5 |
[](http://arxiv.org/abs/2505.13000)
|
| 6 |
[](https://dualcodec.github.io/)
|
| 7 |
[](https://pypi.org/project/dualcodec/)
|
|
@@ -9,6 +15,9 @@
|
|
| 9 |
[](https://github.com/open-mmlab/Amphion/blob/main/models/codec/dualcodec/README.md)
|
| 10 |
[](https://colab.research.google.com/drive/1VvUhsDffLdY5TdNuaqlLnYzIoXhvI8MK#scrollTo=Lsos3BK4J-4E)
|
| 11 |
|
|
|
|
|
|
|
|
|
|
| 12 |
## About
|
| 13 |
DualCodec is a low-frame-rate (12.5Hz or 25Hz), semantically-enhanced (with SSL feature) Neural Audio Codec designed to extract discrete tokens for efficient speech generation.
|
| 14 |
|
|
@@ -31,17 +40,17 @@ pip install dualcodec
|
|
| 31 |
- 2025-01-16: Finished writing DualCodec inference codes, the version is v0.1.0. Latest versions are synced to pypi.
|
| 32 |
|
| 33 |
## Available models
|
| 34 |
-
<!-- - 12hz_v1: DualCodec model trained with 12Hz sampling rate.
|
| 35 |
- 25hz_v1: DualCodec model trained with 25Hz sampling rate. -->
|
| 36 |
|
| 37 |
-
| Model_ID
|
| 38 |
-
|
| 39 |
-
| 12hz_v1
|
| 40 |
-
| 25hz_v1
|
| 41 |
|
| 42 |
|
| 43 |
## How to inference DualCodec
|
| 44 |
-
### 1. Programmic usage (automatically downloads checkpoints from Huggingface):
|
| 45 |
```python
|
| 46 |
import dualcodec
|
| 47 |
|
|
@@ -71,7 +80,7 @@ torchaudio.save("out.wav", out_audio.cpu().squeeze(0), 24000)
|
|
| 71 |
|
| 72 |
|
| 73 |
### 2. Alternative usage with local checkpoints
|
| 74 |
-
First, download checkpoints to local:
|
| 75 |
```
|
| 76 |
# export HF_ENDPOINT=https://hf-mirror.com # uncomment this to use huggingface mirror if you're in China
|
| 77 |
huggingface-cli download facebook/w2v-bert-2.0 --local-dir w2v-bert-2.0
|
|
@@ -123,7 +132,7 @@ python -m dualcodec.app
|
|
| 123 |
This will launch an app that allows you to upload a wav file and get the output wav file.
|
| 124 |
|
| 125 |
## DualCodec-based TTS models
|
| 126 |
-
Models available:
|
| 127 |
- DualCodec-VALLE: A super fast 12.5Hz VALL-E TTS model based on DualCodec.
|
| 128 |
- DualCodec-Voicebox: A flow matching decoder for DualCodec 12.5Hz's semantic codes. (this can be used as the second stage of tts). The component alone is not a TTS.
|
| 129 |
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: audio-to-audio
|
| 3 |
+
library_name: dualcodec
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
---
|
| 6 |
|
| 7 |
# DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
|
| 8 |
|
| 9 |
+
The model was presented in the paper [DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation](https://huggingface.co/papers/2505.13000).
|
| 10 |
+
|
| 11 |
[](http://arxiv.org/abs/2505.13000)
|
| 12 |
[](https://dualcodec.github.io/)
|
| 13 |
[](https://pypi.org/project/dualcodec/)
|
|
|
|
| 15 |
[](https://github.com/open-mmlab/Amphion/blob/main/models/codec/dualcodec/README.md)
|
| 16 |
[](https://colab.research.google.com/drive/1VvUhsDffLdY5TdNuaqlLnYzIoXhvI8MK#scrollTo=Lsos3BK4J-4E)
|
| 17 |
|
| 18 |
+
## Abstract
|
| 19 |
+
Neural audio codecs form the foundational building blocks for language model (LM)-based speech generation. Typically, there is a trade-off between frame rate and audio quality. This study introduces a low-frame-rate, semantically enhanced codec model. Existing approaches distill semantically rich self-supervised (SSL) representations into the first-layer codec tokens. This work proposes DualCodec, a dual-stream encoding approach that integrates SSL and waveform representations within an end-to-end codec framework. In this setting, DualCodec enhances the semantic information in the first-layer codec and enables the codec system to maintain high audio quality while operating at a low frame rate. Note that a low-frame-rate codec improves the efficiency of speech generation. Experimental results on audio codec and speech generation tasks confirm the effectiveness of the proposed DualCodec compared to state-of-the-art codec systems, such as Mimi Codec, SpeechTokenizer, DAC, and Encodec. Demos are available at: this https URL , code is available at: this https URL
|
| 20 |
+
|
| 21 |
## About
|
| 22 |
DualCodec is a low-frame-rate (12.5Hz or 25Hz), semantically-enhanced (with SSL feature) Neural Audio Codec designed to extract discrete tokens for efficient speech generation.
|
| 23 |
|
|
|
|
| 40 |
- 2025-01-16: Finished writing DualCodec inference codes, the version is v0.1.0. Latest versions are synced to pypi.
|
| 41 |
|
| 42 |
## Available models
|
| 43 |
+
<!-- - 12hz_v1: DualCodec model trained with 12Hz sampling rate.
|
| 44 |
- 25hz_v1: DualCodec model trained with 25Hz sampling rate. -->
|
| 45 |
|
| 46 |
+
| Model_ID | Frame Rate | RVQ Quantizers | Semantic Codebook Size (RVQ-1 Size) | Acoustic Codebook Size (RVQ-rest Size) | Training Data |
|
| 47 |
+
|----------|------------|----------------|-------------------------------------|----------------------------------------|---------------|
|
| 48 |
+
| 12hz_v1 | 12.5Hz | Any from 1-8 (maximum 8) | 16384 | 4096 | 100K hours Emilia |
|
| 49 |
+
| 25hz_v1 | 25Hz | Any from 1-12 (maximum 12) | 16384 | 1024 | 100K hours Emilia |
|
| 50 |
|
| 51 |
|
| 52 |
## How to inference DualCodec
|
| 53 |
+
### 1. Programmic usage (automatically downloads checkpoints from Huggingface):
|
| 54 |
```python
|
| 55 |
import dualcodec
|
| 56 |
|
|
|
|
| 80 |
|
| 81 |
|
| 82 |
### 2. Alternative usage with local checkpoints
|
| 83 |
+
First, download checkpoints to local:
|
| 84 |
```
|
| 85 |
# export HF_ENDPOINT=https://hf-mirror.com # uncomment this to use huggingface mirror if you're in China
|
| 86 |
huggingface-cli download facebook/w2v-bert-2.0 --local-dir w2v-bert-2.0
|
|
|
|
| 132 |
This will launch an app that allows you to upload a wav file and get the output wav file.
|
| 133 |
|
| 134 |
## DualCodec-based TTS models
|
| 135 |
+
Models available:
|
| 136 |
- DualCodec-VALLE: A super fast 12.5Hz VALL-E TTS model based on DualCodec.
|
| 137 |
- DualCodec-Voicebox: A flow matching decoder for DualCodec 12.5Hz's semantic codes. (this can be used as the second stage of tts). The component alone is not a TTS.
|
| 138 |
|