wetdog commited on
Commit
b33645d
1 Parent(s): 82cc612

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +122 -2
README.md CHANGED
@@ -1,3 +1,123 @@
 
 
 
 
 
1
  ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ datasets:
2
+ - projecte-aina/festcat_trimmed_denoised
3
+ - projecte-aina/openslr-slr69-ca-trimmed-denoised
4
+ - lj_speech
5
+ - blabble-io/libritts_r
6
  ---
7
+
8
+ # Wavenext-mel-22khz
9
+
10
+ <!-- Provide a quick summary of what the model is/does. -->
11
+
12
+
13
+ ## Model Details
14
+
15
+ ### Model Description
16
+
17
+ <!-- Provide a longer summary of what this model is. -->
18
+ Wavenext is a modification of Vocos, where the last ISTFT layer is replaced with a a trainable linear layer that can directly predict speech waveform samples.
19
+
20
+ This version of Wavenext uses 80-bin mel spectrograms as acoustic features which are widespread
21
+ in the TTS domain since the introduction of [hifi-gan](https://github.com/jik876/hifi-gan/blob/master/meldataset.py)
22
+ The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the
23
+ acoustic output of several TTS models.
24
+
25
+ ## Intended Uses and limitations
26
+
27
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
28
+ The model is aimed to serve as a vocoder to synthesize audio waveforms from mel spectrograms. Is trained to generate speech and if is used in other audio
29
+ domain is possible that the model won't produce high quality samples.
30
+
31
+
32
+ ## Training Details
33
+
34
+ ### Training Data
35
+
36
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
37
+
38
+ The model was trained on 4 speech datasets
39
+
40
+ | Dataset | Language | Hours |
41
+ |---------------------|----------|---------|
42
+ | LibriTTS-r | en | 585 |
43
+ | LJSpeech | en | 24 |
44
+ | Festcat | ca | 22 |
45
+ | OpenSLR69 | ca | 5 |
46
+
47
+
48
+ ### Training Procedure
49
+
50
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
51
+ The model was trained for 1M steps and 96 epochs with a batch size of 16 for stability. We used a Cosine scheduler with a initial learning rate of 1e-4.
52
+ We also modified the mel spectrogram loss to use 128 bins and fmax of 11025 instead of the same input mel spectrogram.
53
+
54
+
55
+ #### Training Hyperparameters
56
+
57
+
58
+ * initial_learning_rate: 5e-4
59
+ * scheduler: cosine without warmup or restarts
60
+ * mel_loss_coeff: 45
61
+ * mrd_loss_coeff: 0.1
62
+ * batch_size: 16
63
+ * num_samples: 16384
64
+
65
+ ## Evaluation
66
+
67
+ <!-- This section describes the evaluation protocols and provides the results. -->
68
+
69
+ Evaluation was done using the metrics on the original repo, after 183 epochs we achieve:
70
+
71
+ * val_loss: 3.79
72
+ * f1_score: 0.94
73
+ * mel_loss: 0.27
74
+ * periodicity_loss:0.128
75
+ * pesq_score: 3.27
76
+ * pitch_loss: 31.33
77
+ * utmos_score: 3.20
78
+
79
+
80
+ ## Citation
81
+
82
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
83
+
84
+ If this code contributes to your research, please cite the work:
85
+
86
+ ```
87
+ @INPROCEEDINGS{10389765,
88
+ author={Okamoto, Takuma and Yamashita, Haruki and Ohtani, Yamato and Toda, Tomoki and Kawai, Hisashi},
89
+ booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
90
+ title={WaveNeXt: ConvNeXt-Based Fast Neural Vocoder Without ISTFT layer},
91
+ year={2023},
92
+ volume={},
93
+ number={},
94
+ pages={1-8},
95
+ keywords={Fourier transforms;Vocoders;Conferences;Automatic speech recognition;ConvNext;end-to-end text-to-speech;linear layer-based upsampling;neural vocoder;Vocos},
96
+ doi={10.1109/ASRU57964.2023.10389765}}
97
+
98
+ @article{siuzdak2023vocos,
99
+ title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
100
+ author={Siuzdak, Hubert},
101
+ journal={arXiv preprint arXiv:2306.00814},
102
+ year={2023}
103
+ }
104
+ ```
105
+
106
+ ## Additional information
107
+
108
+ ### Author
109
+ The Language Technologies Unit from Barcelona Supercomputing Center.
110
+
111
+ ### Contact
112
+ For further information, please send an email to <langtech@bsc.es>.
113
+
114
+ ### Copyright
115
+ Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.
116
+
117
+ ### License
118
+ [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
119
+
120
+ ### Funding
121
+
122
+ This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
123
+