Pérez-González-de-Martos, Alejandro; Sanchis, Albert; Juan, Alfons VRAIN-UPV MLLP's system for the Blizzard Challenge 2021 Inproceedings Proc. of Blizzard Challenge 2021, 2021. Abstract | Links | BibTeX | Tags: Blizzard Challenge, HiFi-GAN, text-to-speech @inproceedings{Pérez-González-de-Martos2021b,
title = {VRAIN-UPV MLLP's system for the Blizzard Challenge 2021},
author = {Alejandro Pérez-González-de-Martos and Albert Sanchis and Alfons Juan},
url = {http://hdl.handle.net/10251/192554
https://arxiv.org/abs/2110.15792
http://www.festvox.org/blizzard/blizzard2021.html},
year = {2021},
date = {2021-01-01},
booktitle = {Proc. of Blizzard Challenge 2021},
abstract = {This paper presents the VRAIN-UPV MLLP’s speech synthesis system for the SH1 task of the Blizzard Challenge 2021. The SH1 task consisted in building a Spanish text-to-speech system trained on (but not limited to) the corpus released by the Blizzard Challenge 2021 organization. It included 5 hours of studio-quality recordings from a native Spanish female speaker. In our case, this dataset was solely used to build a two-stage neural text-to-speech pipeline composed of a non-autoregressive acoustic model with explicit duration modeling and a HiFi-GAN neural vocoder. Our team is identified as J in the evaluation results. Our system obtained very good results in the subjective evaluation tests. Only one system among other 11 participants achieved better naturalness than ours. Concretely, it achieved a naturalness MOS of 3.61 compared to 4.21 for real samples.},
keywords = {Blizzard Challenge, HiFi-GAN, text-to-speech},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper presents the VRAIN-UPV MLLP’s speech synthesis system for the SH1 task of the Blizzard Challenge 2021. The SH1 task consisted in building a Spanish text-to-speech system trained on (but not limited to) the corpus released by the Blizzard Challenge 2021 organization. It included 5 hours of studio-quality recordings from a native Spanish female speaker. In our case, this dataset was solely used to build a two-stage neural text-to-speech pipeline composed of a non-autoregressive acoustic model with explicit duration modeling and a HiFi-GAN neural vocoder. Our team is identified as J in the evaluation results. Our system obtained very good results in the subjective evaluation tests. Only one system among other 11 participants achieved better naturalness than ours. Concretely, it achieved a naturalness MOS of 3.61 compared to 4.21 for real samples. |