Pérez González de Martos, Alejandro Deep Neural Networks for Automatic Speech-To-Speech Translation of Open Educational Resources PhD Thesis Universitat Politècnica de València, 2022, (Advisors: Alfons Juan Ciscar and Alberto Sanchis Navarro). Links | BibTeX | Tags: automatic dubbing, cross-lingual voice cloning, educational resources, simultaneous machine interpretation, text-to-speech @phdthesis{aperez2022,
title = {Deep Neural Networks for Automatic Speech-To-Speech Translation of Open Educational Resources},
author = {Pérez González de Martos, Alejandro},
url = {http://hdl.handle.net/10251/184019},
doi = {10.4995/Thesis/10251/184019},
year = {2022},
date = {2022-06-15},
school = {Universitat Politècnica de València},
note = {Advisors: Alfons Juan Ciscar and Alberto Sanchis Navarro},
keywords = {automatic dubbing, cross-lingual voice cloning, educational resources, simultaneous machine interpretation, text-to-speech},
pubstate = {published},
tppubtype = {phdthesis}
}
|
Pérez, Alejandro; Garcés Díaz-Munío, Gonçal ; Giménez, Adrià; Silvestre-Cerdà, Joan Albert ; Sanchis, Albert; Civera, Jorge; Jiménez, Manuel; Turró, Carlos; Juan, Alfons Towards cross-lingual voice cloning in higher education Journal Article Engineering Applications of Artificial Intelligence, 105 , pp. 104413, 2021. Abstract | Links | BibTeX | Tags: cross-lingual voice conversion, educational resources, multilinguality, OER, text-to-speech @article{Pérez2021,
title = {Towards cross-lingual voice cloning in higher education},
author = {Alejandro Pérez and Garcés Díaz-Munío, Gonçal and Adrià Giménez and Silvestre-Cerdà, Joan Albert and Albert Sanchis and Jorge Civera and Manuel Jiménez and Carlos Turró and Alfons Juan},
url = {https://doi.org/10.1016/j.engappai.2021.104413},
year = {2021},
date = {2021-10-01},
journal = {Engineering Applications of Artificial Intelligence},
volume = {105},
pages = {104413},
abstract = {The rapid progress of modern AI tools for automatic speech recognition and machine translation is leading to a progressive cost reduction to produce publishable subtitles for educational videos in multiple languages. Similarly, text-to-speech technology is experiencing large improvements in terms of quality, flexibility and capabilities. In particular, state-of-the-art systems are now capable of seamlessly dealing with multiple languages and speakers in an integrated manner, thus enabling lecturer's voice cloning in languages she/he might not even speak. This work is to report the experience gained on using such systems at the Universitat Politècnica de València (UPV), mainly as a guidance for other educational organizations willing to conduct similar studies. It builds on previous work on the UPV's main repository of educational videos, MediaUPV, to produce multilingual subtitles at scale and low cost. Here, a detailed account is given on how this work has been extended to also allow for massive machine dubbing of MediaUPV. This includes collecting 59 hours of clean speech data from UPV’s academic staff, and extending our production pipeline of subtitles with a state-of-the-art multilingual and multi-speaker text-to-speech system trained from the collected data. Our main result comes from an extensive, subjective evaluation of this system by lecturers contributing to data collection. In brief, it is shown that text-to-speech technology is not only mature enough for its application to MediaUPV, but also needed as soon as possible by students to improve its accessibility and bridge language barriers.},
keywords = {cross-lingual voice conversion, educational resources, multilinguality, OER, text-to-speech},
pubstate = {published},
tppubtype = {article}
}
The rapid progress of modern AI tools for automatic speech recognition and machine translation is leading to a progressive cost reduction to produce publishable subtitles for educational videos in multiple languages. Similarly, text-to-speech technology is experiencing large improvements in terms of quality, flexibility and capabilities. In particular, state-of-the-art systems are now capable of seamlessly dealing with multiple languages and speakers in an integrated manner, thus enabling lecturer's voice cloning in languages she/he might not even speak. This work is to report the experience gained on using such systems at the Universitat Politècnica de València (UPV), mainly as a guidance for other educational organizations willing to conduct similar studies. It builds on previous work on the UPV's main repository of educational videos, MediaUPV, to produce multilingual subtitles at scale and low cost. Here, a detailed account is given on how this work has been extended to also allow for massive machine dubbing of MediaUPV. This includes collecting 59 hours of clean speech data from UPV’s academic staff, and extending our production pipeline of subtitles with a state-of-the-art multilingual and multi-speaker text-to-speech system trained from the collected data. Our main result comes from an extensive, subjective evaluation of this system by lecturers contributing to data collection. In brief, it is shown that text-to-speech technology is not only mature enough for its application to MediaUPV, but also needed as soon as possible by students to improve its accessibility and bridge language barriers. |