2023 |
Baquero Arnal, Pau Transformer models for Machine Translation and Streaming Automatic Speech Recognition PhD Thesis Universitat Politècnica de València, 2023, (Advisors: Alfons Juan Ciscar and Hermann Ney). Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Neural Machine Translation, Transformer, Transformer Language Model @phdthesis{Arnal2023, title = {Transformer models for Machine Translation and Streaming Automatic Speech Recognition}, author = {Baquero Arnal, Pau}, url = {https://doi.org/10.4995/Thesis/10251/193680 https://www.upv.es/pls/oalu/sic_ted.mostrar_tesis?p_num_reg=12917}, year = {2023}, date = {2023-01-01}, school = {Universitat Politècnica de València}, abstract = {Natural language processing (NLP) is a set of fundamental computing prob- lems with immense applicability, as language is the natural communication vehicle for people. NLP, along with many other computer technologies, has been revolutionized in recent years by the impact of deep learning. This thesis is centered around two keystone problems for NLP: machine translation (MT) and automatic speech recognition (ASR); and a common deep neural architec- ture, the Transformer, that is leveraged to improve the technical solutions for some MT and ASR applications. ASR and MT can be utilized to produce cost-effective, high-quality multilin- gual texts for a wide array of media. Particular applications pursued in this thesis are that of news translation or that of automatic live captioning of tele- vision broadcasts. ASR and MT can also be combined with each other, for instance generating automatic translated subtitles from audio, or augmented with other NLP solutions: text summarization to produce a summary of a speech, or speech synthesis to create an automatic translated dubbing, for in- stance. These other applications fall out of the scope of this thesis, but can profit from the contributions that it contains, as they help to improve the performance of the automatic systems on which they depend. This thesis contains an application of the Transformer architecture to MT as it was originally conceived, achieving state-of-the-art results in similar language translation. In successive chapters, this thesis covers the adaptation of the Transformer as a language model for streaming hybrid ASR systems. After- wards, it describes how we applied the developed technology for a specific use case in television captioning by participating in a competitive challenge and achieving the first position by a large margin. We also show that the gains came mostly from the improvement in technology capabilities over two years including that of the Transformer language model adapted for streaming, and the data component was minor.}, note = {Advisors: Alfons Juan Ciscar and Hermann Ney}, keywords = {Automatic Speech Recognition, Neural Machine Translation, Transformer, Transformer Language Model}, pubstate = {published}, tppubtype = {phdthesis} } Natural language processing (NLP) is a set of fundamental computing prob- lems with immense applicability, as language is the natural communication vehicle for people. NLP, along with many other computer technologies, has been revolutionized in recent years by the impact of deep learning. This thesis is centered around two keystone problems for NLP: machine translation (MT) and automatic speech recognition (ASR); and a common deep neural architec- ture, the Transformer, that is leveraged to improve the technical solutions for some MT and ASR applications. ASR and MT can be utilized to produce cost-effective, high-quality multilin- gual texts for a wide array of media. Particular applications pursued in this thesis are that of news translation or that of automatic live captioning of tele- vision broadcasts. ASR and MT can also be combined with each other, for instance generating automatic translated subtitles from audio, or augmented with other NLP solutions: text summarization to produce a summary of a speech, or speech synthesis to create an automatic translated dubbing, for in- stance. These other applications fall out of the scope of this thesis, but can profit from the contributions that it contains, as they help to improve the performance of the automatic systems on which they depend. This thesis contains an application of the Transformer architecture to MT as it was originally conceived, achieving state-of-the-art results in similar language translation. In successive chapters, this thesis covers the adaptation of the Transformer as a language model for streaming hybrid ASR systems. After- wards, it describes how we applied the developed technology for a specific use case in television captioning by participating in a competitive challenge and achieving the first position by a large margin. We also show that the gains came mostly from the improvement in technology capabilities over two years including that of the Transformer language model adapted for streaming, and the data component was minor. |
2022 |
Baquero-Arnal, Pau; Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Pérez-González-de-Martos, Alejandro; Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge: Extension Journal Article Applied Sciences, 12 (2), pp. 804, 2022. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Natural Language Processing, streaming @article{applsci1505192, title = {MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge: Extension}, author = {Pau Baquero-Arnal and Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Alejandro Pérez-González-de-Martos and Garcés Díaz-Munío, Gonçal V. and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.3390/app12020804}, year = {2022}, date = {2022-01-01}, journal = {Applied Sciences}, volume = {12}, number = {2}, pages = {804}, abstract = {This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting in building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81±0.09 seconds (mean±stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams.}, keywords = {Automatic Speech Recognition, Natural Language Processing, streaming}, pubstate = {published}, tppubtype = {article} } This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting in building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81±0.09 seconds (mean±stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams. |
Iranzo-Sánchez, Javier; Jorge, Javier; Pérez-González-de-Martos, Alejandro; Giménez, Adrià; Garcés Díaz-Munío, Gonçal V; Baquero-Arnal, Pau; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks Inproceedings Proc. of 19th Intl. Workshop on Spoken Language Translation (IWSLT 2022), pp. 255–264, Dublin (Ireland), 2022. Abstract | Links | BibTeX | Tags: Simultaneous Speech Translation, speech-to-speech translation @inproceedings{Iranzo-Sánchez2022b, title = {MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks}, author = {Javier Iranzo-Sánchez and Javier Jorge and Alejandro Pérez-González-de-Martos and Adrià Giménez and Garcés Díaz-Munío, Gonçal V. and Pau Baquero-Arnal and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.18653/v1/2022.iwslt-1.22}, year = {2022}, date = {2022-01-01}, booktitle = {Proc. of 19th Intl. Workshop on Spoken Language Translation (IWSLT 2022)}, pages = {255--264}, address = {Dublin (Ireland)}, abstract = {This work describes the participation of the MLLP-VRAIN research group in the two shared tasks of the IWSLT 2022 conference: Simultaneous Speech Translation and Speech-to-Speech Translation. We present our streaming-ready ASR, MT and TTS systems for Speech Translation and Synthesis from English into German. Our submission combines these systems by means of a cascade approach paying special attention to data preparation and decoding for streaming inference.}, keywords = {Simultaneous Speech Translation, speech-to-speech translation}, pubstate = {published}, tppubtype = {inproceedings} } This work describes the participation of the MLLP-VRAIN research group in the two shared tasks of the IWSLT 2022 conference: Simultaneous Speech Translation and Speech-to-Speech Translation. We present our streaming-ready ASR, MT and TTS systems for Speech Translation and Synthesis from English into German. Our submission combines these systems by means of a cascade approach paying special attention to data preparation and decoding for streaming inference. |
Pérez González de Martos, Alejandro ; Giménez Pastor, Adrià ; Jorge Cano, Javier ; Iranzo-Sánchez, Javier; Silvestre-Cerdà, Joan Albert; Garcés Díaz-Munío, Gonçal V; Baquero-Arnal, Pau; Sanchis Navarro, Alberto ; Civera Sáiz, Jorge ; Juan Ciscar, Alfons ; Turró Ribalta, Carlos Doblaje automático de vídeo-charlas educativas en UPV[Media] Inproceedings Proc. of VIII Congrés d'Innovació Educativa i Docència en Xarxa (IN-RED 2022), pp. 557–570, València (Spain), 2022. Abstract | Links | BibTeX | Tags: automatic dubbing, Automatic Speech Recognition, Machine Translation, OER, text-to-speech @inproceedings{deMartos2022, title = {Doblaje automático de vídeo-charlas educativas en UPV[Media]}, author = {Pérez González de Martos, Alejandro AND Giménez Pastor, Adrià AND Jorge Cano, Javier AND Javier Iranzo-Sánchez AND Joan Albert Silvestre-Cerdà AND Garcés Díaz-Munío, Gonçal V. AND Pau Baquero-Arnal AND Sanchis Navarro, Alberto AND Civera Sáiz, Jorge AND Juan Ciscar, Alfons AND Turró Ribalta, Carlos}, doi = {10.4995/INRED2022.2022.15844}, year = {2022}, date = {2022-01-01}, booktitle = {Proc. of VIII Congrés d'Innovació Educativa i Docència en Xarxa (IN-RED 2022)}, pages = {557--570}, address = {València (Spain)}, abstract = {More and more universities are banking on the production of digital content to support online or blended learning in higher education. Over the last years, the MLLP research group has been working closely with the UPV's ASIC media services in order to enrich educational multimedia resources through the application of natural language processing technologies including automatic speech recognition, machine translation and text-to-speech. In this work, we present the steps that are being followed for the comprehensive translation of these materials, specifically through (semi-)automatic dubbing by making use of state-of-the-art speaker-adaptive text-to-speech technologies.}, keywords = {automatic dubbing, Automatic Speech Recognition, Machine Translation, OER, text-to-speech}, pubstate = {published}, tppubtype = {inproceedings} } More and more universities are banking on the production of digital content to support online or blended learning in higher education. Over the last years, the MLLP research group has been working closely with the UPV's ASIC media services in order to enrich educational multimedia resources through the application of natural language processing technologies including automatic speech recognition, machine translation and text-to-speech. In this work, we present the steps that are being followed for the comprehensive translation of these materials, specifically through (semi-)automatic dubbing by making use of state-of-the-art speaker-adaptive text-to-speech technologies. |
2021 |
Jorge, Javier; Giménez, Adrià; Baquero-Arnal, Pau; Iranzo-Sánchez, Javier; Pérez-González-de-Martos, Alejandro; Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge Inproceedings Proc. of IberSPEECH 2021, pp. 118–122, Valladolid (Spain), 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Natural Language Processing, streaming @inproceedings{Jorge2021, title = {MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge}, author = {Javier Jorge and Adrià Giménez and Pau Baquero-Arnal and Javier Iranzo-Sánchez and Alejandro Pérez-González-de-Martos and Garcés Díaz-Munío, Gonçal V. and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.21437/IberSPEECH.2021-25}, year = {2021}, date = {2021-03-24}, booktitle = {Proc. of IberSPEECH 2021}, pages = {118--122}, address = {Valladolid (Spain)}, abstract = {1st place in IberSpeech-RTVE 2020 TV Speech-to-Text Challenge. [EN] This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politecnica de València for the Albayzin-RTVE 2020 Speech-to-Text Challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid BLSTM-HMM ASR system using streaming one-pass decoding with a context window of 1.5 seconds and a linear combination of an n-gram, a LSTM, and a Transformer language model (LM). The acoustic model was trained on nearly 4,000 hours of speech data from different sources, using the MLLP's transLectures-UPV toolkit (TLK) and TensorFlow; whilst LMs were trained using SRILM (n-gram), CUED-RNNLM (LSTM) and Fairseq (Transformer), with up to 102G tokens. This system achieved 11.6% and 16.0% WER on the test-2018 and test-2020 sets, respectively. As it is streaming-enabled, it could be put into production environments for automatic captioning of live media streams, with a theoretical delay of 1.5 seconds. Along with the primary system, we also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t that, following the same configuration of the primary one, but using a smaller context window of 0.6 seconds and a Transformer LM, scored 12.3% and 16.9% WER points respectively on the same test sets, with a measured empirical latency of 0.81+-0.09 seconds (mean+-stdev). This is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. [CA] "Sistemes de reconeixement automàtic de la parla en castellà de MLLP-VRAIN per a la competició Albayzin-RTVE 2020 Speech-To-Text Challenge": En aquest article, es descriuen els sistemes de reconeixement automàtic de la parla (RAP) creats pel grup d'investigació MLLP-VRAIN de la Universitat Politecnica de València per a la competició Albayzin-RTVE 2020 Speech-to-Text Challenge. El sistema primari (p-streaming_1500ms_nlt) és un sistema de RAP híbrid BLSTM-HMM amb descodificació en temps real en una passada amb una finestra de context d'1,5 segons i una combinació lineal de models de llenguatge (ML) d'n-grames, LSTM i Transformer. El model acústic s'ha entrenat amb vora 4000 hores de parla transcrita de diferents fonts, usant el transLectures-UPV toolkit (TLK) del grup MLLP i TensorFlow; mentre que els ML s'han entrenat amb SRILM (n-grames), CUED-RNNLM (LSTM) i Fairseq (Transformer), amb 102G paraules (tokens). Aquest sistema ha obtingut 11,6 % i 16,0 % de WER en els conjunts test-2018 i test-2020, respectivament. És un sistema amb capacitat de temps real, que pot desplegar-se en producció per a subtitulació automàtica de fluxos audiovisuals en directe, amb un retard teòric d'1,5 segons. A banda del sistema primari, s'han presentat tres sistemes contrastius. D'aquests, destaquem el sistema c2-streaming_600ms_t que, amb la mateixa configuració que el sistema primari, però amb una finestra de context més reduïda de 0,6 segons i un ML Transformer, ha obtingut 12,3 % i 16,9 % de WER, respectivament, sobre els mateixos conjunts, amb una latència empírica mesurada de 0,81+-0,09 segons (mitjana+-desv). És a dir, s'han obtingut latències punteres per a subtitulació automàtica en directe d'alta qualitat amb una degradació del WER petita, del 6 % relatiu.}, keywords = {Automatic Speech Recognition, Natural Language Processing, streaming}, pubstate = {published}, tppubtype = {inproceedings} } 1st place in IberSpeech-RTVE 2020 TV Speech-to-Text Challenge. [EN] This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politecnica de València for the Albayzin-RTVE 2020 Speech-to-Text Challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid BLSTM-HMM ASR system using streaming one-pass decoding with a context window of 1.5 seconds and a linear combination of an n-gram, a LSTM, and a Transformer language model (LM). The acoustic model was trained on nearly 4,000 hours of speech data from different sources, using the MLLP's transLectures-UPV toolkit (TLK) and TensorFlow; whilst LMs were trained using SRILM (n-gram), CUED-RNNLM (LSTM) and Fairseq (Transformer), with up to 102G tokens. This system achieved 11.6% and 16.0% WER on the test-2018 and test-2020 sets, respectively. As it is streaming-enabled, it could be put into production environments for automatic captioning of live media streams, with a theoretical delay of 1.5 seconds. Along with the primary system, we also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t that, following the same configuration of the primary one, but using a smaller context window of 0.6 seconds and a Transformer LM, scored 12.3% and 16.9% WER points respectively on the same test sets, with a measured empirical latency of 0.81+-0.09 seconds (mean+-stdev). This is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. [CA] "Sistemes de reconeixement automàtic de la parla en castellà de MLLP-VRAIN per a la competició Albayzin-RTVE 2020 Speech-To-Text Challenge": En aquest article, es descriuen els sistemes de reconeixement automàtic de la parla (RAP) creats pel grup d'investigació MLLP-VRAIN de la Universitat Politecnica de València per a la competició Albayzin-RTVE 2020 Speech-to-Text Challenge. El sistema primari (p-streaming_1500ms_nlt) és un sistema de RAP híbrid BLSTM-HMM amb descodificació en temps real en una passada amb una finestra de context d'1,5 segons i una combinació lineal de models de llenguatge (ML) d'n-grames, LSTM i Transformer. El model acústic s'ha entrenat amb vora 4000 hores de parla transcrita de diferents fonts, usant el transLectures-UPV toolkit (TLK) del grup MLLP i TensorFlow; mentre que els ML s'han entrenat amb SRILM (n-grames), CUED-RNNLM (LSTM) i Fairseq (Transformer), amb 102G paraules (tokens). Aquest sistema ha obtingut 11,6 % i 16,0 % de WER en els conjunts test-2018 i test-2020, respectivament. És un sistema amb capacitat de temps real, que pot desplegar-se en producció per a subtitulació automàtica de fluxos audiovisuals en directe, amb un retard teòric d'1,5 segons. A banda del sistema primari, s'han presentat tres sistemes contrastius. D'aquests, destaquem el sistema c2-streaming_600ms_t que, amb la mateixa configuració que el sistema primari, però amb una finestra de context més reduïda de 0,6 segons i un ML Transformer, ha obtingut 12,3 % i 16,9 % de WER, respectivament, sobre els mateixos conjunts, amb una latència empírica mesurada de 0,81+-0,09 segons (mitjana+-desv). És a dir, s'han obtingut latències punteres per a subtitulació automàtica en directe d'alta qualitat amb una degradació del WER petita, del 6 % relatiu. |
Iranzo-Sánchez, Javier; Jorge, Javier; Baquero-Arnal, Pau; Silvestre-Cerdà, Joan Albert ; Giménez, Adrià; Civera, Jorge; Sanchis, Albert; Juan, Alfons Streaming cascade-based speech translation leveraged by a direct segmentation model Journal Article Neural Networks, 142 , pp. 303–315, 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Cascade System, Deep Neural Networks, Hybrid System, Machine Translation, Segmentation Model, Speech Translation, streaming @article{Iranzo-Sánchez2021, title = {Streaming cascade-based speech translation leveraged by a direct segmentation model}, author = {Javier Iranzo-Sánchez and Javier Jorge and Pau Baquero-Arnal and Silvestre-Cerdà, Joan Albert and Adrià Giménez and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.1016/j.neunet.2021.05.013}, year = {2021}, date = {2021-01-01}, journal = {Neural Networks}, volume = {142}, pages = {303--315}, abstract = {The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system.}, keywords = {Automatic Speech Recognition, Cascade System, Deep Neural Networks, Hybrid System, Machine Translation, Segmentation Model, Speech Translation, streaming}, pubstate = {published}, tppubtype = {article} } The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system. |
Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert ; Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Baquero-Arnal, Pau; Roselló, Nahuel; Pérez-González-de-Martos, Alejandro; Civera, Jorge; Sanchis, Albert; Juan, Alfons Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization Inproceedings Proc. Interspeech 2021, pp. 3695–3699, Brno (Czech Republic), 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, speech corpus, speech data filtering, speech data verbatimization @inproceedings{Garcés2021, title = {Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization}, author = {Garcés Díaz-Munío, Gonçal V. and Silvestre-Cerdà, Joan Albert and Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Pau Baquero-Arnal and Nahuel Roselló and Alejandro Pérez-González-de-Martos and Jorge Civera and Albert Sanchis and Alfons Juan}, url = {https://www.mllp.upv.es/wp-content/uploads/2021/09/europarl-asr-presentation-extended.pdf https://www.youtube.com/watch?v=Tc0gNSDdnQg&list=PLlePn-Yanvnc_LRhgmmaNmH12Bwm6BRsZ https://paperswithcode.com/paper/europarl-asr-a-large-corpus-of-parliamentary https://github.com/mllpresearch/Europarl-ASR}, doi = {10.21437/Interspeech.2021-1905}, year = {2021}, date = {2021-01-01}, booktitle = {Proc. Interspeech 2021}, journal = {Proc. Interspeech 2021}, pages = {3695--3699}, address = {Brno (Czech Republic)}, abstract = {[EN] We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence. [CA] "Europarl-ASR: Un extens corpus parlamentari de referència per a reconeixement de la parla i filtratge/literalització de transcripcions": Presentem Europarl-ASR, un extens corpus de veu i text de debats parlamentaris amb 1300 hores d'intervencions transcrites i 70 milions de paraules de text en anglés extrets de sessions del Parlament Europeu. Les transcripcions oficials del Parlament Europeu, no literals, s'han sincronitzat per a tot el conjunt d'entrenament. Com que l'entrenament de models acústics requereix transcripcions com més literals millor, també s'han inclòs transcripcions filtrades i transcripcions literalitzades de totes les intervencions, basades en tècniques de filtratge i literalització automàtics. A més, s'han inclòs 18 hores de transcripcions literals revisades manualment per definir dos conjunts de validació i avaluació de referència per a reconeixement automàtic de la parla en temps real, amb oradors coneguts i amb oradors desconeguts. Pel fet de disposar de transcripcions literals i no literals, aquest corpus és també ideal per a l'anàlisi de tècniques de filtratge i de literalització. En aquest article, es descriu la creació del corpus i es proporcionen mesures de referència de reconeixement automàtic de la parla en temps real i en diferit, amb oradors coneguts i amb oradors desconeguts, usant els tres conjunts de transcripcions d'entrenament. El corpus es fa públic amb una llicència oberta.}, keywords = {Automatic Speech Recognition, speech corpus, speech data filtering, speech data verbatimization}, pubstate = {published}, tppubtype = {inproceedings} } [EN] We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence. [CA] "Europarl-ASR: Un extens corpus parlamentari de referència per a reconeixement de la parla i filtratge/literalització de transcripcions": Presentem Europarl-ASR, un extens corpus de veu i text de debats parlamentaris amb 1300 hores d'intervencions transcrites i 70 milions de paraules de text en anglés extrets de sessions del Parlament Europeu. Les transcripcions oficials del Parlament Europeu, no literals, s'han sincronitzat per a tot el conjunt d'entrenament. Com que l'entrenament de models acústics requereix transcripcions com més literals millor, també s'han inclòs transcripcions filtrades i transcripcions literalitzades de totes les intervencions, basades en tècniques de filtratge i literalització automàtics. A més, s'han inclòs 18 hores de transcripcions literals revisades manualment per definir dos conjunts de validació i avaluació de referència per a reconeixement automàtic de la parla en temps real, amb oradors coneguts i amb oradors desconeguts. Pel fet de disposar de transcripcions literals i no literals, aquest corpus és també ideal per a l'anàlisi de tècniques de filtratge i de literalització. En aquest article, es descriu la creació del corpus i es proporcionen mesures de referència de reconeixement automàtic de la parla en temps real i en diferit, amb oradors coneguts i amb oradors desconeguts, usant els tres conjunts de transcripcions d'entrenament. El corpus es fa públic amb una llicència oberta.
|
2020 |
Baquero-Arnal, Pau ; Jorge, Javier ; Giménez, Adrià ; Silvestre-Cerdà, Joan Albert ; Iranzo-Sánchez, Javier ; Sanchis, Albert ; Civera, Jorge ; Juan, Alfons Improved Hybrid Streaming ASR with Transformer Language Models Inproceedings Proc. of 21st Annual Conf. of the Intl. Speech Communication Association (InterSpeech 2020), pp. 2127–2131, Shanghai (China), 2020. Abstract | Links | BibTeX | Tags: hybrid ASR, language models, streaming, Transformer @inproceedings{Baquero-Arnal2020, title = {Improved Hybrid Streaming ASR with Transformer Language Models}, author = {Baquero-Arnal, Pau and Jorge, Javier and Giménez, Adrià and Silvestre-Cerdà, Joan Albert and Iranzo-Sánchez, Javier and Sanchis, Albert and Civera, Jorge and Juan, Alfons}, url = {http://dx.doi.org/10.21437/Interspeech.2020-2770}, year = {2020}, date = {2020-01-01}, booktitle = {Proc. of 21st Annual Conf. of the Intl. Speech Communication Association (InterSpeech 2020)}, pages = {2127--2131}, address = {Shanghai (China)}, abstract = {Streaming ASR is gaining momentum due to its wide applicability, though it is still unclear how best to come close to the accuracy of state-of-the-art off-line ASR systems when the output must come within a short delay after the incoming audio stream. Following our previous work on streaming one-pass decoding with hybrid ASR systems and LSTM language models, in this work we report further improvements by replacing LSTMs with Transformer models. First, two key ideas are discussed so as to run these models fast during inference. Then, empirical results on LibriSpeech and TED-LIUM are provided showing that Transformer language models lead to improved recognition rates on both tasks. ASR systems obtained in this work can be seamlessly transfered to a streaming setup with minimal quality losses. Indeed, to the best of our knowledge, no better results have been reported on these tasks when assessed under a streaming setup.}, keywords = {hybrid ASR, language models, streaming, Transformer}, pubstate = {published}, tppubtype = {inproceedings} } Streaming ASR is gaining momentum due to its wide applicability, though it is still unclear how best to come close to the accuracy of state-of-the-art off-line ASR systems when the output must come within a short delay after the incoming audio stream. Following our previous work on streaming one-pass decoding with hybrid ASR systems and LSTM language models, in this work we report further improvements by replacing LSTMs with Transformer models. First, two key ideas are discussed so as to run these models fast during inference. Then, empirical results on LibriSpeech and TED-LIUM are provided showing that Transformer language models lead to improved recognition rates on both tasks. ASR systems obtained in this work can be seamlessly transfered to a streaming setup with minimal quality losses. Indeed, to the best of our knowledge, no better results have been reported on these tasks when assessed under a streaming setup. |
Iranzo-Sánchez, Javier; Giménez Pastor, Adrià ; Silvestre-Cerdà, Joan Albert; Baquero-Arnal, Pau; Saiz, Jorge Civera; Juan, Alfons Direct Segmentation Models for Streaming Speech Translation Inproceedings Proc. of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp. 2599–2611, 2020. Abstract | Links | BibTeX | Tags: Segmentation, Speech Translation, streaming @inproceedings{Iranzo-Sánchez2020, title = {Direct Segmentation Models for Streaming Speech Translation}, author = {Javier Iranzo-Sánchez and Giménez Pastor, Adrià and Joan Albert Silvestre-Cerdà and Pau Baquero-Arnal and Jorge Civera Saiz and Alfons Juan}, url = {http://dx.doi.org/10.18653/v1/2020.emnlp-main.206}, year = {2020}, date = {2020-01-01}, booktitle = {Proc. of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)}, pages = {2599--2611}, abstract = {The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. These systems are usually connected by a segmenter that splits the ASR output into, hopefully, semantically self-contained chunks to be fed into the MT system. This is especially challenging in the case of streaming ST, where latency requirements must also be taken into account. This work proposes novel segmentation models for streaming ST that incorporate not only textual, but also acoustic information to decide when the ASR output is split into a chunk. An extensive and thorough experimental setup is carried out on the Europarl-ST dataset to prove the contribution of acoustic information to the performance of the segmentation model in terms of BLEU score in a streaming ST scenario. Finally, comparative results with previous work also show the superiority of the segmentation models proposed in this work.}, keywords = {Segmentation, Speech Translation, streaming}, pubstate = {published}, tppubtype = {inproceedings} } The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. These systems are usually connected by a segmenter that splits the ASR output into, hopefully, semantically self-contained chunks to be fed into the MT system. This is especially challenging in the case of streaming ST, where latency requirements must also be taken into account. This work proposes novel segmentation models for streaming ST that incorporate not only textual, but also acoustic information to decide when the ASR output is split into a chunk. An extensive and thorough experimental setup is carried out on the Europarl-ST dataset to prove the contribution of acoustic information to the performance of the segmentation model in terms of BLEU score in a streaming ST scenario. Finally, comparative results with previous work also show the superiority of the segmentation models proposed in this work. |
2019 |
Baquero-Arnal, Pau ; Iranzo-Sánchez, Javier ; Civera, Jorge ; Juan, Alfons The MLLP-UPV Spanish-Portuguese and Portuguese-Spanish Machine Translation Systems for WMT19 Similar Language Translation Task Inproceedings Proc. of Fourth Conference on Machine Translation (WMT19), pp. 179-184, Florence (Italy), 2019. Abstract | Links | BibTeX | Tags: Machine Translation, Neural Machine Translation, WMT19 @inproceedings{Baquero-Arnal2019, title = {The MLLP-UPV Spanish-Portuguese and Portuguese-Spanish Machine Translation Systems for WMT19 Similar Language Translation Task}, author = {Baquero-Arnal, Pau and Iranzo-Sánchez, Javier and Civera, Jorge and Juan, Alfons}, url = {https://www.aclweb.org/anthology/W19-5423/ https://www.mllp.upv.es/wp-content/uploads/2019/09/poster-2.pdf}, year = {2019}, date = {2019-01-01}, booktitle = {Proc. of Fourth Conference on Machine Translation (WMT19)}, pages = {179-184}, address = {Florence (Italy)}, abstract = {This paper describes the participation of the MLLP research group of the Universitat Politècnica de València in the WMT 2019 Similar Language Translation Shared Task. We have submitted systems for the Portuguese ↔ Spanish language pair, in both directions. They are based on the Transformer architecture, as well as on a novel architecture called 2D alternating RNN. Both systems have been domain adapted through fine-tuning which has been shown to be very effective.}, keywords = {Machine Translation, Neural Machine Translation, WMT19}, pubstate = {published}, tppubtype = {inproceedings} } This paper describes the participation of the MLLP research group of the Universitat Politècnica de València in the WMT 2019 Similar Language Translation Shared Task. We have submitted systems for the Portuguese ↔ Spanish language pair, in both directions. They are based on the Transformer architecture, as well as on a novel architecture called 2D alternating RNN. Both systems have been domain adapted through fine-tuning which has been shown to be very effective. |
2018 |
Valor Miró, Juan Daniel ; Baquero-Arnal, Pau; Civera, Jorge; Turró, Carlos; Juan, Alfons Multilingual videos for MOOCs and OER Journal Article Journal of Educational Technology & Society, 21 (2), pp. 1–12, 2018. Abstract | Links | BibTeX | Tags: Machine Translation, MOOCs, multilingual, Speech Recognition, video lecture repositories @article{Miró2018, title = {Multilingual videos for MOOCs and OER}, author = {Valor Miró, Juan Daniel and Pau Baquero-Arnal and Jorge Civera and Carlos Turró and Alfons Juan}, url = {https://www.mllp.upv.es/wp-content/uploads/2019/11/JETS2018MLLP.pdf http://hdl.handle.net/10251/122577 https://www.jstor.org/stable/26388375 https://www.j-ets.net/collection/published-issues/21_2}, year = {2018}, date = {2018-01-01}, journal = {Journal of Educational Technology & Society}, volume = {21}, number = {2}, pages = {1--12}, abstract = {Massive Open Online Courses (MOOCs) and Open Educational Resources (OER) are rapidly growing, but are not usually offered in multiple languages due to the lack of cost-effective solutions to translate the different objects comprising them and particularly videos. However, current state-of-the-art automatic speech recognition (ASR) and machine translation (MT) techniques have reached a level of maturity which opens the possibility of producing multilingual video subtitles of publishable quality at low cost. This work summarizes authors' experience in exploring this possibility in two real-life case studies: a MOOC platform and a large video lecture repository. Apart from describing the systems, tools and integration components employed for such purpose, a comprehensive evaluation of the results achieved is provided in terms of quality and efficiency. More precisely, it is shown that draft multilingual subtitles produced by domain-adapted ASR/MT systems reach a level of accuracy that make them worth post-editing, instead of generating them ex novo, saving approximately 25%–75% of the time. Finally, the results reported on user multilingual data consumption reflect that multilingual subtitles have had a very positive impact in our case studies boosting student enrolment, in the case of the MOOC platform, by 70% relative.}, keywords = {Machine Translation, MOOCs, multilingual, Speech Recognition, video lecture repositories}, pubstate = {published}, tppubtype = {article} } Massive Open Online Courses (MOOCs) and Open Educational Resources (OER) are rapidly growing, but are not usually offered in multiple languages due to the lack of cost-effective solutions to translate the different objects comprising them and particularly videos. However, current state-of-the-art automatic speech recognition (ASR) and machine translation (MT) techniques have reached a level of maturity which opens the possibility of producing multilingual video subtitles of publishable quality at low cost. This work summarizes authors' experience in exploring this possibility in two real-life case studies: a MOOC platform and a large video lecture repository. Apart from describing the systems, tools and integration components employed for such purpose, a comprehensive evaluation of the results achieved is provided in terms of quality and efficiency. More precisely, it is shown that draft multilingual subtitles produced by domain-adapted ASR/MT systems reach a level of accuracy that make them worth post-editing, instead of generating them ex novo, saving approximately 25%–75% of the time. Finally, the results reported on user multilingual data consumption reflect that multilingual subtitles have had a very positive impact in our case studies boosting student enrolment, in the case of the MOOC platform, by 70% relative. |
Iranzo-Sánchez, Javier ; Baquero-Arnal, Pau ; Garcés Díaz-Munío, Gonçal V; Martínez-Villaronga, Adrià ; Civera, Jorge ; Juan, Alfons The MLLP-UPV German-English Machine Translation System for WMT18 Inproceedings Proc. of the Third Conference on Machine Translation (WMT18), Volume 2: Shared Task Papers, pp. 422–428, Brussels (Belgium), 2018. Abstract | Links | BibTeX | Tags: Data Selection, Machine Translation, Neural Machine Translation, WMT18 news translation @inproceedings{Iranzo-Sánchez2018, title = {The MLLP-UPV German-English Machine Translation System for WMT18}, author = {Iranzo-Sánchez, Javier and Baquero-Arnal, Pau and Garcés Díaz-Munío, Gonçal V. and Martínez-Villaronga, Adrià and Civera, Jorge and Juan, Alfons}, url = {http://dx.doi.org/10.18653/v1/W18-6414 https://www.mllp.upv.es/wp-content/uploads/2018/11/wmt18_mllp-upv_poster.pdf}, year = {2018}, date = {2018-01-01}, booktitle = {Proc. of the Third Conference on Machine Translation (WMT18), Volume 2: Shared Task Papers}, pages = {422--428}, address = {Brussels (Belgium)}, abstract = {[EN] This paper describes the statistical machine translation system built by the MLLP research group of Universitat Politècnica de València for the German>English news translation shared task of the EMNLP 2018 Third Conference on Machine Translation (WMT18). We used an ensemble of Transformer architecture–based neural machine translation systems. To train our system under "constrained" conditions, we filtered the provided parallel data with a scoring technique using character-based language models, and we added parallel data based on synthetic source sentences generated from the provided monolingual corpora. [CA] "El sistema de traducció automàtica alemany>anglés de l'MLLP-UPV per a WMT18": En aquest article descrivim el sistema de traducció automàtica estadística creat pel grup d'investigació MLLP de la Universitat Politècnica de València per a la competició de traducció de notícies alemany>anglés de la Third Conference on Machine Translation (WMT18, associada a la conferència EMNLP 2018). Hem utilitzat una combinació de sistemes de traducció automàtica neuronal basats en l'arquitectura Transformer. Per a entrenar el nostre sistema en la categoria "fitada" (només amb els corpus lingüístics oficials de la competició), hem filtrat les dades paral·leles disponibles amb una tècnica que assigna puntuacions utilitzant models de llenguatge de caràcters, i hem afegit dades paral·leles basades en frases d'origen sintètiques generades a partir dels corpus monolingües disponibles.}, keywords = {Data Selection, Machine Translation, Neural Machine Translation, WMT18 news translation}, pubstate = {published}, tppubtype = {inproceedings} } [EN] This paper describes the statistical machine translation system built by the MLLP research group of Universitat Politècnica de València for the German>English news translation shared task of the EMNLP 2018 Third Conference on Machine Translation (WMT18). We used an ensemble of Transformer architecture–based neural machine translation systems. To train our system under "constrained" conditions, we filtered the provided parallel data with a scoring technique using character-based language models, and we added parallel data based on synthetic source sentences generated from the provided monolingual corpora. [CA] "El sistema de traducció automàtica alemany>anglés de l'MLLP-UPV per a WMT18": En aquest article descrivim el sistema de traducció automàtica estadística creat pel grup d'investigació MLLP de la Universitat Politècnica de València per a la competició de traducció de notícies alemany>anglés de la Third Conference on Machine Translation (WMT18, associada a la conferència EMNLP 2018). Hem utilitzat una combinació de sistemes de traducció automàtica neuronal basats en l'arquitectura Transformer. Per a entrenar el nostre sistema en la categoria "fitada" (només amb els corpus lingüístics oficials de la competició), hem filtrat les dades paral·leles disponibles amb una tècnica que assigna puntuacions utilitzant models de llenguatge de caràcters, i hem afegit dades paral·leles basades en frases d'origen sintètiques generades a partir dels corpus monolingües disponibles. |
Publications
2023 |
Transformer models for Machine Translation and Streaming Automatic Speech Recognition PhD Thesis Universitat Politècnica de València, 2023, (Advisors: Alfons Juan Ciscar and Hermann Ney). |
2022 |
MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge: Extension Journal Article Applied Sciences, 12 (2), pp. 804, 2022. |
MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks Inproceedings Proc. of 19th Intl. Workshop on Spoken Language Translation (IWSLT 2022), pp. 255–264, Dublin (Ireland), 2022. |
Doblaje automático de vídeo-charlas educativas en UPV[Media] Inproceedings Proc. of VIII Congrés d'Innovació Educativa i Docència en Xarxa (IN-RED 2022), pp. 557–570, València (Spain), 2022. |
2021 |
MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge Inproceedings Proc. of IberSPEECH 2021, pp. 118–122, Valladolid (Spain), 2021. |
Streaming cascade-based speech translation leveraged by a direct segmentation model Journal Article Neural Networks, 142 , pp. 303–315, 2021. |
Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization Inproceedings Proc. Interspeech 2021, pp. 3695–3699, Brno (Czech Republic), 2021. |
2020 |
Improved Hybrid Streaming ASR with Transformer Language Models Inproceedings Proc. of 21st Annual Conf. of the Intl. Speech Communication Association (InterSpeech 2020), pp. 2127–2131, Shanghai (China), 2020. |
Direct Segmentation Models for Streaming Speech Translation Inproceedings Proc. of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp. 2599–2611, 2020. |
2019 |
The MLLP-UPV Spanish-Portuguese and Portuguese-Spanish Machine Translation Systems for WMT19 Similar Language Translation Task Inproceedings Proc. of Fourth Conference on Machine Translation (WMT19), pp. 179-184, Florence (Italy), 2019. |
2018 |
Multilingual videos for MOOCs and OER Journal Article Journal of Educational Technology & Society, 21 (2), pp. 1–12, 2018. |
The MLLP-UPV German-English Machine Translation System for WMT18 Inproceedings Proc. of the Third Conference on Machine Translation (WMT18), Volume 2: Shared Task Papers, pp. 422–428, Brussels (Belgium), 2018. |