2025 |
Iranzo-Sánchez, Jorge; Santamaría-Jordà, Jaume; Mas-Mollà, Gerard; Garcés Díaz-Munío, Gonçal V; Iranzo-Sánchez, Javier; Jorge, Javier; Silvestre-Cerdà, Joan Albert; Giménez, Adrià; Civera, Jorge; Sanchis, Albert; Juan, Alfons Speech Translation for Multilingual Medical Education Leveraged by Large Language Models Journal Article Forthcoming Artificial Intelligence In Medicine, Forthcoming. Abstract | BibTeX | Tags: Automatic Speech Recognition, domain adaptation, large language models, Machine Translation, oncology, Speech Translation @article{Iranzo-Sánchez2025, title = {Speech Translation for Multilingual Medical Education Leveraged by Large Language Models}, author = {Jorge Iranzo-Sánchez AND Jaume Santamaría-Jordà AND Gerard Mas-Mollà AND Garcés Díaz-Munío, Gonçal V. AND Javier Iranzo-Sánchez AND Javier Jorge AND Joan Albert Silvestre-Cerdà AND Adrià Giménez AND Jorge Civera AND Albert Sanchis AND Alfons Juan}, year = {2025}, date = {2025-01-01}, journal = {Artificial Intelligence In Medicine}, abstract = {The application of large language models (LLMs) to speech translation (ST), or in general, to machine translation (MT), has recently provided excellent results superseding conventional encoder-decoder MT systems in the general domain. However, this is not clearly the case when LLMs as MT systems are translating medical-related materials. In this respect, the provision of multilingual training materials for oncology professionals is a goal of the EU project Interact-Europe in which this work was framed. To this end, cross-language technology adapted to the oncology domain was developed, evaluated and deployed for multilingual interspeciality medical education. More precisely, automatic speech recognition (ASR) and MT models were adapted to the oncology domain to translate English pre-recorded training videos, kindly provided by the European School of Oncology (ESO), into French, Spanish, German and Slovene. In this work, three categories of MT models adapted to the medical domain were assessed: bilingual encoder-decoder MT models trained from scratch, pre-trained large multilingual encoder-decoder MT models and multilingual decoder-only LLMs. The experimental results underline the competitiveness in translation quality of LLMs compared to encoder-decoder MT models. Finally, the ESO speech dataset, comprising roughly 1,000 videos and 745 hours for the training and evaluation of ASR and MT models, was publicly released for the scientific community.}, keywords = {Automatic Speech Recognition, domain adaptation, large language models, Machine Translation, oncology, Speech Translation}, pubstate = {forthcoming}, tppubtype = {article} } The application of large language models (LLMs) to speech translation (ST), or in general, to machine translation (MT), has recently provided excellent results superseding conventional encoder-decoder MT systems in the general domain. However, this is not clearly the case when LLMs as MT systems are translating medical-related materials. In this respect, the provision of multilingual training materials for oncology professionals is a goal of the EU project Interact-Europe in which this work was framed. To this end, cross-language technology adapted to the oncology domain was developed, evaluated and deployed for multilingual interspeciality medical education. More precisely, automatic speech recognition (ASR) and MT models were adapted to the oncology domain to translate English pre-recorded training videos, kindly provided by the European School of Oncology (ESO), into French, Spanish, German and Slovene. In this work, three categories of MT models adapted to the medical domain were assessed: bilingual encoder-decoder MT models trained from scratch, pre-trained large multilingual encoder-decoder MT models and multilingual decoder-only LLMs. The experimental results underline the competitiveness in translation quality of LLMs compared to encoder-decoder MT models. Finally, the ESO speech dataset, comprising roughly 1,000 videos and 745 hours for the training and evaluation of ASR and MT models, was publicly released for the scientific community. |
2022 |
Baquero-Arnal, Pau; Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Pérez-González-de-Martos, Alejandro; Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge: Extension Journal Article Applied Sciences, 12 (2), pp. 804, 2022. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Natural Language Processing, streaming @article{applsci1505192, title = {MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge: Extension}, author = {Pau Baquero-Arnal and Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Alejandro Pérez-González-de-Martos and Garcés Díaz-Munío, Gonçal V. and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.3390/app12020804}, year = {2022}, date = {2022-01-01}, journal = {Applied Sciences}, volume = {12}, number = {2}, pages = {804}, abstract = {This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting in building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81±0.09 seconds (mean±stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams.}, keywords = {Automatic Speech Recognition, Natural Language Processing, streaming}, pubstate = {published}, tppubtype = {article} } This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting in building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81±0.09 seconds (mean±stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams. |
Pérez González de Martos, Alejandro ; Giménez Pastor, Adrià ; Jorge Cano, Javier ; Iranzo-Sánchez, Javier; Silvestre-Cerdà, Joan Albert; Garcés Díaz-Munío, Gonçal V; Baquero-Arnal, Pau; Sanchis Navarro, Alberto ; Civera Sáiz, Jorge ; Juan Ciscar, Alfons ; Turró Ribalta, Carlos Doblaje automático de vídeo-charlas educativas en UPV[Media] Inproceedings Proc. of VIII Congrés d'Innovació Educativa i Docència en Xarxa (IN-RED 2022), pp. 557–570, València (Spain), 2022. Abstract | Links | BibTeX | Tags: automatic dubbing, Automatic Speech Recognition, Machine Translation, OER, text-to-speech @inproceedings{deMartos2022, title = {Doblaje automático de vídeo-charlas educativas en UPV[Media]}, author = {Pérez González de Martos, Alejandro AND Giménez Pastor, Adrià AND Jorge Cano, Javier AND Javier Iranzo-Sánchez AND Joan Albert Silvestre-Cerdà AND Garcés Díaz-Munío, Gonçal V. AND Pau Baquero-Arnal AND Sanchis Navarro, Alberto AND Civera Sáiz, Jorge AND Juan Ciscar, Alfons AND Turró Ribalta, Carlos}, doi = {10.4995/INRED2022.2022.15844}, year = {2022}, date = {2022-01-01}, booktitle = {Proc. of VIII Congrés d'Innovació Educativa i Docència en Xarxa (IN-RED 2022)}, pages = {557--570}, address = {València (Spain)}, abstract = {More and more universities are banking on the production of digital content to support online or blended learning in higher education. Over the last years, the MLLP research group has been working closely with the UPV's ASIC media services in order to enrich educational multimedia resources through the application of natural language processing technologies including automatic speech recognition, machine translation and text-to-speech. In this work, we present the steps that are being followed for the comprehensive translation of these materials, specifically through (semi-)automatic dubbing by making use of state-of-the-art speaker-adaptive text-to-speech technologies.}, keywords = {automatic dubbing, Automatic Speech Recognition, Machine Translation, OER, text-to-speech}, pubstate = {published}, tppubtype = {inproceedings} } More and more universities are banking on the production of digital content to support online or blended learning in higher education. Over the last years, the MLLP research group has been working closely with the UPV's ASIC media services in order to enrich educational multimedia resources through the application of natural language processing technologies including automatic speech recognition, machine translation and text-to-speech. In this work, we present the steps that are being followed for the comprehensive translation of these materials, specifically through (semi-)automatic dubbing by making use of state-of-the-art speaker-adaptive text-to-speech technologies. |
2021 |
Jorge, Javier ; Giménez, Adrià ; Silvestre-Cerdà, Joan Albert ; Civera, Jorge ; Sanchis, Albert ; Alfons, Juan Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models Journal Article IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30 , pp. 148–161, 2021. Abstract | Links | BibTeX | Tags: acoustic modelling, Automatic Speech Recognition, decoding, language modelling, neural networks, streaming @article{Jorge2021b, title = {Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models}, author = {Jorge, Javier and Giménez, Adrià and Silvestre-Cerdà, Joan Albert and Civera, Jorge and Sanchis, Albert and Juan Alfons}, doi = {10.1109/TASLP.2021.3133216}, year = {2021}, date = {2021-11-23}, journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume = {30}, pages = {148--161}, abstract = {Although Long-Short Term Memory (LSTM) networks and deep Transformers are now extensively used in offline ASR, it is unclear how best offline systems can be adapted to work with them under the streaming setup. After gaining considerable experience in this regard in recent years, in this paper we show how an optimized, low-latency streaming decoder can be built in which bidirectional LSTM acoustic models, together with general interpolated language models, can be nicely integrated with minimal perfomance degradation. In brief, our streaming decoder consists of a one-pass, real-time search engine relying on a limited-duration window sliding over time and a number of ad hoc acoustic and language model pruning techniques. Extensive empirical assessment is provided on truly streaming tasks derived from the well-known LibriSpeech and TED talks datasets, as well as from TV shows from a large Spanish broadcasting station.}, keywords = {acoustic modelling, Automatic Speech Recognition, decoding, language modelling, neural networks, streaming}, pubstate = {published}, tppubtype = {article} } Although Long-Short Term Memory (LSTM) networks and deep Transformers are now extensively used in offline ASR, it is unclear how best offline systems can be adapted to work with them under the streaming setup. After gaining considerable experience in this regard in recent years, in this paper we show how an optimized, low-latency streaming decoder can be built in which bidirectional LSTM acoustic models, together with general interpolated language models, can be nicely integrated with minimal perfomance degradation. In brief, our streaming decoder consists of a one-pass, real-time search engine relying on a limited-duration window sliding over time and a number of ad hoc acoustic and language model pruning techniques. Extensive empirical assessment is provided on truly streaming tasks derived from the well-known LibriSpeech and TED talks datasets, as well as from TV shows from a large Spanish broadcasting station. |
Jorge, Javier; Giménez, Adrià; Baquero-Arnal, Pau; Iranzo-Sánchez, Javier; Pérez-González-de-Martos, Alejandro; Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge Inproceedings Proc. of IberSPEECH 2021, pp. 118–122, Valladolid (Spain), 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Natural Language Processing, streaming @inproceedings{Jorge2021, title = {MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge}, author = {Javier Jorge and Adrià Giménez and Pau Baquero-Arnal and Javier Iranzo-Sánchez and Alejandro Pérez-González-de-Martos and Garcés Díaz-Munío, Gonçal V. and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, doi = {10.21437/IberSPEECH.2021-25}, year = {2021}, date = {2021-03-24}, booktitle = {Proc. of IberSPEECH 2021}, pages = {118--122}, address = {Valladolid (Spain)}, abstract = {1st place in IberSpeech-RTVE 2020 TV Speech-to-Text Challenge. [EN] This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politecnica de València for the Albayzin-RTVE 2020 Speech-to-Text Challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid BLSTM-HMM ASR system using streaming one-pass decoding with a context window of 1.5 seconds and a linear combination of an n-gram, a LSTM, and a Transformer language model (LM). The acoustic model was trained on nearly 4,000 hours of speech data from different sources, using the MLLP's transLectures-UPV toolkit (TLK) and TensorFlow; whilst LMs were trained using SRILM (n-gram), CUED-RNNLM (LSTM) and Fairseq (Transformer), with up to 102G tokens. This system achieved 11.6% and 16.0% WER on the test-2018 and test-2020 sets, respectively. As it is streaming-enabled, it could be put into production environments for automatic captioning of live media streams, with a theoretical delay of 1.5 seconds. Along with the primary system, we also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t that, following the same configuration of the primary one, but using a smaller context window of 0.6 seconds and a Transformer LM, scored 12.3% and 16.9% WER points respectively on the same test sets, with a measured empirical latency of 0.81+-0.09 seconds (mean+-stdev). This is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. [CA] "Sistemes de reconeixement automàtic de la parla en castellà de MLLP-VRAIN per a la competició Albayzin-RTVE 2020 Speech-To-Text Challenge": En aquest article, es descriuen els sistemes de reconeixement automàtic de la parla (RAP) creats pel grup d'investigació MLLP-VRAIN de la Universitat Politecnica de València per a la competició Albayzin-RTVE 2020 Speech-to-Text Challenge. El sistema primari (p-streaming_1500ms_nlt) és un sistema de RAP híbrid BLSTM-HMM amb descodificació en temps real en una passada amb una finestra de context d'1,5 segons i una combinació lineal de models de llenguatge (ML) d'n-grames, LSTM i Transformer. El model acústic s'ha entrenat amb vora 4000 hores de parla transcrita de diferents fonts, usant el transLectures-UPV toolkit (TLK) del grup MLLP i TensorFlow; mentre que els ML s'han entrenat amb SRILM (n-grames), CUED-RNNLM (LSTM) i Fairseq (Transformer), amb 102G paraules (tokens). Aquest sistema ha obtingut 11,6 % i 16,0 % de WER en els conjunts test-2018 i test-2020, respectivament. És un sistema amb capacitat de temps real, que pot desplegar-se en producció per a subtitulació automàtica de fluxos audiovisuals en directe, amb un retard teòric d'1,5 segons. A banda del sistema primari, s'han presentat tres sistemes contrastius. D'aquests, destaquem el sistema c2-streaming_600ms_t que, amb la mateixa configuració que el sistema primari, però amb una finestra de context més reduïda de 0,6 segons i un ML Transformer, ha obtingut 12,3 % i 16,9 % de WER, respectivament, sobre els mateixos conjunts, amb una latència empírica mesurada de 0,81+-0,09 segons (mitjana+-desv). És a dir, s'han obtingut latències punteres per a subtitulació automàtica en directe d'alta qualitat amb una degradació del WER petita, del 6 % relatiu.}, keywords = {Automatic Speech Recognition, Natural Language Processing, streaming}, pubstate = {published}, tppubtype = {inproceedings} } 1st place in IberSpeech-RTVE 2020 TV Speech-to-Text Challenge. [EN] This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politecnica de València for the Albayzin-RTVE 2020 Speech-to-Text Challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid BLSTM-HMM ASR system using streaming one-pass decoding with a context window of 1.5 seconds and a linear combination of an n-gram, a LSTM, and a Transformer language model (LM). The acoustic model was trained on nearly 4,000 hours of speech data from different sources, using the MLLP's transLectures-UPV toolkit (TLK) and TensorFlow; whilst LMs were trained using SRILM (n-gram), CUED-RNNLM (LSTM) and Fairseq (Transformer), with up to 102G tokens. This system achieved 11.6% and 16.0% WER on the test-2018 and test-2020 sets, respectively. As it is streaming-enabled, it could be put into production environments for automatic captioning of live media streams, with a theoretical delay of 1.5 seconds. Along with the primary system, we also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t that, following the same configuration of the primary one, but using a smaller context window of 0.6 seconds and a Transformer LM, scored 12.3% and 16.9% WER points respectively on the same test sets, with a measured empirical latency of 0.81+-0.09 seconds (mean+-stdev). This is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. [CA] "Sistemes de reconeixement automàtic de la parla en castellà de MLLP-VRAIN per a la competició Albayzin-RTVE 2020 Speech-To-Text Challenge": En aquest article, es descriuen els sistemes de reconeixement automàtic de la parla (RAP) creats pel grup d'investigació MLLP-VRAIN de la Universitat Politecnica de València per a la competició Albayzin-RTVE 2020 Speech-to-Text Challenge. El sistema primari (p-streaming_1500ms_nlt) és un sistema de RAP híbrid BLSTM-HMM amb descodificació en temps real en una passada amb una finestra de context d'1,5 segons i una combinació lineal de models de llenguatge (ML) d'n-grames, LSTM i Transformer. El model acústic s'ha entrenat amb vora 4000 hores de parla transcrita de diferents fonts, usant el transLectures-UPV toolkit (TLK) del grup MLLP i TensorFlow; mentre que els ML s'han entrenat amb SRILM (n-grames), CUED-RNNLM (LSTM) i Fairseq (Transformer), amb 102G paraules (tokens). Aquest sistema ha obtingut 11,6 % i 16,0 % de WER en els conjunts test-2018 i test-2020, respectivament. És un sistema amb capacitat de temps real, que pot desplegar-se en producció per a subtitulació automàtica de fluxos audiovisuals en directe, amb un retard teòric d'1,5 segons. A banda del sistema primari, s'han presentat tres sistemes contrastius. D'aquests, destaquem el sistema c2-streaming_600ms_t que, amb la mateixa configuració que el sistema primari, però amb una finestra de context més reduïda de 0,6 segons i un ML Transformer, ha obtingut 12,3 % i 16,9 % de WER, respectivament, sobre els mateixos conjunts, amb una latència empírica mesurada de 0,81+-0,09 segons (mitjana+-desv). És a dir, s'han obtingut latències punteres per a subtitulació automàtica en directe d'alta qualitat amb una degradació del WER petita, del 6 % relatiu. |
Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert ; Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Baquero-Arnal, Pau; Roselló, Nahuel; Pérez-González-de-Martos, Alejandro; Civera, Jorge; Sanchis, Albert; Juan, Alfons Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization Inproceedings Proc. Interspeech 2021, pp. 3695–3699, Brno (Czech Republic), 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, speech corpus, speech data filtering, speech data verbatimization @inproceedings{Garcés2021, title = {Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization}, author = {Garcés Díaz-Munío, Gonçal V. and Silvestre-Cerdà, Joan Albert and Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Pau Baquero-Arnal and Nahuel Roselló and Alejandro Pérez-González-de-Martos and Jorge Civera and Albert Sanchis and Alfons Juan}, url = {https://www.mllp.upv.es/wp-content/uploads/2021/09/europarl-asr-presentation-extended.pdf https://www.youtube.com/watch?v=Tc0gNSDdnQg&list=PLlePn-Yanvnc_LRhgmmaNmH12Bwm6BRsZ https://paperswithcode.com/paper/europarl-asr-a-large-corpus-of-parliamentary https://github.com/mllpresearch/Europarl-ASR}, doi = {10.21437/Interspeech.2021-1905}, year = {2021}, date = {2021-01-01}, booktitle = {Proc. Interspeech 2021}, journal = {Proc. Interspeech 2021}, pages = {3695--3699}, address = {Brno (Czech Republic)}, abstract = {[EN] We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence. [CA] "Europarl-ASR: Un extens corpus parlamentari de referència per a reconeixement de la parla i filtratge/literalització de transcripcions": Presentem Europarl-ASR, un extens corpus de veu i text de debats parlamentaris amb 1300 hores d'intervencions transcrites i 70 milions de paraules de text en anglés extrets de sessions del Parlament Europeu. Les transcripcions oficials del Parlament Europeu, no literals, s'han sincronitzat per a tot el conjunt d'entrenament. Com que l'entrenament de models acústics requereix transcripcions com més literals millor, també s'han inclòs transcripcions filtrades i transcripcions literalitzades de totes les intervencions, basades en tècniques de filtratge i literalització automàtics. A més, s'han inclòs 18 hores de transcripcions literals revisades manualment per definir dos conjunts de validació i avaluació de referència per a reconeixement automàtic de la parla en temps real, amb oradors coneguts i amb oradors desconeguts. Pel fet de disposar de transcripcions literals i no literals, aquest corpus és també ideal per a l'anàlisi de tècniques de filtratge i de literalització. En aquest article, es descriu la creació del corpus i es proporcionen mesures de referència de reconeixement automàtic de la parla en temps real i en diferit, amb oradors coneguts i amb oradors desconeguts, usant els tres conjunts de transcripcions d'entrenament. El corpus es fa públic amb una llicència oberta.}, keywords = {Automatic Speech Recognition, speech corpus, speech data filtering, speech data verbatimization}, pubstate = {published}, tppubtype = {inproceedings} } [EN] We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence. [CA] "Europarl-ASR: Un extens corpus parlamentari de referència per a reconeixement de la parla i filtratge/literalització de transcripcions": Presentem Europarl-ASR, un extens corpus de veu i text de debats parlamentaris amb 1300 hores d'intervencions transcrites i 70 milions de paraules de text en anglés extrets de sessions del Parlament Europeu. Les transcripcions oficials del Parlament Europeu, no literals, s'han sincronitzat per a tot el conjunt d'entrenament. Com que l'entrenament de models acústics requereix transcripcions com més literals millor, també s'han inclòs transcripcions filtrades i transcripcions literalitzades de totes les intervencions, basades en tècniques de filtratge i literalització automàtics. A més, s'han inclòs 18 hores de transcripcions literals revisades manualment per definir dos conjunts de validació i avaluació de referència per a reconeixement automàtic de la parla en temps real, amb oradors coneguts i amb oradors desconeguts. Pel fet de disposar de transcripcions literals i no literals, aquest corpus és també ideal per a l'anàlisi de tècniques de filtratge i de literalització. En aquest article, es descriu la creació del corpus i es proporcionen mesures de referència de reconeixement automàtic de la parla en temps real i en diferit, amb oradors coneguts i amb oradors desconeguts, usant els tres conjunts de transcripcions d'entrenament. El corpus es fa públic amb una llicència oberta.
|
2020 |
Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchis, Albert; Juan, Alfons LSTM-Based One-Pass Decoder for Low-Latency Streaming Inproceedings Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 7814–7818, Barcelona (Spain), 2020. Abstract | Links | BibTeX | Tags: acoustic modeling, Automatic Speech Recognition, decoding, Language Modeling, streaming @inproceedings{Jorge2020, title = {LSTM-Based One-Pass Decoder for Low-Latency Streaming}, author = {Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, url = {https://www.mllp.upv.es/wp-content/uploads/2020/01/jorge2020_preprint.pdf https://doi.org/10.1109/ICASSP40776.2020.9054267}, year = {2020}, date = {2020-01-01}, booktitle = {Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020)}, pages = {7814--7818}, address = {Barcelona (Spain)}, abstract = {Current state-of-the-art models based on Long-Short Term Memory (LSTM) networks have been extensively used in ASR to improve performance. However, using LSTMs under a streaming setup is not straightforward due to real-time constraints. In this paper we present a novel streaming decoder that includes a bidirectional LSTM acoustic model as well as an unidirectional LSTM language model to perform the decoding efficiently while keeping the performance comparable to that of an off-line setup. We perform a one-pass decoding using a sliding window scheme for a bidirectional LSTM acoustic model and an LSTM language model. This has been implemented and assessed under a pure streaming setup, and deployed into our production systems. We report WER and latency figures for the well-known LibriSpeech and TED-LIUM tasks, obtaining competitive WER results with low-latency responses.}, keywords = {acoustic modeling, Automatic Speech Recognition, decoding, Language Modeling, streaming}, pubstate = {published}, tppubtype = {inproceedings} } Current state-of-the-art models based on Long-Short Term Memory (LSTM) networks have been extensively used in ASR to improve performance. However, using LSTMs under a streaming setup is not straightforward due to real-time constraints. In this paper we present a novel streaming decoder that includes a bidirectional LSTM acoustic model as well as an unidirectional LSTM language model to perform the decoding efficiently while keeping the performance comparable to that of an off-line setup. We perform a one-pass decoding using a sliding window scheme for a bidirectional LSTM acoustic model and an LSTM language model. This has been implemented and assessed under a pure streaming setup, and deployed into our production systems. We report WER and latency figures for the well-known LibriSpeech and TED-LIUM tasks, obtaining competitive WER results with low-latency responses. |
Iranzo-Sánchez, Javier; Silvestre-Cerdà, Joan Albert; Jorge, Javier; Roselló, Nahuel; Giménez, Adrià; Sanchis, Albert; Civera, Jorge; Juan, Alfons Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates Inproceedings Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 8229–8233, Barcelona (Spain), 2020. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Machine Translation, Multilingual Corpus, Speech Translation, Spoken Language Translation @inproceedings{Iranzo2020, title = {Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates}, author = {Javier Iranzo-Sánchez and Joan Albert Silvestre-Cerdà and Javier Jorge and Nahuel Roselló and Adrià Giménez and Albert Sanchis and Jorge Civera and Alfons Juan}, url = {https://arxiv.org/abs/1911.03167 https://paperswithcode.com/paper/europarl-st-a-multilingual-corpus-for-speech https://www.mllp.upv.es/europarl-st/}, doi = {10.1109/ICASSP40776.2020.9054626}, year = {2020}, date = {2020-01-01}, booktitle = {Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020)}, pages = {8229--8233}, address = {Barcelona (Spain)}, abstract = {Current research into spoken language translation (SLT), or speech-to-text translation, is often hampered by the lack of specific data resources for this task, as currently available SLT datasets are restricted to a limited set of language pairs. In this paper we present Europarl-ST, a novel multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for a total of 30 different translation directions. This corpus has been compiled using the de-bates held in the European Parliament in the period between2008 and 2012. This paper describes the corpus creation process and presents a series of automatic speech recognition,machine translation and spoken language translation experiments that highlight the potential of this new resource. The corpus is released under a Creative Commons license and is freely accessible and downloadable.}, keywords = {Automatic Speech Recognition, Machine Translation, Multilingual Corpus, Speech Translation, Spoken Language Translation}, pubstate = {published}, tppubtype = {inproceedings} } Current research into spoken language translation (SLT), or speech-to-text translation, is often hampered by the lack of specific data resources for this task, as currently available SLT datasets are restricted to a limited set of language pairs. In this paper we present Europarl-ST, a novel multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for a total of 30 different translation directions. This corpus has been compiled using the de-bates held in the European Parliament in the period between2008 and 2012. This paper describes the corpus creation process and presents a series of automatic speech recognition,machine translation and spoken language translation experiments that highlight the potential of this new resource. The corpus is released under a Creative Commons license and is freely accessible and downloadable. |
2019 |
Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Civera, Jorge; Sanchis, Albert; Juan, Alfons Real-time One-pass Decoder for Speech Recognition Using LSTM Language Models Inproceedings Proc. of the 20th Annual Conf. of the ISCA (Interspeech 2019), pp. 3820–3824, Graz (Austria), 2019. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, LSTM language models, one-pass decoding, real-time @inproceedings{Jorge2019, title = {Real-time One-pass Decoder for Speech Recognition Using LSTM Language Models}, author = {Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Jorge Civera and Albert Sanchis and Alfons Juan}, url = {https://www.isca-speech.org/archive/interspeech_2019/jorge19_interspeech.html}, year = {2019}, date = {2019-01-01}, booktitle = {Proc. of the 20th Annual Conf. of the ISCA (Interspeech 2019)}, pages = {3820--3824}, address = {Graz (Austria)}, abstract = {Recurrent Neural Networks, in particular Long-Short Term Memory (LSTM) networks, are widely used in Automatic Speech Recognition for language modelling during decoding, usually as a mechanism for rescoring hypothesis. This paper proposes a new architecture to perform real-time one-pass decoding using LSTM language models. To make decoding efficient, the estimation of look-ahead scores was accelerated by precomputing static look-ahead tables. These static tables were precomputed from a pruned n-gram model, reducing drastically the computational cost during decoding. Additionally, the LSTM language model evaluation was efficiently performed using Variance Regularization along with a strategy of lazy evaluation. The proposed one-pass decoder architecture was evaluated on the well-known LibriSpeech and TED-LIUMv3 datasets. Results showed that the proposed algorithm obtains very competitive WERs with ∼0.6 RTFs. Finally, our one-pass decoder is compared with a decoupled two-pass decoder.}, keywords = {Automatic Speech Recognition, LSTM language models, one-pass decoding, real-time}, pubstate = {published}, tppubtype = {inproceedings} } Recurrent Neural Networks, in particular Long-Short Term Memory (LSTM) networks, are widely used in Automatic Speech Recognition for language modelling during decoding, usually as a mechanism for rescoring hypothesis. This paper proposes a new architecture to perform real-time one-pass decoding using LSTM language models. To make decoding efficient, the estimation of look-ahead scores was accelerated by precomputing static look-ahead tables. These static tables were precomputed from a pruned n-gram model, reducing drastically the computational cost during decoding. Additionally, the LSTM language model evaluation was efficiently performed using Variance Regularization along with a strategy of lazy evaluation. The proposed one-pass decoder architecture was evaluated on the well-known LibriSpeech and TED-LIUMv3 datasets. Results showed that the proposed algorithm obtains very competitive WERs with ∼0.6 RTFs. Finally, our one-pass decoder is compared with a decoupled two-pass decoder. |
2018 |
Del-Agua, Miguel Ángel ; Giménez, Adrià ; Sanchis, Alberto ; Civera, Jorge; Juan, Alfons Speaker-Adapted Confidence Measures for ASR using Deep Bidirectional Recurrent Neural Networks Journal Article IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26 (7), pp. 1194–1202, 2018. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Confidence estimation, Confidence measures, Deep bidirectional recurrent neural networks, Long short-term memory, Speaker adaptation @article{Del-Agua2018, title = {Speaker-Adapted Confidence Measures for ASR using Deep Bidirectional Recurrent Neural Networks}, author = {Del-Agua, Miguel Ángel AND Giménez, Adrià AND Sanchis, Alberto AND Civera,Jorge AND Juan, Alfons}, url = {http://www.mllp.upv.es/wp-content/uploads/2018/04/Del-Agua2018_authors_version.pdf https://doi.org/10.1109/TASLP.2018.2819900}, year = {2018}, date = {2018-01-01}, journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume = {26}, number = {7}, pages = {1194--1202}, abstract = {In the last years, Deep Bidirectional Recurrent Neural Networks (DBRNN) and DBRNN with Long Short-Term Memory cells (DBLSTM) have outperformed the most accurate classifiers for confidence estimation in automatic speech recognition. At the same time, we have recently shown that speaker adaptation of confidence measures using DBLSTM yields significant improvements over non-adapted confidence measures. In accordance with these two recent contributions to the state of the art in confidence estimation, this paper presents a comprehensive study of speaker-adapted confidence measures using DBRNN and DBLSTM models. Firstly, we present new empirical evidences of the superiority of RNN-based confidence classifiers evaluated over a large speech corpus consisting of the English LibriSpeech and the Spanish poliMedia tasks. Secondly, we show new results on speaker-adapted confidence measures considering a multi-task framework in which RNN-based confidence classifiers trained with LibriSpeech are adapted to speakers of the TED-LIUM corpus. These experiments confirm that speaker-adapted confidence measures outperform their non-adapted counterparts. Lastly, we describe an unsupervised adaptation method of the acoustic DBLSTM model based on confidence measures which results in better automatic speech recognition performance.}, keywords = {Automatic Speech Recognition, Confidence estimation, Confidence measures, Deep bidirectional recurrent neural networks, Long short-term memory, Speaker adaptation}, pubstate = {published}, tppubtype = {article} } In the last years, Deep Bidirectional Recurrent Neural Networks (DBRNN) and DBRNN with Long Short-Term Memory cells (DBLSTM) have outperformed the most accurate classifiers for confidence estimation in automatic speech recognition. At the same time, we have recently shown that speaker adaptation of confidence measures using DBLSTM yields significant improvements over non-adapted confidence measures. In accordance with these two recent contributions to the state of the art in confidence estimation, this paper presents a comprehensive study of speaker-adapted confidence measures using DBRNN and DBLSTM models. Firstly, we present new empirical evidences of the superiority of RNN-based confidence classifiers evaluated over a large speech corpus consisting of the English LibriSpeech and the Spanish poliMedia tasks. Secondly, we show new results on speaker-adapted confidence measures considering a multi-task framework in which RNN-based confidence classifiers trained with LibriSpeech are adapted to speakers of the TED-LIUM corpus. These experiments confirm that speaker-adapted confidence measures outperform their non-adapted counterparts. Lastly, we describe an unsupervised adaptation method of the acoustic DBLSTM model based on confidence measures which results in better automatic speech recognition performance. |
Jorge, Javier ; Martínez-Villaronga, Adrià ; Golik, Pavel ; Giménez, Adrià ; Silvestre-Cerdà, Joan Albert ; Doetsch, Patrick ; Císcar, Vicent Andreu ; Ney, Hermann ; Juan, Alfons ; Sanchis, Albert MLLP-UPV and RWTH Aachen Spanish ASR Systems for the IberSpeech-RTVE 2018 Speech-to-Text Transcription Challenge Inproceedings Proc. of IberSPEECH 2018: 10th Jornadas en Tecnologías del Habla and 6th Iberian SLTech Workshop, pp. 257–261, Barcelona (Spain), 2018. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, Iberspeech-RTVE-Challenge2018, IberSpeech2018, Speech-to-Text @inproceedings{Jorge2018, title = {MLLP-UPV and RWTH Aachen Spanish ASR Systems for the IberSpeech-RTVE 2018 Speech-to-Text Transcription Challenge}, author = {Jorge, Javier and Martínez-Villaronga, Adrià and Golik, Pavel and Giménez, Adrià and Silvestre-Cerdà, Joan Albert and Doetsch, Patrick and Císcar, Vicent Andreu and Ney, Hermann and Juan, Alfons and Sanchis, Albert}, doi = {10.21437/IberSPEECH.2018-54}, year = {2018}, date = {2018-01-01}, booktitle = {Proc. of IberSPEECH 2018: 10th Jornadas en Tecnologías del Habla and 6th Iberian SLTech Workshop}, pages = {257--261}, address = {Barcelona (Spain)}, abstract = {This paper describes the Automatic Speech Recognition systems built by the MLLP research group of Universitat Politècnica de València and the HLTPR research group of RWTH Aachen for the IberSpeech-RTVE 2018 Speech-to-Text Transcription Challenge. We participated in both the closed and the open training conditions. The best system built for the closed conditions was a hybrid BLSTM-HMM ASR system using one-pass decoding with a combination of an RNN LM and show-adapted n-gram LMs. It was trained on a set of reliable speech data extracted from the train and dev1 sets using the MLLP’s transLectures-UPV toolkit (TLK) and TensorFlow. This system achieved 20.0% WER on the dev2 set. For the open conditions, we used approx. 3800 hours of out-of-domain training data from multiple sources and trained a one-pass hybrid BLSTM-HMM ASR system using the open-source tools RASR and RETURNN developed at RWTH Aachen. This system scored 15.6% WER on the dev2 set. The highlights of these systems include robust speech data filtering for acoustic model training and show-specific language modelling.}, keywords = {Automatic Speech Recognition, Iberspeech-RTVE-Challenge2018, IberSpeech2018, Speech-to-Text}, pubstate = {published}, tppubtype = {inproceedings} } This paper describes the Automatic Speech Recognition systems built by the MLLP research group of Universitat Politècnica de València and the HLTPR research group of RWTH Aachen for the IberSpeech-RTVE 2018 Speech-to-Text Transcription Challenge. We participated in both the closed and the open training conditions. The best system built for the closed conditions was a hybrid BLSTM-HMM ASR system using one-pass decoding with a combination of an RNN LM and show-adapted n-gram LMs. It was trained on a set of reliable speech data extracted from the train and dev1 sets using the MLLP’s transLectures-UPV toolkit (TLK) and TensorFlow. This system achieved 20.0% WER on the dev2 set. For the open conditions, we used approx. 3800 hours of out-of-domain training data from multiple sources and trained a one-pass hybrid BLSTM-HMM ASR system using the open-source tools RASR and RETURNN developed at RWTH Aachen. This system scored 15.6% WER on the dev2 set. The highlights of these systems include robust speech data filtering for acoustic model training and show-specific language modelling. |
2012 |
Silvestre-Cerdà, Joan Albert ; Del Agua, Miguel ; Garcés, Gonçal; Gascó, Guillem; Giménez-Pastor, Adrià; Martínez, Adrià; Pérez González de Martos, Alejandro ; Sánchez, Isaías; Serrano Martínez-Santos, Nicolás ; Spencer, Rachel; Valor Miró, Juan Daniel ; Andrés-Ferrer, Jesús; Civera, Jorge; Sanchís, Alberto; Juan, Alfons transLectures Inproceedings Proceedings (Online) of IberSPEECH 2012, pp. 345–351, Madrid (Spain), 2012. Abstract | Links | BibTeX | Tags: Accessibility, Automatic Speech Recognition, Education, Intelligent Interaction, Language Technologies, Machine Translation, Massive Adaptation, Multilingualism, Opencast Matterhorn, Video Lectures @inproceedings{Silvestre-Cerdà2012b, title = {transLectures}, author = {Silvestre-Cerdà, Joan Albert and Del Agua, Miguel and Gonçal Garcés and Guillem Gascó and Adrià Giménez-Pastor and Adrià Martínez and Pérez González de Martos, Alejandro and Isaías Sánchez and Serrano Martínez-Santos, Nicolás and Rachel Spencer and Valor Miró, Juan Daniel and Jesús Andrés-Ferrer and Jorge Civera and Alberto Sanchís and Alfons Juan}, url = {http://hdl.handle.net/10251/37290 http://lorien.die.upm.es/~lapiz/rtth/JORNADAS/VII/IberSPEECH2012_OnlineProceedings.pdf https://web.archive.org/web/20130609073144/http://iberspeech2012.ii.uam.es/IberSPEECH2012_OnlineProceedings.pdf http://www.mllp.upv.es/wp-content/uploads/2015/04/1209IberSpeech.pdf}, year = {2012}, date = {2012-11-22}, booktitle = {Proceedings (Online) of IberSPEECH 2012}, pages = {345--351}, address = {Madrid (Spain)}, abstract = {[EN] transLectures (Transcription and Translation of Video Lectures) is an EU STREP project in which advanced automatic speech recognition and machine translation techniques are being tested on large video lecture repositories. The project began in November 2011 and will run for three years. This paper will outline the project's main motivation and objectives, and give a brief description of the two main repositories being considered: VideoLectures.NET and poliMèdia. The first results obtained by the UPV group for the poliMedia repository will also be provided. [CA] transLectures (Transcription and Translation of Video Lectures) és un projecte del 7PM de la Unió Europea en el qual s'estan posant a prova tècniques avançades de reconeixement automàtic de la parla i de traducció automàtica sobre grans repositoris digitals de vídeos docents. El projecte començà al novembre de 2011 i tindrà una duració de tres anys. En aquest article exposem la motivació i els objectius del projecte, i descrivim breument els dos repositoris principals sobre els quals es treballa: VideoLectures.NET i poliMèdia. També oferim els primers resultats obtinguts per l'equip de la UPV al repositori poliMèdia.}, keywords = {Accessibility, Automatic Speech Recognition, Education, Intelligent Interaction, Language Technologies, Machine Translation, Massive Adaptation, Multilingualism, Opencast Matterhorn, Video Lectures}, pubstate = {published}, tppubtype = {inproceedings} } [EN] transLectures (Transcription and Translation of Video Lectures) is an EU STREP project in which advanced automatic speech recognition and machine translation techniques are being tested on large video lecture repositories. The project began in November 2011 and will run for three years. This paper will outline the project's main motivation and objectives, and give a brief description of the two main repositories being considered: VideoLectures.NET and poliMèdia. The first results obtained by the UPV group for the poliMedia repository will also be provided. [CA] transLectures (Transcription and Translation of Video Lectures) és un projecte del 7PM de la Unió Europea en el qual s'estan posant a prova tècniques avançades de reconeixement automàtic de la parla i de traducció automàtica sobre grans repositoris digitals de vídeos docents. El projecte començà al novembre de 2011 i tindrà una duració de tres anys. En aquest article exposem la motivació i els objectius del projecte, i descrivim breument els dos repositoris principals sobre els quals es treballa: VideoLectures.NET i poliMèdia. També oferim els primers resultats obtinguts per l'equip de la UPV al repositori poliMèdia. |
Publications
2025 |
Speech Translation for Multilingual Medical Education Leveraged by Large Language Models Journal Article Forthcoming Artificial Intelligence In Medicine, Forthcoming. |
2022 |
MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge: Extension Journal Article Applied Sciences, 12 (2), pp. 804, 2022. |
Doblaje automático de vídeo-charlas educativas en UPV[Media] Inproceedings Proc. of VIII Congrés d'Innovació Educativa i Docència en Xarxa (IN-RED 2022), pp. 557–570, València (Spain), 2022. |
2021 |
Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models Journal Article IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30 , pp. 148–161, 2021. |
MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge Inproceedings Proc. of IberSPEECH 2021, pp. 118–122, Valladolid (Spain), 2021. |
Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization Inproceedings Proc. Interspeech 2021, pp. 3695–3699, Brno (Czech Republic), 2021. |
2020 |
LSTM-Based One-Pass Decoder for Low-Latency Streaming Inproceedings Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 7814–7818, Barcelona (Spain), 2020. |
Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates Inproceedings Proc. of 45th Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2020), pp. 8229–8233, Barcelona (Spain), 2020. |
2019 |
Real-time One-pass Decoder for Speech Recognition Using LSTM Language Models Inproceedings Proc. of the 20th Annual Conf. of the ISCA (Interspeech 2019), pp. 3820–3824, Graz (Austria), 2019. |
2018 |
Speaker-Adapted Confidence Measures for ASR using Deep Bidirectional Recurrent Neural Networks Journal Article IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26 (7), pp. 1194–1202, 2018. |
MLLP-UPV and RWTH Aachen Spanish ASR Systems for the IberSpeech-RTVE 2018 Speech-to-Text Transcription Challenge Inproceedings Proc. of IberSPEECH 2018: 10th Jornadas en Tecnologías del Habla and 6th Iberian SLTech Workshop, pp. 257–261, Barcelona (Spain), 2018. |
2012 |
transLectures Inproceedings Proceedings (Online) of IberSPEECH 2012, pp. 345–351, Madrid (Spain), 2012. |