2025 |
Iranzo-Sánchez, Jorge; Santamaría-Jordà, Jaume; Mas-Mollà, Gerard; Garcés Díaz-Munío, Gonçal V; Iranzo-Sánchez, Javier; Jorge, Javier; Silvestre-Cerdà, Joan Albert; Giménez, Adrià; Civera, Jorge; Sanchis, Albert; Juan, Alfons Speech Translation for Multilingual Medical Education Leveraged by Large Language Models Journal Article Artificial Intelligence In Medicine, 166 , pp. 103147, 2025. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, domain adaptation, large language models, Machine Translation, oncology, Speech Translation @article{Iranzo-Sánchez2025, title = {Speech Translation for Multilingual Medical Education Leveraged by Large Language Models}, author = {Jorge Iranzo-Sánchez AND Jaume Santamaría-Jordà AND Gerard Mas-Mollà AND Garcés Díaz-Munío, Gonçal V. AND Javier Iranzo-Sánchez AND Javier Jorge AND Joan Albert Silvestre-Cerdà AND Adrià Giménez AND Jorge Civera AND Albert Sanchis AND Alfons Juan}, doi = {10.1016/j.artmed.2025.103147}, year = {2025}, date = {2025-01-01}, journal = {Artificial Intelligence In Medicine}, volume = {166}, pages = {103147}, abstract = {The application of large language models (LLMs) to speech translation (ST), or in general, to machine translation (MT), has recently provided excellent results superseding conventional encoder-decoder MT systems in the general domain. However, this is not clearly the case when LLMs as MT systems are translating medical-related materials. In this respect, the provision of multilingual training materials for oncology professionals is a goal of the EU project Interact-Europe in which this work was framed. To this end, cross-language technology adapted to the oncology domain was developed, evaluated and deployed for multilingual interspeciality medical education. More precisely, automatic speech recognition (ASR) and MT models were adapted to the oncology domain to translate English pre-recorded training videos, kindly provided by the European School of Oncology (ESO), into French, Spanish, German and Slovene. In this work, three categories of MT models adapted to the medical domain were assessed: bilingual encoder-decoder MT models trained from scratch, pre-trained large multilingual encoder-decoder MT models and multilingual decoder-only LLMs. The experimental results underline the competitiveness in translation quality of LLMs compared to encoder-decoder MT models. Finally, the ESO speech dataset, comprising roughly 1,000 videos and 745 hours for the training and evaluation of ASR and MT models, was publicly released for the scientific community.}, keywords = {Automatic Speech Recognition, domain adaptation, large language models, Machine Translation, oncology, Speech Translation}, pubstate = {published}, tppubtype = {article} } The application of large language models (LLMs) to speech translation (ST), or in general, to machine translation (MT), has recently provided excellent results superseding conventional encoder-decoder MT systems in the general domain. However, this is not clearly the case when LLMs as MT systems are translating medical-related materials. In this respect, the provision of multilingual training materials for oncology professionals is a goal of the EU project Interact-Europe in which this work was framed. To this end, cross-language technology adapted to the oncology domain was developed, evaluated and deployed for multilingual interspeciality medical education. More precisely, automatic speech recognition (ASR) and MT models were adapted to the oncology domain to translate English pre-recorded training videos, kindly provided by the European School of Oncology (ESO), into French, Spanish, German and Slovene. In this work, three categories of MT models adapted to the medical domain were assessed: bilingual encoder-decoder MT models trained from scratch, pre-trained large multilingual encoder-decoder MT models and multilingual decoder-only LLMs. The experimental results underline the competitiveness in translation quality of LLMs compared to encoder-decoder MT models. Finally, the ESO speech dataset, comprising roughly 1,000 videos and 745 hours for the training and evaluation of ASR and MT models, was publicly released for the scientific community. |
Santamaría-Jordà, Jaume; Segovia-Martínez, Pablo; Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert; Giménez, Adrià; Gaspar Aparicio, Rubén ; Fernández Sánchez, René ; Civera, Jorge; Sanchis, Albert; Juan, Alfons LHCP-ASR: An English Speech Corpus of High-Energy Particle Physics Talks for Narrow-Domain ASR Benchmarking Inproceedings Forthcoming Interspeech 2025, Rotterdam (Netherlands), Forthcoming. Abstract | BibTeX | Tags: Automatic Speech Recognition, domain adaptation, manual transcription, pseudo-labelling, speech corpus @inproceedings{Santamaria2025, title = {LHCP-ASR: An English Speech Corpus of High-Energy Particle Physics Talks for Narrow-Domain ASR Benchmarking}, author = {Jaume Santamaría-Jordà AND Pablo Segovia-Martínez AND Garcés Díaz-Munío, Gonçal V. AND Joan Albert Silvestre-Cerdà AND Adrià Giménez AND Gaspar Aparicio, Rubén AND Fernández Sánchez, René AND Jorge Civera AND Albert Sanchis AND Alfons Juan}, year = {2025}, date = {2025-01-01}, booktitle = {Interspeech 2025}, address = {Rotterdam (Netherlands)}, abstract = {We present LHCP-ASR, an English speech corpus of high-energy particle physics talks,with 235 hours of transcribed speeches extracted from the 2020--2022 Large Hadron Collider Physics (LHCP) conferences, plus 1.5G tokens of in-domain text extracted from scientific documents. About 30 hours of conference talks were manually transcribed to build two reliable tasks for narrow-domain ASR benchmarking. The remaining conference talks (205 hours) were pseudo-labelled using a very competitive in-domain ASR system, in order to build a dataset for training or adaptation purposes. This paper describes the creation of this dataset, and provides first reference WER% figures using OpenAI's Whisper models and our in-domain ASR system, achieving 13.6% and 15.0% WER points on the two test sets. This corpus is publicly released under an open licence. We believe it will fulfil the need in the area to have new open, reliable, real-life and challenging ASR benchmarks. }, keywords = {Automatic Speech Recognition, domain adaptation, manual transcription, pseudo-labelling, speech corpus}, pubstate = {forthcoming}, tppubtype = {inproceedings} } We present LHCP-ASR, an English speech corpus of high-energy particle physics talks,with 235 hours of transcribed speeches extracted from the 2020--2022 Large Hadron Collider Physics (LHCP) conferences, plus 1.5G tokens of in-domain text extracted from scientific documents. About 30 hours of conference talks were manually transcribed to build two reliable tasks for narrow-domain ASR benchmarking. The remaining conference talks (205 hours) were pseudo-labelled using a very competitive in-domain ASR system, in order to build a dataset for training or adaptation purposes. This paper describes the creation of this dataset, and provides first reference WER% figures using OpenAI's Whisper models and our in-domain ASR system, achieving 13.6% and 15.0% WER points on the two test sets. This corpus is publicly released under an open licence. We believe it will fulfil the need in the area to have new open, reliable, real-life and challenging ASR benchmarks. |
Publications
Accessibility Automatic Speech Recognition Computer-assisted transcription Confidence measures Deep Neural Networks Docencia en Red Education language model adaptation Language Modeling Language Technologies Length modelling Log-linear models Machine Translation Massive Adaptation Models basats en seqüències de paraules Multilingualism Neural Machine Translation Opencast Matterhorn Polimedia Simultaneous Speech Translation Sliding window Speaker adaptation Speech Recognition Speech Translation Statistical machine translation streaming text-to-speech transcripciones video lecture repositories Video Lectures
2025 |
Speech Translation for Multilingual Medical Education Leveraged by Large Language Models Journal Article Artificial Intelligence In Medicine, 166 , pp. 103147, 2025. |
LHCP-ASR: An English Speech Corpus of High-Energy Particle Physics Talks for Narrow-Domain ASR Benchmarking Inproceedings Forthcoming Interspeech 2025, Rotterdam (Netherlands), Forthcoming. |