2025 |
Iranzo-Sánchez, Jorge; Santamaría-Jordà, Jaume; Mas-Mollà, Gerard; Garcés Díaz-Munío, Gonçal V; Iranzo-Sánchez, Javier; Jorge, Javier; Silvestre-Cerdà, Joan Albert; Giménez, Adrià; Civera, Jorge; Sanchis, Albert; Juan, Alfons Speech Translation for Multilingual Medical Education Leveraged by Large Language Models Journal Article Artificial Intelligence In Medicine, 166 , pp. 103147, 2025. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, domain adaptation, large language models, Machine Translation, oncology, Speech Translation @article{Iranzo-Sánchez2025, title = {Speech Translation for Multilingual Medical Education Leveraged by Large Language Models}, author = {Jorge Iranzo-Sánchez AND Jaume Santamaría-Jordà AND Gerard Mas-Mollà AND Garcés Díaz-Munío, Gonçal V. AND Javier Iranzo-Sánchez AND Javier Jorge AND Joan Albert Silvestre-Cerdà AND Adrià Giménez AND Jorge Civera AND Albert Sanchis AND Alfons Juan}, doi = {10.1016/j.artmed.2025.103147}, year = {2025}, date = {2025-01-01}, journal = {Artificial Intelligence In Medicine}, volume = {166}, pages = {103147}, abstract = {The application of large language models (LLMs) to speech translation (ST), or in general, to machine translation (MT), has recently provided excellent results superseding conventional encoder-decoder MT systems in the general domain. However, this is not clearly the case when LLMs as MT systems are translating medical-related materials. In this respect, the provision of multilingual training materials for oncology professionals is a goal of the EU project Interact-Europe in which this work was framed. To this end, cross-language technology adapted to the oncology domain was developed, evaluated and deployed for multilingual interspeciality medical education. More precisely, automatic speech recognition (ASR) and MT models were adapted to the oncology domain to translate English pre-recorded training videos, kindly provided by the European School of Oncology (ESO), into French, Spanish, German and Slovene. In this work, three categories of MT models adapted to the medical domain were assessed: bilingual encoder-decoder MT models trained from scratch, pre-trained large multilingual encoder-decoder MT models and multilingual decoder-only LLMs. The experimental results underline the competitiveness in translation quality of LLMs compared to encoder-decoder MT models. Finally, the ESO speech dataset, comprising roughly 1,000 videos and 745 hours for the training and evaluation of ASR and MT models, was publicly released for the scientific community.}, keywords = {Automatic Speech Recognition, domain adaptation, large language models, Machine Translation, oncology, Speech Translation}, pubstate = {published}, tppubtype = {article} } The application of large language models (LLMs) to speech translation (ST), or in general, to machine translation (MT), has recently provided excellent results superseding conventional encoder-decoder MT systems in the general domain. However, this is not clearly the case when LLMs as MT systems are translating medical-related materials. In this respect, the provision of multilingual training materials for oncology professionals is a goal of the EU project Interact-Europe in which this work was framed. To this end, cross-language technology adapted to the oncology domain was developed, evaluated and deployed for multilingual interspeciality medical education. More precisely, automatic speech recognition (ASR) and MT models were adapted to the oncology domain to translate English pre-recorded training videos, kindly provided by the European School of Oncology (ESO), into French, Spanish, German and Slovene. In this work, three categories of MT models adapted to the medical domain were assessed: bilingual encoder-decoder MT models trained from scratch, pre-trained large multilingual encoder-decoder MT models and multilingual decoder-only LLMs. The experimental results underline the competitiveness in translation quality of LLMs compared to encoder-decoder MT models. Finally, the ESO speech dataset, comprising roughly 1,000 videos and 745 hours for the training and evaluation of ASR and MT models, was publicly released for the scientific community. |
Iranzo-Sánchez, Jorge; Iranzo-Sánchez, Javier; Giménez, Adrià; Civera, Jorge Going Beyond Your Expectations in Latency Metrics for Simultaneous Speech Translation Inproceedings ACL (Findings) 2025, pp. 18205–18228, Vienna (Austria), 2025. Abstract | Links | BibTeX | Tags: latency metrics, Simultaneous Speech Translation @inproceedings{Iranzo-SánchezACL2025, title = {Going Beyond Your Expectations in Latency Metrics for Simultaneous Speech Translation}, author = {Jorge Iranzo-Sánchez AND Javier Iranzo-Sánchez AND Adrià Giménez AND Jorge Civera}, url = {https://www.mllp.upv.es/wp-content/uploads/2025/08/poster.pdf https://openreview.net/forum?id=mbNv6ne53X}, doi = {10.18653/v1/2025.findings-acl.937}, year = {2025}, date = {2025-01-01}, booktitle = {ACL (Findings) 2025}, pages = {18205--18228}, address = {Vienna (Austria)}, abstract = {Current evaluation practices in Simultaneous Speech Translation (SimulST) systems typically involve segmenting the input audio and corresponding translations, calculating quality and latency metrics for each segment, and averaging the results. Although this approach may provide a reliable estimation of translation quality, it can lead to misleading values of latency metrics due to an inherent assumption that average latency values are good enough estimators of SimulST systems' response time. However, our detailed analysis of latency evaluations for state-of-the-art SimulST systems demonstrates that latency distributions are often skewed and subject to extreme variations. As a result, the mean in latency metrics fails to capture these anomalies, potentially masking the lack of robustness in some systems and metrics. In this paper, a thorough analysis of the results of systems submitted to recent editions of the IWSLT simultaneous track is provided to support our hypothesis and alternative ways to report latency metrics are proposed in order to provide a better understanding of SimulST systems' latency.}, keywords = {latency metrics, Simultaneous Speech Translation}, pubstate = {published}, tppubtype = {inproceedings} } Current evaluation practices in Simultaneous Speech Translation (SimulST) systems typically involve segmenting the input audio and corresponding translations, calculating quality and latency metrics for each segment, and averaging the results. Although this approach may provide a reliable estimation of translation quality, it can lead to misleading values of latency metrics due to an inherent assumption that average latency values are good enough estimators of SimulST systems' response time. However, our detailed analysis of latency evaluations for state-of-the-art SimulST systems demonstrates that latency distributions are often skewed and subject to extreme variations. As a result, the mean in latency metrics fails to capture these anomalies, potentially masking the lack of robustness in some systems and metrics. In this paper, a thorough analysis of the results of systems submitted to recent editions of the IWSLT simultaneous track is provided to support our hypothesis and alternative ways to report latency metrics are proposed in order to provide a better understanding of SimulST systems' latency. |
Iranzo-Sánchez, Jorge ; Iranzo-Sánchez, Javier ; Giménez, Adrià ; Civera, Jorge ; Juan, Alfons MLLP-VRAIN UPV system for the IWSLT 2025 Simultaneous Speech Translation Translation Task Inproceedings IWSLT 2025, pp. 340–346, Vienna (Austria), 2025. Abstract | Links | BibTeX | Tags: Simultaneous Speech Translation @inproceedings{Iranzo-Sánchez2025b, title = {MLLP-VRAIN UPV system for the IWSLT 2025 Simultaneous Speech Translation Translation Task}, author = {Iranzo-Sánchez, Jorge AND Iranzo-Sánchez, Javier AND Giménez, Adrià AND Civera, Jorge AND Juan, Alfons}, url = {https://arxiv.org/pdf/2506.18828}, doi = {10.18653/v1/2025.iwslt-1.35}, year = {2025}, date = {2025-01-01}, booktitle = {IWSLT 2025}, pages = {340--346}, address = {Vienna (Austria)}, abstract = {This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2025 Simultaneous Speech Translation track. Our submission addresses the unique challenges of real-time translation of long-form speech by developing a modular cascade system that adapts strong pre-trained models to streaming scenarios. We combine Whisper Large-V3-Turbo for ASR with the multilingual NLLB-3.3B model for MT, implementing lightweight adaptation techniques rather than training new end-to-end models from scratch. Our approach employs document-level adaptation with prefix training to enhance the MT model's ability to handle incomplete inputs, while incorporating adaptive emission policies including a wait-k strategy and RALCP for managing the translation stream. Specialized buffer management techniques and segmentation strategies ensure coherent translations across long audio sequences. Experimental results on the ACL60/60 dataset demonstrate that our system achieves a favorable balance between translation quality and latency, with a BLEU score of 31.96 and non-computational-aware StreamLAAL latency of 2.94 seconds. Our final model achieves a preliminary score on the official test set (IWSLT25Instruct) of 29.8 BLEU. Our work demonstrates that carefully adapted pre-trained components can create effective simultaneous translation systems for long-form content without requiring extensive in-domain parallel data or specialized end-to-end training.}, keywords = {Simultaneous Speech Translation}, pubstate = {published}, tppubtype = {inproceedings} } This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2025 Simultaneous Speech Translation track. Our submission addresses the unique challenges of real-time translation of long-form speech by developing a modular cascade system that adapts strong pre-trained models to streaming scenarios. We combine Whisper Large-V3-Turbo for ASR with the multilingual NLLB-3.3B model for MT, implementing lightweight adaptation techniques rather than training new end-to-end models from scratch. Our approach employs document-level adaptation with prefix training to enhance the MT model's ability to handle incomplete inputs, while incorporating adaptive emission policies including a wait-k strategy and RALCP for managing the translation stream. Specialized buffer management techniques and segmentation strategies ensure coherent translations across long audio sequences. Experimental results on the ACL60/60 dataset demonstrate that our system achieves a favorable balance between translation quality and latency, with a BLEU score of 31.96 and non-computational-aware StreamLAAL latency of 2.94 seconds. Our final model achieves a preliminary score on the official test set (IWSLT25Instruct) of 29.8 BLEU. Our work demonstrates that carefully adapted pre-trained components can create effective simultaneous translation systems for long-form content without requiring extensive in-domain parallel data or specialized end-to-end training. |
2024 |
Iranzo-Sánchez, Javier; Iranzo-Sánchez, Jorge; Giménez, Adrià; Civera, Jorge; Juan, Alfons Segmentation-Free Streaming Machine Translation Journal Article Transactions of the Association for Computational Linguistics, 12 , pp. 1104-1121, 2024, (also accepted for presentation at ACL 2024). Abstract | Links | BibTeX | Tags: segmentation-free, streaming machine translation @article{Juan2024, title = {Segmentation-Free Streaming Machine Translation}, author = {Javier Iranzo-Sánchez AND Jorge Iranzo-Sánchez AND Adrià Giménez AND Jorge Civera AND Alfons Juan}, url = {https://paperswithcode.com/paper/segmentation-free-streaming-machine https://github.com/jairsan/Segmentation-Free_Streaming_Machine_Translation https://arxiv.org/abs/2309.14823 https://2024.aclweb.org/program/tacl_papers/ https://www.mllp.upv.es/wp-content/uploads/2024/09/tacl_segfree_poster.pdf}, doi = {10.1162/tacl_a_00691}, year = {2024}, date = {2024-01-01}, journal = {Transactions of the Association for Computational Linguistics}, volume = {12}, pages = {1104-1121}, abstract = {Streaming Machine Translation (MT) is the task of translating an unbounded input text stream in real-time. The traditional cascade approach, which combines an Automatic Speech Recognition (ASR) and an MT system, relies on an intermediate segmentation step which splits the transcription stream into sentence-like units. However, the incorporation of a hard segmentation constrains the MT system and is a source of errors. This paper proposes a Segmentation-Free framework that enables the model to translate an unsegmented source stream by delaying the segmentation decision until the translation has been generated. Extensive experiments show how the proposed Segmentation-Free framework has better quality-latency trade-off than competing approaches that use an independent segmentation model.}, note = {also accepted for presentation at ACL 2024}, keywords = {segmentation-free, streaming machine translation}, pubstate = {published}, tppubtype = {article} } Streaming Machine Translation (MT) is the task of translating an unbounded input text stream in real-time. The traditional cascade approach, which combines an Automatic Speech Recognition (ASR) and an MT system, relies on an intermediate segmentation step which splits the transcription stream into sentence-like units. However, the incorporation of a hard segmentation constrains the MT system and is a source of errors. This paper proposes a Segmentation-Free framework that enables the model to translate an unsegmented source stream by delaying the segmentation decision until the translation has been generated. Extensive experiments show how the proposed Segmentation-Free framework has better quality-latency trade-off than competing approaches that use an independent segmentation model.
|
Publications
Accessibility Automatic Speech Recognition Computer-assisted transcription Confidence measures Deep Neural Networks Docencia en Red Education language model adaptation Language Modeling Language Technologies Length modelling Log-linear models Machine Translation Massive Adaptation Models basats en seqüències de paraules Multilingualism Neural Machine Translation Opencast Matterhorn Polimedia Simultaneous Speech Translation Sliding window Speaker adaptation Speech Recognition Speech Translation Statistical machine translation streaming text-to-speech transcripciones video lecture repositories Video Lectures
2025 |
Speech Translation for Multilingual Medical Education Leveraged by Large Language Models Journal Article Artificial Intelligence In Medicine, 166 , pp. 103147, 2025. |
Going Beyond Your Expectations in Latency Metrics for Simultaneous Speech Translation Inproceedings ACL (Findings) 2025, pp. 18205–18228, Vienna (Austria), 2025. |
MLLP-VRAIN UPV system for the IWSLT 2025 Simultaneous Speech Translation Translation Task Inproceedings IWSLT 2025, pp. 340–346, Vienna (Austria), 2025. |
2024 |
Segmentation-Free Streaming Machine Translation Journal Article Transactions of the Association for Computational Linguistics, 12 , pp. 1104-1121, 2024, (also accepted for presentation at ACL 2024). |