2025 |
Santamaría-Jordà, Jaume; Segovia-Martínez, Pablo; Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert; Giménez, Adrià; Gaspar Aparicio, Rubén ; Fernández Sánchez, René ; Civera, Jorge; Sanchis, Albert; Juan, Alfons LHCP-ASR: An English Speech Corpus of High-Energy Particle Physics Talks for Narrow-Domain ASR Benchmarking Inproceedings Interspeech 2025, pp. 4033–4037, Rotterdam (Netherlands), 2025. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, domain adaptation, manual transcription, pseudo-labelling, speech corpus @inproceedings{Santamaria2025, title = {LHCP-ASR: An English Speech Corpus of High-Energy Particle Physics Talks for Narrow-Domain ASR Benchmarking}, author = {Jaume Santamaría-Jordà AND Pablo Segovia-Martínez AND Garcés Díaz-Munío, Gonçal V. AND Joan Albert Silvestre-Cerdà AND Adrià Giménez AND Gaspar Aparicio, Rubén AND Fernández Sánchez, René AND Jorge Civera AND Albert Sanchis AND Alfons Juan}, url = {https://www.mllp.upv.es/wp-content/uploads/2025/08/lhcp-asr-poster.pdf https://www.isca-archive.org/interspeech_2025/santamariajorda25_interspeech.html}, doi = {10.21437/Interspeech.2025-2630}, year = {2025}, date = {2025-01-01}, booktitle = {Interspeech 2025}, pages = {4033--4037}, address = {Rotterdam (Netherlands)}, abstract = {We present LHCP-ASR, an English speech corpus of high-energy particle physics talks,with 235 hours of transcribed speeches extracted from the 2020--2022 Large Hadron Collider Physics (LHCP) conferences, plus 1.5G tokens of in-domain text extracted from scientific documents. About 30 hours of conference talks were manually transcribed to build two reliable tasks for narrow-domain ASR benchmarking. The remaining conference talks (205 hours) were pseudo-labelled using a very competitive in-domain ASR system, in order to build a dataset for training or adaptation purposes. This paper describes the creation of this dataset, and provides first reference WER% figures using OpenAI's Whisper models and our in-domain ASR system, achieving 13.6% and 15.0% WER points on the two test sets. This corpus is publicly released under an open licence. We believe it will fulfil the need in the area to have new open, reliable, real-life and challenging ASR benchmarks. }, keywords = {Automatic Speech Recognition, domain adaptation, manual transcription, pseudo-labelling, speech corpus}, pubstate = {published}, tppubtype = {inproceedings} } We present LHCP-ASR, an English speech corpus of high-energy particle physics talks,with 235 hours of transcribed speeches extracted from the 2020--2022 Large Hadron Collider Physics (LHCP) conferences, plus 1.5G tokens of in-domain text extracted from scientific documents. About 30 hours of conference talks were manually transcribed to build two reliable tasks for narrow-domain ASR benchmarking. The remaining conference talks (205 hours) were pseudo-labelled using a very competitive in-domain ASR system, in order to build a dataset for training or adaptation purposes. This paper describes the creation of this dataset, and provides first reference WER% figures using OpenAI's Whisper models and our in-domain ASR system, achieving 13.6% and 15.0% WER points on the two test sets. This corpus is publicly released under an open licence. We believe it will fulfil the need in the area to have new open, reliable, real-life and challenging ASR benchmarks. |
2021 |
Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert ; Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Baquero-Arnal, Pau; Roselló, Nahuel; Pérez-González-de-Martos, Alejandro; Civera, Jorge; Sanchis, Albert; Juan, Alfons Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization Inproceedings Proc. Interspeech 2021, pp. 3695–3699, Brno (Czech Republic), 2021. Abstract | Links | BibTeX | Tags: Automatic Speech Recognition, speech corpus, speech data filtering, speech data verbatimization @inproceedings{Garcés2021, title = {Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization}, author = {Garcés Díaz-Munío, Gonçal V. and Silvestre-Cerdà, Joan Albert and Javier Jorge and Adrià Giménez and Javier Iranzo-Sánchez and Pau Baquero-Arnal and Nahuel Roselló and Alejandro Pérez-González-de-Martos and Jorge Civera and Albert Sanchis and Alfons Juan}, url = {https://www.mllp.upv.es/wp-content/uploads/2021/09/europarl-asr-presentation-extended.pdf https://www.youtube.com/watch?v=Tc0gNSDdnQg&list=PLlePn-Yanvnc_LRhgmmaNmH12Bwm6BRsZ https://paperswithcode.com/paper/europarl-asr-a-large-corpus-of-parliamentary https://github.com/mllpresearch/Europarl-ASR}, doi = {10.21437/Interspeech.2021-1905}, year = {2021}, date = {2021-01-01}, booktitle = {Proc. Interspeech 2021}, journal = {Proc. Interspeech 2021}, pages = {3695--3699}, address = {Brno (Czech Republic)}, abstract = {[EN] We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence. [CA] "Europarl-ASR: Un extens corpus parlamentari de referència per a reconeixement de la parla i filtratge/literalització de transcripcions": Presentem Europarl-ASR, un extens corpus de veu i text de debats parlamentaris amb 1300 hores d'intervencions transcrites i 70 milions de paraules de text en anglés extrets de sessions del Parlament Europeu. Les transcripcions oficials del Parlament Europeu, no literals, s'han sincronitzat per a tot el conjunt d'entrenament. Com que l'entrenament de models acústics requereix transcripcions com més literals millor, també s'han inclòs transcripcions filtrades i transcripcions literalitzades de totes les intervencions, basades en tècniques de filtratge i literalització automàtics. A més, s'han inclòs 18 hores de transcripcions literals revisades manualment per definir dos conjunts de validació i avaluació de referència per a reconeixement automàtic de la parla en temps real, amb oradors coneguts i amb oradors desconeguts. Pel fet de disposar de transcripcions literals i no literals, aquest corpus és també ideal per a l'anàlisi de tècniques de filtratge i de literalització. En aquest article, es descriu la creació del corpus i es proporcionen mesures de referència de reconeixement automàtic de la parla en temps real i en diferit, amb oradors coneguts i amb oradors desconeguts, usant els tres conjunts de transcripcions d'entrenament. El corpus es fa públic amb una llicència oberta.}, keywords = {Automatic Speech Recognition, speech corpus, speech data filtering, speech data verbatimization}, pubstate = {published}, tppubtype = {inproceedings} } [EN] We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence. [CA] "Europarl-ASR: Un extens corpus parlamentari de referència per a reconeixement de la parla i filtratge/literalització de transcripcions": Presentem Europarl-ASR, un extens corpus de veu i text de debats parlamentaris amb 1300 hores d'intervencions transcrites i 70 milions de paraules de text en anglés extrets de sessions del Parlament Europeu. Les transcripcions oficials del Parlament Europeu, no literals, s'han sincronitzat per a tot el conjunt d'entrenament. Com que l'entrenament de models acústics requereix transcripcions com més literals millor, també s'han inclòs transcripcions filtrades i transcripcions literalitzades de totes les intervencions, basades en tècniques de filtratge i literalització automàtics. A més, s'han inclòs 18 hores de transcripcions literals revisades manualment per definir dos conjunts de validació i avaluació de referència per a reconeixement automàtic de la parla en temps real, amb oradors coneguts i amb oradors desconeguts. Pel fet de disposar de transcripcions literals i no literals, aquest corpus és també ideal per a l'anàlisi de tècniques de filtratge i de literalització. En aquest article, es descriu la creació del corpus i es proporcionen mesures de referència de reconeixement automàtic de la parla en temps real i en diferit, amb oradors coneguts i amb oradors desconeguts, usant els tres conjunts de transcripcions d'entrenament. El corpus es fa públic amb una llicència oberta.
|
Publications
Accessibility Automatic Speech Recognition Computer-assisted transcription Confidence measures Deep Neural Networks Docencia en Red Education language model adaptation Language Modeling Language Technologies Length modelling Log-linear models Machine Translation Massive Adaptation Models basats en seqüències de paraules Multilingualism Neural Machine Translation Opencast Matterhorn Polimedia Simultaneous Speech Translation Sliding window Speaker adaptation Speech Recognition Speech Translation Statistical machine translation streaming text-to-speech transcripciones video lecture repositories Video Lectures
2025 |
LHCP-ASR: An English Speech Corpus of High-Energy Particle Physics Talks for Narrow-Domain ASR Benchmarking Inproceedings Interspeech 2025, pp. 4033–4037, Rotterdam (Netherlands), 2025. |
2021 |
Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization Inproceedings Proc. Interspeech 2021, pp. 3695–3699, Brno (Czech Republic), 2021. |
