Santamaría-Jordà, Jaume; Segovia-Martínez, Pablo; Garcés Díaz-Munío, Gonçal V; Silvestre-Cerdà, Joan Albert; Giménez, Adrià; Gaspar Aparicio, Rubén ; Fernández Sánchez, René ; Civera, Jorge; Sanchis, Albert; Juan, Alfons LHCP-ASR: An English Speech Corpus of High-Energy Particle Physics Talks for Narrow-Domain ASR Benchmarking Inproceedings Forthcoming Interspeech 2025, Rotterdam (Netherlands), Forthcoming. Abstract | BibTeX | Tags: Automatic Speech Recognition, domain adaptation, manual transcription, pseudo-labelling, speech corpus @inproceedings{Santamaria2025,
title = {LHCP-ASR: An English Speech Corpus of High-Energy Particle Physics Talks for Narrow-Domain ASR Benchmarking},
author = {Jaume Santamaría-Jordà AND Pablo Segovia-Martínez AND Garcés Díaz-Munío, Gonçal V. AND Joan Albert Silvestre-Cerdà AND Adrià Giménez AND Gaspar Aparicio, Rubén AND Fernández Sánchez, René AND Jorge Civera AND Albert Sanchis AND Alfons Juan},
year = {2025},
date = {2025-01-01},
booktitle = {Interspeech 2025},
address = {Rotterdam (Netherlands)},
abstract = {We present LHCP-ASR, an English speech corpus of high-energy particle physics talks,with 235 hours of transcribed speeches extracted from the 2020--2022 Large Hadron Collider Physics (LHCP) conferences, plus 1.5G tokens of in-domain text extracted from scientific documents. About 30 hours of conference talks were manually transcribed to build two reliable tasks for narrow-domain ASR benchmarking. The remaining conference talks (205 hours) were pseudo-labelled using a very competitive in-domain ASR system, in order to build a dataset for training or adaptation purposes. This paper describes the creation of this dataset, and provides first reference WER% figures using OpenAI's Whisper models and our in-domain ASR system, achieving 13.6% and 15.0% WER points on the two test sets. This corpus is publicly released under an open licence. We believe it will fulfil the need in the area to have new open, reliable, real-life and challenging ASR benchmarks. },
keywords = {Automatic Speech Recognition, domain adaptation, manual transcription, pseudo-labelling, speech corpus},
pubstate = {forthcoming},
tppubtype = {inproceedings}
}
We present LHCP-ASR, an English speech corpus of high-energy particle physics talks,with 235 hours of transcribed speeches extracted from the 2020--2022 Large Hadron Collider Physics (LHCP) conferences, plus 1.5G tokens of in-domain text extracted from scientific documents. About 30 hours of conference talks were manually transcribed to build two reliable tasks for narrow-domain ASR benchmarking. The remaining conference talks (205 hours) were pseudo-labelled using a very competitive in-domain ASR system, in order to build a dataset for training or adaptation purposes. This paper describes the creation of this dataset, and provides first reference WER% figures using OpenAI's Whisper models and our in-domain ASR system, achieving 13.6% and 15.0% WER points on the two test sets. This corpus is publicly released under an open licence. We believe it will fulfil the need in the area to have new open, reliable, real-life and challenging ASR benchmarks. |