MLLP researchers release version 1.1 of Europarl-ST, a large multilingual Speech Translation corpus of parliamentary debates

We are happy to report the release by MLLP researchers of version 1.1 of the Europarl-ST corpus, a large multilingual corpus for Speech Translation based on European Parliament debates, consisting of audio-transcription-translation triplets.

Europarl-ST is a multilingual Speech Translation corpus made up of audio-transcription-translation triplets built from the recordings of debates carried out in the European Parliament in the period between 2008 and 2012. The corpus is released under a Creative Commons license and is freely accessible and downloadable. The full details of the corpus are available in the ICASSP 2020 article “Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates” (by MLLP researchers Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, Javier Jorge, Nahuel Roselló, Adrià Giménez, Albert Sanchis, Jorge Civera and Alfons Juan).

Release version 1.1 adds 3 new languages: Romanian, Polish and Dutch. Together with the already available 6 languages (English, Spanish, French, German, Italian and Portuguese), the corpus now offers 72 speech translation directions. We have also released a new set, called “train-noisy”, which contains the speeches that were discarded during our filtering process, as they may still be useful for some training regimes. Finally, we now provide a speeches.cer file reporting the Character Error Rate (CER) computed with our ASR systems for each speech.

The Europarl-ST corpus, because of its large size and scope, has already been picked up for its use in international scientific evaluation campaigns such as IWSLT 2020.

We encourage you to visit the corpus webpage for full details and download links: https://mllp.upv.es/europarl-st/