|
@@ -1,7 +1,7 @@
|
|
|
# Europarl-ASR
|
|
|
v1.0<br />
|
|
|
2 April 2021<br />
|
|
|
-www.mllp.upv.es/europarl-asr
|
|
|
+[www.mllp.upv.es/europarl-asr](www.mllp.upv.es/europarl-asr)
|
|
|
|
|
|
A large English-language speech and text corpus of parliamentary debates for
|
|
|
streaming ASR benchmarking and speech data filtering/verbatimization.
|
|
@@ -34,22 +34,22 @@ Europarl-ASR (EN) includes:
|
|
|
|
|
|
#### Speech data
|
|
|
|
|
|
-- 1.3K hours of English-language annotated speech data.
|
|
|
-- 18 hours of speech data with both manually revised verbatim transcriptions
|
|
|
+* 1.3K hours of English-language annotated speech data.
|
|
|
+* 18 hours of speech data with both manually revised verbatim transcriptions
|
|
|
and official non-verbatim transcriptions, split in 2 independent validation-
|
|
|
evaluation partitions for 2 realistic ASR tasks (with vs. without previous
|
|
|
knowledge of the speaker).
|
|
|
-- 3 full sets of timed transcriptions for the rest of the speech data
|
|
|
+* 3 full sets of timed transcriptions for the rest of the speech data
|
|
|
(training partition): official non-verbatim transcriptions, automatically
|
|
|
noise-filtered transcriptions and automatically verbatimized transcriptions.
|
|
|
|
|
|
#### Text data
|
|
|
|
|
|
-- 70M tokens of English-language text data.
|
|
|
+* 70M tokens of English-language text data.
|
|
|
|
|
|
#### Pretrained language models
|
|
|
|
|
|
-- The Europarl-ASR English-language n-gram language model and vocabulary.
|
|
|
+* The Europarl-ASR English-language n-gram language model and vocabulary.
|
|
|
|
|
|
This data comprises most of the European Parliament's English-language debate
|
|
|
recordings, transcriptions and translations available from the Parliament's
|
|
@@ -223,10 +223,10 @@ In addition to the speech and text data included in the main release and
|
|
|
described in this document, we are making available for download the following
|
|
|
materials to facilitate the reproducibility of our experiments:
|
|
|
|
|
|
-- The pretrained Europarl-ASR English-language n-gram language model, together
|
|
|
+* The pretrained Europarl-ASR English-language n-gram language model, together
|
|
|
with its vocabulary file.
|
|
|
|
|
|
-- The Europarl-ASR English-language verbatim transcription guidelines, which
|
|
|
+* The Europarl-ASR English-language verbatim transcription guidelines, which
|
|
|
were applied to produce the manually revised verbatim transcriptions for the
|
|
|
dev and test sets.
|
|
|
|
|
@@ -242,47 +242,47 @@ speeches from European Parliament sessions held in the period 1996-2020.
|
|
|
It was compiled and released by the Machine Learning and Language Processing
|
|
|
(MLLP) research group of VRAIN Institut Valencià d'Investigació en
|
|
|
Intel·ligència Artificial, Universitat Politècnica de València
|
|
|
-( www.mllp.upv.es ).
|
|
|
+( [www.mllp.upv.es](www.mllp.upv.es) ).
|
|
|
|
|
|
Europarl-ASR (EN) includes:
|
|
|
|
|
|
#### Speech data
|
|
|
|
|
|
-- 1.3K hours of English-language annotated speech data (33K speeches, 1K
|
|
|
+* 1.3K hours of English-language annotated speech data (33K speeches, 1K
|
|
|
speakers).
|
|
|
-- 18 hours of speech data with both manually revised verbatim transcriptions
|
|
|
+* 18 hours of speech data with both manually revised verbatim transcriptions
|
|
|
and official non-verbatim transcriptions, split in 2 independent validation-
|
|
|
evaluation partitions for 2 realistic ASR tasks (with vs. without previous
|
|
|
knowledge of the speaker).
|
|
|
-- 3 full sets of timed transcriptions for the rest of the speech data
|
|
|
+* 3 full sets of timed transcriptions for the rest of the speech data
|
|
|
(training partition): official non-verbatim transcriptions, automatically
|
|
|
noise-filtered transcriptions and automatically verbatimized transcriptions.
|
|
|
|
|
|
#### Text data
|
|
|
|
|
|
-- 70M tokens of English-language text data.
|
|
|
+* 70M tokens of English-language text data.
|
|
|
|
|
|
#### Language models
|
|
|
|
|
|
-- The Europarl-ASR English-language n-gram language model and vocabulary.
|
|
|
+* The Europarl-ASR English-language n-gram language model and vocabulary.
|
|
|
|
|
|
-This data comprises most of the European Parliament's English-language debate
|
|
|
+This data comprises most of the [European Parliament's English-language debate](https://www.europarl.europa.eu/plenary/en/debates-video.html)
|
|
|
recordings, transcriptions and translations available from the Parliament's
|
|
|
website from 1999 to 2020 (recordings being only available from 2008). This is
|
|
|
complemented by including all English-language transcriptions and translations
|
|
|
-from the Europarl v10 text corpus for the period 1996-1999.
|
|
|
+from the [Europarl](https://www.statmt.org/europarl/) v10 text corpus for the period 1996-1999.
|
|
|
|
|
|
Additionally, to increase text data for language modelling up to 170M tokens,
|
|
|
Europarl-ASR also includes tools to add all English-language text from the
|
|
|
-DCEP Digital Corpus of the European Parliament.
|
|
|
+[DCEP Digital Corpus of the European Parliament](https://ec.europa.eu/jrc/en/language-technologies/dcep).
|
|
|
|
|
|
Detailed dates of the EP speech and text data gathered:
|
|
|
|
|
|
-- English speech: 2008-09-01 to 2020-05-27.
|
|
|
-- English transcriptions: 1999-07-20 to 2020-05-27.
|
|
|
-- Translations into English: 1999-07-20 to 2012-11-30.
|
|
|
-- Europarl v10 (selected to avoid overlapping): 1996-04-15 to 1999-07-19.
|
|
|
-- DCEP (does not include any EP reports of proceedings): 2001 to 2012.
|
|
|
+* English speech: 2008-09-01 to 2020-05-27.
|
|
|
+* English transcriptions: 1999-07-20 to 2020-05-27.
|
|
|
+* Translations into English: 1999-07-20 to 2012-11-30.
|
|
|
+* Europarl v10 (selected to avoid overlapping): 1996-04-15 to 1999-07-19.
|
|
|
+* DCEP (does not include any EP reports of proceedings): 2001 to 2012.
|
|
|
|
|
|
|
|
|
<a id="ack"></a>ACKNOWLEDGEMENTS
|
|
@@ -290,13 +290,13 @@ Detailed dates of the EP speech and text data gathered:
|
|
|
|
|
|
The authors would like to acknowledge:
|
|
|
|
|
|
-- The European Parliament, the European Commission and other EU organizations,
|
|
|
+* The European Parliament, the European Commission and other EU organizations,
|
|
|
for making available a wealth of multilingual speech and text data, both in
|
|
|
their websites and as ready-made corpora such as the DCEP Digital Corpus of
|
|
|
the European Parliament
|
|
|
( https://ec.europa.eu/jrc/en/language-technologies ).
|
|
|
|
|
|
-- Philipp Koehn for compiling the Europarl corpus
|
|
|
+* Philipp Koehn for compiling the Europarl corpus
|
|
|
( https://www.statmt.org/europarl/ ).
|
|
|
|
|
|
This work has received funding from the EU's H2020 research and innovation
|
|
@@ -311,11 +311,11 @@ the Universitat Politècnica de València's PAID-01-17 R&D support programme.
|
|
|
<a id="legal"></a>LEGAL DISCLAIMERS
|
|
|
-----------------
|
|
|
|
|
|
-- Speech and text data from the European Parliament website (audio, official
|
|
|
+* Speech and text data from the European Parliament website (audio, official
|
|
|
transcriptions and translations) were sourced from
|
|
|
https://www.europarl.europa.eu/plenary/en/debates-video.html
|
|
|
|
|
|
-- Text data from the DCEP Digital Corpus of the European Parliament are the
|
|
|
+* Text data from the DCEP Digital Corpus of the European Parliament are the
|
|
|
exclusive property of the European Parliament. These data were sourced from
|
|
|
https://ec.europa.eu/jrc/en/language-technologies/dcep (date of the latest
|
|
|
update: 11 March 2015).
|
|
@@ -324,22 +324,22 @@ the Universitat Politècnica de València's PAID-01-17 R&D support programme.
|
|
|
<a id="licence"></a>LICENCE
|
|
|
-------
|
|
|
|
|
|
-- Speech and text data from the European Parliament website (audio, official
|
|
|
+* Speech and text data from the European Parliament website (audio, official
|
|
|
transcriptions and translations) are the exclusive property of the European
|
|
|
Union represented by the European Parliament. These data are reused here
|
|
|
under the conditions stated in the European Parliament website's Legal
|
|
|
notice ( https://www.europarl.europa.eu/legal-notice ).
|
|
|
|
|
|
-- Text data from the DCEP Digital Corpus of the European Parliament are the
|
|
|
+* Text data from the DCEP Digital Corpus of the European Parliament are the
|
|
|
exclusive property of the European Parliament. The European Parliament
|
|
|
retains ownership of the data. These data are reused here under the usage
|
|
|
conditions of the DCEP Digital Corpus of the European Parliament (
|
|
|
https://ec.europa.eu/jrc/en/language-technologies/dcep#Usage%20Conditions ).
|
|
|
|
|
|
-- Text data from the Europarl v10 corpus are reused here under the Europarl
|
|
|
+* Text data from the Europarl v10 corpus are reused here under the Europarl
|
|
|
corpus terms of use ( https://www.statmt.org/europarl/ ).
|
|
|
|
|
|
-- Europarl-ASR data and code not covered by the previously mentioned licences
|
|
|
+* Europarl-ASR data and code not covered by the previously mentioned licences
|
|
|
© 2021 by Pau Baquero-Arnal, Jorge Civera, Gonçal V. Garcés Dı́az-Munı́o,
|
|
|
Adrià Giménez, Javier Iranzo-Sánchez, Javier Jorge, Alfons Juan, Alejandro
|
|
|
Pérez-González-de-Martos, Nahuel Roselló, Albert Sanchis and Joan Albert
|