5 years ago · ddb7a0a91d
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 
				 # Europarl-ASR
			
 
				 v1.0<br />
			
 
				 2 April 2021<br />
			
 
				-www.mllp.upv.es/europarl-asr
			
 
				+[www.mllp.upv.es/europarl-asr](www.mllp.upv.es/europarl-asr)
			
 
				 
			
 
				 A large English-language speech and text corpus of parliamentary debates for
			
 
				 streaming ASR benchmarking and speech data filtering/verbatimization.
			
@@ -34,22 +34,22 @@ Europarl-ASR (EN) includes:
 
				 
			
 
				 #### Speech data
			
 
				 
			
 
				-- 1.3K hours of English-language annotated speech data.
			
 
				-- 18 hours of speech data with both manually revised verbatim transcriptions
			
 
				+* 1.3K hours of English-language annotated speech data.
			
 
				+* 18 hours of speech data with both manually revised verbatim transcriptions
			
 
				   and official non-verbatim transcriptions, split in 2 independent validation-
			
 
				   evaluation partitions for 2 realistic ASR tasks (with vs. without previous
			
 
				   knowledge of the speaker).
			
 
				-- 3 full sets of timed transcriptions for the rest of the speech data
			
 
				+* 3 full sets of timed transcriptions for the rest of the speech data
			
 
				   (training partition): official non-verbatim transcriptions, automatically
			
 
				   noise-filtered transcriptions and automatically verbatimized transcriptions.
			
 
				 
			
 
				 #### Text data
			
 
				 
			
 
				-- 70M tokens of English-language text data.
			
 
				+* 70M tokens of English-language text data.
			
 
				 
			
 
				 #### Pretrained language models
			
 
				 
			
 
				-- The Europarl-ASR English-language n-gram language model and vocabulary.
			
 
				+* The Europarl-ASR English-language n-gram language model and vocabulary.
			
 
				 
			
 
				 This data comprises most of the European Parliament's English-language debate
			
 
				 recordings, transcriptions and translations available from the Parliament's
			
@@ -223,10 +223,10 @@ In addition to the speech and text data included in the main release and
 
				 described in this document, we are making available for download the following
			
 
				 materials to facilitate the reproducibility of our experiments:
			
 
				 
			
 
				-- The pretrained Europarl-ASR English-language n-gram language model, together
			
 
				+* The pretrained Europarl-ASR English-language n-gram language model, together
			
 
				   with its vocabulary file.
			
 
				 
			
 
				-- The Europarl-ASR English-language verbatim transcription guidelines, which
			
 
				+* The Europarl-ASR English-language verbatim transcription guidelines, which
			
 
				   were applied to produce the manually revised verbatim transcriptions for the
			
 
				   dev and test sets.
			
 
				 
			
@@ -242,47 +242,47 @@ speeches from European Parliament sessions held in the period 1996-2020.
 
				 It was compiled and released by the Machine Learning and Language Processing
			
 
				 (MLLP) research group of VRAIN Institut Valencià d'Investigació en
			
 
				 Intel·ligència Artificial, Universitat Politècnica de València
			
 
				-( www.mllp.upv.es ).
			
 
				+( [www.mllp.upv.es](www.mllp.upv.es) ).
			
 
				 
			
 
				 Europarl-ASR (EN) includes:
			
 
				 
			
 
				 #### Speech data
			
 
				 
			
 
				-- 1.3K hours of English-language annotated speech data (33K speeches, 1K
			
 
				+* 1.3K hours of English-language annotated speech data (33K speeches, 1K
			
 
				   speakers).
			
 
				-- 18 hours of speech data with both manually revised verbatim transcriptions
			
 
				+* 18 hours of speech data with both manually revised verbatim transcriptions
			
 
				   and official non-verbatim transcriptions, split in 2 independent validation-
			
 
				   evaluation partitions for 2 realistic ASR tasks (with vs. without previous
			
 
				   knowledge of the speaker).
			
 
				-- 3 full sets of timed transcriptions for the rest of the speech data
			
 
				+* 3 full sets of timed transcriptions for the rest of the speech data
			
 
				   (training partition): official non-verbatim transcriptions, automatically
			
 
				   noise-filtered transcriptions and automatically verbatimized transcriptions.
			
 
				 
			
 
				 #### Text data
			
 
				 
			
 
				-- 70M tokens of English-language text data.
			
 
				+* 70M tokens of English-language text data.
			
 
				 
			
 
				 #### Language models
			
 
				 
			
 
				-- The Europarl-ASR English-language n-gram language model and vocabulary.
			
 
				+* The Europarl-ASR English-language n-gram language model and vocabulary.
			
 
				 
			
 
				-This data comprises most of the European Parliament's English-language debate
			
 
				+This data comprises most of the [European Parliament's English-language debate](https://www.europarl.europa.eu/plenary/en/debates-video.html)
			
 
				 recordings, transcriptions and translations available from the Parliament's
			
 
				 website from 1999 to 2020 (recordings being only available from 2008). This is
			
 
				 complemented by including all English-language transcriptions and translations
			
 
				-from the Europarl v10 text corpus for the period 1996-1999.
			
 
				+from the [Europarl](https://www.statmt.org/europarl/) v10 text corpus for the period 1996-1999.
			
 
				 
			
 
				 Additionally, to increase text data for language modelling up to 170M tokens,
			
 
				 Europarl-ASR also includes tools to add all English-language text from the
			
 
				-DCEP Digital Corpus of the European Parliament.
			
 
				+[DCEP Digital Corpus of the European Parliament](https://ec.europa.eu/jrc/en/language-technologies/dcep).
			
 
				 
			
 
				 Detailed dates of the EP speech and text data gathered:
			
 
				 
			
 
				-- English speech: 2008-09-01 to 2020-05-27.
			
 
				-- English transcriptions: 1999-07-20 to 2020-05-27.
			
 
				-- Translations into English: 1999-07-20 to 2012-11-30.
			
 
				-- Europarl v10 (selected to avoid overlapping): 1996-04-15 to 1999-07-19.
			
 
				-- DCEP (does not include any EP reports of proceedings): 2001 to 2012.
			
 
				+* English speech: 2008-09-01 to 2020-05-27.
			
 
				+* English transcriptions: 1999-07-20 to 2020-05-27.
			
 
				+* Translations into English: 1999-07-20 to 2012-11-30.
			
 
				+* Europarl v10 (selected to avoid overlapping): 1996-04-15 to 1999-07-19.
			
 
				+* DCEP (does not include any EP reports of proceedings): 2001 to 2012.
			
 
				 
			
 
				 
			
 
				 <a id="ack"></a>ACKNOWLEDGEMENTS
			
@@ -290,13 +290,13 @@ Detailed dates of the EP speech and text data gathered:
 
				 
			
 
				 The authors would like to acknowledge:
			
 
				 
			
 
				-- The European Parliament, the European Commission and other EU organizations,
			
 
				+* The European Parliament, the European Commission and other EU organizations,
			
 
				   for making available a wealth of multilingual speech and text data, both in
			
 
				   their websites and as ready-made corpora such as the DCEP Digital Corpus of
			
 
				   the European Parliament
			
 
				   ( https://ec.europa.eu/jrc/en/language-technologies ).
			
 
				 
			
 
				-- Philipp Koehn for compiling the Europarl corpus
			
 
				+* Philipp Koehn for compiling the Europarl corpus
			
 
				   ( https://www.statmt.org/europarl/ ).
			
 
				 
			
 
				 This work has received funding from the EU's H2020 research and innovation
			
@@ -311,11 +311,11 @@ the Universitat Politècnica de València's PAID-01-17 R&D support programme.
 
				 <a id="legal"></a>LEGAL DISCLAIMERS
			
 
				 -----------------
			
 
				 
			
 
				-- Speech and text data from the European Parliament website (audio, official
			
 
				+* Speech and text data from the European Parliament website (audio, official
			
 
				   transcriptions and translations) were sourced from
			
 
				   https://www.europarl.europa.eu/plenary/en/debates-video.html
			
 
				 
			
 
				-- Text data from the DCEP Digital Corpus of the European Parliament are the
			
 
				+* Text data from the DCEP Digital Corpus of the European Parliament are the
			
 
				   exclusive property of the European Parliament. These data were sourced from
			
 
				   https://ec.europa.eu/jrc/en/language-technologies/dcep (date of the latest
			
 
				   update: 11 March 2015).
			
@@ -324,22 +324,22 @@ the Universitat Politècnica de València's PAID-01-17 R&D support programme.
 
				 <a id="licence"></a>LICENCE
			
 
				 -------
			
 
				 
			
 
				-- Speech and text data from the European Parliament website (audio, official
			
 
				+* Speech and text data from the European Parliament website (audio, official
			
 
				   transcriptions and translations) are the exclusive property of the European
			
 
				   Union represented by the European Parliament. These data are reused here
			
 
				   under the conditions stated in the European Parliament website's Legal
			
 
				   notice ( https://www.europarl.europa.eu/legal-notice ).
			
 
				 
			
 
				-- Text data from the DCEP Digital Corpus of the European Parliament are the
			
 
				+* Text data from the DCEP Digital Corpus of the European Parliament are the
			
 
				   exclusive property of the European Parliament. The European Parliament
			
 
				   retains ownership of the data. These data are reused here under the usage
			
 
				   conditions of the DCEP Digital Corpus of the European Parliament (
			
 
				   https://ec.europa.eu/jrc/en/language-technologies/dcep#Usage%20Conditions ).
			
 
				 
			
 
				-- Text data from the Europarl v10 corpus are reused here under the Europarl
			
 
				+* Text data from the Europarl v10 corpus are reused here under the Europarl
			
 
				   corpus terms of use ( https://www.statmt.org/europarl/ ).
			
 
				 
			
 
				-- Europarl-ASR data and code not covered by the previously mentioned licences
			
 
				+* Europarl-ASR data and code not covered by the previously mentioned licences
			
 
				   © 2021 by Pau Baquero-Arnal, Jorge Civera, Gonçal V. Garcés Dı́az-Munı́o,
			
 
				   Adrià Giménez, Javier Iranzo-Sánchez, Javier Jorge, Alfons Juan, Alejandro
			
 
				   Pérez-González-de-Martos, Nahuel Roselló, Albert Sanchis and Joan Albert