Browse Source

README added some links and minor markdown fixes

Gonçal 2 years ago
parent
commit
ddb7a0a91d
1 changed files with 30 additions and 30 deletions
  1. 30 30
      README.md

+ 30 - 30
README.md

@@ -1,7 +1,7 @@
 # Europarl-ASR
 v1.0<br />
 2 April 2021<br />
-www.mllp.upv.es/europarl-asr
+[www.mllp.upv.es/europarl-asr](www.mllp.upv.es/europarl-asr)
 
 A large English-language speech and text corpus of parliamentary debates for
 streaming ASR benchmarking and speech data filtering/verbatimization.
@@ -34,22 +34,22 @@ Europarl-ASR (EN) includes:
 
 #### Speech data
 
-- 1.3K hours of English-language annotated speech data.
-- 18 hours of speech data with both manually revised verbatim transcriptions
+* 1.3K hours of English-language annotated speech data.
+* 18 hours of speech data with both manually revised verbatim transcriptions
   and official non-verbatim transcriptions, split in 2 independent validation-
   evaluation partitions for 2 realistic ASR tasks (with vs. without previous
   knowledge of the speaker).
-- 3 full sets of timed transcriptions for the rest of the speech data
+* 3 full sets of timed transcriptions for the rest of the speech data
   (training partition): official non-verbatim transcriptions, automatically
   noise-filtered transcriptions and automatically verbatimized transcriptions.
 
 #### Text data
 
-- 70M tokens of English-language text data.
+* 70M tokens of English-language text data.
 
 #### Pretrained language models
 
-- The Europarl-ASR English-language n-gram language model and vocabulary.
+* The Europarl-ASR English-language n-gram language model and vocabulary.
 
 This data comprises most of the European Parliament's English-language debate
 recordings, transcriptions and translations available from the Parliament's
@@ -223,10 +223,10 @@ In addition to the speech and text data included in the main release and
 described in this document, we are making available for download the following
 materials to facilitate the reproducibility of our experiments:
 
-- The pretrained Europarl-ASR English-language n-gram language model, together
+* The pretrained Europarl-ASR English-language n-gram language model, together
   with its vocabulary file.
 
-- The Europarl-ASR English-language verbatim transcription guidelines, which
+* The Europarl-ASR English-language verbatim transcription guidelines, which
   were applied to produce the manually revised verbatim transcriptions for the
   dev and test sets.
 
@@ -242,47 +242,47 @@ speeches from European Parliament sessions held in the period 1996-2020.
 It was compiled and released by the Machine Learning and Language Processing
 (MLLP) research group of VRAIN Institut Valencià d'Investigació en
 Intel·ligència Artificial, Universitat Politècnica de València
-( www.mllp.upv.es ).
+( [www.mllp.upv.es](www.mllp.upv.es) ).
 
 Europarl-ASR (EN) includes:
 
 #### Speech data
 
-- 1.3K hours of English-language annotated speech data (33K speeches, 1K
+* 1.3K hours of English-language annotated speech data (33K speeches, 1K
   speakers).
-- 18 hours of speech data with both manually revised verbatim transcriptions
+* 18 hours of speech data with both manually revised verbatim transcriptions
   and official non-verbatim transcriptions, split in 2 independent validation-
   evaluation partitions for 2 realistic ASR tasks (with vs. without previous
   knowledge of the speaker).
-- 3 full sets of timed transcriptions for the rest of the speech data
+* 3 full sets of timed transcriptions for the rest of the speech data
   (training partition): official non-verbatim transcriptions, automatically
   noise-filtered transcriptions and automatically verbatimized transcriptions.
 
 #### Text data
 
-- 70M tokens of English-language text data.
+* 70M tokens of English-language text data.
 
 #### Language models
 
-- The Europarl-ASR English-language n-gram language model and vocabulary.
+* The Europarl-ASR English-language n-gram language model and vocabulary.
 
-This data comprises most of the European Parliament's English-language debate
+This data comprises most of the [European Parliament's English-language debate](https://www.europarl.europa.eu/plenary/en/debates-video.html)
 recordings, transcriptions and translations available from the Parliament's
 website from 1999 to 2020 (recordings being only available from 2008). This is
 complemented by including all English-language transcriptions and translations
-from the Europarl v10 text corpus for the period 1996-1999.
+from the [Europarl](https://www.statmt.org/europarl/) v10 text corpus for the period 1996-1999.
 
 Additionally, to increase text data for language modelling up to 170M tokens,
 Europarl-ASR also includes tools to add all English-language text from the
-DCEP Digital Corpus of the European Parliament.
+[DCEP Digital Corpus of the European Parliament](https://ec.europa.eu/jrc/en/language-technologies/dcep).
 
 Detailed dates of the EP speech and text data gathered:
 
-- English speech: 2008-09-01 to 2020-05-27.
-- English transcriptions: 1999-07-20 to 2020-05-27.
-- Translations into English: 1999-07-20 to 2012-11-30.
-- Europarl v10 (selected to avoid overlapping): 1996-04-15 to 1999-07-19.
-- DCEP (does not include any EP reports of proceedings): 2001 to 2012.
+* English speech: 2008-09-01 to 2020-05-27.
+* English transcriptions: 1999-07-20 to 2020-05-27.
+* Translations into English: 1999-07-20 to 2012-11-30.
+* Europarl v10 (selected to avoid overlapping): 1996-04-15 to 1999-07-19.
+* DCEP (does not include any EP reports of proceedings): 2001 to 2012.
 
 
 <a id="ack"></a>ACKNOWLEDGEMENTS
@@ -290,13 +290,13 @@ Detailed dates of the EP speech and text data gathered:
 
 The authors would like to acknowledge:
 
-- The European Parliament, the European Commission and other EU organizations,
+* The European Parliament, the European Commission and other EU organizations,
   for making available a wealth of multilingual speech and text data, both in
   their websites and as ready-made corpora such as the DCEP Digital Corpus of
   the European Parliament
   ( https://ec.europa.eu/jrc/en/language-technologies ).
 
-- Philipp Koehn for compiling the Europarl corpus
+* Philipp Koehn for compiling the Europarl corpus
   ( https://www.statmt.org/europarl/ ).
 
 This work has received funding from the EU's H2020 research and innovation
@@ -311,11 +311,11 @@ the Universitat Politècnica de València's PAID-01-17 R&D support programme.
 <a id="legal"></a>LEGAL DISCLAIMERS
 -----------------
 
-- Speech and text data from the European Parliament website (audio, official
+* Speech and text data from the European Parliament website (audio, official
   transcriptions and translations) were sourced from
   https://www.europarl.europa.eu/plenary/en/debates-video.html
 
-- Text data from the DCEP Digital Corpus of the European Parliament are the
+* Text data from the DCEP Digital Corpus of the European Parliament are the
   exclusive property of the European Parliament. These data were sourced from
   https://ec.europa.eu/jrc/en/language-technologies/dcep (date of the latest
   update: 11 March 2015).
@@ -324,22 +324,22 @@ the Universitat Politècnica de València's PAID-01-17 R&D support programme.
 <a id="licence"></a>LICENCE
 -------
 
-- Speech and text data from the European Parliament website (audio, official
+* Speech and text data from the European Parliament website (audio, official
   transcriptions and translations) are the exclusive property of the European
   Union represented by the European Parliament. These data are reused here
   under the conditions stated in the European Parliament website's Legal
   notice ( https://www.europarl.europa.eu/legal-notice ).
 
-- Text data from the DCEP Digital Corpus of the European Parliament are the
+* Text data from the DCEP Digital Corpus of the European Parliament are the
   exclusive property of the European Parliament. The European Parliament
   retains ownership of the data. These data are reused here under the usage
   conditions of the DCEP Digital Corpus of the European Parliament (
   https://ec.europa.eu/jrc/en/language-technologies/dcep#Usage%20Conditions ).
 
-- Text data from the Europarl v10 corpus are reused here under the Europarl
+* Text data from the Europarl v10 corpus are reused here under the Europarl
   corpus terms of use ( https://www.statmt.org/europarl/ ).
 
-- Europarl-ASR data and code not covered by the previously mentioned licences
+* Europarl-ASR data and code not covered by the previously mentioned licences
   © 2021 by Pau Baquero-Arnal, Jorge Civera, Gonçal V. Garcés Dı́az-Munı́o,
   Adrià Giménez, Javier Iranzo-Sánchez, Javier Jorge, Alfons Juan, Alejandro
   Pérez-González-de-Martos, Nahuel Roselló, Albert Sanchis and Joan Albert