5 years ago · 7007c8bed4
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 
				-Europarl-ASR
			
 
				-v1.0
			
 
				-2 April 2021
			
 
				-www.mllp.upv.es/europarl-asr/
			
 
				+# Europarl-ASR
			
 
				+v1.0<br />
			
 
				+2 April 2021<br />
			
 
				+www.mllp.upv.es/europarl-asr
			
 
				 
			
 
				 A large English-language speech and text corpus of parliamentary debates for
			
 
				 streaming ASR benchmarking and speech data filtering/verbatimization.
			
@@ -18,21 +18,21 @@ Universitat Politècnica de València.
 
				 README CONTENTS
			
 
				 ---------------
			
 
				 
			
 
				-- Overview
			
 
				-- Corpus structure and contents
			
 
				-- Additional Europarl-ASR materials
			
 
				-- Extended description
			
 
				-- Acknowledgements
			
 
				-- Legal disclaimers
			
 
				-- Licence
			
 
				+- [Overview](#overview)
			
 
				+- [Corpus structure and contents](#contents)
			
 
				+- [Additional Europarl-ASR materials](#additional)
			
 
				+- [Extended description](#description)
			
 
				+- [Acknowledgements](#ack)
			
 
				+- [Legal disclaimers](#legal)
			
 
				+- [Licence](#licence)
			
 
				 
			
 
				 
			
 
				-OVERVIEW
			
 
				+<a id="overview"></a>OVERVIEW
			
 
				 --------
			
 
				 
			
 
				 Europarl-ASR (EN) includes:
			
 
				 
			
 
				-*Speech data:
			
 
				+#### Speech data
			
 
				 
			
 
				 - 1.3K hours of English-language annotated speech data.
			
 
				 - 18 hours of speech data with both manually revised verbatim transcriptions
			
@@ -43,11 +43,11 @@ Europarl-ASR (EN) includes:
 
				   (training partition): official non-verbatim transcriptions, automatically
			
 
				   noise-filtered transcriptions and automatically verbatimized transcriptions.
			
 
				 
			
 
				-*Text data:
			
 
				+#### Text data
			
 
				 
			
 
				 - 70M tokens of English-language text data.
			
 
				 
			
 
				-*Pretrained language models:
			
 
				+#### Pretrained language models
			
 
				 
			
 
				 - The Europarl-ASR English-language n-gram language model and vocabulary.
			
 
				 
			
@@ -58,7 +58,7 @@ tokens, Europarl-ASR also includes tools to add all English-language text from
 
				 the DCEP Digital Corpus of the European Parliament.
			
 
				 
			
 
				 
			
 
				-CORPUS STRUCTURE AND CONTENTS
			
 
				+<a id="contents"></a>CORPUS STRUCTURE AND CONTENTS
			
 
				 -----------------------------
			
 
				 
			
 
				 Total size: 20 GB
			
@@ -72,6 +72,7 @@ data (for language modelling).
 
				 Here we can see more completely the corpus structure, with additional
			
 
				 subdirectories:
			
 
				 
			
 
				+```
			
 
				   Europarl-ASR
			
 
				   └── en
			
 
				       ├── dev
			
@@ -116,18 +117,19 @@ subdirectories:
 
				               └── internal
			
 
				                   ├── prepro
			
 
				                   └── raw
			
 
				+```
			
 
				 
			
 
				-*Speech data ("original_audio" directories):
			
 
				+#### Speech data ("original_audio" directories)
			
 
				 
			
 
				 In the cases of "dev" and "test", they are subdivided in directories "spk-dep"
			
 
				 and "spk-indep". Thus, for speech data, we have 2 train-dev-test partitions
			
 
				 for 2 different ASR tasks, as follows:
			
 
				 
			
 
				-  1) ASR with known speakers (MEP):
			
 
				-  train ; dev/original_audio/spk-dep ; test/original_audio/spk-dep
			
 
				-  
			
 
				-  2) ASR with unknown speakers (Guest):
			
 
				-  train ; dev/original_audio/spk-indep ; test/original_audio/spk-indep
			
 
				+1. ASR with known speakers (MEP):<br />
			
 
				+   train ; dev/original_audio/spk-dep ; test/original_audio/spk-dep
			
 
				+   
			
 
				+1. ASR with unknown speakers (Guest):<br />
			
 
				+   train ; dev/original_audio/spk-indep ; test/original_audio/spk-indep
			
 
				 
			
 
				 Each of these partition directories contains 3 to 4 subdirectories (depending
			
 
				 on whether it is the train set or the dev/test sets): "lists", "metadata",
			
@@ -140,40 +142,40 @@ speeches per speaker.
 
				 corresponding set (as csv and json files). For each speech we will find these
			
 
				 metadata (as reflected in speeches.headers.csv):
			
 
				 
			
 
				-  term;session_date;speech_id;speaker_type;speaker_id;raw_dur;
			
 
				-  aligned-speech_dur;filtered-speech_dur;cer;ar;path;agenda_item_title
			
 
				+&nbsp;&nbsp;&nbsp;&nbsp;term;session_date;speech_id;speaker_type;speaker_id;raw_dur;<br />
			
 
				+&nbsp;&nbsp;&nbsp;&nbsp;aligned-speech_dur;filtered-speech_dur;cer;ar;path;agenda_item_title
			
 
				 
			
 
				 And for each speaker (as reflected in speakers.headers.csv):
			
 
				 
			
 
				-  type;id;name;gender;url
			
 
				+&nbsp;&nbsp;&nbsp;&nbsp;type;id;name;gender;url
			
 
				 
			
 
				 "speeches" contains a subdirectory for each speech in the corresponding set,
			
 
				 according to this subdirectory structure:
			
 
				 
			
 
				-  speeches/<term>/<session_date>/<speech_id>/
			
 
				+&nbsp;&nbsp;&nbsp;&nbsp;`speeches/<term>/<session_date>/<speech_id>/`
			
 
				 
			
 
				 For each speech, we will find some of the following files (depending on
			
 
				 whether it is in the train set or in the dev/test sets):
			
 
				 
			
 
				-  ep-asr.en.orig.<term>.<session_date>.<speech_id>.m4a
			
 
				-  [In all sets] Audio of the speech.
			
 
				+&nbsp;&nbsp;&nbsp;&nbsp;`ep-asr.en.orig.<term>.<session_date>.<speech_id>.m4a`<br />
			
 
				+&nbsp;&nbsp;&nbsp;&nbsp;[In all sets] Audio of the speech.
			
 
				 
			
 
				-  ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.orig.{dfxp,json,srt,txt}
			
 
				-  [In all sets] Official non-verbatim transcription of the speech, as a txt
			
 
				+&nbsp;&nbsp;&nbsp;&nbsp;`ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.orig.{dfxp,json,srt,txt}`<br />
			
 
				+&nbsp;&nbsp;&nbsp;&nbsp;[In all sets] Official non-verbatim transcription of the speech, as a txt
			
 
				   raw transcription file, as dfxp or srt force-aligned timed subtitle files,
			
 
				   and its json metadata.
			
 
				 
			
 
				-  ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.filt.{dfxp,json,srt}
			
 
				-  [In train set] Automatically filtered transcription of the speech, as dfxp
			
 
				+&nbsp;&nbsp;&nbsp;&nbsp;`ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.filt.{dfxp,json,srt}`<br />
			
 
				+&nbsp;&nbsp;&nbsp;&nbsp;[In train set] Automatically filtered transcription of the speech, as dfxp
			
 
				   or srt force-aligned timed subtitle files, and its json metadata.
			
 
				 
			
 
				-  ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.verb.{dfxp,json,srt,txt}
			
 
				-  [In train set] Automatically verbatimized transcription of the speech, as
			
 
				+&nbsp;&nbsp;&nbsp;&nbsp;`ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.verb.{dfxp,json,srt,txt}`<br />
			
 
				+&nbsp;&nbsp;&nbsp;&nbsp;[In train set] Automatically verbatimized transcription of the speech, as
			
 
				   a txt transcription file, as dfxp or srt force-aligned timed subtitle files,
			
 
				   and its json metadata.
			
 
				 
			
 
				-  ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.rev.{dfxp,json,srt,txt}
			
 
				-  [In dev/test sets] Manually revised verbatim transcription of the speech,
			
 
				+&nbsp;&nbsp;&nbsp;&nbsp;`ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.rev.{dfxp,json,srt,txt}`<br />
			
 
				+&nbsp;&nbsp;&nbsp;&nbsp;[In dev/test sets] Manually revised verbatim transcription of the speech,
			
 
				   as a txt transcription file, as dfxp or srt force-aligned timed subtitle
			
 
				   files, and its json metadata.
			
 
				 
			
@@ -185,7 +187,7 @@ transcriptions (*.ref) and as segment time marked files (*.stm). In all 4
 
				 cases, the text is presented preprocessed for evaluation (tokenized,
			
 
				 lowercased, punctuation removed...).
			
 
				 
			
 
				-*Text data ("text" directories):
			
 
				+#### Text data ("text" directories)
			
 
				 
			
 
				 In the case of "train", they are subdivided in directories "external" and
			
 
				 "internal". "internal" contains all the official non-verbatim transcriptions
			
@@ -197,24 +199,24 @@ Each "text" directory contains 2 subdirectories: "raw" (except in
 
				 "train/external"), "prepro" (in all sets), or "scripts" (only in
			
 
				 "train/external").
			
 
				 
			
 
				-  "raw" contains the raw text data for the corresponding set (*.txt.gz), and
			
 
				+&nbsp;&nbsp;&nbsp;&nbsp;"raw" contains the raw text data for the corresponding set (*.txt.gz), and
			
 
				   its metadata (*.csv). In the cases of "dev" and "test", both the official
			
 
				   non-verbatim transcriptions (*.orig.*) and the manually revised verbatim
			
 
				   transcriptions (*.rev.*) are included.
			
 
				 
			
 
				-  "prepro" contains the text data for the corresponding set, preprocessed for
			
 
				+&nbsp;&nbsp;&nbsp;&nbsp;"prepro" contains the text data for the corresponding set, preprocessed for
			
 
				   training or evaluation (tokenized, lowercased, punctuation removed...). This
			
 
				   data is released to facilitate the reproducibility of our experiments.
			
 
				 
			
 
				-  Finally, "scripts" (only in "train/text/external") contains the script
			
 
				+&nbsp;&nbsp;&nbsp;&nbsp;Finally, "scripts" (only in "train/text/external") contains the script
			
 
				   get_DCEP.sh, which can be used to download the DCEP corpus from its original
			
 
				   website and save it in compressed plain text (.txt.gz).
			
 
				 
			
 
				 
			
 
				-ADDITIONAL Europarl-ASR MATERIALS
			
 
				+<a id="additional-materials"></a>ADDITIONAL Europarl-ASR MATERIALS
			
 
				 ---------------------------------
			
 
				 
			
 
				-https://www.mllp.upv.es/europarl-asr/Europarl-ASR_v1.0_ngram_lm_and_vocab.tar.gz
			
 
				+https://www.mllp.upv.es/europarl-asr/Europarl-ASR_v1.0_ngram_lm_and_vocab.tar.gz<br />
			
 
				 https://www.mllp.upv.es/europarl-asr/Europarl-ASR_transcription_guidelines.pdf
			
 
				 
			
 
				 In addition to the speech and text data included in the main release and
			
@@ -229,7 +231,7 @@ materials to facilitate the reproducibility of our experiments:
 
				   dev and test sets.
			
 
				 
			
 
				 
			
 
				-EXTENDED DESCRIPTION
			
 
				+<a id="description"></a>EXTENDED DESCRIPTION
			
 
				 --------------------
			
 
				 
			
 
				 Europarl-ASR (EN) is a large English-language speech and text corpus of
			
@@ -244,7 +246,7 @@ Intel·ligència Artificial, Universitat Politècnica de València
 
				 
			
 
				 Europarl-ASR (EN) includes:
			
 
				 
			
 
				-*Speech data:
			
 
				+#### Speech data
			
 
				 
			
 
				 - 1.3K hours of English-language annotated speech data (33K speeches, 1K
			
 
				   speakers).
			
@@ -256,11 +258,11 @@ Europarl-ASR (EN) includes:
 
				   (training partition): official non-verbatim transcriptions, automatically
			
 
				   noise-filtered transcriptions and automatically verbatimized transcriptions.
			
 
				 
			
 
				-*Text data:
			
 
				+#### Text data
			
 
				 
			
 
				 - 70M tokens of English-language text data.
			
 
				 
			
 
				-*Language models:
			
 
				+#### Language models
			
 
				 
			
 
				 - The Europarl-ASR English-language n-gram language model and vocabulary.
			
 
				 
			
@@ -283,7 +285,7 @@ Detailed dates of the EP speech and text data gathered:
 
				 - DCEP (does not include any EP reports of proceedings): 2001 to 2012.
			
 
				 
			
 
				 
			
 
				-ACKNOWLEDGEMENTS
			
 
				+<a id="ack"></a>ACKNOWLEDGEMENTS
			
 
				 ---------------
			
 
				 
			
 
				 The authors would like to acknowledge:
			
@@ -298,15 +300,15 @@ The authors would like to acknowledge:
 
				   ( https://www.statmt.org/europarl/ ).
			
 
				 
			
 
				 This work has received funding from the EU's H2020 research and innovation
			
 
				-programme under grant agreements 761758 (X5gon) and 952215 (TAILOR); the
			
 
				-Government of Spain's research project Multisub (RTI2018-094879-B-I00,
			
 
				+programme under grant agreements 761758 ([X5gon](https://cordis.europa.eu/project/id/761758)) and 952215 ([TAILOR](https://cordis.europa.eu/project/id/952215)); the
			
 
				+Government of Spain's research project [Multisub](https://www.mllp.upv.es/projects/multisub/) (RTI2018-094879-B-I00,
			
 
				 MCIU/AEI/FEDER,EU) and FPU scholarships FPU14/03981 and FPU18/04135; the
			
 
				-Generalitat Valenciana's research project Classroom Activity Recognition
			
 
				+Generalitat Valenciana's research project [Classroom Activity Recognition](https://aplicat.upv.es/exploraupv/ficha-proyecto/proyecto/20190714)
			
 
				 (PROMETEO/2019/111) and predoctoral research scholarship ACIF/2017/055; and
			
 
				 the Universitat Politècnica de València's PAID-01-17 R&D support programme.
			
 
				 
			
 
				 
			
 
				-LEGAL DISCLAIMERS
			
 
				+<a id="legal"></a>LEGAL DISCLAIMERS
			
 
				 -----------------
			
 
				 
			
 
				 - Speech and text data from the European Parliament website (audio, official
			
@@ -319,7 +321,7 @@ LEGAL DISCLAIMERS
 
				   update: 11 March 2015).
			
 
				 
			
 
				 
			
 
				-LICENCE
			
 
				+<a id="licence"></a>LICENCE
			
 
				 -------
			
 
				 
			
 
				 - Speech and text data from the European Parliament website (audio, official
			
@@ -344,4 +346,4 @@ LICENCE
 
				   Silvestre-Cerdà are licenced under CC BY 4.0. To view a copy of this
			
 
				   licence, visit http://creativecommons.org/licenses/by/4.0/
			
 
				 
			
 
				-See the file LICENSE for the full licence texts.
			
 
				+See the [LICENSE](https://mllp.upv.es/git-pub/ggarces/Europarl-ASR/src/master/LICENSE) file for the full licence texts.