Browse Source

Updated README with markdown formatting and some links

Gonçal 2 years ago
parent
commit
7007c8bed4
1 changed files with 55 additions and 53 deletions
  1. 55 53
      README.md

+ 55 - 53
README.md

@@ -1,7 +1,7 @@
-Europarl-ASR
-v1.0
-2 April 2021
-www.mllp.upv.es/europarl-asr/
+# Europarl-ASR
+v1.0<br />
+2 April 2021<br />
+www.mllp.upv.es/europarl-asr
 
 
 A large English-language speech and text corpus of parliamentary debates for
 A large English-language speech and text corpus of parliamentary debates for
 streaming ASR benchmarking and speech data filtering/verbatimization.
 streaming ASR benchmarking and speech data filtering/verbatimization.
@@ -18,21 +18,21 @@ Universitat Politècnica de València.
 README CONTENTS
 README CONTENTS
 ---------------
 ---------------
 
 
-- Overview
-- Corpus structure and contents
-- Additional Europarl-ASR materials
-- Extended description
-- Acknowledgements
-- Legal disclaimers
-- Licence
+- [Overview](#overview)
+- [Corpus structure and contents](#contents)
+- [Additional Europarl-ASR materials](#additional)
+- [Extended description](#description)
+- [Acknowledgements](#ack)
+- [Legal disclaimers](#legal)
+- [Licence](#licence)
 
 
 
 
-OVERVIEW
+<a id="overview"></a>OVERVIEW
 --------
 --------
 
 
 Europarl-ASR (EN) includes:
 Europarl-ASR (EN) includes:
 
 
-*Speech data:
+#### Speech data
 
 
 - 1.3K hours of English-language annotated speech data.
 - 1.3K hours of English-language annotated speech data.
 - 18 hours of speech data with both manually revised verbatim transcriptions
 - 18 hours of speech data with both manually revised verbatim transcriptions
@@ -43,11 +43,11 @@ Europarl-ASR (EN) includes:
   (training partition): official non-verbatim transcriptions, automatically
   (training partition): official non-verbatim transcriptions, automatically
   noise-filtered transcriptions and automatically verbatimized transcriptions.
   noise-filtered transcriptions and automatically verbatimized transcriptions.
 
 
-*Text data:
+#### Text data
 
 
 - 70M tokens of English-language text data.
 - 70M tokens of English-language text data.
 
 
-*Pretrained language models:
+#### Pretrained language models
 
 
 - The Europarl-ASR English-language n-gram language model and vocabulary.
 - The Europarl-ASR English-language n-gram language model and vocabulary.
 
 
@@ -58,7 +58,7 @@ tokens, Europarl-ASR also includes tools to add all English-language text from
 the DCEP Digital Corpus of the European Parliament.
 the DCEP Digital Corpus of the European Parliament.
 
 
 
 
-CORPUS STRUCTURE AND CONTENTS
+<a id="contents"></a>CORPUS STRUCTURE AND CONTENTS
 -----------------------------
 -----------------------------
 
 
 Total size: 20 GB
 Total size: 20 GB
@@ -72,6 +72,7 @@ data (for language modelling).
 Here we can see more completely the corpus structure, with additional
 Here we can see more completely the corpus structure, with additional
 subdirectories:
 subdirectories:
 
 
+```
   Europarl-ASR
   Europarl-ASR
   └── en
   └── en
       ├── dev
       ├── dev
@@ -116,18 +117,19 @@ subdirectories:
               └── internal
               └── internal
                   ├── prepro
                   ├── prepro
                   └── raw
                   └── raw
+```
 
 
-*Speech data ("original_audio" directories):
+#### Speech data ("original_audio" directories)
 
 
 In the cases of "dev" and "test", they are subdivided in directories "spk-dep"
 In the cases of "dev" and "test", they are subdivided in directories "spk-dep"
 and "spk-indep". Thus, for speech data, we have 2 train-dev-test partitions
 and "spk-indep". Thus, for speech data, we have 2 train-dev-test partitions
 for 2 different ASR tasks, as follows:
 for 2 different ASR tasks, as follows:
 
 
-  1) ASR with known speakers (MEP):
-  train ; dev/original_audio/spk-dep ; test/original_audio/spk-dep
-  
-  2) ASR with unknown speakers (Guest):
-  train ; dev/original_audio/spk-indep ; test/original_audio/spk-indep
+1. ASR with known speakers (MEP):<br />
+   train ; dev/original_audio/spk-dep ; test/original_audio/spk-dep
+   
+1. ASR with unknown speakers (Guest):<br />
+   train ; dev/original_audio/spk-indep ; test/original_audio/spk-indep
 
 
 Each of these partition directories contains 3 to 4 subdirectories (depending
 Each of these partition directories contains 3 to 4 subdirectories (depending
 on whether it is the train set or the dev/test sets): "lists", "metadata",
 on whether it is the train set or the dev/test sets): "lists", "metadata",
@@ -140,40 +142,40 @@ speeches per speaker.
 corresponding set (as csv and json files). For each speech we will find these
 corresponding set (as csv and json files). For each speech we will find these
 metadata (as reflected in speeches.headers.csv):
 metadata (as reflected in speeches.headers.csv):
 
 
-  term;session_date;speech_id;speaker_type;speaker_id;raw_dur;
-  aligned-speech_dur;filtered-speech_dur;cer;ar;path;agenda_item_title
+&nbsp;&nbsp;&nbsp;&nbsp;term;session_date;speech_id;speaker_type;speaker_id;raw_dur;<br />
+&nbsp;&nbsp;&nbsp;&nbsp;aligned-speech_dur;filtered-speech_dur;cer;ar;path;agenda_item_title
 
 
 And for each speaker (as reflected in speakers.headers.csv):
 And for each speaker (as reflected in speakers.headers.csv):
 
 
-  type;id;name;gender;url
+&nbsp;&nbsp;&nbsp;&nbsp;type;id;name;gender;url
 
 
 "speeches" contains a subdirectory for each speech in the corresponding set,
 "speeches" contains a subdirectory for each speech in the corresponding set,
 according to this subdirectory structure:
 according to this subdirectory structure:
 
 
-  speeches/<term>/<session_date>/<speech_id>/
+&nbsp;&nbsp;&nbsp;&nbsp;`speeches/<term>/<session_date>/<speech_id>/`
 
 
 For each speech, we will find some of the following files (depending on
 For each speech, we will find some of the following files (depending on
 whether it is in the train set or in the dev/test sets):
 whether it is in the train set or in the dev/test sets):
 
 
-  ep-asr.en.orig.<term>.<session_date>.<speech_id>.m4a
-  [In all sets] Audio of the speech.
+&nbsp;&nbsp;&nbsp;&nbsp;`ep-asr.en.orig.<term>.<session_date>.<speech_id>.m4a`<br />
+&nbsp;&nbsp;&nbsp;&nbsp;[In all sets] Audio of the speech.
 
 
-  ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.orig.{dfxp,json,srt,txt}
-  [In all sets] Official non-verbatim transcription of the speech, as a txt
+&nbsp;&nbsp;&nbsp;&nbsp;`ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.orig.{dfxp,json,srt,txt}`<br />
+&nbsp;&nbsp;&nbsp;&nbsp;[In all sets] Official non-verbatim transcription of the speech, as a txt
   raw transcription file, as dfxp or srt force-aligned timed subtitle files,
   raw transcription file, as dfxp or srt force-aligned timed subtitle files,
   and its json metadata.
   and its json metadata.
 
 
-  ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.filt.{dfxp,json,srt}
-  [In train set] Automatically filtered transcription of the speech, as dfxp
+&nbsp;&nbsp;&nbsp;&nbsp;`ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.filt.{dfxp,json,srt}`<br />
+&nbsp;&nbsp;&nbsp;&nbsp;[In train set] Automatically filtered transcription of the speech, as dfxp
   or srt force-aligned timed subtitle files, and its json metadata.
   or srt force-aligned timed subtitle files, and its json metadata.
 
 
-  ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.verb.{dfxp,json,srt,txt}
-  [In train set] Automatically verbatimized transcription of the speech, as
+&nbsp;&nbsp;&nbsp;&nbsp;`ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.verb.{dfxp,json,srt,txt}`<br />
+&nbsp;&nbsp;&nbsp;&nbsp;[In train set] Automatically verbatimized transcription of the speech, as
   a txt transcription file, as dfxp or srt force-aligned timed subtitle files,
   a txt transcription file, as dfxp or srt force-aligned timed subtitle files,
   and its json metadata.
   and its json metadata.
 
 
-  ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.rev.{dfxp,json,srt,txt}
-  [In dev/test sets] Manually revised verbatim transcription of the speech,
+&nbsp;&nbsp;&nbsp;&nbsp;`ep-asr.en.orig.<term>.<session_date>.<speech_id>.tr.rev.{dfxp,json,srt,txt}`<br />
+&nbsp;&nbsp;&nbsp;&nbsp;[In dev/test sets] Manually revised verbatim transcription of the speech,
   as a txt transcription file, as dfxp or srt force-aligned timed subtitle
   as a txt transcription file, as dfxp or srt force-aligned timed subtitle
   files, and its json metadata.
   files, and its json metadata.
 
 
@@ -185,7 +187,7 @@ transcriptions (*.ref) and as segment time marked files (*.stm). In all 4
 cases, the text is presented preprocessed for evaluation (tokenized,
 cases, the text is presented preprocessed for evaluation (tokenized,
 lowercased, punctuation removed...).
 lowercased, punctuation removed...).
 
 
-*Text data ("text" directories):
+#### Text data ("text" directories)
 
 
 In the case of "train", they are subdivided in directories "external" and
 In the case of "train", they are subdivided in directories "external" and
 "internal". "internal" contains all the official non-verbatim transcriptions
 "internal". "internal" contains all the official non-verbatim transcriptions
@@ -197,24 +199,24 @@ Each "text" directory contains 2 subdirectories: "raw" (except in
 "train/external"), "prepro" (in all sets), or "scripts" (only in
 "train/external"), "prepro" (in all sets), or "scripts" (only in
 "train/external").
 "train/external").
 
 
-  "raw" contains the raw text data for the corresponding set (*.txt.gz), and
+&nbsp;&nbsp;&nbsp;&nbsp;"raw" contains the raw text data for the corresponding set (*.txt.gz), and
   its metadata (*.csv). In the cases of "dev" and "test", both the official
   its metadata (*.csv). In the cases of "dev" and "test", both the official
   non-verbatim transcriptions (*.orig.*) and the manually revised verbatim
   non-verbatim transcriptions (*.orig.*) and the manually revised verbatim
   transcriptions (*.rev.*) are included.
   transcriptions (*.rev.*) are included.
 
 
-  "prepro" contains the text data for the corresponding set, preprocessed for
+&nbsp;&nbsp;&nbsp;&nbsp;"prepro" contains the text data for the corresponding set, preprocessed for
   training or evaluation (tokenized, lowercased, punctuation removed...). This
   training or evaluation (tokenized, lowercased, punctuation removed...). This
   data is released to facilitate the reproducibility of our experiments.
   data is released to facilitate the reproducibility of our experiments.
 
 
-  Finally, "scripts" (only in "train/text/external") contains the script
+&nbsp;&nbsp;&nbsp;&nbsp;Finally, "scripts" (only in "train/text/external") contains the script
   get_DCEP.sh, which can be used to download the DCEP corpus from its original
   get_DCEP.sh, which can be used to download the DCEP corpus from its original
   website and save it in compressed plain text (.txt.gz).
   website and save it in compressed plain text (.txt.gz).
 
 
 
 
-ADDITIONAL Europarl-ASR MATERIALS
+<a id="additional-materials"></a>ADDITIONAL Europarl-ASR MATERIALS
 ---------------------------------
 ---------------------------------
 
 
-https://www.mllp.upv.es/europarl-asr/Europarl-ASR_v1.0_ngram_lm_and_vocab.tar.gz
+https://www.mllp.upv.es/europarl-asr/Europarl-ASR_v1.0_ngram_lm_and_vocab.tar.gz<br />
 https://www.mllp.upv.es/europarl-asr/Europarl-ASR_transcription_guidelines.pdf
 https://www.mllp.upv.es/europarl-asr/Europarl-ASR_transcription_guidelines.pdf
 
 
 In addition to the speech and text data included in the main release and
 In addition to the speech and text data included in the main release and
@@ -229,7 +231,7 @@ materials to facilitate the reproducibility of our experiments:
   dev and test sets.
   dev and test sets.
 
 
 
 
-EXTENDED DESCRIPTION
+<a id="description"></a>EXTENDED DESCRIPTION
 --------------------
 --------------------
 
 
 Europarl-ASR (EN) is a large English-language speech and text corpus of
 Europarl-ASR (EN) is a large English-language speech and text corpus of
@@ -244,7 +246,7 @@ Intel·ligència Artificial, Universitat Politècnica de València
 
 
 Europarl-ASR (EN) includes:
 Europarl-ASR (EN) includes:
 
 
-*Speech data:
+#### Speech data
 
 
 - 1.3K hours of English-language annotated speech data (33K speeches, 1K
 - 1.3K hours of English-language annotated speech data (33K speeches, 1K
   speakers).
   speakers).
@@ -256,11 +258,11 @@ Europarl-ASR (EN) includes:
   (training partition): official non-verbatim transcriptions, automatically
   (training partition): official non-verbatim transcriptions, automatically
   noise-filtered transcriptions and automatically verbatimized transcriptions.
   noise-filtered transcriptions and automatically verbatimized transcriptions.
 
 
-*Text data:
+#### Text data
 
 
 - 70M tokens of English-language text data.
 - 70M tokens of English-language text data.
 
 
-*Language models:
+#### Language models
 
 
 - The Europarl-ASR English-language n-gram language model and vocabulary.
 - The Europarl-ASR English-language n-gram language model and vocabulary.
 
 
@@ -283,7 +285,7 @@ Detailed dates of the EP speech and text data gathered:
 - DCEP (does not include any EP reports of proceedings): 2001 to 2012.
 - DCEP (does not include any EP reports of proceedings): 2001 to 2012.
 
 
 
 
-ACKNOWLEDGEMENTS
+<a id="ack"></a>ACKNOWLEDGEMENTS
 ---------------
 ---------------
 
 
 The authors would like to acknowledge:
 The authors would like to acknowledge:
@@ -298,15 +300,15 @@ The authors would like to acknowledge:
   ( https://www.statmt.org/europarl/ ).
   ( https://www.statmt.org/europarl/ ).
 
 
 This work has received funding from the EU's H2020 research and innovation
 This work has received funding from the EU's H2020 research and innovation
-programme under grant agreements 761758 (X5gon) and 952215 (TAILOR); the
-Government of Spain's research project Multisub (RTI2018-094879-B-I00,
+programme under grant agreements 761758 ([X5gon](https://cordis.europa.eu/project/id/761758)) and 952215 ([TAILOR](https://cordis.europa.eu/project/id/952215)); the
+Government of Spain's research project [Multisub](https://www.mllp.upv.es/projects/multisub/) (RTI2018-094879-B-I00,
 MCIU/AEI/FEDER,EU) and FPU scholarships FPU14/03981 and FPU18/04135; the
 MCIU/AEI/FEDER,EU) and FPU scholarships FPU14/03981 and FPU18/04135; the
-Generalitat Valenciana's research project Classroom Activity Recognition
+Generalitat Valenciana's research project [Classroom Activity Recognition](https://aplicat.upv.es/exploraupv/ficha-proyecto/proyecto/20190714)
 (PROMETEO/2019/111) and predoctoral research scholarship ACIF/2017/055; and
 (PROMETEO/2019/111) and predoctoral research scholarship ACIF/2017/055; and
 the Universitat Politècnica de València's PAID-01-17 R&D support programme.
 the Universitat Politècnica de València's PAID-01-17 R&D support programme.
 
 
 
 
-LEGAL DISCLAIMERS
+<a id="legal"></a>LEGAL DISCLAIMERS
 -----------------
 -----------------
 
 
 - Speech and text data from the European Parliament website (audio, official
 - Speech and text data from the European Parliament website (audio, official
@@ -319,7 +321,7 @@ LEGAL DISCLAIMERS
   update: 11 March 2015).
   update: 11 March 2015).
 
 
 
 
-LICENCE
+<a id="licence"></a>LICENCE
 -------
 -------
 
 
 - Speech and text data from the European Parliament website (audio, official
 - Speech and text data from the European Parliament website (audio, official
@@ -344,4 +346,4 @@ LICENCE
   Silvestre-Cerdà are licenced under CC BY 4.0. To view a copy of this
   Silvestre-Cerdà are licenced under CC BY 4.0. To view a copy of this
   licence, visit http://creativecommons.org/licenses/by/4.0/
   licence, visit http://creativecommons.org/licenses/by/4.0/
 
 
-See the file LICENSE for the full licence texts.
+See the [LICENSE](https://mllp.upv.es/git-pub/ggarces/Europarl-ASR/src/master/LICENSE) file for the full licence texts.