Browse Source

Minor rev: revised abstract; revised extended description with more detailed figures

Gonçal V. Garcés Díaz-Munío 2 years ago
parent
commit
44ff1634a2
1 changed files with 5 additions and 5 deletions
  1. 5 5
      README.md

+ 5 - 5
README.md

@@ -3,8 +3,8 @@ Europarl-ASR v1.0
 2 April 2021  
 [www.mllp.upv.es/europarl-asr](https://www.mllp.upv.es/europarl-asr)
 
-A large English-language speech and text corpus of parliamentary debates for
-streaming ASR benchmarking, speech data filtering and speech data verbatimization.
+A 1300-hour English-language speech and text corpus of parliamentary debates for
+(streaming) ASR training and benchmarking, speech data filtering and speech data verbatimization.
 
 Keywords: automatic speech recognition; speech corpus; speech data filtering;
 speech data verbatimization.
@@ -280,19 +280,19 @@ Europarl-ASR (EN) includes:
 
 #### Speech data
 
-* 1300 hours of English-language annotated speech data (33K speeches, 1K
+* 1263 hours of English-language annotated speech data (33,002 speeches, 1046
   speakers).
 * 3 full sets of timed transcriptions: official non-verbatim transcriptions,
   automatically noise-filtered transcriptions and automatically verbatimized
   transcriptions.
-* 18 hours of speech data with both manually revised verbatim transcriptions
+* 17.5 hours of speech data with both manually revised verbatim transcriptions
   and official non-verbatim transcriptions, split in 2 independent validation-
   evaluation partitions for 2 realistic ASR tasks (with vs. without previous
   knowledge of the speaker).
 
 #### Text data
 
-* 70 million tokens of English-language text data.
+* 69.4 million tokens of English-language text data.
 
 #### Pretrained language models