Browse Source

Minor reorganization in corpus description

Gonçal V. Garcés Díaz-Munío 2 years ago
parent
commit
51afc8c3b0
1 changed files with 11 additions and 10 deletions
  1. 11 10
      README.md

+ 11 - 10
README.md

@@ -36,18 +36,19 @@ Europarl-ASR (EN) includes:
 
 #### Speech data
 
-* 1.3K hours of English-language annotated speech data.
+* 1300 hours of English-language annotated speech data.
+* 3 full sets of timed transcriptions: official non-verbatim transcriptions,
+  automatically noise-filtered transcriptions and automatically verbatimized
+  transcriptions.
 * 18 hours of speech data with both manually revised verbatim transcriptions
   and official non-verbatim transcriptions, split in 2 independent validation-
   evaluation partitions for 2 realistic ASR tasks (with vs. without previous
   knowledge of the speaker).
-* 3 full sets of timed transcriptions for the rest of the speech data
-  (training partition): official non-verbatim transcriptions, automatically
-  noise-filtered transcriptions and automatically verbatimized transcriptions.
+
 
 #### Text data
 
-* 70M tokens of English-language text data.
+* 70 million tokens of English-language text data.
 
 #### Pretrained language models
 
@@ -275,19 +276,19 @@ Europarl-ASR (EN) includes:
 
 #### Speech data
 
-* 1.3K hours of English-language annotated speech data (33K speeches, 1K
+* 1300 hours of English-language annotated speech data (33K speeches, 1K
   speakers).
+* 3 full sets of timed transcriptions: official non-verbatim transcriptions,
+  automatically noise-filtered transcriptions and automatically verbatimized
+  transcriptions.
 * 18 hours of speech data with both manually revised verbatim transcriptions
   and official non-verbatim transcriptions, split in 2 independent validation-
   evaluation partitions for 2 realistic ASR tasks (with vs. without previous
   knowledge of the speaker).
-* 3 full sets of timed transcriptions for the rest of the speech data
-  (training partition): official non-verbatim transcriptions, automatically
-  noise-filtered transcriptions and automatically verbatimized transcriptions.
 
 #### Text data
 
-* 70M tokens of English-language text data.
+* 70 million tokens of English-language text data.
 
 #### Language models