|
@@ -36,18 +36,19 @@ Europarl-ASR (EN) includes:
|
|
|
|
|
|
#### Speech data
|
|
|
|
|
|
-* 1.3K hours of English-language annotated speech data.
|
|
|
+* 1300 hours of English-language annotated speech data.
|
|
|
+* 3 full sets of timed transcriptions: official non-verbatim transcriptions,
|
|
|
+ automatically noise-filtered transcriptions and automatically verbatimized
|
|
|
+ transcriptions.
|
|
|
* 18 hours of speech data with both manually revised verbatim transcriptions
|
|
|
and official non-verbatim transcriptions, split in 2 independent validation-
|
|
|
evaluation partitions for 2 realistic ASR tasks (with vs. without previous
|
|
|
knowledge of the speaker).
|
|
|
-* 3 full sets of timed transcriptions for the rest of the speech data
|
|
|
- (training partition): official non-verbatim transcriptions, automatically
|
|
|
- noise-filtered transcriptions and automatically verbatimized transcriptions.
|
|
|
+
|
|
|
|
|
|
#### Text data
|
|
|
|
|
|
-* 70M tokens of English-language text data.
|
|
|
+* 70 million tokens of English-language text data.
|
|
|
|
|
|
#### Pretrained language models
|
|
|
|
|
@@ -275,19 +276,19 @@ Europarl-ASR (EN) includes:
|
|
|
|
|
|
#### Speech data
|
|
|
|
|
|
-* 1.3K hours of English-language annotated speech data (33K speeches, 1K
|
|
|
+* 1300 hours of English-language annotated speech data (33K speeches, 1K
|
|
|
speakers).
|
|
|
+* 3 full sets of timed transcriptions: official non-verbatim transcriptions,
|
|
|
+ automatically noise-filtered transcriptions and automatically verbatimized
|
|
|
+ transcriptions.
|
|
|
* 18 hours of speech data with both manually revised verbatim transcriptions
|
|
|
and official non-verbatim transcriptions, split in 2 independent validation-
|
|
|
evaluation partitions for 2 realistic ASR tasks (with vs. without previous
|
|
|
knowledge of the speaker).
|
|
|
-* 3 full sets of timed transcriptions for the rest of the speech data
|
|
|
- (training partition): official non-verbatim transcriptions, automatically
|
|
|
- noise-filtered transcriptions and automatically verbatimized transcriptions.
|
|
|
|
|
|
#### Text data
|
|
|
|
|
|
-* 70M tokens of English-language text data.
|
|
|
+* 70 million tokens of English-language text data.
|
|
|
|
|
|
#### Language models
|
|
|
|