Guidance for splinter datasets on MCV

Background

At Common Voice, we aim to provide a platform that serves the needs of as many different communities as possible.

Last year (2022) we made big changes to try and improve linguistic inclusion in the dataset - including moving to freeform accent capture, and introducing a new category ‘variant’ - that enabled speakers to tag their speech with a particular variant or dialect within a language or language family.

This year (2023) we will be expanding metadata options for the text corpus - starting with Domain, and then Variant information.

Down the line (Q4 2023-Q1 2024), we will also be focussing on making it easier for the platform experience to be tailored for different language contributors, so that they can - for example - form ‘teams’ who run dedicated efforts to focus on their own variant or domain.

These features are, like, our whole roadmap, drawn from the fantastic feedback and guidance our communities and partners give us.

Present state

This said, we fully appreciate that some language communities will need to supplement the creation of a general use corpus earlier than this, or have particular community mobilisations in which they wish to build out a very specific corpus that Common Voice cannot easily support.

For these situations, we have collected a brief guide below to help you generate a dataset to complement Common Voice.

Guide

Collecting high quality, interoperable data

Audio sampling rate

To be as interoperable as possible with Common Voice, a 48kHz sampling rate is optimal. Using a high quality sampling rate also means that your dataset can be used for TTS as well as STT. Much audio is sampled at 44.1kHz.

File format

The best file format used for speech-to-text experiments is the “.WAV” format, however if you need to collect MP3 files (for example because it’s all your recording tool supports) then this can be converted during pre-processing.

Clip recordings

The average clip length on Common Voice is around 5 seconds, though this does vary substantially between languages. Depending on the type of model you’re building, generally short clips are preferable for ASR tasks - we would not advise recording longer clips, for example exceeding 15 seconds.

Background Noise

Background noise should not interfere with every word of the recording being audible. We want machine learning algorithms to be able to handle a variety of background noise, and even relatively loud noises or quiet background music can be accepted provided that they don’t prevent you from hearing the entirety of the text. Crackles or ‘breaking up’ that prevent you hearing the text means you should reject the clip.

It is important that in an audio recording, you can only hear one person speaking distinct words. If there is another conversation in the background in which other words are audible, this will interfere with the model training.

Speaker diversity

If your goal is to create an ASR engine that is responsive to many different speakers, then you should try to have as many kinds of voices as possible in your training set. For example, if you’d like it to perform well for speakers of different sexes, ages and accents, you will want to ensure they are well represented in your training sets.

Metadata

You can see our approach to metadata here.

We currently collect;

Accent (freeform data)

Variant (pre-set options)

Age (in intervals of 10 years)

Sex (these options will be expanded soon to be more inclusive)

We will be expanding this metadata taxonomy soon to include sentence variant and sentence domain.

Tools that can be used to collect audio data

Computer or mobile: Most computer hardware now has a built in microphone now.
- On mobile - iOS has Voice Memos and most Androids have Sound Recorder.
- Depending on the type of desktop or laptop you have, you might find you already have Windows Sound Recorder or Apple Garage Band.
- There are also dedicated Applications you can download, like Audacity.
Annotation can be handled separately, or with a dual purpose recording application like ligAikuma.
If you would like to use an online recording platform, then aside from Common Voice, another option is Speech Annotation Toolkits for Low Resource Languages, dictate.app.

Where to host your datasets

We would recommend hosting your dataset on GitHub. We have just set up a new repository to link out to such datasets, that sits within the Common Voice organization on GitHub. This will make it easy for others to find the data and get in touch with you.

If you’d like to add your dataset to this repository, have questions about collecting your own supplementary data, or want your work featured on the Mozilla Foundation blog, please get in touch with one of our Community Managers on [email protected]