We’re changing the clip length for donated speech clips in the Common Voice dataset. As a first step, we’ve expanded the limits our platform places on recorded clips to 15 seconds, from the original 10 seconds. This will allow future contributors to donate longer clips, while giving dataset users longer clips for development or research. We’ve expanded this as part of ongoing discussions with our contributor and dataset user community.
When we first set the clip length, the 10 second length was conservatively chosen in light of the ML infrastructure limits of time for working with speech data. Longer clips require more GPU memory to process and the batch size (number of clips that can be processed simultaneously) decreases. But as both ML processes and hardware available to most dataset users have improved over the years, we wanted to expand clip lengths to meet dataset user demand. The current 15 second clip length is still shorter than many other voice datasets in the wider market, and this is by design. We want to be conservative with expanding clip length, both to keep voice contributions lower effort for our contributors and to make sure that dataset users with a range of hardware and infrastructure options are able to work with the dataset.
What happens next:
Removing the recording limits is just the first step to bringing longer clips to the Common Voice dataset. Our next steps will be to update rules and guidelines, like changing the limits for our words-per-sentence rules for some languages. We’ll also be updating our documentation and guidelines in coming weeks. As this is completed, we can begin to welcome longer sentences into the text corpus that will introduce longer clips into future speech datasets.
Thank you so much again to the community for flagging the need for this change. If you have great ideas, pain points or just want to show off what you’re building you can reach out on Discourse, Matrix or email us at commonvoice at mozilla.com.