If you’re a Common Voice contributor, you might notice that the navbar has changed at commonvoice.mozilla.org.
Under the “Speak” section, there’s now a holding space for a new feature that will be released in the first quarter of 2025. After a successful alpha pilot with a limited number of languages, Common Voice will be rolling out a Spontaneous Speech feature set. This will allow people to respond to prompts, and transcribe those prompts. This is a complement to our offering of read/prepared speech. These Spontaneous Speech contributions will then be released as a new dataset offering.
What is Spontaneous Speech
Until recently, Common Voice has been a dataset of read speech. Contributors are presented with a sentence in their chosen language and asked to record themselves reading this sentence aloud. This makes ensuring audio-text matching, or ‘transcription’ relatively straight forward.. But speech rooted in public domain text can have drawbacks as well. Copyright free texts that drive read speech corpora are usually older, and may use language that isn’t as suitable for modern uses. Many languages also do not have a lot of text in the public domain. Then there is the data format itself; people tend to read in a different way than they speak organically, which makes models trained on prepared speech perform poorly in some speech recognition contexts, for example when there might be a lot of disfluencies or colloquialisms..
By expanding our community platform to offer Spontaneous Speech, Common Voice will be able to provide you - the researchers, developers and language activists we serve - with datasets that are more varied, more natural and better reflect the way we use language today. We also hope to lower the barrier to entry for more types of language user - people code-switching, translanguaging, using sociolects and oral-first languages. You can learn more about the team’s thinking in this interview from July 2024.
The Spontaneous Speech pilot
Common Voice launched a closed Alpha of Spontaneous Speech in the summer of 2024, working with community members to test the concept and a prototype of a Spontaneous Speech enabled platform that collected data in 11 languages.
Feedback from this Alpha pilot program helped refine the Spontaneous Speech feature set. Data collected from the Spontaneous Speech Alpha helped us better understand contributor needs. In response to this feedback we’ve redesigned Spontaneous Speech so that users are shown a single prompt at a time, instead of batched sets of sentences like contributors to our read corpus.
What Spontaneous Speech means for contributors
When Spontaneous Speech contributions become available, you can contribute in your chosen language by selecting the “Answer Questions” option from the Speak section of the Common Voice platform. From here, you’ll be asked to select the language you would like to contribute to. Press to record and briefly answer the questions provided in your own words.
Contributors will also be able to transcribe clips recorded by others, creating a text version of each clip. Contributors will also be able to validate transcriptions, offering edits or suggestions when they don’t quite match what the speaker has said. We’ve also worked to further refine the UI, supportive instruction to contributors and data validation to prepare for a wider launch of Spontaneous Speech next year.
What Spontaneous Speech means for the dataset
The core Common Voice datasets that you use for development or research won’t be changing. Spontaneous Speech dataset releases will be part of a parallel release process and will be a new, standalone dataset offering. Spontaneous Speech datasets will be released under a CC0 license, continuing our tradition of offering copyright free speech data to anyone interested in working with multilingual speech data.
Feedback
Adding Spontaneous Speech to Common Voice came from feedback from our contributors and dataset users. Thank you. If you have ideas you want to share or you would like to tell us what you think about Spontaneous Speech, you can find Common Voice on Matrix, Discourse or email the team at [email protected]