Common Voice is a global community of contributors, dataset users, researchers and interested hobbyists who come together to create text and speech datasets that can power a more inclusive, open and healthier data ecosystem.
We wanted to create more space for community participation as we build the future of Common Voice with you. As a part of this, we wanted to share our 2024 goals and roadmap for feedback and discussion with the wider community.
This blog post doesn’t detail all the work our tiny team does - for example, we’ll still be fixing bugs, adding new languages, answering support questions and doing all the other day to day tasks that are needed to keep Common Voice healthy and running. But we also have some exciting expansions in the works! These are roughly clustered into three ‘product themes’.
For a more interactive look at this roadmap, we'll also be hosting a live Q&A session with the Common Voice team April 24, 2024. Free signup is available via this form.
Language as it is lived: Variants, code-switching and spontaneous speech
We want to capture the diversity and nuances of people’s speech. This year we’ll be rolling out support for code-switching (2 languages in 1 dataset) and sociolects (variants of a language used by a social group). Both of these will initially be available through our second platform; Common Voice: Spontaneous Speech. CVSS is currently in Alpha testing, but will be released in Beta at the start of Q3, with just 3 languages to start with.
Re-centring text as a data asset: Sentences driven by consent and quality
We’ve been focusing on improving the health of our text corpora. Last year we migrated sentence collection into the Common Voice platform which has resulted in a 100% increase in rate of languages ingesting new sentences and a 300% increase in people becoming sentence contributors. This year we’ve already worked to include our text corpus into our datasets, and will be working soon to move quality and assurance processes for our sentences into the Common Voice platform. This will make it faster and easier to grow the text corpus with high quality contributions across languages. We’ll also be prototyping some human in the loop ‘commentary’ tools that may be useful for other ML practitioners.
Diversifying governance pathways for more equitable innovation and sustainability
The Data Futures Lab is an experimental space for instigating new approaches to data stewardship challenges, also part of the Mozilla Foundation.
We’ll be working on a collaboration with the DFL to explore how the Common Voice platform might be able to support community-led data collection projects with different governance and licensing structures. We are committed to the good that open source does in the world, and are not making any changes to the licenses on existing datasets, but we want to listen to communities with different perspectives, and go on a learning journey with them. We’ll share our reflections and hold space for discussion in 2025.
Invigorating our open source and technical communities
One of our goals this year is to engage more with our communities beyond the data collection phase of their journey. We plan to co-design learning experiences with community members on utilizing their data for developing responsible speech technology applications. To achieve this, we are partnering up with the Responsible Computing Challenge.
We also want to better support and enable our own open source community to co-create the CV platform to meet their needs. We are creating more space for discussion around technical direction and creating more roadmap transparency for feedback and collaboration. We’re also in the process of auditing all our public technical documentation to make it easier to get involved. We’re pairing this with a review of our internal processes, ramping up team attention for PRs to make sure interested contributors receive prompt feedback. Chat to us on Discourse, Matrix or on GitHub to steer us in the right direction to support you!
Exploring different partnerships for sustainability
Common Voice is a nonprofit effort and is funded through grants and partnerships. Part of each annual roadmap includes work to secure funding that allows Common Voice to grow sustainably. This year, we’ll be continuing to explore funding routes that align with our mission. If you want to support us directly, donations are gratefully accepted, and you can email [email protected] to speak to us about institutional grants or partnerships.