Latest Common Voice Dataset Surpasses 20,000 Hours of Open-Source Speech Data

The April 2022 release also features six new languages, more speech data from female speakers

(GLOBAL | WEDNESDAY, APRIL 27, 2022) -- The latest Common Voice dataset, released today, has achieved a major milestone: More than 20,000 hours of open-source speech data that anyone, anywhere can use. The dataset has nearly doubled in the past year.

The new release also features six new languages — Tigre, Taiwanese (Minnan), Meadow Mari, Bengali, Toki Pona and Cantonese — and more speech data from female speakers.

This is Common Voice’s ninth release. Common Voice is a Mozilla Foundation initiative with cross-sector backing from entities like the Gates Foundation, GIZ, NVIDIA and the UK FCDO. It is the world’s largest multilingual, open-source dataset. Common Voice is used by researchers, academics, and developers around the world to train voice-enabled technology and ultimately make it more inclusive and accessible.

Says Hillary Juma, Common Voice Community Manager: “We are so glad to see new languages and increased representation in our latest dataset release. Our contributors have made this possible — from voice donations, to initiating their language in our project, to opening new opportunities for people to build voice technology tools that can support every language spoken across the world.”

Access the dataset: https://commonvoice.mozilla.org/datasets

Highlights from the latest dataset:

The new release also features six new languages: Tigre, Taiwanese (Minnan), Meadow Mari, Bengali, Toki Pona and Cantonese.

Twenty seven languages now have at least 100 hours of speech data. They include Bengali, Thai, Basque, and Frisian.

Nine languages now have at least 500 hours of speech data. They include Kinyarwanda (2,383 hours), Catalan (2,045 hours), and Swahili (719 hours).

Nine languages now all have at least 45% of their gender tags as female. They include Marathi, Dhivehi, and Luganda.

The Catalan community fueled major growth. The Catalan community's Project AINA — a collaboration between Barcelona Supercomputing Center and the Catalan Government — mobilized Catalan speakers to contribute to Common Voice.

Highest community participation in decision making yet. The Common Voice language Rep Cohort has contributed feedback and learnings about optimal sentence collection, the inclusion of language variants, and more.