We are excited to announce that we will be making it easier for the communities of data creators and data consumers to tag the Common Voice sentence corpora with Domain information! Understanding the domain of sentences and vocabulary is useful for many data consumers.

We know many people in the community have been trying to build models that are optimised for particular use cases, and we want to make that even easier.

For example, if you wanted to train a model for deployment in a healthcare device used on hospital wards, you might want your training data to include domain-specific vocabulary that doctors use regularly - like cardiovascular, arteriosclerosis or ischemia.

Case study:

The Kiswahili community wanted to create a speech corpus specifically focussed on Agriculture. The platform did not make it possible to;

  • Tag / identify agriculture sentences through the platform during the import process
  • Serve users those sentences in particular, as part of a specific project team
  • Cluster those clips after downloading them through the general download flow
  • This made it harder for the builders who were creating an agriculture-focused product to create the most relevant model for their use case

We will be rolling out this work in phases. Phase 1, in February 2024, will give users the option to tag sentences with any of the following tags when they are adding them via the Sentence Collector.

  • General
  • Agriculture and Food
  • Automotive and Transport
  • Finance
  • Service and Retail
  • Healthcare
  • History, Law and Governmant
  • Media and Entertainment
  • Nature and Environment
  • News and Current Affairs
  • Technology and Robotics
  • Language Fundamentals (e.g. Digits, Letters, Money)

We will be releasing this sentence metadata within future dataset releases, starting in March 2024. We will not be rolling out any mechanism for "backtagging" existing sentences within the corpus yet, but are scoping this for a future phase.

In subsequent phases, we will also be delivering end-to-end platform support for communities to focus their mobilising energy on creating a particular domain corpus, for example Health, Education or Climate. This will also ensure we can maintain an enjoyable user experience for all community contributors, regardless of their field of expertise, background or interests.

As always, reach out to us with ideas, thoughts and feedback on Matrix, Discourse or on [email protected]