We are excited to announce a set of changes to the Common Voice dataset.

People who contribute to Common Voice are not under any obligation to share other details about themselves with us, however if they wish to, they can share information about their age, sex, the languages they speak, which dialects and variants, and information about their accent. Doing so is extremely helpful for the data consumers who use Common Voice to train models, and for the community mobilisers to address bias and representation issues. Common Voice is radically transparent - sharing all metadata coverage and statistics with the dataset release so that people understand exactly what they’re working with.

The approach to the gender field in people’s profile settings at the advent of the platform followed historic norms, (it was referred to as gender, and offered options for male, female and other). This was driven by a concern for interoperability (this is still how most actors globally taxonomize, including governments) and community safety (in many of the contexts in which data is collected, gender-diverse communities face discrimination or even violence).

This taxonomy, however, does not reflect the beliefs or values of us in the Common Voice team, project, or wider Mozilla Foundation organization.

We at Common Voice acknowledge that sex and gender are two different categories. Sex is “different biological and physiological characteristics”, Gender is “socially constructed” and “personally experienced”.

We know that for many people, the experience of being served the current options (M/F/O) is hurtful. It also precludes some really exciting, progressive gender-conscious ASR work. We wish to address these issues directly in 2024.

We will now be labelling this category “Sex or Gender”, because we want to give the community more choice in what they tell us about themselves.

Principles: balancing interoperability, community usability, experience and identity

  • Many datasets still use Male, Female, Other and refer to these as ‘Gender’, despite the fact that this is ‘Sex’
  • The impact of sex on vocal characteristics is reasonably well understood
  • The impact of gender identity on speech is less well explored
  • We want to give the community more choice in what they tell us about their sex and/or gender identity
  • We also want the dataset to be easy to consume, understand and use in a variety of different social and geographic contexts
  • We want our metadata to help us be accountable on questions of representation and inclusion
  • The tags below are not meant to be an exhaustive description of people’s identity, nor are they mutually exclusive from one another - all taxonomies ultimately fail to capture the diversity of human experience, and we know that some people may still not feel full represented here
  • People should choose the tag that allows them to self-describe in the way that feels most right and natural to them
  • People should feel absolutely free not to share

V1.1 Metadata Schema for Common Voice

Sex or Gender

Female/Feminine

Male/Masculine

Intersex

Transgender

Non-binary

Don't wish to say

FAQ

What happens to my existing metadata tags, if I have already chosen a Gender dropdown option for my profile?

If you had selected Female, it will now be Female/Feminine. If you had selected Male, it will now be Male/Masculine. If you had chosen Other, it will reset to No Information, so as not to assume anything about why that option was selected. If you hadn’t selected an option, this will continue to be the case (No Information).

How do I change my Gender tag?

You can change your selection at any time, in your profile settings. We encourage the community to make any changes that feel right to them, choosing from our expanded options.

When are these changes coming?

The platform will show you the new options in February 2024. This is also when the data migration will occur. The March 2024 Mozilla Common Voice dataset release will use the new metadata schema.