Part II: Exclusion
Considering how data projects can exclude, discriminate, and marginalize
Data and technology projects continue to exclude, discriminate, and marginalize. New data stewardship models create new barriers to meaningful participation, or reinforce old barriers.
Data stewardship models may fail to lead to a more equitable digital world.
For one, data stewardship models do not guarantee equitable governance or operations. Without constant effort and attention, data stewardship initiatives may exclude and marginalize, and may fail to protect vulnerable community members.
For another, data stewardship models may be unable to prevent the weaponization of data and inferences against vulnerable populations. In many cases, the best way to keep data safe is still not to collect it at all.
The stories in this part highlight how data can be used to exclude, discriminate, and harm people. While not all of the stories here are caused by data stewardship initiatives failing, an initiative fails when it allows stories like this to form.
Data stewardship initiatives that are built around community governance run the risk of inequitable governance. Community members may not have equal ability to participate in the governance of a co-op or collaborative. They may lack time, training, or opportunity to make informed decisions about what to do with their data. Stewardship initiatives have an obligation to lift members up: to provide training, to provide opportunities for input into decision-making, to provide protection for vulnerable members. Initiatives that fail to meet those obligations will only exacerbate inequality.
Story: Barriers to participation and governance
Parents of children with chronic conditions form a data co-op to help safeguard their children's health data. Membership in the group is primarily via word-of-mouth. The group primarily holds meetings during the workday, which makes it more difficult for parents who work to attend the meetings and meaningfully participate in the group. As a result, the co-op's patient population is richer and whiter than the national patient population. After the co-op contracts with a technology company to produce an algorithm to triage and predict severe disease course, audits reveal that it produces worse clinical outcomes for Black and Hispanic patients.
Where data stewardship initiatives attempt to negotiate directly with platforms — for better terms, for new features — they risk creating new inequalities. Even now, platforms discriminate based on national and local policy differences: GDPR’s rights and protections are not always extended to non-European users.
Data stewardship initiatives that negotiate on behalf of communities may further this patchwork. Assuming platforms engage at all, a wealthy group of users in the United States may receive better data protection terms than, say, civil society groups representing rural minorities in India.
What does this mean? The success of some data stewardship initiatives may depend more on the policy and economic context it lives in, rather than the stewardship model itself. This may confound efforts to scale stewardship models beyond the local level.
Story: Data rights arbitrage
Rideshare drivers in the UK successfully use subject access requests to build a detailed database of wage data. In response, rideshare companies negotiate a more equitable wage model that minimizes discrimination between individual drivers for the same ride. The driver collective and the rideshare companies sign a non-disclosure agreement.
This model is deployed only in the UK. In countries where individuals do not have the right to make subject access requests (such as the US), or in countries where drivers have less economic leverage (such as Thailand), the rideshare company continues to use its discriminatory wage model.
Data is potential. A single dataset or a single inference or algorithm may have multiple potential uses: from diagnostic to discriminatory. It is increasingly difficult to control these uses and reuses, even with better data stewardship. Achieving accountability for data-related discrimination may require more wholesale policy changes than a stewardship initiative can manage on its own.
Story: Voice-powered discrimination
By analyzing voice records of interview data from a longitudinal study, researchers develop an algorithm to detect early signs of cognitive decline based on biomarkers in voice data. The algorithm is developed with full patient consent, and effectively preserves the privacy of the initial study population.
After a blood-based testing protocol proves to be more reliable for clinical diagnosis, the algorithm is openly licensed. Soon, the algorithm is deployed in automated job screening tools (to filter high-risk candidates), insurance support hotlines (to re-price insurance for customers), and video-based social media (to drive targeted advertising).
Data stewardship initiatives are still vulnerable to power asymmetries. Even if a data steward succeeds in protecting members from commercial exploitation, most, if not all, are still vulnerable to legal exploitation: from government seizure to legal action. Here, data stewardship is powerless to stop the weaponization of data against vulnerable groups, whether in Hong Kong, the United States, or anywhere in between. Often, the safest option is not to collect data in the first place.
Story: Weaponizing a storytelling program.
With the help of volunteers, an immigration activism group begins to compile a storytelling corpus. To encourage candor, the group promises to embargo stories from release until after the storyteller has died. To keep their promise, the group holds the stories in trust with a third-party trustee.
Five years into the program, a new administration comes to power that takes a harder line on immigration. Immigration enforcement successfully obtains a warrant for the storytelling data, and begins to reidentify the participants. Several are arrested.
The story in 2A is inspired by a racially-biased triage algorithm deployed across American hospitals.
The argument in 2B is inspired by several pieces on regulatory arbitrage. See, e.g., Ryan Calo and Alex Rosenblat, "The Taking Economy: Uber, Information, and Power," 117 Columbia Law Review 6; Brishen Rogers, "The Social Cost of Uber," 82 University of Chicago Law Review Online (2017). See also research from Itzhak Ben-David, Stefanie Kleimeier, and Michael Viehs on how companies take advantage of different national pollution regulations.
The argument in 2C is inspired in part by Nathaniel Raymond’s piece “Safeguards for human subjects research can’t cope with big data.” (Nature, 2019) The story in 2C is inspired in part by actual research on using voice data to detect cognitive decline. For more on data-driven price discrimination, see, e.g., Silvia Merler, "Big data and first-degree price discrimination," Bruegel (2017); Christopher Townley, Eric Morrison, and Karen Yeung, "Big Data and Personalised Price Discrimination in EU Competition Law."
The argument in 2D draws from the work of Yeshimabeit Milner, ("We will not allow the weaponization of COVID-19 data"), Mutale Nkonde (“Congress Must Act on Regulating Deepfakes”), Joy Buolamwini (“We must fight surveillance to protect Black lives”), and Virginia Eubanks. The example in 2D is inspired by the Boston College IRA Tapes Project.