2. Building New Tech and Products
Goal: Trustworthy AI products and services are increasingly embraced by early adopters.
Introducing more trustworthy AI products on the market will require a number of things to happen. Foundational trustworthy AI technologies and practices — things like edge processing that use data locally on a device, machine learning techniques that require less data, or data trusts that balance power between companies and users — will need to emerge as building blocks for developers. Similarly, new thinking about business models, explainability, and transparency will be needed.
Importantly, we will also need to build up our imagination of how the products and services that are so central to our lives today can give users greater control over their digital lives. This aspect of the work will not only require the efforts of people developing products and services — startup founders, entrepreneurs, funders — but also journalists, computer science researchers, artists, and others who see the big picture of how things need to change.
With all of this in mind, we believe that we should pursue the following short term outcomes:
A first major step towards better products and services is developing technological building blocks that can power more responsible AI. These building blocks — which could include better pre-trained models, alternative data governance models, privacy-preserving methods for machine learning, and decentralized, open source datasets — will reflect some of the trustworthy AI themes we’ve identified.
One area where we are seeing better building blocks emerge is in the realm of privacy-preserving AI. Traditionally, machine learning needs centralized data access, which raises concerns about both privacy and centralization of control in the hands of a few large players. Increasingly, however, computer scientists have been exploring the possibilities of edge computing, decentralized computing that is done at or near the source of the data itself.
Researchers at Google have created a method called federated learning that allows engineers to train a machine learning model without needing access to a centralized repository of training data. Federated learning works by using a decentralized network of nodes or servers (i.e. people’s devices) to train an AI system. Federated learning has the ability to compute across millions of individual devices and combine those results to iteratively train the model. The open source library PySyft is a popular implementation of federated learning. In this way, all the training data stays on a user’s device or is split across a number of trusted servers rather than in a centralized database, ensuring greater privacy.
Another relevant technique, differential privacy is a statistical assessment of risk, in which data is statistically anonymized before being used to train the model so that individual users can’t be identified from their data. Differential privacy has long been the foundation of Apple’s approach to machine learning, and now there is an implementation of Google’s open source tool TensorFlow for training machine learning models with differential privacy.
Training datasets and pre-trained models are key to building AI. Although big tech companies might have the vast resources required to develop their own machine learning models, most smaller companies rely on datasets or pre-trained models from Google or IBM to develop their own AI applications. To this end, we will need to work toward ensuring that these existing pre-trained models and datasets are trustworthy. This will be complicated, as such a move risks further legitimizing the power of dominant tech players and further entrenching power inequalities. But by ensuring that the most popular pre-trained models and datasets adhere to a high level of scrutiny, we can quickly scale more powerful trustworthy AI in consumer tech.
Another place where people are working on building blocks for more trustworthy AI is in the re-imagination of data governance and management. While regulations like the GDPR do this through the framework of individual data rights, we also need bottom-up legal structures that balance individual and collective approaches to data governance. Because “data ownership is both unlikely and inadequate as an answer to the problems at stake,” we need to develop new approaches to data governance that shift power from companies back to communities and individuals through a mix of collective and individual frameworks. Ideally, individual fiduciary models can support and complement the more collectivized governance structures of data trusts and co-ops.
One proposed way to do this is to require the big tech platforms to become information fiduciaries. A data fiduciary is an intermediary between individuals and data collectors, the people whose data is being collected (“data subjects”) and the companies collecting that data (“data collectors”). Under this system, users would entrust their personal data to an online platform or company for a service, and in exchange the platform would have a responsibility to exercise care with that data in the interest of the user. While this approach holds promise, under this model companies would have a divided loyalty between the interests of shareholders and the interests of end users. According to Mozilla’s report on alternative data governance models, “Critics are doubtful whether data fiduciary solutions present a realistic path to structural change, even if they could empower individuals to have more control over access to their personal data and enhance accountability through duties of care.”
An alternative fiduciary model is a legal mechanism called data trusts. Similar to an information fiduciary, a data trust is an independent intermediary between two parties. Unlike the fiduciary model, however, the trust would store the data from the data subjects and would negotiate data use with companies according to terms set by the trust. It would also have an undivided duty of loyalty toward its members according to a legal trust framework. Trusts could also serve as a mechanism for the collective enforcement of data rights, making it more likely that actions would be taken to use laws like GDPR to drive changes to products and services in ways that benefit end users. Different trusts might have different terms, and people would have the freedom to choose the trust that most aligns with their own expectations. Some data trusts already exist. For instance, UK Biobank, a charitable company with trustees, is managing genetic data from half a million people.
Another proposed approach is a data cooperative model. A data cooperative facilitates the collective pooling of data by individuals or organizations for the economic, social, or cultural benefit of the group. The entity that holds the data is often co-owned and democratically controlled by its members. Similar to a US credit union or a German Sparkassen savings bank, a data cooperative would share the benefits or profits of the data between its members. Because it could also run internal analytics, both data co-ops and data trusts would be in a strong position to negotiate better services for its members. Some data co-ops already exist: The MIT Trust Data Consortium has demonstrated a pilot version of this system.
Another approach to managing data differently is a data commons, which pools and shares data as a common resource and is typically accompanied by a high degree of community ownership and leadership. One of the major barriers to AI innovation is that only a handful of companies have access to training data. Data commons chip away at that power by democratizing access to training data sets and models, available to anyone who wants to use them and in a format that can be easily analyzed. Often that data is harmonized according to common data specifications, so that it is easy to use across different data pipelines. There are many different data commons already in existence: Mozilla’s project Common Voice, for instance, is a crowdsourced dataset that represents the largest set of open source voice training data in the world, with more than 250k contributors, 4.2k hours of recorded voice data, and 40 different languages. In academia, the UCI Machine Learning repository hosts hundred of datasets that have been accessed millions of times to benchmark ML algorithms in academic research.
Privacy-preserving AI techniques, new data governance models, and open source training datasets are only a few examples of the kinds of building blocks we need to emerge in order to get closer to trustworthy AI. While some of the privacy-preserving techniques we mentioned are becoming standard, there is a long way to go before they are widely adopted. In addition, new data governance models and open data projects are still at a very early stage of development. Over the coming years, we plan to support people and organizations developing and testing out these building blocks, starting with a major exploration of real-world uses of responsible data stewardship and data governance models.
Transparency is the most commonly cited principle in dozens of ethical AI guidelines, across geographic regions and sectors and a major focus of current research and development. There is wide consensus that technological norms and processes that enable transparency are themselves a major class of building blocks for trustworthy AI. However, different actors interpret transparency to mean different things. To move toward AI that is more transparent and accountable, we will need to weave together disparate work that is happening across different sectors.
The AI that is used in consumer-facing tech is often complex and opaque. There are a number of reasons for this, some of which are technical and some of which are based in institutional norms and incentives. According to Jenna Burrell, there are three key sources of AI opacity: (1) opacity as intentional corporate secrecy; (2) opacity as technical illiteracy; and (3) opacity that arises from the characteristics of machine learning algorithms and the scale required to apply them usefully.
Focusing on that third category, technical solutions are currently being explored to make opaque AI more transparent. For decades computer scientists have been working to improve the explainability of AI — whether an AI system can be easily understood by and explained to a human. Another way developers are trying to make opaque AI more accountable is to use a human in the loop approach, which means that humans are directly involved in training, tuning, and verifying the data used in a machine learning algorithm. This allows groups of experts with specialized knowledge to correct or fix errors in machine predictions as the process develops. In this way, humans are more actively involved in making normative judgements about the output of an AI system, rather than offloading decisions to the model.
However, technical solutions to AI transparency can only go so far. In order to address other sources of opacity such as intentional secrecy, companies will need to develop their products and services in a way that enables third party validation and audit.
While developers should be regularly auditing their AI systems, they can also build those systems in a way that makes them easier to audit by third parties. One way companies are doing this is through open data archives. We are already seeing progress in this area with political ads: Companies that allow advertisers to target people with political ads should provide the public with clear, accurate, and meaningful information, available in a format that allows for bulk analysis by regulators and watchdogs groups. Under pressure from advocacy groups and policymakers, platforms like Facebook, Twitter, and Google have developed open political ad libraries that provide detailed information about who paid for an ad, how it was targeted, the size of the audience that saw it, and other information.
The disclosure of data archives should go beyond advertising, though. In some cases, it may be necessary for companies to disclose what content was taken down or removed and why. As such, we want companies to work toward developing transparency products that give third parties access to more information about AI-based targeting systems. By analyzing data archives, researchers can more readily identify patterns of discrimination or deception that any individual user would not be able to see. Without bulk disclosure, the systems can evade efforts to systematically identify harm.
Platforms and services can be designed in a way that gives users greater control and agency over the algorithm’s output. One way digital platforms are already doing this is by giving users explanations of the system’s behavior within the UI itself. Such interventions can provide more information about inputs used by its recommender system. For instance, Netflix adds labels to its recommended videos with: “Because you watched X...” While such explanations may not help individuals identify discrimination or harm, they would help users understand why the algorithm behaved a certain way and could empower them to take action.
Transparency is not a “one size fits all” solution to AI accountability, and not every algorithm requires this level of transparency. In fact, arguments have been made that blunt transparency runs the risk of obscuring itself further by overwhelming us with too much information. Furthermore, useful transparency isn’t always possible – in which case we may need to reconsider whether AI should be used at all in high-stakes consumer environments like health or credit. But we do need to come up with real and workable solutions for transparency across the whole AI ecosystem — from transparency and control for users to tools that allow for scrutiny by researchers and regulators. As a community, we need to continue to experiment, test, and push to make this level of transparency a reality.
We noted before that impact investors can set themselves apart by funding trustworthy AI products and technologies. In addition to new products, though, we will also need new business models that don’t rely on the exploitation of people’s data. As Zeynep Tufecki notes, a business model rooted in “vast data surveillance” to “opaquely target users…will inevitably be misused.” We need to ensure that startups with business models that are socially responsible and that don’t exploit users are getting funded.
Companies already recognize the value in developing business models and objectives that respect — rather than infringe on — people’s privacy. Over the past few years, data privacy has evolved from a “nice to have” for businesses to a must-have, critical topic of discussion. According to a Cisco survey, 70% of organizations that invested in privacy say they are seeing benefits in every area of their business, in the form of competitive advantage, agility, and improved company attractiveness to investors. Companies that demonstrate they care about people’s privacy and well-being increasingly have a market advantage.
There is a hunger in the market for different business models that aren’t focused on monetizing or exploiting people’s data. One simple alternative is to set up the platform so that people pay to use it. Consumers are already used to this model — after all, many people are paying for access to streaming platforms like Netflix, HBO Go, and Hulu, or subscription services like Amazon Prime or HelloFresh. Before it was acquired by Facebook, WhatsApp was charging users $1 per year and the platform was still experiencing huge growth. The downside to this model is that it’s unclear whether people will be willing to pay for multiple subscription services. One survey shows that 75% of consumers capped their maximum spend on streaming services at $30.
Platforms like Hulu and Facebook take what is called a two-sided approach to their business model: They make profit both from (1) users paying to use the service, and (2) by sharing user data with advertisers. One option for two-sided businesses, then, is to rely on more privacy-preserving methods of doing data analysis. These new businesses may offer companies a new way to identify patterns without exploiting people’s data.
It’s important to note that even if these key industry business models were to change, companies may still be incentivized to exploit people's data for the purpose of training their models. This is the product of what has been called the “agile turn,” a way of building tech that shortens development cycles and demands constant user surveillance and testing. More work needs to be done to change these incentives in the tech industry.
All of these ideas will have trade-offs. Funders will need to continue supporting startups and technologies that are seeking out different ways of doing business. As a way to build momentum, foundations and others interested in impact investing could set up special funds to encourage data privacy and alternative approaches to data governance. Impact funds have done a great deal to pave the way for investment in fields like green energy. The same could happen in the field of trustworthy AI.
2.4 The work of artists and journalists helps people understand, imagine, and critique what trustworthy AI looks like.
Many of AI’s shortcomings are not readily apparent to the public, as they are often hidden in complex systems that are difficult to audit. The job of critiquing AI often falls on journalists, artists, creative technologists, and other researchers who are interrogating how these systems work. But beyond critique, they can help us expand our thinking around what is possible by showing us what alternate, preferred futures our technologies can offer.
Investigative journalism is shaping the future of AI by shedding light on technology’s shortcomings and limitations. Journalists can serve as corporate watchdogs by investigating computational systems, and they can also help us understand what is happening by providing context and evidence. For instance, ProPublica’s groundbreaking 2016 series on machine bias in crime algorithms unlocked a new set of investigations into AI bias. In addition, its work on Facebook’s ad targeting platform showed how advertisers could target ads in a discriminatory way, leading to new research and lawsuits on ad discrimination. The fledgling news organization The Markup launched in 2020, using data to investigate tech and its influence on society. The New York Times has a growing tech beat, hiring reporters with expertise in tech investigations. We rely on journalists to do the ever-important work of holding our technologies accountable when they have the potential for harm.
Similarly, art is helping expose the limitations and shortcomings of AI. Artists critique current systems and imagine different ones by providing us a new lens through which we can see our world. For instance, the project ImageNet Roulette by artist Trevor Paglen and Kate Crawford exposes the biases in image datasets that are trained to categorize humans. The project reveals how AI can become a new vector for social discrimination, and shows how art can be wielded to hold tech accountable.
In addition, art and design are tools to help us see what alternative worlds and technologies could look like. Artists and designers do this through speculative design and futures, tools that help us imagine the futures we want to build. For example, there is a growing body of work around feminist technologies that imagines what alternative voice AI could look like in practice. The organization Feminist Internet runs a workshop called “Designing a Feminist Alexa,” which has resulted in a number of voice experiments that push the boundaries of how we think our voice assistants should speak, act, and interact.
As much of this work is still nascent, we have yet to see many of these experiments mirrored in our technologies just yet. Much more work will need to be done to ensure that these innovative ideas and experiments can become viable and real. Mozilla will continue to support journalists and artists who are critiquing our current technological landscape, and offering up visions of an alternate, preferred one.
Sylvie Delacroix and Neil Lawrence, “Bottom-Up Data Trusts: Disturbing the ‘One Size Fits All’ Approach to Data Governance,” SSRN Scholarly Paper (Rochester, NY: Social Science Research Network, October 12, 2018), https://doi.org/10.2139/ssrn.3265315.
Thomas Hardjono and Alex Pentland, “Data Cooperatives: Towards a Foundation for Decentralized Personal Data Management,” ArXiv:1905.08819 [Cs], May 21, 2019, http://arxiv.org/abs/1905.08819.
Anna Jobin, Marcello Ienca and Effy Vayena, “The global landscape of AI ethics guidelines,” Nature Machine Intelligence, vol. 1, no. 9, Sept. 2019, pp. 389–99, https://www.nature.com/articles/s42256-019-0088-2
Jenna Burrell, “How the Machine ‘Thinks’: Understanding Opacity in Machine Learning Algorithms,” Big Data & Society 3, no. 1 (January 5, 2016): 205395171562251, https://doi.org/10.1177/2053951715622512.
Christian Sandvig, Kevin Hamilton, Karrie Karahalios, and Cedric Langbort, “Auditing Algorithms: Research Methods for Detecting Discrimination on Internet Platforms,” paper presented to “Data and Discrimination: Converting Critical Concerns into Productive Inquiry,” a preconference at the 64th Annual Meeting of the International Communication Association, May 22, 2014, Seattle, WA, USA, http://www-personal.umich.edu/~csandvig/research/Auditing%20Algorithms%20--%20Sandvig%20--%20ICA%202014%20Data%20and%20Discrimination%20Preconference.pdf.
Mike Ananny and Kate Crawford, “Seeing without Knowing: Limitations of the Transparency Ideal and Its Application to Algorithmic Accountability,” New Media & Society, December 13, 2016, https://doi.org/10.1177/1461444816676645.
Andrei Hagiu and Julian Wright, “Multi-Sided Platforms,” SSRN Scholarly Paper (Rochester, NY: Social Science Research Network, March 19, 2015), https://doi.org/10.2139/ssrn.2794582.
Seda Gurses and Joris van Hoboken, “Privacy after the Agile Turn,” SocArXiv, May 2, 2017, https://doi.org/10.31235/osf.io/9gy73.