About the Methodology

ZUSAMMENFASSUNG

Quantitative study Survey Annex: Examples of video pairs recommended

Quantitative study

The extension

The study was carried out by participants that installed our web extension and then opted into our experiment and data collection. For those that did not opt in, the reject functionality sent a “dislike” signal and no data was collected.

For those that did opt in, a random assignment was made, with equal probability, to each of our experiment arms:

Control (or “placebo”): Pressing the "Stop Recommending" button will have no effect.
Dislike: Pressing the button will send YouTube a "dislike" message.
History: Pressing the button will send a message to remove the video from your watch history.
Not interested: Pressing the button will send YouTube a "Not interested" message.
Don’t recommend: Pressing the button will send YouTube a "Don't recommend channel" message.
No button: The "Stop Recommending" button will not be shown, but standard data will still be collected.

The extension also collected data (for those that opted in) which was sent to Mozilla servers using Firefox telemetry. Data collected included:

A unique installation ID
Experiment arm assigned
A record of all uses of the “Stop recommending” button with timestamp and video ID on which the button was pressed.
A record of all recommendations made by YouTube including timestamp, video ID, and type of recommendation, i.e. sidebar or homepage.
A record of all interactions with native YouTube user control features.

Data

Participant data

We analyzed data collected from Dec 2, 2021 to June 26, 2022. In this time period there were 22,838 participants that opted into our experiment and data collection. After removing exceptional behavior (either bots, scripts, or very unusual people), we have 22,722 participants for analysis. Of these, 6,147 of them rejected at least one video. There were a total of 30,314 rejected videos. Our participants were recommended 567,880,195 videos. In total there are 162,983,496 video pairs that we were able to analyze. This was limited by the quantity of YouTube metadata collected.

Research Assistant Data

We contracted a set of 24 research assistants from Exeter university supervised by Dr. Chico Camargo to classify video pairs according to our policy. Between April 22 and June 26, 2022, they classified 44,434 pairs (after removal of data by a handful of RAs that were found to have produced high rates of incorrect classifications).

The classifications were:

Acceptable Recommendation 75%
Bad recommendation 22%
Unsure 3%

Video language was automatically classified using the gcld3 model applied to the video description. Based on this, the classified videos included 102 different languages. We did seek out research assistants with varied language skills, although we also allowed them to classify pairs in languages they didn’t understand by using translation tools or in cases where language understanding was not necessary to make a classification.

For the initial days of classification, we asked the research assistants to classify random pairs from our data. For the majority of the classification period, however, we asked them to classify pairs selected to have a wide range of different predicted (by our model) probabilities of being bad. This was partly to employ a method known as active learning to improve our classification model and partly to ensure that we could effectively calibrate that model.

YouTube data

The extension reports video IDs and, when possible, includes metadata such as video title, description, and channel. However, we needed this data consistently and also needed video transcripts. This data was obtained by automatic web requests to YouTube’s servers. We obtained information including title, transcript, channel, and description for over 6 million videos.

Additionally, the channel that was obtained was not always a canonical representation. As such we also obtained a map from all the channels we observed to their canonicalized forms, also through automated web requests to YouTube.

Analysis

Bad recommendation rates

As we mentioned before, a “bad recommendation rate” is defined as how often YouTube recommends videos that are similar to a video that a participant has rejected. The analysis of bad recommendation rates allowed us to measure the effectiveness of YouTube’s user control mechanisms in preventing unwanted recommendations. The underlying belief is that, since YouTube recommends using user control mechanisms to manage recommendations, that using such mechanisms to express what types of recommendations are unwanted should be effective in preventing such recommendations.

As such, if negative feedback is submitted for a particular “rejected” video, we consider that any subsequent recommendation similar to the rejected video is “bad”. Of course, there is a lot of nuance here, and it is not clear that a single negative feedback submission should suppress all future recommendations about a topic. It is also possible the user behavior on YouTube after feedback submission may justify recommendations on topics similar to the video for which feedback was submitted. Regardless, as these are the only tools YouTube offers to prevent unwanted recommendations, we do expect them to be effective for that purpose.

Metrics

We calculate the bad recommendation rate by considering all video pairs in the segment in question (for example, those contributed by participants in a particular experiment arm) and determining the proportion of them that are bad, whether as classified by our research assistants (in which case we must restrict to the subset of pairs that they have assessed) or as classified by our model.

There are alternative possible metrics. For example, we can calculate a bad recommendation proportion for each rejected video, and then aggregate those proportions among all rejected videos in the segment in question. A similar approach can be taken at the level of the participant. We tested various metrics, but found no meaningful differences in the findings, and so used the simplest approach of simply aggregating across all pairs in the segment to calculate a rate, or proportion.

Interaction Rates

We calculate interaction rate for participants as the sum of user control interactions made, divided by number of videos watched. A user control interaction may be a press of the “Stop recommending” button, or a native YouTube control (dislike, not interested, remove from watch history, or don’t recommend channel). For the interaction rate analysis, we divide participants into two groups, those in the UX Control group (which have no “Stop recommending” button) and others (which do). We calculate the average interaction rate for each of these two groups. In the first group, the rate includes only native interactions, as they are the only options available, while the latter group includes “Stop recommending” button interactions.

Semantic Similarity Model

Our research assistants classified 44,434 pairs but we analyzed a total of 162,983,496. For the pairs that were not classified by the research assistants we applied a machine learning model that estimates semantic similarity. Details on the model are available in our recent blog post.

Survey

Survey questions

Our survey ran for four months, from Dec 2021 to April 2022. Survey participants responded to the following questions:

Think about a time you took steps to curate or control your YouTube recommendations. What did you do?
What kinds of videos were you seeing that prompted you to take these steps? Why do you think YouTube was recommending those videos to you?
After you took steps to control your recommendations, did they change? If so, how did they change?
In this scenario, did you feel like you had meaningful control over your video recommendations? Why or why not?
What do you wish you had been able to do in this scenario? Are there other platforms you know of that allow you to do this? Provide examples if so.
Overall, what information do you wish YouTube provided users about how its recommendation algorithm works? What would you do with that information?
Are you interested in being interviewed by Mozilla researchers about your responses to these questions? If so, please provide your email address.

Annex: Examples of video pairs recommended

Link to Addendum Report PDF

Link to public JSON endpoint

Scrollen Sie weiter zu

References