Initial project analysis

February 2019

youtube.tracking.exposed takes its inspiration, methods, experience, and code from facebook.tracking.exposed. This project is motivated by and refers to the tracking.exposed manifesto; the goal is to create free (libre) software, support a community of analysts and developers, and enable critical analysis of monopolist platforms.

How this technology works

What can we do with this data, and who are “we” ?

With the browser extension ytTREX, you collect evidence on how YouTube is treating your individual profile. Full stop. Nothing about algorithm, yet. The session YouTube provides to the watcher is personalized, and with these data, you and your friends can, for example:

  1. Individuals themselves can understand how their experience is curated by the platform.
    By clicking on the browser extension, the adopter would reach their personal collection of videos
  2. People using our tools and analytics. To this end, a simple dashboard would be developed on top of a documented API, and enable other free software developers in extending our AGPL technology. Check the ‘play with data’ section, it points at three basic functionalities
  3. Let communities play with this
    a consumer of Youtube video might want to mark their contribution as belonging to a certain channel or community. This would help communities to better understand themselves and how algorithms impacts them.
  4. Researchers (everyone is a researcher for us, they just have raw data) might want to explore in details the data. This project is inspired by the concept of “European Data Commons” (section 2.2.2), as long as we can ensure they are analyzing phenomenon and not individual behaviors.
    we can achieve this with anonymization and aggregation, but get in touch at -support at tracking dot exposed-

Approach

This method includes asking individuals to install a web-extension, and we should be as clear as possible about how their data would be processed. Despite not including personal data, the sequence of videos seen by an individual might permit us to attribute behavioral information; therefore, we should manage the dataset with the level of protection required by GDPR. This implies:

  1. Thinking about our adopters as people and not as users =)
  2. A clear opt-in, which is revocable and accessible.
  3. A customizable data retention policy.
  4. Full transparency on the data processed and stored (it is free software, but we should describe it before a person joins the project).
  5. Data protection impact assessment?
  6. A way to notify individuals in case of a data breach (which is weird because contributors are anonymous to us, but we might send a message using the browser extension).

Install youtube.tracking.exposed browser extension

Data Pipelines, Collection and Transformation

The technology consists of a pipeline for data collection. Individual contributors have full access to and control over the data they send to us, and to any data attributed, inferred or associated with their contributions. The pipeline has different stages, which serve these functions:

  1. Requirements: an individual should have the browser extension installed. We are currently in beta version, and it is not linked to any of your existing Google profiles. It works through the browser: you should configure, in the panel, what your name or pseudonymous is (and you can change it with whatever you want).
  2. The Collector: the submission is validated as specific to the individual. This process uses a cryptographically secure vector which ensures a collection associated with an individual can’t be mixed with submissions made by another individual.
  3. The Parser: a process that reads the submission and extracts metadata from the given page. These data are served into the database and represent the raw input for our analysis. The parser creates an entry in the database collection named ‘videos’
  4. The API: the interfaces meant for data analysts and for the browser interface you’ll see in this website, to retrieve data in machine readable format (CSV or JSON).

The pipeline parses the collected videos, and attributes metadata. They have some differences based on the kind of page acquired (different URL schema normally imply a different kind of page):

We will begin in parsing only the ‘Watch’ pages.

Along with the metadata extracted from the page, there is some technical metadata generated by the recording system:

These data represent the unique picture of what YouTube suggested to the individual in a specific timeframe.

Research questions?

The release of stage 3 of the roadmap above would be enough to start research. We aim to keep methodology separate from technology. A researcher might, for example, address questions such as:

In this comparative analysis, the researcher can start to figure out how the algorithm displays different suggestions.

Outreach

  1. Talk to the content producers - “When I grow up I’ll be a youtuber! - they are one of the first large categories in the world which is vocal, and sensitive about algorithm discrimination (demonetization, unlisting, automated+human take down).
  2. YouTubers are trying to deduct the effect of algorithm. We can offer a tool which YouTubers may have an interest in promoting among their audiences. There is a list of functionalities meant for communities, and others meant for researchers.
  3. The website is a technical reference, it is meant to describe technology. If interest grows, other websites might use our data and do their own analysis and communication.