What Is Processed and Why

how ytTREX works, how you control your data, and which are the default settings

This browser extension allows you to collect evidence of algorithmic personalization; because the perception of public debate is personalized, only via a client-side-collection, we can perform analysis and research.


What this browser extension does

  1. It creates cryptographic key pairs. It is the method to ensure you can access your data, and only you can mark the data received as “yours.” It is necessary because we don’t have an email address, Google profile, or YouTube username. Your official identification method is not linked to our data at all. This work for every human or bot opening youtube.com website. Every browser extension installed has a different and unique cryptographic key.

  2. It copies the HTML of every youtube.com video page once YouTube has completed to send the suggested videos. The HTML is sent to the tracking.exposed server, hosted in Germany and administrated by the technical staff of our team. The system administrators are the three technologies of Tracking Exposed, get in touch at youtube-team at tracking dot exposed.

  3. Beside specially rare maintenance operations, the code processing the data collected is here available for review.

Technical detail: the extension cryptographically signs the HTML you send, with your public key. We differentiate supporters through the public key they are using, and you can create a new key, download, or import a key when you want. Each time a new supporter show up you’ll see it in the first graph

Privileges we need to operate

In our manifest.json the browser extension specify which kind of priviles the extension need, here you can find summarized what and why.

“permissions”: [ “storage”, “alarms”, “https://*.youtube.com/”, “https://youtube.tracking.exposed/” ]

  • storage: we need it to save your preference settings (if, for example, the extension is enabled or disabled, by default is disabled). It also stores the cryptographic material used by the extension.
  • alarms: this is necessary to run some code at a scheduled time, we do not use it yet, but we are running an attempt that would let people participate in experiments like weTest with lesser effort.
  • youtube.com and youtube.tracking.exposed are the two infrastructures with whom we operate. In youtube.com, the extension looks for suggested videos, and the remote server is the platform that allows you to compare, download, analyze the recommended videos. The code running on the server, is in the backend directory.
  • In the file you might see a mention aboutlocalhost, but it isn't actually present in the requested privileges. The reason why it is in the file, is to help developers, and this line take effect only when a developer is working on the tool.

Data collected and processed

  1. From the submission: we use the public key to verify the existence of the profile and validate the signature. We keep the exact time, the URL of the video watched, and the HTML. This data goes in a collection named ‘videos’.
  2. From videos: we pick all the HTML and extract metadata, using parser implemented in nodejs. It derives the following list of metadata. The data goes in a collection named ‘metadata.’
  3. Of course, You have full control on the data object linked to you in the ‘videos’ and ‘metadata’ collections.

URL considered by the extension

https://www.youtube.com/watch?.*

A Watched Video Metadata entry

There is one entry for each submission received by the supporters

Name Example Notes
id e1895eed23ffcb8a0b5d1221c28a712b379886fe the unique identified of the evidence: every observation has a different ID
title Salmo - 90MIN (Official Live Performance) | Vevo X the title of the video: in certain condition the same video might display a different title, for example, it can looks translated
videoId U7OarstN2GU This is the youtube VideoId, if you compose the url https://youtube.com/watch?v=$videoId, it will display the video. This videoId is unique in Youtube platform.
authorName Confused Bi-Product of a Misinformed Culture The name of the author as YouTube display it
authorChannel /channel/UCSsz5GO1rQjzp1RND7QtEjg The unique ID of the content producer: by looking at htts://youtube.com/$authorChannel you'll see the producer page
savingTime 2019-08-01 22:46:41.355Z The date when the evidence was collected
watcher caramel-macaroon-succotash This is a pseudonym assigned to every broswer submitting data. This is linked to the authentication material, and therefore this is a personal data.
Related [ list ] A list of related videos. The size of this list is the number in the field RelatedN. See below the details
Metadata [ list ] A collection of additional metadata about what YouTube sent you during the video. Advertising banner, Advertising video. This set of information grows as long as we support new one, and they are limited to what youtube is sending to the supporter.

A Video Related Metadata entry

Each related video has a different data structure. This list of related is part of the evidence. We use related and suggested as synonyms. We are talking about the videos display on the right column of YouTube interface.

Name Example Notes
Index 1 Order of the related video. This counter start from 1, and we can't guarantee all the videos have the same amount of related videos. (also called suggestionOrder)
Title Platero y Tú - Vamos tirando (1993) (Álbum completo) The name of the suggested video
Verified false true or false, is the presence of the verified marker next ✔ to the channel name
foryou true This field take value true or false if the video related was explicitly 'recommended for you'
source Kirstin Leticia The display name of the channel owning the recommended video
displayTime 03:14 Video length, as declared in the preview

What is a personal data in this regards

What is linkable to an individual activity is personal data. In our regards, two metadata should deserve special attention and protection: the watcher pseudonym, and the sequence of videos seen by the same watcher.

  1. By ‘watcher’ it is meant a sequence of three random food names. They are assigned when the adopter opts-in. For example pizza-broccoli-tangelo is a legitimate watcher name. The ‘watcher’ is the person using the extension, who has to opt-in to this term of data processing. This person, in regards of our GDPR compliance, is a data subject.
  2. The data subjects can exert their rights in two ways:
  1. The watcher pseudonym is not guaranteed to be unique. In the system, another person with the same pseudonym might exist. Anyhow, knowing or guessing a pseudonym has no impact on data privacy because it is not an authentication factor, and a method to use this pseudonym to query the database does not exist.
  2. The data subject can fully dispose of the evidence they send to us: they can control data retention time (default is 6 month, minimum two weeks, maximum 1 year). If they want to share some limited portion of this data, they can create access token for this, they can revoke and change their authentication code from the browser extension.
  3. The sequence of videos seen by the same watcher is not disclosed to anyone except the watcher. This sequence (or, the video experience of the individual) is protected because in the long term, these content might let an adversary infer personal information, and we don’t allow it.

Our data processing logic and values

The primary goal is to enable algorithm analysis. The influence of the personalization algorithm emerges by comparing individualized experiences with other people’s experiences. This can be done with three broad approaches:

For the data subject (or the supporter with the browser extension installed)

A person collects how YouTube personalizes his or her experience. If this person decides to share their evidences (or a portion of it) with someone else in an exclusive way, this second person can accept or decline to share back their evidences. This allows two people to compare their personalized experience.

The privacy model is an explicit, and revokable, opt-in from the two parties, allowing a granular selection of the shared content.

For researchers

A researcher coordinates tests among people or by using puppet profiles. The experiments might differ broadly (it mainly depends on the research question and methodology). In this case, the researcher should ask the data subject to share the data with them, in a private agreement expressed with informed consent. This is the approach used in the research team which publishes our first report: exposing YouTube, an ALEX project in DMI Summer School at UvA.

The privacy model is an explicit opt-in for people joining the research group. In case dummy profiles play a role, the researcher likely owns these profiles and has full control of the data collected with them.

Open data for a wider access

Tracking Exposed might run an analysis on the dataset as long as the logic to look into the database is:

How to guarantee protection in Open Data

Our privacy model wants to anonymize, aggregate data in a way which you can’t recognize any contributor. Again, we should investigate phenomena without exposing any individual.

This method has a problem: the three declaration above aren’t enough at producing 100% safe and useful data for the public. Said approach it is a general indication, our procedure to accept a process like:

  1. A third party proposes a research question and a logic in pseudo-code.
  2. We implement the functionality and don’t release it to the public, we only produce a small sample of data.
  3. We help the third party in writing their privacy assessment.
  4. We’ll notify impacted data subject if an opt-in isn’t mandatory.

At the moment, this is not yet happening. We are only experimenting with producing aggregated queries with privacy-preserving capabilities. When external researchers had access to a selected portion of the dataset, they signed an agreement with us which requires them to:

How we (and other researcher) release data

In case a research abides the following methodology:

  1. Researchers must use profiles under their control.
  2. The profiles should investigate on activities of clear public interests (political figures, mainstream media)

A researchers might decide to publish their collected data as a method to let others replicate and validate the research. Few cases like these are registered so far, such as in the context of Facebook algorithm analysis, as documented in the invisible curation of content, or the report Italian political election and digital propaganda, has data released in a repository.

The collaborative test like poTEST#1, or weTEST#1 fail to comply with point n.1 above, we release the data because the pseudonym released as part of the test is different from the one associated to the profile. It can’t be correlated.

The purpose of this dataset is the research on personalization algorithm. The dataset has not personal data, despite the fact that the personalization of YouTube depends on personal data (thus, legally acknowledged as data subject ). de-anonymization attacks such as relinking by searching for known patterns is not considered feasable because:

  1. attacker should know how patterns appears on youtube personalization algorithm and this is not know.
  2. Youtube (Alphabet) is likely the only entitiy who might be interested in de-anonymize volunteers, but we estimated they might already have such knowledge if they really want to have it.
  3. People supporting the experiment would not be exposed to a risk for participating, as it is, at the moment, aimed to a still explorative scientific research.

Last but not least: We express worry on the centralization of power in the hand of Google Chrome, upon Brave and Edge runs. We already suffered a few takedown of our extension(s).