What We Collect

and how ytTREX works

What the browser extension does

  1. It creates cryptographic key pairs. It is the method to ensure you can access your data, and only you can mark the data received as “yours.” It is necessary because we don’t have an email address, Google profile, or YouTube username. Your official identification method is not linked to our data at all. This work for every human or bot opening youtube.com website. Every browser extension installed has a different and unique cryptographic key.

  2. It copies the HTML of every youtube.com video page once YouTube has completed to send the suggested videos. The HTML is sent to the tracking.exposed server, hosted in Germany and administrated by the technical staff of our team.

Technical detail: the extension cryptographically signs the HTML you send, with your public key. We differentiate supporters through the public key they are using, and you can create a new key, download, or import a key when you want. Each time a new supporter show up you’ll see it in the first graph

Which data we receive and save

  1. From the submission: we use the public key to verify the existence of the profile and validate the signature. We keep the exact time, the URL of the video watched, and the HTML. This data goes in a collection named ‘videos’.
  2. From videos: we pick all the HTML and extract metadata, using parser implemented in nodejs. It derives the following list of metadata. The data goes in a collection named ‘metadata.’
  3. YES: You have full control of your entries in the ‘videos’ and ‘metadata’ collection.

How a ‘metadata’ entry looks like

todo update to v2, it contains ad

There is one entry for each submission received by the supporters

Name Example Notes
id e1895eed23ffcb8a0b5d1221c28a712b379886fe the unique identified of the evidence: every observation has a different ID
title Salmo - 90MIN (Official Live Performance) | Vevo X the title of the video: in certain condition the same video might display a different title, for example, it can looks translated
videoId U7OarstN2GU This is the youtube VideoId, if you compose the url https://youtube.com/watch?v=$videoId, it will display the video. This videoId is unique in Youtube platform.
authorName Confused Bi-Product of a Misinformed Culture The name of the author as YouTube display it
authorChannel /channel/UCSsz5GO1rQjzp1RND7QtEjg The unique ID of the content producer: by looking at htts://youtube.com/$authorChannel you'll see the producer page
savingTime 2019-08-01 22:46:41.355Z The date when the evidence was collected
RelatedN 20 The number of related video recorded
watcher caramel-macaroon-succotash This is a pseudonym assigned to every broswer submitting data. This is linked to the authentication material, and therefore this is a personal data.
Related [ is a list ] A list of related videos. The size of this list is the number in the field RelatedN. See below the details

Each related video has a different data structure. This list of related is part of the evidence. We use related and suggested as synonyms. We are talking about the videos display on the right column of YouTube interface.

Name Example Notes
Index 1 Order of the related video. This counter start from 1, and we can't guarantee all the videos have the same amount of related videos. (also called suggestionOrder)
Title Platero y Tú - Vamos tirando (1993) (Álbum completo) The name of the suggested video
Verified false true or false, is the presence of the verified marker next ✔ to the channel name
foryou true This field take value true or false if the video related was explicitly 'recommended for you'
source Kirstin Leticia The display name of the channel owning the recommended video
displayTime 03:14 Video length, as declared in the preview

What is a personal data in this regards

What is linkable to an individual activity is personal data. In our regards, two metadata should deserve special attention and protection: the watcher pseudonym, and the sequence of videos seen by the same watcher.

  1. By ‘watcher’ it is meant a sequence of three random food names. They are assigned when the adopter opts-in. For example pizza-broccoli-tangelo is a legitimate watcher name. The ‘watcher’ is the person using the extension, who has to opt-in to this term of data processing. This person, in regards of our GDPR compliance, is a data subject.
  2. The data subjects can exert their rights in two ways:
    • With the browser extension they can access to their personal profile, where they can fully control the data
    • By downloading the key, they can ask us to intervene in deleting their data. Having the key is the only authentication measure you use to the system. The public key is the only secret we can demand to you in order to identify you properly.
  3. The watcher pseudonym is not guaranteed to be unique. In the system, another person with the same pseudonym might exist. Anyhow, knowing or guessing a pseudonym has no impact on data privacy because it is not an authentication factor, and a method to use this pseudonym to query the database does not exist.
  4. The data subject can fully dispose of the evidence they send to us: they can control data retention time (default is 6 month, minimum two weeks, maximum 1 year). If they want to share some limited portion of this data, they can create access token for this, they can revoke and change their authentication code from the browser extension.
  5. The sequence of videos seen by the same watcher is not disclosed to anyone except the watcher. This sequence (or, the video experience of the individual) is protected because in the long term, these content might let an adversary infer personal information, and we don’t allow it.

Our data processing logic

The primary goal is to enable algorithm analysis. The influence of the personalization algorithm emerges by comparing individualized experiences with other people’s experiences. This can be done with three broad approaches:

For the data subject (or the supporter with the browser extension installed)

A person collects how YouTube personalizes his or her experience. If this person decides to share their evidences (or a portion of it) with someone else in an exclusive way, this second person can accept or decline to share back their evidences. This allows two people to compare their personalized experience.

The privacy model is an explicit, and revokable, opt-in from the two parties, allowing a granular selection of the shared content.

For researchers

A researcher coordinates tests among people or by using puppet profiles. The experiments might differ broadly (it mainly depends on the research question and methodology). In this case, the researcher should ask the data subject to share the data with them, in a private agreement expressed with informed consent. This is the approach used in the research team which publishes our first report: exposing YouTube, an ALEX project in DMI Summer School at UvA.

The privacy model is an explicit opt-in for people joining the research group. In case dummy profiles play a role, the researcher likely owns these profiles and has full control of the data collected with them.

Open data for a wider access

Tracking Exposed might run an analysis on the dataset as long as the logic to look into the database is:

How to guarantee protection in Open Data

Our privacy model wants to anonymize, aggregate data in a way which you can’t recognize any contributor. Again, we should investigate phenomena without exposing any individual.

This method has a problem: the three points above are not enough for producing 100% safe and useful data for the public. This approach is a general indication, but our procedure to accept a process like this includes three phases:

  1. A third party proposes a research question and a logic in pseudo-code
  2. We implement the functionality and don’t release it to the public, we only produce a small sample of data.
  3. We help the third party in writing their privacy assessment.
  4. The proposal (1), our experimental result (2) and the P.A. (3) are assessed by a team of independent reviewers.

At the moment, this is not yet happening. We are only experimenting with producing aggregated queries with privacy-preserving capabilities. When external researchers had access to a selected portion of the dataset, they signed an agreement with us which requires them to:

How we (and other researcher) release data

In case a research abides the following methodology:

  1. Researchers must use profiles under their control.
  2. The profiles should investigate on activities of clear public interests (political figures, mainstream media)

A researchers might decide to publish their collected data as a method to let others replicate and validate the research. Few cases like these are registered so far, such as in the context of Facebook algorithm analysis, as documented in the invisible curation of content, or the report Italian political election and digital propaganda, has data released in a repository.

The collaborative test like poTEST#1, or weTEST#1 fail to comply with point n.1 above, we release the data because:

  1. the pseudonym released as part of the test is different from the one associated to the profile. It can’t be correlated. It is possible, among the ‘related content’, YouTube likely recommend something related to individual previous activities. It is possible exist a content so personal to link uniquely an individual, and thus de-anonymize a subject or an interest of a subject? what this might lead at?
  2. ATTENTION: we can’t outline a general rule and we should evaluate to do a data protection impact assessment in every different case. It is good, in the test, to do the experiment with a browser logged off, cleaned cookies, history, and local storage. Even simple suggestion, install a new browser, often we suggest Brave
    • We express worry on the centralization of power in the hand of Google Chrome, upon Brave and Edge runs. We already suffered a few takedown of our extension(s).