This project wants to explore how the web is tracked by whom.
As the most tech savvy readers know, when we visit a web page, several things happen in the background.
The page from a server is sent to the browser of the user that start to paint on screen the content. However, it may be necessary for the browser to access other resources, the most common are:
- images
- instruction for how to style the several elements (like: color, dimension, position of the text), know as CSS
- code for animation or smart application, know as JS
- fonts (how a text appear)
All these resources may be provided by the same website, or they may be provide by a different website.
If those resources are provided by a different website, the browser needs to obtain them making a request to a different actor.
All these requests may be used to track users on the web, especially if they are associated with cookies (hence the annoying banners on every website) and headers (from those we don’t have banners).
An example of this are the Social button by Facebook, Twitter, Google, Reddit, etc… in order to show those buttons it is necessary to make a request to the respective company and send information about the user. In this way is possible to show very social buttons like (“Jon, Tyrion and Sansa liked this element”) but those social platforms will know what page you have visited.
Finally website may also use analytics solutions that help the website to know who visit their website, what page are visited more often, and other information. The most common analytic solution is provide by Google itself for free, of course the website obtain a lot of useful data, but Google obtain the same data as well.
Armed with this basic knowledge let’s explore how we can know who is tracking the web.
Obtain the data
The simplest way to know what requests are made to what service is to simply render the webpage using a browser like Firefox and track all the requests that are made.
This procedure is not as simple as it may look like, likely thank to help from friends a reasonable simple solution was possible.
Chrome headless and selenium may help also
— Ramiro Algozino (@ralgozino) May 14, 2019
How difficult can it be to programmatically get a list of all the request a browser does in order to display a web page?
We programmatically drive Firefox making all the request through a proxy.
Everything was nicely packed together in the selenium-wire project.
The result is a tiny python script that get in input a domain, start Firefox, make Firefox visit and render the homepage, track all the request through a proxy and finally store all the request into a SQLite file.
import sys
from seleniumwire import webdriver # Import from seleniumwire
from selenium.webdriver.firefox.options import Options
import tldextract
from urllib.parse import urlparse
import sqlite3
import json
options = Options()
options.headless = True
# Create a new instance of the Firefox driver
driver = webdriver.Firefox(options=options)
original_domain = sys.argv[1]
url = 'https://{}'.format(original_domain)
# Make a request to the URL
driver.get(url)
conn = sqlite3.connect("requests.db")
c = conn.cursor()
c.execute('''
CREATE TABLE IF NOT EXISTS requests(
original_domain TEXT NOT NULL,
original_url TEXT NOT NULL,
time_request INT DEFAULT (strftime('%s','now')),
request TEXT,
status_code INT,
subdomain TEXT,
domain TEXT,
tld TEXT,
scheme TEXT,
netloc TEXT,
path TEXT,
params TEXT,
query TEXT,
fragment TEXT,
request_header TEXT,
response_header TEXT
);
''')
conn.commit()
insert_stmt = """
INSERT INTO requests(
original_domain,
original_url,
request,
status_code,
subdomain,
domain,
tld,
scheme,
netloc,
path,
params,
query,
fragment,
request_header,
response_header
)
VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, json(?), json(?));
"""
# Access requests via the `requests` attribute
for request in driver.requests:
if request.response:
rpath = request.path
subdomain, domain, tld = tldextract.extract(rpath)
parsedRequest = urlparse(rpath)
scheme, netloc, path, params, query, fragment = parsedRequest
status_code = request.response.status_code
data = (
original_domain,
url,
rpath,
request.response.status_code,
subdomain,
domain,
tld,
scheme,
netloc,
path,
params,
query,
fragment,
json.dumps(dict(request.headers)),
json.dumps(dict(request.response.headers)),
)
c.execute(insert_stmt, data);
conn.commit()
driver.close()
driver.quit()
At this point we have a script that given a domain in input, get its home page and store all the requests necessary to render that homepage into a small database.
Then we used the list of the top10million most influent website (actually domains) to know which website are most visited.
We manipulate the list of domain to obtain the first few hundreds of domains:
cat top10milliondomains.csv | awk -F "," '{ print substr($2, 2, length($2) - 2)}' | head -n 1000
And finally we use xargs
to run the python script in parallel.
xargs -n1 -P6 python3 tracker.py
Hence, the whole command command was:
cat top10milliondomains.csv | awk -F "," '{ print substr($2, 2, length($2) - 2)}' | head -n 1000 | xargs -n1 -P6 python3 tracker.py
After some hours we collect 186582 requests done while rendering the homepage of 1924 domains. Those requests are against 3472 domains.
The amount of requests is definitely not huge, far from it, but in order to do them Firefox need to render a whole webpage along with the JS and CSS, definitely not a lightweight task.
A brief data analysis will soon follow, follow me on twitter or subscribe to the mail-list to receive updates.
Top comments (2)
Is it possible to do this with python ( without using selenium) for a single website ?
Hi!
Yes and no, it depends on the website (assuming a reasonable level of complexity).
Some website offer all the HTML, so you can just explore it, and analyze tag by tag what you should do.
Other websites relies heavily on JS to modify/create their content at "run time", so you will need to execute that JS in order to know what requests should be done. (This is the case of React or Vue.js)
Once you started to render the JS everything became so complex that you could as well used selenium.
To recap, it depends on the website!