The Search for Self: How to obtain and analyze your history of Google searches

#privacy #google #datascience #statistics

Googleâ€™s search engine is so thoroughly baked into our everyday existence that it feels more like the final stage in a cognitive process than it does an independent piece of software. Modern humans donâ€™t wonder, they wonder-then-Google, with the taps of characters into your address bar as natural and legitimate a step as the original thought.

As a result, your accumulation of Google searches over a period of time acts as a reliable proxy for your state of mind, curiosities, ambitions, and fears included. Luckily (or not, depending on your definition of privacy), Google logs your searches and makes them available to you, assuming youâ€™re signed in to a Google account (often via Gmail). If you've never adjusted your settings to halt this behavior, hereâ€™s how to find, parse, and visualize that data, starring the author as guinea pig.

1. Download the data

Head to https://takeout.google.com/settings/takeout, where youâ€™ll find various personal datasets available, including your GChat conversations and emails. Unselect all of them (â€œSelect noneâ€), then recheck Searches and hit â€œNext.â€ On the next page you can choose a file type (.tgz allows for fewer files) and delivery method (I stuck with a download link sent over email). After opening that email, clicking through, downloading the archive and unzipping it, youâ€™ll be left with a collection of files nested under the folders â€œTakeoutâ€ and â€œSearches.â€

2. Prepare the data

The data is in JSON format, but is still organized in a relatively straightforward manner and can be flattened into vectors without too much trouble in Python:

import json
import os
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter

#your file path here!
files= os.listdir('Searches')
del files[0]

searches =[]
dates = []
for file in files:
    with open('Searches/%s'%(file)) as json_data:
        d = json.load(json_data)
    for i in range(len(d['event'])):
        for j in range(len((d['event'][i][u'query'][u'id']))):
            searches.append(d['event'][i][u'query'][u'query_text']) 
            dates.append(d['event'][i][u'query'][u'id'][j][u'timestamp_usec'])

dates = [datetime.datetime.fromtimestamp(int(i)/1000000).strftime('%Y-%m-%d %H:%M:%S') 
        for i in dates]
searches = [i.encode('utf-8') for i in searches]

3. Analyze the data

Weâ€™ll start with some high-level figures. In the 886 days spanning the available time period back to Fall of 2014, I executed nearly 64,000 Google searches, or over 70 per day. I use my personal laptop at work everyday, which helps explain such volume, but clearly the pervasiveness of Google searches mentioned in the intro was not overstated!

There are more patterns worth mining though. You could look at hour-by-hour trends:

hours = [datetime.datetime.strptime(i, '%Y-%m-%d %H:%M:%S').hour for i in dates]
n, bins, patches = plt.hist(hours, 24, facecolor='blue', alpha=0.75)
plt.xticks([0,6,12,18], ['12 AM','6 AM', '12 PM', '6 PM'], fontsize=18)
plt.xlabel('Hour', fontsize=24)
plt.ylabel('Frequency', fontsize=24)
plt.gcf().set_size_inches(18.5, 10.5, forward=True)
plt.show()

At its simplest, the hour-by-hour graph reflects my consciousness: he who does not Google is probably asleep. Soon after arriving at work though, I begin searching up a frenzy, reaching peak inquisitiveness around 3 PM. After an early evening respite, Iâ€™m back on my search grind by 10 PM and donâ€™t finish up until well past midnight (Iâ€™m a bit of a night owl).

What exactly am I Googling though? Sorting for term frequency isnâ€™t too difficult:

combo = ' '.join(searches)
freqs = Counter(combo.split())
top = freqs.most_common(40)

words = []
counts = []
for i in range(40):
    words.append(top[i][0])
    counts.append(top[i][1])

words.reverse()
counts.reverse()

plt.barh(range(40), counts, align='center', color='b', alpha=0.75)
plt.yticks(range(40), words, fontsize=16)
plt.gcf().set_size_inches(18.5, 10.5, forward=True)
plt.show()

The usual suspects from the English language like â€œtheâ€ and â€œofâ€ dilute the list, but you can still see where my mindâ€™s been in the last few years. I blog regularly and like to avoid overusing a word, hence the heavy reliance on searching for synonyms. I always want the sports-reference sites to be my top result, so I append "ref" to any baseball or basketball query. I live in New York (â€œnycâ€) and go to the gym a fair amount (â€œnyscâ€). Iâ€™m an aspiring data scientist (â€œdata,â€ â€œpython,â€ â€œrâ€). Iâ€™m quintessentially American (â€œbaseballâ€, â€œStatesâ€), but also worried about what that means nowadays (â€œtrumpâ€).

There is, of course, a time component to each of these terms. People donâ€™t Google the same things everyday for the same reasons they donâ€™t think the same thoughts every day. So if I write a function that takes my search data, start and end dates, and a tuple of terms I'm interested in...

def term_by_week(data, start, end, terms, normalized=False):
    start = datetime.datetime.strptime(start, '%Y-%m-%d')
    end = datetime.datetime.strptime(end, '%Y-%m-%d')
    weeks=[]
    while start < end:
        weeks.append(start.strftime('%Y-%m-%d'))
        start += datetime.timedelta(days=7)

    for term in terms:
        term_weeks = []
        for i in range(len(weeks)-1):
            term_weeks.append(sum((data['time'] > weeks[i]) & 
                      (data['time'] < weeks[i+1]) &
                      (data['search'].str.contains(term))))
        termlength = len(term_weeks)
        if normalized == True:
            term_weeks = [i/float(max(term_weeks)) for i in term_weeks]
        plt.plot(range(termlength), term_weeks, label=term, linewidth=5.0)

    ticks = range(1, len(weeks), len(weeks)/4)[0:4] + [len(weeks)-1]
    plt.xticks(ticks, [weeks[i] for i in ticks], fontsize=15)
    plt.xlim((0,len(weeks)))
    plt.xlabel('Week', fontsize=24)
    plt.ylabel('Frequency', fontsize=24)
    plt.legend()
    plt.gcf().set_size_inches(18.5, 10.5, forward=True)
    plt.show()

...I can pick some familiar topics and examine their fluctuations over time, giving me a sense of how my interests and focus changed as the weeks roll by (this took a few minutes to run):

d = {"search": searches, "time": dates}
googled = pd.DataFrame(d)

term_by_week(googled, '2014-10-01', '2017-03-05', 
        ('trump', 'warriors', 'python', 'ibm'))

Without ever meeting me, you could use a graph like this to understand who I was and what I was thinking about over a long period of time (which, of course, is what Google does to make gobs of money). I worked for IBM (teal) after I graduated until changing jobs in summer of 2015. For months, I closely followed the Golden State Warriorsâ€™ record-breaking season (green). I decided to learn Python (red), the programming language used for all this, in the spring of 2016. And I paid great attention to Trump (blue) as the election neared, took a much needed hiatus, and then replugged in for his inauguration.

Unfortunately, your lasting takeaway from this post may be a reminder of Googleâ€™s omniscience. You may have noticed all the other things I didnâ€™t check when exporting my data, from maps to GChat conversations to personal calendars. There are long, complex conversations to be had about how big your digital footprint should be and who should have access to it.

One certain thing is that you have the right to view your past online actions, and as demonstrated above, the capability to find meaning in them. In an age where all of us are too distracted to write a journal entry before bed, Google provides something of an approximation for a diary, and one thatâ€™s likely a little more honest at that.

So I would encourage you to at least download your data, and even take a shot at analyzing it. The full code is linked here and Iâ€™d be happy to lend a helping hand to those who find the syntax inaccessible and try to answer any questions you might have about the processâ€¦

â€¦or you could just Google it. ðŸ˜‰

This post was originally published on my data blog, perplex.city