loading...

Using Google's BigQuery to Better Understand the Python Ecosystem

walker profile image Walker Harrison ・5 min read

When it comes to programming, it's important to be aware of the tools that your contemporaries are using. Not only is this a good way to stay apprised of industry trends, but it helps you develop a common vocabulary when interacting with other programmers.

In my case, I've recently been using Python more for data analysis, and found myself wondering what packages were most frequently used by other data scientists. I could, of course, just Google "best Python packages for data science," but in the spirit of statistics, I wanted to make a data-driven decision as opposed to relying on anecdote. Plus, Google has another tool that would prove useful: BigQuery.

If you'd like to swim in an ocean of Big Data, experience Google's colossal resources or even just practice your SQL, then BigQuery is a tool worthy of your attention. Among the datasets available on BigQuery for the public to rummage through are every New York City cab ride since 2009, every 311 call in San Francisco since 2008, and every MLB pitch from the 2016 season.

For my purposes, though, the relevant public dataset was that from GitHub, which includes more than 145 million unique commits and the contents of over 160 million files. By querying this dataset and finding popular extensions, common filenames, and oft-used packages, you can get a sense of the greater coding universe without actually opening any individual scripts. (Google's Felipe Hoffa, who oversees the dataset's rollout, keeps a running tally of neat analyses from the community.)

I knew that I'd have to go slightly deeper than just examining the available Python files. While data science is growing in popularity, it is still just a modest neighborhood within the wide universe of use cases for Python, which is one of today's most popular and versatile languages. Any insights specific to data science would be washed away by the heavier trends created by more generic use of Python. If anything, I'd want to search the contents of .ipynb files and not .py files, since the former applies to Jupyter (formerly IPython) notebooks.

The major advantages of using a notebook over running a Python script from the command line are that you can execute discrete blocks of code instead of the entire file and that you can display graphics inline. Such agility makes notebooks favorable to data scientists who want to create instant visualizations or manipulate large datasets without rerunning entire analyses, something I hypothesized would be reflected in the popular .ipynb packages. And so I set off to compare the millions of .py and .ipynb files in the GitHub dataset.

BigQuery is available via the command line or client libraries in various languages, but I'm partial to to the Web UI (in part because I'm Googling SQL and regex syntax in the adjacent browser tabs...). Once signed into my Google account, I can head over to the GitHub dataset and click on the "files" table so that I can query it:

In actuality, two tables are required: I need to select rows from the contents table with ids from a subset of the files table that have names ending in .py:

SELECT * FROM [bigquery-public-data:github_repos.contents]
WHERE id IN (
    SELECT id FROM [bigquery-public-data:github_repos.files]
    WHERE RIGHT(path, 3) = '.py');

Something to note is that the results (5.4 million Python scripts) are big enough to require their own table, according to Google's rules, so if you'd like to do something similar you'll need to make a Google Cloud Platform account and put down a credit card. Your first terabyte of processed data is free, but I still managed to incur about $10 worth of fees for this project thanks to mistaken or inefficient queries.

Once the results of that query were parked in a new personal table, I ran the equivalent query for Jupyter notebooks (WHERE RIGHT(path, 6) = '.ipynb'), leaving me with separate batches of .py contents and .ipynb contents.

Determining the ten most popular packages in .py files wasn't too tricky, especially using this guide from Google's Francesc Campoy as reference. After splitting the text of each file at any new line (SELECT SPLIT(content, '\n') as line) that started with the phrase "import " or "from " (since there are two main ways to load Python packages), I used a regular expression that identified the next word to grab all the packages:

SELECT REGEXP_EXTRACT( line, r'(?:\S+\s+)(\S+)' ) as package, count(*) as n
FROM (
    SELECT SPLIT(content, '\n') as line
    FROM [pythonvsnotebooks:ipynb.py_contents]
    HAVING LEFT(line, 7) = 'import ' OR LEFT(line, 5) = 'from ')
GROUP BY package
ORDER BY n DESC
LIMIT 10;

Which gives me:

package n
1 os 1048981
2 sys 829156
3 __future__ 691222
4 time 424434
5 logging 404678
6 re 403193
7 numpy 394069
8 datetime 373272
9 json 282672
10 unittest 278512

Nothing here is particularly surprising. Most of these packages are part of Python's highly regarded standard library. os, sys, and re, for example are required to interact with the operating system, access variables maintained by the interpreter, and use regular expressions, respectively. These are the very "general use" packages that are indeed popular enough to drown out the results from something more niche, like data science.

Conjuring up the same table for Notebooks is slightly more complicated. The format of each contents value is actually a JSON since Notebooks support various cell types (such as code, comments, or markdown), and therefore need a more sophisticated data type than just a long string.

There's likely a more elegant way to deal with this added layer of complexity, but my hack was realizing that splitting the raw JSON text on \n \" would successfully select for new lines of code. I had to build in some case logic for package titles that accidentally pulled in some appending characters (WHEN package LIKE '%\\n",'), but the end result was clean:

SELECT
CASE WHEN package LIKE '%\\n",' THEN LEFT(package, LENGTH(package) - 4)
ELSE package
END as package, n
FROM (
    SELECT REGEXP_EXTRACT( line, r'(?:\S+\s+)(\S+)' ) as package, count(*) as n
    FROM (
        SELECT SPLIT(content, '\n      \"') as line
        FROM [pythonvsnotebooks:ipynb.ipynb_contents]
        HAVING LEFT(line, 7) = 'import ' OR LEFT(line, 5) = 'from ')
    GROUP BY package
    ORDER BY n DESC)
LIMIT 10;

Yielding:

package n
1 numpy 9344
2 matplotlib.pyplot 5715
3 pandas 4469
4 os 1940
5 matplotlib 1714
6 IPython.display 1627
7 sys 1577
8 __future__ 1244
9 sklearn 1162
10 time 1006

The overall totals are orders of magnitude smaller than those in the first table since Jupyter Notebooks are much less prevalent, but the sequence of popular packages is the interesting takeaway. numpy has leapt to the top of the list, and standard library packages have been surpassed in popularity by pandas, a data analysis tool, and matplotlib, a visualization tool. In ninth is a machine learning library, sklearn. These are the tools of the data science trade that I was looking for that only bubbled to the top once the mass of unrelated Python code was cut away.

It was nice to see my intuition validated when it comes to Jupyter notebooks, which apparently are primarily used for statistical analyses. But the real joy of this project was just playing around with BigQuery and feeling the power of sifting through mountains of data in just a few seconds. I'd encourage readers to explore some of the other analyses done on the GitHub dataset, and to give BigQuery a whirl for themselves. There's a lot to be found in three terabytes of data...

Posted on Mar 31 '17 by:

walker profile

Walker Harrison

@walker

Graduate student in statistics at Duke University. Former dev.to employee. I like to blog about data science on my Medium publication, perplex.city, and on dev.to

Discussion

markdown guide
 

Thanks Walker!

A better query to extract all the ipynb (I left the results at bigquery.cloud.google.com/table/fh...

#standardSQL  
SELECT a.id id, size, content, binary, copies,
  sample_repo_name, sample_path
FROM (
  SELECT id, ANY_VALUE(repo_name) sample_repo_name, ANY_VALUE(path) sample_path
  FROM `bigquery-public-data.github_repos.files`
  WHERE ENDS_WITH(path,'.ipynb')
  GROUP BY 1
) a
JOIN `bigquery-public-data.github_repos.contents` b
ON a.id=b.id
 

And more #standardSQL for the imports query:

#standardSQL
SELECT
CASE WHEN package LIKE '%\\n",' THEN SUBSTR(package, 0, LENGTH(package) - 4)
ELSE package
END as package, n
FROM (
  SELECT REGEXP_EXTRACT( line, r'(?:\S+\s+)(\S+)' ) as package, count(*) as n
  FROM (
    SELECT line
    FROM (
      SELECT SPLIT(content, '\n      \"') as lines
      FROM `fh-bigquery.github_extracts.contents_ipynb`
    ), UNNEST(lines) line
    WHERE STARTS_WITH(line, 'import ') OR STARTS_WITH(line, 'from '))
  GROUP BY package
)
ORDER BY n DESC
LIMIT 50;
 

Thanks, Felipe. Still honing my SQL skills so all ears for better solutions. I've got some more BigQuery posts in the pipeline both here on dev.to and on Medium