Command-line data analytics made easy with SPyQL

Daniel Moura — Sun, 06 Nov 2022 11:08:17 +0000

The command-line is incredibly powerful when it comes to data processing. Still, many of us working with data do not take advantage of it. I can think of some reasons:

Poor readability: the focus is on minimising how much you need to type and not so much on how readable a sequence of commands is;
Steep learning curve: many commands, with many options;
Looks outdated: some of these tools are around since the 70s and target delimited text files (and not modern formats like JSON and YAML).

These motivated me to write a command-line tool that focus on readability, easiness to learn and modern data formats, while leveraging the command-line ecosystem. On top of that, it also leverages the Python ecosystem! Meet SPyQL - SQL with Python in the middle:

SELECT 
  date.fromtimestamp(purchase_ts) AS purchase_date,
  sum_agg(price * quantity) AS total
FROM csv('my_purchases.csv')
WHERE department.upper() == 'IT' and purchase_ts is not Null
GROUP BY 1
ORDER BY 1
TO json

SPyQL in Action

I think the best way for getting to know SPyQL and getting comfortable with the command-line is to open a terminal and solve a problem. In this case, we will try to understand the geographical distribution of cell towers. Let's start!

Setup

Let's start by installing SPyQL:

$ pip3 install -U spyql

and check its version:

$ spyql --version
spyql, version 0.8.1

Let's also install MatplotCLI, a utility for creating plots from the command-line that leverages Matplotlib:

$ pip3 install -U matplotcli

Finally, we will download some sample data (you can alternatively copy-paste the URL to your browser and download the file from there):

$ wget https://raw.githubusercontent.com/dcmoura/blogposts/master/spyql_cell_towers/sample.csv

This CSV file contains data about cell towers that were added to the OpenCellid database on 2022 September 10 (OCID-diff-cell-export-2022-09-10-T000000.csv file from the OpenCellid project redistributed without modifications under the Creative Commons Attribution-ShareAlike 4.0 International License).

Inspecting the data

Let's look to the data by getting the first 3 lines of the file:

$ head -3 sample.csv
radio,mcc,net,area,cell,unit,lon,lat,range,samples,changeable,created,updated,averageSignal
GSM,262,2,852,2521,0,10.948628,50.170324,15762,200,1,1294561074,1662692508,0
GSM,262,2,852,2501,0,10.940241,50.174076,10591,200,1,1294561074,1662692508,0

You could do this same operation with SPyQL:

$ spyql "SELECT * FROM csv LIMIT 2" < sample.csv

$ spyql "SELECT * FROM csv('sample.csv') LIMIT 2"

Notice that we are telling to get 2 rows of data and not 3 rows of the file (where the first is the header).

One advantage of SPyQL is that we can change the output format easily. Let's change the output to JSON and look to the first record:

$ spyql "SELECT * FROM csv('sample.csv') LIMIT 1 
TO json(indent=2)"
{
  "radio": "GSM",
  "mcc": 262,
  "net": 2,
  "area": 852,
  "cell": 2521,
  "unit": 0,
  "lon": 10.948628,
  "lat": 50.170324,
  "range": 15762,
  "samples": 200,
  "changeable": 1,
  "created": 1294561074,
  "updated": 1662692508,
  "averageSignal": 0
}

Querying the data

Let's first count how many records we have:

$ spyql "SELECT count_agg(*) AS n FROM csv('sample.csv')"
n
45745

Notice that aggregation functions have the suffix _agg to avoid conflicts with Python's functions like min, max and sum.
Now, let's count how many cell towers we have by radio type:

$ spyql "SELECT radio, count_agg(*) AS n 
FROM csv('sample.csv') GROUP BY 1 ORDER BY 2 DESC TO pretty"
radio        n
-------  -----
GSM      31549
LTE      12996
UMTS      1182
CDMA        16
NR           2

Notice the pretty printing output. We can plot the above results by setting the output format to JSON and piping results into MatplotCLI:

$ spyql "SELECT radio, count_agg(*) AS n 
FROM csv('sample.csv') GROUP BY 1 ORDER BY 2 DESC TO json" |
plt "bar(radio, n)"

How easy was that? :-)

Querying the data: a more complex example

Now, let's get the top 5 countries with more cell towers added on that day:

$ spyql "SELECT mcc, count_agg(*) AS n 
FROM csv('sample.csv') GROUP BY 1 ORDER BY 2 DESC LIMIT 5 
TO pretty"
  mcc      n
-----  -----
  262  24979
  440   5085
  208   4573
  310   2084
  311    799

MCC stands for Mobile Country Code (262 is the code for Germany). The first digit of the MCC identifies the region. Here's an exert from Wikipedia:

0: Test networks
2: Europe
3: North America and the Caribbean
4: Asia and the Middle East
5: Australia and Oceania
6: Africa
7: South and Central America
9: Worldwide

Let's copy paste the above list of regions and create a new file named mcc_geo.txt. On the mac this is as easy as $ pbpaste > mcc_geo.txt, but you can also paste this into a text editor and save it.

Now, let's ask SPyQL to open this file as a CSV and print its contents:

$ spyql "SELECT * FROM csv('mcc_geo.txt') TO pretty"
  col1  col2
------  -------------------------------
     0  Test networks
     2  Europe
     3  North America and the Caribbean
     4  Asia and the Middle East
     5  Australia and Oceania
     6  Africa
     7  South and Central America
     9  Worldwide

SPyQL detected that the separator is a colon and that the file has no header. We will use the colN syntax to address the Nth column.

Now, let's create a single JSON object with as many key-value pairs as input rows. Let the 1st column of the input be the key and the 2nd column be the value, and save the result to a new file:

$ spyql "SELECT dict_agg(col1,col2) AS json 
FROM csv('mcc_geo.txt') TO json('mcc_geo.json', indent=2)"

We can use cat to inspect the output file:

$ cat mcc_geo.json                                               
{
  "0": "Test networks",
  "2": "Europe",
  "3": "North America and the Caribbean",
  "4": "Asia and the Middle East",
  "5": "Australia and Oceania",
  "6": "Africa",
  "7": "South and Central America",
  "9": "Worldwide "
}

Basically, we have aggregated all input lines into a Python dictionary, and then saved it as a JSON file. Try removing the AS json alias from the SELECT to understand why we need it :-)

Now, let's get the statistics by region instead of by MCC. For this, we will load the JSON file that we just created (with the -J option) and do a dictionary lookup:

$ spyql -Jgeo=mcc_geo.json "SELECT geo[mcc//100] AS region,
count_agg(*) AS n FROM csv('sample.csv') GROUP BY 1 
ORDER BY 2 DESC TO pretty"
region                               n
-------------------------------  -----
Europe                           35601
Asia and the Middle East          5621
North America and the Caribbean   3247
Australia and Oceania              894
Africa                             381
South and Central America            1

We do an integer division by 100 to get the 1st digit of the MCC and then we lookup this digit on the JSON we just created (which is loaded as a Python dictionary). This how we do a JOIN in SPyQL, via a dictionary lookup :-)

Leveraging Python libs on your queries

Another advantage of SPyQL is that we can leverage the Python ecosystem. Let's try to do some more geographical statistics. Let's count towers by H3 cell (resolution 5) for Europe. First, we need to install the H3 lib:

$ pip3 install -U h3

Then, we can convert latitude-longitude pairs into H3 cells, count how many towers we have by H3 cell, and save results into a CSV:

$ spyql "IMPORT h3 SELECT h3.geo_to_h3(lat, lon, 5) AS cell,
count_agg(*) AS n FROM csv('sample.csv') WHERE mcc//100==2
GROUP BY 1 TO csv('towers_by_h3_res5.csv')"

Visualising these results is fairly simple with Kepler. Just go to kepler.gl/demo and open the above file. You should see something like this:

Final words

I hope you enjoyed SPyQL and that I could show you how simple it makes to query data from the command-line. In this first post about SPyQL, we are just scratching the surface. There is a lot more we can do. We have barely leveraged the Shell and Python ecosystems in this article. And we worked with a small file (SPyQL can handle GB-size files without compromising system resources). So, stay tuned!

Try out SPyQL and reach back to me with your thoughts. Thank you!

Resources

SPyQL repo: github.com/dcmoura/spyql
SPyQL documentation: spyql.readthedocs.io
MatplotCLI repo: github.com/dcmoura/matplotcli

You can find me on Tweeter, LinkedIn and Vimeo!

Article originally published at danielcmoura.com.

DEV Community: Daniel Moura