DEV Community: Simone Mosciatti

Use SimpleSQL from Github pages

Simone Mosciatti — Wed, 08 Apr 2020 08:50:47 +0000

SimpleSQL is an HTTP API that allow complete control over a SQLite database.

Underneath SimpleSQL is based on RediSQL to manage all the different databases, connections, backups, replication, etc.

One of the most compelling use case is to SimpleSQL to create completely static webpages that can operate against one database. This would allow to create very powerful webapp that don’t need a server backend. All the interaction with the backend can happen directly on the client code.

A common way to host a simple web app, is to use Github pages, which offer a free space for open source projects.

In order to use SimpleSQL you need to allow the browser to make CORS requests against simplesql.redbeardlab.com. Doing so allow the browser to make HTTP requests against the simplesql.redbeardlab.com domain.

Fortunately github pages already allow it, so no step is necessary for the CORS.

SimpleSQL provide also a JS SDK, which makes writing application simpler.

In order to use the SDK is sufficient to import it adding the following line in the head section of your html pages.

<script type="text/javascript" src="https://unpkg.com/@redbeardlab/simplesql@>=1.0.8"></script>

Once imported the SDK it will be possible to invoke all the SimpleSQL functions, like: SimpleSQL.newDatabase() to create a new database, or SimpleSQL.command(db, "select 1;") to execute a command against one database, all this from client code.

An example, hosted on Github pages, is available here with the source being here.

The documentation for the SDK is on github.

While the raw HTTP API are documented on swagger.

Invarian as Interface

Simone Mosciatti — Fri, 21 Feb 2020 21:43:02 +0000

An invariant is a condition that holds true no matter what during the execution of a phase of a computer program.

For instance, in a classical for loop, for (i = 0; i <= 10; i++) an invariant would be that the value of i is always between 0 and 10 (beside explicit changes of i).

Trying to keep as many invariant as possible internally our software help the development. But we can use invariant also as interface, and it helps immensely consuming our software.

I was hit by this problem when I started to actually stress my own software, it took a while to admit that I did a sub-optimal choice in one of the main interface of RediSQL.

RediSQL is a Redis module that allow users to send SQL commands to a Redis. The memory space between Redis and RediSQL are separated so you can’t query Redis with SQL, but you can create your own table and use those. RediSQL is based on SQLite.

One of the main interface of RediSQL is the REDISQL.EXEC command, that execute a raw SQL statement against one SQLite database.

Upon executing a SQL command, SQLite can return three different values:

OK
DONE, and the number of rows modified
A result consisting of more than one row

I implemented the REDISQL.EXEC command to return, respectively:

The string “OK”
An array containing the string “DONE” and one integer
An array of array containing the result of a query

Moreover, a query that returns no rows, will return DONE not an empty result.

While this seems a reasonable interface when used in the CLI, it is very difficult to use programmatically.

The user needs to test if the result is a string, and then make sure that the string is actually “OK” to match the first case.

If it is not a string, it must be an array, so the user will need to check if it is an array of length 2, and if the first elements of the array is the string “DONE” to match the second case.

Finally to match a query result, the user need to make sure that the result is an array of array, and now it can consume the result.

This is very cumbersome and tedious to implement, especially in statically typed languages like go(lang) and java. In those languanges the result is consumed as a generic {}interface or Object and parsed in something more safe. It is a little better is dynamically typed languages.

While developing SimpleSQL on top of RediSQL, I understood that this was a real problem and I decide to create a v2 for RediSQL, fixing several other design mistake I did the first time.

The new interface exploit exactly the concept of Invariant as Interface. Now the REDISQL.EXEC always returns an array of array and in the first array, as first elements there is always a string. Either: “OK”, “DONE” or “RESULT”.

Consuming this API is much simpler, the user know that it will always receive an array of array, and that in the first position there will be a tag to indicate how the rest of the result should be interpreted and used.

Then the same concept was exported to SimpleSQL creating an API simple to consume.

If interested in SimpleSQL subscribe to the mail list of the product or follow me on Twitter.

Installing software, brief guide for when stuff don’t work.

Simone Mosciatti — Fri, 31 Jan 2020 18:00:04 +0000

In this short post we are going to understand how to install software when stuff don’t work out of the box. We will understand how a *NIX shell search for software and how to make sure that our binaries are always found.

When dealing with software, installation is a classical issues. Hopefully the software you wan to install is available as a package from your favorite package manager (deb, rpm, something else) and usually those packages are well done and everything works out of the box.

However, you may need to install software that is not available as a package, or the package is broken, or something that yesterday use to work, today is not working anymore.

In those cases there are usually two options:

Start everything from scratch again (delete the virtual machine or stop the docker containers)
Understand the inner working of the system so that you can fix it, and make sure that similar problems don’t happens again.

This is a brief guide for the second option. I assume a basic knowledge of *NIX systems and some familiarity with the command line.

What it means to install software

The more proficient you become with the topic, the less “install” is simple to define.

In this post, with “install” we mean to set up the system in such a way that is possible to invoke a binary. A complete installation make sure that all the necessary environmental variables are set up correctly.

It can be as simple as apt-get install or it can be more complex.
Binaries

To install a binary is sufficient to place it in the system PATH. The system path is an environment variable that stores an ordered list of PATHs. At the moment, in my system it looks like this:

$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

The paths are ordered and separated by a colon : so in this case the paths are:

/usr/local/sbin
/usr/local/bin
/usr/sbin
/usr/bin
/sbin
/bin

The different directories are there for convention, for instance the sbin directories are for system-recovery software.

When we start a binary, the shell check if it find the binary in the path, the check is done by name.

To visualize let’s try to log all the system calls when we invoke the tree binary. (tree print the directory structure of a given folder, and you can install it from system packages.)

To visualize the system calls we can use strace (again available from system packages).

However, we cannot call just strace tree since strace will start to log after we have already found tree, but we can strace bash that in turn will invoke tree like so.

$ strace bash -c "tree -D 1"
... a lot of stuff ...
stat("/usr/local/sbin/tree", 0x7ffcb15ffea0) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/tree", 0x7ffcb15ffea0) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/tree", 0x7ffcb15ffea0)  = -1 ENOENT (No such file or directory)
stat("/usr/bin/tree", {st_mode=S_IFREG|0755, st_size=77384, ...}) = 0
... yet more stuff ...

As expected the system is checking all the directories in the path, in order. It start checking if a file called tree is present in the first directory stat /usr/local/sbin/tree but it returns an error, -1 the file is not there. Similarly for /usr/local/bin/tree and /usr/sbin/tree. Finally it find the file in /usr/bin/tree and it can finally invoke it.

So there are two way to install software, the first one is to add the binaries to one of the path in $PATH, the other is to add the path that contains our binaries to $PATH.

Tricking the shell into invoking the wrong command

This system is quite fragile, the checks happens only at level of strings, without doing anything more than a plain string comparison.

What happens if we install a new software, called tree?

$ mkdir -p /fake/bin
$ cat /fake/bin/tree
#! /bin/bash
echo "fake tree"
$ chmod +x /fake/bin/tree
$ /fake/bin/tree
fake tree

Here we have created a new binaries directory (/fake/bin), and we put inside it an executable (chmod +x) called tree. The fake tree just print out a string.

Now if we invoke tree the regular process will happen, all the directories in $PATH are checked until a tree executable is found, and if found it is executed.

$ tree -D 1 
1 [error opening dir]

0 directories, 0 files

Indeed the regular tree software is invoked.

Let’s change the $PATH variable:

$ export PATH="/fake/bin:$PATH"
$ echo $PATH
/fake/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

Now the first directory checked is /fake/bin, and the system will find an executable called tree in there.

And if we invoke tree again:

$ tree -D 1
fake tree

As expected the fake tree is invoked.

This is source of great flexibility but also of many frustrations.

It is flexible because it allow us to install new software without being administrators (sudo access). Moreover it allow to have in the system system different version of the same software. But of course it is easy to make mistake and invoke by mistake the wrong executable.

`which` to the rescue

$ which tree 
/fake/bin/tree

The which utility let us discover what path is followed when looking for a binary.

Let’s fix this:

$ export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
$  which tree
/usr/bin/tree
$ tree --version
tree v1.7.0 (c) 1996 - 2014 by Steve Baker, Thomas Moore, Francesc Rocher, Florian Sesser, Kyosuke Tokoro

Subscribe to the mail list, or follow me on twitter.

Software is a FOCUS intensive industry.

Simone Mosciatti — Sun, 19 Jan 2020 14:03:56 +0000

Industries are usually characterized as labor-intensive or as capital-intensive.

Labor-intensive industries requires a lot of human input in order to produce their output. A classical examples is the industry of services (like restaurants or hotel) or even old manufacturing (where people were building stuff with their hands.)

Capital-intensive industries requires much less human input but they requires a lot of capital in order to produce their output. An example could be building oil rings, insurance companies, energy producers, airlines.

Software does not fall in neither this two categories.

With “software industry” I mean the industry that produce software products, ventures like AWS, Microsoft Azure or Google Cloud Platforms are definitely capital intensive industries. In this post I am talking about software product like Slack, Notion, WordPress, GCC.

Software is not labor-intensive. Not many people are necessary in order to produce good software. On the contrary, Brooks’s law on software project management states that “adding manpower to a late software project makes it later”.

Nor software is capital-intensive. Building a great software product does not requires huge money investments. Computing power is almost free nowadays and the same is true for bandwidth and storage.

What makes or breaks a project, it’s the amount of FOCUS developers can pour into it.

With focus I mean the internal knowledge of all the nitty-gritty details of the software and its surrounding. The knowledge of how the software is used and of how it solve real problems for the users. The knowledge of why it is implemented the way it is. The capacity to add features and fix bugs without breaking the software for a subset of users.

What I defined as FOCUS is not just knowledge of computer programming and data structure, it is much more than just knowing the users, it is much more that testing and CI/CD. Of course those things are important but it is not just that. If it was sufficient, companies like Google, Amazon, Facebook, Microsoft would keep pushing out new flawless software product. But that is not the case.

Obtain and maintaining FOCUS is extremely hard, it requires raw knowledge of the computing fundamentals, a lot of time and experience with the product and having build a strong community that sustains the effort (asking for features, trusting with updates, discuss the general direction of the product.)

Unfortunately people get bored, especially in creative profession, when the product is mature, the job becomes more boring. Maintenance is still necessary and the most efficient person to maintain a piece of software will be the one less interested in doing it.

Moreover, it is a risk for people to be stuck too long on the same project using the same set of technologies. The half-life of a particular technologies is very short in the software industry and developers are aware of it. It is a perfectly rational choice trying to mitigate this kind of risk. On the other hand management, pursuing efficiency, will push the developers to use the same old proven technologies.

Finds the right balance between innovative work and efficient use of time is difficult for people managing projects. Too much innovation and nothing important get done, with developers chasing the new shinny, hyped, project. No innovation at all and developers are quickly alienated by keeping maintaining the same project.

As last point, especially in the software industry, switching companies very often is the strategy recognized to maximize income. This makes even harder to reach FOCUS. If the team that works in a product completely changes in 5 years it is impossible that the team has FOCUS on the project.

Unfortunately there are not easy answer, but is clear that being able to create production grade software is almost a necessity for every medium to big organization.

Hierarchical JSON with SQLite / RediSQL

Simone Mosciatti — Tue, 14 Jan 2020 20:29:02 +0000

Hierarchical JSON with SQLite / RediSQL

RediSQL is compiled including the JSON1 SQLite extensions. Hence, all the functions documented in JSON1 are available out of the box.

JSON1 is extremely flexible and powerful, as an example consider a report table that track sales in a company by year, quarter and week.

> REDISQL.CREATE_DB DB
> REDISQL.EXEC DB "CREATE TABLE sales(year STRING, quarter STRING, week STRING, total INT);"
> REDISQL.EXEC DB "INSERT INTO sales VALUES('2019', 'q1', '1', 100);"
> REDISQL.EXEC DB "INSERT INTO sales VALUES('2019', 'q1', '2', 125);"
> REDISQL.EXEC DB "INSERT INTO sales VALUES('2019', 'q2', '1', 200);" 
> REDISQL.EXEC DB "INSERT INTO sales VALUES('2019', 'q2', '2', 300);" 
> REDISQL.EXEC DB "INSERT INTO sales VALUES('2020', 'q1', '1', 400);" 
> REDISQL.EXEC DB "INSERT INTO sales VALUES('2020', 'q1', '2', 450);" 
> REDISQL.EXEC DB "INSERT INTO sales VALUES('2020', 'q2', '1', 500);"

From this table we would like to generate a JSON report in the form:

{'2019': 
  {'q1': {'1': 100, '2': 125}, 
   'q2': {'1': 200, '2': 300}}, 
 '2020': 
  {'q1': {'1': 400, '2': 450}, 
   'q2': {'1': 500}}}

This is not a trivial problem, because SQL generally does not like to return data in this format, just a string. However the JSON1 module is flexible enough and CTE provide us with enough expressing power.

Let's see the final query first and then we will try to understand it piece by piece.

WITH quarters AS (
  WITH weeks AS (
      SELECT year, quarter, json_group_object(week, total) AS week_json 
      FROM sales 
      GROUP BY year, quarter
      ) 
  SELECT year, json_group_object(quarter, json(week_json)) AS quarters_json 
  FROM weeks 
  GROUP BY year
) 
SELECT json_group_object(year, json(quarters_json)) 
FROM quarters;"

This returns exactly the single line we are looking for.

It seems a difficult query, but working on it piece by piece we can understand it quickly.

The WITH constructor simply create a "virtual table" valid for the execution of the query.

The simplest way to understand this query is going inside-out.

SELECT year, quarter, json_group_object(week, total) AS week_json 
FROM sales 
GROUP BY year, quarter

json_group_object is an aggreate query and it returns a JSON string with the weeks as key and the totals as values.

> REDISQL.EXEC DB "SELECT year, quarter, json_group_object(week, total) AS week_json FROM sales GROUP BY year, quarter" 
1) 1) (integer) 2019
   2) "q1"
   3) "{\"1\":100,\"2\":125}"
2) 1) (integer) 2019
   2) "q2"
   3) "{\"1\":200,\"2\":300}"
3) 1) (integer) 2020
   2) "q1"
   3) "{\"1\":400,\"2\":450}"
4) 1) (integer) 2020
   2) "q2"
   3) "{\"1\":500}"

In this way we are able to create a JSON document that express the total of sales for each week. We compres the week columns in a flat JSON document.

The next step is similar, for the year, we compress each quarter in a JSON document, the difficulties lays in maintaining the total of the weeks.

WITH weeks AS (
    --- same query as above
    SELECT year, quarter, json_group_object(week, total) AS week_json 
    FROM sales 
    GROUP BY year, quarter
) 
  SELECT year, json_group_object(quarter, json(week_json)) AS quarters_json 
  FROM weeks 
  GROUP BY year

We introduce the WITH statement.
Using the WITH statement we treat the result of the query above as a new table that we can use in the later statement, the new table is called week.
Note how we conveniently associate a name (week_json) to the result of the json_group_object aggregation. This is useful to manipulate that JSON object.

The rest of the query is very similar, we are compressing all the quarters into a flat JSON object.

> REDISQL.EXEC DB "WITH weeks AS ( SELECT year, quarter, json_group_object(week, total) AS week_json FROM sales GROUP BY year, quarter) SELECT year, json_group_object(quarter, json(week_json)) AS quarters_json FROM weeks GROUP BY year"
1) 1) (integer) 2019
   2) "{\"q1\":{\"1\":100,\"2\":125},\"q2\":{\"1\":200,\"2\":300}}"
2) 1) (integer) 2020
   2) "{\"q1\":{\"1\":400,\"2\":450},\"q2\":{\"1\":500}}"

This query provided us, for each year, a JSON hierarchical structure that map quarters and weeks tp total sales.

Now, we can guess the last step, compress the years into another hierarchical JSON structure.

This yield to the original query:

WITH quarters AS (
  --- same query as above
  WITH weeks AS (
      SELECT year, quarter, json_group_object(week, total) AS week_json 
      FROM sales 
      GROUP BY year, quarter
      ) 
  SELECT year, json_group_object(quarter, json(week_json)) AS quarters_json 
  FROM weeks 
  GROUP BY year
) 
SELECT json_group_object(year, json(quarters_json)) 
FROM quarters;"

And let's see the result:

> REDISQL.EXEC DB "WITH quarters AS ( WITH weeks AS ( SELECT year, quarter, json_group_object(week, total) AS week_json FROM sales GROUP BY year, quarter ) SELECT year, json_group_object(quarter, json(week_json)) AS quarters_json FROM weeks GROUP BY year) SELECT json_group_object(year, json(quarters_json)) FROM quarters;"

1) 1) "{\"2019\":{\"q1\":{\"1\":100,\"2\":125},\"q2\":{\"1\":200,\"2\":300}},\"2020\":{\"q1\":{\"1\":400,\"2\":450},\"q2\":{\"1\":500}}}"

The last result is a hierarchical JSON structure where the years map to the quartes, the quarters map to the weeks and the weeks map to the sales.

Tracking the Trackers

Simone Mosciatti — Sat, 18 May 2019 17:11:13 +0000

This project wants to explore how the web is tracked by whom.

As the most tech savvy readers know, when we visit a web page, several things happen in the background.

The page from a server is sent to the browser of the user that start to paint on screen the content. However, it may be necessary for the browser to access other resources, the most common are:

images
instruction for how to style the several elements (like: color, dimension, position of the text), know as CSS
code for animation or smart application, know as JS
fonts (how a text appear)

All these resources may be provided by the same website, or they may be provide by a different website.

If those resources are provided by a different website, the browser needs to obtain them making a request to a different actor.

All these requests may be used to track users on the web, especially if they are associated with cookies (hence the annoying banners on every website) and headers (from those we don’t have banners).

An example of this are the Social button by Facebook, Twitter, Google, Reddit, etc… in order to show those buttons it is necessary to make a request to the respective company and send information about the user. In this way is possible to show very social buttons like (“Jon, Tyrion and Sansa liked this element”) but those social platforms will know what page you have visited.

Finally website may also use analytics solutions that help the website to know who visit their website, what page are visited more often, and other information. The most common analytic solution is provide by Google itself for free, of course the website obtain a lot of useful data, but Google obtain the same data as well.

Armed with this basic knowledge let’s explore how we can know who is tracking the web.

Obtain the data

The simplest way to know what requests are made to what service is to simply render the webpage using a browser like Firefox and track all the requests that are made.

This procedure is not as simple as it may look like, likely thank to help from friends a reasonable simple solution was possible.

Chrome headless and selenium may help also

— Ramiro Algozino (@ralgozino) May 14, 2019

How difficult can it be to programmatically get a list of all the request a browser does in order to display a web page?

We programmatically drive Firefox making all the request through a proxy.

Everything was nicely packed together in the selenium-wire project.

The result is a tiny python script that get in input a domain, start Firefox, make Firefox visit and render the homepage, track all the request through a proxy and finally store all the request into a SQLite file.

import sys                                                                                                                                                              

from seleniumwire import webdriver  # Import from seleniumwire                                                                                                          
from selenium.webdriver.firefox.options import Options                                                                                                                  

import tldextract                                                                                                                                                       
from urllib.parse import urlparse                                                                                                                                       

import sqlite3                                                                                                                                                          
import json                                                                                                                                                             

options = Options()                                                                                                                                                     
options.headless = True                                                                                                                                                 
# Create a new instance of the Firefox driver                                                                                                                           
driver = webdriver.Firefox(options=options)                                                                                                                             

original_domain = sys.argv[1]                                                                                                                                           
url = 'https://{}'.format(original_domain)                                                                                                                              
# Make a request to the URL
driver.get(url)                                                                                                                                                         

conn = sqlite3.connect("requests.db")                                                                                                                                   
c = conn.cursor()                                                                                                                                                       

c.execute('''                                                                                                                                                           
    CREATE TABLE IF NOT EXISTS requests(                                                                                                                                
        original_domain TEXT NOT NULL,                                                                                                                                  
        original_url TEXT NOT NULL,                                                                                                                                     
        time_request INT DEFAULT (strftime('%s','now')),                                                                                                                
        request TEXT,                                                                                                                                                   
        status_code INT,                                                                                                                                                
        subdomain TEXT,                                                                                                                                                 
        domain TEXT,                                                                                                                                                    
        tld TEXT,                                                                                                                                                       
        scheme TEXT,                                                                                                                                                    
        netloc TEXT,                                                                                                                                                    
        path TEXT,                                                                                                                                                      
        params TEXT,                                                                                                                                                    
        query TEXT,                                                                                                                                                     
        fragment TEXT,                                                                                                                                                  
        request_header TEXT,                                                                                                                                            
        response_header TEXT                                                                                                                                            
    );                                                                                                                                                                  
''')                                                                                                                                                                    
conn.commit()                                                                                                                                                           
insert_stmt = """
INSERT INTO requests(
        original_domain,
        original_url,
        request,
        status_code,
        subdomain,
        domain,
        tld,
        scheme,
        netloc,
        path,
        params,
        query,
        fragment,
        request_header,
        response_header
)
VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, json(?), json(?));
"""

# Access requests via the `requests` attribute
for request in driver.requests:
    if request.response:
        rpath = request.path
        subdomain, domain, tld = tldextract.extract(rpath)
        parsedRequest = urlparse(rpath)
        scheme, netloc, path, params, query, fragment = parsedRequest
        status_code = request.response.status_code
        data = (
            original_domain,
            url,
            rpath,
            request.response.status_code,
            subdomain,
            domain,
            tld,
            scheme,
            netloc,
            path,
            params,
            query,
            fragment,
            json.dumps(dict(request.headers)),
            json.dumps(dict(request.response.headers)),
        )

        c.execute(insert_stmt, data);

conn.commit()

driver.close()
driver.quit()

At this point we have a script that given a domain in input, get its home page and store all the requests necessary to render that homepage into a small database.

Then we used the list of the top10million most influent website (actually domains) to know which website are most visited.

We manipulate the list of domain to obtain the first few hundreds of domains:

cat top10milliondomains.csv | awk -F "," '{ print substr($2, 2, length($2) - 2)}' | head -n 1000

And finally we use xargs to run the python script in parallel.

xargs -n1 -P6 python3 tracker.py

Hence, the whole command command was:

cat top10milliondomains.csv | awk -F "," '{ print substr($2, 2, length($2) - 2)}' | head -n 1000 | xargs -n1 -P6 python3 tracker.py

After some hours we collect 186582 requests done while rendering the homepage of 1924 domains. Those requests are against 3472 domains.

The amount of requests is definitely not huge, far from it, but in order to do them Firefox need to render a whole webpage along with the JS and CSS, definitely not a lightweight task.

A brief data analysis will soon follow, follow me on twitter or subscribe to the mail-list to receive updates.

Repository here

Write a Postgres proxy. Day 1. Getting familiar with the API.

Simone Mosciatti — Fri, 17 May 2019 14:32:16 +0000

Write a Postgres proxy. Day 1.

RediSQL, SQL steroids for Redis. Is a very fast in-memory SQL engine. Its main features are:

Speed, up to 130,000 insert per second
Familiarity, it support standard SQL, no weird dialects
Simplicity, it is very easy to operate and to use with binding for any language.

Code on github: RedBeardLab/rediSQL

In this series of post we are writing a postgres proxy that accept connections made using the postgres (PG) protocol and forward them to RediSQL.

Motivation and introduction of the project are here.

My hope for this project is to distill the knowledge I am getting from this work and help other that are interested in exploring the PG protocol.

Intro

This post is about the first day of this project so it is mostly introduction of the references used during this work and a little bit of code.

In this day we quickly reach the stage where we are able to receive a query from psql (the CLI tool for PG).

We will start the post showing the PG references that are most useful and the Python references of the asyncio module we used.

Then we will explore very quickly the few lines of code that I ended up writing.

The last section will explore the error I made during this day, trivial errors but that where a big time sink anyway.

The references

Before to start this work I questioned if I should implement this proxy for Postgres or for MySQL.

To choose I explored at the documentation of both projects and both are quite good. However, the documentation for PG looked simpler to follow and more linear and I just decide to go for PG.

The main documentation for this project is the Chapter 50 of the PG documentation.

In particular the following sections are of extreme interest:

The Message Flow section explains what is the flow of messages between PG and the client. It helps in understanding what message we should expect from the client and what message we are required to send as a server.
The Message Data Types section simply explains how to read and interpretate the section “Message Formats”.
The Message Formats section goes into the details and enumerate the format of each kind of message. As an example we discover that usually each message start with a single letter that identify the type of message (like R is used for authentication related messages or that queries start with Q), then 4 bytes (an Int32) indicate the length of the whole message and finally the body of the message itself.

Python “ASYNCIO”

While I would like to merge this project in the main RediSQL rust codebase, I am a strong believer that starting the project in Python is a good idea. I will gain the knowledge necessary to successfully re-write the software in Rust while having already faced most of the implementation difficulties in a language that allows very fast iteration. Moreover, it will be just impossible to merge the Python code base into the RediSQL rust codebase, so I will just be forced to re-write it.

While I am not looking for performance I still opted to work with asyncio, mostly because it was a long time I didn’t do any big work in Python and I wanted to get a pulse of the available tools. Moreover I hoped that it would be closer to what I would find in Rust with Tokio, but it seems to me that the two models are not very similar.

On the Python side I keep referring to the Callback Based API for [asyncio](https://docs.python.org/3.5/library/asyncio-protocol.html#transports-and-protocols-callback-based-api).

The API is very simple, you simply sub-classed the asyncio.Protocol class and implemented three callbacks:

connection_made for when a new connection is created to the server.
data_received for when a new packed of data arrives to the server.
connection_lost for when we loose connection with the client.

As you can imagine all the logic is in the data_receivedcallback, and it will be more complex than a standard web-server. Indeed HTTP is a stateless protocol, everything is simpler if the protocol is stateless, each request does not depends on the previous one.

The PG protocol is stateful, it means that we need to store and use information from previous messages. As an example, a client, before to send its queries, needs to send an handshake and to authenticate. This means that our server will have at least two state, an “initial” state where each connection start and a “ready” state where a connection end ups only if it completed the handshake and authenticate.

The code

Finally here the code of this first day of code. The code is mostly boilerplate copied from the Python documentation but it is already enough to accept a connection from psql and receive the first query.

To test our progresses we started the Python server and, at the same time we execute psql giving as input a file with few SQL statements to execute.

psql -f goal.sql -h localhost -p 8888

The workflow of the day

Other than boilerplate code, the interesting part of the code are the definition of the magic number that identifies the messages:

SSLRequestCode = b'\x04\xd2\x16\x2f' # == hex(80877103)
StartupMessageCode = b'\x00\x03\x00\x00' # == hex(196608)

NoSSL = b'\x4E' # == 'N'

AuthenticationOk = b'\x52\x00\x00\x00\x08\x00\x00\x00\x00'
AuthenticationCleartextPassword = b'\x52\x00\x00\x00\x08\x00\x00\x00\x03'

ReadyForQuery = b'\x5A\x00\x00\x00\x05\x49' # == Z0005I , the last I stand for Idle

And the logic to reply to the client:

    def _reply(self, data):
        if self.state == "initial" and data[4:8] == SSLRequestCode:
            self.transport.write(NoSSL)
        elif self.state == "initial" and data[4:8] == StartupMessageCode:
            # we don't require a password
            self.transport.write(AuthenticationOk)
            # good to go for the first query!
            self.transport.write(ReadyForQuery)
        return

Let’s explore how we get to this few lines of code.

My discovering process

Scanning quickly the documentation it could seems like the first message to expect is the StartupMessage, however, the first message sent by psql is the SSLRequest message, and this took quite a while to figure out.

The SSLRequest message is recognized because it contains the magic number 80877103 which we encode in the python code as b'\x04\xd2\x16\x2f'

Since we don’t yet support SSL we simply respond to the SSLRequest with N(encoded as b'\x4E') to let know to the client that we are not going to use SSL. At this point, the client, can either drop the connection or decide to accept a non-encrypted connection and send the StartupMessage in plain text.

Also for the StartupMessage there is a magic number (196608) which we encoded as b'\x00\x03\x00\x00'.

Along with the magic number, the StartupMessage contains information like the user who is starting the connection, what database the user is trying to connect and other information. At the moment we ignore all those information.

After the StartupMessage the server requires authentication, in our case we don’t care about authentication just yet and we just send the AuthenticationOk message.

The next step is a little tricky.

We just send a message to the client, the AuthenticationOk message and so I would expect the client to send the server something back.

Wrong!

Now, the server need to be proactive and tell the client that it is ok to start sending queries. We need to send two messages, one after the other to the client.

Indeed you can see in the code that we immediately send the ReadyForQuery message.

At this point our time is over for this day, however we can clearly see from the log that the next message received by the server is the first query of our file!

Success!

Errors made during this day

During this coding section I wasted a lot of time because I didn’t read the documentation with enough care.

Indeed I was expecting the StartupMessage as first message and not the SSLRequest. I spend a lot of time trying to fit the StartupMessage into the SSLRequest, maybe I was reading the message with the wrong endianess? Maybe there was “garbage” from the protocol layer?

Nah! I am just reading the wrong message.

Another time sink was me reading the wrong column in the ASCII table trying to use the decimal, instead of the hexadecimal, encoding. All the messages start with a letter, in our cases we needed the N for rejecting the SSL, and the R for the AuthenticationOk message and finally the Z for the ReadyForQuery message. As an example the N is 78 in decimal and 4E in hexadecimal. I was trying to encode N as b'\x78' instead of b'\x4E'.

Conclusion

I hope your enjoyed the post.

I will keep publishing about this topic on this blog, so if you are interested feel free to follow me on twitter or subscribe to the mail list just below.

All post of this serie: Writing a Postgres proxy.

Write a Postgresql proxy. The Beginning.

Simone Mosciatti — Tue, 14 May 2019 10:02:02 +0000

This series of articles will follow my progress in creating a RediSQL proxy for Postgres, pg-redis-proxy.

The end goal of this project is to have a proxy that will listen to the PG protocol, forward the queries to RediSQL, and finally return the answer to the original client.

If the project is successful I hope to integrate the code into RediSQL itself, so to provide another interface for RediSQL, not just the Redis protocol but also — directly — the PG protocol.

Caveats

The abstraction I am going to build will definitely be a leaky one.

Indeed SQLite (on which is based RediSQL) does not support a lot of features of PG. Classical examples are all the DATE datatypes that are not supported in SQLite.

However I still feel it may be useful and fun to build.

Approach

Ideally I would like to have the code merge into the main RediSQL, this would definitely suggest to code it in Rust.

But Rust is a “production” language.

At least in my experience, you need to have a quite good understanding of the problem and of the design before to successfully code something in Rust.

Moreover, there is this old expression in programming that suggest to plan for at least one prototype that you will eventually throw away and re-write from scratch.

Indeed I am not so sure of the whole architecture and of the problem I will be facing while writing this proxy.

The first version of pg-redis-proxy will be written in Python3.

The goals of the Python version

The goals of the first python version are:

Get to know the PG protocol
Quickly explore several possible architecture
Provide some open source libraries and guidance for other that wants to explore a similar project

I will stop developing the project when it will be possible to execute a simple SQL file against an instance of pg-redis-proxy and have it return the expected result.

The SQL file I am aim to is something like:

CREATE TABLE foo(a INT, b INT, c INT);
INSERT INTO foo VALUES(1,2,3);
PREPARE insertfoo (int, int, int) AS
    INSERT INTO foo VALUES($1, $2, $3);
EXECUTE insertfoo(5,6,7);
INSERT INTO foo VALUES(4,5,6);
EXECUTE insertfoo(8,9,0); 

SELECT * FROM foo;

I am not aiming to code anything more than what is extremely strictly necessary, but I will be extremely open to accept pull requests.

Hence if you are interesting in the project, or in piece of the project, feel free to contribute in the repository.

The repo

You can follow the progress on this github repository.

Finally, if you are interested in following this project you can either follow me on twitter or subscribe to the mail-list in the original blog.

DEV Community: Simone Mosciatti

Use SimpleSQL from Github pages

Invarian as Interface

Installing software, brief guide for when stuff don’t work.

What it means to install software

Tricking the shell into invoking the wrong command

which to the rescue

Software is a FOCUS intensive industry.

Hierarchical JSON with SQLite / RediSQL

Hierarchical JSON with SQLite / RediSQL

Tracking the Trackers

Obtain the data

Write a Postgres proxy. Day 1. Getting familiar with the API.

Write a Postgres proxy. Day 1.

Intro

The references

Python “ASYNCIO”

The code

The workflow of the day

My discovering process

Errors made during this day

Conclusion

Write a Postgresql proxy. The Beginning.

Caveats

Approach

The goals of the Python version

The repo

`which` to the rescue