Leonardo Giordani

Posted on Nov 1, 2020 • Edited on Nov 8, 2020 • Originally published at thedigitalcatonline.com

Dissecting a Web stack

#architecture #www #python #https

It was gross. They wanted me to dissect a frog.
(Beetlejuice, 1988)

Introduction

Having recently worked with young web developers who were exposed for the first time to proper production infrastructure, I received many questions about the various components that one can find in the architecture of a "Web service". These questions clearly expressed the confusion (and sometimes the frustration) of developers who understand how to create endpoints in a high-level language such as Node.js or Python, but were never introduced to the complexity of what happens between the user's browser and their framework of choice. Most of the times they don't know why the framework itself is there in the first place.

The challenge is clear if we just list (in random order), some of the words we use when we discuss (Python) Web development: HTTP, cookies, web server, Websockets, FTP, multi-threaded, reverse proxy, Django, nginx, static files, POST, certificates, framework, Flask, SSL, GET, WSGI, session management, TLS, load balancing, Apache.

In this post, I want to review all the words mentioned above (and a couple more) trying to build a production-ready web service from the ground up. I hope this might help young developers to get the whole picture and to make sense of these "obscure" names that senior developers like me tend to drop in everyday conversations (sometimes arguably out of turn).

As the focus of the post is the global architecture and the reasons behind the presence of specific components, the example service I will use will be a basic HTML web page. The reference language will be Python but the overall discussion applies to any language or framework.

My approach will be that of first stating the rationale and then implementing a possible solution. After this, I will point out missing pieces or unresolved issues and move on with the next layer. At the end of the process, the reader should have a clear picture of why each component has been added to the system.

The perfect architecture

A very important underlying concept of system architectures is that there is no perfect solution devised by some wiser genius, that we just need to apply. Unfortunately, often people mistake design patterns for such a "magic solution". The "Design Patterns" original book, however, states that

Your design should be specific to the problem at hand but also general enough to address future problems and requirements. You also want to avoid redesign, or at least minimize it.

And later

Design patterns make it easier to reuse successful designs and architectures. [...] Design patterns help you choose design alternatives that make a system reusable and avoid alternatives that compromise reusability.

The authors of the book are discussing Object-oriented Programming, but these sentences can be applied to any architecture. As you can see, we have a "problem at hand" and "design alternatives", which means that the most important thing to understand is the requirements, both the present and future ones. Only with clear requirements in mind, one can effectively design a solution, possibly tapping into the great number of patterns that other designers already devised.

A very last remark. A web stack is a complex beast, made of several components and software packages developed by different programmers with different goals in mind. It is perfectly understandable, then, that such components have some degree of superposition. While the division line between theoretical layers is usually very clear, in practice the separation is often blurry. Expect this a lot, and you will never be lost in a web stack anymore.

Some definitions

Let's briefly review some of the most important concepts involved in a Web stack, the protocols.

TCP/IP

TCP/IP is a network protocol, that is, a set of established rules two computers have to follow to get connected over a physical network to exchange messages. TCP/IP is composed of two different protocols covering two different layers of the OSI stack, namely the Transport (TCP) and the Network (IP) ones. TCP/IP can be implemented on top of any physical interface (Data Link and Physical OSI layers), such as Ethernet and Wireless. Actors in a TCP/IP network are identified by a socket, which is a tuple made of an IP address and a port number.

As far as we are concerned when developing a Web service, however, we need to be aware that TCP/IP is a reliable protocol, which in telecommunications means that the protocol itself takes care or retransmissions when packets get lost. In other words, while the speed of the communication is not granted, we can be sure that once a message is sent it will reach its destination without errors.

HTTP

TCP/IP can guarantee that the raw bytes one computer sends will reach their destination, but this leaves completely untouched the problem of how to send meaningful information. In particular, in 1989 the problem Tim Barners-Lee wanted to solve was how to uniquely name hypertext resources in a network and how to access them.

HTTP is the protocol that was devised to solve such a problem and has since greatly evolved. With the help of other protocols such as WebSocket, HTTP invaded areas of communication for which it was originally considered unsuitable such as real-time communication or gaming.

At its core, HTTP is a protocol that states the format of a text request and the possible text responses. The initial version 0.9 published in 1991 defined the concept of URL and allowed only the GET operation that requested a specific resource. HTTP 1.0 and 1.1 added crucial features such as headers, more methods, and important performance optimisations. At the time of writing the adoption of HTTP/2 is around 45% of the websites in the world, and HTTP/3 is still a draft.

The most important feature of HTTP we need to keep in mind as developers is that it is a stateless protocol. This means that the protocol doesn't require the server to keep track of the state of the communication between requests, basically leaving session management to the developer of the service itself.

Session management is crucial nowadays because you usually want to have an authentication layer in front of a service, where a user provides credentials and accesses some private data. It is, however, useful in other contexts such as visual preferences or choices made by the user and re-used in later accesses to the same website. Typical solutions to the session management problem of HTTP involve the use of cookies or session tokens.

HTTPS

Security has become a very important word in recent years, and with a reason. The amount of sensitive data we exchange on the Internet or store on digital devices is increasing exponentially, but unfortunately so is the number of malicious attackers and the level of damage they can cause with their actions. The HTTP protocol is inherently

HTTP is inherently insecure, being a plain text communication between two servers that usually happens on a completely untrustable network such as the Internet. While security wasn't an issue when the protocol was initially conceived, it is nowadays a problem of paramount importance, as we exchange private information, often vital for people's security or for businesses. We need to be sure we are sending information to the correct server and that the data we send cannot be intercepted.

HTTPS solves both the problem of tampering and eavesdropping, encrypting HTTP with the Transport Layer Security (TLS) protocol, that also enforces the usage of digital certificates, issued by a trusted authority. At the time of writing, approximately 80% of websites loaded by Firefox use HTTPS by default. When a server receives an HTTPS connection and transforms it into an HTTP one it is usually said that it terminates TLS (or SSL, the old name of TLS).

WebSocket

One great disadvantage of HTTP is that communication is always initiated by the client and that the server can send data only when this is explicitly requested. Polling can be implemented to provide an initial solution, but it cannot guarantee the performances of proper full-duplex communication, where a channel is kept open between server and client and both can send data without being requested. Such a channel is provided by the WebSocket protocol.

WebSocket is a killer technology for applications like online gaming, real-time feeds like financial tickers or sports news, or multimedia communication like conferencing or remote education.

It is important to understand that WebSocket is not HTTP, and can exist without it. It is also true that this new protocol was designed to be used on top of an existing HTTP connection, so a WebSocket communication is often found in parts of a Web page, which was originally retrieved using HTTP in the first place.

Implementing a service over HTTP

Let's finally start discussing bits and bytes. The starting point for our journey is a service over HTTP, which means there is an HTTP request-response exchange. As an example, let us consider a GET request, the simplest of the HTTP methods.

GET / HTTP/1.1
Host: localhost
User-Agent: curl/7.65.3
Accept: */*

As you can see, the client is sending a pure text message to the server, with the format specified by the HTTP protocol. The first line contains the method name (GET), the URL (/) and the protocol we are using, including its version (HTTP/1.1). The remaining lines are called headers and contain metadata that can help the server to manage the request. The complete value of the Host header is in this case localhost:80, but as the standard port for HTTP services is 80, we don't need to specify it.

If the server localhost is serving HTTP (i.e. running some software that understands HTTP) on port 80 the response we might get is something similar to

HTTP/1.0 200 OK
Date: Mon, 10 Feb 2020 08:41:33 GMT
Content-type: text/html
Content-Length: 26889
Last-Modified: Mon, 10 Feb 2020 08:41:27 GMT

<!DOCTYPE HTML>
<html>
...
</html>

As happened for the request, the response is a text message, formatted according to the standard. The first line mentions the protocol and the status of the request (200 in this case, that means success), while the following lines contain metadata in various headers. Finally, after an empty line, the message contains the resource the client asked for, the source code of the base URL of the website in this case. Since this HTML page probably contains references to other resources like CSS, JS, images, and so on, the browser will send several other requests to gather all the data it needs to properly show the page to the user.

So, the first problem we have is that of implementing a server that understands this protocol and sends a proper response when it receives an HTTP request. We should try to load the requested resource and return either a success (HTTP 200) if we can find it, or a failure (HTTP 404) if we can't.

1 Sockets and parsers

1.1 Rationale

TCP/IP is a network protocol that works with sockets. A socket is a tuple of an IP address (unique in the network) and a port (unique for a specific IP address) that the computer uses to communicate with others. A socket is a file-like object in an operating system, that can be thus opened and closed, and that we can read from or write to. Socket programming is a pretty low-level approach to the network, but you need to be aware that every software in your computer that provides network access has ultimately to deal with sockets (most probably through some library, though).

Since we are building things from the ground up, let's implement a small Python program that opens a socket connection, receives an HTTP request, and sends an HTTP response. As port 80 is a "low port" (a number smaller than 1024), we usually don't have permissions to open sockets there, so I will use port 8080. This is not a problem for now, as HTTP can be served on any port.

1.2 Implementation

Create the file server.py and type this code. Yes, type it, don't just copy and paste, you will not learn anything otherwise.

import socket

# Create a socket instance
# AF_INET: use IP protocol version 4
# SOCK_STREAM: full-duplex byte stream
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# Allow reuse of addresses
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

# Bind the socket to any address, port 8080, and listen
s.bind(('', 8080))
s.listen()

# Serve forever
while True:
    # Accept the connection
    conn, addr = s.accept()

    # Receive data from this socket using a buffer of 1024 bytes
    data = conn.recv(1024)

    # Print out the data
    print(data.decode('utf-8'))

    # Close the connection
    conn.close()

This little program accepts a connection on port 8080 and prints the received data on the terminal. You can test it executing it and then running curl localhost:8080 in another terminal. You should see something like

$ python3 server.py 
GET / HTTP/1.1
Host: localhost:8080
User-Agent: curl/7.65.3
Accept: */*

The server keeps running the code in the while loop, so if you want to terminate it you have to do it with Ctrl+C. So far so good, but this is not an HTTP server yet, as it sends no response; you should actually receive an error message from curl that says curl: (52) Empty reply from server.

Sending back a standard response is very simple, we just need to call conn.sendall passing the raw bytes. A minimal HTTP response contains the protocol and the status, an empty line, and the actual content, for example

HTTP/1.1 200 OK

Hi there!

Our server becomes then

import socket

# Create a socket instance
# AF_INET: use IP protocol version 4
# SOCK_STREAM: full-duplex byte stream
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# Allow reuse of addresses
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

# Bind the socket to any address, port 8080, and listen
s.bind(('', 8080))
s.listen()

# Serve forever
while True:
    # Accept the connection
    conn, addr = s.accept()

    # Receive data from this socket using a buffer of 1024 bytes
    data = conn.recv(1024)

    # Print out the data
    print(data.decode('utf-8'))

    conn.sendall(bytes("HTTP/1.1 200 OK\n\nHi there!\n", 'utf-8'))

    # Close the connection
    conn.close()

At this point, we are not really responding to the user's request, however. Try different curl command lines like curl localhost:8080/index.html or curl localhost:8080/main.css and you will always receive the same response. We should try to find the resource the user is asking for and send that back in the response content.

This version of the HTTP server properly extracts the resource and tries to load it from the current directory, returning either a success of a failure

import socket
import re

# Create a socket instance
# AF_INET: use IP protocol version 4
# SOCK_STREAM: full-duplex byte stream
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# Allow reuse of addresses
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

# Bind the socket to any address, port 8080, and listen
s.bind(('', 8080))
s.listen()

HEAD_200 = "HTTP/1.1 200 OK\n\n"
HEAD_404 = "HTTP/1.1 404 Not Found\n\n"

# Serve forever
while True:
    # Accept the connection
    conn, addr = s.accept()

    # Receive data from this socket using a buffer of 1024 bytes
    data = conn.recv(1024)

    request = data.decode('utf-8')

    # Print out the data
    print(request)

    resource = re.match(r'GET /(.*) HTTP', request).group(1)
    try:
        with open(resource, 'r') as f:
            content = HEAD_200 + f.read()
        print('Resource {} correctly served'.format(resource))
    except FileNotFoundError:
        content = HEAD_404 + "Resource /{} cannot be found\n".format(resource)
        print('Resource {} cannot be loaded'.format(resource))

    print('--------------------')

    conn.sendall(bytes(content, 'utf-8'))

    # Close the connection
    conn.close()

As you can see this implementation is extremely simple. If you create a simple local file named index.html with this content

<head>
    <title>This is my page</title>
    <link rel="stylesheet" href="main.css">
</head>
<html>
    <p>Some random content</p>
</html>

and run curl localhost:8080/index.html you will see the content of the file. At this point, you can even use your browser to open http://localhost:8080/index.html and you will see the title of the page and the content. A Web browser is a software capable of sending HTTP requests and of interpreting the content of the responses if this is HTML (and many other file types like images or videos), so it can render the content of the message. The browser is also responsible of retrieving the missing resources needed for the rendering, so when you provide links to style sheets or JS scripts with the <link> or the <script> tags in the HTML code of a page, you are instructing the browser to send an HTTP GET request for those files as well.

The output of server.py when I access http://localhost:8080/index.html is

GET /index.html HTTP/1.1
Host: localhost:8080
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Language: en-GB,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Pragma: no-cache
Cache-Control: no-cache


Resource index.html correctly served
--------------------
GET /main.css HTTP/1.1
Host: localhost:8080
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
Accept: text/css,*/*;q=0.1
Accept-Language: en-GB,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Referer: http://localhost:8080/index.html
Pragma: no-cache
Cache-Control: no-cache


Resource main.css cannot be loaded
--------------------
GET /favicon.ico HTTP/1.1
Host: localhost:8080
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
Accept: image/webp,*/*
Accept-Language: en-GB,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache


Resource favicon.ico cannot be loaded
--------------------

As you can see the browser sends rich HTTP requests, with a lot of headers, automatically requesting the CSS file mentioned in the HTML code and automatically trying to retrieve a favicon image.

1.3 Resources

These resources provide more detailed information on the topics discussed in this section

Python 3 Socket Programming HOWTO
HTTP/1.1 Request format
HTTP/1.1 Response format
The source code of this example is available here

1.4 Issues

It gives a certain dose of satisfaction to build something from scratch and discover that it works smoothly with full-fledged software like the browser you use every day. I also think it is very interesting to discover that technologies like HTTP, that basically run the world nowadays, are at their core very simple.

That said, there are many features of HTTP that we didn't cover with our simple socket programming. For starters, HTTP/1.0 introduced other methods after GET, such as POST that is of paramount importance for today's websites, where users keep sending information to servers through forms. To implement all 9 HTTP methods we need to properly parse the incoming request and add relevant functions to our code.

At this point, however, you might notice that we are dealing a lot with low-level details of the protocol, which is usually not the core of our business. When we build a service over HTTP we believe that we have the knowledge to properly implement some code that can simplify a certain process, be it searching for other websites, shopping for books or sharing pictures with friends. We don't want to spend our time understanding the subtleties of the TCP/IP sockets and writing parsers for request-response protocols. It is nice to see how these technologies work, but on a daily basis, we need to focus on something at a higher level.

The situation of our small HTTP server is possibly worsened by the fact that HTTP is a stateless protocol. The protocol doesn't provide any way to connect two successive requests, thus keeping track of the state of the communication, which is the cornerstone of modern Internet. Every time we authenticate on a website and we want to visit other pages we need the server to remember who we are, and this implies keeping track of the state of the connection.

Long story short: to work as a proper HTTP server, our code should at this point implement all HTTP methods and cookies management. We also need to support other protocols like Websockets. These are all but trivial tasks, so we definitely need to add some component to the whole system that lets us focus on the business logic and not on the low-level details of application protocols.

Web frameworks, concurrency, HTTPS, web servers, AWS, performances... read the full post on The Digital Cat

Photo by Samrat Khadka on Unsplash