DEV Community: Filip Todic

How to change PostgreSQL's data directory on Linux

Filip Todic — Tue, 20 Apr 2021 14:39:54 +0000

There comes a time when you have to restore a relatively large database locally. Most likely, you've partitioned your disk, and your root partition got the thick end of it, 50 GB if you were generous. Let's assume that that's not nearly enough for the database you're about to restore. At the same time, your /home partition got the rest of the disk space you had available.

You can always resize these two partitions, but then you would have to back up your /home directory, unmount it or find a Live USB and tamper with them. If you don't have the time, or the desire, or even a backup disc, you can always change the location where postgresql stores its data.

The following instructions are a love letter to all those lost souls who find themselves in this situation and forget to check the status of SELinux, as well as to my future self who'll most likely have to do it again. Considering this is mostly a dump of my bash history, I hope this exact procedure works for you. If not, feel free to contact me and we'll update it together.

Procedure

A fresh install is the easiest to change, but let's assume you have some databases locally you don't want to lose, just move them and restore the large database next to them. The steps are more or less the same anyway.

`data_directory` and `config_file`

Before you start anything, locate the postgresql's configuration file and its data directory:

$ sudo su - postgres
[postgres@host ~]$ psql
Password for user postgres:
psql (12.6)
Type "help" for help.

postgres=# SHOW config_file;
            config_file
-----------------------------------
 /var/lib/pgsql/data/postgresql.conf
(1 row)

postgres=# SHOW data_directory;
  data_directory
-------------------
 /var/lib/pgsql/data
(1 row)

Stop the `systemd` service

Stop the postgresql systemd service:

systemctl stop postgresql.service

New location

Create a directory where you have enough disk space available (in this case, it's the /home directory), grant the postgres user ownership and permissions over it and copy the original data directory to the new location (the key is to preserve the same ownership and permissions structure):

mkdir /home/pgdata
chown postgres:postgres /home/pgdata
chmod 700 /home/pgdata
rsync -av /var/lib/pgsql/data/ /home/pgdata/data

`postgresql` configuration

Open the postgresql.conf file in the new location and update the data_directory variable, setting it to the new location where your data was moved:

vim /home/pgdata/data/postgresql.conf

#------------------------------------------------------------------------------
# FILE LOCATIONS
#------------------------------------------------------------------------------

# The default values of these variables are driven from the -D command-line
# option or PGDATA environment variable, represented here as ConfigDir.

data_directory = '/home/pgdata/data'    # use data in another directory
                                        # (change requires restart)
#hba_file = 'ConfigDir/pg_hba.conf'     # host-based authentication file
                                        # (change requires restart)
#ident_file = 'ConfigDir/pg_ident.conf' # ident configuration file

`systemd` configuration

Do the same thing with the postgresql.service's systemd configuration file:

vim /lib/systemd/system/postgresql.service

# ...
Environment=PGDATA=/home/pgdata/data
# ...

Once you're done editing the systemd configuration, reload it and start the service:

systemctl daemon-reload
systemctl start postgresql.service
systemctl status postgresql.service

SELinux

If you're receiving some vague Permission denied errors, check whether or not you have SELinux enabled:

cat /sys/fs/selinux/enforce
1

If the result is 1, then SELinux is in enforcing mode. To temporarily set it to permissive mode (0), run:

setenforce 0

You can try starting the postgresql.service again. If the process has started successfully, stop it, and tell SELinux to apply the same context to the new location. Then you can return SELinux to enforcing mode

semanage fcontext --add --equal /var/lib/pgsql /home/pgdata
restorecon -rv /home/pgdata
setenforce 1

Then you should be able to start the postgresql.service without any errors

systemctl start postgresql.service
systemctl status postgresql.service

Confirmation

To confirm the new location and configuration is used, rerun the first step:

➜ sudo su - postgres
[postgres@lenovo ~]$ psql
Password for user postgres:
psql (12.6)
Type "help" for help.

postgres=# SHOW data_directory;
  data_directory
-------------------
 /home/pgdata/data
(1 row)

postgres=# SHOW config_file;
            config_file
-----------------------------------
 /home/pgdata/data/postgresql.conf
(1 row)

Automate it before you document it

Filip Todic — Fri, 13 Nov 2020 10:28:11 +0000

I help clients fix their legacy software for a living, and one of the most common complaints (or regrets) I hear is "If only we had documented this pain-in-the-a*s-service, we wouldn't have had these problems". Despite the best intentions, this is a common misconception and a dangerous generalization that produces more harm than good in the long run.

First of all, not all software documentation is created equal. To subject the project, customer and internal documentation to the same standards and scrutiny would be a massive waste of time and energy because they serve different purposes. The format and contents of a document depend mostly on the intended audience. For example, if you're a developer writing libraries, the developers using your libraries are your audience. Those developers create services and products on top of those libraries, and their customers are their audience. These two types of audiences share a common goal, to get up and running as fast as possible. This is often achieved by creating a system that is intuitive for your audience to use, simple or straightforward to integrate, and documented and configurable in a way that avoids cognitive overload.

Second of all, any form of documentation runs the risk of becoming outdated, especially the internal documentation. Documentation pertaining to onboarding, project setup, library creation, testing, builds and deployment oftentimes comprises various steps that need to be executed in sequence to achieve the same result. Those are ideal candidates for automation. Ideally, you'd have a single command that encapsulates all those steps that were previously documented, and document the invocation example(s) with a short description of what the command does. If a procedure changes in the future, one person can update it, and the rest of the team can continue using with the same effect. For example, I've seen documents outlining the steps necessary to create a Python library instead of using a template, as well as documents describing how to set up your environment to run tests and deploy libraries, tasks easily automated using tox. Not only does automation reduce the amount of documentation you need to write and keep up-to-date, but decreases the surface area of your system, making it easier to use and maintain.

On the other hand, an alternative where nothing is automated and everything is documented is a lose-lose situation, both for business owners and employees. Considering that some, if not most, projects are unique snowflakes, employees can experience micro-frustrations on a daily basis just by using or switching between different projects. Each project can have a slightly different set of steps to achieve the same effect (e.g. run the test suite or deploy), none of which are automated, but some of them are documented. That means they are wasting valuable time on repetitive and monotonous tasks instead of producing something of value to the customers. That's why it's in the business owner's or manager's interest to invest time in automating everything that can be automated, instead of documenting it and hoping for the best.

In summary, documentation and automation have their advantages when applied in the right circumstances. Neither is a silver bullet, so don't treat it as such. The more we automate, the more time we'll have to focus on the things that really matter. As a consequence, the end result will be simpler, easier to user, and require far less documentation.

It's just business

Filip Todic — Mon, 21 Sep 2020 09:20:55 +0000

As I was making breakfast this morning, a social media post about why developers keep autoplaying videos on their sites was brought to my attention. There were some fair counterarguments in the comments section about the clients' wishes, and some less constructive arguments that can be categorized under the "Just say no!" category.

Prior to that splash of investigative "wisdom", I saw an existential question being posted on Hacker News. Putting aside how incongruous this last sentence sounds (not to mention the post), I asked myself what makes programmers think they are so special?

I know some programmers think of themselves as craftsmen or artists, as if their work is somehow a work of art. Newsflash, it's not. First of all, most artists become famous after they're dead, which obviously makes their ability to reap the rewards of their work particularly hard. Now imagine a world where legacy software becomes invaluable because the author or the entire team have just kicked the bucket. You can almost hear their colleagues saying "there will never be another to-do list like that ever again".

Unfortunately, these "artists" will never miss an opportunity to quote Spiderman's uncle, and say that we as programmers have great power and great responsibility. You can always say "no" to your clients and employers whenever you have to do something that hurts your delicate sensibilities. In theory, yes. In practice, it's a little bit different, but we're getting there.

Let's say you've done your homework and prepared and air-tight case for your employer that goes beyond the regular "I don't like this feature". Unless you present an alternative that makes financial sense to the business (finance is the Esperanto of the real world), it's either you, or the feature.

I know this sounds a bit cynical, but that's not my intention. It's also not an excuse to turn a blind eye on unethical behavior that should be stopped. It's just how the world works. Unless the thing you rebel against is not causing substantial damage to the business or its users in a way that harms the business, you'll be replaced with someone who can and will do it because there's rent to pay or just doesn't care.

If you really want to compare this line of work with another industry in hopes of determining your position in the hierarchy, please stop using arts, construction and manufacturing industries when there are far better analogies:

Drug Dealers and IT are the only people who call their customers “users”

It's not a coincidence. If you've never seen The Wire, I suggest you take a good hard look and see where you would fit in the supply chain or the society at large. My guess, programmers would be the ones who cut the raw product and mix it with various other substances to produce different versions of the product (or the beat cops/task forces if the first option is too denigrating to you), but that's just me. Either way, the programmer's impact on the company's policy is limited by design, so don't expect autoplaying videos will disappear en masse.

As to existential questions, neither your title nor your salary makes you special. Programmers are paid well because there is a high demand for a certain set of skills, and a low supply. It may change over time, it may not. In the meantime, find a hobby and relax. It's just business.

Controlling access to files uploaded by users

Filip Todic — Wed, 25 Mar 2020 19:46:00 +0000

Imagine a situation where you have to check whether or not a user that sent the request can access or download files that were uploaded by another user. Perhaps user A uploaded a file that needs to be shared only with user B or only with authenticated users. If your application is deployed behind a reverse-proxy such as nginx, you can use the best of both worlds: your application for checking the user’s permissions and the Web server for serving the files the application tells it to serve.

Before we begin, there are a couple of things I would like to address. First of all, having your Web application serve the media files by loading it into memory and sending it in a response is grossly inefficient. You may not know the size of the file, or there could be many requests happening at once. Whatever the case may be, there is a better way.

`X-Accel`

To quote the official documentation:

X-accel allows for internal redirection to a location determined by a header returned from a backend.

This allows you to handle authentication, logging or whatever else you please in your backend and then have NGINX handle serving the contents from redirected location to the end user, thus freeing up the backend to handle other requests. This feature is commonly known as X-Sendfile.

To achieve this, at least two things have to be implemented:

The application’s response must contain the X-Accel-Redirect header;
The location should be marked as internal; to prevent direct access to the URI.

`X-Accel-Redirect` header

This header tells nginx which URI to serve. Although the following example uses django-rest-framework, the same thing can be achieved with any other Web framework.

If we assume all files uploaded by users are located in the /home/user/repo/media/ directory (also defined in Django’s MEDIA_ROOT setting), or more precisely, the /home/user/repo/media/files/{user.id}/ directory by the FileField’s upload_to function, the view looks something like this:

from pathlib import Path

from django.conf import settings
from django.http import HttpResponseRedirect

from rest_framework.decorators import action
from rest_framework.response import Response
from rest_framework.viewsets import ModelViewSet

from .models import File
from .permissions import CanAccessFile

class FileViewSet(ModelViewSet):
    permission_classes = [CanAccessFile]
    queryset = File.objects.all()

    @action(detail=True, methods=["get"])
    def download(self, request, pk=None):
        obj = self.get_object()
        if settings.DEBUG:
            return HttpResponseRedirect(obj.upload.url)

        file_name = Path(obj.upload.path).name
        headers = {
            "Content-Disposition": f"attachment; filename={file_name}",
            "X-Accel-Redirect": (
                f"/uploads/files/{obj.user_id}/{file_name}"
            ),
        }
        return Response(data=b"", headers=headers)

Note that the settings.DEBUG block is here so developers can keep using Django’s static mechanism for serving media files during development.

There are also other X-Accel-* headers that can be set by the application to further refine the process.

`internal`

The application’s response that contains the X-Accel-Redirect header is picked up by the Web server on its way back to the client. In order for nginx to locate the file that should be sent to the client, the configuration should look something like this:

server {
    server_name example.com;

    location /favicon.ico { access_log off; log_not_found off; }
    location /static/ {
        root /home/user/repo;
    }

    location /uploads/ {
        internal;
        alias /home/user/repo/media/;
    }

    location / {
        include /etc/nginx/proxy_params;
        proxy_pass http://unix:/run/gunicorn.sock;
    }
}

With that all set, you’re ready to start serving files to select users!

Python package distribution can be easy

Filip Todic — Sun, 26 Jan 2020 10:26:00 +0000

There have been so many blog posts and presentations about Python packaging that I’m reluctant to write what could be interpreted as yet another one, but I’ll give it a go. Why? Because up until recently, I was working on several interconnected libraries, each of which had a more ridiculous build process than the next.

Needless to say, each had its own deployment procedure. Some were documented, others were left to the author’s interpretation. Dozens of commands to execute. Some had to be built locally because the CI was apparently “too low on resources”. All of this for pure Python packages. No C extensions. It’s like saying “our machine doesn’t have the resources to create a ZIP file”.

Not to mention libraries that had themselves listed as build dependencies, or setup.py reading the virtual environment’s environment variables to determine which version of a particular dependency should be installed. Or git branching workflows so complicated that made resolving merge conflicts a routine, with the changelog being the worst of it.

If you’re a pragmatic developer who has better things to do with its time than watching broken processes in action, you would have searched for a better way. And you would have found one after a quick search and a few mouse clicks away because these issues have already been addressed and solved by the community.

Summary

When I started looking for a solution, I wanted to achieve three things:

Have a single command for performing all the package validity checks: build, test suite, linting, documentation, etc.
Have a single command for deployment where the developer specifies the next version and it magically appears on PyPI;
Make the aforementioned processes easy to debug, modify and replace if a better option arises.

This led me to tox, an automation tool primarily used for running test suites against multiple version of dependencies. However, it’s also useful for automating all sorts of different tasks, even the ones requiring Python 3 when you’re still using Python 2. To give you a taste of where this leads, the following things were achieved:

The test suite was executed against the installed distribution and multiple versions of dependencies, thus ensuring the distribution is valid and you support multiple versions of your dependencies. This is achieved by executing tox in the command line to run the whole test suite, or tox -e <env> if you want to test a specific use case.
When one wants to create a new release, one has to execute tox -e release to create a feature release, or tox -e release -- patch to create a patch release when semantic versioning is used. The release process comprises several other tox processes, such as changelog or manifest, each of which can be executed individually.
The deployment was executed in the CI environment, but it could also be executed locally. This ensured the distributions are created in a controlled and stable environment, not on someone’s machine with some random settings.

This process was propagated into other packages and into the internal cookiecutter project to make it available to new packages as well. This means that developers had a unified, standardized workflow which enabled everyone to initiate a new release almost instantly, only to have it appear on PyPI in a matter of minutes.

This post will summarize my experiences and point you to other useful posts covering this subject matter so you don’t waste time on the same issues. I’ll spare you the intricate details of producing the package as there is already enough material on that subject online.

However, if you need a reference where everything is already confgured, take a look at the centerline package. I’ve created this package during college as an exercise, but nowadays I use it primarily for testing packaging and distribution. The main files you should be focusing on are setup.cfg, pyproject.toml, .bumpversion.cfg, .travis.yml, and last but not least, tox.ini.

A bit of history

Python has a rather complicated build and distribution mechanism when it comes to third party libraries. It’s not rocket science, just a large amount of legacy that you have to navigate. The initial idea was to have a batteries included approach where the Python’s core had all the things you would ever need.

However, that process inhibited reuse of code that was considered useful, but not useful enough to be included into Python’s core. And so PyPI was born, a place where people could upload their Python code and share it with the world. Now you have the best of both worlds, right?

Well, yes and no. In hindsight, its easy to criticize the decisions that were made at the time that led us to a point where we have multiple ways to build and distribute libraries and call Python’s standard library a dead end. The point is, although things are not ideal, there is no point complaining about it if you’re not going to do anything about it.

To correct the course of the entire Python ecosystem, a large amount of time, energy and money is needed. That’s something the open-source community is traditionally scarce at, but things are improving.

On the other hand, if you are experiencing issues with it, and you want to start somewhere, you can start in your backyard (so to speak). Review the tooling choices that are currently made available to you, standardize your libraries’ maintenance process, obfuscate the internals by providing simple high level APIs or access points to get the job done.

During this process you’ll familiarize yourself with the ins and outs of the build system, the terminology, and you’ll be able to make informed decisions based on your needs, and spare your colleagues who are less interested in the package distribution.

With that in mind, I would like to present to you one of the ways that helped us resolve our differences and enabled us to get past the semi-automated phase. It’s important to note that this is not the way. If it doesn’t suit you or your team’s needs, fine. This also won’t cast blame on the packaging ecosystem, but show you an approach that improved the collaboration on various libraries by adopting the following changes:

Switching to the src/ layout
Declarative configuration
Trunk based development
Changelog and version management
Release automation

The `src` layout

When dealing with Python libraries that are sometimes called packages (i.e. a directory with the __init__.py file) because they mostly consist of packages that need to be distributed, there are two most-common ways of organizing your codebase:

Namespace layout (I’m not sure it’s the correct term, but let’s go with it for now):

├── mypackage
│ ├── __init__.py
│ └── mod1.py
├── tests
├── setup.py
└── setup.cfg

src/ layouts:

├── src
│ └── mypackage
│ ├── __init__.py
│ └── mod1.py
├── tests
├── setup.py
└── setup.cfg

You may be wondering what’s the difference? The difference is how the Python’s import system treats these files during development and testing, and that has an effect on the build and distribution processes.

Development mode

Before we get into details, please bear in mind that some people use symlinks to connect packages to projects where they are being used, whereas others use development mode enabled by pip’s editable installs. I personally prefer the latter which goes something along the lines of this:

$ mkproject centerline # provided by `virtualenvwrapper`
$ git clone https://github.com/fitodic/centerline .
$ pip install -e .[dev,test,docs]

Why do I prefer this? Because from my experience, it’s the most stable and reliable development approach when developing and distributing packages. It also drastically simplifies the package’s setup, both in testing and production environments.

On the other hand, I found symlinks impractical because not all packages are meant to be used as dependencies in other projects, for example Django packages. And I prefer to have them tested immediately as a standalone codebase. Which brings us to the crux of the problem.

Testing packages

When using namespace layouts, i.e. the package is in the project’s root directory where setup.py is located, Python will implicitly include the current directory in sys.path. This means that when you run your test suite, the tests will be ran against the code you have in your current working directory, not the installed distribution.

If you don’t think thats a big deal, let me give you an example. Let’s say you have a library that has a test suite that is executed in the CI (Continuous Integration) environment. Someone introduces a change to the library’s configuration (e.g. setup.py) or forgets to include some static files in the MANIFEST.in file (they are not picked up by default). All the tests pass and you create a new release.

The release is successfully deployed to PyPI and installed by your users who start getting ImportErrors. The library is clearly installed, but its empty. It has its metadata so everything looks OK from pip’s point of view, but the code is missing (or parts of it).

How can this scenario be avoided? By placing your code in the src/ directory and configuring setuptools (or whichever build system you are using) to search for packages in the src/ directory. This forces you to install the distribution that would be shipped to your users and running the test suite against it. That way, if there are any configuration errors, you’ll catch them immediately.

You may have also noticed that in the src/ layout pictured above, the tests directory is located on the same level as is the src/ directory. The pytest documentation has a chapter on good integration practices which I recommend you read. The approach described above is in line with these practices for exactly these reasons, to improve the reliability of the entire library development process.

If you are interested in more details, I highly recommend the following blog posts:

Declarative configuration

To make the package usable, you have to configure it before you deploy it. This configuration comprises package metadata, the list of dependencies that need to be installed, what to include into the distribution, the build system, etc. Unfortunately, at this moment, there are a number of files you have to configure to achieve this.

Let’s start with the build phase. PEP-518 introduced the pyproject.toml which enables users to use either setuptools, flit or poetry to build the distribution from source.

Why is this important? Long story short, when pip installs the package from an sdist (source distribution), it executes the setup.py file in order to build a wheel (binary distribution). When executing setup.py, it assumes its only dependencies are setuptools and wheel, and it needs setuptools to do that. What if this is deployed in a restrained environment without setuptools installed beforehand? How do you specify which package to download in order to install your own package? Especially when you need setuptools to read the configuration in the first place.

pyproject.toml solves this issue by allowing you to specify the build dependencies, and you only need pip. You could have specified the build dependencies in setup.py before that, but then again, you had to have setuptools installed to read it.

There is also another thing that is problematic with setup.py. You could find all manner of customized “stuff” in them, some even requiring dependencies that were yet to be installed. For instance, the setup.py imported the library’s version from myproject/ __init__.py, which also contained imports to dependencies. But how did we come to that? I suppose package developers treated packages as regular projects and kept using requirements.txt files, custom build scripts and various other procedures to “make it work”.

The declarative configuration is a way to limit the scope of abuse. By declaring everything in setup.cfg, or one day pyproject.toml if it supports it, there aren’t many ways to “hack” your way around it. I hope.

Even though I prefer setup.cfg, that doesn’t mean you cannot put all of this in setup.py to achieve the same result. It’s a matter of preference, although more and more tools have started adding support for reading their configuration from pyproject.toml. In my opinion, one file would ideally hold the entire project’s configuration, but then again, it’s a matter of preference.

Dependencies

Libraries declare, or should declare, their dependencies in spans to support multiple versions simultaneously, unlike projects like Web applications, where you would want to specify the exact version that’s being deployed. Furthermore, you would also want to list you extra dependencies that users may or may not install, depending on their needs. You can use the same extra dependencies for setting up your development and testing environment as well:

install_requires =
    Fiona>=1.7.0
    Shapely>=1.5.13
    numpy>=1.10.4
    scipy>=0.16.1
    Click>=7.0

[options.extras_require]
dev =
    tox
gdal =
    GDAL>=2.3.3
lint =
    flake8
    isort
    black
test =
    pytest>=4.0.0
    pytest-cov
    pytest-sugar
    pytest-runner
docs =
    sphinx

Trunk based development

Once the library’s configuration is in place, other developers will most likely join in and couldn’t care less about the package’s structure and deployment, as long as it works. But collaboration on a package isn’t just run the test suite -> code review -> merge changes. Packages often provide functionality that builds on its dependencies.

As with all dependencies, there are deprecation periods and sometimes, its just not possible to continue supporting all versions. The Python 2 to 3 migration is one such example where packages simultaneously supported both version of Python, and then dropped Python 2. These types of changes impact their users the most, so they should be executed with care.

The difficult part is supporting multiple incompatible versions at the same time. A compat.py module, or the module where all the compatibility edge-cases are hidden, can only get you so far. At some point in time, you’ll want to create a release that will only continue receiving security patches, while most of the development and innovation carries on.

This is where release management from the version control system’s perspective kicks in. There are several available options, such as Git Flow or Trunk based development (TBD). I’ve used both of them when working on libraries and can safely say that TBD produces more satisfying results.

In a nutshell, TBD requires developers to create short-lived branches from the master branch that will be merged back into the master branch. Every once in a while, a release branch is created from the master branch that carries a version designation, such as release_1_11. That branch is the source for creating 1.11.X releases until the version or branch is deprecated. Meanwhile, all the security patches that are merged into the master branch are cherry-picked into the release branch. If you’re lucky and the developers working on the project make sensible commits and commit messages, it’s even easier to manage releases, especially around bugfixes and security patches.

This approach enables the continuation of feature development according to the project’s roadmap, without worrying too much about backwards compatibility. After all, that’s what the release branches are for.

Changelog and version management

One of the reasons why TBD was adopted in the first place was changelog and version management. My team inherited a customized GitFlow workflow that introduced more frustration when someone introduced an ill-advised method for handling multiple releases simultaneously. There were practically two codebases whose functionality was more or less the same, apart from the code that handled compatibility issues between two versions of a certain dependency.

This brought even more frustration when a new release had to be made. Various merge requests were issued, slow pipelines executed, merge conflicts resolved (mostly around the changelog), and so forth. There had to be a better way.

After some research, I stumbled upon TBD and everything just clicked as described in the previous chapter. Furthermore, I dropped the custom built changelog and versioning script in favor of towncrier and bumpversion. I don’t think there is a need to go into too many details as their documentations are preety straightforward. After some minor issues, these two libraries were successfully integrated into tox workflow.

Release automation

The last step to set up was the release process. This is also a tox environment, named release that built the changelog using towncrier, bumped the package version using bumpversion, tagged it and pushed the changes to origin.

From there on, the CI would run the test suite one more time, again using tox so developers can easily reproduce issues if they arise, build the standard (sdist) and binary (wheel) distributions, and upload them to PyPI using twine. If you are using Travis CI, you can use its own mechanism for uploading distributions.

Conclusion

As mentioned earlier, this is one approach that may or may not suit you and your team. Each library used here has an alternative and it also comes down to a matter of preference.

If you are starting out or need to create a new library, I would advise you to use one fo the existing Python cookiecutters, such as cookiecutter-pylibrary. You can also build your own as I did. Beats copy/pasting coded from an existing project or documentation.

Optimizing websites and APIs with cache and ETags

Filip Todic — Tue, 12 Nov 2019 16:20:58 +0000

When creating a public website or a web app with a public API, you generally want things to run as smooth as possible for the end-user. You also might want to minimize the server’s response time and resource consumption and you want to do it as fast as possible.

Why would you do that? For any number of reasons, but most often as your site’s traffic increases, it becomes a necessity to speed it up as much as possible.

How would you do that? Performance benchmarks, when done properly, oftentimes tell you where you should look. At first, most people focus on tuning database queries or revisit some of the tools or libraries they are using. Although these steps are important and have their merits, that’s not why we are here. We are here because you want to get the most bang for your buck, especially if you have a website or API that’s read a lot. That’s where cache and ETag come in.

Before we dive right in, I should tell you that when it comes to caching and ETags, there is no silver bullet. If someone tells you otherwise, for instance that there is a service or an approach that magically makes all your problems go away, you can safely assume their set of experiences and situations they have dealt with is most probably rather limited. There’s nothing wrong with that, just be aware that their solutions might not be applicable to your use case.

Why am I telling you this? Because I’ve had this talk with several colleagues and noticed that I was repeating the same story and sending the same references all over again. This is my first attempt at scaling this process. I don’t have all the answers and there are certainly things that I am not aware of at the moment, but I’m going to outline the things I wish someone had told me when I was first starting out in this direction, i.e. optimizing web sites and APIs for high intensity traffic.

Cache

Have you noticed how good computers are at repetitive tasks? Well, it turns out even computers have their limits. This limit depends mostly on the amount of “juice” your machine has at its disposal, but ask yourself this: why would you want to execute the same query/computation/operation, over and over again, for each user, only to get the same result?

When a user makes a request to a specific resource, for instance its profile, What the server or application generally does is it computes the result or retrieves the data from the database, serializes it into the requested format and sends it back to the user. Without cache, this process is repeated for each request, over and over again.

There has to be a better way, right? Regardless of the amount of users that are triggering this event, why would you want to waste precious resources on something like that? Wouldn’t these resources be better spent on something else? Or not spent at all? Is there a better way?

As it turns out, there is. Consider the case we’ve described above. If we use a caching mechanism, the server will save the result (retrieved or serialized data) into memory before sending it to the user who has made the request. That way, if the same resource is requested again, instead of retrieving or computing the result again, the end-result will be ready. It will be retrieved from memory and sent back to the user almost instantly.

Services that provide cache

Sounds great, but you may be asking yourself how can you implement it? If you are using a database, such as PostgreSQL, you are already using it via shared_buffers. This setting tells PostgreSQL the amount of memory it has at its disposal for caching data. The defaults are pretty conservative, as are some of its other settings, so you may want to tune it, depending on your needs and available resources.

Apart from the database itself, there are services that are dedicated to caching data. memcached and redis are two popular options, but there are many more. They are oftentimes key-value storages where you can store the results of database calls, results of computations or even entire rendered pages. Most programming languages and Web frameworks have libraries and interfaces to interact with them so you don’t have to reinvent the wheel. For example, Django has an entire section dedicated to cache.

HTTP cache

When working with the Web, it is not enough to cache your own data. There are other forces at work here outside of your control, which are sometimes called “downstream” caches, such as Internet Service Providers (ISP), cache proxies or even the Web browsers themselves. These caches may come between the end-user and your application, and it is important to be aware of them, especially when dealing with private data.

To be on the safe side, you should send HTTP headers that define the resource’s cache policy. The two of most important headers are Cache-Control and Vary. Cache-Control defines whether or not the response should be cached by the client, for how long and when should it be revalidated, whereas the Vary header indicates which HTTP headers are used when selecting a representation of the resource. If you are using a Web framework, it probably has its own mechanisms for generating these headers. For example, Django has decorators for the Vary HTTP header, as well as for the Cache-Control HTTP header.

As a side note, if you are using SessionAuthenticationMiddleware, each response will have a Vary: Cookie response header.

Cache expiration

There are only two hard things in Computer Science: cache invalidation and naming things.– Phil Karlton

Since Web frameworks provide most of the mechanisms to deal with cache, both internally (via their APIs and interfaces) and externally (via HTTP headers). That means that the hardest part is left to the developer: how to invalidate cache so that fresh data is sent as soon as possible?

Theoretically, data can be stored in- and be served from cache indefinitely, but you don’t want that. You want to evict old and rarely used data, but what about data that is updated?

Once a resource is updated on the server, the previous version of the same resource stored in cache should be either evicted, invalidated or updated in order to server the new, updated version. Which path should you choose? It’s basically up to you, i.e. your application and use-case, but there are several options that come to mind:

You can add a last_modified field to each resource (for simplicity’s sake let’s assume a resource represents a row in a database table) that it updated on each save, and use the resource’s primary key (e.g. its ID) and the last_modified field to construct a cache key and use it to save to and retrieved the data from cache. This is a relatively simple approach that is widely applicable, however it does have the overhead of retrieving the object from the database, at the very least its ID and last_modified values to construct the cache key, before retrieving the cached data. Worst case scenario is that if the cache is empty, two database queries will be executed before the data is stored in cache for subsequent requests.
The first approach could be further optimized by caching all database queries, using something like django-cachalot which caches the results of database queries and clears them once the data in the database table is changed. However, this approach does have its limitations, primarily when there is a high amount of database modifications.
To avoid the aforementioned issues, you could implement something like per-site or per-view cache, but a vanilla implementation such as this means your application would be serving stale data until the previous version expires. This could be alleviated by passing a cache key constructor to drf-extensions’ cache_response decorator, but that oftentimes raises the issues mentioned in point 1.
If you want fine-grained control, you can always cache response fragments, but that is susceptible to nitpicking and inadvertently caching block that should not be cached (e.g. user-specific data).
There is also cache versioning. Basically, each time a resource is updated, the cache key’s version is incremented. Although this means that only the latest version of the resource is cached, the downside is that you should keep tabs on version numbers.
Finally, you can always use the low-level API, i.e. implementing the cache invalidation logic yourself. For example, each time a resource is saved, you would construct the resource’s cache key, try to locate it, and if it doesn’t exist, create it. If it does, delete it and set it anew.

As I said before, the path you choose to go is entirely up to you. There is nothing wrong with combining all of the aforementioned approaches. As always, the most important thing to do is to choose the right tools for the job. These are just pointers with useful links so you know what are your options are and where to start. The end result depends primarily on your application and usage patterns.

ETag

In the HTTP cache section I mentioned “downstream” caches and the role that Cache-Control and Vary headers play at informing proxy caches and other intermediaries about the resource’s freshness. As it turns out, there is another set of HTTP headers that helps clients (e.g. Web browsers) determine whether or not the received resource is still fresh, and those headers are ETag, If-Match and If-None-Match.

The ETag HTTP response header is as a resource identifier, or better yet, an identifier of a particular version of the resource. Sort of like the resource’s fingerprint that changes whenever the resource is updated. This identifier usually comes in the form of a hash, generated by the server. The algorithm that’s used to generate it primarily depends on the resource it is calculating the ETag for, but there are a few common use-cases:

If the resource is an entry in the database, you can use the database table’s name (including the schema if necessary), the entry’s primary key and the last_modified field (if available);
If the resource is an HTML page composed of various different resources, you can use the entire HTML representation as input;
If the resource is a file on disk, the file’s modification time and size can be used as inputs.

You may be asking yourself how does this help the cache process? Why go through all this trouble of generating the ETag just to match it with the resource it came with? There are two reasons you would want to do that:

It tells the client that sent the request (e.g. Web browser) to cache the response and keep reusing it until the ETag changes, thus saving bandwidth.
It prevents the client from updating the resource if the resource has been updated in the meantime.

How is this achievable? By sending conditional requests.

Conditional requests

Sending either If-Match or If-None-Match header in the request makes the request conditional, but when should you use which? The good news is that browsers do most of it for you by default, but here are the details behind these types of requests.

When retrieving data, the Web browser requests the resource from the server and receives the resource’s representation along with a couple of HTTP headers, among which is the ETag header. The ETag’s value is used for retrieving the cached resource from the browser’s cache. After a while, the resource needs to be revalidated (depending on the expiration time set by the Cache-Control HTTP header), so the browser requests the resource from the server (again), but this time, it adds the ETag’s value in the If-None-Match HTTP header. This tells the server to check whether or not the resource has changed since the last request. The server generates the resource’s ETag and if the two values match, the resource has not changed, and the server returns a 304 Not Modified response with an empty body, telling the browser to use the cached resource. Otherwise, the server returns a 200 OK response with the resource’s new representation and a new ETag value. What’s the benefit of this process? You save on bandwidth by not receiving the data you already have stored locally.

That covers data retrieval, but the ETag header is also useful when updating resources. In fact, it helps prevent simultaneous updates of the same resource from overwriting each other, or “mid-air collisions”. How would you do that? Let’s say you’ve retrieved a resource and edited it in your browser. You click the Save button, and the browser sends a PUT request with ETag’s value in the If-Match HTTP header. If the resource has changed in the meantime, the server will respond with a 412 Precondition Failed response, thus preventing you from overriding the resource that has been updated while you were editing it.

Conclusion

Caching is a broad subject and this post is intended to give you a high-level overview of its application in Web development. This post includes a couple of examples, but there are certainly many more, from code snippets to HTTP headers that can be used as well (e.g. Last-Modified). If you want to get into more details, there are references to other resources that go into more practical details, especially from the Mozilla Developer Network and the Django Web framework, both of them being excellent sources for further reading.

DEV Community: Filip Todic

How to change PostgreSQL's data directory on Linux

Procedure

data_directory and config_file

Stop the systemd service

New location

postgresql configuration

systemd configuration

SELinux

Confirmation

Automate it before you document it

It's just business

Controlling access to files uploaded by users

X-Accel

X-Accel-Redirect header

internal

Python package distribution can be easy

Summary

A bit of history

The src layout

Development mode

Testing packages

Declarative configuration

Dependencies

Trunk based development

Changelog and version management

Release automation

Conclusion

Optimizing websites and APIs with cache and ETags

Cache

Services that provide cache

HTTP cache

Cache expiration

ETag

Conditional requests

Conclusion

Resources

`data_directory` and `config_file`

Stop the `systemd` service

`postgresql` configuration

`systemd` configuration

`X-Accel`

`X-Accel-Redirect` header

`internal`

The `src` layout