DEV Community: Cosimo Streppone

On the Oct 19th AWS us-east-1 outage

Cosimo Streppone — Fri, 31 Oct 2025 07:05:36 +0000

These will be personal comments on the text that AWS put out after the October 19th 2025 us-east-1 outage, which you can find here: https://aws.amazon.com/message/101925/.

Important premises before starting:

While I’ve been working on web operations for a long time, I have never dealt with services as big as AWS, and also I don’t know anything about how they operate internally.
I have the utmost respect for AWS SREs and engineers that had to deal with this outage, so this is in no way intended to downplay the quality of the services or the recovery work done there. On the contrary…

Let’s start.

At Kahoot!, the first signal that something was wrong was Slack being slow or unresponsive in our, Central Europe, morning. We didn’t know that AWS or the us-east-1 region was involved in any of it. As we realized Slack was not operating correctly, a few of us sent test messages in our backup channel on a Google Chat room “? SRE Team”, used a handful of times over a few years. It can be cumbersome to establish a backup channel when Slack is suddenly down.

Takeaway 1: establish and document your backup comms channel if and when the primary fails.

In my case, I had another problem. My Firefox install had started acting up, in the special way Firefox fails when you have used snap to install it. I will spare you my thoughts on snap itself, none of which are positive. After 15 minutes of head scratching, I realized the issue might be with Firefox and not related to the AWS outage, and restarted my browser.

The impact on our infrastructure has been minimal. We’ve seen a few AWS API calls fail, but essentially nothing else. Our EC2 instances and AutoScalingGroups in us-east-1 have been up and running with no issues.

It’s always DNS! Except when it isn’t…

Many cited the “It’s always DNS!” meme. My guess is that they probably haven’t read the AWS text. This was not a DNS failure. It was the system AWS designed to update those “hundreds of thousands” of DynamoDB DNS records, that failed due to a race condition. More on that below.

Is that a sign that few people take the time to read things through?

Outage text commentary follows

If you haven’t, I’d suggest reading Lorin Hochstein’s blog about the outage. I won’t be mentioning any of Lorin’s points here.

Engineering teams for impacted AWS services were immediately engaged and began to investigate. By 12:38 AM on October 20, our engineers had identified DynamoDB’s DNS state as the source of the outage.

11:48 PM to 12:38 AM means 50 minutes from when the issue started to detection. That seems … quite a lot of time. I don’t know the details, of course. My guess is that DynamoDB is so core to most AWS services and so reliable that it’s hard to imagine it could have issues. This is also confirmed by the fact that “key internal tooling” depends on DynamoDB, meaning it must be very rare that it’d be down or unavailable.

Makes me think of those times when some component, script or cronjob that had been reliably working for years, seems to be the one failing. You’re thinking: “No way, it can’t be THAT! It’s been working perfectly fine for at least 5 years!”. And yet, something this time has changed that caused the failure. Such cases happen. I’ve been smacked in the face a few times :-)

By 2:25 AM, all DNS information was restored, and all global tables replicas were fully caught up by 2:32 AM. Customers were able to resolve the DynamoDB endpoint and establish successful connections as cached DNS records expired between 2:25 AM and 2:40 AM.

We can conclude that no DNS records for us-east-1 DynamoDB endpoints were available between 11:48 PM and (partially) 2:40 AM. The DynamoDB endpoint hostname for the us-east-1 region is dynamodb.us-east-1.amazonaws.com. Negative DNS lookups, which I take to be NXDOMAIN responses in this case, are to be cached by resolvers. This can be a problematic if the TTL for such negative responses is high. In case of the DynamoDB regional DNS zone, this is currently set to 5 (five) seconds.

To understand this, let’s look at the SOA record for the us-east-1 DNS zone, which controls how long DNS responses are cached for:

$ dig us-east-1.amazonaws.com SOA +noall +answer
us-east-1.amazonaws.com. 579 IN SOA dns-external master.amazon.com. root.amazon.com. 22365 180 60 2592000 5

The last number in the SOA record is the minimum TTL, used as TTL for negative responses. Hence any lookup for dynamodb.us-east-1.amazonaws.com that returned an NXDOMAIN response (record not existing) would be cached for just 5 seconds. I’m wondering whether AWS have changed this value after the outage, as otherwise the time for clients to recover during the incident would have been much shorter. Wondering what sort of loads this imposes on their DNS serving infrastructure…

Takeaway 2: if you have particularly critical services, verify that the negative DNS response TTL you advertise in your SOA records is appropriately set, so that clients can recover quickly when DNS records are restored. Five seconds might be a bit extreme for anyone except huge companies, also because it can impose a tremendous load on the DNS infrastructure. Something like 60s might be more appropriate for mere mortals. YMMV.

EC2

DynamoDB being the backbone of many internal AWS services meant that EC2 was also impacted. I won’t reiterate here how or why. Instead:

Existing EC2 instances that had been launched prior to the start of the event remained healthy and did not experience any impact for the duration of the event.

That’s what we saw. Our existing EC2 instances in us-east-1 kept running with no issues. We were also “lucky” that our AutoScalingGroups didn’t initiate any scale-in or scale-out events, being night time in us-east-1. Those would have probably failed, at least the scale-out ones, as launching new instances was one of the impacted operations.

Other more “complex” services were impacted. Our AWS usage is relatively basic, we don’t use esoteric services or configurations, so we weren’t affected by the EC2 issues during the outage. Keeping things simple has been advantageous in this case.

Takeaway 3: Simple, “boring” infrastructure choices can be surprisingly resilient. Complex service configurations and dependencies increase your surface area for cascading failures.

DropletWorkflow Manager

Each DWFM manages a set of droplets within each Availability Zone and maintains a lease for each droplet currently under management. This lease allows DWFM to track the droplet state, ensuring that all actions from the EC2 API or within the EC2 instance itself, such as shutdown or reboot operations originating from the EC2 instance operating system, result in the correct state changes within the broader EC2 systems.

As part of maintaining this lease, each DWFM host has to check in and complete a state check with each droplet that it manages every few minutes.

Starting at 11:48 PM PDT on October 19, these DWFM state checks began to fail as the process depends on DynamoDB and was unable to complete.

While this did not affect any running EC2 instance, it did result in the droplet needing to establish a new lease with a DWFM before further instance state changes could happen for the EC2 instances it is hosting.

Between 11:48 PM on October 19 and 2:24 AM on October 20, leases between DWFM and droplets within the EC2 fleet slowly started to time out.

At 2:25 AM PDT, with the recovery of the DynamoDB APIs, DWFM began to re-establish leases with droplets across the EC2 fleet. Since any droplet without an active lease is not considered a candidate for new EC2 launches, the EC2 APIs were returning “insufficient capacity errors” for new incoming EC2 launch requests.

This description of the inner workings of the DropletWorkflow Manager is quite fascinating. From my point of view, DWFM was designed really well to effectively fail in such a graceful way. Being defensive is a useful trait in systems design. This is an excellent example of “failing open” design philosophy. The system degraded gracefully rather than causing widespread instance failures.

After attempting multiple mitigation steps, at 4:14 AM engineers throttled incoming work and began selective restarts of DWFM hosts to recover from this situation. Restarting the DWFM hosts cleared out the DWFM queues, reduced processing times, and allowed droplet leases to be established.

The oldest trick in the book! A well-placed server restart can help :-) I did this regularly years ago, then started to prefer trying to understand the actual failure at hand, before kicking the server and thus sometimes preventing further observations to understand the cause of the fault. Sometimes it’s still a viable way to get out of trouble, even in AWS apparently.

Network Load Balancers

NLBs being based on EC2 instances, they were impacted by the outage.

Our monitoring systems detected this at 6:52 AM, and engineers began working to remediate the issue.

That means 80 minutes from the first NLB issues to detection at the monitoring layer. That’s an indication that the recovery work was either quite challenging, or just that it took that long to notice. It would be again very interesting to know what exactly happened during that time.

Other AWS Services

By 2:24 AM, service operations recovered except for SQS queue processing, which remained impacted because an internal subsystem responsible for polling SQS queues failed and did not recover automatically. We restored this subsystem at 4:40 AM and processed all message backlogs by 6:00 AM.

One aspect I haven’t seen mentioned anywhere else is the fact that with all these different subsystems, we can assume many SREs must have been on deck to deal with this outage. Given that, the coordination work will have been absolutely massive and critical as well. No doubt it would have been extremely fascinating to observe how this coordination went on, and how the different teams communicated and collaborated to get things back up and running. Maybe it was a single team of three-five people instead? It was night time in US, so who knows.

Inbound callers experienced busy tones, error messages, or failed connections. Both agent-initiated and API-initiated outbound calls failed. Answered calls experienced prompt playback failures, routing failures to agents, or dead-air audio.

It was the first time I read the term “dead-air audio”. A detour to Wikipedia was definitely worth it.

Customers with IAM Identity Center configured in N. Virginia (us-east-1) Region were also unable to sign in using Identity Center.

Fortunately, our IAM Identity Center is in a different region, so we weren’t impacted there either. I certainly don’t envy teams who were shut off from access to the AWS console. Our observability systems also weren’t affected, but I can see how losing console access and perhaps also losing your observability platform could be a completely paralyzing situation.

Takeaway 4: Spend some time pondering how not only your own systems, but also the 3rd party systems you depend on would react to an AWS outage. Would they make you unable to react? Can you do something about it?

In Conclusion

Finally, as we continue to work through the details of this event across all AWS services, we will look for additional ways to avoid impact from a similar event in the future, and how to further reduce time to recovery.

The AWS message clearly wasn’t meant to be a post-mortem. With that said, there are zero mentions of a human element anywhere in the outage. Perhaps because the DNS automation was … automation, and no manual intervention caused the issues. I would have appreciated learning more about the human factors involved in any case. For example, what challenges the teams faced in identifying that DynamoDB DNS records were missing? Perhaps this is more material for an AWS-internal post-mortem, that people might still be working on.

Multiple DNS Enactor instances applying potentially outdated plans simultaneously seems (in hindsight, clearly) quite risky? It’s always easy to criticize after the fact, but since we’re missing such a huge amount of context and background, the only thing we can do is speculate and also learn from this, as much as we can. If you have, or know where to get more insights into AWS internals, reach out and let me know!

The post On the Oct 19th AWS us-east-1 outage appeared first on Random hacking.

How to pin a specific apt package version

Cosimo Streppone — Mon, 10 Feb 2025 09:59:06 +0000

I’d like to pin a specific package, say redis-server, to a specific version, in my case 7.0.*, and that seems straight-forward to do with:

Package: redis-server
Pin: version 7.0.*
Pin-Priority: 1001

Now, I would also like to have apt fail when 7.0.* is not available, either because there are only newer versions available, f.ex. 7.2.* or 7.4.* or perhaps because only older versions like 6.* are available.

I can’t seem to find a way to achieve that. I’ve read various resources only, consulted man 5 apt_preferences, but I’m still not sure how to.

I tried combining the previous pinning rule to another one with priority -1, as in the following:

Package: redis-server
Pin: release *
Pin-Priority: -1

But that seems to make all versions unavailable unfortunately. Here’s what I’m seeing:

$ apt-cache policy redis-server
redis-server:
  Installed: (none)
  Candidate: 5:7.0.15-1build2
  Version table:
     5:7.0.15-1build2 500
        500 [http://no.archive.ubuntu.com/ubuntu](http://no.archive.ubuntu.com/ubuntu) noble/universe amd64 Packages

$ cat > /etc/apt/preferences.d/redis-server
Package: redis-server
Pin: version 7.0.15*
Pin-Priority: 1001

Package: redis-server
Pin: release *
Pin-Priority: -1

$ apt-cache policy redis-server
redis-server:
  Installed: (none)
  Candidate: (none)
  Version table:
     5:7.0.15-1build2 -1
        500 [http://no.archive.ubuntu.com/ubuntu](http://no.archive.ubuntu.com/ubuntu) noble/universe amd64 Packages

I expected this configuration to provide an available candidate, since one exists (7.0.15), but that doesn’t work.

Note that a successful outcome for me is:

define a target wanted version, f.ex. redis-server=7.0.*
provide an apt preferences.d file such that:
- when any 7.0.* versions are available, apt will install that version
- when no 7.0.* versions are available, apt will fail installing nothing

A bad outcome is when redis-server is installed, but with a package version that does not match what I had specified as requirement (hence, different from 7.0.*).

This is on Ubuntu 24.04, although there is nothing specific to 24.04 or Ubuntu here I would think.

Any ideas?

Posted on stackoverflow, let’s see! https://unix.stackexchange.com/questions/790837/how-to-pin-an-apt-package-to-a-version-and-fail-if-its-not-available

UPDATE: based on the stackoverflow feedback, it seems that the solution wasn’t far off.

Package: redis-server
Pin: version 5:7.0.15*
Pin-Priority: 1001

Package: redis-server
Pin: release *
Pin-Priority: -1

I needed to prepend the version with the "5:“.

The post How to pin a specific apt package version appeared first on Random hacking.

TIL: Styling Obsidian text paragraphs

Cosimo Streppone — Mon, 23 Dec 2024 12:15:12 +0000

TIL that it’s possible to style Obsidian text paragraphs in a way that allows me to focus on the actual text content and not on the fact that I need to add an artificial line break every time I type a paragraph :-)

Obsidian can use CSS snippets to style the application itself and the text/markdown content. The CSS snippets need to be saved in <vault_directory>/snippets/whatever.css.

This is how to get that “natural” book-like spacing between paragraphs, and avoid adding spurious line breaks in the markdown code:

.cm-contentContainer {
  line-height: 1.70rem;
}

.markdown-source-view.mod-cm6 .cm-content > .cm-line {  
  padding-bottom: 12px !important;  
}

Of course, the values for main line-height and padding will depend on your particular screen and font settings. In my case I use a screen tilted in vertical position for writing and coding, and my font of choice is the beautiful Berkeley Graphics’s Berkeley Mono.

My experience at SREcon EMEA 2022

Cosimo Streppone — Mon, 07 Nov 2022 10:36:10 +0000

A few weeks ago I attended SREcon in Amsterdam. Here’s some sparse thoughts about it, with no pretense of being exhaustive or coherent.

Looking Back

There is only a handful of conferences I attended where I felt “at home”. Going back in time, Surge was the first one, then came Velocity. I’m adding SREcon to that list. It definitely felt like I was among people that speak the same language and have similar breadth and depth of expertise, and yet it is somewhat strange at the same time.

As I see it, there’s at least three “tiers” for such a big and niche conference. The FAANG folks, the tiny company with a sysadmin or devops or two, and then the big ocean of mid-sized companies, where people like us are. Our SRE team is four people and we manage a service with millions of monthly users. Needless to say, we have a lot on our plate :-)

I came to SREcon after a hiatus from conferences for some years. After a while, conferences tend to become self-referential and people start talking about the same things over and over again. I wanted to understand how things had changed in our field, what were people talking the most about, get some fresh perspectives and perhaps connect with people from other companies. What prompted me to do this was Niall Murphy tearing the SRE bible book apart.

The Question of SRE Identity

This year’s conference topic was “What could SRE be?”.

No surprise, then, that a good portion of the talks were about what I refer to as the question of identity for SREs. We have seen the same happen — and a lot more during all these years — for the DevOps movement.

What could SRE be, then? According to some presentations, one would conclude that whatever SRE is, it’s no longer what Google intended, it’s not what anyone else thinks it is either, it’s just what you think it is: a subjectivist view.

Among the Usenix slack conversations, there was a lot of chit-chat about SRE identity. My personal contribution was the following meme:

Other funny memes that were shared:

An interesting fact I learned during the conference is that the Google SRE book was written by assembling contributions from the best teams at Google, picking out their respective best practices. Paradoxically, this implies that the SRE book is not representative of how even Google itself does SRE.If you also consider that, at the time the SRE book was published (2016), Google employed about 1,200 people in the various SRE teams, the only possible conclusion is…if you are not Google, there is likely very little that you can apply to your everyday mere-mortal-SRE life.

Before you think I’m exaggerating, such conclusion was claimed by (ex-)Google engineers themselves, for example in Alex Hidalgo’s “Diamonds under Pressure” talk and (in my opinion) in one of the best talks of the conference, Emil Stolarsky’s Unified Theory of SRE. Another entertaining presentation in the same vein was Andrew Clay Shafer’s “SRE as She Is Spoke”. Andrew expressed this thesis that “progress [on the SRE journey] stops when the needs are met”, which seems a reasonable and pragmatic approach.

The videos are not up yet, but they should be in a few weeks.

Alongside to the “subjectivist” view, there were other talks, which could be classified as systems thinking, that focused on the more general and broad aspects of what SREs do, how to handle complex systems, human factors, etc… Among the best IMO were:

The first plenary session on Day 1, “Knowledge and Power: A Sociotechnical Systems Discussion on the Future of SRE” by Laura Maguire and Lorin Hochstein. I’ve been following Lorin’s excellent blog, Surfing Complexity for some years now.
Laura Nolan’s “What SRE Could Be: Systems Reliability Engineering”. Lots of material to go deeper into systems thinking. This was brilliant, I think Laura raised the level of the conversation with this talk.

What else?

The question of SRE identity accounted for a notable part of the talks, but thankfully not all. It’s good to pause and reflect on our role, but personally that’s not why I was interested in SREcon, not primarily at least. What I like are the deep technical talks, where I get to know more about how other companies actually do the stuff we call SRE. Given my past conference experience, I expected Facebook/Meta’s talk to be somewhat disappointing, and it was. While some details of how Meta is structured were shared, and are always interesting, I expected a bit more on how the incident actually happened.

I loved Effie Mouzeli’s talk on how to make teams resilient, “Is Our Team as Resilient as Our Systems?”. We naturally focus on systems, but teams are a crucial part of the equation. My team and I have had to work on this a lot in the past years, and I’m hoping to share more about this soon. I felt this talk had a lot of good insights, some of which we’ve also applied over time.

Another talk that deserves a mention is Chris Sinjakli’s reflection on broadening the scope of how we work on reliability for our systems. This is sometimes difficult to do when toil is a big part of our jobs. Luckily it’s not for our team, not anymore at least, so this talk felt very relevant to me, and I recommend it.

I couldn’t attend some of the talks due to the two parallel tracks. I hope to catch-up when slides and videos will be published later on.

What about the hallway track?

In general, people say that conferences are most useful because of the casual conversations you can have in the hallways. While I somewhat agree with that, the opportunities to have conversations vary depending on the type of person you are, and the people you meet, of course. My impression is that while some people at SREcon were happy to have conversations, most were likewise happy to be left alone, which is fair enough :-)

Just to say that it was really nice to meet people and chat, and almost all I talked to knew Kahoot! directly and were happy to share details about what they’re doing and equally interested in what we’re doing.

In some of these conversations I’ve been trying to motion for more concrete, down to earth, talks on how smaller companies like ours do SRE. It’s ok to aspire or be interested in how Google runs, but you come away with absolutely zero information that’s useful to your work life. Possibly there’s a downside even: people going home thinking they have to do whatever Google does (see chapters above) so ultimately… let’s give less importance to the Googles of the world, please!

Besides the hallway track, there was a nice “sidewalk” track. We walked around the city, 15 km a day on average — you gotta track those SLOs… — and I also managed to snap some nice pictures of Amsterdam at sunrise and sunset.

The Venue and Organization

Loved all of it, honestly the best conference I’ve ever been to. The venue was spectacular, there was plenty of space, slides were clearly visible on screen, and the food was awesome! We also used one of the available meeting rooms to participate in our own company hackaton after the conference finished, until they kicked us out. Here’s a sneak peek of what our team was working on:

I hope to return to SREcon next year in Dublin. By then, I’d love to see more not-Google, not-Meta, etc… talks on the program. Perhaps we (or you!) should think about presenting too, why not?

Failed to connect to the host via SSH on Ubuntu 22.04

Cosimo Streppone — Wed, 05 Oct 2022 16:02:48 +0000

If you have just upgraded to Ubuntu 22.04, and you suddenly experience either errors when trying to ssh into hosts, or when running ansible or again when running the ansible provisioner building a packer image, this is probably going to be useful for you.

In my case I was trying to build an AWS EC2 image via packer and the ansible provisioner, and I had this error:

amazon-ebs.aws: Failed to connect to the host via ssh: Unable to negotiate with 127.0.0.1 port
amazon-ebs.aws: 40015: no matching host key type found. Their offer: ssh-rsa

If your problem is that you simply can't connect via SSH to a host from your Ubuntu 22.04 host, then look it up, there are a lot of people in the same boat.

The proposed solution is to add this snippet to either your /etc/ssh/ssh_config or ~/.ssh/config:

PubkeyAcceptedKeyTypes +ssh-rsa

or just for some specific hosts:

Host host.example.com
    PubkeyAcceptedKeyTypes +ssh-rsa

In the case of ansible connecting to a host, or packer launching ansible connecting to a host, this needs an additional step or two.

For ansible:

ansible --ssh-extra-args="-o PubkeyAcceptedKeyTypes=+ssh-rsa"

For packer with ansible provisioning:

build {
  sources = ["sources.amazon-ebs.aws"]
  provisioner "ansible" {
    ansible_env_vars = [
      ...
      "ANSIBLE_SSH_ARGS='-o PubkeyAcceptedKeyTypes=+ssh-rsa -o HostkeyAlgorithms=+ssh-rsa'"
    ]
    playbook_file       = "..."
    galaxy_file         = "..."
    ...
    extra_arguments     = "${concat(local.default_ansible_extra_args, var.ansible_extra_args)}"
  }
}

Background info on the cause of this issue: https://ikarus.sg/rsa-is-not-dead/

Hope I don't need to come back to this for a while :-)

Text Clustering using Deep Learning language models

Cosimo Streppone — Wed, 04 May 2022 09:15:26 +0000

I had a ton of fun working on this small project, involving nlp, the most classic clustering algorithm of all (k-means), and a touch of deep learning.

It all started with the thought: what if... we could do X? How would we do it? What it would take to do it? "Doing it" involved a lot of research and learning on my part, on NLP but also on how to ship such big models to production and make them work reasonably fast.

Some of that experience I have already written about, but this new article just published today explains the more high-level view of what was done and how.

https://kahoot.com/tech-blog/text-clustering-using-deep-learning-language-models/

Deploying Large Deep Learning Models in Production

Cosimo Streppone — Thu, 26 Aug 2021 19:07:27 +0000

Most deep learning or machine learning (ML) articles and tutorials focus on how to build, train and evaluate a model. The model deployment stage is rarely covered in detail, even though it is just as important if not fundamental part of a ML system. In other words, how do we take a working ML model from a jupyter notebook to a production ML-powered API?

I hope more and more practitioners will cover the deployment aspect of ML models. For now, I can offer my own experience about how I approached this problem, hoping this will be useful to some of you out there.

Creating a useful ML model

How to create a useful ML model is the part of the work I won’t cover in this post. :-)

I assume that you already have:

a model or pipeline that is either pre-trained or that you have trained yourself
a model based on PyTorch, though most of the information here will probably help with any ML framework
some idea on how to make your model available as a RESTful API

First step: defining a simple API

The rest of this article will use Python as a programming language, for various reasons, the most important being that the ML model is based on PyTorch. In my specific case, the problem I worked on was text clustering.

Given a set of sentences, the API should output a list of clusters. A cluster is a group of sentences that have a similar meaning, or as similar as possible. This task is usually referred to with the term “semantic similarity”.

Here’s an example. Given the sentences:

“Dog Walking: 10 Simple Steps”
“The Secrets of Dog Walking”
“Why You Need To Dog Walking”
“The Art of Dog Walking”
“The Joy of Dog Walking”
“Public Speaking For The Modern Age”,
“Learn The Art of Public Speaking”
“Master The Art of Public Speaking”
“The Best Way To Public Speaking”

The API should return the following clusters:

Cluster 1 = (“Dog Walking: 10 Simple Steps”, “The Secrets of Dog Walking”, “Why You Need To Dog Walking”, “The Art of Dog Walking”, “The Joy of Dog Walking”)
Cluster 2 = (“Public Speaking For The Modern Age”, “Learn The Art of Public Speaking”, “Master The Art of Public Speaking”, “The Best Way To Public Speaking”)

The model

I plan to describe the details of the specific model and algorithm I used in a future post. For now, the important aspect is that this model can be loaded in memory with some function we define as follows:

model = get_model()

This model will likely be a very large in-memory object. We only want to load it once in our backend process and use it throughout the request lifecycle, possibly for more than just one request. A typical model will take a long time to load. Ten seconds or more is not unheard of, and we can’t afford to load it for every request. It would make our service terribly slow and unusable.

A simple Python backend module

Last year I discovered FastAPI, and I immediately liked it. It’s easy to use, intuitive and yet flexible. It allowed me to quickly build up every aspect of my service, including its documentation, auto-generated from the code.

FastAPI provides a well-structured base to build upon, whether you are just starting with Python or you are already an expert. It encourages use of type hints and model classes for each request and response. Even if you have no idea what these are, just follow along FastAPI’s good defaults and you will likely find this way of working quite neat.

Let’s build our service from scratch. I usually start from a python virtualenv, an isolated python environment where you can install your dependencies.

virtualenv --python /usr/bin/python3.8 .venv
source .venv/bin/activate

If you are not familiar with virtualenv, there are many tutorials you can read online.

Next step, we write our requirements file, with all the python modules we need to run our project. Here’s an example:

# --- requirements.txt
fastapi~=0.61.1

Save the file as requirements.txt. You can install the modules with pip. There are plenty of guides on how to get pip on your system if you don’t have it:

pip install -r requirements.txt

Doing so will install FastAPI. Let’s create our backend now. Copy the following skeleton API into a main.py file. If you prefer, you can clone the FastAPI template published at https://github.com/cosimo/fastapi-ml-api:

from typing import Optional

from fastapi import FastAPI

app = FastAPI()
model = get_model()

@app.post("/cluster")
def cluster():
return {"Hello": "World"}

You can run this service with:

uvicorn main:app --reload

You’ll notice right away that any changes to the code will trigger a reload of the server: if you are using the production ML model, the model own load time will quickly become a nuisance. I haven’t managed to solve this problem yet. One approach I could see working is to either mock the model results if possible, or use a lighter model for development.

Invoking uvicorn in this way is recommended for development. For production deployments, FastAPI’s docs recommend using gunicorn with the uvicorn workers. I haven’t looked into other options in depth. There might be better ways to deploy a production service. For now this has proven to be reliable for my needs. I did have to tweak gunicorn’s configuration to my specific case.

Running our service with gunicorn

The gunicorn start command looks like the following:

gunicorn -c gunicorn_conf.py -k uvicorn.workers.UvicornWorker --preload main:app

Note the arguments to gunicorn:

-k tells gunicorn to use a specific worker class
main:app instructs gunicorn to load the main module and use app (in this case the FastAPI instance) as the application code that all workers should be running
--preload causes gunicorn to change the worker startup procedure

Preloading our application

Normally gunicorn would create a number of workers, and then have each worker load the application code. The --preload option inverts the sequence of operations by loading the application instance first and then forking all worker processes. Because of how fork() works, each worker process will be a copy of the main gunicorn process and will share (part of) the same memory space.

Making our ML model part of the FastAPI application (or making our model load when the FastAPI application is first created) will cause our model variable to be “shared” across all processes!

The effect of this change is massive. If our model, once loaded into memory, occupies 1 Gb of RAM, and we want to run 4 gunicorn workers, the net gain is 3 Gb of memory that we will have available for other uses. In a container-based deployment, it is especially important to keep the memory usage low. Reclaiming 75% of the total memory that would otherwise be used is an excellent result.

I don’t know enough details about PyTorch models or Python itself to understand how this sharing keeps being valid across the process lifetime. I believe that modifying the model in any way will cause copy-on-write operations and ultimately the model variable to be copied in each process memory space.

Complications

Turns out we don’t get this advantage for free. There are a few complications with having a PyTorch model shared across different processes. The PyTorch documentation covers them in detail, even though I’m not sure I did in fact understand all of it.

In my project I tried several approaches, without success:

use pytorch.multiprocessing in the gunicorn configuration module
modify gunicorn itself (!) to use pytorch.multiprocessing to load the model. I did it just as a prototype, but even then… bad idea
investigate alternative worker models instead of prefork. I don’t remember the results of this investigation, but they must have been unsuccessful
use /dev/shm (Linux shared memory tmpfs) as a filesystem where to store the Pytorch model file

A Solution?

The approach I ended up using is the following.

gunicorn must create the FastAPI application to start it, so I loaded the model (as a global) when creating the FastAPI application, and verified the model was loaded before that, and only loaded once.

I added the preload_app = True option to gunicorn’s configuration module.

I limited the amount of workers (my tests showed 3 to work best for my use case), and limited the amount of requests each gunicorn worker will serve. I used max_requests = 50. I limited the amount of requests because I noticed a sudden increase in memory usage in each worker regularly some minutes after startup. I couldn’t trace it back to something specific, so I used this dirty workaround.

Another tweak was to allow the gunicorn workers to start up in a longer than default time, otherwise they would be killed and respawned by gunicorn’s own watchdog as they were taking too long to load the ML model on startup. I used a timeout of 60 seconds instead of the default 30.

The most difficult problem to troubleshoot was workers suddenly stopping and not serving any more requests after a short while. I solved that by not using async on my FastAPI application methods. Other people have reported this solution not working for them… This remains to be understood.

Lastly, when loading the Pytorch model, I used the .eval() and .share_memory() methods on it, before returning it to the FastAPI application. This is happening just on first load.

For example, this is how my model loading looks like:

def load_language_model() -> SentenceTransformer:
    language_model = SentenceTransformer(SOME_MODEL_NAME)
    language_model.eval()
    language_model.share_memory()

    return language_model

The value returned by this method is assigned to a global loaded before the FastAPI application instance is created.

I doubt this is the way to do things, but I did not find any clear guide on how to do this. Information about deploying production models seems quite scarce, if you remember the premise to this post.

In summary:

preload_app = True
Load the ML model before the FastAPI (or wsgi) application is created
Use .eval() and .share_memory() if your model is PyTorch-based
Limit the amount of workers/requests
Increase the worker start timeout period

Read on for other tips about dockerization of all this. But first…

Gunicorn configuration

Here’s more or less all the customizations needed for the gunicorn configuration:

# Preload the FastAPI application, so we can load the PyTorch model
# in the parent gunicorn process and share its memory with all the workers
preload_app = True

# Limit the amount of requests a single worker will handle, so as to
# curtail the increase in memory usage of each worker process
max_requests = 50

Bundling model and application in a Docker container

Your choice of deployment target might be different. What I used for our production environment is a Dockerfile. It’s easily applicable as a development option but also good for production in case you deploy to a platform like Kubernetes like I did.

Initially I tried to build a Dockerfile with everything I needed. I kept the PyTorch model file as binary in the git repository. The binary was larger than 500Mb, and that required the use of git-lfs at least for Github repositories. I found that to be a problem when trying to build Docker containers from Github Actions. I couldn’t easily reconstruct the git-lfs objects at build time. Another shortcoming of this approach is that the large model file makes the docker container context huge, increasing build times.

Two stage Docker build

In cases like this, splitting the Docker build in two stages can help. I decided to bundle the large model binary into a first stage Docker container, and then build up my application layer on top as stage two.

Here’s how it works in practice:

# --- Dockerfile.stage1

# https://github.com/tiangolo/uvicorn-gunicorn-fastapi-docker
FROM tiangolo/uvicorn-gunicorn-fastapi:python3.8

# Install PyTorch CPU version
# https://pytorch.org/get-started/locally/#linux-pip
RUN pip3 install torch==1.7.0+cpu torchvision==0.8.1+cpu torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

# Here I'm using sentence_transformers, but you can use any library you need
# and make it download the model you plan using, or just copy/download it
# as appropriate. The resulting docker image should have the model bundled.
RUN pip3 install sentence_transformers==0.3.8
RUN python -c 'from sentence_transformers import SentenceTransformer; model = SentenceTransformer("")'

Build and push this container image to your docker container registry as stage1 tag.

After that, you can build your stage2 docker image starting from the stage1 image.

# --- Dockerfile
FROM $(REGISTRY)/$(PROJECT):stage1

# Gunicorn config uses these env variables by default
ENV LOG_LEVEL=info
ENV MAX_WORKERS=3
ENV PORT=8000

# Give the workers enough time to load the language model (30s is not enough)
ENV TIMEOUT=60

# Install all the other required python dependencies
COPY ./requirements.txt /app
RUN pip3 install -r /app/requirements.txt

COPY ./config/gunicorn_conf.py /gunicorn_conf.py
COPY ./src /app
# COPY ./tests /tests

You may need to increase the runtime shared memory to be able to load the ML model in a preload scenario.

If that’s the case, or if you get errors on model load when running your project in Docker or Kubernetes, you need to run docker with --shm-size=1.75G for example, or any suitable amount of memory for your own model, as in:

docker run --shm-size=1.75G --rm <command>

The equivalent directive for a helm chart to deploy in Kubernetes is (WARNING: POSSIBLY MANGLED YAML AHEAD):

apiVersion: apps/v1
kind: Deployment
metadata:
  ...
spec:
  ...
  template:
    ...
    spec:
      volumes:
        - name: modelsharedmem
          emptyDir:
            sizeLimit: "1750Mi"
            medium: "Memory"
      containers:
        - name: {{ .Chart.Name }}
          ...
          volumeMounts:
            - name: modelsharedmem
              mountPath: /dev/shm
          ...

A Makefile to bind it all together

I like to add a Makefile to my projects, to create a memory of the commands needed to start a server, run tests or build containers. I don’t need to use brain power to memorize any of that, and it’s easy for colleagues to understand what commands are used for which purpose.

Here’s my sample Makefile:

# --- Makefile
PROJECT=myproject
BRANCH=main
REGISTRY=your.docker.registry/project

.PHONY: docker docker-push start test

start:
    ./scripts/start.sh

# Stage 1 image is used to avoid downloading 2 Gb of PyTorch + nlp models
# every time we build our container
docker-stage1:
    docker build -t $(REGISTRY)/$(PROJECT):stage1 -f Dockerfile.stage1 .
    docker push $(REGISTRY)/$(PROJECT):stage1

docker:
    docker build -t $(REGISTRY)/$(PROJECT):$(BRANCH) .

docker-push:
    docker push $(REGISTRY)/$(PROJECT):$(BRANCH)

test:
    JSON_LOGS=False ./scripts/test.sh

Other observations

I had initially opted for Python 3.7, but I tried upgrading to Python 3.8 because of a comment on a related FastAPI issue on Github, and in my tests I found that Python 3.8 uses slightly less memory than Python 3.7 over time.

The Perl echo chamber, marketing and … is Perl really dying?

Cosimo Streppone — Sun, 20 Jun 2021 19:29:52 +0000

Recently I came across this tweet from Curtis/Ovid, which references longer post about a proposal to integrate a better, more modern object-oriented “system” (Corinna) in Perl 5.

The proposal itself is not what I’d like to address here. I haven’t followed Corinna’s evolution. I believe it goes in a positive direction for the language, FWIW.

From that original tweet, a comment from Rafael followed:

[…] but I’m still wondering what are the real factors that make companies seek an exit strategy from Perl 5. Who makes this kind of expensive decision, and why? I suspect obscure OO syntax is not a major one.

This is what I replied with:

This is indicative of the fundamental problem in the Perl echo chamber. Some people still have no idea why companies are moving away from Perl. If you want to hear the perspective from someone who has seen this happen in multiple companies, let me know :-)

Sorry for this premise, but I was afraid what follows would make no sense otherwise.

Why is Perl dying today?

First of all, I don’t think “ is dying” is a useful question to ask, nor it is indicative of anything particularly interesting. I’m sure everyone reading this will have encountered plenty of “C is dying”, “Java is dying” or similar, and yet, C and Java are still being used everywhere. In one sense, no language really dies ever. In Perl’s situation, things are slightly different though, as (I believe) Python slowly conquered Perl’s space over time.

What does it mean for a language to die, or to be dead?

From an end user point of view , let’s say a random programmer employed in a company or freelance, a language could be dying if a task they want to accomplish using that language is hard because there are no supporting libraries for it (think CPAN or PyPi), or the libraries are so old they don’t work anymore. That situation surely conveys the idea that the language is not in use anymore, or very few people must be using that language. One would expect that a common task in 2021 must be easy to accomplish with a language worth using in 2021.

What about a company ‘s point of view? The reality is that companies don’t have an opinion on languages, only people do. Teams do have an opinion on languages. The group dynamics inside a team influence what languages are acceptable for current and new projects.

Is Perl dying then?

My experience

Some years ago I was a fairly active member of the Perl community, I attended and presented at various Perl conferences around Europe, talking about my experience using Perl at a few small and large companies.

I remember picking up Perl for the first time based on a suggestion from my manager back then. He gave me a hard copy print-out of the whole of Perl 5.004 man pages, and said: “We are going to use this language. It’s amazing, take some time to study it and we’ll start!”. This was 1998, and I had such a fantastic time :-). I was such a noob, but Perl was amazing. It could do everything you needed and then some, and it was easy and simple. The language was fast already back then, and it got faster over time. At that point in time, I was working in a very small company, we were three people initially, and we ended up writing a complete web framework from scratch that is still in use today, after more than 20 years. If that’s not phenomenal, I don’t know what is. It’d be cool to talk about this framework: it was more advanced than anything that’s ever been done even considering it’s 2021… a story for another time.

And by the way, we were running our Perl code on *anything*, and I mean anything, Windows PCs, Linux, Netware and even AS/400, a limited subset of it at least, at a time when Java’s “write once, run everywhere” was just an empty marketing promise. Remember this was the time of Netscape Navigator and Java applets. Ramblings, I know, but perhaps useful to understand where things have gone wrong.

In 2007, I left my job in Italy and moved to Norway to work for Opera Software. Back then, Opera’s browser was still running the Presto engine, and a little department inside Opera was in charge of web services. That’s where I was headed. Most services there were written in Perl. Glorious times for me, I would learn an awful lot there, meet a lot of skilled developers. Soon after I started working there, 2007, some colleagues were already making fun of Perl. It’s a “write-only language”, “not meant for serious stuff”, “lack of web frameworks”, etc… Those were the times when Python frameworks started to emerge, some of which would eventually disappear. I remember a few colleagues strongly arguing to move to this Python framework called Pylons, and then eventually to Django.

I believe this general attitude towards Perl originated from different factors:

personal preference towards other languages and/or dislike towards Perl
the desire to be working with the latest “hip” framework or language
the discomfort of maintaining an aging codebase with problems

These factors exist and are legitimate reasons to want to move away from any language or framework. I’m not saying they are justified, but I do understand why people wanted that. In our field, I have seen it’s quite common to try and avoid the objective difficulties of maintaining a legacy project, going the greener way of an overly optimistic rewrite, which normally ends in tears.

Throughout the years, I noticed other contributing factors to the progressive abandonment of Perl, even in companies like Opera.

I’ll mention two that I experienced directly:

Outdated or non existent supporting libraries
Teams composition

There was a time a few years ago, when CPAN was awesome, the best language support system in existence and every other language community was envying it. CPAN pretty much was selling Perl by itself. In my case, the libraries on CPAN educated me and made me adopt a testing culture that no other language (in my knowledge) had before Perl. Today, seeing npm modules being installed without running tests makes me uncomfortable :-)

Then over time (years) a shift happened. You would search on CPAN for a library that would help you with a common task and you wouldn’t find anything, or you would only find quick hacks that didn’t really work properly. In my case, I remember the first example of that being OAuth2. If I had to speculate, I would say this is a product of many elements, one of which is the average age of Perl programmers getting higher.

Another related shift I remember from those years is companies publishing their APIs/SDKs started dismissing Perl, at first relying on some CPAN module to eventually appear, then completely omitting Perl support. In the beginning, we politely complained to those companies, trying to make a point, but unfortunately there was no turning back. These days almost no SDK comes with a Perl component.

The second major aspect I have experienced is related to teams. In 2012 I was tasked with writing my first ever greenfield project, entirely from scratch, a project that would turn out to be one of the things I’m most proud of, Opera Discover , an online news recommendation system for the Opera browser, still working today! A team of three veteran engineers (myself included) was assembled, and there and then, we were faced with a decision: what language should we use for this?

While I was most experienced in Perl and knew Python a little, the other two colleagues didn’t know Perl. They had experience in C++ mostly, as this was Opera after all. We were chosen not based on our programming language expertise, rather (I suppose) based on our capability to tackle such a big and complex project. While I could have proposed that the project be written in Perl, in good conscience I knew that choice was not viable. Django was readily available and could provide a wide range of functionality we actually needed. No alternative in the Perl world could come close to such a good value proposition. The fact that Python was (like Perl had been for me!) a very accessible choice, simple to pick up, easily installed on any Linux system, and with plenty of solid up-to-date libraries, made the choice obvious.

With the Discover project, I started learning Python properly as a day-to-day programming language. I remember being horrified (and making fun of) the httplib2/httplib3 situation initially. Then I learned about the requests module and forgot all about it. This is to say, Python also has its quirks of course. The disastrous Python 2 vs Python 3 decision in the Python community caused a lot of grief and uncertainty for people (Perl could have learned something from that…). Nowadays, that’s a non-argument, everything runs on Python 3 and if you still haven’t moved, you will soon.

In general, having learned Python quite well, my mindset with regards to programming and my job changed completely. I’m not a Perl programmer. I’m not a Python programmer either. I can use different tools whenever they are more suited to what I need to do. In fact, in my last four years I have written software in NodeJS and Java of all things… I used to despise and make fun of Java, but I had never worked on any professional project before. While I do maintain that Java has some horrible aspects, contrary to my expectations, I have enjoyed working with it, it has an efficient runtime, awesome threading, solid libraries and debugging/inspection tools.

Conclusions

While I do understand Ovid’s point about wanting to keep the business going, and enjoying Perl as a language, I have personally moved on many years ago. I still use Perl for the occasional script when it’s convenient, but for other use cases, like web APIs, I prefer Python and FastAPI, PyTorch for machine learning, etc.. so my conclusion is that it’s the libraries and the ecosystem that drive language use, and not the language itself.

A better OO system will unfortunately do nothing for Perl (in my opinion at least). Better marketing will without a doubt do nothing for Perl. As if a prettier website could change the situation and the aspects I talked about… it can’t! The situation we have in front of us in 2021 is the result of technological and social changes started at least a decade ago.

I realize this may be an incoherent post. Sorry about that, I tried to write it right away or it would have probably never come out.

If you have questions or comments, let me know and I’ll try to address them if I can.

Most importantly, I do not wish to convince anyone that what I wrote is true. It is simply my experience. If there’s one thing I wish people would take from it, it’s to move away from the thought of yourself being a “X Programmer” and broaden your horizons and set of tools available to you. It was a tremendously positive move for myself, one I wished I had done before.

Peace.

How to setup kubectl from scratch with existing clusters

Cosimo Streppone — Mon, 29 Mar 2021 11:05:25 +0000

If you are just starting to use kubectl or Kubernetes, before you can interact with the Kubernetes clusters, you will need to setup gcloud and kubectl. This quick post explains how to do that. It could also be of use if your kubectl stopped working for unknown reasons, which it might do from time to time.

I wrote this quick post because all the instructions you get online assume you are working with a toy project and want to create your test kubernetes cluster. That is not what most people need in my opinion, so I thought I'd write this. Hopefully this is useful to at least another person out there :-)

Step by step instructions

Setup GCloud SDK if you haven’t yet. Follow the instructions here: https://cloud.google.com/sdk/docs/install

Typing gcloud --version should output something like this:

$ gcloud --version
Google Cloud SDK 291.0.0
alpha 2020.05.01
beta 2020.05.01
bq 2.0.57
core 2020.05.01
gsutil 4.50
kubectl 2020.05.01

Install kubectl by following instructions here: https://kubernetes.io/docs/tasks/tools/

Typing kubectl --version should output something like this:

$ kubectl version --client
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", 
  GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean",
  BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc",
  Platform:"linux/amd64"}

Backup any existing ~/.kube/config file you might have from before and move it to a temporary directory.

Type the following command to fetch the kubernetes configuration for your existing clusters:

$ gcloud container clusters get-credentials <cluster-name> --region <region> --project <kubernetes-project>

and repeat that for each project and cluster you want to manage.

If the step above fails, you likely don’t have the necessary permissions to access the desired kubernetes cluster. Ask someone to help if possible :-)

Your ~/.kube/config file will now contain references to clusters configuration. Each cluster will have an endpoint URL, like https://11.22.33.44. If you access your kubernetes or work environment via a VPN, ensure your routes are setup to access those endpoints via the VPN itself, or your kubectl command won’t be able to connect your Kubernetes clusters!

You can confirm that everything is working by using a sample command like kubectl get pods. That command should return a list of pod names, like the following:

$ kubectl get pods
NAME                           READY   STATUS    RESTARTS   AGE
somestuff-7bbd5fd8bf-bb27k   1/1     Running   0          36d
...

Congratulations! You can now do some real damage with random kubectl commands. Have fun! :-)

pgtop – a top clone for PostgreSQL

Cosimo Streppone — Wed, 09 Dec 2020 10:18:06 +0000

According to meta::cpan records, the first release of pgtop is dated April 26, 2005, which makes this little software more than 15 years old!

Back then I had just found out about the brilliant mytop by Jeremy Zawodny, and my day-to-day experience being on Postgres, IIRC version 6.5.3, I decided to try and “convert” mytop to Postgres.

Being quite naive, I thought the endeavour would be much easier than it really was. I’m glad I started though, which is why pgtop exists in the first place. It’s not the only one either. I seem to remember a few similar pgtop projects by other programmers.

After using MySQL and Percona Server for many years, due to a new job, I have gone back to Postgres, version 9.5 and 10 at this time. In recent months, I have done some work to improve performance of our database queries, and remembered writing and using pgtop years before.

Since I lost(*) the original sources, I tried the pgtop version I last uploaded to CPAN, 0.05, dated 2008. It did work, in the sense that I could run the same perl code unmodified, a great testament to Perl as language and as runtime. It didn’t work because the underlying Postgres meta tables that were used in version 6 changed their schema in the 10-12 years since :-)

I spent some time to adapt the metadata queries to work with recent Postgres versions, and was slightly amused by the quality of my 15 year old code… The best feeling about this little tool was to rediscover how useful a few dozen lines of code can be. The service provider monitoring helps, but doesn’t even come close to the level of detail pgtop can provide.

After getting pgtop to work again, I quickly added a few more useful features. I was pleased by the efficiency with which I could work on this tool, considering its age.

So far I added just what was strictly necessary to me:

Updated pgtop to the current decade. Now requires perl >= 5.014
Fixed to work with Postgres >= 9.0
Added a sample Dockerfile to build and run pgtop as Docker container
Added a --config option, to load arbitrary config files. This is useful if you want to monitor several databases at once, for example in a tmux session. The config file supports all the options that are available on the command line.
Implemented a query killer command, activated pressing K to kill at once all queries slower than a given threshold, in seconds. This is useful if the database is overwhelmed by a lot of slow queries. I don’t recommend using it, particularly if it involves killing UPDATE or INSERT queries , but it can be quite useful.
Added a --slow_threshold option, to consider queries slow if they have been running for longer than the given value (in seconds). Now the tool highlights slow queries in bold yellow, and logs all the slow queries to a pgtop.log file.
Added a --slack_webhook option, to automatically notify a slack channel if a query crosses the slow threshold runtime value. All the information about the slow query including the SQL will be included in the slack message.

Please let me know if you give it a try! :-)

Download here: https://metacpan.org/release/COSIMO/pgtop-0.11

I tried dev.to and I do understand now

Cosimo Streppone — Fri, 10 Jul 2020 09:54:20 +0000

I have been programming professionally for more than 20 years now.

I wasn't familiar with this community. I decided to give it a go about a month ago, imported a few of my old blog posts, added some new ones, longer posts, quick ones, etc...

My posts are not popular by any means, I usually write about niche backend development stuff that's not very fashionable :-)

The one article that gave me the opportunity to approach dev.to was Five tips to be more effective on the command line, which I published here for the first time. It probably got picked up in some feed and a user posted it on reddit a few days later.

That made me understand the whole point of this community and why people like it. The same exact post got thousands of views on reddit and mostly positive reactions but also a few scolding and rude comments from people that have not even read the whole article, since they were complaining about stuff I did explain.

Here on dev.to, I got barely a few hundred views, but polite and relevant comments from people that had clearly appreciated and read through all, suggesting alternatives or simply their own view on the approaches expressed in the post.

I like how welcoming and polite this community feels like.

Jumping in and out of your editor too much?

Cosimo Streppone — Fri, 12 Jun 2020 14:35:39 +0000

Quick post! Hopefully useful as it's been for me.

Have you ever been caught in the loop of editing a file in vim (or any other editor), and then testing your changes quickly from the command line, and doing it over and over for dozens of times?

I have, many times. Sometimes you don't even need to do that. For example, vim allows you to invoke a command line tool with :!program-name.

Still, sometimes one falls into that vicious edit-run-edit cycle without thinking too much. If that's your case, here's a simple tip you can try. Note: you need to be on the console for this to work.

When you're done with your changes, save the file and then use CTRL + Z to suspend the editor process and drop to the command line, where you will have your command waiting for you already.

Run the command, and then use fg to return to the editor process exactly as you left it.

The edit-run cycle will be a lot faster.

fg (foreground) and bg (background) are useful commands to "control the flow" of the processes you launch from the command line.