DEV Community: Jason Skowronski

Fear database changes? Get them under control with CI/CD

Jason Skowronski — Tue, 17 Dec 2019 16:15:31 +0000

Developers often fear database changes because a mistake by anyone on your team can lead to a major outage and even data loss. The stakes are higher when changes are not backwards compatible, cannot be rolled back, or impact system performance. This can cause a lack of confidence and slow your team velocity. As a result, database changes are a common failure point in agile and DevOps.

Databases are often created created manually and too often evolve through manual changes, informal process, and even testing in production. This makes your system more fragile. The solution is to include database changes in your source control and CI/CD pipeline. This lets your team document each change, follow the code review process, test it thoroughly before release, make rollbacks easier, and coordinate with software releases.

Let’s look at an example of how to include database migrations in your CI/CD process and push a non-backwards-compatible database change successfully. We'll also look at testing your changes, progressive deployments, dealing with rollbacks, and a few helpful tools.

What is CI/CD?

CI/CD is a cornerstone of modern development and DevOps.

CI—or Continuous Integration—is the practice of merging all working developer code into a shared repository throughout the day. Its purpose is to prevent integration problems by integrating often and early. Commonly, this integration kicks off an automated build and test.

CD—or Continuous Delivery—is the practice of building, testing, and releasing software in short cycles, with the aim of ensuring that a working version of the software can be released at any time.

Is Your Database Ready For CI/CD?

There are several key requirements to having your database ready for CI/CD. First, the database must be reproducible from scratch using one or more SQL scripts. This means that in addition to a script that creates the initial version of your database, you must also maintain scripts that make all required schema updates to your database.

When you create these scripts, you have two options:

Create one script per schema object, then update the corresponding script (state based) when making changes to the object.
Create one original script that creates the entire database schema. Then, create a series of individual change scripts (migration based) for changes.

To learn more, check out this excellent article on state-based versus migration-based database updates.

The second requirement for CI/CD is that the database schema (meaning, those scripts we just mentioned), just like your source code, must live in source control. You must treat your database schema changes as a controlled process just as you do with code.

Third, always back up before performing any database migrations. If you're working with a live production database, consider a Postgres follower database for your migration or upgrade.

Lastly, changes that involve removing a database object, such as deleting a column as shown below, can be more difficult to deal with due to the loss of data. Many organizations develop strategies to deal with this, such as only allowing additive changes (e.g. adding a column), or having a team of DBAs that deals with such changes.

Is Your Team Ready for CI/CD?

Perhaps the best process for database changes and database CI/CD is ensuring you have a collaborative effort between DevOps and DBAs. Make sure your DBAs are part of the code review cycle; they can help to identify issues that only they may know about. DBAs have knowledge of the databases in each specific environment, including database specific dependencies such as ETL load jobs, database maintenance tasks, and more.

Be sure to consult a database SME in setting up your database for CI/CD, and in any migration process, when possible. Be sure to also follow sensible DevOps processes, such as test your changes in a test environment, performing backups, mitigating risks, being prepared for rollbacks, and so on.

How Your CI Tool Helps With Migrations

When you create or update these scripts, and push them to source control, your CI tool (such as Jenkins or Heroku CI) will pull the changes and then:

Rebuild your database to the newest version of the scripts in a test or staging environment. Since the database is being rebuilt, be sure to export the look up/reference data, then import it back to the new schema. Although it is possible to export and import transactional data, transactional data is out of scope for this article. You can read more about best practices here if interested.
Run your tests. For testing your database changes, one possible time saver is to have two sets of tests. The first set is a quick test that verifies your build scripts and runs a few basic functional tests (such as referential integrity, stored procedures unit tests, triggers, and so on). The second set includes migration of transactional data (possibly scrubbed production data) to run a more realistic full set of tests.
Deploy your database changes to your production environment or another selected environment. (Depending on your migration strategy, the CI tool should also simultaneously deploy and test any code changes dependent on the database change.)

Watch Out for These Common Problems

In many cases, when you're making a simple schema addition with bidirectionally compatible code, then you can push code and database changes at the same time. This shouldn't be an issue, as rollbacks in our case will be easy and predictable. This is often true when we are dealing with microservices with simple database components.

However, in many scenarios, serious problems can happen with this simplistic approach:

Production data may be different from test/stage data and cause unforeseen issues.
A large number of changes in both code and database schema may be in the pipeline and need to be deployed simultaneously.
CI/CD processes may not be consistent through every environment.
You may be under a zero-downtime mandate.
Even using tools that help you to achieve zero-downtime (such as Heroku preboot) you can end up with two versions of the code running simultaneously.

There are several strategies for addressing the above issues. Some popular solutions include:

If your changes are backwards-compatible, then use a tick-tock release pattern. This approach involves releasing the new database column then releasing the new code. You can identify problems early in this manner, with minimal production changes. Additionally, the rollback remains small and manageable, and can be accomplished with tools such as Heroku's Postgres rollback, as noted above.
If your provider supports it, use a blue/green rollout. In this pattern, an entirely new set of production servers is created side-by-side with the current production servers. Enable database synchronization and use a DNS or a proxyto cut over to the new servers/database. You can rollback by simply changing the proxy back to the original servers.

A Simple Migration Example

Let’s run through an example based on the the migration scripting option as explained above. Note that some frameworks (Rails, Django, ORM tools, and so on) abstract out or handle schema creation and migration for you. While the details may differ according to the framework you are using, the below example should still help you to understand these core concepts. For example, you may have a schema configuration file to include in your CI/CD process.

For our example, we'll use Node.js, Postgres, and GitHub. We'll also use Heroku because it provides convenient tools including Heroku CI with deploy scripts for CI/CD, and easy Postgres rollbacks in case we make a mistake. If you need help deploying Node.js and Postgres on Heroku, here’s a quick walk-through.

Here's the pertinent code for our example. We're going to create a simple database with a single table, and a Node.js file that writes to that database table on load.

Database creation SQL (we have just one simple table):

CREATE TABLE users (
   id           integer PRIMARY KEY,
   firstname    varchar(40) NOT NULL,
   lastname     varchar(40) NOT NULL,
   enrolled     char(1) NOT NULL,
   created_at   date NOT NULL
);

Node.js

const result = await client.query('INSERT INTO users 
  (id,firstname,lastname,enrolled,created_at) 
  values ($1,$2,$3,$4,$5) ',[1,'Becky','Smith','y',new Date()]);

Once these files are checked into GitHub and our repository is attached to a Heroku app, we can enable the Heroku CI tool on the Heroku dashboard:

The real work is done by the Heroku Procfile and the Heroku release phase. Using those, we can tell the Heroku CI tool to run a database migration SQL file any time a new release is created (in other words, a successful compile). Here is the release line we need to include in the Heroku Procfile:

release: bash `./release-tasks.sh`

The content of the release-tasks file includes a list of SQL scripts to run. That list is updated with each release to include the needed schema modifications. For this very simple example, it will point to just one script:

psql -h <hostname> -d <database> -U <user> -w -f database/migrate.sql

(The database password can be supplied as a Heroku environment variable.)

Typically, as we are using the migration-based strategy, we would add additional migration scripts for each set of changes. For a more robust solution, we could use a tool such as Liquibase, Alembic or Flyway. These tools add version control to your database, both generating the necessary change scripts between releases, and giving you the ability to easily roll back changes. For example, Flyaway creates scripts that allow you to migrate from any version of your database (including an empty database) to the latest version of the schema.

To kick off the CI tool, we make two changes: drop a required column, and change the JavaScript to no longer reference that column. First, we update the SQL code in Node.js, taking out the column:

const result = await client.query('INSERT INTO users 
  (id,firstname,lastname,created_at) 
  values ($1,$2,$3,$4) ',[2,'Becky','Smith',new Date()]);

Next, we create a migrate.sql file (referenced in the Procfile above) to alter the table and remove the column:

ALTER TABLE users DROP COLUMN enrolled;

Now, we commit the code change and SQL file, and watch the CI magic. First, the integration tests run. If you are using a common testing framework, the Heroku CI tool probably works with your test suite.

And now the CI tool creates a new release and deploys the app, which kicks off the migrate.sql file. (See the middle of the image below.)

We can check to see that the column was removed by inspecting the database through the Heroku CLI tool:

It worked! There is no longer a column named 'enrolled'. Our CI tool ran our script and deleted the column.

Some tools, like Liquibase, keep a detailed list of database changes. These tools allow you to easily see the last set of changes in cases like the above.

Now, any time that code or an updated migrate.sql is committed in the future, the CI tool will kick off the tests. If the tests pass, this creates a new release and pushes it to staging. When there is a new release, the migrate.sql file runs against the staging database.

We've taken a simple route here for demonstration purposes, but could have made this process more robust. For instance, when moving a new release to staging, we could wipe out the old version of the database, create a new one from scratch running the original creation script plus all migration scripts, and then populate the database with any reference data all through the Procfile and release phase. Also note that for simplicity sake, we are not running this migration with transactions in progress. In a real-world scenario, Heroku recommends using an advisory lock to prevent concurrent migrations.

How To Do Rollbacks

Even with the best planning and forethought, there will be times when you need to roll back your database. There are many approaches to rolling back failed deployments.

Create a SQL file that rolls back the changes quickly. (For example, while you are in staging, use a compare utility to generate the script.) This file should be part of the deployment package so that you can quickly run the rollback if there is an error.
Roll forward (quickly push a new build that fixes the issue).
Rely on source control and labels or branches to recreate and deploy the previous version.
Restore a full backup of your database. (Use a tool that ships with your database, such as pg_restore in Postgres.)
Use a tool provided by your platform, such as Heroku Postgres Rollback and Heroku Release Rollback for code. As the name implies, Heroku Postgres Rollback allows you to easily roll back your database to a previous point in time, quickly and confidently moving your database back to a working release.

Be aware that all these solutions come with their own challenges, such as potential loss of new data (restoring a backup or redeploying) and introducing new bugs.

Summary

Database changes and migrations can be scary, and can cause serious mistrust. However, if you place your database under CI/CD controls, you can not only confidently migrate your changes, but also move towards a better agile and DevOps experience. This is can be as simple as using source control for your database schema, having a good process in place with your DevOps and DBA teams, and using your existing CI tools to test and migrate your databases. Once you establish and train your team on the new process, future changes will be smoother and more automatic than your old manual process.

How Stream Processing Makes Your Event-Driven Architecture Even Better

Jason Skowronski — Wed, 11 Dec 2019 16:29:02 +0000

If you’re an architect or developer looking at event-driven architectures, stream processing might be just what you need to make your app faster, more scalable, and more decoupled.

In this article—the third in a series about event-driven architectures—we will review a little of the first article in the series, which outlined the benefits of event-driven architectures, some of the options, and a few patterns and anti-patterns. We will also review the second article, which provided more detail on message queues and deployed a quick-start message queue using Redis and RSMQ.

This article will also dive deeper into stream processing. We will discuss why you might pick stream processing as your architecture, some of the pros and cons, and a quick-to-deploy reference architecture using Apache Kafka.

What is an Event-Driven Architecture?

Stream processing is a type of event-driven architecture. In event-driven architectures, when a component performs some piece of work that other components might be interested in, that component (called a producer) produces an event—a record of the performed action. Other components (called consumers) consume those events so that they can perform their own tasks as a result of the event.

This decoupling of consumers and producers gives event-driven architectures several benefits:

Asynchronous—Communications between components are asynchronous, avoiding any bottlenecks caused by synchronous, monolithic architectures.
Decoupled—Components don’t need to know about one another, and can be developed, tested, deployed, and scaled independently.
Easy Scaling—Since components are decoupled, bottleneck issues can be more easily tracked to a single component, and quickly scaled.

There are two main kinds of event-driven architectures: message queues and stream processing. Let's dive into the differences.

Intro to Message Queues

With message queues, the original event-driven architecture, the producer places a message into a queue targeted to a specific consumer. That message is held in the queue (often in first-in, first-out order) until the consumer retrieves it, at which time the message is deleted.

Message queues are useful for systems where you know exactly what needs to happen as a result of an event. When an issue occurs, your producer sends a message to the queue, targeted to some consumer(s). Those consumers obtain the message from the queue and then execute the next operation. Once that next step is taken, the event is removed from the queue forever. In the case of message queues, the flow is generally known by the queue, giving rise to the term “smart broker/dumb consumer”, which means the broker (queue) knows where to send a message, and the consumer is just reacting.

Intro to Stream Processing

With stream processing, messages are not targeted to a certain recipient, but rather are published at-large to a specific topic and available to all interested consumers. Any and all interested recipients can subscribe to that topic and read the message. Since the message must be available to all consumers, the message is not deleted when it is read from the stream.

Producers and brokers don’t need or want to know what will happen as a result of a message, or where that message will go. The producer just sends the message to broker, the broker publishes it, and the producer and broker move on. Interested consumers receive the message and complete their processing. Because of this further decoupling, systems with event streaming can evolve easily as the project evolves.

Consumers can be added and deleted and can change how and what they process, regardless of the overall system. The producer and the broker don’t need to know about these changes because the services are decoupled. This is often referred to as “dumb broker/smart consumer”—the broker (stream) is just a broker, and has no knowledge of routing. The consumers in message processing are the smart components; they are aware of what messages to listen for.

Also, consumers can retrieve multiple messages at the same time and since messages are not deleted, consumers can replay a series of messages going back in time. For example, a new consumer can go back and read older messages from before that consumer was deployed.

Stream processing has become the go-to choice for many event-driven systems. It offers several advantages over message queues including multiple consumers, replay of events, and sliding window statistics. Overall, you gain a major increase in flexibility.

Should You Use Stream Processing or Message Queues?

Here are a several use cases for each:

Message Queues

Message queues, such as RabbitMQ and ActiveMQ are popular. Message queues are particularly helpful in systems where you have known or complex routing logic, or when you need to guarantee a single delivery of each message.

A typical use case for message queues is a busy ecommerce website where your services must be highly available, your requests must be delivered, and your routing logic is known and unlikely to change. With these constraints, message queues give you the powerful advantages of asynchronous communication and decoupled services, while keeping your architecture simple.

Additional use cases often involve system dependencies or constraints, such as a system having a frontend and backend written in different languages or a need to integrate into legacy infrastructure.

Stream Processing

Stream processing is useful for systems with more complex consumers of messages such as:

Website Activity Tracking. Activity on a busy website creates a lot of messages. Using streams, you can create a series of real-time feeds, which include page views, clicks, searches, and so on, and allow a wide range of consumers to monitor, report on, and process this data.
Log Aggregation. Using streams, log files can be turned into a centralized stream of logging messages that are easy for consumers to consume. You can also calculate sliding window statistics for metrics, such as an average every second or minute. This can greatly reduce the output data volumes, making your infrastructure more efficient.
IOT. IOT also produces a lot of messages. Streams can handle a large volume of messages, and publish them to a large number of consumers in a highly scalable and performant manner.
Event Sourcing. As described in a previous article, streams can be used to implement event sourcing, where updates and deletes are never performed directly on the data; rather, state changes of an entity are saved as a series of events.
Messaging. Complex and highly-available messaging platforms such as Twitter and LinkedIn use streams (Kafka) to drive metrics, deliver messages to news feeds, and so on.

A Reference Architecture Using Kafka

In our previous article, we deployed a quick-to-stand-up message queue to learn about queues. Let’s do a similar example stream processing.

There are many options for stream processing architectures, including the following:

Apache Kafka
Apache Spark
Apache Beam/Google Cloud Data Flow
Spring Cloud Data Flow

We'll use the Apache Kafka reference architecture on Heroku. Heroku is a cloud platform-as a service (PaaS) that offers Kafka as an add-on. Their cloud platform makes it easy to deploy a streaming system rather than hosting or running your own. Since Heroku provides a Terraform script that deploys all the needed code and configuration for you in one step, it's a quick and easy way to learn about stream processing.

We won’t walk through the deployment steps here, as they are outlined in detail on the reference architecture page. However, it deploys an example eCommerce system that showcases the major components and advantages of stream processing. Clicks to browse or purchase products are recorded as events to Kafka.

Here is a key snippet of code from edm-relay, which sends messages to the Kafka stream. It's quite simple to publish events to Kafka since it's only a matter of calling the producer API to insert a JSON object.

app.post('/produceClickMessage', function (req, res) {
   try {
     const topic = `${process.env.KAFKA_PREFIX}${req.body.topic}`;
     console.log(`topic: ${topic}`);
     producer.produce(
       topic,
       null,
       // Message to send. Must be a buffer
       Buffer.from(JSON.stringify(req.body)),
       // for keyed messages, we also specify the key - note that this field is optional
       null,
       // you can send a timestamp here. If your broker version supports it,
       // it will get added. Otherwise, we default to 0
       Date.now(),
     );
   } catch (err) {
     console.error('A problem occurred when sending our message');
     throw err;
   }
   res.status(200).send("{\"message\":\"Success!\"}")
 });

A real-time dashboard then consumes the stream of click events and displays analytics. This could be useful for business analytics to explore the most popular products, changing trends, and so on.

Here is the code from edm-stream that subscribes to the topic:

.on('ready', (id, metadata) => {
   consumer.subscribe(kafkaTopics);  
   consumer.consume();
   consumer.on('error', err => {
     console.log(`Error in Kafka consumer: ${err.stack}`);
   });
   console.log('Kafka consumer ready.' + JSON.stringify(metadata));
   clearTimeout(connectTimoutId);
 })

and then consumes the message from the stream by calling an event handler for each message:

 .on('data', function(data) {
   const message = data.value.toString()
   console.log(message, `Offset: ${data.offset}`, `partition: ${data.partition}`, `consumerId: edm/${process.env.DYNO || 'localhost'}`);
   socket.sockets.emit('event', message);
   consumer.commitMessage(data);
 })

The reference architecture is not just about buying coffee; it's a starting point for any web app where you want to track clicks and report in a real-time dashboard. It's open source, so feel free to experiment and modify it according to your own needs.

Stream processing not only decouples your components so that they are easy to build, test, deploy, and scale independently, but also adds yet another layer of decoupling by creating a “dumb” broker between your components.

Next Steps

If you haven’t already, read our other articles in this series on the advantages of event-driven architecture and deploying a sample message queue using Redis and RSMQ.

Scale Your Apps with an Easy Message Queue on Redis

Jason Skowronski — Mon, 09 Dec 2019 15:57:11 +0000

If you’re a microservices developer considering communication protocols, choosing an event-driven architecture might just help you rest a little easier at night. With the right design, event-driven architecture can help you to create apps that are decoupled and asynchronous, giving you the major benefits of your app being both performant and easily scalable.

We’ll create and deploy a simple, and quick to stand up, message queue using Heroku, Redis, and RSMQ. And we’ll look at how our system works, what it can do, and some advantages.

Message Queues vs. Streams

One of the first, and most important, decisions is whether to use message queues or streams. In message queues, a sender places a message targeted to a recipient into a queue. The message is held in the queue until the recipient retrieves it, at which time the message is deleted.

Similarly, in streams, senders place messages into a stream and recipients listen for messages. However, messages in streams are not targeted to a certain recipient, but rather are available to any and all interested recipients. Recipients can even consume multiple messages at the same time, and can play back a series of messages through the streams history.

If these are new concepts for you, learn more in our previous article on best practices for event-driven architecutures.

Why Message Queues Are Helpful

Message queues can be thought of as the original event-driven architecture. They drove the adoption of early event-driven designs and are still in use today. In these message queue designs, a client (or other component) traditionally creates a message when some action happens, then sends that message to a queue, targeted to a specific recipient. The recipient, which has been sitting idle waiting for work receives (or retrieves) the message from the queue, processes it, and does some unit of work. When the recipient is done with its work, it deletes the message from the queue.

This traditional path is exactly what our example below will do. It’s a simple setup, but by placing a queue between the producer and consumer of the event, we introduce a level of decoupling that allows us to build, deploy, update, test, and scale those two components independently. This decoupling not only makes coding and dev ops easier (since our components can remain ignorant of one another), but also makes our app much easier to scale up and down. We also reduce the workload on the web dynos, which lets us respond back to clients faster, and allows our web dynos to process more requests per second. This isn't just good for the business, but it's great for user experience as well.

Our Example App

Let's create a simple example app to demonstrate how a message queue works. We’ll create a system where users can submit a generic application through a website. This is a simple project you can use just to learn, as a real-world use case, or as a starting point for a more complicated project. We’re going to setup and deploy our simple yet powerful message queue using Heroku, Redis, Node.js, and RSMQ. This is a great stack that can get us to an event-driven architecture quickly.

Heroku, Redis, and RSMQ—A Great Combination for Event-Driven

Heroku, with its one-click deployments and “behind-the-scenes” scaling, and Redis, an in-memory data store and message broker, are an excellent pair for quickly deploying systems that allow us to focus on business logic, not infrastructure. We can quickly and easily provision a Redis deployment (dyno) on Heroku that will scale as needed, and hides the implementation details we don’t want to worry about.

RSMQ is an open-source simple message queue built on top of Redis that is easy to deploy. RSMQ has several nice features: it’s lightweight (just 500 lines of javascript), it’s fast (10,000+ messages per second), and it guarantees delivery of a message to just one recipient.

We’ll also follow the “Worker Dynos, Background Jobs, and Queuing” pattern, which is recommended by Heroku and will give us our desired decoupling and scalability. Using this pattern, we’ll deploy a web client (the browser in the below diagram) that handles the user input and sends requests to the backend, a server (web process) that runs the queue, and a set of workers (background service) that pull messages from the queue and do the actual work. We’ll deploy the client/server as a web dyno, and the worker as a worker dyno.

Let’s Get Started

Once you’ve created your Heroku account and installed the Heroku CLI, you can create and deploy the project easily using the CLI. All of the source code needed to run this example is available on GitHub.

$ git clone https://github.com/devspotlight/example-message-queue.git  
$ cd example-message-queue  
$ heroku create  
$ heroku addons:create heroku-redis  
$ git push heroku master  
$ heroku ps:scale worker=1  
$ heroku open

If you need help with this step, here a few good resources:

Getting Started on Heroku with node.js

[Using Redis with Heroku]((https:/elements.heroku.com/addons/heroku-redis)

System Overview

Our system is made up of three pieces: the client web app, the server, and the worker. Because we are so cleanly decoupled, both the server and worker processes are easy to scale up and down as the need arises.

The Client

Our client web app is deployed as part of our web dyno. The UI isn’t really the focus of this article, so we’ve built just a simple page with one link. Clicking the link posts a generic message to the server.

Our Simple Web UI

The Web Server

The web server is a simple Express server that delivers the web client. It also creates the queue on startup (if the queue doesn’t already exist), receives new messages from the client, and adds new messages to the queue.

Here is the key piece of code that configures the variables for the queue:

let rsmq = new RedisSMQ({
        host: REDIS_HOST,
        port: REDIS_PORT,
        ns: NAMESPACE,
        password: REDIS_PASSWORD
  });

and sets up the queue the first time the first server runs:

rsmq.createQueue({qname: QUEUENAME}, (err) => {
   if (err) {
        if (err.name !== "queueExists") {
            console.error(err);
            return;
        } else {
            console.log("The queue exists. That's OK.");
        }
   }
   console.log("queue created");
});

When a client posts a message, the server adds it to the message queue like this:

app.post('/job', async(req, res) => {
   console.log("sending message");
   rsmq.sendMessage({
        qname: QUEUENAME,
        message: `Hello World at ${new Date().toISOString()}`,
        delay: 0
   }, (err) => {
        if (err) {
            console.error(err);
            return;
        }
   });
   console.log("pushed new message into queue");
});

The Worker

The worker, which fittingly is deployed as a worker dyno, polls the queue for new messages, then pulls those new messages from the queue and processes them.

We’ve chosen the simplest option here: The code reads the message, processes it, then manually deletes it from the queue. Note that there are more powerful options available in RSMQ, such as "pop”, which reads and deletes from the queue at the same time, and a “real-time” mode for pub/sub capabilities.

rsmq.receiveMessage({ qname: QUEUENAME }, (err, resp) => {
   if (err) {
      console.error(err);
      return;
   }
   if (resp.id) {
      console.log("Hey I got the message you sent me!");
      // do lots of processing here
      // when we are done we can delete the message from the queue
      rsmq.deleteMessage({ qname: QUEUENAME, id: resp.id }, (err) => {
         if (err) {
            console.error(err);
            return;
         }
         console.log("deleted message with id", resp.id);
      });
   } else {
      console.log("no message in queue");
   }
});

We could easily fire up multiple workers by using Throng, if needed. Here’s a good example of a similar setup as ours that uses this library.

Note: When you deploy the worker dyno, be sure to scale the worker processes under the “Resources” tab in the Heroku Dashboard to at least one dyno so that your workers will run, if you haven’t already in the CLI.

Running the Example

When we deploy and start our dynos, we see our server firing up, our queue being deployed, and our worker checking for new messages.

And when we click our link on the client, you can see the server push the message onto the queue, and then the worker grab the message, process it, and delete it.

We’ve built a quick-to-stand up, but powerful message queue with our example. We’ve built a system that separated our components so that they are unaware of one another, and are easy to build, test, deploy, and scale independently. This is a great start to a solid, event-driven architecture.

Next Steps

If you haven’t already, check out the code on Github and try it out yourself.

Heroku also offers a great event-driven reference architecture. You can get a running system in a single click, so it's another easy way to experiment and learn.

Postgres Is Underrated—It Handles More than You Think

Jason Skowronski — Wed, 09 Oct 2019 15:04:20 +0000

Thinking about scaling beyond your Postgres cluster and adding another data store like Redis or Elasticsearch? Before adopting a more complex infrastructure, take a minute and think again. It’s quite possible to get more out of an existing Postgres database. It can scale for heavy loads and offers powerful features which are not obvious at first sight. For example, its possible to enable in-memory caching, text search, specialized indexing, and key-value storage.

After reading this article, you may want to list down the features you want from your data store and check if Postgres will be a good fit for them. It’s powerful enough for most applications.

Why Adding Another Data Store is Not Always a Good Idea

As Fred Brooks put it in The Mythical Man-Month: "The programmer, like the poet, works only slightly removed from pure thought-stuff. [They] build castles in the air, from air, creating by exertion of the imagination."

Adding more pieces to those castles, and getting lost in the design, is endlessly fascinating; however, in the real world, building more castles in the air can get in your way. The same holds true for the latest hype in data stores. There are several advantages to choosing boring technology:

If someone new joins your team, can they easily make sense of your different data stores?
When you or another team member come back a year later, could they quickly pick up how the system works?
If you need to change your system or add features, how many pieces do you have to move around?
Have you factored in maintenance costs, security, and upgrades?
Have you accounted for the unknowns and failure modes when running your new data store in production at scale?

Although it can be managed by thoughtful design, adding multiple datastores does increase complexity. Before exploring adding additional datastores, it's worth investigating what additional features your existing datastores can offer you.

Lesser-known but Powerful Features of Postgres

Many people are unaware that Postgres offers way more than just a SQL database. If you already have Postgres in your stack, why add more pieces when Postgres can do the job?

Postgres caches, too

There’s a misconception that Postgres reads and writes from disk on every query, especially when users compare it with purely in-memory data stores like Redis.

Actually, Postgres has a beautifully designed caching system with pages, usage counts, and transaction logs. Most of your queries will not need to access the disk, especially if they refer to the same data over and over again, as many queries tend to do.

The shared_buffer configuration parameter in the Postgres configuration file determines how much memory it will use for caching data. Typically it should be set to 25% to 40% of the total memory. That’s because Postgres also uses the operating system cache for its operation. With more memory, most recurring queries referring the same data set will not need to access the disk. Here is how you can set this parameter in the Postgres CLI:

ALTER SYSTEM SET shared_buffer TO = <value>

Managed database services like Heroku offer several plans where RAM (and hence cache) is a major differentiator. The free hobby version does not offer dedicated resources like RAM. Upgrade when you’re ready for production loads so you can make better use of caching.

You can also use some of the more advanced caching tools. For example, check the pg_buffercache view to see what’s occupying the shared buffer cache of your instance. Another tool to use is the pg_prewarm function which comes as part of the base installation. This function enables DBAs to load table data into either the operating system cache or the Postgres buffer cache. The process can be manual or automated. If you know the nature of your database queries, this can greatly improve application performance.

For the really brave at heart, refer to this article for an in-depth description of Postgres caching.

Text searching

Elasticsearch is excellent, but many use cases can get along just fine with Postgres for text searching. Postgres has a special data type, tsvector, and a set of functions, like to_tsvector and to_tsquery, to search quickly through text. tsvector represents a document optimized for text search by sorting terms and normalizing variants. Here is an example of the to_tsquery function:

SELECT to_tsquery('english', 'The & Boys & Girls');

  to_tsquery   
---------------
 'boy' & 'girl'

You can sort your results by relevance depending on how often and which fields your query appeared in the results. For example, you can make the title more relevant than the body. Check the Postgres documentation for details.

Functions in Postgres

Postgres provides a powerful server-side function environment in multiple programming languages.

Try to pre-process as much data as you can on the Postgres server with server-side functions. That way, you can cut down on the latency that comes from passing too much data back and forth between your application servers and your database. This approach is particularly useful for large aggregations and joins.

What’s even better is your development team can use its existing skill set for writing Postgres code. Other than the default PL/pgSQL (Postgres’ native procedural language), Postgres functions and triggers can be written in PL/Python, PL/Perl, PL/V8 (JavaScript extension for Postgres) and PL/R.

Here is an example of creating a PL/Python function for checking string lengths:

CREATE FUNCTION longer_string_length (string1 string, string2 string)
  RETURNS integer
AS $$
  a=len(string1)
  b=len(string2)
  if a > b:
    return a
  return b
$$ LANGUAGE plpythonu;

Postgres offers powerful extensions

Extensions are to Postgres what plug-ins mean in many applications. Suitable use of Postgres extensions can also mean you don’t have to work with other data stores for extra functionality. There are many extensions available and listed on the main Postgres website.

Geospatial Data

PostGIS is a specialized extension for Postgres used for geospatial data manipulation and running location queries in SQL. It’s widely popular among GIS application developers who use Postgres. A great beginner’s guide to using PostGIS can be found here.

The code snippet below shows how we are adding the PostGIS extension to the current database. From the OS, we run these commands to install the package (assuming you are using Ubuntu):

$ sudo add-apt-repository ppa:ubuntugis/ppa
$ sudo apt-get update
$ sudo apt-get install postgis

After that, log in to your Postgres instance and install the extension:

CREATE EXTENSION postgis;
CREATE EXTENSION postgis_topology;

If you want to check what extensions you have in the current database, run this command:

SELECT * FROM pg_available_extensions;

Key-Value Data Type

The Postgres hstore extension allows storing and searching simple key-value pairs. This tutorial provides an excellent overview of how to work with hstore data type.

Semi-structured Data Types

There are two native data types for storing semi-structured data in Postgres: JSON and XML. The JSON data type can host both native JSON and its binary form (JSONB). The latter can significantly improve query performance when it is searched. As you can see below, it can convert JSON strings to native JSON objects:

SELECT '{"product1": ["blue", "green"], "tags": {"price": 10, "discounted": false}}'::json;

json                       
---------------------------------------------------------------------
 {"product1": ["blue", "green"], "tags": {"price": 10, "discounted": false}}

Tips for Scaling Postgres

If you’re considering switching off Postgres due to performance reasons, first see how far you can get with the optimizations it offers. Here we'll assume you've done the basics, like creating appropriate indexes. Postgres offers many advanced features, and while the changes are small they can make a big difference, especially if it keeps you from complicating your infrastructure.

Don’t over-index

Avoid unnecessary indexes. Use multi-column indexes sparingly. Too many indexes take up extra memory that crowd out better uses of the Postgres cache, which is crucial for performance.

Using a tool like EXPLAIN ANALYZE might surprise you by how often the query planer actually chooses sequential table scans. Since much of your table’s row data is already cached, oftentimes these elaborate indexes aren’t even used.

That said, if you do find slow queries, the first and most obvious solution is to see if the table is missing an index. Indexes are vital, but you have to use them correctly.

Partial indexes save space

A partial index can save space by specifying which values get indexed. For example, you want to order by a user’s signup date, but only care about the users who have signed up:

CREATE INDEX user_signup_date ON users(signup_date) WHERE is_signed_up;

Understanding Postgres index types

Choosing the right index for your data can improve performance. Here are some common index types and when you should use each one.

B-tree indexes B-tree indexes are balanced trees that are used to sort data efficiently. They’re the default if you use the INDEX command. Most of the time, a B-tree index suffices. As you scale, inconsistencies can be a larger problem, so use the amcheck extension periodically.
BRIN indexes A Block Range INdex (BRIN) can be used when your table is naturally already sorted by a column, and you need to sort by that column. For example, for a log table that was written sequentially, setting a BRIN index on the timestamp column lets the server know that the data is already sorted.
Bloom filter index A bloom index is perfect for multi-column queries on big tables where you only need to test for equality. It uses a special mathematical structure called a bloom filter that’s based on probability and uses significantly less space.

 CREATE INDEX i ON t USING bloom(col1, col2, col3);
 SELECT * from t WHERE col1 = 5 AND col2 = 9 AND col3 = 'x';

GIN and GiST indexes \ Use a GIN or GiST index for efficient indexes based on composite values like text, arrays, and JSON.

When Do You Need Another Data Store?

There are legitimate cases for adding another datastore beyond Postgres.

Special data types

Some data stores give you data types that you just can’t get on Postgres. For example, the linked list, bitmaps, and HyperLogLog functions in Redis are not available on Postgres.

At a previous startup, we had to implement a frequency cap, which is a counter for unique users on a website based on session data (like cookies). There might be millions or tens of millions of users visiting a website. Frequency capping means you only show each user your ad once per day.

Redis has a HyperLogLog data type that is perfect for a frequency cap. It approximates set membership with a very small error rate, in exchange for O(1) time and a very small memory footprint. PFADD adds an element to a HyperLogLog set. It returns 1 if your element is not in the set already, and 0 if it is in the set.

PFADD user_ids uid1
(integer) 1
PFADD user_ids uid2
(integer) 1
PFADD user_ids uid1
(integer) 0

Heavy real-time processing

If you’re in a situation with many pub-sub events, jobs, and dozens of workers to coordinate, you may need a more specialized solution like Apache Kafka. LinkedIn engineers originally developed Kafka to handle new user events like clicks, invitations, and messages, and allow different workers to handle message passing and jobs to process the data.

Instant full-text searching

If you have a real-time application under heavy load with more than ten searches going on at a time, and you need features like autocomplete, then you may benefit more from a specialized text solution like Elasticsearch.

Conclusion

Redis, Elasticsearch, and Kafka are powerful, but sometimes adding them does more harm than good. You may be able to get the capabilities you need with Postgres by taking advantage of the lesser-known features we’ve covered here. Ensuring that you are getting the most out of Postgres can save you time and help you avoid added complexity and risks.

To save even more time and headaches, consider using a managed service like Heroku Postgres. Scaling up is a simple matter of adding additional follower replicas, high availability can be turned on with a single click, and Heroku operates it for you. If you really need to expand beyond Postgres, the other data stores that we mentioned above, such as Redis, Apache Kafka and Elasticsearch, can all be easily provisioned on Heroku. Go ahead and build your castles in the air―but anchor them to a reliable foundation, so you can dream about a better product and customer experience.

For more information on Postgres, listen to Cloud Database Workloads with Jon Daniel on Software Engineering Daily.

Best Practices for Event-Driven Microservice Architecture

Jason Skowronski — Tue, 24 Sep 2019 15:08:11 +0000

If you’re an enterprise architect, you’ve probably heard of and worked with a microservices architecture. And while you might have used REST as your service communications layer in the past, more and more projects are moving to an event-driven architecture. Let’s dive into the pros and cons of this popular architecture, some of the key design choices it entails, and common anti-patterns.

What is Event-Driven Microservice Architecture?

In event-driven architecture, when a service performs some piece of work that other services might be interested in, that service produces an event—a record of the performed action. Other services consume those events so that they can perform any of their own tasks needed as a result of the event. Unlike with REST, services that create requests do not need to know the details of the services consuming the requests.

Here’s a simple example: When an order is placed on an ecommerce site, a single “order placed” event is produced and then consumed by several microservices:

1) the order service which could write an order record to the database
2) the customer service which could create the customer record, and
3) the payment service which could process the payment.

Events can be published in a variety of ways. For example, they can be published to a queue that guarantees delivery of the event to the appropriate consumers, or they can be published to a “pub/sub” model stream that publishes the event and allows access to all interested parties. In either case, the producer publishes the event, and the consumer receives that event, reacting accordingly. Note that in some cases, these two actors can also be called the publisher (the producer) and the subscriber (the consumer).

Why Use Event-Driven Architecture

An event-driven architecture offers several advantages over REST, which include:

Asynchronous – event-based architectures are asynchronous without blocking. This allows resources to move freely to the next task once their unit of work is complete, without worrying about what happened before or will happen next. They also allow events to be queued or buffered which prevents consumers from putting back pressure on producers or blocking them.
Loose Coupling – services don’t need (and shouldn’t have) knowledge of, or dependencies on other services. When using events, services operate independently, without knowledge of other services, including their implementation details and transport protocol. Services under an event model can be updated, tested, and deployed independently and more easily.
Easy Scaling – Since the services are decoupled under an event-driven architecture, and as services typically perform only one task, tracking down bottlenecks to a specific service, and scaling that service (and only that service) becomes easy.
Recovery support – An event-driven architecture with a queue can recover lost work by “replaying” events from the past. This can be valuable to prevent data loss when a consumer needs to recover.

Of course, event-driven architectures have drawbacks as well. They are easy to over-engineer by separating concerns that might be simpler when closely coupled; can require a significant upfront investment; and often result in additional complexity in infrastructure, service contracts or schemas, polyglot build systems, and dependency graphs.

Perhaps the most significant drawback and challenge is data and transaction management. Because of their asynchronous nature, event-driven models must carefully handle inconsistent data between services, incompatible versions, watch for duplicate events, and typically do not support ACID transactions, instead supporting eventual consistency which can be more difficult to track or debug.

Even with these drawbacks, an event-driven architecture is usually the better choice for enterprise-level microservice systems. The pros—scalable, loosely coupled, dev-ops friendly design—outweigh the cons.

When to Use REST

There are, however, times when a REST/web interface may still be preferable:

You need a synchronous request/reply interface
You need convenient support for strong transactions
Your API is available to the public
Your project is small (REST is much simpler to set up and deploy)

Your Most Important Design Choice – Messaging Framework

Once you’ve decided on an event-driven architecture, it is time to choose your event framework. The way your events are produced and consumed is a key factor in your system. Dozens of proven frameworks and choices exist and choosing the right one takes time and research.

Your basic choice comes down to message processing or stream processing.

Message Processing

In traditional message processing, a component creates a message then sends it to a specific (and typically single) destination. The receiving component, which has been sitting idle and waiting, receives the message and acts accordingly. Typically, when the message arrives, the receiving component performs a single process. Then, the message is deleted.

A typical example of a message processing architecture is a Message Queue. Though most newer projects use stream processing (as described below), architectures using message (or event) queues are still popular. Message queues typically use a “store and forward” system of brokers where events travel from broker to broker until they reach the appropriate consumer. ActiveMQ and RabbitMQ are two popular examples of message queue frameworks. Both of these projects have years of proven use and established communities.

Stream Processing

On the other hand, in stream processing, components emit events when they reach a certain state. Other interested components listen for these events on the event stream and react accordingly. Events are not targeted to a certain recipient, but rather are available to all interested components.

In stream processing, components can react to multiple events at the same time, and apply complex operations on multiple streams and events. Some streams include persistence where events stay on the stream for as long as necessary.

With stream processing, a system can reproduce a history of events, come online after the event occurred and still react to it, and even perform sliding window computations. For example, it could calculate the average CPU usage per minute from a stream of per-second events.

One of the most popular stream processing frameworks is Apache Kafka. Kafka is a mature and stable solution used by many projects. It can be considered a go-to, industrial-strength stream processing solution. Kafka has a large userbase, a helpful community, and an evolved toolset.

Other Choices

There are other frameworks that offer either a combination of stream and message processing or their own unique solution. For example, Pulsar, a newer offering from Apache, is an open-source pub/sub messaging system that supports both streams and event queues, all with extremely high performance. Pulsar is feature-rich—it offers multi-tenancy and geo-replication—and accordingly complex. It’s been said that Kafka aims for high throughput, while Pulsar aims for low latency.

NATS is an alternative pub/sub messaging system with “synthetic” queueing. NATS is designed for sending small, frequent messages. It offers both high performance and low latency. However, NATS considers some level of data loss to be acceptable, prioritizing performance over delivery guarantees.

Other Design Considerations

Once you’ve selected your event framework, here are several other challenges to consider:

Event Sourcing

It is difficult to implement a combination of loosely-coupled services, distinct data stores, and atomic transactions. One pattern that may help is Event Sourcing. In Event Sourcing, updates and deletes are never performed directly on the data; rather, state changes of an entity are saved as a series of events.
CQRS

The above event sourcing introduces another issue: Since state needs to be built from a series of events, queries can be slow and complex. Command Query Responsibility Segregation (CQRS) is a design solution that calls for separate models for insert operations and read operations.
Discovering Event Information

One of the greatest challenges in event-driven architecture is cataloging services and events. Where do you find event descriptions and details? What is the reason for an event? What team created the event? Are they actively working on it?
Dealing with Change

Will an event schema change? How do you change an event schema without breaking other services? How you answer these questions becomes critical as your number of services and events grows.

Being a good event consumer means coding for schemas that change. Being a good event producer means being cognizant of how your schema changes impact other services and creating well-designed events that are documented clearly.
On Premise vs Hosted Deployment

Regardless of your event framework, you’ll also need to decide between deploying the framework yourself on premise (message brokers are not trivial to operate, especially with high availability), or using a hosted service such as Apache Kafka on Heroku.

Anti-Patterns

As with most architectures, an event-driven architecture comes with its own set of anti-patterns. Here are a few to watch out for.

Too much of a good thing

Be careful you don’t get too excited about creating events. Creating too many events will create unnecessary complexity between the services, increase cognitive load for developers, make deployment and testing more difficult, and cause congestion for event consumers. Not every method needs to be an event.
Generic events

Don’t use generic events, either in name or in purpose. You want other teams to understand why your event exists, what it should be used for, and when it should be used. Events should have a specific purpose and be named accordingly. Events with generic names, or generic events with confusing flags, cause issues.
Complex dependency graphs

Watch out for services that depend on one another and create complex dependency graphs or feedback loops. Each network hop adds additional latency to the original request, particularly north/south network traffic that leaves the datacenter.
Depending on guaranteed order, delivery, or side effects

Events are asynchronous; therefore, including assumptions of order or duplicates will not only add complexity but will negate many of the key benefits of event-based architecture. If your consumer has side effects, such as adding a value in a database, then you may be unable to recover by replaying events.
Premature optimization

Most products start off small and grow over time. While you may dream of future needs to scale to a large complex organization, if your team is small then the added complexity of event-driven architectures may actually slow you down. Instead, consider designing your system with a simple architecture but include the necessary separation of concerns so that you can swap it out as your needs grow.
Expecting event-driven to fix everything

On a less technical level, don’t expect event-driven architecture to fix all your problems. While this architecture can certainly improve many areas of technical dysfunction, it can’t fix core problems such as a lack of automated testing, poor team communication, or outdated dev-ops practices.

Learn More

Understanding the pros and cons of event-driven architectures, and some of their most common design decisions and challenges is an important part of creating the best design possible.

If you want to learn more, check out this event-driven reference architecture, which allows you to deploy a working project on Heroku with a single click. This reference architecture creates a web store selling fictional coffee products.

Product clicks are tracked as events and stored in Kafka. Then, they are consumed by a reporting dashboard.

The code is open source so you can modify it according to your needs and run your own experiments.

PaaS versus Serverless: Which to choose in 2019?

Jason Skowronski — Wed, 11 Sep 2019 15:00:18 +0000

Serverless computing is increasingly popular, so when building a new application in 2019 should you go serverless or stick with PaaS? AWS started the serverless movement in 2014 with the introduction of its AWS Lambda service. Back then it felt as revolutionary as when Steve Jobs introduced the first iPhone. The main idea was to completely get rid of infrastructure management. You just need to write a small piece of code, upload it and cloud will take care of the rest. It felt like PaaS 2.0. The bigger, better, more advanced version of PaaS.

PaaS sought to make infrastructure easier to manage, so developers can focus on working within web frameworks rather than wasting time in dealing with the underlying infrastructure. The goal is to simplify the process of deployment and operation. At the same time, they still offer access to the infrastructure, which provides a balance between automation and flexibility to configure your servers. For example, Heroku today makes it as easy as a 1-line command to deploy, manage, and scale server apps.

So which approach should we choose for building apps today? Should you make the switch to serverless? The first step is to look at all our options objectively, evaluate them for our specific situation, and make a reasoned choice. Both can solve basic development needs: delivering functionality quickly and reliably. Understanding the technical differences will help you determine which approach is best for a particular project.

There are many serverless and PaaS platforms to use for our comparison. For serverless, popular options include AWS Lambda, Google Cloud Functions, Azure Functions and OpenWhisk. On the PaaS side we have Heroku, AWS Elastic Beanstalk, Google AppEngine and more.

To simplify our comparison, let’s focus on AWS Lambda for serverless and Heroku for PaaS as prototypical examples. We'll try be as fair as possible in our comparison.

Advantages of Serverless with Lambda

Serverless offers some really great advantages. You don’t need to manage infrastructure or app servers, so you can focus on just coding functions. It also bills per-call basis rather than a per-hour basis.

There’s no infrastructure to manage.
Just write a small piece of code, upload it to the cloud and the cloud service will do everything else. No more server setup and management. No more scaling and load balancing problems. No more troubleshooting servers and network. Sounds like a dream! While a PaaS solves some of these issues, you may still need to think about managing dynos, geo-distribution, fault tolerance and scaling.
Increase development velocity.
There is no need to write code or use frameworks for handling HTTP, parsing JSON and so on. You just need to write pure business logic in a function. Lambda will do everything rest for you. With Heroku, you still need to write code to implement your application server.
Pay only for resources that are actually used.
With any PaaS provider, you are paying for the availability of specific compute resources, such as CPU and RAM, whether or not they are in use. With Lambda you are paying ONLY when you are actually using the resources. Using Lambda you will pay for a number of invocations, for the amount of consumed resources, and for execution time. This billing model itself is pushing developers to write compact and efficient code. With Heroku, you are billed for running dynos, even when they are not used. You can scale down unneeded dynos, but you must run at least one to have a functioning app.
Integrate with other AWS services.
Lambda is very well integrated with many other AWS services. It’s easier to use if all your infrastructure is running on AWS. You may even be forced to use Lambda as a way to integrate AWS services with external services, such as forwarding Cloudwatch events to a monitoring service.
Compute at the edge can lower average latency.
There is a completely different service from AWS called Lambda@Edge. The idea is to run custom JavaScript code on a CDN node close to the end user. It can be used to achieve lower latency for the client.

Advantages of PaaS with Heroku

PaaS platforms offer an opinionated and standard infrastructure that makes them developer friendly. They simplified the management of the underlying infrastructure while still providing access to it.

Simple to use as part of regular developer workflow.
Getting your app running on Heroku requires fewer steps, and could be as simple as a Git push. As your app matures, Heroku also integrates other services in a convenient package like continuous integration and review apps. With Lambda, you have to package your code into a zip, upload it, and configure triggers and permissions. You are not able to expose your app to the internet without configuring API Gateway.
Unlimited HTTP requests for the same price.
Heroku doesn’t bill per request, so it’s more predictable, and a single dyno can handle thousands of requests per second, depending on your code. With Lambda you are forced to use API Gateway to expose it via HTTP. It can be expensive because you are paying for every request, on top of your Lambda request and bandwidth charges. At the time of writing this post, API Gateway alone costs $3.50 per 1M requests. That may not seem expensive at first glance. But don’t forget that we live in a world where an app or a service can go viral and become hugely popular overnight. The question is, will you be able to pay the AWS bill after that?
Fewer long tail latency issues.
Heroku dynos run continuously and are ready for requests at any moment. You can even use the preboot feature to ensure the dyno is ready to receive traffic before routing to it. One of the biggest potential issues with Lambda is long tail latency. When Lambda is triggered, it will spin up a Firecracker-based microVM and run your code there. This will take some time to start, so the first response will be longer. All subsequent requests that land on that existing Lambda instance will be processed without a cold start and the associated latency. In my experience, for Java-based Lambda, the cold start can be around 10 seconds. This shows how JVM is a bad choice for this use case. With Go, the cold start in my use case got down to 1.5 seconds — a pretty significant improvement, but it’s still long.
You can run any code.
Heroku lets you deploy, run, and manage applications written in Ruby, Node.js, Java, Python, Clojure, Scala, Go, and PHP. You can even run Docker containers with any image you choose. Lambda, on the other hand, supports only a limited number of runtimes. AWS tried to address this during last year’s re:Invent with the announcement of the Runtime API. Now, you can build a Linux-compatible binary in the programing language of your choice and run it on Lambda. However, not all programming languages allow you to do that.
There’s no need to rewrite existing apps.
If your application is already using an application server, you usually don’t need to make many changes in order to run on Heroku. On the other hand, it may require more effort to port existing apps to Lambda — particularly those not written as serverless functions originally. That’s also assuming that Lambda supports the necessary runtime. With legacy applications, this can be a significant limitation.
You have longer execution time.
The current limit for Lambda execution is 15 minutes. That’s enough for many tasks, but not for all. Some specific tasks, for example, from an ETL domain, may require longer execution time than is available on Lambda. With Heroku, you can use a worker dyno to run tasks for many hours at a time. Nevertheless, it’s best practice to design your dynos to be stateless, so consider using a job queue and saving your work periodically.

What to choose when?

In summary, Lambda can be a great solution when you've got a bit of code you just need to run with as little overhead as possible, for task-based background work (if you can fit it into Lambda limits), infrequently used functions, and instances where long tail latency is not a problem. It can also be the right tool if you need very fast average latency at the edge. You can easily use it for gluing AWS services together with some custom logic and for building infrastructure or automation tools. Think about tasks like decoding small videos, resizing pictures, or processing AWS events as a good fit for Lambda.

On the other hand, Heroku is a good fit for new web applications because it integrates many common parts of the development lifecycle in a more convenient package, without requiring you to configure multiple services. Doing a Git push and getting a working app after a couple of minutes is a breeze. It's also better for compute operations that take longer than a few minutes or for frequently called functions. Visit the Heroku Platform description to learn more about how Heroku works.

What is your opinion on which to choose in 2019? Let us know in the comments below.

Identifying trolls and bots on Reddit with machine learning (Part 2)

Jason Skowronski — Fri, 09 Aug 2019 15:02:43 +0000

Trolls and bots are widespread across social media, and they influence us in ways we are not always aware of. Trolls can be relatively harmless, just trying to entertain themselves at others’ expense, but they can also be political actors sowing mistrust or discord. While some bots offer helpful information, others can be used to manipulate vote counts and promote content that supports their agenda. Bot problems are expected to grow more acute as machine learning technologies mature.

In the first part of this two part series, we covered how to collect comment data from Reddit in bulk and build a dashboard to moderate suspected trolls and bots. In this second part, we’ll show you how we used machine learning to detect bots and trolls using Python and scikit-learn. We’ll then create an API using Flask to say whether comments on Reddit are likely to be bots or trolls for use in our moderator dashboard.

Background on troll and bot detection

Troll and bot detection is a relatively new field. Historically, companies have employed human moderators to detect and remove content that’s inconsistent their terms of service. However, this manual process is expensive, plus it can be emotionally tiring for humans to review the worst content. We will quickly hit the limits of human moderator efficaciousness as new technologies like OpenAI GPT-2 natural language generation are unleashed. As bots improve, it is important to employ counter technologies to protect the integrity of online communities.

There’ve been several studies done on the topic of bot detection. For example, one researcher found competing pro-Trump and anti-Trump bots on Twitter. Researchers at Indiana University have provided a tool to check Twitter users called botornot.

There’s also been interesting research on online trolls. Research from Stanford has shown that just 1% of accounts create 74% of conflict. Researchers at Georgia Tech used a natural language processing model to identify users who violate norms with behavior like making personal attacks, misogynistic slurs, or even mansplaining.

Screening comments for moderation

Our goal is to create a machine learning model to screen comments on the politics subreddit for moderators to review. It doesn't need to have perfect accuracy since the comments will be reviewed by a human moderator. Instead, our measure of success is how much more efficient we can make human moderators. Rather than needing to review every comment, they will be able to review a prescreened subset. We are not trying to replace the existing moderation system that Reddit provides, which allows moderators to review comments that have been reported by users. Instead, this is an additional source of information that can complement the existing system.

As described in our part one article, we have created a dashboard allowing moderators to review the comments. The machine learning model will score each comment as being a normal user, a bot, or a troll.

Try it out for yourself at reddit-dashboard.herokuapp.com.

To set your expectations, our system is designed as a proof of concept. It’s not meant to be a production system and is not 100% accurate. We’ll use it to illustrate the steps involved in building a system, with the hopes that platform providers will be able to offer official tools like these in the future.

Collecting training data

Our initial training dataset was collected from a list of known bots and trolls. We’ll use two lists of these 393 known bots plus 167 more from the botwatch subreddit. We’ll also use a list of 944 troll accounts from Reddit’s 2017 Transparency Report that were suspected of working for the Russian Internet Research Agency.

We are using an event-driven architecture that consists of a process that downloads data from Reddit and pushes it in a Kafka queue. We then have a Kafka consumer that writes the data into a Redshift data warehouse in batches. We wrote a Kafka producer application to download the comments from the list of bots and trolls. As a result, our data warehouse contains not only the data from the known bots and trolls, but also real-time comments from the politics subreddit.

While Reddit comments aren’t exactly private, you may have data that is private. For example, you may have data that’s regulated by HIPAA or PCI, or is sensitive to your business or customers. We followed a Heroku reference architecture that was designed to protect private data. It provides a Terraform script to automatically configure a Redshift data warehouse and connect it to a Heroku Private Space. As a result, only apps running in the Private Space can access the data.

We can either train our model on a dyno directly or run a one-off dyno to download the data to CSV and train the model locally. We’ll choose the latter for simplicity, but you’d want to keep sensitive data in the Private Space.

heroku run bash -a kafka-stream-viz-jorge
export PGPASSWORD=<password>
echo "select * from reddit_comments" | psql -h tf-jorge-tf-redshift-cluster.coguuscncu3p.us-east-1.redshift.amazonaws.com -U jorge -d redshift_jorge -p 5439 -A -o reddit.csv
gzip reddit.csv
curl -F "file=@reddit.csv.gz" https://file.io

If you prefer to use our training data to try it out yourself, you can download our CSV.

Now we have comments from both sets of users and count a total of 93,668. The ratios between the classes are fixed at 5% trolls, 10% bots, and 85% normal. This is useful for training but likely underestimates the true percentage of normal users.

Selecting features

Next, we need to select features to build our model. Reddit provides dozens of JSON fields for each user and comment. Some don’t have meaningful values. For example, banned_by was null in every case, probably because we lack moderator permissions. We picked the fields below because we thought they’d be valuable as predictors or to understand how well our model performs. We added the column recent_comments with an array of the last 20 comments made by that user.

no_follow
link_id
gilded
author
author_verified
author_comment_karma
author_link_karma
num_comments
created_utc
score
over_18
body
is_submitter
controversiality
ups
is_bot
is_troll
recent_comments

Some fields like “score” are useful for historical comments, but not for a real-time dashboard because users won’t have had time to vote on that comment yet.

We added additional calculated fields that we thought would correlate well with bots and trolls. For example, we suspected that a user’s recent comment history would provide valuable insight into whether they are a bot or troll. For example, if a user repeatedly posts controversial comments with a negative sentiment, perhaps they are a troll. Likewise, if a user repeatedly posts comments with the same text, perhaps they are a bot. We used the TextBlob package to calculate numerical values for each of these. We’ll see whether these features are useful in practice soon.

recent_num_comments
recent_num_last_30_days
recent_avg_no_follow
recent_avg_gilded
recent_avg_responses
recent_percent_neg_score
recent_avg_score
recent_min_score
recent_avg_controversiality
recent_avg_ups
recent_avg_diff_ratio
recent_max_diff_ratio
recent_avg_sentiment_polarity
recent_min_sentiment_polarity

For more information on what these fields are and how they are calculated, see the code in our Jupyter Notebooks in https://github.com/devspotlight/botidentification.

Building a machine learning model

Our next step is to create a new machine learning model based on this list. We’ll use Python’s excellent scikit learn framework to build our model. We’ll store our training data into two data frames: one for the set of features to train in and the second with the desired class labels. We’ll then split our dataset into 70% training data and 30% test data.

X_train, X_test, y_train, y_test = train_test_split(
            input_x, input_y,
            test_size=0.3, random_state=16)

Next, we’ll create a decision tree classifier to predict whether each comment is a bot, a troll, or a normal user. We’ll use a decision tree because the created rule is very easy to understand. The accuracy would probably be improved using a more robust algorithm like a random forest, but we’ll stick to a decision tree for the purposes of keeping our example simple.

clf = DecisionTreeClassifier(max_depth=3,
         class_weight={'normal':1, 'bot':2.5, 'troll':5},
         min_samples_leaf=100)

You’ll notice a few parameters in the above code sample. We are setting the maximum depth of the tree to 3 not only to avoid overfitting, but also so that it’s easier to visualize the resulting tree. We are also setting the class weights so that bots and trolls are less likely to be missed, even at the expense of falsely labeling a normal user. Lastly, we are requiring that the leaf nodes have at least 100 samples to keep our tree simpler.

Now we’ll test the model against the 30% of data we held out as a test set. This will tell us how well our model performs at guessing whether each comment is from a bot, troll, or normal user.

matrix = pd.crosstab(y_true, y_pred, rownames=['True'],
                    colnames=['Predicted'], margins=True)

This will create a confusion matrix showing, for each true target label, how many of the comments were predicted correctly or incorrectly. For example, we can see below that out of 1,956 total troll comments, we correctly predicted 1,451 of them.

Predicted    bot        normal    troll        All
True                                 
bot          3677       585       33           4295
normal       197        20593     993          21783
troll        5          500       1451         1956
All          3879       21678     2477         28034

In other words, the recall for trolls is 74%. The precision is lower; of all comments predicted as being a troll, only 58% really are.

Recall : [0.85611176 0.94537024 0.74182004]
Precision: [0.94792472 0.94994926 0.58578926]
Accuracy: 0.917493044160662

We can calculate the overall accuracy at 91.7%. The model performed the best for normal users, with about 95% precision and recall. It performed fairly well for bots, but had a harder time distinguishing trolls from normal users. Overall, the results look fairly strong even with a fairly simple model.

What does the model tell us?

Now that we have this great machine learning model that can predict bots and trolls, how does it work and what can we learn from it? A great start is to look at which features were most important.

feature_imp = pd.Series(
    clf.feature_importances_,
    index=my_data.columns.drop('target')).sort_values(ascending=False)

recent_avg_diff_ratio           0.465169
author_comment_karma            0.329354
author_link_karma               0.099974
recent_avg_responses            0.098622
author_verified                 0.006882
recent_min_sentiment_polarity   0.000000
recent_avg_no_follow            0.000000
over_18                         0.000000
is_submitter                    0.000000
recent_num_comments             0.000000
recent_num_last_30_days         0.000000
recent_avg_gilded               0.000000
recent_avg_sentiment_polarity   0.000000
recent_percent_neg_score        0.000000
recent_avg_score                0.000000
recent_min_score                0.000000
recent_avg_controversiality     0.000000
recent_avg_ups                  0.000000
recent_max_diff_ratio           0.000000
no_follow                       0.000000

Interesting! The most important feature was the average difference ratio in the text of the recent comments. This means if the text of the last 20 comments is very similar, it’s probably a bot. The next most important features were the comment karma, link karma, the number of responses to recent comments, and whether the account is verified.

Why are the rest zero? We limited the depth of our binary tree to 3 levels, so we are intentionally not including all the features. Of note is that we didn’t consider the scores or sentiment of previous comments to classify the trolls. Either these trolls were fairly polite and earned a decent number of votes, or the other features had better discriminatory power.

Let’s take a look at the actual decision tree to get more information.

export_graphviz(estimator, out_file='tree.dot',
    feature_names = data.drop(['target'], axis=1).columns.values,
    class_names = np.array(['normal','bot','troll']),
    rounded = False, proportion = False,
    precision = 5, filled = True)

Now we can get an idea of how this model works! You might need to zoom in to see the details.

Let’s start at the top of the tree. When the recent comments are fairly similar to each other (the average difference ratio is high), then it’s more likely to be a bot. When they have dissimilar comments, low comment karma, and high link karma, they are more likely to be a troll. This could make sense if the trolls use posts of kittens to pump up their link karma, and then make nasty comments in the forums that either get ignored or downvoted.

Hosting an API

To make our machine learning model available to the world, we need to make it available to our moderator dashboard. We can do that by hosting an API for the dashboard to call.

To serve our API, we used Flask, which is a lightweight web framework for Python. When we load our machine learning model, the server starts. When it receives a POST request containing a JSON object with the comment data, it responds back with the prediction.

Example for a **bot **user:

{
   "banned_by":null,
   "no_follow":true,
   "link_id":"t3_aqtwe1",
   "gilded":false,
   "author":"AutoModerator",
   "author_verified":false,
   "author_comment_karma":445850.0,
   "author_link_karma":1778.0,
   "num_comments":1.0,
   "created_utc":1550213389.0,
   "score":1.0,
   "over_18":false,
   "body":"Hey, thanks for posting at \\/r\\/SwitchHaxing! Unfortunately your comment has been removed due to rule 6; please post questions in the stickied Q&amp;A thread.If you believe this is an error, please contact us via modmail and well sort it out.*I am a bot",
   "downs":0.0,
   "is_submitter":false,
   "num_reports":null,
   "controversiality":0.0,
   "quarantine":"false",
   "ups":1.0,
   "is_bot":true,
   "is_troll":false,
   "recent_comments":"[...array of 20 recent comments...]"
}

The response returned is:

{
  "prediction": "Is a bot user"
}

We deployed our API on Heroku because it makes it very easy to run. We just create a Procfile with a single line telling Heroku which file to use for the web server.

web: python app.py ${port}

We can then git push our code to heroku:

git push heroku master

Heroku takes care of the hassle of downloading requirements, building the API, setting up a web server, routing, etc. We can now access our API at this URL and use Postman to send a test request:

https://botidentification.herokuapp.com/

See it working

Thanks to the great moderator dashboard we wrote in part one, we can now see the performance of our model operating on real comments. If you haven’t already, check it out here: reddit-dashboard.herokuapp.com.

It’s streaming real live comments from the r/politics subreddit. You can see each comment and whether the model scored it as a bot, troll or normal user.

You may see some comments labeled as bots or trolls, but it’s not obvious why after inspecting their comment history. Keep in mind that we used a simple model in order to keep our tutorial easier to follow. The precision for labeling trolls is only 58%. That’s why we designed it as a filter for human moderators to review.

If you’re interested in playing with this model yourself, check out the code on GitHub at https://github.com/devspotlight/botidentification. You can try improving the accuracy of the model by using a more sophisticated algorithm such as a random forest. Spoiler alert: it’s possible to get 95%+ accuracy on the test data with more sophisticated models, but we’ll leave it as an exercise for you.

Trolls and bots are disrupting social media—here’s how AI can stop them (Part 1)

Jason Skowronski — Fri, 09 Aug 2019 14:59:25 +0000

Trolls and bots have a huge and often unrecognized influence on social media. They are used to influence conversations for commercial or political reasons. They allow small hidden groups of people to promote information supporting their agenda and a large scale. They can push their content to the top of people’s news feeds, search results, and shopping carts. Some say they can even influence presidential elections. In order to maintain the quality of discussion on social sites, it’s become necessary to screen and moderate community content. Can we use machine learning to identify suspicious posts and comments? The answer is yes, and we’ll show you how.

This is a two part series. In this part, we'll cover how to collect comment data from Reddit in bulk and build a real-time dashboard using Node and Kafka to moderate suspected trolls and bots. In part two, we'll cover the specifics of building the machine learning model.

Trolls and bots are huge pain for social media

Trolls are dangerous online because it's not always obvious when you are being influenced by them or engaging with them. Posts created by Russian operatives were seen by up to 126 million Americans on Facebook leading up to the last election. Twitter released a massive data dump of over 9 million tweets from Russian trolls. And it’s not just Russia! There are also accounts of trolls attempting to influence Canada after the conflict with Huawei. The problem even extends to online shopping where reviews on Amazon have slowly been getting more heavily manipulated by merchants.

Bots are computer programs posing as people. They can amplify the effect of trolls by engaging or liking their content en masse, or by posting their own content in an automated fashion. They will get more sophisticated and harder to detect in the future. Bots can now create entire paragraphs of text in response to text posts or comments. OpenAI’s GPT-2 model can write text that feels and looks very similar to human quality. OpenAI decided not to release it due to safety concerns, but it’s only a matter of time before the spammers catch up. As a disclaimer, not all bots are harmful. In fact, the majority of bots on Reddit try to help the community by moderating content, finding duplicate links, providing summaries of articles, and more. It will be important to distinguish helpful from harmful bots.

How can we defend ourselves from propaganda and spam posted by malicious trolls and bots? We could carefully investigate the background of each poster, but we don’t have time to do this for every comment we read. The answer is to automate the detection using big data and machine learning. Let’s fight fire with fire!

Identifying bots and trolls on Reddit

We’ll focus on Reddit because users often complain of trolls in political threads. It’s easier for trolls to operate thanks to anonymous posting. Operatives can create dozens or hundreds of accounts to simulate user engagement, likes and comments. Research from Stanford has shown that just 1% of accounts create 74% of conflict. Over the past few months, we’ve seen numerous comments like this one in the worldnews subreddit:

“Anyone else notice the false users in this thread? I recognise their language. It has very specific traits like appearing to have genuine curiosity yet backed by absurd statements. Calling for 'clear evidence' and questioning the veracity of statements (which would normally be a good thing but not under a guise). Wonder if you could run it through machine learning to identify these type of users/comments.” - koalefant

https://www.reddit.com/r/worldnews/comments/aciovt/_/ed8alk0/?context=1

There are several existing resources we can leverage. For example, the botwatch subreddit keeps track of bots on Reddit, true to its namesake! Reddit’s 2017 Transparency Report also listed 944 accounts suspected of being trolls working for the Russian Internet Research Agency.

Also, there are software tools for analyzing Reddit users. For example, the very nicely designed reddit-user-analyzer can do sentiment analysis, plot the controversiality of user comments, and more. Let’s take this a step further and build a tool that puts the power in the hands of moderators and users.

In this article, the first of a two-part series, we’ll cover how to capture data from Reddit’s API for analysis and how to build the actual dashboard. In part two, we’ll dive deeper into how we built the machine learning model.

Creating a dashboard of suspected bots and trolls

In this tutorial, you’ll learn how to create a dashboard to identify bots and trolls on Reddit comments in real time, with the help of machine learning. This could be a useful tool to help moderators of political subreddits identify and remove content from bots and trolls. As users submit comments to the r/politics subreddit, we’ll capture the comments and run them through our machine learning model, then report suspicious ones on a dashboard for moderators to review.

Here’s a screengrab from our dashboard. Try it out yourself at reddit-dashboard.herokuapp.com.

System architecture

Due to the high number of posts and comments being made on social media sites, its necessary to use a scalable infrastructure to process them. We’ll design our system architecture using an example written by the Heroku team in Managing Real-time Event Streams with Apache Kafka. This is an event-driven architecture that will let us produce data from the Reddit API and send it to Kafka. Kafka makes it easy to process streaming data and decouple the different parts of our system. Reading this data from Kafka, our dashboard can call the machine learning API and display the results. We’ll also store the data in Redshift for historical analysis and use as training data.

Collecting data from Reddit

Our first step is to download the comments from the politics subreddit for analysis. Reddit makes it easy to access comments as structured data in JSON format. To get recent commits for any subreddit just request the following URL:

https://www.reddit.com/r/${subreddit}/comments.json

Likewise, we can access public data about each user, including their karma and comment history. All we need to do is request this data from a URL containing the username, as shown below.

https://www.reddit.com/user/${username}/about.json
https://www.reddit.com/user/${username}/comments.json

To collect the data, we just looped through each comment in the r/politics subreddit, and then loaded the user data for each commenter. You can use whatever HTTP request library you like, but we used our examples will use axios for Node.js. Also, we’ll combine data from both calls into a single convenient data structure that includes both the user information and their comments. This will make it easier to store and retrieve each example later. This functionality can be seen in the profile-scraper.js file and you can learn more about how to run it in the README.

Real-time event streaming in Kafka

Now that the data has been collected from Reddit, we are ready to stream the comments into Kafka. Before connecting to the Kafka server you will need to create a topic in the Heroku dashboard. Click Add Topic and set the topic name with a single partition.

To connect to the Kafka server as a Producer in Node.js you can use the no-kafka library with the connection information already set in the cluster created by Heroku:

const Kafka = require('no-kafka')
const url = process.env.KAFKA_URL
const cert = process.env.KAFKA_CLIENT_CERT
const key = process.env.KAFKA_CLIENT_CERT_KEY

fs.writeFileSync('./client.crt', cert)
fs.writeFileSync('./client.key', key)

const producer = new Kafka.Producer({
  clientId: 'reddit-comment-producer',
  connectionString: url.replace(/\+ssl/g, ''),
  ssl: {
    certFile: './client.crt',
    keyFile: './client.key'
  }
})

After you are connected to Kafka you can send messages to the topic you created

earlier. For convenience, we decided to stringify the JSON messages before sending them to Kafka in our live streaming app:

producer.send({
  topic: 'northcanadian-72923.reddit-comments',
  partition: 0,
  message: {
    value: JSON.stringify(message)
  }
})

In our repo, the sample live streaming worker code is in the kafka-stream.js file.

Building a moderator dashboard

Our sample dashboard is a JavaScript application based on a previous version of the twitter-display Kafka demo app by Heroku. We simplified the app by removing some dependencies and modules, but the general architecture remains: an Express app (server-side) to consume and process the Kafka topic, connected via a web socket with a D3 front end (client-side) to display the messages (Reddit comments) and their classification in real time. You may find our open source code on https://github.com/devspotlight/Reddit-Kafka-Consumers.

In the server-side Node app, we connect to Kafka as a simple Consumer, subscribe to the topic, and broadcast each group of messages to our function which loads the prediction:

new Consumer({
  broadcast: (msgs) => {
    predictBotOrTrolls(msgs)
  },
  interval: constants.INTERVAL,
  topic: constants.KAFKA_TOPIC,
  consumer: {
    connectionString: process.env.KAFKA_URL,
    ssl: {
      cert: './client.crt',
      key: './client.key'
    }
  }
})

We then use unirest (HTTP/REST request library) to send the unified data scheme from those messages to our machine learning API for real-time predictions on whether or not the author is a person or a bot or troll (more about that in the next section of this article).

Finally, a WebSocket server is used in our app.js so that the front end can get all the display data in real time. Since the subreddit comments stream in real time, the scaling and load balancing of each application should be considered and monitored.

We use the popular D3 JavaScript library to update the dashboard dynamically as Kafka messages stream in. Visually, there is a special table bound to the data stream, and this table gets updated with the newest comments as they come (newest first), as well as the count of each user type detected:

import * as d3 from 'd3'

class DataTable {
  constructor(selector, maxSize) {
    this.tbody = d3.select(selector)
    this._maxSize = maxSize
    this._rowData = []
  }

  update(data) {
    data.forEach((msg) => {
      this._rowData.push(msg)
    }

    if (this._rowData.length >= this._maxSize)
      this._rowData.splice(0, this._rowData.length - this._maxSize)

    // Bind data rows to target table
    let rows = this.tbody.selectAll('tr').data(this._rowData, (d) => d)

  ...

See data-table.js for more details. The code shown above is just an excerpt.

Calling out to our ML API

Our machine learning API is designed to examine features about the comment poster’s account and recent comment history. We trained our model to examine features like their Reddit “karma”, number of comments posted, whether they verified their account, and more. We also provided it with a collection of features that we hypothesize will be useful in categorizing users. We pass the collection to the model as a JSON object. The model then returns a prediction for that user that we can display in our dashboard. Below are sample JSON data objects (using our unified data scheme) sent as requests to the HTTP API.

Example for a bot user:

{
   "banned_by":null,
   "no_follow":true,
   "link_id":"t3_aqtwe1",
   "gilded":false,
   "author":"AutoModerator",
   "author_verified":false,
   "author_comment_karma":445850.0,
   "author_link_karma":1778.0,
   "num_comments":1.0,
   "created_utc":1550213389.0,
   "score":1.0,
   "over_18":false,
   "body":"Hey, thanks for posting at \\/r\\/SwitchHaxing! Unfortunately your comment has been removed due to rule 6; please post questions in the stickied Q&amp;A thread.If you believe this is an error, please contact us via modmail and well sort it out.*I am a bot",
   "downs":0.0,
   "is_submitter":false,
   "num_reports":null,
   "controversiality":0.0,
   "quarantine":"false",
   "ups":1.0,
   "is_bot":true,
   "is_troll":false,
   "recent_comments":"[...array of 20 recent comments...]"
}

The response returned is:

{
  "prediction": "Is a bot user"
}

Run it easily using a Heroku Button

As you can see, our architecture has many parts—including producers, Kafka, and a visualization app—which might make you think that it’s difficult to run or manage. However, we have a Heroku button that allows us to run the whole stack in a single click. Pretty neat, huh? This opens the door to using more sophisticated architectures without the extra fuss.

If you’re technically inclined, give it a shot. You can have a Kafka cluster running pretty quickly, and you only pay for the time it's running. Check out our documentation for the local development and the production deployment processes in our code’s README document.

Next steps

We’d like to encourage the community to use these types of techniques to control the spread of trolls and harmful bots. It’s an exciting time to be alive and watch as trolls attempt to influence social media, while these communities develop better machine learning and moderation tools to stop them. Hopefully we’ll be able to keep our community forums as places for meaningful discussion.

Check out our part two article “Detecting bots and trolls on Reddit using machine learning”, which will dive deeper into how we built the machine learning model and its accuracy.

Twelve-Factor Apps: A Retrospective and Look Forward

Jason Skowronski — Wed, 17 Jul 2019 15:09:57 +0000

If your team is creating apps for the cloud, chances are the Twelve-Factor App methodology has influenced the frameworks and platforms you’re using. Popular frameworks such as Spring Boot, Magento, and more credit the twelve factors as part of their design. Leading companies such as Heroku, Amazon, and Microsoft use and recommend the methodology. While new frameworks and methodologies are released every month, few have the far-reaching impact of this one.

Let's take a look at what these factors are all about, the story behind the creation of this methodology seven years ago, and why they are just as important today.

Creation of Twelve-Factor Apps

In late 2011, Heroku co-founder Adam Wiggins knew there was a problem with the state of application development and deployment. Adam and his team had been personally involved in hundreds of apps and witnessed hundreds of thousands of app deployments into the cloud through the Heroku platform. Some of these apps were successful, taking advantage of features such as distributed architectures. But some of these apps had serious problems: they were not scalable, portable, or easy to maintain. The team recognized a common set of issues among these problem apps, and wanted to do something about it.

Adam and his team came up with a set of guidelines for building successful cloud apps— guidelines that would minimize cost and time, maximize portability, enable continuous deployment, and scale up without changes to processes or architecture.

Twelve-Factor Apps: A Methodology for SaaS App Development

Let’s start with an overview of the twelve factors. You can see the full list and extra detail for each factor at 12factor.net.

Codebase
Use source control. One codebase per application. Deploy to multiple environments.
Dependencies
Declare and isolate dependencies. Never rely on the existence of system packages. Never commit dependencies in the codebase repository.
Config
Keep configuration separate from codebase.
Backing Services
Treat services the app consumes (database, caching, and so on) as attachable resources. You should be able to swap your database instance without code changes.
Build, Release, Run
Deploy apps in three discrete steps: build (convert codebase into executable), release (combine build artifacts with config to create a release image), and run (use the same release image every time you launch).
Processes
Processes should be stateless.
Port Binding
Export services via port binding. Apps should be completely self-contained.
Concurrency
Scale out by decomposing applications into individual processes that do specific jobs.
Disposability
Apps should be quick to start, resilient to failure, and graceful to shut down. Expect servers to fail, be added, change.
Dev/prod parity
Keep the development environment identical to all other environments.
Logs
Treat logs as events streams. Write to stdout and stderr.
Admin Processes
Run admin tasks (database migrations, background jobs, cache clearing, and so on) as one-off, isolated processes.

How the Twelve-Factor App Changed Application Development and Dev-Ops

Over the past seven years, these twelve factors have guided tens of thousands of apps to success. In fact, the methodology has worked so well in creating apps that are maintainable, scalable, and portable that many of the twelve factors have now been adopted as “common sense” for cloud development and dev-ops. We all now know about explicit dependency declarations, scaling out versus up, and the benefits of a code repository. However, some of these factors were radical suggestions when first introduced.

These battle-tested, industry-accepted factors have even been codified by some of the most successful cloud frameworks and tools. Most modern frameworks such as Spring & Spring Boot, Symfony, and Magento (Adobe) embody the twelve factors as part of their design principles. Tools such as Docker images and Heroku slugs (build/release/run), Vagrant (dev/prod parity), Puppet and Vault (configuration), and Papertrail (logging) have been created to enforce, automate, and simplify management for apps using the factors.

The Heroku platform also embodies the twelve factors. For example, Heroku requires apps to be decomposed into one or more lightweight, discrete containers (dynos)—a direct manifestation of the stateless factor. Heroku also enforces languages and frameworks to use an explicit list of app dependencies (such as Ruby’s bundler), allows admin processes to be run in isolation using one-off dynos, and aggregates the output streams of all running dynos in an application so that logs can be processed as a stream.

Just As Important Today As Seven Years Ago

The Twelve-Factor App methodology continues to be just as important today as when it was first released. Millions of people visited the Twelve-Factor App website in the last year. Companies such as Amazon, Microsoft, IBM, and Pivotal continue to use and recommend the methodology.

Newer architectures, such as microservices, serverless containers, and most cloud deployment still benefit from (and often critically enforce) the methodology. Amazon Lambda functions, for example, enforce factors such as stateless components, scaling out, disposability, isolated admin functions, and backing services.

The Twelve-Factor App methodology has guided an enormous number of apps, frameworks, and platforms to success over the years. Taking these factors into consideration early in your design process will help you and your team architect scalable, portable, maintainable apps.

Beginner's Guide to Using CDNs

Jason Skowronski — Tue, 25 Jun 2019 15:17:28 +0000

Websites have become larger and more complex over the past few years, and users expect them to load instantaneously even on mobile devices. The smallest performance drops can have big effects: just a 100ms decrease in page load time can drop conversions by 7%. With competitors just a click away, organizations wishing to attract and retain customers need to make web performance a priority. One relatively simple method of doing this is by using content delivery networks (CDNs).

In this article, we'll explain how CDNs help improve web performance. We'll explain what they are, how they work, and how to implement them in your websites.

What is a CDN?

A CDN is a distributed network and storage service that hosts web content in different geographical regions around the world. This content can include HTML pages, scripts, style sheets, multimedia files, and more. This lets you serve content from the CDN instead of your own servers, reducing the amount of traffic handled by your servers.

CDNs can also act as a proxy between you and your users, offering services such as load balancing, firewalls, automatic HTTPS, and even redundancy in case your origin servers go offline (e.g. Cloudflare Always Online).

Why Should I Use a CDN?

CDNs offload traffic from your servers, reducing your overall load. They are also optimized for speed and in many cases offer faster performance, which can improve your SEO rankings. Since CDNs host data in centers located around the world, they literally move your content closer to your users. This can greatly reduce latency for some users and avoid downtime caused by data center outages or broken routes.

How Do CDNs Work?

A CDN consists of multiple data centers around the world called Points of Presence (PoPs). Each PoP is capable of hosting and serving content to users. CDNs route users to specific PoPs based on a number of factors including distance, PoP availability, and connection speed.

A PoP acts as a proxy between your users and your origin server. When a user requests a resource from your website such as an image or script, they are directed to the PoP. The PoP will then deliver the resource to the user if it has it cached.

But how do you get the content to your PoP? Using one of two methods: pushing, or pulling. Pushing requires you to send your content to the CDN beforehand. This gives you greater control over what content gets served by the CDN, but if a user requests content that you haven't yet pushed, they may experience errors.

Pulling is a much more automatic method, where the CDN automatically retrieves content that it hasn't already cached. When a user requests content that isn't already cached, the CDN pulls the most recent version of the content from your origin server. After a certain amount of time, the cached content expires and the CDN refreshes it from the origin the next time it's requested.

How Do I Choose a CDN?

While CDNs work the same way fundamentally, they differ in a number of factors including:

Price

Most CDNs charge based on the amount of bandwidth used. Some may also charge based on the number of cache hits (files served from cache), cache misses (retrievals from the origin), and refreshes. Others charge a fixed fee and allow a certain amount of bandwidth over a period of time. When comparing CDNs, you should estimate your bandwidth needs and anticipated growth to find the best deal.

Availability and Reliability

CDNs strive for 100% uptime, but perfect uptime is never guaranteed. Consider your availability needs and how each CDN supports them. Also, compare CDNs based on their PoP uptime rather than their overall uptime, especially in the regions you expect to serve. If possible, verify that your CDN offers fallback options such as routing around downed PoPs.

PoP Locations (Regions Served)

Depending on where your users are located, certain PoPs can serve your users more effectively. Choose a CDN that manages PoPs close to your users, or else you'll miss out on many of the performance benefits that CDNs offer.

How Do I Add a CDN to My Website?

The process of adding a CDN to your website depends on where and how your website is hosted. We'll cover some of the more common methods below.

Web Hosting Provider

If your website is hosted by a provider such as inMotion Hosting, HostGator, or 1&1, your provider may offer a CDN as a built-in or extra service. For example, Bluehost provides Cloudflare for free and enabled by default for all plans. You can still use a CDN if your host doesn't explicitly support it, but it may fall under one of the following processes.

Content Management System (CMS)

Content management systems (CMSes) like WordPress and Squarespace often support CDNs through the use of plugins. For WordPress, Jetpack provides support for its own CDN automatically. Others such as W3TC, WP Super Cache, and WP Fastest Cache let you choose which CDN to direct users to.

Self-Hosted

Websites that you host yourself offer the greatest flexibility in choosing a CDN. However, they also require more setup. As an example, let's enable Google Cloud CDN for a website hosted on the Google Cloud Platform (GCP).

This example assumes you have a GCP account, a domain registered with a registrar, and a website hosted in Compute Engine, App Engine, or another GCP service. If you don't already have a GCP account, create one and log into the Google Cloud Console.

Step 1: Configure your DNS records

Traditionally, the way to route your users to a CDN was to change the resource URLs in your website to point to URLs provided by the CDN. Most modern CDNs avoid this by managing your DNS records for you, letting you redirect users without requiring changes to your website.

To configure Cloud DNS, view the Cloud DNS quickstart document and follow the instructions for creating a managed public zone. Don't create a new record or a CNAME record yet, since we don't yet have an IP address to point the DNS record to. In the screenshot below, we created a new zone called mydomain-example for the domain subdomain.mydomain.com.

After creating the zone, update your registrar's domain settings to point to the Cloud DNS name servers. This will let you manage your domain records through Cloud DNS instead of through your registrar. For more information, visit the Cloud DNS documentation page on updating your domain's name servers or refer to your registrar's documentation.

Step 2: Enable Cloud CDN

With DNS configured, we now need to enable the CDN itself. With Cloud CDN, a load balancer must be selected as the origin. If you don't already have a load balancer, you can follow these how-to guides to create one. For a standard HTTP/S website, follow this guide for specific instructions.

With your load balancer created, follow these instructions to enable Cloud CDN for an existing backend service. Once your new origin is created, select it from the origin list. You will need the IP address displayed in the Frontend table to configure Cloud DNS, so make sure you copy it or keep this window open. The following screenshot shows an example Cloud CDN origin:

After retrieving your frontend IP address, return to Cloud DNS and select your zone. Create a new A record to point the domain to your origin's IP address. You can find instructions on the Cloud DNS quickstart documentation page under creating a new record. This is shown in the screenshot below. Optionally, you can also create a CNAME record to redirect users from a subdomain, such as www.yourdomain.com.

Step 3: Configure your web server

To ensure your content is properly cached, make sure your web server responds to requests with the correct HTTP headers. Cloud CDN only caches responses that meet certain requirements, some of which are specific to Cloud CDN. You will need to view your web server's documentation to learn how to set these headers. Apache and Nginx provide guides with best practices for configuring caching.

Step 4: Upload content to the CDN

For most website operators you don’t need to do anything to upload content. That’s because the CDN will automatically cache resources from your server as people access your site. This is also known as the “pull method”. Alternatively, Google does allow you to push specific content you want to host by manually uploading it.

How Does a CDN Impact Performance?

To demonstrate the performance benefits of CDNs, we ran a performance test on a website hosted on the Google Cloud Platform. The website is a static single page website created with Bootstrap and the Full Width Pics template and consists of seven high-resolution images, courtesy of NASA/JPL-Caltech. The server is a Google Compute Engine instance located in the us-east1-b region running Nginx 1.10.3.

We configured the instance to allow direct incoming HTTP traffic. We also set up Google Cloud CDN for the instance. You can see a screenshot of the web page and networking timing of the site below using a waterfall chart.

We then ran a performance test using SolarWinds Pingdom. Pingdom provides a page speed test that measures the time needed to fetch and render each element of a web page. We created two separate checks to test the origin server and CDN separately, then compared the results to see which method was faster. To maximize latency, we ran both checks from Pingdom's Eastern Asia location.

Origin Results

Running a speed test on the origin server resulted in a page load time of 3.68 seconds. The time to download the first byte from the server (shown as a blue line) was 318 milliseconds, meaning users had to wait one-third of a second before their device even began receiving content. Rendering the page (indicated by the orange line) took an additional 679 ms, meaning users had to wait almost a full second to see anything on their screen. By the time the page finished rendering (green line), users had been waiting more than 3.5 seconds.

Most of this delay was due to downloading the high-resolution images, but a significant amount of time was spent connecting to the server and waiting for content to begin transferring.

CDN Results

With a CDN, we immediately saw a substantial improvement in load time. The entire page loaded in just 1.04 seconds, more than 2 seconds faster than the origin server. The most significant change is in the time to first byte (blue line), which dropped to just 7 ms. This means our users began receiving content almost immediately after connecting to the CDN.

While there wasn't a significant improvement in the DOM content load time (orange line), the connection and wait times dropped significantly. We also saw content begin to appear on the page as early as 0.5 seconds into the page load time. We can confirm this by looking at the film strip, which shows screenshots of the page at various points in the loading process. This is compared to the 1.5 seconds it took for the origin server to begin rendering content.

Conclusion

CDNs offer a significant performance boost without much effort on the part of organizations. The biggest challenge is finding out which CDN provider to choose. If you're not sure which provider will benefit you the most, we benchmarked four of the most popular providers (Cloudflare, Fastly, AWS CloudFront, and Google CDN). While performance plays a major role in each provider's viability, we also encourage you to factor in additional features, security, and integrations offered by the CDN.

After setting up your CDN, you can check the performance difference using SolarWinds Pingdom. In addition to running one-time tests, you can use Pingdom to schedule periodic checks to ensure your website is always performing at its best. In addition, you can use Pingdom to constantly monitor your website's availability and usability.

Originally posted on the Pingdom blog

Benchmarking Popular NodeJS Logging Libraries

Jason Skowronski — Thu, 20 Jun 2019 15:00:13 +0000

Sometimes developers are hesitant to include logging due to performance concerns, but is this justified and how much does the choice of library affect performance?

Let’s run some benchmarks to find out! We ran a series of performance tests on some of the most popular NodeJS libraries. These tests are designed to show how quickly each library processed logging and the impact on the overall application.

The Contenders

For this test, we investigated some of the most commonly used NodeJS logging libraries:

Log4js 4.0.2
Winston 3.2.1
Bunyan 1.8.12

We also used the following additional libraries:

winston-syslog 2.0.1 for syslog logging with Winston
bunyan-syslog 0.3.2 for syslog logging with Bunyan

We benchmarked these libraries to test their performance, sending their logs to a console and file system. We also tested sending log info to a local rsyslog server over both TCP and UDP since it is common and probably wise to offload logs in a production environment.

These tests were run using NodeJS 8.15.1.

Methodology

Our goal was to compare the performance between the logging libraries. Each library was run on its default configuration and the same system was used across all libraries and tests.

Our test application logged a total of 1,000,000 log events of the phrase “Hello, world!” and it’s available on GitHub at https://github.com/codejamninja/node-log-benchmarks. We strictly processed logs to create an isolated control group.

We measured the results dedicating either a single logical processor or eight (4 cores with hyperthreading) to simulate a larger production server. NodeJS is often considered a single threaded program, but technically it's just the event loop that is single threaded. There are many NodeJS tasks that take place on parallel threads, such as garbage collection. It's also worth noting that the tty (terminal) was doing a bunch of work printing the logs to the screen, which would have most definitely executed on a separate thread. That’s why is so important to test with multiple CPUs typically found on production systems.

Also, the NodeJS file system writes are nonblocking (asynchronous). The unblocked time lets us know when the code used to schedule the filesystem writes is finished and the system can continue executing additional business logic. However, the file system will still be asynchronously writing in the background. So, the done time lets us know how long it took to actually write the logs to the filesystem.

The hardware we used is from Amazon AWS.

Name	Spec
Processors	Intel Core i7-7700 @ 2.80GHz (4 cores, 8 threads)
Memory	32GB Ram
Operating System	64-bit Ubuntu 18.04.2 LTS Server
NodeJS	8.15.1 LTS

Test Results

For all tests, the results are measured in milliseconds. The smaller bars are better because it means the logs took less time to process.

Console

For the first set of test results, we benchmarked the performance of the libraries when logging to the console.

From these results, we can see additional CPUs had a significant effect on the amount of time it took NodeJS to log to the console. Winston is the clear winner for speed in multithreaded systems; however, Bunyan performed slightly better in a single-threaded system.

Filesystem

For the second set of test results, we benchmarked the performance of the libraries when writing the logs to the filesystem. Again, notice that each test result contains two times, unblocked and done. This is because the libraries sometimes asynchronously send the logs to syslog. The total time to log is the sum of these two times.

After seeing how much additional CPUs affected console logs, I was very surprised to see that logging to the filesystem performed roughly the same with additional CPUs. This is most likely because the work required to write files is much less than the work required to print to a tty device, so there was less multithreaded activity happening.

Log4js seemed to have the worst results writing to a filesystem, sometimes taking over 5 times the amount of time to write to the filesystem. Winston unblocked the event loop the fastest, but Bunyan finished writing to the filesystem the fastest. So, if you're choosing a log library based on filesystem performance, the choice would depend on whether you want the event loop unblocked the fastest or if you want the overall program execution to finish first.

Syslog UDP

For the third set of test results, we benchmarked the performance of the libraries when sending the logs to syslog over UDP.

Log4js and Bunyan both finished around the same time when using multiple CPUs; however, Log4js unblocked the event loop much sooner and performed better on a single CPU.

Log4js also successfully sent all of its logs to syslog without dropping a single one. Although Bunyan had a low drop rate, it still managed to drop a few logs. I would say Log4js is a clear winner when sending logs to syslog over UDP.

I had a terrible experience getting Winston to work with syslog over UDP. When it did work it took well over a minute to unblock the event loop, and took over two minutes to finish sending the logs to syslog. However, most of the times I tested it, I ran out of memory before I could finish. I am assuming that when using UDP, the library aggregates all the logs in the heap before sending them to syslog, instead of immediately streaming the logs over to syslog. At any rate, it sends the logs over to syslog over UDP in a way that does not work well when slammed with a million logs.

Syslog TCP

For the fourth set of test results, we benchmarked the performance of the libraries when sending the logs to syslog over TCP. Again, notice that each test result contains two times, unblocked and done. This is because the libraries sometimes asynchronously send the logs to syslog.

Since Bunyan was the only library that successfully sent logs to syslog over TCP without dropping any of them, it is the clear winner. Despite its somewhat slow performance when multiple CPUs were introduced, it still was relatively fast.

Sadly I was not able to get Log4js to send logs to syslog over TCP. I believe there is a bug in their library. I consistently received the following error.

(node:31818) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'trace' of undefined

Winston was relatively fast when sending logs to syslog over TCP, however, it had a horrific log drop rate. Most of the logs were either dropped or corrupted. Below is an example of one of the corrupted logs syslog received from Winston. You can see that the message was cut off.

Mar 17 19:21:42 localhost /home/codejamninja/.nvm/versions/node/v8.15.1/bin/node[22463]: {"mes

The log was supposed to look like this:

Mar 17 19:21:42 localhost /home/codejamninja/.nvm/versions/node/v8.15.1/bin/node[22463]: {"message": "92342: Hello, world!"}

Bunyan performed relatively well when sending logs to syslog over TCP. It did not drop a single log and unblocked the event loop very quickly. One thing that did surprise me though is that additional CPUs consistently performed worse than running on a single CPU. I am baffled by that, though this is the only scenario in which that happened.

Conclusion

These results really took me by surprise. I was thinking there would be an overall winner, but each library performed best in different areas under different conditions.

Output type	Winner
Console	Winston
File	Winston and Bunyan tied
Syslog UDP	Log4js
Syslog TCP	Bunyan

Winston performed best when logging to the console. Winston and Bunyan both performed best in their own ways when logging to the file system. Log4js performed the best when sending logs to syslog over UDP. Bunyan had the best results when sending logs to syslog over TCP.

If you care more about throughput for syslog, then Log4js with UDP is the best output type. If you only care unblocking the code then Winston writing to a filesystem is the best. In this case, logging averaged 0.0005 ms per log event which is blazing fast. If your typical response latency is 100 ms, then it's only 0.0005% of your total response time. That’s faster than running console.log(). As long as you don’t go overboard with too many log statements, the impact is very small.

	Console	File	Syslog TCP	Syslog UDP
Log4js	24385 ms	31584 ms	N/A	1195 ms
Winston	10756 ms	7438 ms	9362 ms	142871 ms
Bunyan	15062 ms	4197 ms	24984 ms	12029 ms

Overall, I would recommend using Log4js with UDP library for the best performance. This will have a negligible impact on your overall response time. Tools like Loggly will store and organize those logs for you. It will alert you when the system encounters critical issues so you can deliver a great experience to your customers.

Six Strategies for Deploying to Heroku

Jason Skowronski — Wed, 19 Jun 2019 15:37:20 +0000

There are many ways of deploying your applications to Heroku—so many, in fact, that we would like to offer some advice on which to choose. Each strategy provides different benefits based on your current deployment process, team size, and app. Choosing an optimal strategy can lead to faster deployments, increased automation, and improved developer productivity.

The question is: How do you know which method is the "best" method for your team? In this post, we'll present six of the most common ways to deploy apps to Heroku and how they fit into your deployment strategy. These strategies are not mutually exclusive, and you can combine several to create the best workflow for your team. Reading this post will help you understand the different options available and how they can be implemented effectively.

Deploying to Production with Git

Our first method is not only the most common, but also the simplest: pushing code from a Git repository to a Heroku app. You simply add your Heroku app as a remote to an existing Git repository, then use git push to send your code to Heroku. Heroku then automatically builds your application and creates a new release.

Because this method requires a developer with full access to manually push code to production, it's better suited for pre-production deployments or for projects with small, trusted teams.

Pros:

Simple to add to any Git-based workflow
Supports Git submodules

Cons:

Requires access to both the Git repository and Heroku app

GitHub Integration

If your repository is hosted on GitHub, you can use GitHub integration to deploy changes directly to Heroku. After linking your repository to a Heroku app, changes that are pushed to your repository are automatically deployed to the app. You can configure automatic deployments for a specific branch, or manually trigger deployments from GitHub. If you use continuous integration (CI), you can even prevent deployments to Heroku until your tests pass.

GitHub integration is also useful for automating pipelines. For example, when a change is merged into the master branch, you might deploy to a staging environment for testing. Once the change has been validated, you can then promote the app to production.

Pros:

Automatically deploys apps and keeps them up-to-date
Integrates with pipelines and review apps to create a continuous delivery workflow
If you use a CI service (such as Heroku CI) to build/test your changes, Heroku can prevent deployment when the result is fail

Cons:

Requires administrator access to the repository, so it’s only useful for repositories you own
Does not support Git submodules

Heroku Review Apps

When introducing a change, chances are you want to test it before deploying it straight to production. Review Apps let you deploy any GitHub pull request (PR) as an isolated, disposable instance. You can demo, test, and validate the PR without having to create a new app or overwrite your production app. Closing the PR destroys the review app, making it a seamless addition to your existing workflows.

Pros:

Can automatically create and update apps for each PR
Supports Docker images
Supports Heroku Private Spaces for testing changes in an isolated environment

Cons:

Requires both pipelines and GitHub integration to be enabled

Deploying with Docker

Docker lets you bundle your apps into self-contained environments, ensuring that they behave exactly the same both in development and in production. This also gives you more control over the languages, frameworks, and libraries used to run your app. To deploy a container to Heroku, you can either push an image to the Heroku container registry, or build the image automatically by declaring it in your app's heroku.yml file.

Pros:

Automatically generate images, or push an existing image to the container registry
Consistency between development and production
Compatible with Heroku Review Apps

Cons:

If your app doesn’t already run in Docker, you’ll need to build an image
Requires you to maintain your own stack
Does not support pipeline promotions

Using Hashicorp Terraform

Infrastructure-as-code tools like Hashicorp Terraform can be helpful to manage complex infrastructure. Terraform can also be used to deploy a Heroku app. Despite it not being officially supported by Heroku, Terraform is being used by many Heroku users. Using Terraform with Heroku, you can define your Heroku apps with a declarative configuration language called HCL. Terraform automates the process of deploying and managing Heroku apps while also making it easy to coordinate Heroku with your existing infrastructure. Plus, Terraform v0.12 now allows you to store Remote State in a PostgreSQL database. This means you can now run Terraform on a Heroku dyno storing Terraform state in a Heroku Postgres database.

For an example, check out a reference architecture using Terraform and Kafka.

Pros:

Automates Heroku app deployments
Allows you to deploy Heroku apps as code
Simplifies the management of large, complex deployments
Allows you to configure multiple apps, Private Spaces as well as resources from other cloud providers (e.g. AWS, DNSimple, and Cloudflare) to have a repeatable, testable, multi-provider architecture.

Cons:

Requires learning Terraform and writing configuration if you don’t use it already

The 'Deploy to Heroku' Button

What if deploying your app was as easy as clicking a button? With the 'Deploy to Heroku' button, it is! It’s great for taking an app for a test run with default settings in a single click, or to help train new developers.

This button acts as a shortcut allowing you to deploy an app to Heroku from a web browser. This is great for apps that you provide to your users or customers, such as open source projects. You can parameterize each button with different settings such as passing custom environment variables to Heroku, using a specific Git branch or providing OAuth keys. The only requirements are that your source code is hosted in a GitHub repository and that you add a valid app.json file to the project's root directory. We’ve even heard of one company that adds a button to the README for each of their internal services. This forces them to keep the deploy process simple and aids new hires getting up to speed with how services are deployed.

A 'Deploy to Heroku' button.

Pros:

Easy to add to a project's README file or web page
Easy to use: simply click the button to deploy the app
Provides a template with preconfigured default values, environment variables, and parameters

Cons:

Does not support Git submodules
Apps deployed via button do not auto-update when new commits are added to the GitHub repo from which it was deployed
Not a good workflow for apps that you need to keep up to date because buttons can only create new apps and the deployed app is not automatically connected to the GitHub repo from which it came

Which Should I Choose?

The method you choose depends on your specific deployment process, your requirements, and your apps. For small teams who are just getting started, deploying with Git is likely to be your first deployment due to its simplicity. The Heroku Button is equally straightforward, letting you deploy entire apps with a single click. If you use continuous integration or release frequently, integrating with GitHub can simplify this process even more by doing automated deployments when you commit your code. This is a big improvement over deploying on an IaaS system because Heroku manages the entire process automatically.

As your requirements get more sophisticated, add the other strategies as needed. When your application is running in a production environment and you need quality control, you may want to add pipelines to get the advantages of review apps, automated testing, and staging environments. If you need a custom stack, then you can do so with Docker. As you add more complex infrastructure components, then add Terraform.

Advanced teams will use a combination of strategies: For example, you may choose to deploy a Docker image by creating a review app from a GitHub pull request, testing the review app, then manually deploying the final version using git push.

Ready to give one of these methods a try? Sign up for a free Heroku account and test them out.