DEV Community: Koan

Anatomy of a high-velocity CI/CD pipeline

RJ Zaworski — Thu, 02 Dec 2021 16:35:17 +0000

If you’re going to optimize your development process for one thing, make it speed. Not the kind of speed that racks up technical debt on the team credit card or burns everyone out with breathless sprints, though. No, the kind of speed that treats time as your most precious resource, which it is.

Speed is the startup’s greatest advantage. Speed means not wasting time. Incorporating new information as soon as it’s available. Getting products to market. Learning from customers. And responding quickly when problems occur. But speed with no safeguards is simply recklessness. Moving fast requires systems for ensuring we’re still on the rails.

We’ve woven many such systems into the sociotechnical fabric of our startup, but maybe the most crucial among them are the continuous integration and continuous delivery processes that keep our work moving swiftly towards production.

The business case for CI/CD

Writing in 2021 it’s hard to imagine building web applications without the benefits of continuous integration, continuous delivery, or both. Running an effective CI/CD pipeline won’t score points in a sales pitch or (most) investor decks, but it can make significant strategic contributions to both business outcomes and developer quality of life. The virtuous cycle goes something like this:

faster feedback
fewer bugs
increased confidence
faster releases
more feedback (even faster this time)

Even on teams (like ours!) that haven’t embraced the dogma (or overhead) of capital-A-Agile processes, having the confidence to release early and often still unlocks shorter development cycles reduces time to market.

As a developer, you’re probably already bought into this idea. If you’re feeling resistance, though, here’s a quick summary for the boss:

The business case for continuous integration and delivery

Is CI/CD worth the effort?

Nobody likes a red build status indicator, but the truth is that builds fail. That’s why status dashboards exist, and a dashboard glowing crimson in the light of failing builds is much, much better than no dashboard at all.

Still, that dashboard (nevermind the systems and subsystems it’s reporting on) is pure overhead. Not only are you on the hook to maintain code and release a dozen new features by the end of the week, but also the litany of scripts, tests, configuration files, and dashboards needed to build, verify, and deploy it. When the server farm of Mac Minis in the basement hangs, you’re on the hook to restart it. That’s less time available to actually build the app.

This is a false dilemma, though. You can solve this problem by throwing resources at it. Managed services eliminate much of the maintenance burden, and when you’ve reached the scale where one-size-fits-all managed services break down you can likely afford to pay a full-time employee to manage Jenkins.

So, there are excuses for not having a reliable CI/CD pipeline. They just aren’t very good ones. The payoff — in confidence, quality, velocity, learning, or whatever you hope to get out of shipping more software — is well worth any pain the pipeline incurs.

Yes, even if it has to pass through XCode.

A guiding principle

Rather than prescribing the ultimate CI/CD pipeline in an edict from on-high, we’ve taken guidance from one of our team principles and evolved our practices and automation from there. It reads:

Ship to Learn. We release the moment that staging is better than prod, listen early and often, and move faster because of it.

Continuous integration is a big part of the story, of course, but the same guidance applies back to the pipeline itself.

Releasing the moment that staging is better than prod is easy to do. This is nearly always the case, and keeping up with it means having both a lightweight release process and confidence in our work. Individual investment and a reasonably robust test suite are all well and good; better is having a CI/CD pipeline that makes them the norm (if not the rule).
Listening early and often is all about gathering feedback as quickly as we possibly can. The sooner we understand whether something is working or not, the faster we can know whether to double down or adapt. Feedback in seconds is better than in minutes (and certainly better than hours).
Moving faster includes product velocity, of course, but also the CI/CD process itself. Over time we’ve automated what we reasonably can; still, several exception-heavy stages remain in human hands. We don’t expect to change these soon, so here “moving fast” means enabling manual review and acceptance testing, but we don’t expect to replace them entirely any time soon.

So, our pipeline

Product velocity depends on the pipeline that enables it. With that in mind, we’ve constructed our pipeline to address the hypothesis that issues uncovered at any stage are more exponentially expensive to fix than those solved at prior stages. Issues will happen, but checks that uncover them early on drastically reduce friction at the later, more extensive stages of the pipeline.

Here’s the boss-friendly version:

Test early, test often

Local development

Continuous integration starts immediately. If you disagree, consider the feedback time needed to integrate and test locally versus anywhere else. It’s seconds (rebasing against our main branch or acting on feedback from a pair-programming partner), minutes (a full run of our test suite) or less.

We’ve made much of if automatic. Our editors are configured to take care of styles and formatting; TypeScript provides a first layer of testing; and shared git hooks run project-specific static checks.

One check we don’t enforce is to run our full test suite. Run time goes up linearly with the size of a test suite, and — while we’re culturally averse to writing tests for their own sake — running our entire suite on every commit would be prohibitively expensive. What needs testing is up to individual developers’ discretion, and we avoid adding redundant or pointless tests to the test suite just as we avoid redundant test runs.

Make it fast, remember? That applies to local checks, too. Fast checks get run. Slow checks? No-one has time for that.

Automated CI

Changes pushed from local development to our central repository trigger the next layer of checks in the CI pipeline. Feedback here is slower than in local development but still fairly fast, requiring about 10 minutes to run all tests and produce a viable build.

Here’s what it looks like in Github:

Green checks are good checks.

There are several things going on here: repeats of the linting and static analysis run locally, a run through our completed backend test suite, and deployment of artifacts used in manual QA. The other checks are variations on this theme—different scripts poking and prodding the commit from different angles to ensure it's ready for merging into main. Depending on the nature of the change, we may require up to a dozen checks to pass before the commit is greenlit for merge.

Peer review

In tandem with the automated CI checks, we require manual review and sign-off before changes can be merged into main.

“Manual!?” I hear the purists cry, and yes — the “M” word runs counter to the platonic ideal of totally automated CI. Hear me out. The truth is that every step in our CI/CD pipeline existed as a manual process first. Automating something before truly understanding it is a sure path to inappropriate abstractions, maintenance burden, and at least a few choice words from future generations. And it doesn’t always make sense. For processes that are and always will be dominated by exceptions (design review and acceptance testing, to pick two common examples) we’ve traded any aspirations at full automation for tooling that enables manual review. We don’t expect to change this any time soon.

Manual review for us consists of (required) code review and (optional) design review. Code review covers a checklist of logical, quality, and security concerns, and we (plus Github branch protection) require at least two team members to believe a change is a good idea before we ship it. Besides collective ownership, it’s also a chance to apply a modicum of QA and build shared understanding around what’s changing in the codebase. Ideally, functional issues that weren’t caught locally get caught here.

Design review

Design review is typically run in tandem with our counterparts in product and design, and aims to ensure that designs are implemented to spec. We provide two channels for reviewing changes before a pull request is merged:

preview builds of a “live” application that reviewers can interact with directly
storybook builds that showcase specific UI elements included within the change

Both the preview and storybook builds are linked from Github’s pull request UI as soon as they’re available. They also nicely illustrate the type of tradeoffs we’ve frequently made between complexity (neither build is trivial to set up and maintain), automation (know what would be trickier? Automatic visual regression testing, that’s what) and manual enablement (the time we have decided to invest has proven well worth it).

The bottom line is that — just like with code review — we would prefer to catch design issues while pairing up with the designer during initial development. But if something slipped through, design review lets us respond more quickly than at stages further down the line.

The feedback from manual review steps is still available quickly, though: generally within an hour or two of a new pull request being opened. And then it’s on to our staging environment.

Continuous delivery to staging

Merging a pull request into our main branch finally flips the coin from continuous integration to continuous delivery. There's one more CI pass first, however: since we identify builds by the commit hash they're built from, a merge commit in main triggers a new CI run that produces the build artifact we deliver to our staging environment.

The process for vetting a staging build is less prescriptive than for the stages that precede it. Most of the decision around how much QA or acceptance testing to run in staging rests with the developer on-call (who doubles as our de-facto release manager), who will review a list of changes and call for validation as needed. A release consisting of well-tested refactoring may get very little attention. A major feature may involve multiple QA runs and pull in stakeholders from our product, customer success, and marketing teams. Most releases sit somewhere in the middle.

Every staging release receives at least passing notice, for the simple reason that we use Koan ourselves — and specifically, an instance hosted in the staging environment. We eat our own dogfood, and a flavor that’s always slightly ahead of the one our customers are using in production.

Staging feedback isn’t without hiccups. At any time we’re likely to have 3–10 feature flags gating various in-development features, and the gap between staging and production configurations can lead to team members reporting false positives on features that aren’t yet ready for release. We’ve also invested in internal tooling that allows team members to adopt a specific production configuration in their local or staging environments.

The aesthetics are edgy (controversial, even), but the value is undeniable. We’re able to freely build and test features prior to production release, and then easily verify whether a pre-release bug will actually manifest in the production version of the app.

If you’re sensing that issues caught in staging are more expensive to diagnose and fix than those caught earlier on, you’d be right. Feedback here is much slower than at earlier stages, with detection and resolution taking up to several hours. But issues caught in staging are still much easier to address before they’re released to production.

Manual release to production

The “I” in CI is unambiguous. Different teams may take “integration” to mean different things — note the inclusion of critical-if-not-exactly-continuous manual reviews in our own integration process — but “I” always means “integration.”

The “D” is less straightforward, standing in (depending on who you’re talking to, the phase of the moon, and the day of the week) for either “Delivery” or “Deployment,” and they’re not quite the same thing. We’ve gained enormous value from Continuous Delivery. We haven’t made the leap (or investment) to deploy directly to production.

That’s a conscious decision. Manual QA and acceptance testing have proven tremendously helpful in getting the product right. Keeping a human in the loop ahead of production helps ensure that we connect with relevant stakeholders (in product, growth, and even key external accounts) prior to our otherwise-frequent releases.

Testing in production

As the joke goes, we test comprehensively: all issues missed by our test suite will be caught in production. There aren’t many of these, fortunately, but a broad enough definition of testing ought to encompass the instrumentation, monitoring, alerting, and customer feedback that help us identify defects in our production environment.

We’ve previously shared an outline of our cherished (seriously!) on-call rotation, and the instrumentation beneath it is a discussion for another day, but suffice to say that an issue caught in production takes much longer to fix than one caught locally. Add in the context-switching required from team members who have already moved on to other things, and it’s no wonder we’ve invested in catching issues earlier on!

Revising the pipeline

Increasing velocity means adding people, reducing friction, or (better yet) both. Hiring is a general problem. Friction is specific to the team, codebase, and pipeline in question. We adopted TypeScript to shorten feedback cycles (and save ourselves runtime exceptions and pagerduty incidents). That was an easy one.

A less obvious bottleneck was how much time our pull requests were spending waiting for code review — on average, around 26 hours prior to merge. Three and half business days. On average. We were still deploying several times per day, but with several days’ worth of work-in-process backed up in the queue and plenty of context switching whenever it needed adjustment.

Here’s how review times tracked over time:

This chart is fairly cyclical, with peaks and troughs corresponding roughly to the beginning and end of major releases — big, controversial changes as we’re trailblazing a new feature; smaller, almost-trivial punchlist items as we close in on release day. But the elephant in the series lands back around March 1st. That was the start of Q2, and the day we added “Code Review Vitals” to our dashboard.

It’s been said that sunlight cures all ills, and simply measuring our workflow had the dual effects of revealing a significant bottleneck and inspiring the behavioral changes needed to correct it.

Voilá! More speed.

Conclusion

By the time you read this post, odds are that our CI/CD pipeline has already evolved forward from the state described above. Iteration applies as much to process as to the software itself. We’re still learning, and — just like with new features — the more we know, and the sooner we know it, the better off we’ll be.

With that, a humble question: what have you learned from your own CI/CD practices? Are there checks that have worked (or totally flopped) that we should be incorporating ourselves?

We’d love to hear from you!

Every Software Team Deserves a Charter

RJ Zaworski — Thu, 02 Dec 2021 16:26:59 +0000

Software teams own features, projects, and services—everyone knows that. But caught up in what we’re doing, it’s easy to lose sight of why we’re doing it. Digging into the details might turn up some clues:

Why does our team exist? To maintain Service X
Why does Service X need to run? Because it’s a tier-two service that’s a dependency of tier-one Services A and Y
What happens if Service A goes down? I’m not sure, but it doesn’t sound good.

Teams exist for a reason, opaque thought it may be, and team members should know what it is.

Teams are works-in-progress

Clarity is even harder to come by when a team is just starting out. Wouldn’t it be nice if teams arrived in the world as the ancient Greek poet Hesiod describes the birth of the goddess Athena?

And the father of men and gods gave her birth by way of his head…arrayed in arms of war. —Hesiod, Theogeny

Anyone who has collaborated with other people can confirm that—unlike ancient Greek deities—teams do not spring forth fully-formed. Nor are they the product of a single head: teams are made up of individuals, and even with a cohesive vision of the team’s objective (not itself a given), everyone likely won’t agree on the best way to get there.

The psychologist Bruce Tuckman framed this reality with a four-stage model for group development. In Tuckman’s model, teams form, storm, norm, and perform, with high-performing teams emerging only after weather the turbulence of the early stages. But while we can’t control the sequence of events, we can hasten the journey.

Accelerating team formation

Shared expectations are the foundation underlying all effective teamwork. Yet often the process of establishing them is left up to chance. It’s true that time and good intentions usually lead to common ground, but by forcing explicit conversations about the team’s intentions and beliefs up front, a written charter can significantly accelerate the process.

At a minimum, a charter should lay out the team’s:

mission statement, summarizing the team’s shared purpose
principles for conduct and decision-making
key performance indicators (KPIs) representing the team’s status and health

While a team lead or manager may write the first draft, revisions are highly encouraged: input and feedback from all team members will only help ensure the charter represents a common understanding of the team’s identity. Ideally the charter-drafting process will start to build collective ownership as well.

Let’s dig into specifics, and look at how we’ve addressed them in the Koan dev team charter.

Mission statement

A charter’s mission statement is a single sentence summarizing what the team does, for whom, and how. Rather than serving specific customer personas or internal stakeholders, our development team is on the hook to advance the company and its mission as a whole. We do it by shipping reliable software and holding ourselves (and our colleagues) to high standards. As the mission statement at the top of our charter reads, we exist:

To advance Koan through engineering excellence and continuous improvement.

Principles

Principles are the guidelines that the team can fall back on when assessing contradictory or otherwise unclear choices. They also set expectations. A principles that we, “win the marathon” both encourages thoughtful, long-term decision-making and implies that team members will do it.

The team’s principles should be short and memorable. In creating our own charter, we brainstormed, debated, and revised our way down to just four. They’re both a clear expression of our common values and simple enough to remember. They read:

In support of our company Mission, Vision and Goals, and Values, Koan engineers:

Figure it out. We find a way to deliver our objectives while continuously improving along the way.

Ship to learn. We release the moment that staging is better than prod, listen early and often, and move faster because of it.

Deliver customer value. Our work directly benefits our customers — whether they’re outside Koan or at the next desk down.

Win the marathon. We’re in it for the long haul, making decisions that balance today’s needs against the uncertain future ahead.

Once again, our closeness to the rest of the company shows through in a brief preamble connecting our team-specific principles back to the mission, vision, and values of the organization as a whole.

KPIs

The team’s charter should include a measurable definition of its health. Is the team maintaining basic responsibilities and expectations? Team members should always be able to reference KPIs that quantify its current status.

As with the mission and principles, the specific metrics will vary considerably across functions. While a sales org may be looking at calls per rep or the total value of qualified leads, a dev team will often focus on the “—ilities”—stability, durability, and so on.

Our own KPIs are split between numbers we’re interested in (but not actively losing sleep over) and numbers that really matter. The latter are important enough to take up precious real estate on our company dashboard, and as the lone development team in a dynamic startup we’ve limited our focused to just two themes with very specific measures:

Quality: % TypeScript coverage (FE, BE)
Velocity: PR lifetime (time delta from opened to merged)

There are plenty of other numbers we’re interested in—but for the charter (2021 edition) those two were disproportionately more important to our continued improvement (and quality of life) as a team.

The team evolves. The charter, too.

Existing teams need charters, too, and chances are they aren’t the same as when the team was first formed. Explicit or otherwise, the charter will change. It should change. Revisit it quarterly, revise it yearly, or whenever:

A principle needs updating to reflect changing expectations or operating conditions
A KPI is significantly exceeded, or becomes an automatic part of the culture
Team members join or leave
And so on!

Of course, the charter is just the beginning. So much more goes into an effective team, from the skills individual team members bring to the goals they work together to achieve.

Even as expectations change over the team’s lifetime, team members should never lack clarity on why the team exists, or on how they can show up and contribute!

Making the most of our startup’s on-call rotation

RJ Zaworski — Tue, 16 Nov 2021 00:19:39 +0000

Will I have to be on call?

In the last hour of Koan’s on-site interview we turn the tables and invite candidates to interview our hiring team. At face value it’s a chance to answer any open questions we haven’t answered earlier in the process. It’s also a subtle way to introspect on our own hiring process — after three rounds of interviews and side-channel conversations with the hiring manager, what have we missed? What’s on candidates’ minds? Can we address it earlier in the process?

So, you asked, will I have to be on call?

The middle of the night pager rings? The panicked investigations? Remediation, write-ups, post-mortems?
We get it. We’ve been there.

Patrick Collison’s been there, too:

// Detect dark theme var iframe = document.getElementById('tweet-1432731774270906369-537'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1432731774270906369&theme=dark" }

“Don’t ruin the duck.” There are worse guiding principles for an on-call process (and operational health generally).

So, will I have to be on call?

Yeah, you will. But we’ve gotten a ton out of Koan’s on-call rotation and we hope you will, too. Ready to learn more?

On-call at Koan

We set up Koan’s on-call rotation before we’d heard anything about Patrick’s ducks. Our version of “don’t ruin the duck,” included three principles that (if somewhat less evocative) have held up surprisingly well:

We concentrate distractions — our on-call developer is tasked with minimizing context switching for the rest of the team. We’ll escalate incidents if needed, but as much as possible the business of ingesting, diagnosing, and triaging issues in production services stays in a single person’s hands — keeping the rest of the team focused on shipping great product.
We control our own destiny — just like Koan’s culture at large, being on-call is much more about results (uptime, resolution time, pipeline throughput, and learning along the way) than how they come about. Our on-call developer wields considerable authority over how issues are fielded and dispatched, and even over the production release schedule.
We take turns — on-call responsibilities rotate weekly. This keeps everyone engaged with the on-call process and avoids condemning any single person to an eternity (or even an extended period) of pager duty.

These principles have helped us wrangle a fundamentally interrupt-driven process. What we didn’t realize, though, was how much time — and eventually, value — we were recovering between the fire drills.

How bugs begin

Before that, though, we’d be remiss to skip the easiest path to a calm, quiet on-call schedule: don’t release. To paraphrase Descartes, code ergo bugs — no matter how diligent you are in QA, shipping software means injecting change (and therefore new defects) into your production environment.

Not shipping isn’t an option. We’re in the habit of releasing multiple times per day, not to mention all of the intermediate builds pushed to our staging environment via CI/CD. A production issue every now and then is a sign that the system’s healthy; that we’re staying ambitious and shipping fast.

But it also means that things sometimes break. And when they do, someone has to pick up the phone.

Goals

On the bad days, on-call duty is a steady stream of interruptions punctuated by the occasional crisis. On the good days it isn’t much to write home about. Every day, though, there are at least a few minutes to tighten down screws, solve problems, and explore the system’s nooks and crannies. This is an intentional feature (not a bug) of our on-call rotation, and the payoff has been huge. We’ve:

built shared ownership of the codebase and production systems
systematized logging, metrics, monitoring, and alerting
built empathy for customers (and our support processes)
spread awareness of little-used features (we’re always onboarding)
iterated on key processes (ingestion/triage, release management, etc)

You don’t get all that by just passing around a firefighting hat. You need buy-in and — crucially — a healthy relationship with your production environment. Which brings us back to our principles, and the on-call process that enables it.

We concentrate distractions

When something breaks, the on-call schedule clarifies who’s responsible for seeing it’s fixed. As the proverbial umbrella keeping everyone else focused and out of the rain (sometimes a downpour, sometimes a drizzle), you don’t need to immediately fix every problem you see: just to investigate, file, and occasionally prioritize them for immediate attention.

That still means a great deal of on-call time spent ingesting and triaging a steady drip of symptoms from:

customer issues escalated by our customer success team
internal bug reports casually mentioned in conversations, slack channels, or email threads
exceptions/alerts reported by application and infrastructure monitoring tools

Sometimes symptoms aren’t just symptoms, and there’s a real issue underneath. Before you know it, the pager starts ringing—

Enter the pager

The water’s getting warmer. A pager ping isn’t the end of the world, but we’ve tuned out enough false positives that an alert is a good sign that something bad is afoot.

Once you’ve confirmed a real issue, the next step is to classify its severity and impact. A widespread outage? Those need attention immediately. Degraded performance in a specific geography? Not awesome, but something that can probably wait until morning. Whatever it is, we’re looking to you to coordinate our response, both externally (updating our status page) and either escalating or resolving the issue yourself.

On-call isn’t a private island. There will always be times we need to pause work in progress, call in the team, and get to the bottom of something that’s keeping us down. But the goal is to do it in a controlled fashion, holding as much space for everyone else as you reasonably can.

We control our own destiny

Your responsibilities aren’t purely reactive, however. Controlling your own destiny means having at least a little agency over what breaks and when. This isn’t just wishful thinking. While issues introduced in the past are always a lurking threat — logical edge cases, bottlenecks, resource limits, and so on — the source of most new issues is a new release.

It makes sense, then, for whoever’s on-call to have the last word on when (and how) new releases are shipped. This includes:

managing the release — generating changelogs, reviewing the contents of the release, and ensuring the appropriate people are warned and signatures are obtained
debugging release / deployment issues — monitoring both the deployment and its immediate aftermath, and remediating any issues that arise
making the call on hotfix releases and rollbacks — as a step sideways from our usual flow they’re not tools we use often. But they’re there (and very quick) if you need them

Closing the feedback loop

An unexpected benefit we’ve noticed from coupling on-call and release management duties is the backpressure it puts on both our release cadence and deployment pipeline. If we’re underwater with issues from the previous release, the release manager has strong incentives to see they’re fixed before shipping anything else. Ditto any issues in our CI/CD processes.

Neither comes up too often, fortunately, and while we can’t totally write off the combination of robust systems and generally good luck, it’s just as hard to discount the benefits of tight feedback and an empowered team.

We take turns

But you said, “team!” — a lovely segue to that last principle. Rotating on-call responsibility helps underscore our team’s commitment to leaving a relatively clean bill (releases shipped, exceptions handled; tickets closed; etc) for the next person up. When you’re on-call, you’re the single person best placed to deflect issues that would otherwise engulf the entire team. When you’re about to be on call, you’re invested in supporting everyone else in doing the same. You’d love to start your shift with:

healthy systems
a manageable backlog of support inquiries
a clear list of production exceptions
a quick brain-dump of issues fielded (and ongoing concerns) from the teammate you’re taking over from

A frequent rotation almost guarantees that everybody’s recently felt the same way. Team members regularly swap shifts (for vacations, appointments, weddings, anniversaries, or any other reason), but it’s never long before you’re back on call.

The rest of the time

Ultimately, we’ve arrived at an on-call process that balances the realities of running software in production with a high degree of agency. We didn’t explicitly prioritize quality of life, and we don’t explicitly track how much time on-call duties are eating up. But collective ownership, individual buy-in, and tight feedback have pushed the former up and the latter down, to the point where you’ll find you have considerable time left over for other things. Ideally you’ll use your turn on-call to dig deeper into the issues you touch along the way:

exploring unfamiliar features (with or without reported bugs)
tightening up our CI processes
tuning configurations
writing regression tests
improving logging and observability

Yes, you’ll be triaging issues, squashing bugs, and maybe even putting out the odd production fire. You can almost count on having time left to help minimize the need for on-call. You’re on the hook to fix things if they break — and empowered to make them better.

So yes, you’ll have to take an on-call shift.

Help us make it a good one!

Cover image by Daniel Seßler on Unsplash

Routing on the Edge

Danielle Heberling — Thu, 09 Sep 2021 18:23:09 +0000

At Koan, our application’s frontend is a React Single Page Application running in two distinct environments (Staging and Production).

In addition to viewing the Staging and Production versions of our frontend, we also need to serve up a version of the frontend based off of a git commit in our Staging environment. Doing this gives Koan developers a “live preview” URL to review what the frontend looks like after committing changes but before they’re merged.

Our Solution

Our solution has the following high level steps:

Code changes are merged into our main branch. This action kicks off our CI system
The CI builds the code and places the build artifacts (static HTML/JavaScript/CSS files) into S3 buckets
A CloudFront CDN is in front of one of those S3 buckets
Our staging app domain is pointed at this CloudFront CDN
On all origin requests to the staging app domain → a Lambda@Edge function serves a build-specific index.html with static references to the rest of the build

More Details on Build Artifacts

Our CI process delivers build artifacts into S3 at /commit/[commit sha].

When a developer wants to “live preview” their recent commit, they need to add /commit/<their commit SHA> to the end of our staging app domain.

Each index.html file in this S3 bucket references static assets (CSS/JS files) hosted on a separate frontend-builds subdomain. This domain pointed at a second CloudFront CDN with a second S3 bucket as its origin. Serving these as CDN-friendly, immutable assets saves significant compute (money) for resources that don't need Lambda@Edge.

Inside the Lambda "router" function

Whenever that developer requests a specific version of the app, the request hits CloudFront as an origin-request. Our Lambda@Edge function receives a message event from CloudFront and then proceeds to do the following:

Gets the git commit hash from the pathname in the request. If there isn’t a commit hash in the URL, then we assume we want the latest version.
Gets the requested index file
Returns the index file as the body for our response

Let's see some code

Gets the git commit hash from the pathname in the request

Whenever someone makes an HTTP request to the CDN, the CDN then sends an event object to our Lambda@Edge function. The shape looks something like this.

We then pull the pathname off of that event object:

Now that we have our pathname (including the optional commit/<commit sha> fragment), we can extract our git commit hash by calling a getHash helper function.

If there isn’t a hash present in the pathname this means that we just want to serve up the latest version of the app, so we'll return null.

...

Gets the requested index file

Now that we have our git commit hash (or the null default) from the pathname, let's pass that commit hash into another helper function to get the desired index file from our S3 bucket.

The variables that start with process.env are NodeJS's way of referencing environment variables on the Lambda function. We set these variables when the function was provisioned.

If the S3 object (index.html file) is missing, we handle that in the catch and log the error.

A possible next step to improve this might be using Lambda@Edge memory. Since the index file is immutable, we should only need to retrieve it from S3 once (or if Edge memory is dumped). https://aws.amazon.com/blogs/networking-and-content-delivery/leveraging-external-data-in-lambdaedge/

...

Returns the index file as the body for our response

All together the function's code will look something like this

Closing

While there are opportunities for improvement, this setup works well for our team, and we thought that sharing this approach might give you and your team some ideas to iterate on.

More recently, AWS released CloudFront Functions. Stay tuned as we evaluate if that’s a good solution for us to use instead of our existing Lambda@Edge functions. It’s highly possible we could re-architect this to completely bypass the S3 GET and/or further utilize the edge caching.

Thanks to Daniel Kaczmarczyk and RJ Zaworski for reviewing drafts of this article.

The Secret to Getting More Done

Danielle Heberling — Wed, 14 Jul 2021 17:23:45 +0000

It was a cold and rainy day as I sat alone in my home office debugging Webpack config errors. No matter what I tried, the errors would not go away. My natural inclination was to "just get through most of them" before eating lunch. But as I fixed errors, more emerged. Do you know what eventually got me through these hang ups?

Taking a break.

As my hunger grew stronger, I decided to give in and went to lunch. Being able to step away and temporarily detach my mind from the task at hand was just what I needed to approach this problem with a fresh approach full of new things to try.

Time and time again, the act of taking a break has helped immensely both for getting meaningful work done and for my overall mental state.

Photo by Sander Dalhuisen on Unsplash

My top three activities while taking a break throughout the workday are:

Take a walk outside if weather isn't too bad.
Change my surroundings. This could mean working at nearby coffee shop or moving from a desk to a couch. The context switching required to get up and move also helps to refocus.
Read a book. Bonus points if it is a physical copy or on an e-reader. It's important to not stare at the same screen all day.

Some companies are planning to take this idea a step further and are piloting a four day work week. Personally, I'm really interested to see how the rise of remote/hybrid workplaces as a result of the COVID-19 pandemic affects workers' break frequency. Curious to see if it goes up or down. In the meantime, I'm happy I get to work on a team that facilitates working with purpose, built on a culture that supports transparency, autonomy and inclusivity.

Four-day weeks and hybrid workplaces don't mean less work—just a more honest accounting of what already goes on. The standard work week in the USA is currently 40 hours, but no one is actually productive that entire time. I set up a poll on Twitter to get some data points on this and here's the results.

Remembering to take breaks has helped me to get more done while working fewer hours, and I can support my teammates better by bringing my best self to my work. Maybe it can help you too. What are some of your favorite activities to do when taking a break during the work day?

Being a developer at a startup is actually pretty great

Daniel Kaczmarczyk — Tue, 22 Jun 2021 20:31:10 +0000

You’re a developer who is looking for their next role and you are thinking about what kind of companies to talk to. One of the first decisions you have to make is whether to join a startup or a bigger, more established company. Here’s a quick case for startups:

Biggest Incentives

The earlier you are joining the company, the higher the financial benefits can be. If the startup does well, your share options can be worth a lot of money. The people you’re going to work with are often very entrepreneurial and innovative (and since you’re reading this article, you probably are too). You can learn a lot being around other creators, sharing ideas, and discussing things openly. It’s a stark contrast to more corporate processes and environments.

You get a lot of stuff done!

It’s not all paperwork and sitting in meetings. At a startup “agile” is an adjective — not a time-suck. We release our software early and often, and maintain a very short cycle between pull requests and releases. It’s a very exciting environment, with a lot of opportunity to see your work come to life — and fast! Working for a company that wins a startup competition feels differently than being one of 5000 employees in a giant conglomerate. Although I’ve only experienced the former, I can extrapolate that getting 20% of the credit feels better than getting 0.002% credit.

If you’re joining a startup, you can also expect to be able to influence the technology choices, the culture, and many other things that you would not be able to influence otherwise. Most of the processes are a blank slate, which require you to take charge and make the calls yourself. This includes a lot of things — hiring decisions, processes, technologies, and the list goes on and on.

Another great aspect of working at a smaller company is that you’re much closer to the folks who use your product. Being able to more easily obtain feedback is a gift to help build things that closely align with your user base.

Accelerate your career

At a startup, you have a big impact as an individual contributor. If you’re joining a team of 5 engineers, you’ll become ~17% of the team, and your work and ideas carry a lot of weight. Taking on this responsibility gives you an opportunity to hone your skills, master more parts of the stack, and gain confidence and great experience. Opportunity presents itself where responsibility is dropped, or, as it is in many startups — where someone did not claim responsibility for something yet. There is a lot of joy in diving into a codebase and finding a part of it that can be made better, and there’s no one but you to guide that part of the project.

This sounds exciting … For some people. Responsibility is a double-edged sword. Being an owner of a feature comes with being able to rule over your domain, but you also need to answer for your decisions. However, most of the time, it’s very beneficial and a great learning opportunity to lead features and projects. You learn a lot of important lessons very quickly, like having to deal with a lot of existing codebase… Most of it, written by you!

Mentorship and Learning

Mentorship is critical to accelerating your career. Often your manager (who may be your CTO too), will be in charge of a small team, resulting in a lot of attention given to you. In a corporate role, it’s not unheard of to be talking to your manager as infrequently as once a month, for an hour. In contrast, it’s common practice in a startup to give you both a lot of 1:1 time and opportunities to better your craft through their feedback and help.

With that kind of support and independence, you will find yourself being empowered to choose the projects you want to work on, and choose how exactly you want to do them. From creating a new internal service in a language you like the most, adopting a pattern that you enjoy working with, to suggesting marketing copy changes, the startup life is rife with opportunities to do things the way you like it.

Another important consideration in whether or not you’re ready for this kind of job is your willingness to teach yourself anything. Since the team is quite small, even when your mentor makes themselves very available, there will still be a lot of things that you will have to pick up on your own. Where a big company would provide you with a rigorous and lengthy training program, in a startup most often you’ll find yourself having to not only teach yourself what’s necessary, but also figuring out what it is that you have to learn to fill in the gaps.

The bad rap

Startups often get a bad rap. When people complain about working for startups, they often mention long hours and low pay, lousy culture (and no HR department to fix it), opportunity for employees to burnout quickly, and general chaos. As much as many of those points are true for some companies, they’re not exclusive to startups.

Closing thoughts

Working at a startup is likely going to be rewarding. The connections you will make with other people are going to be invaluable, and most startups are much more lenient in letting you choose how to do your work, whether that means choosing all of your equipment or setting up your hours according to your lifestyle and preferences. You’ll develop your decisiveness, communication, and adaptability skills. And that’s all on top of the broad spectrum of technical skills you’ll pick up along the way, alongside with a great job title and a list of achievements that you can confidently say were yours.

Special thanks to Danielle Heberling for helping out with the content.

Photo by Israel Andrade on Unsplash

From Pebbles to Brickworks: a Story of Cloud Infrastructure Evolved

RJ Zaworski — Tue, 18 Aug 2020 18:16:23 +0000

You can build things out of pebbles. Working with so many unique pieces isn’t easy, but if you slather them with mortar and fit them together just so, it’s possible to build a house that won’t tumble down in the slightest breeze.

Like many startups, that’s where Koan’s infrastructure started. With lovingly hand-rolled EC2 instances sitting behind lovingly hand-rolled ELBs inside a lovingly — yes — hand-rolled VPC. Each came with its own quirks, software updates, and Linux version. Maintenance was a constant test of our technical acumen and patience (not to mention nerves); scalability was out of the question.

These pebbles carried us from our earliest prototypes to the first public iteration of Koan’s leadership platform. But there comes a day in every startup’s journey when its infrastructure needs to grow up.

Motivations

The wolf chased them down the lane and he almost caught them. But they made it to the brick house and slammed the door closed.

What we wanted were bricks, uniform commodities that can be replicated or replaced at will. Infrastructure built from bricks has some significant advantages over our pebbly roots:

Visibility. Knowing who did what (and when) makes it possible to understand and collaborate on infrastructure. It’s also an absolute must for compliance. Repeatable, version-controlled infrastructure supplements application changelogs with a snapshot of the underlying infrastructure itself.
Confidence. Not knowing — at least, not really knowing — infrastructure makes changes very nervous. For our part, we didn’t. Which isn’t a great position to be in when that infrastructure needs to scale.
Consistency. Pebbles come in all shapes and sizes. New environment variables, port allocations, permissions, directory structure, and dependencies must be individually applied and verified on each instance. This consumes development time and increases the risk of “friendly-fire” incidents from any inconsistencies between different hosts (see: #2).
Repeatability. Rebuilding a pebble means replicating all of the natural forces that shaped it over the eons. Restoring our infrastructure after a catastrophic failure seemed like an impossible task—a suspicion that we weren’t in a hurry to verify.
Scalability. Replacing and extending are two sides of the same coin. While it’s possible to snap a machine image and scale it out indefinitely, an eye to upkeep and our own mental health encouraged us to consider a fresh start. From a minimal, reasonably hardened base image.

Since our work at Koan is all about goal achievement, most of our technical projects start exactly where you’d expect. Here: reproducible infrastructure (or something closer to it), documented and versioned as code. We had plenty of expertise with tools like terraform and ansible to draw on and felt reasonably confident putting them to use—but even with familiar tooling, our initially shaky foundation didn’t exactly discourage caution.

That meant taking things step by gradual step, establishing and socializing patterns that we intended to eventually adopt across all of our cloud infrastructure. That’s a story for future posts, but the journey had to start somewhere.

Dev today, tomorrow the world

“Somewhere,” was our trusty CI environment, dev. Frequent, thoroughly-tested releases are both a reasonable expectation and a point of professional pride for our development team. dev is where the QA magic happens, and since downtime on dev blocks review, we needed to keep disruptions to a minimum.

Before dev could assume its new form, we needed to be reasonably confident that we could rebuild it:

…in the right VPC
…with the right Security Groups assigned
…with our standard logging and monitoring
…and provisioned with a working instance of the Koan platform

Four little tests, and we’d have both a repeatable dev environment and a template we could extend out to production.

We planned to tackle dev in two steps. First, we would document (and eventually rebuild) our AWS infrastructure using terraform. Once we had a reasonably-plausible configuration on our hands, we would then use ansible to deploy the Koan platform. The two-step approach deferred a longer-term dream of fully-immutable resources, but it allowed us to address one big challenge (the infrastructure) while leaving our existing deployment processes largely intact.

Replacing infrastructure with Terraform

First, the infrastructure. The formula for documenting existing infrastructure in terraform goes something like this:

Create a stub entry for an existing resource
Use terraform import to attach the stub to the existing infrastructure
Use terraform state and/or terraform plan to reconcile inconsistencies between the stub and reality
Repeat until all resources are documented

Here’s how we documented the dev VPC's default security group, for example:

$ echo '
resource "aws_default_security_group" "default" {
  # hard-coded reference to a resource not yet represented in our
  # Terraform configuration
  vpc_id = var.vpc_id
}' >> main.tf

At this point, we could run terraform plan to see the difference between the existing infrastructure and our Terraform config:

$ terraform import aws_default_security_group.default sg-123456
$ terraform plan
# module.dev-appserver.aws_default_security_group.default will be updated in-place
  ~ resource "aws_default_security_group" "default" {
      ~ egress                 = [
          - {                                 
              - cidr_blocks      = [
                  - "0.0.0.0/0",
                ]                    
              - description      = ""
              - from_port        = 0
              - ipv6_cidr_blocks = []
              - prefix_list_ids  = []
              - protocol         = "-1"
              - security_groups  = []
              - self             = false
              - to_port          = 0
            },
        ]
        id                     = "sg-123456"
    # ...
    }

Using the diff as an outline, we could then fill in the corresponding aws_default_security_group.default entry:

# main.tf
resource "aws_default_security_group" "default" {
  vpc_id = var.vpc_id
  ingress {
    protocol  = -1
    self      = true
    from_port = 0
    to_port   = 0
  }
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Re-running terraform plan, we could verify that the updated configuration matched the existing resource:

$ terraform plan
...
No changes. Infrastructure is up-to-date.
This means that Terraform did not detect any differences between
your configuration and real physical resources that exist. As a 
result, no actions need to be performed.

The keen observer will recognize a prosaic formula crying out for automation, a call we soon answered. But for our first, cautious steps, it was helpful to document resources by hand. We wrote the configurations, parameterized resources that weren’t imported yet, and double-checked (triple-checked) our growing Terraform configuration against the infrastructure reported by the aws CLI.

Sharing Terraform state with a small team

By default, Terraform tracks the state of managed infrastructure in a local tfstate file. This file contains both configuration details and a mapping back to the “live” resources (via IDs, resource names, and in Amazon’s case, ARNs) in the corresponding cloud provider. As a small, communicative team in a hurry, we felt comfortable bucking best practices and checking our state file right into source control. In almost no time we ran into collisions across git branches—a shadow of collaboration and locking problems to come—but we resolved to adopt more team-friendly practices soon. For now, we were up and running.

Make it work, make it right.

Provisioning an application with Ansible

With most of our dev infrastructure documented in Terraform, we were ready to fill it out. At this stage our attention shifted from the infrastructure itself to the applications that would be running on it—namely, the Koan platform.

Koan’s platform deploys as a monolithic bundle containing our business logic, interfaces, and the small menagerie of dependent services that consume them. Which services run on a given EC2 instance will vary from one to the next. Depending on its configuration, a production node might be running our REST and GraphQL APIs, webhook servers, task processors, any of a variety of cron jobs, or all of the above.

As a smaller, lighter, facsimile, dev has no such differentiation. Its single, inward-facing node plays host to the whole kitchen sink. To simplify testing (and minimize the damage to dev), we took the cautious step of replicating this configuration in a representative local environment.

Building a local Amazon Linux environment

Reproducing cloud services locally is tricky. We can’t run EC2 on a developer’s laptop, but Amazon has helpfully shipped images of Amazon Linux—our bricks’ target distribution. With a little bit of fiddling and a lot of help from cloud-init, we managed to bring up reasonably representative Amazon Linux instances inside a local VirtualBox:

$ ssh -i local/ssh/id_rsa dev@localhost -p2222
Last login: Fri Sep 20 20:07:30 2019 from 10.0.2.2
       __|  __|_  )
       _|  (     /   Amazon Linux 2 AMI
      ___|\\___|___|

At this point, we could create an ansible inventory assigning the same groups to our "local" environment that we would eventually assign to dev:

# local/inventory.yml
appservers:
  hosts:
    127.0.0.1:2222
cron:
  hosts:
    127.0.0.1:2222
# ...

If we did it all over again, we could likely save some time by skipping VirtualBox in favor of a detached EC2 instance. Then again, having a local, fast, safe environment to test against has already saved time in developing new ansible playbooks. The jury’s still out on that one.

Ansible up!

With a reasonable facsimile of our “live” environment, we were finally down to the application layer. ansible approaches hosts in terms of their roles—databases, webservers, or something else entirely. We approached this by separating out two “base” roles for our VMs generally (common) and our app servers in particular (backend), where:

The common role described monitoring, the runtime environment, and a default directory structure and permissions
The backend role added a (verioned) release of the Koan platform

Additional roles layered on top represent each of our minimally-dependent services — api, tasks, cron, and so on—which we then assigned to the local host:

# appservers.yml 
- hosts: all
  roles:
  - common
  - backend
- hosts: appservers
  roles:
  - api
- hosts: cron
  roles:
  - cron

We couldn’t bring EC2 out of the cloud, but bringing up a local instance that quacked a lot like EC2 was now as simple as:

$ ansible-playbook \
  --user=dev \
  --private-key ./local/ssh/id_rsa \
  --inventory local/inventory.yml \
  appservers.yml

From pebbles to brickwork

With our infrastructure in terraform, our deployment in ansible, and all of the confidence that local testing could buy, we were ready to start making bricks. The plan (and there’s always a plan!) was straightforward enough:

Use terraform apply to create a new dev instance
Add the new host to our ansible inventory and provision it
Add it to the dev ELB and wait for it to join (assuming provisioning succeeded and health checks passed)
Verify its behavior and make adjustments as needed
Remove the old dev instance (our pebble!) from terraform
Rinse and repeat in production

The entire process was more hands-on than anyone really wanted, but given the indeterminate state of our existing infrastructure and the guiding philosophy of, step one was simply waving dev out the door.

Make it work, make it right.

Conclusion

Off it went! With only a little back and forth to sort out previously unnoticed details, our new dev host took its place as brick #1 in Koan’s growing construction. We extracted the dev configuration into a reusable terraform module and by the end of the week our brickwork stretched all the way out to production.

In our next post, we'll dive deeper into how we imported volumes of undocumented infrastructure into Terraform.

Big thanks to Ashwin Bhat for early feedback, Randall Gordon and Andy Beers for helping turn the pets/cattle metaphor into something more humane, and EMAR DI on Unsplash for the cover image.

And if you’re into building software to help every team achieve its objectives, Koan is hiring!