DEV Community: Joel Lutman

Why you should embrace DevOps

Joel Lutman — Mon, 22 Mar 2021 21:45:12 +0000

A cautionary tale of DevOps negligence & why it should be at the heart of any project

After recently finishing up a HUGE terraform refactor, I was left reflecting on why it’s essential to establish DevOps principles at the start of any large project; and the horror stories that can happen when it’s not.

Over the last decade, DevOps has gone from strength to strength and proven itself as a core component to many success stories. However despite this, in my experience as a software and cloud consultant, it’s still frequently overlooked and ignored in favour of application development.

I’m not talking about Waterfall architecture here, but rather some of the fundamental technical DevOps principles of:

Continuous Integration: establishing tools and process for continuously merging code back to a single code repository and single source of truth (e.g. git, subversion, mercurial, peer review)
Continuous Testing: establishing tools and process for continuously testing code during all stages of the software development lifecycle (e.g. unit tests, integrations tests, system tests, regression tests, chaos engineering)
Release Management: establishing tools and process for packaging and deploying releases
Infrastructure: infrastructure configuration, management, and infrastructure as code tools
Monitoring: performance monitoring and logs, end user experience
Continuous Delivery: automating the processes in software delivery

Many times I’ve seen these delayed or ignored until development starts reaching a critical mass, and developers start facing the deficits of not having these. They start getting messy merge conflicts. They need to work out how their application lives. What database do they need to connect to (and how do they manage this)? How do they deploy a new version without impacting the customer experience? Then there’s the dreaded cliff edge of deploying to production when they’ve only ever played around on smaller, more lenient dev/test environments.

I’ve seen this result in many “DevOps initiatives”, either in the middle of the application development or towards the end to try and ease some of the burden.

“What’s wrong with this?” I hear you ask.

While it is possible to establish these processes during the development lifecycle, or even afterwards, the time and effort to establish these fundamentals is exponentially increased. It can cause a great amount of refactoring that eats into time that could be better spent elsewhere.

Let’s look at a real life example to see how these principles can be easily overlooked, and the consequences of offsetting DevOps to a later stage. The example is loosely inspired by a project that I was bought on to rescue at a late stage, though largely exaggerated for comedic effect.

Case study — Project Whale

Say we have a standard 3 tier application called project Whale, with the 3 tiers representing;

the presentation/UI/frontend,
the application/business-logic/backend,
and the data/storage/database.

The UI talks to the backend which in turn securely accesses the database.

During development of the frontend, we need to communicate with the backend, and likewise for the backend to database. The team decides to stand up a shared piece of infrastructure (a single EC2 instance) to run all three layers, which QA can also access to view features. Now they are able to continue their relevant development, whilst communicating to the necessary services.

Fast forward a few months and they’re approaching delivery time — great!!!

Everyone’s excited and the CEO’s about ready to pop a bottle of champagne. The developers need to run project Whale somewhere that’s accessible to the public internet. That EC2 they’ve been using for development seems like a great candidate — it’s already mostly set up with all the packages and utilities already configured by hand as and when the team realised they needed them.
They push the latest versions of the frontend and backend, clean out the database, and project Whale goes live.

As the champagne pours they start getting emails from customers about bugs in the frontend system. Small stuff, nothing that stops business, but things that could have easily been discovered if they’d invested in continuous testing, and it’s enough that requires a new version of Whale to be deployed. Except now they’ve got customers on the live system, and deploying a new version means taking the old one offline while they switch over.

They’re left with a couple options:

Deploy anyway and annoy customers
Schedule a deployment overnight when traffic is low and annoy the developers who have to pull an all-nighter
Start up a fresh EC2 instance and turn off the old one

The latter is clearly the best option for reducing downtime and keeping people happy, however upon doing this they realise all those little packages and utilities need to be set up again and reconfigured, and the last time anyone did this was 4 months ago. They documented it though, so even though it takes hours they’re able to get the new EC2 up with the new version of Whale deployed — this is where release management would have really been beneficial right?

Except now customers are complaining more than before, because they can’t login and are being told that they don’t have an account — the database!

Yup, they forgot about the data. While they were able to get the database running on the new EC2, they forgot about the customer account data that had been written to the old database on the old instance — this is a fresh database without any customer information.

BATTLESTATIONS!!!!

As fires seem to spread they realise their mistake, they should have isolated the database from the rest of the stack, so they could deploy any number of EC2 instances running Whale and simply connect it to a persistent database. But none of them are database or networking experts and unsure how to expose the database connection across servers — this is where Infrastructure as Code could have really helped.

Let’s skip forward a few days, all the fires are now just smouldering pits. They managed to provision a managed database service in the form of AWS RDS and got some help with the networking. Whale is really gaining traction, and customers love it, then overnight the user base explodes. They go from tens of customers to a few thousand and are featured on the front page of Reddit.

Then it happens.

Everything just slows down and stops. No alarms, no errors, just nothing. Customers are once again adrift, and causing one hell of a Twitter storm. What happened? There were no recent code changes.

It turns out that they gained more users than the single EC2 instance could handle. As more users joined and started using Whale, the backend started dumping logs at an increased rate filling up the EC2’s local file storage, taking both it and the frontend offline. Something that could have been easily avoided if they’d shipped logs off to a Monitoring solution.

Now they need to work out how to scale Whale over multiple EC2 instances at once, and how to get the logs off of the local EC2 file system to somewhere else. All while Whale is currently offline, and they still haven’t got an automated solution for deploying new versions without users experiencing some downtime. This would have been a trivial task if they had embraced continuous delivery and used containers to run Whale’s individual layers on ECS or Fargate — where shipping logs and autoscaling is given out of the box.

So now the team is left trying to split the frontend and backend, and refactor both applications to run in containers. Then on the infrastructure side they need to set up and configure:

ECS/Farage Clusters to run the containers,
Log routing and monitoring to CloudWatch,
Auto scaling for their customer demands,
Networking to their RDS instance,
Deployment strategies for rolling out new versions.

To make matters worse they’re still running on a single environment — so either they’d have to make all of these changes on their “production” environment, or spin up a separate isolated development environment, which to ensure both environments are identical. They’d really have to invest in Infrastructure as Code.

This is a HUGE amount of technical debt for a small team which could easily stop any future development or bug fixes for 6 months or more — something that could easily sink the project, team, and potentially the company.

How DevOps principles could have helped?

Now, as I said, this isn’t about Waterfall design, but more about involving DevOps principles from the start. By applying DevOps principles from the beginning they could have avoided some of these situations.

By embracing Continuous Integration they could have:

Ensured that any code integrated back to their main branch be fully tested prior to deployment; avoiding small bugs interfering with the customer experience.

By embracing Continuous Testing they could have:

Created isolated Development, QA, and Production environments. This would have ensured that development versions of Whale could be deployed and acceptance tested, rather than pushing untested versions onto their live customer facing Production environment.

By embracing Infrastructure they could have:

Spun up a managed database, such as RDS, instead of running their own on the EC2 server; avoiding losing valuable customer data.
Leveraged container autoscaling to easily scale horizontally, instead of running everything on a single server; avoiding outages due to increased customer usage.
Codified their cloud estate using IaC (Infrastructure as Code) so that they could easily provision multiple dev/qa/prod environments, rather than trying to repeat months old work to manually configure a second environment.

By embracing Monitoring they could have:

Shipped logs to persistent storage such as CloudWatch, instead of leaving them on the single server; avoiding filling up the local application storage and taking Whale offline.

By embracing both Release Management and Continuous Delivery they could have:

Eased the release and roll out of the different layers in isolation; avoiding the lengthy rewrite to separate the backend and frontend layers, and enabling each layer to be developed at its own cadence.

All of these problems could have been avoided if they had merely considered the wider picture outside of their own application code, and embracing these principles from the beginning, when the cost and burden would have been mere hours, rather than months.

Summary

To me this is one of the biggest strengths of DevOps; by embracing DevOps principles we start thinking at a larger scale and building applications in a more holistic view. DevOps doesn’t require teams of DevOps engineers setting barriers or lengthy processes; it just requires individuals and teams to embrace those principles to build better, more scalable, and robust systems.

Now, of course that’s not all DevOps is. It also involves building a no blame culture, doing technical postmortems, and involving the wider business to understand that code doesn’t stop at the application level. However, I really wanted to highlight the impact that ignoring DevOps can have on development and projects.

It’s not all doom and gloom though, we can still bring DevOps into an existing project, but the technical burden will be increased exponentially, and would require significant investment both in time and resource.

So the next time you’re thinking of starting a new project, product, or initiative; don’t put off applying these principles. DevOps and application development are two pieces of the same puzzle. They need to be done together, in tandem, rather than ignored until a later date, because by that later date, your product could already be dead in the water.

Do you have any experiences of where DevOps was applied a little too late? If you are interested in learning more about DevOps, please check out my other blogs on my website, Manta Innovations and reach out to me on Twitter @ Joel Lutman

A Deep Dive into Amazon Timestream

Joel Lutman — Mon, 23 Nov 2020 19:00:19 +0000

Amazon Timestream is AWS’s newest addition to their storage offerings. It’s a fast, scalable, and serverless time series database; something in my experience both the community and businesses have been clamoring for.

Recently I spent an afternoon testing out Timestream and I thought I’d share what I learned during that time, and my initial impressions.

What is a time series database?

A time series database is a system optimized for storing and serving time series data. A time series being a sequence of records represented as data points over an interval. While time series data can be stored in a traditional relational database, these often experience scaling issues.

Typical time series use cases include any type of data where we repeatedly measure values or metrics at regular intervals, this includes;

IoT data (e.g. weather readings, device statuses)
DevOps analytics (e.g. CPU utilisation, memory allocation, network transmission)
App analytics (e.g. clickstream data, page load times, healthchecks, response times)

Timestream targets these use cases - in fact they even provide some sample IoT and DevOps data to play around with, which is exactly what I did.

Secure Serverless Infrastructure <3

Setting up Timestream is incredibly easy.

As a completely serverless offering, there is little to configure and no sizing or throughput settings to worry about. Additionally, being serverless it follows a rolling release schedule, meaning you are able to take advantage of new features as they become available, rather than worry about version upgrades. Additionally, in line with other AWS managed solutions you simply pay for usage rather than the underlying infrastructure.

One of the few settings you specify is the encryption key. Timestream enforces data encryption and thankfully this setting cannot be turned off. Your options here allow you to specify how your data is encrypted (both at rest and in flight) using a CMK stored in AWS KMS.

Intelligent data storage

The other main setting is how long your data lasts in each of Timestream’s storage options.

Timestream currently has 2 types of storage:

A write optimized memory store; where data initially lands and is automatically deduplicated - I’ll talk about this more in a second.
A read optimized magnetic store; for cost effective long term storage.

When setting up a Timestream table you’re required to set a retention policy to specify how long data should exist in each store before moving onto the next (from memory, to magnetic, to deletion), with the minimum values being 1hr for the memory store (up to a maximum 1 year) and 1 day for the magnetic store (up to a maximum 200 years).

Never worry about duplicate records again

I briefly mentioned data duplication and I want to focus on that a bit more. Data duplication is a big problem in traditional relational databases. Large CRM systems often may find themselves with multiple entries for identical data points if uniqueness is not enforced by the schema.

Timestream deals with this with an interesting approach, in that if an identical record is received, the write optimized memory store deduplicates this into a single record. This uses a “first writer wins” approach, so whichever record is sent first will be written to disk, with the duplicate record being thrown away.

As far as I’ve been able to test these duplicate records must be 100% identical, however I would love to see an option in the future to tweak this down to a lower similarity threshold (e.g. two records being treated as duplicates if they are 90% similar).

The Timestream data model

Being a type of NoSQL database, Timestream has its own type of data model distinct from both traditional SQL data models, and many other NoSQL data models. Timestream is considered a schema-less database as there is no enforced schema.

However it still uses concepts such as databases and tables, along with Timestream specific concepts, so lets define these:

Database: a collection of tables;
Table: an encrypted container that holds our time series records;
Record: a combination of a timestamp, 1 or more dimensions, and a single measure;
Dimensions: attributes that describe metadata of record (e.g. region, AZ, vpc, hostname for DevOps metric data) - always stored as varchar;
Measure: the single named data value representing the measurement (e.g. cpu usage, memory allocation for DevOps metric data) - can be boolean, bigint, varchar, or double;

The Timestream UI presents this model in a familiar column wise structure, however due to the data model it doesn’t support the standard CRUD operations you might expect. While records can be created and read back, they cannot be updated or deleted. Instead records can only be removed when they reach the retention limit on the magnetic storage.

Schema-less SQL on steroids

Despite Timestream being a schema-less NoSQL database, as mentioned it does present its data model as a column wise structure which anyone familiar with SQL will feel at home with.

Timestream enables data to be queried using standard SQL (supporting CTE’s, filtering, and aggregations), with numerous scalar and aggregate functions and additional time series interpolation for data points that may be missing or lost in transmission. This means you can easily group data into different chunks of time and perform aggregates, even if certain points in time were missing data. The one limitation here is that while Timestream does support table joins, these can only be on the same table (a join back to itself), though this does make sense when you remember that tables are schema-less.

Integrations

Whilst having a handy SQL interface is great, for many this is not the best way to present data to users and stakeholders, especially when trying to highlight trends or patterns over time. Thankfully Timestream comes with a number of built in integrations, both within the AWS ecosystem and for third party tools.

These include:

Dashboards and charts via Amazon QuickSight or Grafana
Data ingestion AWS SDK and CLI, from AWS IoT via IoT rules, from Kinesis Data Analytics streams, or from Telegraf
Connecting traditional SQL workbench tools over JDBC

Closing thoughts

Overall, playing around with Timestream was very interesting. I think it’s a powerful service that further rounds out AWS’s storage offerings, and comes with some exciting features that are specific to Timestream. As mentioned, I was impressed by the deduplication of data, and I’d be keen to see this developed further, or being rolled out as a configurable option for other storage services - I think AWS could really be onto something with this feature. On top of that, having it being both schema-less and giving us an SQL interface is a nice middle ground for those not entirely sold on NoSQL data models.

There’s a lot to like with Timestream, and I think that it could potentially be a good fit for lots of use cases. While Amazon mentions use cases such as DevOps metrics and IoT data; I think it could also have great potential for clickstream, stock market, currency, and asset management data - really anything where we want to be taking repeated measurements over time.

What do you think?

I’m sure there’s many more use cases than the ones I mentioned above, so let me know what use cases you can think of, or perhaps are already using Timestream for.

I’d also be interested to hear about how well Timestream scales for large datasets - it wasn’t something I was able to test that rigorously during my couple of hours with it. So any insight into performance and scaling would be great.

For more tech insight follow me on Twitter at @JoelLutman; where I tweet and blog about AWS, serverless, big data, and software best practice.

My Top 10 AWS Services

Joel Lutman — Tue, 13 Oct 2020 16:13:58 +0000

AWS is huge. With its multitude of services and continuous updates AWS is a playground for developers, but the sheer scale of it can be overwhelming for newcomers.

This is why I have put together a handy guide on my Top 10 AWS services, that I think all AWS developers should know, regardless of whether you are working on big data,
machine learning, web apps, IoT, or networking, because you’ll likely need to interact with them at some point.

In no particular order, here are my top 10:

EC2

What is it?

Scalable servers in the cloud.

Why is it important?

Ok, let’s get the big one done first.

AWS EC2 is the backbone of AWS. It was one of the first services launched back in 2006 and took on the traditional concept of a data centre, but allows you to
spin up and down servers with zero commitment at the click of a button. You can think of EC2 as a blank canvas in which you can install, configure, and run anything
you want, even a minecraft server. Additionally you can launch preconfigured snapshots, called AMI’s, from the marketplace if you don’t want to install something yourself.

On top of that, EC2 forms a huge part of many of the AWS Certification exams...so learn EC2.

ECS

What is it?

Scalable serverless container orchestration.

Why is it important?

Along with EC2, ECS is the other main way of running custom applications in AWS.

It’s a managed (and can be completely serverless) container orchestration service. This means that instead of having to worry about any underlying hardware that your app is running on; you just have to ensure that your app can run inside a docker container.

For a developer, this means their app can be easily ported to different cloud providers. Security wise this means no patching of the host OS, and financially this means you only pay for the compute you require, rather than paying for the entire server as with EC2.

For application developers, knowing ECS is a must - personally, its my go to for custom applications.

What is it?

Secure data encryption and key management

Why is it important?

Security is important, even more so on the cloud, where an incorrect setting can expose your resource to the rest of the world. KMS secures data and secrets in the cloud.

Storing data on S3? Use a KMS key to encrypt it. Storing a confidential key in Secrets Manager - you need to use KMS for this.

Using KMS is crucial for building secure AWS native solutions.

CloudWatch

What is it?

Logs, monitoring, and insights for resources

Why is it important?

Once we’ve got resources and applications running in the cloud, we need to be able to observe them and access their logs. If something goes down we need to know what exactly happened.

This is where CloudWatch comes into play.

With CloudWatch we can gather logs from both managed services and our own applications running on ECS and EC2. We can also use CloudWatch for event processing, and scheduling lambda events.

So whether you’re deploying a service to AWS or scheduling event driven architecture, CloudWatch is crucial.

AWS IAM

What is it?

User and permissions management

Why is it important?

Before you even start deploying a service to AWS you need to be thinking about IAM. IAM is how we assign privileges to both users and roles.

So if you’re designing a service that requires access to a private S3 bucket, you’ll need to use IAM to assign s3 read access to the role your service is using. Learning IAM permissions is invaluable for application developers and security engineers alike.

IAM is also another service that comes up frequently in AWS certifications so it’s worth familiarising yourself with the most common ones.

Let me know what you think

Thanks for taking the time to read this guide - I hope it helps! As mentioned these are my own personal views, and the services are not ranked in any particular order.

If there is an AWS application that you swear by that hasn’t featured in this top 10 list, or you have any questions regarding these applications, I would love to hear from you.

For more blogs and tech insight follow me on Twitter Joel Lutman for more info on AWS, cloud computing, serverless, and software development.

Set up a virtual call centre in 30 minutes with Amazon Connect

Joel Lutman — Mon, 20 Jul 2020 17:12:12 +0000

This is a step by step guide on how to set up Amazon Connect in under 30 mins.

Amazon Connect enables you to have your own virtual call centre, where agents can log into and receive calls from
clients via a web portal using only a pair of headphones.

If this is the first time you’ve heard of Amazon Connect then I suggest you checkout my recent high level summary on it first.

This demo does require you to already have an AWS account set up, ideally with admin level permissions to provision the required services.

If you’ve got that then login to the AWS Console and head to the Amazon Connect page and we’ll get started. If not, you will need
to create an account here.

1. First you’re going to set up your identity access management.

If you want to manage your agents within Amazon Connect use the first option “Store users with Amazon Connect”, and personalise the URL.

If you already have and wish to use Active-Directory, you can use the second two options; to manage users via AWS AD, or non-AWS AD via SAML, respectively.

This stage will also provide you with the URL for your agents to login with.

2. Next you have the option to create an admin user.

I suggest skipping this step for this walkthrough as you can use your IAM user instead, however you can use this opportunity to add other administrators here if you wish.

3. Next you’'ll configure the telephony options for both inbound and outbound calls.

I’ve selected both options here as I want to receive and make calls.

4. The last step of the initial set up is to configure your data storage; which will contain call and chat logs.

By default Amazon Connect generates its own S3 buckets and KMS keys to use for secure data encryption, but you can set this to use pre-existing buckets and keys should you wish to.

5. Now that you've done the initial setup you will be presented with a summary screen.

Check through the options and if everything looks good, create the instance.

6. Once your Amazon Connect instance has been created, you can log into the dashboard and customise your virtual call centre.

7. The first thing you need to do is claim a phone number to receive calls on.

This can be from any country Amazon Connect supports, regardless of which region our instance is located in.

I'm currently in Canada, so I chose a North American number, and opted for ‘Toll free’.

8. Next you will be presented with a screen advising you to claim the number.

It advises you to dial the number, however from my experience with Amazon Connect all configuration changes can take up to 15 minutes to be pushed out.

If you were to call at this stage you might not get through, but that doesn’t stop you continuing the setup.

9. Next you can set the hours of operation; which is when you expect agents to be able to take calls.

You can have multiple hours of operation if you want to represent multiple groups or group remote teams by time zones.

I set all of my operational hours in Pacific Standard Time, and extended the hours into the evening a little.

If your call centre isn’t operational at the weekend, you can remove these from the operational hours.

10. Following this, the next thing you need to do is set up queues.

A queue here is not a queue in terms of a waiting queue, but rather a workflow queue that callers will transit through.

As with the hours of operation, you can have multiple queues per call centre, for different workflows, and callers can be transferred between queues in the same way you might traditionally transfer callers between departments.

If your call centre requires more than one workflow, add additional queues with the “add new queue” button. I created an additional queue called “VanQueue”.

11. Next, you will be given the option to create or upload your own prompts, which are audio files you may wish to playback to callers.

I didn’t want to use any custom audio prompts, so I skipped this stage but feel to check them out, or add your own and apply them in your contact flow, speaking of which...

12. The next stage is the biggest and most complex bit - contact flows.

This is how you design the flow that a customer may take, and that can be a complete end to end flow, or a small flow which can be composed into a larger flow.

In this way you can use Software Engineering principles of composition and DRY (Don't Repeat Yourself) to create reusable flow elements.

As an example, I have created a single end to end flow;

Here I've just set up a very simple flow whereby I check those basic settings I've configured (opening hours, staff availability, and queue availability) and try to transfer the customer to an agent.

If any of these fail, the system responds to the customer letting them know why (e.g. outside of opening hours) prior to terminating the call.

If they can't be immediately transferred due to the queue being at capacity, I've implemented a loop to wait 5 minutes and try again.

In this way I've been able to set up a very simple complete end to end flow for a call center, using a simple drag and drop UI.

Flows can become a lot more complex, and I could have used things such as keypad entry, Lex skills, and even trigger an AWS Lambda (which in turn can be used to trigger many other AWS services via an SDK call).

13. Following this, you will need to set up a routing profile.

Routing profiles act as the link between our agents and answer queues.

14. Once the routing profile is in place, you can now start creating users and assigning them to the profile.

When creating an agent/user you need to assign them both a routing profile (which we just spoke about above) and a security profile.

Security profiles dictate the access control the agent has within AWS Connect, and can be selected from default options of Admin, Agent, CallCenterManager, or QualityAnalyst.

Alternatively you can create our own Security Profiles and assign agents to them.

15. The last thing you need to do is to switch your inbound number onto the correct contact flow.

The reason to do this last is to ensure that everything related to that contact flow is set up and agents are available before making the flow live.

If you switched the number onto the flow at the start, but hadn’t yet created agents to answer, or the correct operational hours, then clients may start calling in and receiving unexpected responses or be left waiting for an agent.

We do this simply by going back to the phone number management screen and attaching it to our new contact flow.

For instance, I switched it from ‘Sample inbound flow’ to ‘Demo’ which is the name I gave my demo contact flow.

16. Once that’s done, you are ready to go.

You have successfully set up a virtual call centre in (hopefully) under 30 minutes. Clients can now dial in, and after making their way through our contact flow will be connected to an available agent.

You’ll be able to log onto your virtual call centre by clicking the phone logo in the top right.

This was just a simple quick walkthrough of setting up Amazon Connect.

Amazon Connect is a powerful tool and it can become complex when we start using some of the more interesting features such as AWS Lex and Lambda support.

If you find yourself in need of some advice or just want to find out more, then feel free to reach out to me on Twitter (@joellutman),
email joel@manta-innovations.co.uk or via my site.

What is AWS connect?

Joel Lutman — Wed, 01 Jul 2020 17:43:05 +0000

Recently I found myself spending some time with some of the less well known AWS services, and I wanted to draw attention to just how great some of these services are.

One of them, AWS Connect, has proven to be an interesting use case.

With the growing demand to work remotely, it has seen increased usage during the COVID-19 outbreak. It allows companies to create a virtual cloud based call centre, that enables and empowers staff to answer from anywhere they have access to a PC.

A cloud based call centre?

AWS Connect markets itself as “an omnichannel cloud contact center”, but what does that really mean?

AWS Connect is a versatile way of building and managing a completely serverless call centre, and allows distributed teams to work remotely from anywhere over the world.

It can be used as a simple way of managing agents and connecting customers with them, or as a way of building complex routing systems that can use multiple customer inputs and diverging paths to route customers to different agent teams.

Source: www.pexels.com/photo/woman-wearing-earpiece-using-white-laptop-computer-210647

What separates AWS Connect from a traditional call centre is its ability to create and scale call centres within minutes, and enables remote working as it relies on web interfaces rather than a traditional handset.

AWS Connect provides an all in one service for:

acquiring public phone numbers
creating distributed teams
creating operational hours
creating simple to complex call routing
secure data storage and encryption of call logs on AWS
Integration with AWS database for automatic logs and stats
providing a simple user interface for agents to answer calls without needing a physical handset

Just how simple is it?

AWS Connect is one of those well designed products that can be downright simple, or incredibly complex, depending on what you design and the components that you use.

It is designed to be used by anyone, and doesn’t require developer experience to configure, though some knowledge of S3 buckets and data encryption via AWS KMS is beneficial.

Source: https://unsplash.com/photos/BeVGrXEktIk

Without any prior experience, on my first attempt with AWS Connect I was able to get a full serverless call centre up and running in under 15 minutes, that included:

A public dial in number
Secure encrypted data storage for call logs
Seniority roles for managers and agents with different admin rights
Reports on call metrics and stats
Operational hours for agents
A call routing that made use of keypad entry, and queue checking to place customers on hold if no agent was available
3 different user types (admin, managerial, agent) that could log in and receive inbound calls via PC and headset

Complexity if you want it

On the other side, AWS Connect supports a huge range of customisation and supported services.

AWS Connect can be integrated into AWS Lambda and AWS Lex, meaning scripts can be written that would enable some of the following features:

Speech-to-text translation - providing agents with a summary of call
Integration with AWS database solutions - providing queryable stats and metrics of calls
Language detection - allowing key words and phrases to be identified and flagged during calls to help understand overall customer satisfaction

Source: http://anthillonline.com/wp-content/uploads/2018/07/chatbot.jpg

AWS Connect is an incredibly versatile and scalable platform that allows companies to build a customised and flexible virtual call centre. Enabling them to adapt to pressures of scale, flexibility, and distribution to overcome the obstacles and rigid structures of a traditional call centre.

This is just a high level summary of AWS connect, I’ll be looking to put together a more technical guide in the foreseeable future - so watch this space.

For more information about AWS and Serverless feel free to check out my other blogs, and my website Manta Innovations.

Automating data pipeline with AWS step functions

Joel Lutman — Tue, 23 Jun 2020 16:58:14 +0000

Apache Spark, Serverless, and Microservice's are phrases you rarely hear spoken about together, but that's all about to change with AWS Step Functions.

Apache Spark vs Serverless

As someone who works as a SME in Apache Spark it's been common for me to be working with large Hadoop clusters (either on premise or as part of an EMR cluster on AWS), which run up large bills even though the clusters are mostly idle, seeing short periods of intense compute when pipelines run.

In contrast we have the Serverless movement.

The severless movement aims to abstract away many of these issues with managed services, where you only pay for what you use, examples being AWS Lambda, Glue, DynamoDB, and S3.

Here I’m going to talk about how we can bring these two concepts together to utilise serverless in delivering big data solutions to try and get the best of both worlds.

Hello world AWS Step Functions

Welcome to AWS Step Functions, a managed service that lets us coordinate multiple AWS services into workflows.

AWS Step Functions can be used for a number of use cases and workflows including;

sequence batch processing
transcoding media files
publishing events from serverless workflows
sending messages from automated workflows
or orchestrating big data workflows.

A traditional enterprise Big Data architecture may involve many complex distributed self managed tools. This could include clusters for Apache Spark, Zookeeper, HDFS, and more. This type of architecture is heavily reliant on time based schedulers such as CRON and does a poor job of binding individual workflow steps together.

The diagram above illustrates a typical big data workflow of sourcing data into a datalake, ETL'ing our data from source format to Parquet, and using a pre-trained Machine Learning model to predict based on the new data. Data is made available for user interaction via SQL queries.

What if a single service goes down? How are we alerted? Our orchestration times have to be well defined and follow a synchronous blocking workflow. We have no contract between services - which leads to slow development of each component and drives a waterfall approach. These are all questions and problems that arise with such an architecture.

So lets try to replicate this using serverless components and see if we can do a better job.

In the above diagram we’ve been able to replicate the previous architecture in a completely serverless approach, thanks to Step Functions enabling us a way of binding any AWS service into a workflow. Additionally, using managed serverless components has enabled us to overcome many of the problems and issues identified with the previous approach.

This serverless approach gives us the ability to:

Query the data at any stage via AWS Athena
Handle any errors or timeouts across the entire stack, route the error to a SNS topic, then onto any support team
Configure retries at a per service or entire stack level
Inspect any file movement or service state via a simple query or HTTP request to DynamoDB
Configure spark resources independently of each job, without worrying about cluster constraints or YARN resource sharing
Orchestrate stages neatly together in many different ways (sequential, parallel, diverging)
Trigger the entire pipeline on a CRON schedule or via events Monitor ETL workflows via UI

In comparison to traditional single stack server based architecture, organisations and businesses also gain a number of advantages for both the development process and service management:

Increase development velocity and flexibility by splitting the Sourcing Lambda, Spark ETL, View Lambda, and Sagemaker Scripts into micro-service's or monorepo's
Treat each ETL stage as a standalone service which only requires data in S3 as the interface between services
Recreate our services quickly and reproducibly by leveraging tools such as Terraform
Create and manage workflows in a simple readable configuration language
Avoid managing servers, clusters, databases, replication, or failure scenario's
Reduce our cloud spend and hidden maintenance costs by consuming resources as a service.

Configuration as Code

As mentioned, one of the clear benefits of using AWS Step Functions is being able to describe and orchestrate our pipelines with a simple configuration language.

This enables us to remove any reliance on explicitly sending signals between services, custom error handling, timeouts, or retries. Instead defining these with the Amazon States Language - a simple, straightforward, JSON-based, structured configuration language.

With the states language we declare each task in our step function as a state. We define how that state transitions into subsequent states; what happens in the event of a state's failure (allowing for different transitions depending on the type of failure), and how we want a state to execute (sequential or in parallel alongside other states).

Not the only option, but...

It's worth pointing out that some of these benefits are not limited to just AWS Step Functions.

Airflow, Luigi, and NiFi are all alternative orchestration tools that are able to provide us with a subset of these benefits, in particular scheduling and a UI.

However these rely on running on top of EC2 instances which in turn would have to be maintained.

If the servers were to go offline our entire stack would be non-functional, which is not acceptable to any high performing business. They also lack many of the other benefits discussed such as; stack level error, timeout handling, and configuration as code, among others.

AWS Step Functions - a versatile and reliable tool

AWS Step Functions is a versatile service which allows us to focus on delivering value through orchestrating components.

Used in conjunction with serverless applications we can avoid waterfall architecture patterns. By swapping in different services to fulfil roles during development this allows developers to focus on the core use case, rather than solutionising. For instance, we could easily swap DynamoDB out for AWS RDS without any architecture burden.

As we've demonstrated, it can be a powerful and reliable tool in leveraging big data within the serverless framework and should not be overlooked for anyone exploring orchestration of big data pipelines on AWS.

Used in conjunction with the serverless framework, it can enable us to quickly deliver huge value without the traditional big (data) headaches.

More for information about AWS and Serverless feel free to check out my other blogs, and my website, Manta Innovations

How to test serverless workflows?

Joel Lutman — Tue, 16 Jun 2020 17:44:29 +0000

Serverless is a design pattern which aims to remove many issues development teams typically face when maintaining servers or services, enabling them to focus on delivering value and benefit quickly and efficiently.

However using a large amount of serverless resources also has its drawbacks, in particular the difficulties in testing.

In this blog I aim to discuss some of these problems, and propose a solution for testing heavily serverless workflow’s through regression testing.

The Different Types of Testing

When building applications it’s important that we write comprehensive test coverage to ensure our application behaves as expected, and protects us from unexpected changes during iteration.

In both traditional and serverless development, when building apps and workflow’s that involve calls to other services, we need to test the boundaries.

But how do we do this when our boundaries are managed services?

Before continuing it’s important to understand the difference between unit, integration, and regression tests, as they are often easily mixed up:

Unit test

The smallest type of test, where we test a function. When following Test Driven Development these are the kind of tests we write first.

Given a function def addOne(input: Int): Int = input + 1 we would expect a corresponding test which may look something like:

addOne(-1) shouldEqual 0
addOne(0) shouldEqual 1

Integration test

A larger test, where we test a workflow which may call many functions. These are more behaviour focused and target how our system expects to run given different inputs.

Given an application with an entrypoint def main(args: Seq[String]): Unit we may expect an integration test to look something like:

main(Seq("localhost:8000", "/fake-url", "30s")) shouldRaise 404
main(Seq("localhost:8000", "/mocked-url", "30s")) shouldNotRaiseException

Regression test

The largest type of test, also thought of as a systems test. While unit and integration tests look to test how our application behaves during changes to it, regression tests look to test how our systems behave due to our application changing. They also prevent unexpected regressions due to development.

While integration testing of an api crawler may test what happens to the app when the api goes offline by utilising a local mock, regression testing should test what happens to downstream services should that api go offline.

Problems with Serverless

Let's work through a real world example where we will see that relying only on unit and integration tests is not enough for even simple serverless workflow’s.

Given this demo workflow:

AWS Lambda runs some simple code to get data from an api, do some processing, and publish the results to an S3 bucket
S3 bucket has an event trigger that sends an alert to an SNS topic when new data is published to it
SNS topic sends an email to users letting them know that the data is available to download from a link
Users access the link, which is an AWS API Gateway endpoint, to download the data

We'll assume the Lambda has unit and integration tests. These may use mocks and utilities to test how this simple code would handle the various response codes, and capture the messages being sent to S3. This is testing the boundaries of the Lambda, however this leaves much of our workflow untested.

In a traditional stack, where we would be self provisioning servers we could test these by running containers for them. However how do we do this with managed/cloud-native services which are not available in the form of local containers?

Serverless and self provisioned servers may bear similarities, but they’re not the same, and any tests using it as a replacement would provide little benefit. However if we only test the Lambda code then we are leaving much of our workflow untested.

What happens if someone logs into the console and changes the SNS topic name?

The Lambda will still pass it’s unit and integration tests, and it will still publish data to the S3 bucket. However the SNS topic will no longer receive the event, and won’t be able to pass on alert to our users - our workflow is broken, and even worse we’re not aware of it.

This is the catch-22 of testing managed/cloud-native serverless - as our workflow’s become more complicated, we need rigorous testing, but the more services we include the less tested our workflow becomes.

This is why regression/systems testing becomes more important with serverless workflow’s, and why it should become more of the norm.

Regression Testing Serverless Workflows

So now that we understand what we want to test, and why it's important, we need to find a way of testing it.

The traditional approach would be to deploy the stack onto an environment, where someone can manually trigger and evaluate the workflow. This is testing the happy path, as it doesn’t evaluate all the permutations of different components changing.

Additionally, due to the manual process involved we are unlikely to be able to evaluate this frequently, and instead may only do this once per release which could contain many changes. Should we find any regressions, it becomes harder to identify the root cause due to the multiple changes that have been implemented between releases.

This doesn’t scale well when we have more complex workflows that utilise parallel and diverging streams (for an example of such read my blog on building serverless data pipelines).

So, how do we do better?

Well, what we can do is take the same approach used for unit and integration tests, and look at how we can test our remit (in this case our entire workflow) as a black box.

We can achieve this by;

Spinning up infrastructure around our workflow
Running a suite of tests to start the workflow
Asserting on the results at the end of the workflow
Destroying our test infrastructure afterwards - to do which we need to leverage IaC (Infrastructure as Code) tools such as terraform.

For our demo workflow, we would achieve this by deploying managed/cloud-native services, which the Lambda at the start of our workflow will connect to, in lieu of the real external API.

We can then run a suite of tests to trigger the Lambda, and assert the expected results exist at the end of our workflow via the workflow API Gateway.

With this approach, we can now automate the traditional manual QA testing, and ensure we cover a much wider spectrum of BDD test cases, including scenarios such as “What alert do/should our users receive if the API is unavailable?”.

In traditional unit/integration testing we wouldn’t be able to answer or test for this, as this process is handled outside of the Lambda. We could test what happens to the Lambda in the event of the external API becoming unavailable, but not how downstream processes would react - we’d be reliant on someone manually trying to mimic this scenario, which doesn’t scale.

Furthermore, utilising IaC we can run a huge barrage of these larger workflow tests in parallel, and easily scale these up to incorporate elements of load and chaos testing.

Instead of being reactive to our workflow breaking, we can push the limits to establish our redundancy prior to experiencing event outages.

Conclusions

Hopefully I’ve sold you on the idea of regression/systems testing, and why as we move to a more serverless world, we need to establish a more holistic view on testing our systems as a whole, rather than only the components in isolation.

It’s not to say that we should abandon the faithful unit test in favour of systems testing, but why we should not fall into the fallacy that just because our “code” is tested, our systems and workflows are also tested.

This also highlights why Development, QA, and DevOps are not activities done in isolation by separate teams. Having a key understanding of each is required to implement and test such a workflow, and that ideally both the workflow and test framework should be implemented by a single cross functional team, rather than throwing tasks over the fence.

For more on AWS and serverless, feel free to check out my other blogs on Dev.to, and my website, Manta Innovations

What was it like to attend a virtual conference? - AWS Online Summit Series

Joel Lutman — Thu, 11 Jun 2020 17:39:12 +0000

What was a virtual conference like?

Before I go into talking about each talk, I thought I'd write a bit about what attending a virtual conference was like?

The first thing I noticed is that all the videos were pre-recorded, rather than live streams. This meant that there was no back and forth with the presenters, and that any questions were instead answered by an "AWS Expert" via a text chat. That definitely didn't give me a feeling of being part of something in the same way a conference would. However it did mean that there were no latency or connection issues. It also meant that I could watch any of the talks once they had been released over the following days, compared to having to choose between which talks I could attend.

Compared to a usual conference where I may attend with colleagues or friends, where we would talk and bounce around ideas about the subject matter, being virtual did not give me this sense of camaraderie. In hindsight, I should have tried to arrange for a group of friends to all dial into the conference on a shared zoom call, as I know this has worked for others in the past.

On the plus side, being able to listen to any talk I wanted to, and being able to quickly switch between talks if I realised the talk I was listening to wasn't actually of interest, gave me much more freedom.

I would never get up and walk out of day talk if I were there in person, it's rude and distracting to the presenter. However, knowing that I could switch over to another stream within seconds meant I could be much more flexible in listening to a talk's opening 5 minutes, and deciding I wanted to stay for the whole thing.

This flexibility also extends itself to the conference becoming more accessible for those that would not have typically been able to attend an in person conference. Whether due to travel, cost, physical accessibility, personal dependents, or work deadlines. There are many reasons why someone might not be able to attend a traditional conference, and many of these disappear in a virtual format.

Not as personal as traditional conferences, but there are lessons to be learnt

Overall I would say this felt a lot less fun than a traditional conference, and I personally found it a lot more of a clinical and lonely experience.
It might be awhile before we can attend conferences likes this again. The keynote and fireside chats did better at giving us a personal touch, as these were clearly filmed from the presenters personal office spaces, while the technical talks were presented in front of an AWS-orange screen.

It might be awhile before we can attend a conference like this again

Given the circumstances I'm really glad AWS chose to do the summit, as an online summit is better than no summit, however if it's a choice between attending one online or in person, I'd prefer to go back to going in person. That said, it may be worth adopting some lessons from online conferences such as live recording the talks and sharing them with participants afterwards, as not all conferences provide these.

However, who knows when we will be able to attend conferences again, so online conferences might become the norm for the foreseeable future.

If you've attended an online conference recently - what was your experience?

As always, the content here describes my own thoughts and understandings from the material presented, not the views of the presenters, who I do not speak for.

This is my last blog in my AWS’s Online Summit 2020 series. I hope you have enjoyed them.

For more on my AWS Summit series, check out the summaries on the talks I attended.

What is MLOps? - AWS Online Summit Series

Joel Lutman — Tue, 09 Jun 2020 18:05:49 +0000

What is MLOps? - AWS Online Summit Series

Having originally come from a Data Science and ML background, before focusing on Cloud implementations and Serverless, I was interested in AWS AI Specialist Solutions Architect Julian Bright’s talk on Machine learning ops: DevOps for data science.

Ops, Ops, Ops

MLOps (Machine Learning Ops) is another new term, following the pattern of DevOps and GitOps (not to forget DevSecOps, DataOps, AIOps, and anything else you can append “Ops” onto), that I’m seeing more and more in the industry.

MLOps largely revolves around solving similar issues as DevOps does - deployments. The only difference here being that instead of focusing on application deployment, MLOps is focused on model deployment.

If I’m honest, I’m not sure we need another “Ops” title just to differentiate between a model and an application. In the end of the day a well written ML model is often a containerised application or binary object anyway, which are not that dissimilar from a standard containerised app or jar.

But then again I work in the industry that brought us phrases such as “Python Ninja”, “10x Developer”, and recursive mindbender “SPARQL” (SPARQL Protocol and RDF Query Language); so maybe I shouldn’t be too critical.

ML still has a long way to go

Julian opened by giving us some interesting facts about Machine Learning in industry. In particular, he quoted an Algorithma survey which found that “55% of companies have not deployed a machine learning model” (by “companies” Algorithma are referring to enterprise business, of which they had 750 respondents, though they do not publish what metric they used to classify a business as enterprise).

Having worked on both the data science and software sides, I’m honestly not that surprised.

ML is still a relatively novel concept to many enterprise businesses. From my personal experience many enterprise use cases are much more BI focused, and have yet to understand and tap into what a ML model can do for them, over their traditional dashboards and reports. The Algorithma survey shows 21% of the total survey were still evaluating use cases to see if they even had a need for an ML model.

Source: https://info.algorithmia.com/2020

In addition to this, the Algorithma survey, also found that of those 45% that had deployed a machine learning model, approximately 68% took somewhere between 1 week to over 1 year to deploy a single model.

Keep in mind that in a best practice CI/CD workflow we deploy multiple times a day (and in GitOps we deploy each commit). So a single deployment taking even a week should be unacceptable in modern software design.

Source: https://info.algorithmia.com/2020

Why so slow?

Julian went on to talk about how the actual ML code is only a small part of an ML solution. Good machine learning solutions require accurate data, which needs, among others;

collection
verification
feature
engineering
metadata management
infrastructure management
automation
process management
team structure

All of these can introduce their own challenges. One of which he highlighted was that different teams could own parts of the process, each requiring their own handoff, integration points, and development workflow.

This is something that I’ve definitely seen across all aspects of software development, it is not specific to ML.

My own opinion on this matter is that the developer/engineer/scientist who develops the source code (whether that be an app or a model), should be the one to take it through its entire lifecycle through to deployment. This in my opinion speeds up the delivery, and provides a more coherent and consistent code base for the model, and avoids “throwing it over the fence” to other teams.

Deploying and Orchestrating ML Models

Julian went on to talk about how we can use the AWS Developer Tools (Code Build, Deploy, Pipeline, etc) not only for deploying traditional apps but for ML models too, which follows the patterns demonstrated in Loh Yiang Meng’s talk: CI/CD at scale: Best practices with AWS DevOps services.

This did make me think, if we can use the same processes and tooling for both application and ML models, then why should we treat ML models any differently to applications?

Anyway, I deviate.

So now that we are able to deploy our model, how do we orchestrate it?

Compared to apps, many ML and Data Science models are written more as scripts than a service; and as highlighted, we may need to perform some small steps such as data cleansing and validation prior to using our model.

Serverless ML

Julian demonstrated how we can use a number of tools to orchestrate SageMaker scripts to perform these steps. He mentioned a number of operators including Apache Airflow, Netflix Metaflow, Kubernetes, and AWS Step Functions (which provides first class support for SageMaker scripts).

This was interesting to me, I’m a huge AWS Step Functions fan, having used it extensively within my serverless AWS implementations.

Source: https://aws.amazon.com/step-functions/use-cases

Despite being around since 2015, AWS Step Functions does not have first class support for most other AWS services, and requires you to write a small Lambda function to invoke the actual service. The more AWS services that Step Functions gives first class support for, the better.

Still some way to go

Overall I came away thinking that ML in enterprise still has a long way to go, and that we’re still seeing a lot of gatekeeping in this area.

We have data engineers writing code to deliver the data, data scientists writing models, developers writing apps to turn the model into a service, and operations deploying it to environments.

No wonder things are slow and complex when we have this many handoffs. If approaches such as MLOps can assist in this then that’s great, but to me much of the deployment issues feel more like business and process problems than technical or tools based ones.

These are of course my own opinions, and I would welcome to hear your thoughts on MLOps?

This is part of my ongoing series on AWS’s recent Online Summit 2020.

As always, the content here describes my own thoughts and understandings from the material presented, not the views of the presenters, who I do not speak for.

CI/CD at scale - AWS Online Summit Series

Joel Lutman — Thu, 04 Jun 2020 18:30:43 +0000

What is Best Practice?

While I have my own opinion of best practice, I think it’s good to constantly check your standards against peers and industry leaders to ensure you haven’t fallen behind.

Therefore I decided to dial into AWS Solution Architect Loh Yiang Meng’s talk: CI/CD at scale: Best practices with AWS DevOps services.

Overall I felt that this talk was best pitched for those unfamiliar with the AWS CICD tools, as he gave a good overview of the AWS Developer tools (Code Commit/Build/Deploy/Pipeline), and how these integrate with each other. For more info on these check out the docs on AWS CodePipeline.

Codepipeline now supports integration with Bitbucket Cloud

Source: CI/CD at scale: Best practices with AWS DevOps services - Loh Yiang Meng, AWS Solution Architect

One thing that he made a point of highlighting is that CodePipeline now supports integration with Bitbucket Cloud (I believe this went into Beta last December), which leaves GitLab as the only major git provider not supported.

While I’ve used GitLab extensively in enterprise environments (and much prefer the experience over Bitbucket or CodeCommit), between this and all the great stuff GitHub is doing recently with Codespaces and Actions, I really can’t see any reason to not be using GitHub in 2020.

Electrify’s Journey with AWS CICD

Lastly, Loh introduced Martin Lim, CEO, and Arshad Zackeriya, Senior DevOps Engineer, from Electrify Asia to talk about their CICD journey with AWS.

Source: CI/CD at scale: Best practices with AWS DevOps services - Loh Yiang Meng, AWS Solution Architect

Here they gave us an overview of their CICD pipeline, which followed Loh’s use of CodeCommit, CodeBuild, ECR, and CodePipeline for best practice CI. However they used a Lambda to deploy to their EKS cluster (deployment to EKS is something that CodeDeploy has yet to support), and then went further and built an Alexa skill to trigger deployments.

While their design of sourcing (CodeCommit), building (CodeBuild), publishing (ECR), and orchestration (CodePipeline), followed best practice CI, and the Alexa skill definitely had the wow factor, this still involved some manual intervention to trigger deployments. Sure the Alexa skill made deployments easier, but is it really any different from someone clicking “run” on a jenkins job?

I’m also not sure I’d trust Alexa with doing my production deployments - what happens if a colleague said the wrong release number?

Source: boredpanda.com

“DevOps is not a product, but a culture”

Overall, Loh Yiang Meng was very engaging as a presenter and some of his comments on best practice definitely aligned with my own. In particular he highlighted that we should automate everything because humans make mistakes

Source: CI/CD at scale: Best practices with AWS DevOps services - Loh Yiang Meng, AWS Solution Architect

This is part of my ongoing series on AWS’s recent Online Summit 2020

As always, the content here describes my own thoughts and understandings from the material presented, not the views of the presenters, who I do not speak for.

For more on my AWS Summit series, check out the summaries on the talks I attended.

Enterprise & Containerization - AWS Online Summit Series

Joel Lutman — Wed, 03 Jun 2020 16:44:57 +0000

What’s hard about containers?

I tend to work with a lot of enterprise clients and, as much as I believe in and desire modern workflows with Kubernetes, Flux, and GitOps; my experience has been that many enterprise clients are still stuck in the traditional delivery format and have a tentative understanding of containerization and microservices.

I believe this is due to the business concept of an “application” being easier to comprehend as a single monolithic codebase rather than a set of loosely coupled microservices. So I was interested to hear AWS Senior Partner Solutions Architect Gaurav Arora’s thoughts on how we as technologists and consultants deal with that, and dialled into his talk on Enterprise cloud migration meets application containerization.

Gaurav presented his approach to containerization of enterprise applications using a plan of:

Prepare
Discover
Design
Migrate
Operation

Prepare

The Prepare stage it’s all about understanding the enterprise viewpoint, and how to prepare them for containerization.

He spoke of his experience with enterprise clients and how while many of them may have heard of containerization and potentially even kubernetes, some are still in the dark as to why these would benefit them.

Those that were aware of the benefits cited such as “increase agility”, “productivity”, “cost optimisation”, and this is exactly the arguments we should hone in on when evangelising containerization for enterprise.

Discover

Gaurav next spoke about the Discovery stage, and how when looking at pre-existing enterprise applications we need to assess what elements can be containerized.

Do we need binaries? How does the licence work inside a container? Do we bundle our own dependencies? Is it stateless? Can it be containerized?

This is something that I hadn’t appreciated. For so long I’ve been able to run any app inside a docker container and build my applications with docker in mind.

I’d forgotten that so much enterprise software was reliant on obscure versions of specific software that might be entirely closed source, what about if an application is only built to run on some Windows platform - how do I containerize that?!?!

Source: Enterprise cloud migration meets application containerization - Gaurav Arora, AWS Senior Partner Solutions Architect

Design, Migrate & Operation

Gaurav talked about how for both Design and Migration, enterprise should consider a three tier/stage approach.

Stage one - would be the design and creation of the lowest level of the cloud, that of VPC’s, security groups, accounts, and tagging.
Stage two - would be the design and creation of the cluster environment, would you use ECS or Fargate, what about kubernetes, do you use ECR, what about load balancers.
Stage three - would be the actual container architecture, which base image do you use, how many replicas should we run?

Overall, although I completely agree with Gaurav and understand how recent containerization is in the eyes of some enterprise business, more than anything I was left a little disheartened.

The fact that we are still talking about how we containerize enterprise applications shows how many applications there are out there that are still waiting to get, or just can’t be containerized.

This is part of my ongoing series on AWS’s recent Online Summit 2020.

As always, the content here describes my own thoughts and understandings from the material presented, not the views of the presenters, who I do not speak for.

For more on my AWS Summit series, check out the summaries on the talks I attended.

AWS meets GitOps

Joel Lutman — Thu, 28 May 2020 17:07:37 +0000

As someone who’s spending more and more time with kubernetes and but has only dipped my toe into GitOps, I was interested to hear what the AWS approach would be to GitOps.

Therefore, I dialled into AWS Solution Architect Jason Umiker’s Kubernetes GitOps on AWS, at the AWS Summit Online, May 2020.

This talk did not disappoint.

I've become a Flux convert

After covering the basics concepts of CICD we went straight into an overview of Flux, the GitOps operator for Kubernetes and part of the CNCF; and what GitOps actually means to a workflow, mainly being able to control deployments via Pull Requests to your master/release branch.

A convincing argument for GitOps.

GitOps is a very new approach for release management and deployment, especially to those Enterprise clients, many of whom are still struggling with CICD and remain on traditional timed release cycles.

He highlighted that all developers already use git for many great reasons that apply to not only development of software but release management too; namely a single source of truth, audit trail, built in peer review, and ease in gatewaying change.

By tying the actual release management and deployment to git, we can now have a single tool in control of not only our development and iteration, but also our deployment.

Source: https://dzone.com/articles/what-devops-is-to-the-cloud-gitops-is-to-cloud-nat

Ghost in the machine 👻

Jason went on to explain and demonstrate how GitOps with Flux could be achieved on AWS using AWS CodeBuild and CodePipeline, alongside external kubernetes operators to deploy a change to his Ghost service running on EKS.

Here he merged a PR that changed the RDS definition which the Ghost app used for storage (an AWS resource managed by the AWS CDK) and a change to his Ghost deployment (a kubernetes resource defined by the manifest). Because he is using GitHub as a source, CodePipeline is able to monitor the repo for changes and initiate a simple pipeline of Source (from git) and CodeBuild only, with the trick being that the CodeBuild stage is actually doing our deployment.

Source: Kubernetes GitOps on AWS - Jason Umiker, AWS Solution Architect

This CodeBuild stage actually has a very simple buildspec.yml that just issues the cdk deploy (for those not familiar with AWS CDK this is the equivalent of a terraform apply) which applies the change to the RDS resource. At the same time we have Flux monitoring the same repository via a webhook, which has performed a new deployment for the change to the Ghost manifest yaml.

And there we had it, in a single PR he had committed, reviewed, and deployed a change to both the AWS managed infrastructure, and the Kubernetes managed service.

I need some more alone time with Flux 😉

This talk was great and made me realise that I need to spend more time with Flux, especially in light of the Argo Flux collaboration which happened back in November, as this is the exact CICD workflow I’ve always desired.

From a developers point of view being able to finish my tasks with a PR is the ideal. I don’t need to worry if my PR made it into the “Friday release”, or whether there was any issues during deployment, if it’s merged it’s done.

This is part of my ongoing series on AWS’s recent Online Summit 2020.

As always, the content here describes my own thoughts and understandings from the material presented, not the views of the presenters, who I do not speak for.