DEV Community: Kyle Jones

How smart cities are helping the public sector reduce emissions

Kyle Jones — Wed, 28 Jul 2021 00:00:00 +0000

Our global population is exploding, and the latest estimates project it’s set to reach 10 billion by 2050. This growth of almost 2 billion people will put an even greater demand on our existing resources, including energy. Most of us are also eager to improve our quality of life, which is often done by finding new ways to incorporate technology into our daily lives. However, with growing concern around climate change, these uses for technology need to be more efficient and sustainable. So how do we start to reduce the carbon footprint of a city while improving lives?

The answer could be in how we build smarter cities for the future.

What are smart cities?

Smart cities are large settlements that are data-driven, using a variety of sensors, analytics and automated processes to streamline traditionally manual services. This means that society and the overall quality of life improves, as less time and effort gets spent on mundane and repetitive tasks.

Bristol is an example of a city that is innovating in this way in order to reach their target of reducing CO2 emissions by 40% through a number of initiatives such as smart metering.

European cities like Copenhagen are pushing the boundaries through research and development undertaken by a smart cities incubator called the Copenhagen Solutions Lab. The incubator’s current focus is on air quality and the population’s flow along the roads.

Further afield, cities including Dubai and Singapore are also installing infrastructure and putting revolutionary systems in place, such as autonomous police stations where fines can be paid or incidents can be reported.

What technologies power smart cities?

The main requirement for many of the innovations is a fast, reliable internet connection to transmit the data from the sensors. The internet required is supported by the recent introduction and adoption of 5G across the world. 5G networks have lower latency and higher bandwidth, meaning that they provide a faster connection which will undoubtedly become the backbone of the infrastructure behind smart cities.

Most commonly, the features in smart cities are underpinned by consumer-grade electronics — particularly sensors, the same technology used in most Internet of Things (IoT) devices like Alexa speakers and Philips Hue bulbs. These electronics have exploded in popularity in recent years with the surge in interest in both smart home products and products from manufacturers like Raspberry Pi and Arduino.

Smart cities will generate more real-time data that can feed forward into other systems. It could also feed into big data analytics alongside existing data sets such as weather or event calendars to gather further insights from the combined information. For example, on a rainy day, analytics could determine whether an increased capacity for public transport is needed.

What features of smart cities are environmentally friendly?

As technology advances, more and more manual processes have become automated. These advances are visible across all industries — from autonomous vehicles and autopilot systems to robotics used in manufacturing. Similarly, as technology has improved, more data is being produced that can then be fed into analytics services.

Energy

The most common example is the adoption of smart utility meters to monitor and forecast usage. The benefits of these smart meters are that they improve the accuracy of their billing systems and empower a more streamlined, tailored support service. Over time, these smart meters will enable the suppliers to predict the required capacity at any given time. Predicting capacity in this way will pave the way for an easier transition to clean energy through evidence-based requirements for energy storage.

In anticipation of hosting COP26 in 2020, Glasgow began installing smart street lighting in various parts of the city. The installations use motion and noise detection to turn the lights on when triggered. Keeping these lights off when footfall is at its lowest levels saves electricity, lowering the reliance on the power grid while not compromising on safety or security.

Transport

Transport is another key area where some cities have made smart, environmentally-friendly changes.

One example is the 3,000 sensors installed to create smart parking spaces in Cardiff. Proximity sensors are installed in the ground beneath the spaces and detect whether that particular parking space is occupied by a vehicle. The status of these spaces is updated in real-time with local street signs and an app updating to inform motorists of the status of the spaces. The accompanying app also allows users to reserve a space in advance. Informing the public that there are little or no parking spaces available encourages commuters to choose more eco-friendly methods of transport like buses or trains.

Bike-sharing is another example that improves society in many ways such as improving health by empowering exercise, reducing pollution and easing congestion. Hubs are being added to various cities across Europe, with the Welsh Government recently announcing hub additions in Aberystwyth, Barry, Rhyl and Swansea.

Waste management

An effort introduced by Leeds Council is the use of smart waste and recycling bins. These bins use sensors to monitor their capacity while solar panels power a compactor inside to improve the utilisation of the limited space. It is possible to monitor the bin’s capacity using an app that also sends notifications when it needs to be emptied. This functionality enables the council to adapt its refuse and recycling collections. Being able to adapt the collections enables a reduction in the number of collection vehicles on the road, which in turn reduces emissions and improves congestion.

Redcar & Cleveland Borough Council has proposed a similar scheme involving adding RFID tags to household bins that will measure the weight of the contents and frequency of collections. Thinktank Social Market Foundation has theorised that a scheme like this could encourage recycling by enabling councils to provide tax cuts based on the amount of recycling done by a household. This is particularly important to councils across the UK, as the highest recycling rate in England was 64.5% in 2017 but the lowest was just 14.1%.

What are the barriers faced by the public sector?

There are a handful of main issues with the development of smart cities. Firstly, budget constraints are a major problem due to austerity measures and the perception that smart city infrastructure isn’t a necessary expense.

Secondly, deployment of these new technologies is slowed by regulation and bureaucracy. The time taken to develop a proposal, wait on the requested funding and then finally implement the solutions means that it can be months or even years before the changes are put in place and the organisation finally sees the dividends from their efforts.

Finally, the network capability required to effectively handle the new workloads has only recently begun rolling out. 5G networks were first introduced in late 2019, however initial estimates put a majority coverage will take until around 2025.

Despite these obstacles, more and more of the public sector is adopting smart city technologies and innovating to improve the lives of their citizens. As the availability of the infrastructure and the number of successful case studies grows, the barriers faced are receding — which in turn lowers the overall cost of adoption thanks to the maturity phase of the product life cycle. As efforts to combat climate change gain traction, levelling up our cities to become digital powerhouses will become more prominent — but it’s clear that the seeds for change have already been sown.

How can we build more sustainable software?

Kyle Jones — Fri, 04 Jun 2021 00:00:00 +0000

In 2020, Brewdog announced its latest funding round, Equity for Punks Tomorrow, aiming to push its sustainability efforts even further. No small feat for a company that is already carbon negative. With the funding, they intend to replace their fleet of vehicles with fully electric alternatives, generate renewable energy for a number of their breweries, and a whole lot more.

But decarbonisation is not exclusively a private-sector trend — and various public-sector organisations are also turning their attention to more climate-friendly practices. Take the case of the Department for Environment, Food and Rural Affairs (DEFRA). In September 2020, DEFRA published Greening Government, a policy paper setting out its strategy for providing digital services that are more sustainable.

This policy is not an isolated case either, with other government departments such as the Treasury and the Department for Transport publishing similar papers in recent years.

As a growing number of industries are turning their attention to climate change and innovating to reduce their impact on the environment, we’re all on a journey to Net Zero — and that includes our tech. But how does software engineering fit into this picture, with the industry so handcuffed to its consumption of resources?

What resources are needed to build software?

With data centres already consuming an estimated 3% of global electricity, energy is at the core of any discussion on the resources used when developing software. Electricity is used at every step of the software development life cycle — from the design and engineering all the way to deployment and hosting. Every interaction has an associated cost that could be measured in kWh and can therefore also be measured in tonnes of carbon dioxide equivalents (CO₂e).

When we think about this in real terms, it all adds up. For example, a weekly one-hour meeting on Zoom with 6 participants releases an estimated 0.05kg of CO₂e per meeting. Over the course of a year, that results in the same emissions as driving a petrol car almost 10 miles.

There is also an indirect impact associated with the hardware required for the software — materials (especially metals and plastics) are required to manufacture the underlying hardware needed for servers and networking devices.

Finally, many data centres use liquid cooling to alleviate the excess heat produced by the servers, preventing damage to their components. The coolant used for these systems is usually water, a resource that is already scarce for over 3.2 billion people.

How can software engineers make greener services?

Migrating to the cloud

Traditional on-premises infrastructure only utilises between 12% and 18% of its capacity, whereas cloud server utilisation is around 65%. This statistic, alongside the fact that cloud customers consume 77% fewer servers means that those same customers reduce their carbon footprint by a whopping 88%. Cloud infrastructure processed 60% of workloads in 2019, with that figure estimated to hit 94% by the end of 2021.

These statistics mean that the question is no longer whether or not we should migrate to the cloud. Instead, we should consider whether the cloud providers are providing their services in a sustainable manner.

Each cloud provider approaches sustainability differently — Azure is leading the way in innovating to reduce its impact. Microsoft has committed to four key environmental targets, including using 100% renewable energy by 2025. They introduced liquid immersion cooling in its data centres to help them reach these lofty targets.

At AWS, servers use free-air cooling systems that incorporate reclaimed or recycled water to provide direct evaporative cooling in summer months, ensuring the air is of a low enough temperature to sufficiently cool the infrastructure. Some sites have even installed on-site water treatment systems to reduce their overall water usage footprint.

Amazon, Microsoft and Google are also purchasing varying quantities of renewable energy from external sources to further reduce their reliance on fossil fuels.

Adopting serverless architecture

Migrating applications to the cloud is only one piece of the puzzle, however. Adopting a serverless architecture is critical as it enables us to use what is necessary with little-to-no waste. Using a serverless architecture means that your application runs on infrastructure managed by a cloud provider.

Examples of these types of infrastructure are AWS’ Fargate, Azure’s CosmosDB and GCP’s Cloud Functions. This infrastructure is shared (unless a dedicated host/instance is requested) and billed on usage. Sharing infrastructure means utilisation of the underlying infrastructure is maximised while billing based on usage encourages reduction and optimisation wherever possible.

Optimise your functionality

The final method of reducing the impact your services have on the environment is to optimise. Optimising the functionality working behind-the-scenes on an application or website, even if it only accounts for a performance boost of a few milliseconds, contributes exponentially when we change our viewpoint. A few additional milliseconds taken when a single piece of functionality is used quickly multiplies when talking about millions of uses and multiple applications that are powering hundreds of products.

So how can we make websites more energy efficient?

Add a Dark Mode
Block bots and web scrapers
Design to use static content wherever possible
Implement caching and use a CDN
Reduce the quantity or size of images and videos
Reduce the variety of fonts
Use system fonts where possible

Some are even taking things a step further, developing free and open tools that allow users to make sustainability more visible. Examples include the Cloud Sustainability Console, green hosting checks and carbon footprint calculators for websites.

How can non-technical teams help?

Other teams can reduce a company’s impact in several ways. They can contribute indirectly by setting sustainability targets and monitoring how the business adheres to these targets. Microsoft has built a Sustainability Calculator to make monitoring and reporting progress towards environmental targets easier for their customers.

They can also impact the company’s carbon footprint more directly. Organisations must empower their teams to participate in actions that, although small, all add up, such as:

Car sharing
Cycle to Work schemes
Recycling bins in office spaces
Sustainable suppliers

Net Zero is no small feat for any industry or geography. We all have to take whatever small steps we can to wean ourselves off wasteful, carbon-emitting processes. These small steps can then grow exponentially into visibly significant changes. This is particularly difficult for any industries that have become accustomed to a particular way of working, of which software engineering is far from alone.

How Can We Identify Emerging Influencers Using Machine Learning?

Kyle Jones — Tue, 13 Apr 2021 07:53:29 +0000

In the age of social media more than ever, marketing teams are turning to influencers to advertise their new, innovative products or services. The number of followers, impressions and engagements all impact traffic, which in turn helps drive one very important metric - sales. This makes the ability to identify the next up-and-coming influencer all the more important and valuable, with 61% of marketers agreeing that it’s difficult to find the right influencers for a campaign.

Influencer marketing is an industry that has seen rapid growth over the past few years with an increase between 40 and 50 percent year on year, according to Influencer Marketing Hub. It allows companies to increase brand awareness and trust through directly interacting with their target audience in a way never seen before in other strands of marketing.

Take Gymshark, recently valued at £1.3bn, this fitness clothing brand was one of the earliest adopters of the influencer marketing model. They are renowned for marketing products through their community of Instagram influencers and YouTubers, with their success being a testament to the importance of influencer marketing, showing that when done right it can be a key driver to exponential growth.

As an industry, influencer marketing is expected to hit $10bn in 2020, growing enormously from just $3bn in 2017. Increasingly companies are valuing the involvement of influencers in their marketing, so it is now more important than ever to be able to identify emerging influencers and those with the most potential to drive success for a brand.

So what makes an influencer?

The most obvious type of influencer are celebrities - individuals with large numbers of followers who are often idolized. Similarly, other types of influencers include industry experts, thought leaders and bloggers.
These categories, although relatively vague terms, all have some similar traits that provide us with some insight on what features to look for when trying to identify the next big influencer.
Common traits in these categories are that the individuals often have a sizable audience that they engage with through some kind of platform, and in most cases, they are individuals who talk about topics in a particular niche. For example, Andrew Ng could be considered an influencer due to his sizeable audience on Twitter, where he regularly posts about machine learning.

How do we quantify influence?

As with a large amount of machine learning problems, being able to express a complex idea like influence is paramount to identifying trends. In the context of social media, there are three main types of data we can use:

Number of Followers
Number of Impressions
Engagements

The number of followers and impressions, when measured over time, gives us an idea of the reach of the potential influencer. Engagements then give us information about how the individual is able to convert that reach, giving us insight into how likely the users that are viewing the posts are to be motivated into doing a particular thing, such as purchasing a product.

How do we predict an emerging influencer?

A paper published by students at Stanford University discusses a potential solution, while Kaggle (a website which hosts data science and machine learning competitions) has several others. Kaggle's most effective entry makes use of a Bayesian Optimized Light Gradient Boosted Machine that is trained using data that compares two individual's social media statistics, including those discussed above. Bayesian optimization is a process that allows us to fine-tune hyper-parameters for a machine learning model. LightGBM is a relatively new gradient boosting framework that uses tree-based learning and differs from other tree-based algorithms in that the tree grows vertically (leaf-wise) rather than horizontally (level-wise).

Some advantages of LightGBM is that it is fast, requires a lower amount of memory and supports GPU learning, however, it is quite sensitive to overtraining. In this implementation, the LightGBM manages to correctly identify which of the two individuals are the most influential with around 87% accuracy, making this model ideal for assisting in deciding between a relatively low number of potential influencers. However, despite this being one of the most accurate machine learning solutions for this problem, it is held back by the need for a large amount of curated data in order to be effective. This means that access to a large volume of user data needs to be accessed via an API or by web scraping, before then pre-processing the data into a given format to be used by the model.

What can we do to improve the implementation?

The difficulty in quantifying influence is that these values better describe an individual’s reach rather than their influence. However, we could also use Natural Language Processing techniques such as sentiment analysis, named entity recognition and document classification to analyse the post itself, along with the text-based engagements. This would give us a better understanding of what topics and what type of posts get a better reach as well as which of these garner the most positive interactions.

How do you decide if the influencer you’ve predicted is right for you?

Even with the work mentioned above, the influencer may not necessarily be the right fit for the strategy you have in mind. In order to build brand awareness or to increase a social following, you might need to choose a macro-influencer, whereas to reach an ideal audience a micro-influencer might be a better choice. The relevancy of the influencer's usual posts and the quality of the post could also contribute to choosing a different influencer.

How to Conduct a Constructive Code Review

Kyle Jones — Tue, 16 Mar 2021 00:00:00 +0000

Many businesses have a dedicated software engineering department, whether to maintain a website or develop apps. The majority of these departments will perform code reviews as part of their daily routine. But what’s the difference between a negative experience and a constructive review?

What is a code review?

A code review is a common practice in software engineering that aims to ensure the quality of an implementation is maintained by having the author’s work examined by one or more of their peers.

Where should you start with a code review?

Productive code reviews should always start with two elements: the ticket for the issue and any test files in the change request. The ticket will include an overview of the scenario and any further information required for the review, such as designs or non-technical considerations. These help to understand the context behind the work. The test files detail the various scenarios and outcomes that the functional changes are hoping to solve and achieve, giving the reviewer a better understanding of the problem and the proposed solution. When reviewing the test files (and later on the functional code itself), it is essential to consider any edge cases that may have gotten overlooked. These edge cases may seem insignificant but could result in a service outage due to unexpected behaviour. Following on from these, the reviewer should turn their attention towards the files with the most changes in them. These files are likely to be the core of the implementation and therefore saves time when reviewing.

How should you conduct the code review?

A constructive code review is often tough to achieve. Teams and individuals will often only identify the problems with the change request. Some will even solutionise after spotting a potential issue. The pitfall that many forget is that a review is an evaluation of both the strong and weak points. Not many reviewers praise the positives of an implementation, even when a solution is elegant or when edge cases are addressed. As mentioned above, some reviewers will identify a problem and proceed to solutionise. Solutionising is the responsibility of the author of the change and not the reviewer. Reviewers that solutionise squander time when it could be spent on other problems while also demotivating the author. Many software engineers thrive on the problem-solving element of their work and would prefer to do this themselves.

What about the finer things?

The reviewer should also avoid demanding that trivial changes are made before approving the work. Identifying issues such as spelling mistakes, poor naming & micro-optimizations are useful as they affect the code quality. However, concerns like these should be noted and fixed in a follow-up change — the purpose of a change request is to improve the overall product and codebase, so it doesn’t necessarily have to be perfect on the first iteration. When reviewing a change request, pay attention to the finer details like file paths and names referenced in the code. These are easily overlooked, particularly during refactoring, but will break the feature if they are not correct. Similarly, with any resources or files referenced, confirm that these are in the change.

When should you do the review?

Many engineers are reluctant to carry out code reviews because of the need to context switch. To avoid unnecessarily context switching, conduct code reviews before or after another unavoidable context switch such as a break or meeting. Doing so provides the reviewer with a choice of regular but flexible times to conduct reviews. Despite the focus on flexibility, code reviews should not stagnate for more than a day. The velocity of the team is more important than the velocity of the individual. It’s easy for multiple reviews to stack up and begin to block other engineers or tickets.

How to Perform a Successful Incident Postmortem

Kyle Jones — Tue, 16 Feb 2021 00:00:00 +0000

When growing a business from a startup to a large enterprise, it's software systems also expand in complexity, meaning that encountering incidents is inevitable. These incidents have indirect costs that can include loss of trust in the product or brand. However, incidents occurring are not necessarily a bad thing - it provides the business with new learning opportunities and the chance to improve its operational practices. But how do we learn from such a failure?

What is a postmortem?

An incident postmortem is a meeting that brings together all of the people that were involved, whether directly or indirectly, to discuss, document and derive value. Postmortems should remain positive to avoid demotivating the technical teams further. Postmortems should also be blameless - this empowers engineers to provide details of their contribution and prevents the establishment of a fear culture. Learning should be the focus rather than dwelling on what went wrong, and so any past tense discussions should stick to facts instead of opinions - avoiding phrases that include would have, could have or should have is essential.

Who should get invites?

A variety of different stakeholders should get invites to the postmortem, including:

Individuals/Teams that First Logged the Incident
Individuals/Teams that Responded to the Incident
Individuals/Teams that Diagnosed the Issue(s)
Individuals/Teams that Rectified the Issue(s) It may also be appropriate to invite a user that was directly affected by the incident. Invite a range of different people like this maintains transparency and allows us to glean as much information as possible to document.

What should get documented?

Thorough documentation of the event is vital. Memories are short, and in time important details fade into obscurity. Some key facts to document are:

Date and Time that the Incident Started
Date and Time that the Incident was First Logged
Which Teams/Individuals Responded to the Incident
Number of Users/Accounts Affected
Number of Support Requests Raised
Date and Time that the Incident was Fixed
Any Solutions or Mitigations In addition to the above details getting recorded in a postmortem document, the meeting should also have minutes taken, and a timeline of the incident constructed.

What should you do after the postmortem?

The postmortem produces discussion points to bring into further sessions and documentation that will prove valuable in the future. Values recorded should be compared against any service level agreements (SLAs) that may be in place, to confirm that the incident did not result in any breaches. Any issues identified as a result of the incident should be discussed in-depth, with potential solutions or mitigations planned into the roadmap, alongside rigid delivery dates. These solutions/mitigations should have tickets written to capture the work, each of which should be SMART. Depending on the incident, particularly regarding who first logged the incident and how long it had been ongoing before being logged, improvements to the observability may be required. Observability improvements should be a priority alongside immediate solutions to the faults.

If an external user reported the issue, it might be pertinent to publish the postmortem's findings openly, allowing anyone access. Publishing postmortem outcomes publicly has most notably been utilised by Monzo. It enables them to maintain transparency, ensures accountability as a business and has provided their users with greater trust in the brand.

How Hacktoberfest and Open-Source Software are Driving Innovation

Kyle Jones — Mon, 01 Feb 2021 20:28:35 +0000

Business landscapes around the world are constantly evolving, with every organisation clambering to innovate so that they can better react to the changing conditions in which they exist and uncover new opportunities to help them to drive their growth and profits. But in a world where true innovation is scarce and many companies resort to rehashing the same ideas, how do you foster an environment that cultivates ideation?

Why open-source?

Open-source software has seen its fair share of innovative languages, tools, frameworks and products in recent years. Some notable examples include Docker, Kubernetes, Gatsby, React and Elasticsearch - not to mention numerous machine learning frameworks like Tensorflow. The most interesting element of this growth is the fact that a number of these open-source products and tools are generating revenue like regular software development companies that do not openly publish their source code, often through hybrid licensing or hosting. The industry as a whole is already estimated to be worth over $20bn and is expected to continue growing to reach around $33bn by 2022.

In 2014, Digital Ocean unveiled its first open-source hackathon, Hacktoberfest. The requirements have changed over the years but the core message has stayed the same: several contributions to public code repositories that positively impact the projects earn you a limited edition t-shirt and some stickers. Fast forward to 2020 and Hacktoberfest has grown to become a juggernaut that drives the open-source community to new heights year after year. Key metrics for the event have soared, from 768 participants in their first year to a whopping 61,871 completing the challenge in 2019 - an impressive growth of over 8,000 percent! But why has this explosive growth happened, and how does it affect businesses?

Why has this explosive growth happened?

Both open-source software and hackathons encourage collaboration between a number of skilled individuals, often rallying them around a single goal. This leads to a positive environment for a number of like-minded people to bounce ideas for new features (or even whole new products) off each other. In the case of hackathons, people are also motivated by the prospect of a reward or prize. Hacktoberfest’s offering of merchandise in return for contributions is part marketing campaign and part motivator, encouraging participation and contributions to both the event and the wider open-source community. This amalgamation motivates by fulfilling a number of elements from Maslow’s Hierarchy of Needs, fostering a sense of esteem and belonging amongst contributors, as well as also providing them with clothing and shared knowledge in return.

How does it affect businesses?

Over the years, a variety of well-known companies have contributed to, sponsored and supported Hacktoberfest, with notable examples ranging from Indeed to Twilio to Auth0. A handful of these, including Indeed and SendGrid (which has recently been acquired by Twilio), have published papers and talks discussing their takeaways and the associated statistics from their participation in Hacktoberfest.

Companies participating in Hacktoberfest have seen the participation in their projects increase in recent years, helping to accelerate their product development. Umbraco recorded a growth of 40% in participation between 2018 and 2019, with the total number of pull requests in 2019 coming in at 436. Most recently, Indeed talked about the statistics from their internal Hacktoberfest events at DevRelCon Earth 2020. Between 2018 and 2019, they recorded an increase of 1,322 contributions and 170 pull requests. SendGrid saw the number of pull requests increase from 43 in 2016 to 1,249 in 2017, with over 1,000 of those being changes to source code. These pull requests equated to 1,385 story points of effort (around 3.4 years of work!).

Innovation is vital in a highly competitive world, with each organisation scrambling to stay one step ahead of their competitors. Companies that participate in open-source hackathons and Hacktoberfest create a culture of innovation and are seeing dividends in the form of increased development velocity. Failing to iterate and innovate means that a business’ products or services can start to stagnate, which can be catastrophic in fast-changing markets. Cultivating and encouraging horizon scanning, ideation and iteration are fundamental to survival in the modern world of business.

How to Reduce the Chances of an Outage

Kyle Jones — Mon, 25 Jan 2021 00:00:00 +0000

The popular messaging service Slack recently experienced a global outage which impacted millions of users, the majority of which were individuals working from home or remote learning due to the Coronavirus pandemic. The outage lasted for an extended period, potentially impacting their service level agreements (SLAs), impacts the brand while coming hot on the heels of the announcement of their $27.7 billion acquisition by Salesforce. Outages like the one Slack experienced are increasingly common in a technology focused society, so how do we avoid costly issues like this?

What causes an outage?

An outage (also known as downtime) is a period of time when a given service or system is unavailable, failing to provide and perform it's primary functionality. Completely removing the possibility of experiencing an outage is almost impossible, however there are a number of ways in which the odds of an outage can be significantly reduced. Some of the things that can be done to try and avoid an incident are:

Testing
Horizontal Scaling
Chaos Engineering
Resiliency Improvements to the Software
Improving Observability
Improving Processes

What testing should be carried out?

Having a high test coverage is only part of the story when it comes to testing an application's robustness - a high unit test coverage can be deceiving as this simply proves that the software functions as expected, in isolation. Unit testing should be combined with other forms of testing including integration testing, security testing and performance testing (which should also cover various subtypes such as load testing and stress testing). Including a wider array of testing types ensures that both the functional and non-functional factors that could impact the overall system are being checked under various conditions that could occur in production.

What is horizontal scaling?

Horizontal scaling improves the application's stability and robustness, but simply means that each component in the system can increase it's capacity by adding more instances. For example, a simple CRUD application that uses RDS in AWS might horizontally scale by adding additional MySQL RDS instances as read replicas in multiple availability zones (AZs).

What is chaos engineering?

Chaos engineering is a form of resiliency testing created by Netflix when they built their Chaos Monkey application, and functions by turning off random components in their architecture to simulate outages for particular services. This aided Netflix in their migration to a cloud hosted infrastructure by allowing them to identify potential pitfalls, particularly by finding dependencies that, when removed, would have caused incidents in a production environment. Doing this regularly in a non-production environment allowed these incidents to occur in a controlled environment that does not impact customers, mitigating the potential impact to the business.

How do we improve resiliency?

There are a number of resiliency improvements that can be made to an application in order to reduce the chances of it causing or intensifying an outage. Some simple improvements can include adding an exponential backoff retry strategy to calls to other services, building circuit breaker functionality into the system and implementing queues between services in order to provide a smoother, more consistent throughput to the service. Implementing queues also brings with it the advantage of assisting in decoupling the components from each other, meaning that each part of the system is more independent and has fewer dependencies. These strategies can also help to reduce load and enhance the fault tolerance of the application, both of which result in a higher level of resiliency.

What about observability?

Successful deployment of a component is only part of the story. Many deployments appear successful at first before running into issues due to use of resources such as memory, an unexpected failure that occurs irregularly or even security issues that are not at first apparent. These reasons are why observability is incredibly important as logs, metrics and traces allow you to continually monitor your system as well as notifying your technical team when a component begins to show signs of an underlying problem using alerts.

What processes should be in place?

Good processes should underpin the entire engineering practice and ensure that human error is avoided or that any incidents that occur are recitified swiftly. Code reviews should be carried out to mitigate as many potential issues as possible, whereas processes like on-call policies and incident post-mortems foster a culture of continually iterating and learning from mistakes when they do occur.

While these methods do not necessarily prevent outages, they are strong foundations when it comes to reducing the odds of an incident occurring and it causing a total outage. These incidents are rare, but when they do occur they can prove costly to an organisation and preparation should be in place to minimise these costs and the chances of such an occurrence.

A Deep Dive into AWS Firecracker

Kyle Jones — Thu, 02 Apr 2020 13:28:06 +0000

Firecracker is a Virtual Machine Monitor, written in Rust that Amazon Web Services use to power it's Serverless Compute services - Lambda and Fargate. Firecracker makes use of Linux's Kernel-based Virtual Machine virtualisation infrastructure to provide its products with MicroVMs.

What's the Point?

The development of Firecracker was undertaken to meet several objectives. These were:

To run thousands of functions (up to 8000) on a single machine with minimal wasted resources.
To allow thousands of functions to run on the same hardware, protected against a variety of risks including security vulnerabilities, such as side-channel attacks like Spectre.
To perform similarly to running natively, with no impact from the consumption of resources by other functions, retaining the possibility of over committing resources while providing functions with only the resources it needs.
To be able to start new and clean up old functions quickly.

So How Does It Work?

The invoke traffic gets delivered via the Invoke REST API, which authenticates requests, checks for authorization and then loads the function metadata.

The requests are then handled by the Worker Manager, which sticky-routes to as few workers as possible to improve cache locality, enable connection re-use and amortize the cost of moving and loading customer code. Once the Worker Manager has identified which worker should run the code, it advises the Invoke service, cutting down on round-trips by having it send the payload directly to the worker.

Each worker potentially offers thousands of MicroVMs, each providing a single slot and Firecracker process, with each slot only ever used for a single concurrent invocation of a function, but many serial invocations. Each slot supplies a pre-loaded execution environment for a function, including a minimized Linux kernel, userland and a shim control process. This method is like that offered by QEMU, Graphene, gVisor and Drawbridge (and by extension, Bascule) in that they provide some of the operating system functionality within the userspace to reduce the kernel surface and so improve security. On serial invocations, the MicroVM and the process the function runs in are re-used.

If a slot is available, the Worker Manager performs a lightweight concurrency control protocol and informs the front-end that the slot is available for utilization. The front-end then calls the MicroManager with the details of the slot and payload, which is then passed onto the shim running inside the MicroVM for that slot. The MicroManager keeps a small pool of pre-booted MicroVMs ready to be used, as the already fast 125ms boot-up time offered by Firecracker is still not fast enough for the scale-up path of Lambda. Upon completion, the MicroManager gets given either a response payload, or the details of an error which are then returned to the front-end.

However, if no slots are available, the Worker Manager calls the Placement service to request that a new slot gets created for the function. This service then optimizes the process (taking less than 20ms on average), ensuring that the use of resources such as CPU is even across the fleet, before requesting that a particular worker generates a new slot. To reduce blocking of user requests, the MicroManager keeps a small pool of pre-booted MicroVMs ready to be used when requested by the Placement service.

For each MicroVM, the Firecracker process handles creating and managing the MicroVM, providing device emulation and handling VM exits.

The shim process communicates through the MicroVM boundary using a TCP/IP socket with the MicroManager - a process that manages a single worker's Firecracker processes. The MicroManager provides slot management and locking APIs to the Placement service and an invoke API to the front-end.

As an extra level of security against unwanted behaviour (including code injection), a jailer implements a wrapper around Firecracker which puts it in a restrictive sandbox before booting the guest.

Further Reading - Firecracker: Lightweight Virtualization
for Serverless Applications