DEV Community: Marc-Henry Geay

CyberGordon - CDN cache optimization

Marc-Henry Geay — Sun, 30 Oct 2022 12:39:52 +0000

Web cache introduction

A Web cache is a system for optimizing the Web page loads. It is implemented both client-side (browser) and server-side (Content Delivery Network (CDN)/Web server).

Web cache optimization brings several benefits:

On the user side, the web page is displayed faster; very useful during mobile browsing. 🐎
On the server side, the load is lower, and with the cloud model of pay-as-you-go, this results in cost reductions. 💰
And you reduce your carbon footprint by eliminating many unnecessary requests processed by browsers and servers. 🌱

The context of my project

CyberGordon website static resources (images, css, js, json) are hosted on a AWS S3 Bucket behind AWS CloudFront (CDN). You can have a presentation of the architecture on my specific blog post.

The HTTPS queries of these static resources represent half of the overall query volume: an average user browsing (home page --> create an analysis --> view results) represents 9 static and 8 dynamic queries. The latter requests are used to create the analysis and retrieve the results as well as the overall CyberGordon statistics displayed on the homepage.

So I started working in March 2021 based on a AWS post to optimize the cache of these static resources as much as possible and I got good results and a lower bill! 👌

How I optimized the cache

To achieve my goal, I used 3 features:

Cache-Control header: add a cache directive on static resources sent to users to be respected by the web browser.
CloudFront cache policies: refine the CDN to be able to manage the default cache and reset it if needed.
File versioning: long-term caching of resources that almost never change and replacing them with new ones quickly.

Production & automation requirements
It is essential to have a strategy allowing to have the hand on the cache in case of change of object (security fix, dependency upgrade, new logo, etc) to reset it on all the chain so that the user get the new object when he returns on the Web site.

I have decided to have a maximum delay of 30 minutes in case an item needs to be changed.

To avoid manual tasks, I also tried to automate all repetitive tasks.

User cache - Cache-Control header

The Cache-Control HTTP header is useful to give instructions on the cache of the object received by a Web browser or an intermediary service (proxy, CDN, ...). In a simplified way, this header received by the web browser indicates if the object can be cached and for how long.

I want visitors to cache as much of the object as possible for one or two visits, so I set a 30 minute cache on the client side for web pages.

I added Cache-Control header on the following S3 objects:

html - max-age=1800 --> 30 min cache
js, css, svg, jpg, png, gif, pdf, json (except for statistic data), txt, xml - max-age=31536000, immutable --> 1 year cache

To automate the addition of this header during deployment, I updated my CyberGordon Terraform configuration to automatically add a specific Cache-Control header based on the file extension.

CDN cache - CloudFront cache policies

CloudFront is extremely powerful but when you want to push its settings further, it becomes quite complex. I won't do a detailed presentation, as that would be a whole post, but you can learn more about it in the documentation.

A Cache Policy allows you to specify the criteria (URL path, query string, headers, etc) to store an object in the CDN cache. In addition to the cache, we can enable compression of resources in transit to reduce the volume transferred between the CDN and the Web browser.

I have applied 3 Cache Policies on CloudFront:

Default (*) --> All static resources: 30 days cache with Gzip + Brotli compressions
assets/json/stats_* --> Statistic updated every 30 min, so 30 min cache with Gzip + Brotli compressions
request/*, get-request/*, r/*, contact-message --> Dynamic requests, so no cache

These Cache Policies are easily managed trough Terraform.

Static resources cache - File versioning

Cache file versioning aka Cache busting is a very simple but very effective way to force a client to retrieve the new version of an updated resource by adding the 'version' of the resource to the file name.

The example below shows a real case of the Bootstrap JavaScript file used on the CyberGordon website.

A portion of the file hash (SHA-256) is added to the file name in order to have a unique name for each Bootstrap version upgrade.
In the code of the HTML page, the unique name of the Bootstrap file is used. The web page (index.html) keeps its file name for life and is only cached for 30 minutes. Therefore, if a static resource is updated, only its reference in the HTML page is changed with a new hash, and the client will have loaded the new version within 30 minutes.

This method works for every static file type: I use it for all JavaScript, CSS, images (like the logo shown as an example highlighted in gray in the image above) and also JSON file for changelog.

Here again, I have automated with Terraform the creation of the file name including the hash as well as the insertion of this unique name in the HTML pages that reference it. The code is horrible to read: a mixture of Join, Regex and Substr functions... but its works!

What happens if I want to push a new resource?

CloudFront is our intermediate cache that we can act on directly to flush the cache and push updated resources to the clients.

This is what happens when I have to update a resource:

Normally a new client retrieves static resources (e.g. Bootstrap 4.6) from the CloudFront cache and then caches them on his web browser (or even on the enterprise proxy).
When updating a resource (Bootstrap 4.7), I push the new file via Terraform which changes its filename with the hash and the HTML pages using it. Then the unitary CloudFront cache of these pushed files is cleared.
Consequently, the next time the client visits (at least 30 minutes later), his browser will retrieve the HTML web page and thus the new referenced resource (Bootstrap 4.7).

CyberGordon cache strategy

With these 3 techniques and update processes in place, I have summarized here the principles and values of caching.

The cache strategy follows these 3 principles:

The client-side cache should not exceed 30 minutes for Web pages (HTML). After this time, the client must retrieve the latest version of the HTML files from the server.
The client keeps all static resources (img, js, css, ...) unmodified and versioned in the cache for up to one year.
If an resource is modified, it must be possible to reset the CDN cache so that the client can retrieve the latest version after its browser-side cache has expired (so in less than 30 min).

This table summarizes the current cache strategy on CyberGordon:

Object	Versioning	CloudFront cache	Browser cache
HTML + sitemap/robots	No	1 month	30 min
Images	Yes	1 month	1 year
CSS	Yes	1 month	1 year
JavaScript	Yes	1 month	1 year
JSON Changelog + Engine lists	Yes	1 month	1 year
JSON Statistic (Updated)	No	30 min	30 min

Conclusion & results

The implementation of these techniques was not an easy task and many tests were carried out, but the result is relevant since it went into production in June 2021:

90% fewer HTTPS requests and 99% less volume (GB) transferred after the first visit.
The cache hit rate of static resources jumped from 0 to 86% and the volume in GB transferred to the server (origin) decreased by almost 40%.

The pictures below show the result without cache and then with cache.

Moreover, in addition to the implementation of the cache, the compression of static resources during the network transfer and the file source reduces the amount of data transferred. Images can be compressed easily with online tools without loss of visual quality.
Compressing the resources allowed me to reduce the overall size of the resources by 13%.

Finally I was able to automate the whole implementation with the powerful functions of Terraform. Without Terraform, the work at each resource modification will be very hard...

The genesis and architecture of my CyberGordon project

Marc-Henry Geay — Sun, 30 Oct 2022 10:56:48 +0000

Quick introduction

CyberGordon quickly provides you threat and risk information about observables such as IP addresses or domain names by querying multiple threat intelligence sources.

Thanks to each source that provides free access to great Threat Intelligence against phishing and malware. Without them, CyberGordon would have not been there.

Why CyberGordon ?

Whether it be during my investigations at work or personal surfing sessions, I’m too lazy to use several sources to check if a domain or email address is suspicious or malicious. Some awesome OSINT tools exist, but I didn’t have one to aggregates them all into one simple web interface. On top of that, I wanted to start by building a usable and useful tool on AWS infrastructure that I could share with my entourage. I would have liked to share CyberGordon widely, but I’m constrained by the query limits that free API sources provide. Lastly, the lock down during the COVID-19 crisis gave me a lot of time, a rare resource that considerably contributed in completing CyberGordon.

Well why “Gordon” ?
As a Batman fan, I chose the commissioner James Gordon, a friend and reliable informant of the Dark Knight 🦇 🕵️

On April 2021, Gordon became CyberGordon to better reflect its function.

Objectives

When I built CyberGordon, I tried to follow several rules:

Simple: get results after pasting observable(s)

Neither I or my entourage would use a tool that has an extremely complex GUI or that requires sophisticated information for submission. The aim is to copy/paste one or more observables even if they are in a messy format (listed, quoted or in CSV format) submit on a unique form and get a readable summary. I’m still working on improving this last point.

Scalable: add easily new sources

Scalability often referred to the ability to manage automatically the capacity depending on the user’s demand. My intention was humble: I wanted an evolving system where I can add, or update, easily a source (called engine) without impacting the existing ones and without adding delay during processing of the user’s observable submission.

Almost free: use adapted and cost-effective services

For my first tool on AWS and as a non-profit service, the cost was of course the most important criteria. All major cloud service providers offer free tiers use for few of their main services for a duration of one year (sometimes less) or for life. I spent some time looking for simple and almost free AWS services before building a draft.

Serverless: minimal maintenance

It started as a challenge: being a relatively simple tool, I tried to avoid managing a Linux server, even though if I loved managing Debian servers previously… I wanted to test functions (containers of code), where you only manage the code, while the underlying layers (runtime, OS, hardware) are managed by AWS.
Of course, code maintenance has to be done at least for each runtime update (Python 3.8 to Python 3.9 for example).

Secure: apply best practices

Last but not least, even tough the manipulated data is not confidential, I have applied some principles: all data — in motion and at rest — is encrypted with AWS managed key (free), permission’s resources are restricted to the minimal needs (least privileges), public exposure is limited and management actions (API) and users’ HTTP request logs are stored for a duration of 6 months.

Attempt with Slack

Before hosting CyberGordon entirely on AWS, I tried to build a front-end on Slack as a ‘bot’ using Slack Commands feature and processing them on AWS. It works like a charm with one engine, but with two or more it is a mess and unusable. Slack is not suitable for presenting multiple results ; it is a good chat tool for “one question, one short response” capability, but not as a reporting tool...

Slack request is sent to an HTTPS endpoint (hosted on AWS API Gateway) and forwarded to the back-end. Results are then returned using the URL incoming webhook included in the Slack request. As you can see below, results are quickly unreadable when using 2 engines…

A teammate suggested I generate reports on a webpage and send links to Slack user. However, after a lot thinking, this hybrid solution didn’t suit me because it limits the user’s scope only to my Slack workspace users.

Current architecture and how it works

To get all AWS capabilities and cheapest prices, all resources are hosted in “US East (Northern Virginia)” region, except for 2 resources invoked near the user location.

Short summary
On website (1) you paste an observable and submits it ; request is parsed (2), sent to a queue (3), dispatched to engines (4) that queries API sources. During this background work, you’re forwarded to the results page (5).

Components are represented on this diagram and described in detail later on.

1. Static website

All front-web assets are stored on a single S3 bucket (object storage). To provide cache, reduced latency and encrypted traffic (HTTPS TLS 1.3), a CloudFront distribution (CDN) is used ; the S3 Bucket policy only allows traffic from CloudFront.

The domain cybergordon.com points to a CloudFront distribution and the DNS zone “cybergordon.com” is entirely managed by AWS Route 53 (DNS service).

2. Request pipeline

By clicking on “Analyze!”, an HTTP POST request with observable(s), passes through the CyberGordon-Request Lambda@Edge function: a Python code deployed on multiple geographical points to be executed closer to the user.

The CyberGordon-Request function generates a request ID (UUID version 4) and parses the observable(s) into 7 predefined types list: IPv4, FQDN, URL, MD5, SHA-1, SHA-256 and Email address. Basically, the function compares the request body with 7 flexible regex that accept new line (\n) or space between each observable.

Then the user is forwarded (HTTP 302 Found) to the results page that is described below.

3. Queue pipeline

Small but powerful part to dispatch observables to engines depending of the types they can check against sources. The CyberGordon-Queue SNS topic receives message and send immediately a copy to each subscriber that accepts submitted observables type(s).

4. Engine pipeline

Each CyberGordon-Engine Lambda functions receives simultaneously the observable(s) list. The engine controls the integrity of the request (using the SHA-1 function), then gets, if applicable, the API token of the remote source in encrypted variables and queries it then in HTTPS. Finally results are stored in a DynamoDB Table (no-SQL database). Results from all engines are stored in a unique database record. In previous implementation, individual results were stored on S3 objects, with a fourfold increase in lead times to retrieve them!

All engines query remote sources to get live and fresh information, except for the Offline Feeds engine (E23): an hourly CloudWatch Event Rule (scheduled task) invokes a Lambda function (Python code) that downloads, transforms in a JSON format and overwrites the existing feed content stored in the main S3 bucket.

5. Result pipeline

Returning to the Request stage explained earlier (2. Request pipeline), the user is forwarded to the result page. When loaded, a JavaScript gets the request ID in URL Query Parameter (see example below) and then makes a call to the result endpoint HTTP GET /get-result.
This HTTP call is caught by the CyberGordon-Results Lambda@Edge function (like the Request function). This function reads the database record that contains all results and return it as a JSON Document.
This JavaScript script is “Datatables”, a great free JQuery plugin that generates super-easy HTML table from a JSON input. Datatables provides exporting capabilities: all results are exportable in Excel, CSV, PDF files or in your clipboard for further analysis or archiving purposes if needed.

Result URL example: https://cybergordon.com/r/e3a3a0c9-33c0-46e1-a612-91788ee76d14

Continuous improvements

The current architecture is far from being perfect and suffers from several issues that are more or less obvious:

Slowness when getting and merging results from each object result. I could merge engines to one function that could generates only one result object ; in this case the CyberGordon-Result function can be spiked.
Re-enforce the security (input control)
The quality of the Python code, long way…
Provide a front-end API and User Account system

Since 2020 I did some improvements which will be the subject of a future article:

Backup config and code on S3.
Industrialize deployment with CI/CD pipeline and Infrastructure as Code with Terraform.

I’m open to any remarks that will help improve CyberGordon !

Thanks to Carole Boijaud, Youssef Sayegh and my darling for their careful proofreading.

Learn how I tuned CyberGordon Web cache with CloudFront on this blog post. More posts on my personal blog.