DEV Community: Tom Bailey

AppSync Resolvers: Careful now!

Tom Bailey — Fri, 05 Jan 2024 20:13:34 +0000

Recently, we had an issue on a project which after some investigation turned out to be related to AppSync Resolvers 😰. Don't worry - this not me bashing AppSync or AppSync Resolvers, they are most certainly awesome! We just overlooked how we implemented one, got caught out, learnt a lesson and are now a little more careful with them. Let me take you for a little journey...

How we used AppSync Resolvers

We were using AppSync and AppSync Resolvers as you would expect. We had a GraphQL schema which defined Queries and Mutations and had a mix of "Data Sources". This is AWS's example below just to give context of what "Data Sources" would be in terms of AppSync and AppSync Resolvers.

Source: https://docs.aws.amazon.com/appsync/latest/devguide/resolver-components.html

Most of our "Data Sources" were lambdas which performed custom operations. But, in some cases we would have "Data Sources" which were direct connections to services like DynamoDB or Cognito. These are super nice to use, we would simply have a data source i.e. Cognito declared in CDK like below. There would be a resolver created from it declaring the fieldName matching the field in the GraphQL schema.

const cognitoDataSource = graphQlApi.addHttpDataSource(
  "CognitoDataSource",
  `https://cognito-idp.${region}.amazonaws.com/`,
  {
    authorizationConfig: {
      signingRegion: region,
      signingServiceName: "cognito-idp",
    }
  } 
);

cognitoDataSource.createResolver(
  "CognitoStatusResolver",
  props: {
    fieldName: "cognitoStatus"
  }
);

Note: I've taken out the request and response mapping VTL templates from the example code so it's not as hard to read. Have a look here if you want to understand more about creating AppSync Resolvers with either Javascript or VTL templates.

AppSync Javascript Resolvers
AppSync VTL Template Resolvers

So, we had been using lots of these "Data Source" resolvers and they were working seamlessly. The field names were declared in the schema like below and used wherever we could to provide direct integration to particular AWS services.

type Query {
  getAllPersonDetails: [PersonDetails!]
}

type PersonDetails {
  id: String!
  name: String!
  cognitoStatus: String!
}

Shown above we could call getAllPersonDetails with the fields inside it and it would get the CognitoStatus directly from Cognito since it had it's own Cognito resolver already created. No need for it to be passed through a lambda and perform the Cognito call. Fantastic! Except...

This is getting big

...our application began to grow. We added more and more GraphQL fields which could be queried. PersonDetails began to look like this:

type PersonDetails {
  id: String!
  firstName: String!
  lastName: String!
  dateOfBirth: String!
  address: Address!
  phoneNumber: String!
  emailAddress: String!
  firstLogin: String
  lastLogin: String
  cognitoStatus: CognitoStatus!
}

What's wrong with this you might think? Nothing in particular really as we were able to query the exact fields we wanted (thanks GraphQL) directly from the "Data Source". But, as the application grew we made more and more queries to call these resolvers. And if one of those resolvers were to be carelessly included in a heavily used Query then we would be querying the connected service...directly...every time.

cognitoStatus: CognitoStatus!

Oops.

Ouch.

Yes we suddenly had caused a Denial of Service attack...on ourselves.

Wait, what happened?

With our excitement of using resolvers and connecting up the data sources all over the place we had forgotten that one of them was calling the Cognito API directly...which could be rate limited if used carelessly. And we used it carelessly. And we hit that rate limit, hard.

Actually, there was nothing wrong with us having the cognitoStatus field and it having the Cognito "Data Source". But with the growing size of the application and users, the cognitoStatus field was included in a Query when it shouldn't have been. This field was then subsequently queried many...many times. This meant it was calling the Cognito API directly...many...many times and Cognito didn't like that.

Queue us hitting the User Pool rate limit and people being unable to log in...yeah.

So what did we learn?

We do still love AppSync Resolvers. The ability to connect your "Data Source" directly without the need for a lambda resolver is still fantastic. Better speed and less code make them a win.
Just be considerate when using them. You can easily connect up a "Data Source" like Cognito and leave it there. But this can lead to future you or someone else unknowingly popping it into a Query which is hit so much it makes Cognito fall over. We will still be using resolvers of course, but just checking every time we write queries that we know what all the fields are actually connected too 😏.
AWS is your friend and not your enemy in these situations. When it came to understanding this problem we could use both CloudWatch and CloudTrail to gain a quick insight into our API calls and understand where our "TooManyRequestException" was originating from. Take time to read up on these services and be able to use them in a critical situation like this one!

Thank you for listening

Thanks for coming through this journey with me! I hope it was helpful and not too scary. Boo! Bye. 👋

How to log logically

Tom Bailey — Thu, 21 Dec 2023 15:20:39 +0000

“Houston, we have a problem.”

You hear this statement filtering through from a client, except replace “Houston” with your name and replace “problem” with “bug”. As you open your logs in fear you think back to all those times when writing code that you said…

“Ah, I’ll write some logs for that later”
“Lets just fire the full response object in to this log, I’m sure there’ll be something useful in it”
“I’ll just log ‘Failure’ here and that should be enough to know something's up”

…a bead of sweat appears on your forehead and you frown angrily thinking of all those articles you skipped over discussing logging and it’s extreme importance to a system. distant screams

Fear not, you wake up from your nightmare and rush to your computer to begin reading up on every article discussing logging (while also continually being frustrated by websites about tree cutting) and land here - a paradise of logging advice.

Why log?

Well hopefully my box office breaking movie script above gives you a taster on the importance of not only “Why log?” but also “Why log correctly?”. Here’s a few reasons you will often come across on why logging is beneficial:

Analysing errors - A simple one, you have an issue in your application and you need to understand what’s going on. Logging should provide context and traceability of a user/flow of data in a system and allow you to pinpoint what and where something may have gone wrong and action it.
Alerting issues - This is not the logs themselves but the ability to track these logs as metrics and perform actions upon log patterns whether erroneous or not. These can be simple remediation actions or P0 alerts waking up your support engineer in the night informing them the application is imploding.
Assessing threats - Similar to alerting issues, you can assess potential threats to your system by analysing logs and the metrics they produce before the threats affect other services. Additionally, you can produce reports on logs to trace potential bottlenecks or high volume areas that need improving before they fail.
Amplifying business intelligence - I’m stretching it on the “A” theme here but logs do help provide business insights to the developers and more importantly the client which is vital. With logging they can see popular areas of their application or potential downfalls causing frustration among customers.

These are just a few of the benefits of logging in your application. You might be thinking “OK then, let’s just log like crazy and make sure we have everything outputted to its fullest”, but now the pendulum swings the other way. You are being overwhelmed with bloated, useless information and all you want to know is...what’s going on! So it’s important to also think about what to log and not just why log.

What to log?

This is one that can be so hard to get right and is best to put in as much thought as you can when writing the logs. You really want to ensure your future self or colleague is not left baffled by what a log means when it is being read as part of debugging session or it’s not filled with blinding information distracting you from the real issue.

You want to give your logs meaning and context, you want a reader to be able to understand pretty quickly what the log is saying and also where it is in relation to a particular process in your application.

For example this log has a little bit of meaning but not much context:

"Failed to find user"

You’ll read this and understand your application failed to find a user but you’re left thinking:

What user was it looking for?
Where was it looking for this user?
What part of the system was doing the looking which failed?

An improved log should be much clearer - it should provide a log with meaning (i.e User ID which is an item that was not found) but also context that it was in the User table specifically. See below:

"Failed to find item in 'User' table for user with id: [user-id-1]"

Additionally covered later in this blog with “Structured vs Unstructured” logging we can enhance these logs even more to include additional context such as session IDs, timestamps or even the exact function being invoked.

The key point to remember here is to ensure there is meaning and context behind your logs, they need to be readable pretty quickly and able to be understood immediately. You don’t want someone guessing what the log means or what it relates to in the application, it needs to be clear and contain all the information required to successfully debug.

What not to log?

This is where it can get scary, you might have followed the process very easily for “What to Log?” and included all the relevant information in your new log providing both meaning and context. But, you need to be extremely careful here as there is information you definitely don’t want being revealed in your logs, some can be obvious and others you really need to watch out for. Here are some examples:

Tokens/Secrets - This should hopefully be obvious but watching out for passwords, tokens and secrets is critical to ensuring you don’t easily reveal a way in to your application to the outside world if intercepted.
PII - Personally Identifiable Information could easily have a blog post on it’s own. There is many different categories of identifiable information which if revealed together or pieced together from different logs by someone can cause extreme damage to a person. Some of these include:
- Unique ID numbers i.e Passport number
- Bank account information
- Medical records
- Full name
- Phone number
- Date of Birth
- IP address
Information user has asked not to track - This could be anything which the user has explicitly opted out of being recorded or the consent has expired for collection.

Note: It’s important to note that sometimes sensitive information may be required in a log but this should be on a rare occasion and very carefully considered when done so. You can also apply some form of masking, hashing or encryption to ensure the information is not revealed in full.

I’d recommend having a look at the OWASP Cheat Sheet Series for logging to find an extensive list of what to watch out for when writing your logs.

There is also some tools available to help with this, for example AWS provides automatic detection, masking and alerting for sensitive information when using their CloudWatch Logs application:

Help protect sensitive log data with masking - AWS Cloudwatch Logs

How to log?

So you’ve got the reason for why you are logging, you’ve got the contents to what you are logging but how best do you log to get the information you want in a clear and usable format. Here is a couple of key concepts to consider when logging.

Log Levels

Log levels are common in most logging applications and serve as a filter for you or your system allowing the ability to easily sort through logs and action on particular log levels focussing on ones with a higher priority or more relevance to your issue. The most common log levels that you will come across are TRACE, DEBUG, INFO, WARN, ERROR & FATAL and quite often TRACE/FATAL are used in very little cases so you’ll mostly be using the 4 in the middle. There’s a decent breakdown provided here which I have mainly used to form the following log level rule of thumb:

TRACE - Used for extremely detailed and potentially high level logs. Useful if you want complete visibility of code execution for extended debugging and if used, should only be turned on for a small period of time due to the sheer amount of log data it would produce (can negatively affect application performance and your wallet). This would nearly be line by line or decision by decision logging which would be excessive for basic debugging but useful for finding a very specific occurrence or issue in the code.
DEBUG - Messages that are helpful in tracking flow through a system but not important enough to make it into INFO. Useful for Development & QA phases and will be diagnostically helpful in finding issues through logging interesting events, decision points and entry/exits of non-trivial methods. DEBUG logs are generally not enabled in Production due to the volume of logs and should only be enabled if required to do a small period of debugging and then disabled again. These could include database queries, external API call request/responses and logging of configuration values.
INFO - Stepping up from DEBUG these are informative events which are generally useful in production and would usually ignore in normal circumstances but use when we want to analyse an issue and have some context. If your event is a common “normal” occurrence, then in production logs it’s likely it would be INFO. This would include system lifecycle events (e.g start/stop), session lifecycle events (e.g login/logout) and significant boundary events (e.g database calls, remote API calls).
WARN - An unexpected technical or business event but with no immediate human intervention required. It’s often still a log we would like investigate but not urgently and shouldn’t be causing major issues to customers. Likely the application can automatically recover and you would just like this warning logged for later tracing, investigation or metric analysing.
ERROR - This is a log to signal the system is in distress. An unexpected occurrence in which likely customers are being effected and you will want to alert someone urgently to begin investigating the issue and intervene with a manual fix. It’s often an error which affects a particular operation rather than the whole system i.e missing data or an external API not responding on time
FATAL - Big uh oh. This is rarely used but is reserved for large application failures in which the whole system doesn’t fulfil it’s intended functionality, a major service being completely unavailable or where there is going to be major data corruption or loss. You might use this to alert systems to shut down applications to prevent more damage occurring.

Structured vs Unstructured Logs

Now you’ve got your level of log, you’ll need to decide how you want the information to be presented within it. You’ve got two main options, structured or unstructured. You can probably make a good assumption of the difference between these already but these are the key points to clarify. I’ll start with what the industry recommends you not do, and then work my way to the best practices.

Unstructured Logs

Unstructured logs are usually human readable plain text with no consistent or predetermined message format. They are difficult to treat as data sets and not easily queried or searched, and since they are designed for a human to read then of course a computer struggles to understand them without some form of Natural Language Processor, and that just gets complex.

Here’s an example an unstructured log:

2023-09-01T08:36:00Z INFO Successfully did big maths with debited amount: [£45.00]

They really aren't the best and that’s the honest truth, ultimately they are the plainest form of logging and often used for firing into your code at first to make following the flow of data easy and perform debugging in a small capacity. Once the application grows or team scales then the logs become a little messier, more inconsistent and start to look real scary when an issue occurs…are you getting the point.

Quickly moving on…

Structured Logs

Structured logging is basically the opposite of above and much better to use. They are formed with a consistent and predetermined message format that can be queried much quicker and easier by a computer. By using structured logging you are making sure you can utilise the large amount of logs you may be producing to form key metrics, quickly follow traces through logs or just ensure you have a consistent form with all the information you need in every log produced.

With structured logs you can standardise certain data points within them such as the timestamp, actor IP address, session ID, log level, function name and a message etc. which makes querying so much easier across all your logs. The message attribute is often then the more human-readable part which we can get the most context from (this kind of trumps the unstructured logging as you have your human readable message in here amongst the structured format).

{
  "timestamp": "2023-09-01T08:36:00Z"
  "level": "INFO",
  "sessionId": "9ccd503c-973f-4f17-bce9-3431875cd94a",
  "functionName": "BigMathFunction",
  "message": "Successfully did big maths with debited amount: [£45.00]"
}

You can see in this example above that the structured format of the attributes means you can easily query with the right tools, sorting and filtering exactly what you need to debug effectively. With a predefined structure you have consistency, and with consistency you have efficiency - simply marvellous.

Ultimately, the industry recommends against plain old unstructured logging from the start. Get your predetermined message format defined at the beginning of your project, include what you think you’ll need as the application expands and keep it consistent. There are many good standard logging packages out there that will make this a lot easier for you too. Your future self with thank you a million times over.

Securing logs & using logs for security

So now your logging but you want to make sure those logs are securely stored and transmitted but also useful in maintaining and monitoring the security of your application. It’s something that is of critical importance to any application and should be considered heavily when adding logs to your system. Similarly to some other sections in this blog security could easily be it’s own blog post but I will briefly cover some key concepts which are good to keep in your head and prompt further investigation when implementing logs yourself in an application.

Securing logs

Log Protection

Protection of logs is extremely important, especially as they may contain required sensitive information, application source code or valuable company data that would be extremely useful to a bad actor. This protection can be helped with masking or encryption as mentioned earlier in the “What not to log?” section and can help with the initial assurance that even if logs are leaked in a worst case scenario that there is no sensitive information included.

But you of course still want good log protection and here are some considerations you should make:

Ensuring logs have tamper protection so you can identify if logs have been altered or removed maliciously
Access to view logs should also be monitored by auditing user access events, this helps prevent or alert access by bad actors
Ensure logs are sent securely when transmitted outside of your own network
Consider Denial of Service (DOS) protection as logs may be subject to floods of information from an attacker to cover their tracks of bad activity or severely cripple an application with log overload

Log Retention

This is hopefully simple enough to understand but ensure that your logs are only retained for the specified time required, this is often set by the client as a legal or regulatory amount. So ensure the logs are retained for that length of time and deleted once the time has been reached (no earlier or no later).

Logs for security

As well as ensuring your logs are secure you will also want to use logs for security, tracking not only customer flows but developer activity too. It’s important to log changes of settings, movements and actions throughout the application made by owner/developers and track their access to logs and sensitive information.

There is some good guidelines located below on what you should be logging for security including setting changes and owner/developer changes.

Minimum Viable Secure Product

Logging in a Distributed System

I’ve decided to diverge a little from the basics of loggings to briefly cover logging in a distributed system as there is some key points to cover here which are important to think about when deciding on how to log in your application.

If you remember from above I talked about the important of context in “What to Log?”, this becomes ever more important in distributed systems. Within distributed systems a flow completed from start to finish will likely pass through multiple micro services, each producing their own logs but only logging within the context of themselves (as that’s all they know about!). The logs may be clear for a single micro service but if you want to analyse the complete flow at a higher level it becomes incredibly difficult to link these logs together and understand them as a whole. To help with this, you can use a Correlation ID.

Correlation IDs

This is a unique identifier that can be generated and added to the first request going into your system as part of a flow and subsequently passed along to each micro service within the flow. This Correlation ID can then be placed into each log produced from that flow no matter the micro service. This provides a unique link between all of the events logged allowing for that high level understanding of a flow within your system previously mentioned - now you’ve got context.

Correlation IDs also back the importance of using “Structured Logs”, with this ID being placed in every related log in a flow then as long as the logs are structured, querying can be relatively simple using that specific ID.

Distributed logging products

Some examples of products which can help you implement the distributed logging and tracing include:

With these tools you can not only trace and view logs of a particular request but you can hook into observability tooling to give a better picture of performance within your application.

Logging vs Metrics

A brief area to finish on is the difference between logs and metrics. You’ll have noticed both mentioned already in this blog and it can often be difficult to know which to use in particular scenarios and how they are beneficial to you and your application

Logs

If we take the case of having some structured logs being produced from your application, then in most cases there is going to be a significantly high volume of these. They can be read by some form of filtering or querying solution and extremely useful in debugging error cases or following transactional flows within your application. So your main use cases here are:

One off querying of logs to investigate a particular situation i.e error case
Following a user or data flow right through your application
Identifying anomalies to understand potential bottlenecks
Identifying potential security threats

You may though have some regular searches you want to perform on particular logs, whether that be for producing alerts to identify any issues or monitoring system performance at a higher level - this can be where metrics come in.

Metrics

Using metrics can be extremely useful when you want to provide high observability of a system based on planned events occurring over time. Metrics are numbers generated based on particular patterns of events which can be used to monitor performance of a system but also then alert based on particular metric thresholds being hit by your application.

You have millions of metrics available to you based on logs produced from your application and it’s important to pick the right ones that benefit you and your application.

Some key use cases for metrics would be:

Monitoring health and performance of a particular component or system as a whole
Automated alerting based on metrics
High level observability and key metric gathering for clients

So what do we use?

As briefly explained above you’ll hopefully see that Logs and Metrics have different use cases within an application and are both incredibly important to the observability of a system. So ultimately it’s a mix of both to get the best out of your application. Use metrics to monitor performance, specific events and assist in alerting - then have logs for one off debugging sessions and following a flow right through your system.

There’s a lot to think about regarding both and hence why it’s stressed here and in many other articles that logging is of such high importance to an application, don’t let it fall through the cracks.

“Houston, we have a problem.”

You hear this statement filtering through from a client, except instead of waking up from your nightmare sweating and hyperventilating your brain reminds you that after reading those many articles about the importance of logging you said…

“Yes, let’s write that log now, I have the context so i’ll get it in before I forget”
“I’ll take only what I think I need from this response object and include it in the wonderfully structured and easily queryable log”
“I’ll not only log ‘Failure’ here but include the relevant Correlation ID, timestamp, usefully human readable error message and anything I think might be useful for the support engineer reading the logs at 2am to gain the full meaning and context of what they are looking at”

…and you drift peacefully back to sleep.

6 Steps to becoming an AWS Solutions Architect

Tom Bailey — Mon, 13 Nov 2023 15:11:52 +0000

I recently completed the AWS Solutions Architect (Associate) Certification, yes…"groovy" I can hear you say, but what the heck is it? Is it useful? And how was the study process? In this post, I’ll walk you through the 6 steps I took to get that beautiful badge and throw in a few useful tips along the way.

Step 1: Understanding AWS certifications

You should hopefully already have an idea of what a certification is and the benefits it can bring. Regarding AWS certifications, they not only give you credibility but give you a wider understanding of the services AWS offer, what use cases fit each service and the best practices to follow when architecting them.

There are a few paths you can take in their certification map (shown below) but we encourage employees to start with AWS Cloud Practitioner. It requires less effort (~10 hours study) and gives you a great overview of AWS and experience in the exam process.

Source: AWS Certification T&Cs Overview

Info: AWS are continuously enhancing their exam offerings so there may be more exams available than detailed above so make sure to check out their latest exams here.

Solutions Architect (Associate) sits the level up from Cloud Practitioner. It is certainly a big jump in terms of knowledge and commitment, but provides a much deeper understanding of AWS services. Professional is higher again and is seen as the most difficult of the “role-based” certifications (shown in grey above).

You also have speciality certifications, such as ML and Networking. These should not be taken until after completing a professional level “role-based” certification as they require knowledge from both the professional level and the specialist topic.

Info: A certification lasts 3 years so needs re-certified but if you do a higher level one in the path then it certifies everything below it, yay!

Step 2: Making sure AWS Solutions Architect (Associate) is for you

If you are dipping your toes in or have just started in AWS and want a certification to enhance that knowledge then I’d stop here and just go for the Cloud Practitioner first - it works for developers, testers, designers, management and anyone in between wanting to know more about AWS, the services it provides and how it’s architected at a very high level.

If you have a decent amount of experience working in AWS (> 1 year) and want to really cement that knowledge with a deep understanding of how AWS services are implemented then go Solutions Architect Associate. It’s a big jump from practitioner but really worth the certification (benefitting both you personally and often your company in their AWS partnership).

Here are some key points about the exam which are good to know before we continue:

Estimated time to study: ~110 hours including taking notes, performing demos in a personal AWS account and taking practice exams. This can of course be more or less depending on how well you ingest information but this is how long I felt it took me to feel confident enough to take the exam and pass.

Exam style: 140min closed-book exam (can be proctored from home).

Question style: Multiple-choice & multiple-answer questions (no essays or free text entry).

Marking style: No negative marking so you can take a guess if you really need.

Questions: There are 65 questions in total, although 15 are not counted in your final mark as they are tester questions for future exams set by AWS (but you of course don’t know which ones are marked/not marked, sneaky).

Pass Mark: 720/1000 so they take the 50 marked questions and scale them out of 1000 depending on some difficulty factors.

Price: $150 exam + revision content purchased (Instil kindly covered both the exam cost and any revision content purchased for us which was a great bonus).

Step 3: Pick a course for studying

When studying for the Associate exam you will need:

A decent course to walk you through the fundamentals
Time set aside to take course notes and revise
An AWS sandbox to practice and test out deployments in
Practice questions and exams to explore the depths of your learning

Personally, I would recommend Adrian Cantrill’s AWS Certified Solutions Architect course - a video-based learning course with guided demos to practice in your own personal AWS account, “end of topic” practice questions and full practice exams.

Source: https://learn.cantrill.io/p/aws-certified-solutions-architect-associate-saa-c03

Adrian is a great teacher taking you through 99%+ of the knowledge you need to complete the exam. His course requires no previous knowledge of AWS. Additionally the demos are often 1-click deployments so he spins up a lot of underlying architecture quickly meaning you can just focus on the one topic he is specifically teaching in that demo.

Info: I don’t say 100% because I can’t guarantee he covers every single AWS service that may come up in the exam. It’s important not to fully rely on this course and also do some AWS white paper reading, plenty of practice exams and check the exam specification document to make sure that you have everything covered before taking the exam.

The course also includes a lot of the information on how the actual AWS exam works, the types of questions asked and your best approach to answering. This is incredibly useful if you’ve never done an AWS exam before.

I would recommend completing 100% of this course before attempting the exam. After completing, you can then skim back through what you’ve revised and redo topics if you feel need of a refresh.

Step 4: Practice, practice, practice exams

One of best places to find practice exams is AWS Certified Solution Architect Practice Exams available on the Tutorials Dojo Portal. This comes highly recommended by Adrian and others. It allows you to take the exams in different formats including “Timed Mode”, “Review Mode” (question then answer) and specific topic exams if you are struggling in a particular area.

After doing Adrian’s course I did a lot of these practice exams which really helped clarify what I understood, didn’t know and also gave me confidence in the time I had spent revising. Additionally, if doing the “Review Mode” exam then you are given content from their course with attached videos to then help you understand where you went wrong.

Info: In terms of exam readiness, once completing Adrian’s course I did about 8 of these exams. Adrian recommends hitting 90% in the exams before attempting the real thing which is true but I never hit 90% in any... so it is possible to pass if hitting below 90.

All the practice exams I completed were 'closed book' (with some timed) and I was hitting mostly between 75-85% which gave me the confidence to do the exam. (I had it already booked so didn’t have much choice.) Somewhere around these percentages and above I think is a decent indicator of your knowledge as I’d say the Tutorial Dojo exams are close enough if not slightly harder than the real thing.

Also don’t panic if you get a fail in a practice exam. I failed a couple of practice exams with a big batch of hard questions and certainly panicked which I didn’t need too. If you have given yourself plenty of time before the real exam, use that time to write down the topics you are struggling with (Tutorial’s Dojo kindly tells you) and rewatch some videos before attempting any exams again and trying to consistently hit around that 80% mark.

Step 5: Set yourself up for success

Instil will often have study groups running for specific exams which is really beneficial to the revision process, so perhaps consider starting one in your own company or try finding other people who are also taking the exam and revise with them, ask questions and help each other out in the journey!

Revision Tips:

Info: Allow about 100 to 110 hours to prepare for the exam.

Complete the course demos: Most of us tend to learn better by doing not just reading. It certainly helps cement the knowledge in your brain. I definitely learned a lot more from using the actual AWS architecture after learning about it, which meant I could more easily recall things during the exam.
Do your practice exams early: Give yourself plenty of time between finishing the course and doing the real exam to do practice exams or do topic based exams plugging any knowledge gaps you might have.
Use the course as a revision time guide: The course was reasonably accurate in it’s timing, it contained 80+ hours of content including demos. As I was also pausing videos to write notes, revising over work and completing about 8 practice exams after the course, I put about 110 hours of time into passing the exam. Using the course as a guide and how much time you have per week to revise you can hopefully work out a rough plan on how long it might take you to complete.
Don’t fully rely on the course: As mentioned, Adrian’s course probably covers 99%+ of what may come up on the exam. It certainly covers the most common questions that require a deeper understanding of particular AWS services, but there’s possibly just a couple of extra services not mentioned which you will want to know the name of at least. Use the exam guide provided by AWS to help you find content which you may have missed and usually it’s just a quick google to put it at the back of your brain in case it comes up.

Exam Tips:

Important: I would highly recommend doing the exam on your personal machine. Make sure notifications are completely silenced and if you have to use a company laptop then make sure that software updates are not being enforced during the exam. This would be disastrous.

Be well prepared with your exam space: If completing at home then it’s a proctored exam and they are of course strict on what you can and cannot do, including leaving the room, outside noise or even having a pen on your desk. Make sure you test their exam software well in advance and ensure you have a quiet environment in which no one can interrupt you.
Don’t waste time on the very hard questions: If you do Adrian’s course, he has a few videos on how to tackle the exam regarding easy, medium and hard questions. 140 minutes seems like a lot but with 65 questions you only have about 2 minutes per question to read, understand and answer correctly. He recommends going through all 65 questions quickly first, completing all the instant “I know this” questions, then going back completing medium “I need to consider this” questions and finally using whatever time you have left to consider or guess the hard questions.
- I’d definitely recommend doing the timed practice exams in full exam conditions to practice how quickly you proceed through the exam and whether you’ll have time to go back to flagged questions etc.

Step 6: Go for it! You can do it

I’m obviously going to recommend doing this exam as it not only benefits your personal career in AWS but can help with your employer’s AWS partnership. It may seem like a big undertaking but if you follow these steps and create a good revision plan over a sensible timeline (months not weeks), then by the end you will be ready for the exam.

Then you can then go round telling everyone you are an architect! (Even if it’s just a AWS Solutions Architect…you can leave that extra bit out.)

Yes, I printed out this badge and wear it alongside my Blue Peter badge when attending formal events (if you don’t know what a Blue Peter badge is, shame on you).

Testing Step Functions Locally

Tom Bailey — Wed, 26 Apr 2023 20:36:23 +0000

Have you built a Step Function with many steps, retries and end states - but you are left wondering, how do I test this masterpiece to ensure it's as wonderful as I think it is? Then you've come to the right place! Have a look at how we test Step Functions locally to give you more confidence in your work.

As you may have seen in our previous posts, we love Step Functions. It's great being able to build your Step Function in the console, see the payloads passing through your states and everything going green for you to say “Wooh! You’ve stepped through a Step Function successfully.”. But what if it didn’t, and it’s actually not doing what you expect, it’s going red and throwing useless errors or worse, it’s green but not giving you the response you want. What do you need? Tests!

What does AWS provide to help you test?

AWS itself provides some basic tools required for testing Step Functions - no they’re not the silver bullet in which you can just quickly write and run to test your Step Functions - but they certainly give you a jump start.

Step Functions Local documentation states:

AWS Step Functions Local is a downloadable version of Step Functions that lets you develop and test applications using a version of Step Functions running in your own development environment.

With Step Functions Local you can test locally or as part of a pipeline. You can test your flows, inputs, outputs, retries, back-offs and error states to ensure it performs as you expect.

Note:

Step Functions Local can sometimes be behind the Step Functions feature set. We have noticed when a new feature is implemented in Step Functions, the Step Functions Local container image may not be updated to include those features immediately. This is understandably not ideal - but you can keep an eye on the container here for new versions in which AWS are actively updating.

How to get it up and running

At Instil, we knew that we needed to run these tests as part of the pipeline but also run them locally when developing or investigating issues. AWS kindly provides some help with running the tests via the AWS CLI which is great, but we wanted to create these tests to last and have them run as part of our deployment pipeline. So we found this solution.

Here’s what you need:

AWS Step Functions Local (Docker Image)
Testcontainers package
AWS SDK package
Wait For Expect package

Step 1: Have a look at your Step Function

The Step Functions Workflow Studio is great for building out your Step Function in the console. It makes creating your Step Function user-friendly and makes visualising it super easy. Here we have an example Step Function.

It has a couple of lambdas, a choice state for checking the response of the first lambda and some success and failure paths. It has 4 flows which we would want to test if I can count correctly:

Get Item Price → “Item Price <= 100” → Success
Get Item Price → “Item Price > 100” → Ask for verification → Success
Get Item Price → Fail
Get Item Price → “Item Price > 100” → Ask for verification → Fail

Now we have an idea of what we want to test from our Step Function, we can get to work.

Step 2: Download your ASL file from the Step Function Workflow Studio

To use the Step Function Local container, we need our Step Function in ASL (Amazon States Language) which is AWS’ own language for defining Step Functions and their states. You can do this from the Step Function console by exporting the JSON definition.

Step 3: Get that Docker container spinning

You need the container up and running to be able to run the Step Function locally within it, we used testcontainers to spin up the short-lived container and have it ready for testing.

import {GenericContainer} from "testcontainers";

const awsStepFunctionsLocalContainer = await new GenericContainer("amazon/aws-stepfunctions-local")
    .withExposedPorts(8083)
    .withBindMount("your-path-to/MockConfigFile.json", "/home/MockConfigFile.json", "ro")
    .withEnv("SFN_MOCK_CONFIG", "/home/MockConfigFile.json")
    .start();

Note:

Test Containers picks a random free port on the host machine and uses 8083 above to map it, so you don’t need to worry about clashes.
MockConfigFile.json is the file we use for mocking how the AWS services respond in your Step Function test executions, we will come to how to create those in the next step!

Step 4: Create your MockConfigFile

The use of a mock config file is how we define the test cases, flows and responses of AWS service integrations within the Step Function. It makes up the meat of your Step Function testing journey and ultimately controls how detailed you want your tests to be.

The mock config is a JSON file which according to AWS’ own documentation includes:

StateMachines - The fields of this object represent state machines configured to use mocked service integrations.
MockedResponse - The fields of this object represent mocked responses for service integration calls.

Here’s what ours looks like as a finished product below. Make sure the names of the steps are identical to those named in the ASL file i.e “Get Item” in the test case is “Get Item” from the ASL file.

Note:

A great thing you can do in this file also detailed in the AWS documentation is to test the retry and backoff behaviour of some of your steps. For example, you could test that a lambda responds with an error on its first invocation, automatically retries and then returns successfully on its second invocation. Something like this is shown in the MockedGetItemAbove100 mocked response below.

{
  "StateMachines": {
    "ItemPriceChecker": {
      "TestCases": {
        "shouldSuccessfullyGetItemWithPriceBelow100": {
          "Get Item Price": "MockedGetItemBelow100"
        },
        "shouldSuccessfullyGetItemAndVerifyWithPriceEqualOrAbove100": {
          "Get Item Price": "MockedGetItemAbove100",
          "Ask for verification": "MockedAskForVerificationSuccess"
        },
        "shouldFailToGetItem": {
          "Get Item Price": "MockedGenericLambdaFailure"
        },
        "shouldFailToVerifyItemWithPriceEqualOrAbove100": {
          "Get Item Price": "MockedGetItemAbove100",
          "Ask for verification": "MockedGenericLambdaFailure"
        }
      }
    }
  },
  "MockedResponses": {
    "MockedGetItemBelow100": {
      "0": {
        "Return": {
          "StatusCode": 200,
          "Payload": {
            "StatusCode": 200,
            "itemPrice": 80
          }
        }
      }
    },
    "MockedGetItemAbove100": {
      "0": {
        "Throw": {
          "Error": "Lambda.TimeoutException",
          "Cause": "Lambda timed out."
        }
      },
      "1": {
        "Return": {
          "StatusCode": 200,
          "Payload": {
            "StatusCode": 200,
            "itemPrice": 100
          }
        }
      }
    },
    "MockedAskForVerificationSuccess": {
      "0": {
        "Return": {"StatusCode": 200}
      }
    },    
    "MockedGenericLambdaFailure": {
      "0": {
        "Throw": {
          "Error":"Lambda.GenericLambdaFailure",
          "Cause":"The lambda failed generically."
        }
      }
    }
  }
}

Step 5: Prepping the tests

So you have the Step Function and test cases ready, all you need now is to get them running. This first function will get the client for the Step Function Local container and allow you to run commands against it for testing the local version of your Step Function:

import {SFNClient} from "@aws-sdk/client-sfn";

const sfnLocalClient = new SFNClient({
  endpoint: `http://${awsStepFunctionsLocalContainer?.getHost()}:${awsStepFunctionsLocalContainer?.getMappedPort(8083)}`,
  region: "eu-west-2",
  credentials: {
    accessKeyId: "test",
    secretAccessKey: "test",
    sessionToken: "test"
  }
});

Important:

As you can see, we used “test” above for the credentials. This is to ensure the Step Function doesn’t interact with our actual deployed environment in AWS.

Step Functions Local allows you to run tests against actual deployed services (so feel free to do so for your case) but since we have mocked the services using MockConfigFile.json then we don’t want to do that. By using fake credentials then it just defaults to the mocked services from our file.

Next, create your local Step Function instance in the docker container using the client just created.

import {CreateStateMachineCommand} from "@aws-sdk/client-sfn";
import {readFileSync} from "fs";

const localStepFunction = await sfnLocalClient.send(
  new CreateStateMachineCommand({
    definition: readFileSync("your-path-to/ItemPriceCheckerAsl.json", "utf8"),
    name: "ItemPriceChecker",
    roleArn: undefined
  })
);

You can then start a Step Function execution for one of the test cases. This will run the Step Function in the container and use the mocked AWS service integrations defined in the MockConfigFile.json to determine the path it takes. Here is the function you can use, we have it wrapped here so it can be ran for each specific test case.

The stepFunctionInput is a JSON string of what you would be passing in to the Step Function. In our case for the ItemPriceChecker there is no input to the Step Function as the item price is retrieved in the first step - so the input can be anything e.g {}. Make sure for your own Step Function to pass in any input required or use {} similar to the example if no input is required.

import {StartExecutionCommand, StartExecutionCommandOutput} from "@aws-sdk/client-sfn";

async function startStepFunctionExecution(testName: string, stepFunctionInput: string): Promise<StartExecutionCommandOutput> {
  return await sfnLocalClient.send(
    new StartExecutionCommand({
      stateMachineArn: `${
        localStepFunction.stateMachineArn as string
      }#${testName}`,
      input: stepFunctionInput
    })
  );
}

Step 6: Finally some testing!

Now you have your running Step Function execution for a particular test case, we need to actually test it worked. This is where AWS isn’t super helpful, there is no provided API for interacting with the Step Function execution and determining how the Step Function handled your test data. So we had to make our own! Sort of.

Here’s an example using the Step Function execution from above:

import {GetExecutionHistoryCommand, GetExecutionHistoryCommandOutput, StartExecutionCommandOutput} from "@aws-sdk/client-sfn";
import waitFor from "wait-for-expect";

it("should successfully get item with price below 100", async () => {
  const stepFunctionInput = {};
  const expectedOutput = JSON.stringify({
    StatusCode: 200,    
    itemPrice: 80
  });

  // This runs the Step Function and returns the execution details using the function created earlier in the post
  const stepFunctionExecutionResult = await startStepFunctionExecution(
    "shouldSuccessfullyGetItemWithPriceBelow100",
    stepFunctionInput
  );

  // This checks the states to ensure the execution successfully completed with the correct output
  await thenTheItemPriceIsReturned(stepFunctionExecutionResult, expectedOutput);
});

async function thenTheItemPriceIsReturned(
  startLocalSFNExecutionResult: StartExecutionCommandOutput,
  expectedOutput: string
): Promise<void> {
  // Since the execution arn is provided, it could still be running so this waits for the execution to finish by checking for the result you need
  await waitFor(async () => {
    const getExecutionHistoryResult = await getExecutionHistory(startLocalSFNExecutionResult.executionArn);
    const successStateExitedEvent = getExecutionHistoryResult.events?.find(event => event.type === "SucceedStateExited");

    expect(successStateExitedEvent?.stateExitedEventDetails?.name).toEqual("Success");
    expect(successStateExitedEvent?.stateExitedEventDetails?.output).toEqual(expectedOutput);
  });
}

async function getExecutionHistory(executionArn: string | undefined): Promise<GetExecutionHistoryCommandOutput> {
  return await sfnLocalClient.send(
    new GetExecutionHistoryCommand({
      executionArn
    })
  );
}

There is a lot of information above but at its heart, it simply runs the Step Function in the container and returns the execution information to the test. It then grabs the execution history of the running local Step Function and checks for an event showing it succeeded; this allows the test to then also check the execution output and ensure it has succeeded correctly.

Step 7: Make sure to tear it all down

One thing that can be easily forgotten is your container running as a part of your test. A good thing to do is make sure it is torn down correctly at the end of your test run. This can be done very easily as part of an afterAll if running multiple tests and is simple done by stopping the test containers instance.

awsStepFunctionsLocalContainer.stop();

Step 8: Expand and add more tests

Now this is up to you! You can continue to test the rest of the flow cases for the Step Function, checking it has emitted “FailStateExited” in the execution history for the failed cases or expanding your testing flows.

The HistoryEventType from the aws-sdk gives you all the event types which can be logged in the Step Function Local execution history, this allows you to write tests however you like for checking the execution of the Step Function. Here are some examples of matcher functions we have written for different types of events:

import {HistoryEvent} from "@aws-sdk/client-sfn";

async findExecutionSucceededEventInHistory(executionArn: string | undefined): Promise<HistoryEvent | undefined> {
  return await findEventFromExecutionHistory(executionArn, "ExecutionSucceeded");
}

async findFailStateEnteredEventInHistory(executionArn: string | undefined): Promise<HistoryEvent | undefined> {
  return await findEventFromExecutionHistory(executionArn, "FailStateEntered");
}

async findSucceedStateExitedEventInHistory(executionArn: string | undefined): Promise<HistoryEvent | undefined> {
  return await findEventFromExecutionHistory(executionArn, "SucceedStateExited");
}

async findEventFromExecutionHistory(executionArn: string | undefined, eventKey: HistoryEventType): Promise<HistoryEvent | undefined> {
  const history = await getExecutionHistory(executionArn);

  return history.events?.find(
    event => event.type === eventKey
  );
}

You’re good to go!

What we have created above is hopefully something quite simple for testing Step Functions. We additionally improved this by creating a Step Function testing service class which holds all the re-usable functions and can be called easily within the required test file. With this we were able to run our Step Function tests as part of our deployment pipeline, providing greater confidence in our code and allowing us to integrate Step Functions more into our applications.

Important:

Now it's also good to note here that this is not everything we do at Instil to test our Step Functions, it is simply a companion that enables us to test the difficult edge cases including complicated flows, retries and back-offs etc. We are advocates for testing in the cloud - and this local testing mixed with integration testing in the cloud (focusing more on Step Functions interacting with other parts of the cloud rather than edge cases) is a good starting place for testing your Step Functions.

Additionally, we do hope to see some improvements to the Step Functions Local client in future from AWS, possibly providing their own matchers for checking that states have been entered and exited correctly within the tested Step Function, but if not we will just have to do it ourselves!