Kazuya

Posted on Dec 8, 2025

AWS re:Invent 2025 - Supercharge Serverless testing: Accelerate development with Kiro (CNS427)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Supercharge Serverless testing: Accelerate development with Kiro (CNS427)

In this video, Arthi and Thomas demonstrate how to use Kiro and agentic AI to improve serverless testing workflows. They refactor a task management API built with API Gateway, Lambda, and DynamoDB from tightly-coupled code to hexagonal architecture using Kiro's custom agents and spec-driven development. The session covers replacing Moto mocks with in-memory fakes and dependency injection for cleaner unit tests, implementing property-based testing with Hypothesis for algorithmic correctness, and validating integration tests against real AWS services. They showcase schema validation using Pydantic for EventBridge events, demonstrate how Kiro can analyze bug reports to generate risk heat maps, and explain the AI-driven development lifecycle (AIDLC) framework for continuous improvement from inception through construction to operations.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Supercharging Serverless Testing with Kiro and Agentic AI

Hello everyone, good afternoon. Welcome to CNS427, Supercharging the Testing with Kiro. My name is Arthi. I am based in Singapore and I work with customers in Southeast Asia. I've got Thomas. Hi, I'm Thomas. I'm based in Sydney. I look after startups across APAC, focusing on agentic coding.

Alright, I hope you've all had a good re:Invent so far. We're almost at the end of re:Invent. So in today's talk, we are going to talk about how you can use agentic AI across your serverless development lifecycle, specifically to simplify testing. Now testing is a vast topic, so today we will focus on automated functional tests for serverless, which includes unit tests, integration tests, and end-to-end tests. So this is a Level 400 talk, so we assume the audience is familiar with serverless and also some of the basics of coding.

To first understand the challenges with serverless, let's start with what's a bit different about serverless applications. Serverless applications are highly distributed or modular, which means they have a larger number of integrations as opposed to traditional apps, and they also make use of a lot of cloud native services. Now this has implications for the testing process where it becomes important to test the integration layers as well. And one of the questions that comes up very often for automated tests is handling dependencies for isolated tests. That is, should you mock them or should you emulate them, or just use AWS services? And finally, your Lambda functions themselves may not be complex enough, so how do you think about things like coverage or where should you actually focus your testing efforts?

So today we are going to see how Kiro can help us with each of these pillars, starting with how you should write your applications to make it easy to test, how you can use agentic AI to generate the tests, and finally how you can combine the power of MCP servers with agents to use historic data to continuously improve the quality of your applications. So because our focus is testing in today's talk, we've actually pre-built the application, which is just a task management API that uses Amazon API Gateway as a REST API, Lambda for processing, and DynamoDB for the persistence layer. There is also an asynchronous component where any task events are published out to Amazon EventBridge, and then they are consumed by a notification service. So throughout the next hour or so, we are going to evolve both the application and write the tests for this application, for this task API. So with that, we are good to start and we're going to switch to the IDE.

Code Walkthrough: Understanding the Task Management API Structure

It's stopped mirroring, I think. It's not mirroring. It's not mirroring, sorry. Sorry, just give us a second. So I'm going to start off by taking you through a quick code walkthrough of the current application. Now, we have chosen Python in this particular case, but a lot of the best practices we talk about will apply for any other programming language as well.

Now as we go through the code base, this is a slightly trimmed down version of the code. So we have published the full version of this application to GitHub and we'll be sharing the links with you later on. So for this talk, we just want you to focus more on how the code is structured and the actual flow and how we're going to evolve it. So real quick, if we look at the directory structure currently, we are just going to focus on the task API behind API Gateway. All of the code is in the task_api folder and our tests are in the tests folder, and we'll start with just the unit test for now. So just a quick check. Can everybody see the code? Is it big enough? If not, just raise your hand. Okay, cool.

So let's start with our Lambda handler first. So we are using Powertools here to simplify implementing some of the serverless best practices. The handler itself, we've chosen to kind of combine all the task CRUD operations into a single Lambda function as against a micro Lambda, but Powertools makes it really easy to route the request to the correct function. So for example, if you get a POST on the slash tasks endpoint, it'll end up invoking our create task function. So it's fairly straightforward. We first parse the event to extract what we want, which is just our task details. We build the task object. Then we persist the task to our database, in this case DynamoDB. We publish the event and then we construct the response, and there is some basic error handling here.

Our task handler also enforces some of the business rules. So as an example, if you're updating a task and you're defining dependencies between the tasks, you don't want to end up with circular dependency. So in this case, when you update a task, there is a check here where we've created a helper function that validates dependencies, but the rest of the flow is the same. You update the database, you publish the event, and then you construct your response.

Our helper function here, given a task, basically queries the database to build the existing dependency graph, and then it passes through the graph to identify if you're going to violate the rule or not. The models file is pretty straightforward. This has just our data classes. Our domain logic has business rules. Again, in a real-world scenario, you would have a lot more rules here, but to keep things simple, we will simply focus on the circular dependency check for the demo today.

So the idea here is given a task, its dependency, and the dependency graph from the database, this is just doing a depth-first search to identify circular dependency. And then the very last file we have here is the integrations where we define the integration with AWS services. So you can see we've used Boto3 and initialized our clients here. This is the module that has the implementation of the methods invoked from our handler. So our save task to DynamoDB ends up calling the put item API, and likewise, this is where we actually publish the event to EventBridge.

Current Testing Approach: Using Moto for Unit Tests and Its Limitations

Let's take a very quick look at how we've written the unit test for our Task Handler. Now, because our task actually persists tasks to DynamoDB and publishes events, if we have to unit test this, we have to mock out those dependencies. So in this particular case, currently we are using Moto. Moto is a Python library specifically designed to mock out Boto3. That makes our life a little bit easier in a couple of ways. Our pytest fixtures are just the reusable setup and teardown for our tests. As a best practice, we have set the AWS credentials to dummy values for our unit tests.

So with Moto, you just need to use the mock AWS context manager, and the advantage is because it's designed for Boto3, you're still using the same Boto3 client calls. What this is also doing transparently is that it's monkey patching the Boto3 calls during runtime. Basically, it'll intercept calls to Boto3 at runtime and replace it with our mock. We also need to mock our EventBridge. We have created a mock test context, and just a quick look at two tests. Our first test is a successful scenario where, given a task, we create the task successfully within the database.

So we need to pass our fixtures as arguments to the test case. We import the handler, we create our test event, we invoke the Lambda handler with the event and the context, and then we validate the response. You can optionally test the mock state as well. And then let's take a look at another test case that enforces the circular dependency rule. This test is a little bit more involved than the first because this depends on the dependency graph existing in the database.

So as part of setting up this test, we need to create a few tasks, in this case task one and task two, and we persist these tasks to our mock DynamoDB database. And the rest of the steps are pretty much the same. We create the test event, invoke the handler, only in this case we expect an error response. So I'm going to run this test in a second to show what happens. But this is our current code base. So the question here is, let's say our business rules change in the future, the needs change, and maybe we need to replace DynamoDB with something else, maybe DocumentDB, or maybe we need to replace EventBridge with something else. Can you think about the implications of this?

Of course, we'll have to update our integrations file to talk to the new services, but just from a testing or developer experience perspective, do you see any challenges with the way the code is written right now? You can just shout out the answers and I'll just run the test suite. How many of you ran into difficulties when you were trying to change the integration architecture and then you had to change all the testing, all the related code afterwards? Just give us a raise of hands. There you go. Yeah, so as we did the code walkthrough, our test cases, for example, that's testing the handler code, and it's basically testing the response status and the details of the response returned. That test will have to be updated because now the mock will have to be changed to work with the new services we pick.

So this kind of creates, although you did not change the actual code that was tested by the test case, you need to rewrite all those tests. So this kind of creates extra work. So I have just run the test here. For those not familiar with Python, Poetry is just a library commonly used for packaging and dependency management, and pytest is a really common testing framework in Python. So I've turned on the timing for our tests here. You can see that there is a slight overhead in our setup where we are initializing the mock functions, and it varies between 400 to 500 milliseconds.

So in this case, we've only got two tests. The actual runtime depends on whether the dependencies are cached or not, so we've run this a few times on our laptop, but there's room to probably improve and make our tests a little bit faster. So just given this code base,

let's quickly summarize what we have seen. Currently, the way we've written the code, this is very tightly coupled to the infrastructure choices. So if you have to change anything on the AWS layer, you end up rewriting tests that should not really be affected by the change. There is a little bit of friction with the developer experience that you do need to know for mocking exactly how the services work. And then of course there's room to potentially improve our tests and make them faster.

Architectural Review with Custom Kiro Agents and Hexagonal Architecture

So now I'm going to pass it over to Thomas to see how we should address these problems and what's the best way to write the tests. Thanks, Arie. Before we proceed, how many of you are Python developers? Just give us a raise of hands. Good number. For those of you that don't work with Python, don't worry, all these principles are applicable across the board. We just picked Python just because we are familiar with it, but all the principles essentially can be applied to anything and everything. Just give me one second. Oops. All right.

So Arie walked us through our current architecture. We saw some pitfalls in there. Now let's have a look at how we're going to fix it. So what I'm going to do, I'm going to use Kiro CLI for this particular task. Now just give me one second, I'll just fire it up. And you may notice that I'm not using the traditional invoke command for Kiro CLI. I'm using something that's called custom agent, and I'll tell you what a custom agent is just in just a minute. I'm just going to pass a prompt which essentially is asking Kiro to do a review of my current architecture and propose or suggest what are the problems and also propose how to fix them. So I'll just execute that and we'll get to that in a minute. I'll just leave it running.

I just want to show you what's under the hood. So if we go to a Kiro folder in our project folder, we'll have a section called agents. If I opened it up and I'll just remove this for a minute and make this a little bit bigger. So this is essentially a configuration of a custom agent. Now, what a custom agent is, it's an instance of Kiro CLI agent rather, that we can configure for a specific task. So you can notice that I have a specific set of MCP servers just for this particular operation, and I'm also passing a description and a prompt to the agent which is used alongside of my prompt that I passed to it. So this is think of it as a specialized agent for a particular task.

So in our case it's going to be evaluating our architecture. You can have a specialized security auditing agents. You can have compliance agents, etc. So it depends on your use case that you would use. We also have built-in agents that we released this week, such as the AWS Security and AWS DevOps agents which are on the side of Kiro. They're not inside of Kiro at the moment, but these are capabilities that you can configure inside of the Kiro or Kiro CLI. We also configure tools, so it's quite customizable. You can configure steering files that actually tweak how the agent operates and what kind of output it returns.

So with that said, let's go back to our terminal and let's double check what the response is essentially. So we can see that Kiro, I'll scroll all the way up. Kiro did a review and it did use, did read our whole project essentially. At points here and there it used the MCP configuration just to enhance its answers and generated a full hexagonal architecture audit report that we can either read in this terminal format or we can read it in a proper markdown format, which I'll show you and it's probably easier to read. I'll just flip here.

So there's a few sections there. I'm not going to go through that in detail, but this just showcases how you can leverage agentic AI to help you with current review and kind of also look around corners because you know we are developers, we know what we're doing, but sometimes we don't account for every single scenario and we may not see everything that's related to our application that might be potentially needing updates or being need to be improved. So we can see already that it asks us or it suggests us to fix the domain logic dependencies. There's going to be more section essentially related to the pattern, how we design the architecture of the application, etc. As I said, I'm not going to go through this in a whole heap of detail, but you can see there's quite a bit of information there.

Now, why is this useful? Because we can essentially go to Kiro and we can ask it to generate a spec-driven development flow, which creates specs that we can use to quantify the requirements, design, and implementation of individual tasks. It's particularly useful for feature or general software development, but in our case we're going to use it to create a plan to integrate the changes that Kiro actually suggested us to do. So I'll just reference the file that has our evaluation, our audit, and I'll ask it to generate.

Just while I'm doing this, I think I mentioned hexagonal architecture. So the thinking there was because our code base was kind of tightly coupled with the different concerns, so the idea was how do we decouple it, and hexagonal architecture is one way to do that. So the idea is you have got code or ports that only deal with the interfaces, and you've got the core business logic, and then you have a glue layer that connects the interface to the business logic. So we have kind of used that as a baseline to rearchitect the code.

Spec-Driven Development: From Requirements to Implementation

Exactly. So we can see that this was a little bit faster than usual just because I actually have a spec created just for the sake of time, but normally Kiro would go in and create the whole spec from the ground up with requirements, design, and tasks. We can see those three files in here referenced, and it actually gives us a description of what it did. Now, I'll show you how those files actually look like. So if I go back to my Kiro folder, I'll just minimize this. We have a folder called specs, and in there we have a hexagonal spec. If I open the requirements, I'll just get this here. Where am I? Here, okay. So in the requirements file, hold on, there's always something with live demos.

Okay, let's minimize this. There we go. So this is our requirements file. You can see that we have introduction, we have glossary to kind of understand all the terms, and we have all the requirements and especially acceptance criteria that are needed for every single project, any single change that you're running to your project, be it a feature, be it architecture, be it anything essentially. It's very, very useful when working with AI especially.

Now the next file that it created for us is design. So this is the full design of changes. We can actually see the diagram of changes that it'll be implementing into the process. This is the target architecture, how it's going to augment or rather decouple features to make them more flexible, more versatile, and easier to test as well. I'm not going to go through all of this in detail. I just want to highlight the process. But the most important thing here is the task list.

Now here we could go to Kiro and ask it to essentially start executing these either in sequence one by one, do the whole project essentially in one go. It depends how much time you have, depends if you're doing something on the side, but you can essentially delegate this to Kiro to go through on its own. Observe it at some point, verify that it's doing the job that it's supposed to do, but essentially do the migration or rather refactoring of the application on the go on its own. So I'm going to make it easy on myself, I'll just fast forward. Normally this would take a little bit of time, let me just make this a little bit bigger. I hope you can see everything now.

So just to illustrate the original state that we had when Arthi was going through the application, we had something like this. We had everything kind of coupled together. Yes, we had some files that were separate, but it was all kind of bundled together with tightly coupled references. So all the HTTP parsing, business logic, and all the integration calls were kind of stuck together. Now, after the audit and the implementation of those findings of the audit, we would use Kiro through a spec-driven approach to modify our architecture, refactor it. This is how it would look like. This is the current state of our architecture.

So we can see we still have our handler, but it's much, much leaner. It's only processing HTTP requests. Then we have all the business logic kind of offloaded to our domain layer, and I'll go through this in more detail in a second. And then also through interfaces,

Decoupled Architecture: Implementing Protocols and Dependency Abstraction

we are communicating with the adapters that are invoking our services, so the integration. Now how does this actually look like? If I go to the task handler again, and I'll just minimize this side. So in our main handler, we have a method called create task, right?

The important bit here is that we're no longer coupling anything. We're calling a delegated task service. Now if I scroll up here and I go to the site, essentially that action is calling the domain logic. If I flip to domain logic, essentially, and I'll go to the top of the task service, we can see that this particular service requires a repository and event publisher to process. So there's two kinds of streams that we see. We have one stream that essentially does an operation when we don't specify them, and I'll show you what that does. But if we do specify them, we can point it to specific integration that we can manage.

Now what does this mean? So if we would essentially not specify these parameters, our application would default to what it has under the hood and will use interfaces to use what we call protocols. Now protocols are specifically for Python, but what they essentially do is they create contracts without the need to set implementation. And this is what you can see here because the methods in our class for TaskRepository protocol are pretty much empty. There's nothing really in there now. Why this is good is because we can point it to anything, right?

But in our case, if we go back to the logic, essentially we can see that the repository in this case is set to none, which means that we are calling integrations. And through integrations, if I go to integrations, we are calling our integration to DynamoDB or the EventBridge, right? And we can manage this, we can modify this, we can point it to something else if we need to. This is for the case where we did not specify the repository and the event publisher. If we do specify it, it depends on what we specify. The route will be different essentially, and this is particularly useful for tests.

So back to our architecture, just want to recap here. This is how the decoupling would work here. And in terms of what we covered essentially in this part of the session, we essentially showed how the handler performs the validation. I'll just take this off. We saw how the domain logic is essentially containing all the business logic and business functionality. We saw how the integration layer works. And then also we saw how the abstraction works with the handler itself, so we are abstracting the logic, the business logic and the integration from the handler itself. And we are relying on the domain to kind of interface to the repository or the publisher to create a contract. And afterwards the integration implements through the interface the AWS services. And back to you.

Dependency Injection for Testing: Moving from Mocks to Get Task Service

Yeah, I'll show you the test. So just to recap, the task, our domain logic just expects a TaskRepository that offers the save task method or a delete task method. It does not know whether it is implemented using DynamoDB or Aurora or whatever might be the service, and the exact implementation logic is contained within our integration layer. So if you go back to the problem statement of swapping, say, DynamoDB with something else, your integration changes. But as long as the database is exposed through the same save task method, we don't need to change the tests in our handler and the logic code. So I'm going to now actually code out the tests and then we'll see how that looks different or simpler than before.

Now, we'll start with the task handler, sorry, the test for the task handler. Before I move to the test, there's one thing I wanted to call out. So if you saw when Thomas walked us through the code, we did not directly initialize the task service in the handler. We actually used the get task service method. So what we are doing here is setting up our code for dependency injection. I'm going to talk a bit about why that makes testing easier. Again, those of you familiar with Java, this will probably seem intuitive, not so much in the Python world, but we'll see how we can do it. So if we actually look at the get task service,

it looks for a service level variable called TaskService. If this is set, it's going to return it as it is, but if it's not, it's going to initialize the task service, and this is the flow that would actually kick off when our Lambda is invoked in production. The key thing I want to call out is that at runtime our application does not rely on dependency injection because we have provided default ways for the flows. This piece of additional code was specifically written to simplify testing, and we'll see how that simplifies testing.

So let's go back to our test task handler. At this point, our handler still relies on the domain logic or the task service, so we still need to mock that out to test this in isolation from the rest of the code base. The way to do that would be, let's say we start off by mocking get task. All right, something like this. I'll accept this and I'll explain what this is doing. So now to unit test our task handler, we need to mock out only the domain logic, so we are no longer concerned with the actual AWS services in use. We just want to validate that our task handler works well and returns the correct response code to our end client.

We need to do a couple of things. One is we are going to mock the get task service to return a mock object. In this case, that service is called the mock task, mock get task service. If you remember when we spoke about Moto for Boto3, Moto automatically understands Boto3 API calls, but in this case, this is a custom mock, so we need to configure the behavior of the mock task service. Within our test case, we will have to go ahead and say configure mock. I'm going to just keep this simple, but basically you will have to specify the return value. I won't accept this because we're going to do the test a bit differently, but just to give you the picture, first, I need to create what the return value for a successful create task call looks like.

And then we also need the monkey patching because the patch now has to be managed by our tests. Essentially every time get task service is called, we want to insert our mock into the picture. That's basically what this test is doing. Now the thing with this approach is, remember we just wrote two sample tests, but we are going to write tests for all our different resources and methods, and at times we want to simulate the error. Depending upon the behavior we want to simulate, our mock will either have a return value or a side effect, which would simply be raising the exception. So what happens is all our test code is now riddled with a whole bunch of mock code, and we also need to patch the code at runtime.

In-Memory Fakes: Creating Configurable Test Doubles for Cleaner Tests

Now the tricky thing with patching at runtime is it can get brittle because if we change the logic of the domain service, it can break in unexpected ways, and also patching can sometimes leak states across tests, and this is where we are going to use dependency injection. Instead we are not going to use the mock service, so let's see how dependency injection simplifies our life. So the first thing is we want a highly configurable mock whose return value can be changed depending on the test we are running, whether it's an error or a success scenario.

So what we really need is a configurable fake task service. This is just an in-memory fake at this point, this is not a mock. What we have done is you can initialize the fake task service with a bunch of flags that tell whether the service raises an exception or just works as expected and follows the happy path. For example, it has the same methods that our task service offers, but when we call create task, the first check we do in this in-memory fake is to check whether the exception flag is set. If set, it'll raise an exception, if not, it'll go ahead and simply return the task object.

So in scenarios where you want to highly customize the behavior of your dependency, it is kind of easier to do that with in-memory fakes rather than a mock. The complexity of the mock configuration has now moved here, so we are not essentially writing additional code, but it's just where the complexity goes. It's now in the in-memory fake. Now this is the first step. Now we of course need a fixture again, because the second step is we need to replace the original task service call with this mock.

So how are we going to do that? We said patching is brittle, so what we are going to do instead is we are going to use dependency injection here. Within our Pytest fixture now, we initialize the fake task service. Now if you remember, our get task service looks for the module level variable, whether it's set or not. So in our test, we are basically setting that module level variable to our fake task service, and we have basically injected our in-memory fake into the test. And that's it, so anything before the yield is setup and after the yield statement is your teardown in Python. And conftest is a file that Pytest automatically loads, it's just where your reusable code goes.

So how does this actually make our test cleaner? The first thing is we want to use our fake. We will probably go to this section. We don't need this part because it was generated, so I'll stick to the previous structure we had for the test and let's see what changes. The first thing is we are now going to make use of the fake tasker, so we need to pass this as a variable. We still create our test event. Now remember, the behavior of our fake task service when the flag is not set is to just return a successful task response. So all we need to do to run our tests here is import our handler code. This should actually look similar to what we were doing in the old tests.

I then invoke this. The autocomplete is a bit laggy, but basically we're going to do the same thing that we did before. I called the lambda handler directly with the event and the mock lambda context. Then all we need to do here is validate the response. It generated this time, so we can optionally validate the response, but I'll keep it simple for the demo. So if you look at it, that's about it for testing the happy path for the test task handler. We have used the fake task service, which gets injected into our task domain, and it is simulating all the different scenarios that our task service can raise.

So as an example, now if we take the case for the circular dependency error, if you remember the previous code, we had all this logic to basically create the dependency map in the database and all of that. That goes away because now I can simulate an error simply by configuring the should raise circular dependency flag here. The rest of the steps are going to look similar to what we did with the previous test. So you can see here that our tests are now vastly simplified.

So what this means as a developer in the future is if our domain logic changes or you change the behavior of the task service, this file is the only place where I need to make changes to get my unit test to work. Having said that, it doesn't mean that you should avoid mock at all costs. For example, if we take our notification service that's processing things from the EventBridge, let's say it is sending out an email when your task is due as a reminder. There's no need for me to fake an entire email server or an SES because that's a third party dependency. All I need to know for that test is that the send message method was actually invoked and I'm good.

But for this particular case where we own the code for the domain logic and we want to highly customize the behavior, in-memory fake combined with dependency injection makes our code a lot cleaner. But then that leads to the question, what about our domain logic that actually validates that the circular dependency function works? So this becomes a whole lot simpler now because this is just a pure function that given a task and a dependency map, it is just going to check and return true if there is a circular dependency or false if not.

So if we look at the test case for this, this is our current dependency. Let's say task one depends on task two, and task two has no dependency at the moment. I'm going to try to force a circular dependency by calling this has circular dependency, and I passed this mock dependency graph I've created, and all I need to know is that this is going to return true. The negative scenario is equally simple, so I now have an empty graph here, and at this time this should basically say that there is no circular dependency.

So that's it, and these tests again don't need to change in the future. I mean, if you change the logic for how you calculate circular dependency, all you need to do is rerun the test, so the effort kind of goes down. And the third thing we called out was what is the impact of this on the timing of the tests. So both of these are in the unit folder, so we'll stick to this. So for Moto, Moto 3 is a library that mocks all of the AWS services, so it has a slightly bigger overhead. But for our particular case, the in-memory fake, as you can see, is barely 120 milliseconds. It's a lot more lightweight, and our tests are that much more cleaner.

Property-Based Testing: Validating Algorithmic Correctness with Hypothesis

So we have now wrapped up our unit tests, but this takes us to the next question. When you're using agentic AI or spec driven development to build applications, let's say you use the requirements, build the code, then you ask the agent to write the test for the generated code, and it might do a great job of comprehensive tests. But there is no way to validate that your requirements were correctly translated to the code in the first place.

Your tests are validating the code that's generated. So how do you solve this problem? Quiro recently introduced property-based tests to do that. While Thomas explains that, I'm going to run the test, but I'll explain what I'm doing after he's done to save time.

Who has heard about property-based testing in Quiro or in general? Okay, I'll use an analogy to explain this. Think about a case where you're building your new tests. It's kind of like building a bridge, right? You build a bridge, you want to test it, that it actually holds the load that it's supposed to be holding. Now, would you rather test it with your own car only, or would you rather test it with your car, my car, Arthi's car, a truck, an ambulance, an elephant, right? So that's the difference between the traditional way of creating tests and property-based testing.

With property-based testing, you essentially define properties, and then the system runs not just one test. It can execute 100 tests during the same execution, and Arthi is going to show you how that actually looks like using some menu modules. It simplifies the process because it makes it simple to define but also faster to execute. So, like Thomas said, it's great to kind of also detect edge cases in your business logic. The idea here is we'll just see, I've run this test and I'll explain this in a minute.

I'll start with the scenario I'm trying to explain. If you think about a circular dependency check, that's a good example where property-based tests actually work well because we just want to verify the algorithmic correctness of that code and whether it does what it is supposed to do. That's really it. So to do that, we are going to take a slightly different approach. Where is the code? I think I might have the wrong file open. Hang on, just give me a second. Okay, it's just doubled. Why does it split it like that? Yeah, I think I did that before. All right, that's strange. Okay, bear with me. Actually, this is the first time Thomas and I are running this together, so I'm not used to his laptop, so please bear with me.

All right, so in this case, let's say this is the scenario we want to test for algorithmic correctness. When you saw the unit test we wrote for the domain logic, we created these simple dependency graphs and we tested it. Like we said, task one depends on task two, get task two to depend on task one, and it should detect. But let's say that there are complex kinds of dependencies that exist in our database where each of these nodes represents a task essentially. So let's say this task depends on one, this depends on two, and so on. Now, if I try to force this dependency of this main two back to main zero, if there was a bug in the logic and let's say it only travels the right side of this branch, it might incorrectly conclude that, you know what, this doesn't set up a circular dependency, let's allow this to go through. But if my logic is written correctly, then it should test all branches that it encounters on the way. So this is a good use case where property-based tests might help us.

I've basically actually run the test. So, a few quick things to note. Python has a library called Hypothesis that allows you to do property-based tests. Obviously with property-based tests, instead of fixed inputs, we are generating a large number of inputs from a given problem or input space. However, the data is not completely random. So in our case, task IDs are UUIDs, so that's why I've used the strategies here in this case to specify that I want to create a bunch of UUIDs for my testing because that's what I'm using for my task.

Next, let's just focus on this branching test. You see the given decorator here defines the input space to be used to generate the test. So I'm just going to refer to them as main chain and branch chain as I showed you in the diagram. I'm just generating a list of task IDs. I've specified a few constraints as to how many, and the ID should be unique, because of course each task is unique within our database. Similarly for the branch chain, and then I'm also specifying that we randomly choose a point on the main chain where we want to create the branch.

Now if you look at the test case, we also define a few additional constraints. For example, we don't want any overlap between the two lists, because we are using UUIDs there's a good chance we won't actually breach this constraint. And the test itself is pretty simple. Basically I'm looping over the list of task IDs generated for the main chain and I'm building this left side of the branch first, and then I do the same for the second set of the list and building this right side of the branch first. Then I actually pick the point where I want to define the branch, so if there are any existing dependencies there,

I want to preserve the existing dependencies there, and that's really it. I've built at this point the dependency map in the database, and that's it. I just loop back the last node back to the first node, and then I expect that it'll detect the dependency every time. So the way we have written the test has changed.

Second, I ran this test already and you can see that for each of the tests there is about 0.34, 0.32. There is a little bit of an overhead in running the test, although these are still unit tests. So what exactly is happening behind the scene? To understand that, I've installed this open source plugin called Tikipy. Sorry, I'm just going to have to run this again because, all right, got it this time around. So this plugin makes it easy to visualize what's happening under the hood with property-based tests, specifically.

So what this is showing us is, I specifically chose just the test that we did. The important thing to note is it did not run a single test for this function. It ran 100 test cases, each with a unique combination of inputs. But you'll also see here that the generated number of samples is 104. That's because it created 104 sets of inputs, but 4 of them were discarded because they did not meet some of our constraints, and that's okay. Our test itself, actually each test runs pretty fast, but it's just that it's running 100 of them, so that's what's adding to the overhead.

Now this visualization in our case, the good news is we wrote the logic correctly and you can see all 100 of them passed and that's good. But when you have failures as a developer, it's good to look at what was the input for which the thing failed, so it's easy to troubleshoot, and that's where this helps again because if you click into this you can actually see every single input. In this case, of course, everything was successful, but you can see what was the main chain, what is the branch chain, what is the branch point. It even tells you the actual code coverage, what were the lines that were tested by this, and it makes it easy for you to troubleshoot things.

So that kind of wraps this up. So kind of where this is useful is anywhere algorithmic correctness or pure functions, business logic, really good, especially to catch edge cases. But the moment you're thinking about end-to-end tests or so on, they're not so good because for end-to-end tests you need specific inputs that will actually trigger your end-to-end workflow and random data might not really help you there. But that's a good way, or even if you don't run it as part of every CI, it's a good way to validate correctness of core business critical business logic and whether that works as expected. All right, so that wraps up this one. So we're actually good to move on to the integration test. So we've finished our unit test.

Integration Testing: Schema Validation and Testing Against Real AWS Services

So for the integration test, the first one I want to discuss is just validating the schema for asynchronous integration, which is our EventBridge part. So what exactly does schema test do? The goal here is as a publisher of an event, if I modify my event such that I maybe remove fields that my subscribers depend on, I'm going to break the logic of my downstream subscribers. So that's really what the schema test is for. So if you have any breaking changes in your event schema, it's going to pick this up.

Now our task API is super simple and I think it's fair to assume that it's the same team that owns both the code bases and maybe you can coordinate the changes. But the moment you have like a central event bus, multiple publishers, multiple subscribers, where publishers often don't even know who the subscribers are, it becomes more and more important to validate the contract of the test. So again, there are a few different ways to write this. It depends on how you're defining your event schema. For example, AsyncAPI is one way to define your schema or OpenAPI, then you can use tools specific to those. But because we are in Python, we're going to do it the Pythonic way, and we're using Pydantic, which is just a data validation library.

So defining a schema is as simple as inheriting from the base model, and I've basically defined what a task create event or update event will look like. If the task is deleted, of course the schema will just have the ID. There's another important thing we are checking because we're using EventBridge, and EventBridge expects a few mandatory fields, otherwise you can't publish to it. So we are also validating the compliance to EventBridge schema.

Now actually running the test is easy. I've got a helper function here, so validating a schema is as simple as just calling the model validate on the class with the event, and that's it. So my actual test cases are generating the create event, delete event, and so on, and just calling this helper method, and we're good to go. There are a few advanced other ways to test this, for example, consumer-driven contract tests and so on. But if you're interested, just find us after this talk, and we can talk about it.

Now, let's come to just integration with actual AWS services, which is DynamoDB and EventBridge. So our recommendation here is for AWS services, test against the actual services. That way you also get to validate

the other integration properties, such as whether permissions are set up correctly or networking is set up correctly, and so on, are also important considerations. Another thing to think about with mocking libraries is whether the libraries fully support those services. As an example, Moto, which we used in the initial version, does not support global tables for DynamoDB out of the box. Then you will have to manage that replication within your code. So that's another thing to think about if you are relying on mocks.

Let's quickly take a look at DynamoDB. Like we said, we are going to test this against the real database. So our test fixture is straightforward. If you remember the integration code that Thomas walked you through, you initialize the TaskRepository with the actual table name. We are going to modify the UUID because we don't want to leave the test data around so that we can delete this. So we are going to fix what is the ID of the task. There's the cleanup that basically deletes the task. Now there are two types of tests here. One is the happy path scenario where you create and make sure that the tasks are persistent.

I create the task object. I use my pytest fixture task ID for it. I simply call save_task. If you remember the protocol that he mentioned, save_task allows you to persist the task to the database, and that's it. Then we retrieve the task from the database and we're good to go. Again, remember in the future, if you swap DynamoDB with something else, this test case actually does not change at all. You simply rewrite your integration code and run this as is, and it should work. Well, you will have to rewrite the code to retrieve the data based on the database you're using, but a large part of the test remains unchanged.

Now, the more interesting part of this test is actually the failure scenarios. So the first thing to call out here is, again, we are simulating errors with services. Once again, in-memory fake comes to our rescue, so we have a fake class. I won't go through it. It's similar to what we did with the unit test. But the question here is, what exactly is it we should be validating with failures?

So there are two things we should be checking for. First is, are we surfacing the correct error code or exception code and message so that the client knows what to do with it, or are we just collapsing everything into a 500 and then troubleshooting becomes hard? The second thing to test is does your application behave as expected when these error scenarios occur. To make it concrete, let's again go back to our application. Let's say for our task app, we have an offline mode, so people can work on tasks offline.

Probably when they come online, it's possible that you might end up with conflicts in tasks. So how are you going to resolve the conflict? We have used a simple strategy of first write wins. So we have a version which is the Unix timestamp of a record. When a client reads it and then they write back, if the timestamp is changed, the write will be rejected. But the error message should convey enough so the client knows what to do with it.

So the way we're going to do this is, in this case, we should return a 409 conflict error. Now this is going to start to look similar to the unit test. So the question is, first, is this an integration test because we're just using mocks? The second question is, are we repeating what we did with the handler test? For the first question, in this case, we are evaluating a failure with the integration layer for DynamoDB, so it logically sits as part of the DynamoDB test validation. So we have left it in here with the integration test.

The second thing is it's not an exact repetition of the handler test because if you remember in the handler, we mocked out the entire TaskService logic layer. We injected our fake there, so that part of the code was never tested. But if you look at the fake that we have created here, the first thing we are doing is we are creating the TaskRepository, but we are using a fake. So once again, dependency injection. We injected the fake database into TaskRepository.

We are using a mock for the publisher. So here again, mock is handy for us because when I'm testing database failures, I don't want to publish any events to EventBridge, that's all. I don't need to actually configure the behavior of EventBridge, so here mock is super handy for me just to prevent events from being published to EventBridge. Then I actually instantiate the TaskService with the fake repo and fake publisher, so we are testing all of the code that has been written in the TaskService, so it's not a repetition.

You could choose directly to validate the TaskService and not call the handler. The reason we chose the handler is that we are also mapping some of the error codes. That is, for example, when you have an IAM error, which is a permission error, and Lambda doesn't have permission to write to DynamoDB, this permission error has no meaning to your end client. So here we actually check that your error message does not have permission or access in the message and you're basically surfacing IAM errors as internal error. So that's why we chose to test the handler.

And then the last one for EventBridge, the main difference with the EventBridge test is unlike DynamoDB, you can't really query EventBridge because you publish an event and it's gone unless there is a subscriber. So testing EventBridge requires a little bit of extra work where we are setting up a test harness, which is just another Lambda function that's listening to test events. It adds a bunch of metadata and persists it to another DynamoDB, and then we just read back from the DynamoDB.

So if we look at the happy path case for this, we initialize the publisher with the real event bus. We create the event and we just call publish task event. Because there is extra work in the Lambda receiving the event, processing, and persisting it, we do introduce a wait time here to allow for that. And then we basically query the target database and we are done. So I'm going to just run these tests, which actually take a little bit of time to run, but we'll have this running and I'll pass it off to Thomas to finish the last part.

So while that runs, very quickly summarizing what we did for the tests. Mocks are useful, but use them sparingly, and we saw at least two or three places where we did use the mock. However, when you want to customize the behavior of your domain logic, in-memory fakes work better. Dependency injection makes the code cleaner, but remember at runtime we have provided defaults, so the code at runtime does not require dependency injection. We're not talking about full dependency injection frameworks. It's a good idea to validate your event schema.

For integration tests, run them against real services and use fakes to simulate errors. And very important to validate the handling of integration failures, like are you surfacing the correct exception codes and messages so your client knows what to do with it. And that actually wraps up our test section. We have the last part of, oh, the tests have run. The tests have run, not surprisingly, it's almost 17 seconds because the wait time I specified for EventBridge was five seconds, so that kind of adds to it.

AI-Driven Development Lifecycle: From Operations Back to Inception

So what we recommend is unit tests should definitely run on every commit, but some of the long-running integration tests, it's a good idea to run them only when you explicitly change the logic or for major releases, so you can optimize your build time. All right, so we're moving to the last part of our session. Let me just flip to another window and make this a little bit bigger. I'm sorry, I missed the pane here. There we go.

So you may have noticed that we kind of followed a certain flow throughout the session. First we had our application in a certain state, then we figured out what needs to be changed and modified. Then we performed the migration using the AI, and afterwards we kind of drilled deeper into things and figured out how to modify it even further. So this kind of flows into the traditional way of thinking of development lifecycle but powered by AI, and this can be elevated. We were touching lightly, to be honest, on AI in this case. Most of this stuff you can use Kira or any other coding assistant essentially to help you out and analyze the code.

But what I want to emphasize here is that we followed essentially the thinking approach that corresponds with the framework that AWS just released a few months back, which is AIDLC, or the AI-driven development lifecycle, which kind of brings structure to the chaos. So if you've followed the news for most of the year, a lot of people talked about white coding. I built this in five minutes, I built that in five minutes, which is great. But once you get to a higher level of complexity in your application, especially in existing applications, you can't just white coat your way through stuff. This is where we kind of need structure and where spectrum and development fits into the picture very nicely. But to kind of bring that to the whole team on a larger spectrum, this framework is particularly useful and it's very simple to implement actually. It has three stages.

So as we did ourselves, we went through the inception phase where we kind of thought about what we can improve. We had AI perform a review to give us a list that we can follow, analyze, tweak, modify, or process otherwise. Then we went to the construction where we actually performed the refactoring, performed additional changes, and then deployed. And afterwards, now we're going to cover a bit of operation. So operation essentially is the part where you push to production or at least some kind of traffic-loaded environment where you can monitor and evaluate your application across the span of time. So think about a case where we would

essentially evaluate our application for, let's say, three to six months. It's already running, and we want to collect the bugs or the issues that we have. So what I'll do right now is essentially just run this prompt in here. I'll tell you in a minute what it does.

So we do have a list of bugs. Let me just minimize this so it's visible, that we collected over time. You can see it's in JSON format. It's hard to read, it's hard to go through it, you know, let's make it easier on ourselves. So what I'll do, I'll run this again actually. I'll just get a summary. And let me fan this out a bit. So this is the summary of the JSON file that we have. So it essentially gave us a total sum of the bugs. We can see there's one critical, seventeen high severity ones, etc. We can see individual components. So this is essentially from the decoupled perspective that we already talked about. We can see which file has which problems, and we can go deeper. We can see which one relates to validation, serialization, etc.

Now, we asked Kiro, if I scroll a bit higher, to actually analyze our application and create a risk heat map, essentially for us to understand which parts of the application need to be modified and how. Alright, we gave it some extra input just to follow the hexagonal architecture, essentially, and it's already working on it. It's creating a heat map. Now just for the sake of time, because we have three minutes left, I will use the pre-created one, so let me just flip to that one. Here, it actually is modifying this pre-created one, so you can see it's flickering. It just modified it, and let me just scroll all the way up. So you can see it's been just updated. Kiro just finished updating it. If the file wasn't existing, it'll create a new one, of course. But this is similar to that evaluation, that audit file that we created before. Essentially, we got a full report about what's happening based on our collective bugs or bug report.

It can be logs. It can be anything. It doesn't have to be just in the simple form as we saw, but we can see the highest pain points essentially. So this is quite graphical, and Kiro can do this in even a higher level depending on the configuration. There's steering that we can use to kind of fiddle with this and make it more granular, more customized. And this kind of brings us back to the previous approach where we right now imagine we are in operation, right? So you would imagine that's the end of it. We'll just collect it and that's it, right? Now, we're going to do something with it. So we could do the same thing that we did in the inception phase where we took our audit, created a spec, and refactored the application.

So in this case, we can take our heat map again, have Kiro analyze it even further if we want to, create our spec, and feed it back to the inception. So we're essentially making a full circle from the operations to inception, then apply through construction, essentially apply our findings, improve our architecture, improve our application, eliminate or minimize or mitigate those bugs that we're running into. And that essentially creates the full circle of the AI-driven development.

Conclusion: Resources and Best Practices for Serverless Testing

Yep, so just summarizing the things we covered today, we looked at how you can basically use GenAI throughout the development lifecycle. So instead of approaching it just as writing tests for serverless, but how do you redesign to simplify testing. And then we saw, so just so you know, all of the, including property-based tests and the application itself, we built it using Kiro, a lot of the spec and development. And then of course, towards the end, Thomas showed how you can continuously iterate based on historic data to improve your application.

So some quick resources. The first one is a talk from 2023, but it's a really good breakout that talks specifically from a Python perspective, some of the best practices. The second one is from this year, but this is a recorded one. We now have really good integrations to debug functions live using VS Code. And the last one is the completed version of our task API that follows all of the best practices that's published out to GitHub, and you should be able to access it there. Then really quick, if you're looking for serverless and event resources, this is the place to go. And that was really it. We thank you for spending time with us, for spending your last day at re:Invent with us.

; This article is entirely auto-generated using Amazon Bedrock.