Davide De Sio for AWS Community Builders

Posted on Jan 27 • Edited on Mar 3

Nevermore.dev: LLM-as-judge on Lambda Durable Functions

#aws #genai #lambda

“The past is dead. The future? Let’s make it less painful.”

Writing post‑mortems is one of those things everyone agrees are important and everyone secretly hates doing. They’re tedious, emotionally draining, and they require the worst kind of energy: clear thinking after chaos.

And you find yourself thinking both: “I want this to happen never more” and “I want to write this never more”. And in that quiet, the chaos lingers. Undebuggable, relentless, eternal.

Nevermore.dev was born from that very specific kind of developer pain: a post‑mortem generator with two moods:

Professional: calm, neutral, executive‑friendly (but so boring)
Creepy: full Addams‑family vibes, because if we have to revisit horror, we might as well embrace it 🦇

Dark UI aside, the interesting tech geeky part lives under the hood: brand new AWS Lambda Durable Function powering an LLM‑as‑Judge workflow on Amazon Bedrock (using Nova models).

🏗️ Architecture

The solution I had in mind was fairly simple in shape, even if layered in execution.

The flow starts from an Amplify Gen 2 frontend. An AppSync GraphQL mutation triggers a lightweight Lambda whose only job is to start the AI workflow, not to run it (it acts as the sync backend). From there, everything moves asynchronously into a Durable Lambda function.

This durable function is where the real logic lives. Instead of relying on a single model, the workflow follows an LLM as a judge pattern. Generation happens in parallel: a fast model produces a first candidate, while a more balanced one generates an alternative. The point here is diversity, not consensus.

Once both candidates are available, a higher-quality model steps in as a judge. It evaluates the outputs and selects the best result, acting as a decision layer rather than a generator.

All model calls go through Amazon Bedrock, keeping the system decoupled and letting each model focus on what it does best: speed, balance, or quality.

In this way, the main benefit I was aiming for was avoiding the setup of a Step Function with all its inherent complexity, while still having pure code to manage a durable, asynchronous workflow directly inside a Lambda, a Durable Lambda.

🕯️ A bit more about Nevermore.dev

At its core, I wanted a simple but effective CRUD panel to help me manage post-mortems. I've semi-vibed (using specs, refining it..) it with Kiro with my personal Amplify Gen 2 Kiro Power.

The product comes with all the usual Amplify Gen 2 built-in features.
Fully integrated with Cognito for auth.

CRUD operations featured by an API wiht AppSync and DynamoDb as storage.

Everything deployed with just npx amplify deploy.
Awesome to cut-off time to production and give me the product I was searching for.

🤖 AI to the rescue

The real core of the product, by the way, is using AI to generate clearer, more useful incident descriptions and root cause analyses (which is why I’m building and using it in the first place). What truly bores me about writing post-mortems isn’t the incident itself, but the ritual around it: finding the right tone, the right template, the right wording. With AI, all of that can be reduced to a single prompt that produces exactly what I need, but relying on a single model often forces me to switch models, manually evaluate the output, or ask another model to judge it.

It’s all still far too manual for something that’s supposed to be part of my daily routine.

Thus, I wanted the ability to parallelize multiple generations across different models, then use another model to evaluate the results and pick the best one, which is where the Lambda Durable Function comes into play. Finally, I can have the best output to be immediately available in Markdown, ready to be copied into team cards or any other notification system.

🪦 But it wasn't funny

As expected, it wasn’t funny enough.
Post-mortems aren’t funny, at least, not yet.
But they should NEVERMORE be boring.

As I was semi-vibing the frontend with Kiro and my personal Amplify Gen 2 Kiro Power, it only took a couple of prompts to add a button theme switch and fully embrace a creepy mode for dark theme users (aren't we talking about post-mortems?). Since I nevermore wanted to write a dull post-mortem, what better muse than the ever-macabre Addams Family?

Now reading a post-mortem in full Addams Family tone is incredibly satisfying, and I regret nothing.

🧱 The stack

Okay, creepy awesome. But let’s forget about the UI for a moment and get to the underline technical part.
Nevermore.dev is built entirely on AWS with a fairly modern setup:

Amplify Gen 2: frontend and full‑stack wiring
Amazon Cognito: authentication and authorization
AWS AppSync: GraphQL API
DynamoDB: NoSQL records persistence
AWS Lambda Durable Functions: AI orchestration layer
Amazon Bedrock (Nova): AI models engine for text generation & evaluations

As we said, the brain of the system is a single Durable Lambda Function that:

Generates multiple enhanced versions of a post‑mortem section
Uses another LLM to judge them
Returns the best one

All of this really happens inside one function, with:

parallel execution
checkpointing
resumability

The point here is: no Step Functions. No external state machines. No need to define complex architectures or multiple lambdas.

That’s where Durable Functions really shine.
Let's see why comparing to Step Functions.

⚔️ Why Lambda Durable Functions (instead of Step Functions)

Traditionally, a workflow like this would scream Step Functions.
They work and are a very good choice, but they come with trade‑offs:

JSON‑heavy definitions
state management between steps
mental context switching between states
orchestration logic separated from business logic (this could be a pro, but we should see the use case)

Lambda Durable Functions flip the model:

You write normal async code in just one function and AWS handles durability.

With a single Lambda you can get:

long-running executions (without losing state)
automatic checkpointing
deterministic replay
parallel fan‑out / fan‑in

For LLM workflows, where latency, retries, partial failures, and cost control matter, this is huge.

Here was my architecture map before starting, the core is an LLM workflow which should implement LLM-as-judge pattern. Having a solution to be placed all in the Durable Lambda, while front-end act just as a client, is a big milestone to cut off time to production.

How would this look if implemented with AWS Step Functions?

Lambda	State	Task	Extra elements required
Lambda 1	State 1	Call LLM	Must be separate; JSON definition required
Lambda 2	State 2	Process response	Another separate Lambda or branch logic
Lambda 3	State 3	Write to DB	Separate Lambda
Lambda 4	State 4	Map / Parallel	For fan-out of multiple LLM calls
Lambda 5	State 5	Wait / Choice	For retry / fallback logic
Lambda 6	State 6	Aggregate results	Another separate Lambda
Lambda 7	State 7	Success / Fail	Final orchestration state

Using AWS Step Functions would require a far larger amount of architectural and logical code compared to a single Lambda Durable Function. It’s a huge time saver, eliminates constant context-switching, and reinforces a DDD-inspired approach where my Lambda acts as a fully responsible micro-service, handling parallel execution and the orchestration of results end-to-end.

I always embrace DDD when it makes sense, and I stick to KISS: keep the model focused, the boundaries explicit, and the moving parts to the absolute minimum.

If you’re looking for a solid framework to choose between the two options, there’s an excellent decision framework here. Moreover, as suggested in the hybrid architecture chapter, you may even benefit from applying both approaches in your application.

⚖️ LLM‑as‑a‑Judge

So, as written before, instead of trusting a single model output, Nevermore.dev uses this powerful pattern:

Generate candidates post-mortem using multiple fast/cheap models
Judge them using a more capable reasoning model

This, compared with a single model response, gives you:

better quality
more consistency
controllable cost

In my case, I'm using three models of the Nova family:

export const MODEL = {
  SMALL: "eu.amazon.nova-micro-v1:0",
  MEDIUM: "eu.amazon.nova-lite-v1:0",
  LARGE: "eu.amazon.nova-pro-v1:0",
} as const;

🛠️ Deploying Durable Function with CDK

Durable functions, per documentation, are still officially not supported by Amplify itself but a PR has been merged and I expect this soon to come.

Meanwhile, it's very simple to deploy a durable function with CDK (or other IaC tools). It also enforce to me the concept that the durable function is a specific component which should be decoupled by "app architecture" created with Amplify.

It’s mostly a matter of configuration of the Durable Lambda Function itself.
And it feels exactly how it should do: an extension of what we already are able to do with CDK.

const durableFunction = new lambda.Function(this, 'DurableFunction', {
  runtime: lambda.Runtime.NODEJS_22_X,
  handler: 'index.handler',
  code: lambda.Code.fromAsset('lambda'),
  functionName: 'nevermore-dev-durable-ai-generator',
  memorySize: 1024,
});

const cfnFunction = durableFunction.node.defaultChild as lambda.CfnFunction;
cfnFunction.durableConfig = {
  executionTimeout: cdk.Duration.hours(1).toSeconds(),
};

Giving it the right permissions (security first!):

durableFunction.addToRolePolicy(new iam.PolicyStatement({
  actions: [
    'lambda:CheckpointDurableExecution',
    'lambda:GetDurableExecutionState',
  ],
  resources: ['*'], //better restrict this permission!
}));

Be aware to restrict this permission resources attribute as suggested here .

I've used CDK as it's a good fit for a full Typescript project with AWS Amplify Gen 2 and React but you can choose and learn how to deploy with your preferred IaC method (Cloudformation, CDK or SAM) a durable function here in AWS docs.

If you’re using CDK to deploy your Lambda Durable Function, you should create a "proxy" function that acts as a backend to invoke it. The code is as simple as described here.

✍️ Writing the durable handler

Again, this is the core part: no state machines, no glue code.
Just code logic.

export const handler = withDurableExecution(
  async (event: Event, context: DurableContext) => {
    const originalText = event.context || '';
    const theme = event.theme || 'addams';

    if (!originalText.trim()) {
      return getEmptyContextMessage(theme);
    }

    const enhancementPrompt = createEnhancementPrompt(originalText, event.fieldType, theme);
    const candidates = await generateCandidates(
      context,
      [MODEL.SMALL, MODEL.MEDIUM],
      enhancementPrompt
    );

    const judgePrompt = createJudgePrompt(originalText, candidates, event.fieldType, theme);
    const judgment = await judgeAndSelectBest(
      context,
      MODEL.PRO,
      judgePrompt,
      candidates,
      originalText
    );

    return judgment.enhancedText;
  }
);

Parallelism is implemented with a very simple context.map
Parallel, checkpointed, resumable. Exactly what flaky LLM calls need.

const candidateResults = await context.map(
  "Generate enhanced versions",
  models,
  async (_, modelId) => {
    const enhancement = await converse(modelId, prompt);
    return { modelId, answer: enhancement };
  }
);

Judging is implemented as a subsequent durable step

return await context.step("judge-best-version", async () => {
  const judgeResponse = await converse(judgeModel, judgePrompt);
});

If parsing fails, I fall back gracefully.
No wasted inference. No duplicate cost.

This complete pattern code is available here in aws samples repo.

✍️ Prompting

The best thing about this stack is that once the pattern is implemented, you can easily reuse it across different use cases by simply adapting the prompts. Below are my (very simple) examples.

This one is for generating candidate responses (where fieldName is the name of the field I want to generate, e.g. description or root cause, and the originalText is the starting point).

You are an experienced SRE reviewing a technical post-mortem {fieldName}.
Your task is to enhance this {fieldName} with professional insights and technical depth.

Original {fieldName}:
"""
{originalText}
"""

Requirements:
1. Expand and enhance the technical details with clarity and precision
2. Add relevant technical insights, metrics, and potential implications
3. Maintain a professional, clear, and concise tone
4. Use markdown formatting for better readability (headers, lists, code blocks)
5. Focus on actionable insights and lessons learned
6. If the original is empty or minimal, generate a comprehensive {fieldName} based on the context
7. Length: 200-400 words
8. Include specific technical recommendations and next steps

This is the prompt for the judge

You are an experienced SRE Lead reviewing post-mortem documents for quality and accuracy.

Original {fieldName}:
"""
{originalText}
"""

Enhanced Versions:
{candidatesList}

Evaluate each version based on:
- Technical accuracy and depth
- Clarity and readability
- Appropriate use of markdown formatting
- Professional tone and structure
- Actionable insights and recommendations
- Completeness and thoroughness

Reply with JSON only (no other text):
{
  "bestIndex": <1-based index>,
  "reasoning": "<2-3 sentences explaining your choice>"
};

The interest part of this prompt is that even if i needed just the best response, I've tracked also 2 or 3 sentences explaining the choice. This could be useful to review the result if you wan to introduce a human in the loop with a notification review patter for which Durable functions are a good fit too (see this example).

We can also observe that the LLM-as-judge pattern is essentially a composition of other patterns: parallelism and prompt chain with structured output.
By combining these patterns, you gain the flexibility to tailor the solution more precisely to your specific use case.

The creepy Addams theme give me the opportunity to test that just changing the prompt you can get the custom tone needed or fit your use case.

Here is the adapted prompt for candidate responses

You are Grandmama Addams, an ancient and wise debugger from the Addams Family mansion. 
Your task is to enhance this technical post-mortem {fieldName} with your dark wisdom and supernatural insight.

Original {fieldName}:
"""
{originalText}
"""

Requirements:
1. Expand and enhance the technical details with clarity and depth
2. Add relevant technical insights and potential implications
3. Maintain an Addams Family tone - creepy, darkly humorous, but technically accurate
4. Use markdown formatting for better readability (headers, lists, code blocks)
5. Keep it professional yet delightfully macabre
6. If the original is empty or minimal, generate a comprehensive {fieldName} based on the context
7. Length: 200-400 words
8. Include specific technical recommendations

Here is the adapted prompt for the judge

You are Morticia Addams, reviewing post-mortem documents for quality and accuracy.

Original {fieldName}:
"""
{originalText}
"""

Enhanced Versions:
{candidatesList}

Evaluate each version based on:
- Technical accuracy and depth
- Clarity and readability
- Appropriate use of markdown formatting
- Addams Family tone while remaining professional
- Actionable insights and recommendations

Reply with JSON only (no other text):
{
  "bestIndex": <1-based index>,
  "reasoning": "<2-3 sentences explaining your choice>"
}

I picked Morticia as the judge because her personality fits the role beautifully, but it was extremely funny to see how dramatically the tone changed just by switching to another member of the Addams family (choosing Fester to add a touch of madness was absolutely absurd).

👀 Let's see it in action

When invoking the function we can see the execution in Lambda Console under brand new Durable Executions tab.

You have a high level of detail at every step

🎯 So, why this matters?

Durable Functions make Lambda viable for serious AI workflows without needing a Step Function:

multi‑step reasoning
fan‑out / fan‑in
partial failures
cost‑aware retries

In my use case: post‑mortems are still painful.
But now, at least, they’re elegantly painful, ai-assisted with the generation and judge of this generation in a single scoped micro-service without an external workflow handler tool.