Ahmad Humayun

Posted on Jun 6

Why I Don’t Let the LLM Decide Issue State

#ai #python #dataengineering #marketinganalytics

When you build an AI system for marketing performance monitoring, one tempting idea is to let the LLM decide everything.

Campaign pacing is off.

Creative frequency is too high.

A product category is spending inefficiently.

So the natural thought is:

Let’s send the current issue and previous issue to the LLM and ask if this is new, recurring, worsening, or improving.

Something like this:

state = llm.invoke(f"""
Last week this advertiser had issue: {issue.description}
Previous metric: {issue.prev_value}
Current metric: {issue.current_value}

Is this issue new, recurring, worsening, or improving?
""")

This looks fine in a demo.

But I would not use this in production.

The issue state is not a language problem. It is a data/history problem.

If the same issue existed last week and the metric is now 25% worse, the issue is worsening. If it disappeared, it is resolved. If it appeared for the first time, it is new.

There is no reason to spend tokens or accept non-determinism for that.

The mistake: using the LLM for decisions that already have an algorithmic answer

LLMs are useful when the output needs language, judgment, or explanation.

But deciding whether an issue is NEW, RECURRING, or WORSENING should not depend on prompt wording.

In a monitoring system, this becomes a real problem.

If the same campaign issue is classified differently on different runs, the whole weekly comparison becomes noisy. You cannot reliably tell whether performance is actually getting worse or the LLM just described it differently this time.

That is why I keep the issue lifecycle deterministic.

The LLM can write the explanation later.

It should not decide the state.

The state machine I use

For each detected issue, I track it across analysis periods using a small state machine.

The states are:

NEW
RECURRING
WORSENING
IMPROVING
RESOLVED
STALE

The meaning is simple.

NEW

The issue is detected for the first time for this advertiser or entity.

Example:

Creative fatigue was not present last week, but it is detected this week.

RECURRING

The same issue is still present, but the metric did not move enough to call it better or worse.

Example:

Frequency was high last week and is still high this week, but only changed by 3%.

WORSENING

The issue is still present and the underlying metric degraded beyond a threshold.

Example:

CPA was already high, and now it is 28% higher than the previous period.

IMPROVING

The issue is still present, but the metric moved in the right direction.

Example:

Frequency is still above the target range, but it dropped by 22%.

RESOLVED

The issue was present before, but it is no longer detected.

Example:

Budget pacing was behind last week, but spend is now back within the expected range.

STALE

This one is useful in real systems.

Sometimes an issue does not get resolved cleanly. Data is missing, an account stops syncing, or a campaign disappears from the source data.

In those cases, I do not always want to mark the issue as resolved immediately.

STALE means:

This issue was previously active, but we have not seen enough recent evidence to confidently call it resolved.

This avoids false “good news” when the real problem is a data gap.

The core logic

The basic version is not complicated.

def compute_issue_state(
    current_value: float | bool,
    previous_record: IssueRecord | None,
    threshold_pct: float = 0.20,
) -> IssueState:
    if previous_record is None:
        return IssueState.NEW

    if previous_record.state == IssueState.RESOLVED:
        return IssueState.NEW

    if isinstance(current_value, bool):
        return IssueState.RECURRING

    previous_value = previous_record.value or 1
    delta = (current_value - previous_value) / previous_value

    if delta > threshold_pct:
        return IssueState.WORSENING

    if delta < -threshold_pct:
        return IssueState.IMPROVING

    return IssueState.RECURRING

This gives me a predictable result every time.

Same input.

Same previous state.

Same threshold.

Same output.

That matters a lot more than people realize.

Boolean issues need different handling

Not every issue has a numeric severity value.

Some issues are just present or absent.

For example:

Tracking pixel missing
Creative disapproved
Campaign has no active ads
Required targeting field missing

These are boolean issues.

They cannot really become “20% worse.”

A creative is either disapproved or it is not.

So for boolean issues, the lifecycle is simpler:

first detection  -> NEW
detected again   -> RECURRING
not detected     -> RESOLVED

For numeric issues, I can compare magnitude.

Examples:

High CPM
Creative fatigue
Budget pacing gap
CPA increase
ROAS drop
Frequency increase

These can move up or down, so they can be worsening or improving.

One simple way to model this is to keep the issue kind inside the enum.

from enum import Enum


class IssueKind(Enum):
    NUMERIC = "numeric"
    BOOLEAN = "boolean"


class IssueType(Enum):
    HIGH_CPM = ("high_cpm", IssueKind.NUMERIC)
    CREATIVE_FATIGUE = ("creative_fatigue", IssueKind.NUMERIC)
    BUDGET_PACING_GAP = ("budget_pacing_gap", IssueKind.NUMERIC)

    TRACKING_FAILURE = ("tracking_failure", IssueKind.BOOLEAN)
    CREATIVE_DISAPPROVED = ("creative_disapproved", IssueKind.BOOLEAN)
    TARGETING_ERROR = ("targeting_error", IssueKind.BOOLEAN)

    def __init__(self, slug: str, kind: IssueKind):
        self.slug = slug
        self.kind = kind

Then the state transition does not need to guess what kind of issue it is dealing with.

Thresholds should not be global

I usually do not like using one fixed threshold for every issue type.

A 20% change can mean different things depending on the metric.

For example:

A 20% CPA increase may be meaningful.
A 20% CPM change may be normal in some accounts.
A 10% frequency increase may already matter if the audience is small.
A small budget pacing gap may not be worth escalating.

So the threshold should be configurable by issue type.

Something like:

ISSUE_THRESHOLDS = {
    IssueType.HIGH_CPM: 0.20,
    IssueType.CREATIVE_FATIGUE: 0.10,
    IssueType.BUDGET_PACING_GAP: 0.20,
}

This keeps the state machine simple, but still lets each metric behave differently.

Where the LLM actually belongs

I still use the LLM.

Just not for the lifecycle decision.

Once the issue state is already computed, I pass structured context to the LLM and ask it to write the recommendation.

Example:

prompt = f"""
Issue type: {issue.issue_type.slug}
State: {issue.state.value}
Current value: {issue.current_value}
Previous value: {issue.previous_value}
Delta: {issue.delta_pct:.1%}

Write a concise recommendation for this issue.
"""

Now the LLM is doing the part it is actually good at:

Turning structured data into a readable explanation.

It can say:

Creative fatigue is worsening. Frequency increased by 24% compared to the previous period, while CTR continued to decline. Consider rotating in fresh creatives or reducing spend on the affected ad set.

But it did not decide that the issue was worsening.

The system already knew that.

Why this separation matters

This design gives a few practical benefits.

1. The system is auditable

If someone asks why an issue was marked as worsening, I can show the exact values.

previous value: 1.20
current value: 1.55
delta: +29.1%
threshold: 20%
state: WORSENING

No prompt guessing.

No “the model thought it was worse.”

2. The system is consistent

The same input produces the same state every time.

This is important for weekly monitoring, dashboards, alerts, and Slack summaries.

If the state changes, it changed because the data changed.

3. It is cheaper

There is no reason to call an LLM thousands of times just to classify a numeric delta.

That cost adds up quickly if you are checking many advertisers, campaigns, creatives, products, or funnel stages.

4. Debugging is easier

If the state assignment is wrong, I know where to look.

Maybe the threshold is too sensitive.

Maybe the previous period logic is wrong.

Maybe the metric should be inverted because lower is better.

These are engineering problems.

They are much easier to fix than prompt behavior.

The general rule I follow

For AI systems, I try to separate the work into two layers.

Deterministic layer:
- detection
- thresholds
- state transitions
- deduplication
- severity scoring
- history checks

LLM layer:
- explanation
- summarization
- recommendation wording
- stakeholder-friendly language

The deterministic layer decides what happened.

The LLM explains it.

That separation makes the system much easier to trust.

And in production, trust matters more than making the architecture look “more AI.”

DEV Community

Why I Don’t Let the LLM Decide Issue State

The mistake: using the LLM for decisions that already have an algorithmic answer

The state machine I use

NEW

RECURRING

WORSENING

IMPROVING

RESOLVED

STALE

The core logic

Boolean issues need different handling

Thresholds should not be global

Where the LLM actually belongs

Why this separation matters

1. The system is auditable

2. The system is consistent

3. It is cheaper

4. Debugging is easier

The general rule I follow

Top comments (0)