Context First AI

Posted on Mar 20

How AI Triages Resident Cases Before a Human Touches Them

Most property ops teams don't have a prioritisation problem — they have a visibility problem. A production-grade AI triage system runs in three layers: keyword classification (fast, zero API cost), sentiment scoring (catches urgency keywords can't), and LLM routing (edge cases only — urgent cases skip it entirely). The flagging layer is where most internal builds fall short. Human override stays in by design.

Most property operations teams don't have a prioritisation problem. They have a visibility problem — and they're solving it with headcount when they should be solving it with logic.

The Pattern We Keep Seeing

Across portfolio operators and software teams building resident-facing products, the intake problem looks almost identical regardless of scale. A case manager arrives Monday morning to a queue of 40, 50, sometimes 80 unread messages. Somewhere in that queue is a burst pipe reported Friday night. It's sitting behind a paint touch-up request and a bin collection query. No system has surfaced it. No flag has been raised. The urgency is invisible until someone reads far enough down to find it.

We've seen this described in near-identical terms by a compliance lead at a 200-unit build-to-rent operator and a product director at a facilities management SaaS company — different organisations, different software stacks, same Monday morning problem. The inbox is a flat list. Flat lists don't discriminate between a leaking roof and a cosmetic scratch. Humans have to do that work manually, and manual triage at scale is slow, inconsistent, and — when it fails — genuinely risky.

The Problem Is Structural, Not Operational

It's tempting to frame this as a staffing question. If the team were bigger, someone would always be available to read incoming cases in real time. But that misses the point. The constraint isn't availability — it's the cognitive overhead of routing. A skilled case manager can triage a single case in 30–60 seconds. Across 50 cases a day, that's close to 45 minutes of pure classification work, every day, before a single problem has actually been resolved. That's roughly 37% of a working morning spent deciding what to work on rather than working on it.

Worse, manual triage is only as good as the last person who touched the queue. Inconsistency creeps in. A resident who's complained three times gets treated like a first contact. A household with a young child and a broken boiler in winter doesn't get flagged any differently than one without. The system has no memory, and it has no ability to read between the lines of a 400-word email written by someone who is increasingly furious.

This is the problem AI triaging is designed to solve — not by replacing case manager judgement, but by doing the classification work before any human gets involved.

The Three-Layer Architecture

AI triage in a property management context isn't a single step. A production-grade system runs in three distinct layers, each serving a different purpose and operating at a different cost.

Layer 1 — Keyword Classification

The first pass is a weighted keyword scan against a library of terms mapped to property categories: plumbing, electrical, HVAC, security, pest control, and others. The weighting matters more than the keyword list itself.

KEYWORD_WEIGHTS = {
    # Safety-critical — fire immediately
    "gas smell": 10,
    "gas leak": 10,
    "no electricity": 9,
    "flood": 9,
    "burst pipe": 9,
    "fire": 10,
    # High urgency
    "no heat": 7,
    "boiler broken": 7,
    "no hot water": 6,
    "security breach": 8,
    # Medium
    "leak": 5,
    "mould": 5,
    "damp": 4,
    # Low
    "squeaky door": 2,
    "paint": 1,
    "bin collection": 1,
}

SUBJECT_LINE_MULTIPLIER = 2.0

def keyword_score(subject: str, body: str) -> tuple[float, str, float]:
    subject_lower = subject.lower()
    body_lower = body.lower()

    max_weight = 0
    matched_category = "general"
    confidence = 0.0

    for keyword, weight in KEYWORD_WEIGHTS.items():
        # Subject line counts double
        if keyword in subject_lower:
            adjusted = weight * SUBJECT_LINE_MULTIPLIER
            if adjusted > max_weight:
                max_weight = adjusted
                matched_category = classify_keyword(keyword)
        elif keyword in body_lower:
            if weight > max_weight:
                max_weight = weight
                matched_category = classify_keyword(keyword)

    # Confidence: how certain we are in this classification
    confidence = min(max_weight / 10.0, 1.0)
    return max_weight, matched_category, confidence


This layer handles the majority of cases — accurately, instantly, and at zero API cost. A resident who types "URGENT: no heat" in the subject is communicating something different than one who mentions it in paragraph four. The multiplier handles that signal.

Layer 2 — Sentiment Scoring

Keywords tell you *what* the problem is. Sentiment tells you how bad it's gotten.

A resident who writes "the tap is dripping" gets medium priority. A resident who writes "I've reported this dripping tap three times and nobody has responded — I am contacting my solicitor" is describing the same physical problem, but the case is now high priority, escalation-flagged, and legal threat patterns have been detected.

python
import re

SENTIMENT_SIGNALS = {
"emergency_language": {
"patterns": [r"\burgent\b", r"\bemergency\b", r"\bASAP\b", r"\bimmediately\b", r"\bdesperate\b"],
"score_adjustment": +2,
},
"escalation_signals": {
"patterns": [
r"\bbeen waiting\b", r"\bno.{0,15}responded\b", r"\bcomplained before\b",
r"\bsolicitor\b", r"\blawyer\b", r"\bcouncil\b", r"\breport.{0,10}again\b",
r"\bthird time\b", r"\bmultiple times\b",
],
"score_adjustment": +3,
},
"vulnerable_population": {
"patterns": [
r"\belderly\b", r"\bchild(ren)?\b", r"\bbaby\b", r"\binfant\b",
r"\bdisabled\b", r"\bwheelchair\b", r"\bpregnant\b",
r"\basthma\b", r"\bheart condition\b", r"\bmedical\b",
],
"score_adjustment": +2,
"flag": "VULNERABLE_POPULATION",
},
"extended_duration": {
"patterns": [
r"\bfor weeks\b", r"\bfor months\b", r"\bsince \w+ \d{4}\b",
r"\bgetting worse\b", r"\bstill not fixed\b", r"\bongoing\b",
],
"score_adjustment": +2,
"flag": "EXTENDED_DURATION",
},
"emotional_intensity": {
"patterns": [r"[A-Z]{5,}", r"!!!+", r"\?\?\?+", r"\bfurious\b", r"\boutraged\b"],
"score_adjustment": +1,
"flag": "HIGH_EMOTIONAL_INTENSITY",
},
"deprioritise": {
"patterns": [r"\bno rush\b", r"\bwhenever convenient\b", r"\bwhen you get a chance\b"],
"score_adjustment": -2,
},
}

def sentiment_adjust(text: str, base_score: float) -> tuple[float, list[str]]:
flags = []
adjusted_score = base_score

for signal_name, config in SENTIMENT_SIGNALS.items():
    for pattern in config["patterns"]:
        if re.search(pattern, text, re.IGNORECASE):
            adjusted_score += config["score_adjustment"]
            if "flag" in config:
                flags.append(config["flag"])
            break  # One match per signal type is enough

return adjusted_score, list(set(flags))


Layer 3 — LLM (Edge Cases Only)

The part nobody demos: routing every case through a language model is the naive implementation. It's expensive at scale, introduces latency, and creates a single point of failure.

python
import anthropic

CONFIDENCE_THRESHOLD = 0.6
LLM_BUDGET_LIMIT_DAILY = 50 # Per organisation

def triage_case(subject: str, body: str, org_id: str) -> dict:
full_text = f"{subject}\n\n{body}"

# Layer 1: Keyword scan
keyword_weight, category, confidence = keyword_score(subject, body)

# Safety-critical: fire immediately, skip everything else
if keyword_weight >= 9:
    return {
        "priority": "URGENT",
        "category": category,
        "flags": ["SAFETY_CRITICAL"],
        "source": "keyword_immediate",
        "llm_used": False,
    }

# Layer 2: Sentiment adjustment
adjusted_score, flags = sentiment_adjust(full_text, keyword_weight)
priority = score_to_priority(adjusted_score)

# Layer 3: LLM for genuinely ambiguous cases
if confidence < CONFIDENCE_THRESHOLD and not is_budget_exhausted(org_id):
    llm_result = classify_with_llm(subject, body)
    log_llm_usage(org_id)
    return {
        "priority": llm_result["priority"],
        "category": llm_result["category"],
        "flags": flags + llm_result.get("additional_flags", []),
        "source": "llm",
        "llm_used": True,
    }

return {
    "priority": priority,
    "category": category,
    "flags": flags,
    "source": "keyword_sentiment",
    "llm_used": False,
}

def classify_with_llm(subject: str, body: str) -> dict:
client = anthropic.Anthropic()

prompt = f"""Classify this property management case. Return JSON only.

Subject: {subject}
Body: {body}

Return:
{{
"priority": "URGENT|HIGH|MEDIUM|LOW",
"category": "plumbing|electrical|hvac|security|pest|noise|administrative|other",
"reasoning": "one sentence",
"additional_flags": []
}}"""

message = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=256,
    messages=[{"role": "user", "content": prompt}]
)

import json
return json.loads(message.content[0].text)

def score_to_priority(score: float) -> str:
if score >= 8: return "URGENT"
if score >= 5: return "HIGH"
if score >= 3: return "MEDIUM"
return "LOW"


Critically: urgent cases skip the LLM entirely. If "gas smell" fires at weight 10, there is no reason to wait for a model to confirm it. The flag fires immediately.

The Flags That Change How Cases Are Handled

Beyond the four standard priority levels, a well-implemented triage system surfaces named flags that give case managers actionable context before they open a case.

python
FLAG_DEFINITIONS = {
"VULNERABLE_POPULATION": "Children, elderly, disabled, pregnant, or health condition mentioned",
"ESCALATED_COMPLAINT": "Frustration expressed, repeat complaint, or legal threat indicated",
"EXTENDED_DURATION": "Issue ongoing for days, weeks, or months",
"HIGH_EMOTIONAL_INTENSITY": "Writing style indicates significant distress",
"SAFETY_CRITICAL": "Immediate safety risk — gas, fire, flood, security",
}

None of these flags override agent judgement. The case manager retains full control to reclassify category and priority once they open a case. The flags are there to make sure context — which might be buried in paragraph five of a long email — is visible at a glance before any decision is made.

This is, if we're honest, where a lot of internal-build attempts fall down. Teams build classification but skip the flagging layer, and then wonder why the system doesn't feel meaningfully different from what they had before.

Where Human Judgement Still Wins — and Should

We don't think the goal here is to reduce human involvement in case management. We think that's the wrong framing entirely. The goal is to get the right information to the right person at the right moment, so that their judgement is applied where it actually matters.

Consider a case where a resident reports a neighbour playing loud music that's been ongoing for months, and mentions that an elderly family member can't sleep. A triage system will correctly detect VULNERABLE_POPULATION and EXTENDED_DURATION. It might correctly classify this as a noise complaint. But whether this requires an administrative response, a welfare check, or something more urgent is a call that sits with a human — someone who knows the property, knows the tenant history, and can weigh factors the AI simply doesn't have access to.

A compliance lead at a large residential operator put this to us in a way that stuck: "The AI gives me the same information I'd have if I'd read every email carefully. It doesn't tell me what to do with it." That's the right relationship between the technology and the professional.

The Cost and Reliability Architecture That Actually Scales

For teams building or evaluating AI triage systems, the economics matter as much as the capability.

import functools
import hashlib

# Response cache — identical cases don't re-trigger LLM
_cache = {}

def cached_llm_classify(subject: str, body: str) -> dict:
    cache_key = hashlib.md5(f"{subject}:{body}".encode()).hexdigest()
    if cache_key in _cache:
        return _cache[cache_key]
    result = classify_with_llm(subject, body)
    _cache[cache_key] = result
    return result

# Circuit breaker — graceful degradation when AI service is unavailable
class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.failures = 0
        self.threshold = failure_threshold
        self.open = False
        self.recovery_timeout = recovery_timeout

    def call(self, fn, *args, fallback=None, **kwargs):
        if self.open:
            return fallback
        try:
            result = fn(*args, **kwargs)
            self.failures = 0
            return result
        except Exception:
            self.failures += 1
            if self.failures >= self.threshold:
                self.open = True
            return fallback

circuit_breaker = CircuitBreaker()

# Usage: falls back to medium priority if LLM is unavailable
llm_result = circuit_breaker.call(
    classify_with_llm,
    subject,
    body,
    fallback={"priority": "MEDIUM", "category": "general", "additional_flags": []}
)

A well-designed implementation keeps LLM usage minimal by design. The keyword and sentiment layers handle the majority of cases locally, with no API calls. Budget limits per organisation prevent runaway spend. The response cache means similar cases don't re-trigger an LLM call. The circuit breaker ensures that if the AI service is unavailable, cases default to medium priority rather than failing entirely.

We're not entirely sure how to benchmark this precisely across different portfolio sizes, because the case composition varies considerably. But the directional outcome is consistent: teams that deploy this architecture report that LLM calls represent a small minority of total triage operations, while the accuracy improvement over pure keyword classification is significant.

Key Takeaways

The three-layer architecture — keywords, then sentiment, then LLM for genuinely ambiguous cases — is the approach that balances accuracy, cost, and reliability at scale. Trying to shortcut to a single-layer LLM solution gets expensive and brittle.

Urgency escalation should skip the LLM entirely. Safety-critical categories need to fire immediately, not wait for model confirmation.

Named flags are as important as priority levels. Knowing a case is "high priority" is less actionable than knowing it's high priority + VULNERABLE_POPULATION + ESCALATED_COMPLAINT. The flags are where the system earns its daily usefulness.

Human override should be built in by design, not bolted on as an afterthought. Case managers need to be able to reclassify immediately, without friction. The AI is a starting point.

Budget controls and degradation patterns are non-negotiable for production deployment. A system that fails open — passing all costs and all decisions to the AI layer — will create problems that outweigh the operational benefits.

How Context First AI Approaches This

At Context First AI, the triage capability described in this article sits within the broader HandyConnect V2.0 platform — built specifically for property management operations that need to handle resident cases at scale without proportionally scaling their operations team.

The system is designed around the principle that the best AI implementations make existing professionals more effective, not more dependent on automation. Case managers using HandyConnect don't experience the AI as a black box that produces decisions — they experience it as a first pass that surfaces what matters, so their attention goes where it belongs.

The Stack pillar at Context First AI focuses on technical product decisions: the architecture choices, the compliance implications, and the build-versus-buy questions that teams actually face when deploying AI in operational contexts. The triage system is one component of a larger infrastructure that includes automated email ingestion, SLA tracking, and role-based access management across blocks and properties.

For B2B teams evaluating AI tooling for case management, the relevant questions are around reliability under load, cost control at scale, and the degree to which the system supports — rather than constrains — the professionals using it. HandyConnect is built to answer those questions in production, not just in a demo.

Conclusion

The inbox queue was never the right mental model for managing resident cases. A prioritised worklist — where the system has already surfaced the gas leak, flagged the frustrated long-term complainer, and elevated the vulnerable resident — is a fundamentally different working environment. The technology to build it exists and is mature enough for production deployment. The teams that will get the most out of it are the ones who treat it as an infrastructure decision, not a feature addition.

The question isn't whether AI triage is possible. It's whether the architecture is built to last beyond the first month.

Resources

- [Building Production-Grade AI Systems: Cost Control and Reliability Patterns]

Created with AI assistance. Originally published at [Context First AI]

DEV Community

How AI Triages Resident Cases Before a Human Touches Them

Top comments (0)