Ticket triage is one of the most tedious, repetitive, and error-prone tasks in IT operations. A ticket comes in. Someone reads it, figures out what category it belongs to, assigns a priority, and routes it to the right team. This process happens hundreds of times per day in most organizations, and it is almost entirely manual.
AI can automate this. But doing it wrong means tickets get misrouted, priorities get botched, and your team loses trust in the system within a week. Here is how to build it right, based on a production deployment we built using Azure DevOps work items.
The Architecture: ADO Analyzer
The system we built is called the ADO Analyzer. It monitors Azure DevOps for new work items, analyzes each one using an LLM, and either auto-triages it or flags it for human review. The key insight is that it does not try to handle everything — it handles the easy cases automatically and escalates the hard ones.
Here is the flow:
- A new ticket arrives in Azure DevOps.
- The analyzer pulls the title, description, and any attached logs or screenshots.
- It sends this content to an LLM with a structured prompt that asks for: category, priority, affected system, recommended team, and a confidence score for each determination.
- The LLM responds with structured JSON containing its analysis.
- The system applies confidence thresholds to decide what to do next.
Confidence Thresholds: The Safety Net
This is the most important part of the design. Every AI classification comes with a confidence score, and the system behaves differently based on that score:
- **High confidence (above 85%):** The ticket is auto-triaged. Category, priority, and team assignment are applied automatically. A comment is added to the ticket explaining the AI's reasoning. No human intervention needed.
- **Medium confidence (60-85%):** The ticket is pre-filled with the AI's suggestions, but flagged for human review. An engineer glances at it, confirms or corrects the triage, and approves. This takes about 15 seconds instead of 2 minutes.
- **Low confidence (below 60%):** The ticket goes into the manual queue with no AI suggestions applied. The system does not guess when it is not confident enough to be useful.
These thresholds are not arbitrary. We calibrated them by running the analyzer against 500 historical tickets where the correct triage was already known. We adjusted the thresholds until the auto-triage accuracy exceeded 95% and the medium-confidence suggestions were helpful at least 80% of the time.
Weighted Scoring for Priority
Priority assignment is more nuanced than category classification. A ticket might clearly be a "network issue" (easy to categorize) but determining whether it is P1 or P3 requires context. We use a weighted scoring system:
- **Impact keywords (weight: 30%):** Words like "outage," "down," "all users affected" push the priority up. Words like "intermittent," "one user," "cosmetic" push it down.
- **System criticality (weight: 25%):** Each system in the environment has a criticality rating. A ticket affecting the payment processing system scores higher than one affecting the internal wiki.
- **User role (weight: 20%):** Tickets from VPs and directors are not automatically higher priority, but the system does factor in whether the affected user is in a revenue-generating role.
- **Time sensitivity (weight: 15%):** Mentions of deadlines, SLAs, customer commitments, or regulatory requirements increase the priority score.
- **Historical pattern (weight: 10%):** If similar tickets in the past turned out to be P1 incidents, the system factors that pattern into its scoring.
The weighted score maps to a priority level. The thresholds are tuned based on your organization's actual priority distribution — if 80% of your tickets are P3, your thresholds should reflect that reality.
The Prompt Engineering That Makes It Work
The prompt is not "classify this ticket." That produces unreliable results. The prompt is structured to force the LLM into a specific reasoning pattern:
First, the prompt provides context about the organization — what teams exist, what systems they own, what the priority levels mean. Second, it provides the ticket content. Third, it asks for a step-by-step analysis: what is the reported issue, what system is affected, who is impacted, what is the business impact, and what team should handle it. Finally, it asks for the classification with a confidence percentage and a one-sentence justification.
This chain-of-thought approach dramatically improves accuracy compared to asking for a direct classification. The LLM's reasoning is also logged, which means when a human reviews a medium-confidence ticket, they can see why the AI made its suggestion and quickly validate or correct it.
Results After 90 Days
After running the ADO Analyzer in production for three months, here are the numbers:
- **62% of tickets** were auto-triaged with high confidence. Of those, 96% were correctly classified — meaning only 4% needed human correction after the fact.
- **28% of tickets** fell in the medium-confidence range. The AI's pre-filled suggestions were accepted without changes 83% of the time.
- **10% of tickets** went to manual triage. These were genuinely ambiguous cases — tickets with vague descriptions, novel issue types, or cross-team impact.
- **Average triage time** dropped from 3.2 minutes per ticket to 22 seconds (averaged across auto, assisted, and manual).
- **Misroute rate** dropped from 18% to 6%.
How to Start
You do not need to build the full system on day one. Start with a read-only analyzer that processes tickets and logs what it would have done, without actually making any changes. Run it for two weeks. Compare its suggestions against what your human triagers actually did. Tune the thresholds. Then gradually enable auto-triage for the highest-confidence cases and expand from there.
Originally published at https://primeautomationsolutions.com
Top comments (0)