Jarsis Platform
AI & Operations8 min read

AI ticket classification for FM service desks: what actually works

The pitch is irresistible. The first month is awkward. The difference between "AI saves us hours" and "AI dispatched a plumber to an electrical job" is mostly in how the model is actually used.

Auto-classification of inbound tickets is the most-pitched AI feature in facilities service desks, and the one that fails most quietly. The pitch is straightforward: tenants and staff write tickets in free text; AI tags the trade, priority, asset, and routing; the service desk skips the triage step. When it works, it saves real money. When it doesn't, the trust damage takes longer to repair than the original time savings.

What classification actually means

A useful classification on a service desk usually answers four questions. What kind of work is this (electrical, plumbing, fabric, HVAC, grounds)? How urgent is it (immediate, within 24 hours, planned)? Which asset or area does it relate to? Who should see it next?

Each of those is a different problem. Trade and priority are usually solvable with high accuracy from the text and photos. Asset and area need the platform to have a clean location and asset register; if your asset register is noisy, the AI will be too. Routing is the most sensitive because it is the step that makes the operational consequence visible: a misrouted ticket lands on the wrong team and stays there until someone notices.

Where AI works well

The pattern that consistently delivers value is bounded classification with confidence scoring. The model returns its best guess at trade, priority, and asset, with a confidence value attached. High-confidence tickets route automatically. Low-confidence tickets land in a triage queue with the suggested classification visible but not yet applied. The triage agent confirms or corrects in two clicks, and the correction feeds back into the model.

The numbers vary by deployment, but a typical pattern is 70-80% of tickets routing automatically with high confidence, 15-20% landing in triage with a suggested classification that an agent confirms in seconds, and 5% that the model declines to classify because the description is ambiguous or contradictory. That last 5% is where human judgement earns its keep, and it is exactly the slice where bad classification is most expensive.

Where it fails

The most common failure mode is over-confident classification. Models trained on lots of clean data can produce a high confidence score for a ticket that is actually ambiguous, particularly when the ticket text uses unfamiliar phrasing or describes a symptom rather than a cause. "The lights are flickering and there is a burning smell" is an electrical job, but it is also potentially a fire risk that needs an immediate response rather than a same-day ticket.

The second failure mode is silent drift. The model that worked well for the first six months stops working as tenant language changes, new asset types are introduced, or the classification taxonomy quietly evolves. Without ongoing measurement, the drift is invisible until someone notices the misroute rate creeping up.

The third is misalignment between AI confidence and operational risk. A 90% confident classification of a low-risk ticket is fine if it is wrong; a 90% confident classification of a high-risk safety incident is not. The threshold for automatic routing should be a function of consequence, not a single number applied across the board.

Human-in-the-loop patterns

The pattern that lasts is one where humans stay in the loop on the cases that matter and stay out of it on the cases that do not. Routine tickets with high confidence route automatically. Tickets above a priority threshold always touch a human, regardless of confidence. Tickets involving safety keywords (gas, fire, electrical shock, water leak above electrical) skip classification entirely and route to a duty officer. The system learns from the cases it gets right, and it learns from the cases it gets wrong because every override is feedback.

The mistake to avoid is treating AI classification as replacing the triage role rather than augmenting it. The role does not disappear. It changes shape: less typing, more reviewing edge cases, more adjusting thresholds and taxonomies as the operation evolves.

Measuring whether it saved time

The metric that matters is not classification accuracy in isolation. It is end-to-end resolution time, segmented by ticket type, against a baseline taken before AI was introduced. Accuracy without resolution-time improvement is a vanity metric.

Track the misroute rate, the override rate from agents, and the rate at which low-confidence tickets land in triage. Track these by ticket category, not as global averages. A model that is 95% accurate overall but 40% accurate on a small high-risk category is a model that will eventually generate the kind of incident an executive writes a position paper about.

Transparency and ATRS

For UK public sector deployments, AI classification falls squarely under the Algorithmic Transparency Recording Standard. A published transparency record is required, and writing one by hand for every model update is a chore nobody enjoys. Platforms that generate the record automatically from configuration and live operational metadata save the chore and produce something more accurate than a hand-edited document.

The transparency record is also useful internally. It forces a clear statement of what the model is doing, what it is not doing, and where the human stays in the loop. Teams that have written one find it sharpens their own understanding of how the AI fits into the operation.

See AI classification with the safety rails on

Jarsis Platform ships with AI classification, confidence scoring, automatic ATRS records, and configurable human-in-the-loop thresholds. Book a demo to see it on your kind of tickets.