DEV Community

Cover image for Designing Hybrid Edge AI Systems for Low-Latency Intent Classification in Mobile Applications
Dheeraj Dhiman
Dheeraj Dhiman

Posted on

Designing Hybrid Edge AI Systems for Low-Latency Intent Classification in Mobile Applications

A Hybrid Edge–Cloud Architecture for Low-Latency Intent Classification in Mobile Applications

Abstract

Large Language Models (LLMs) have fundamentally changed how applications process natural language. They excel at reasoning, summarization, question answering, and generating human-like responses. As a result, many modern applications route every user message directly to a cloud-hosted LLM.

While this approach is effective for complex conversations, it is often unnecessary for deterministic interactions. Commands such as "Show my leave balance", "Open settings", or "Contact HR" do not require generative reasoning. They require identifying a known intent and triggering a predefined workflow.

Sending these requests to the cloud introduces avoidable latency, increases operational costs, depends on network availability, and transmits user data that could otherwise remain on the device.

This article presents a hybrid architecture that performs intent classification entirely on the client using a lightweight machine learning model. By classifying predictable requests locally and forwarding only ambiguous or complex queries to a cloud-based LLM, applications can provide a significantly faster, more private, and more resilient user experience.

Although the implementation examples reference Core ML on iOS, the architectural principles discussed here apply equally to Android, desktop, and embedded systems.


Introduction

Over the past few years, conversational interfaces have evolved from simple rule-based chatbots into sophisticated AI assistants capable of understanding natural language.

As engineers, it is tempting to assume that every user message deserves the full reasoning power of a Large Language Model. In practice, however, most application interactions are remarkably predictable.

Consider the following examples:

  • Show my leave balance
  • Apply for leave tomorrow
  • Open profile
  • Change password
  • View salary slip
  • Track my order
  • Show today's appointments
  • Contact support

These requests are not open-ended questions.

They are commands.

Their purpose is not to generate new knowledge but to identify the user's intent and execute an existing application workflow.

Yet many applications still send these requests to remote AI services.

Although this simplifies implementation, it often creates unnecessary architectural complexity.

Each interaction now depends on:

  • Internet connectivity
  • API availability
  • Server scalability
  • Token consumption
  • Network latency

The user experiences several hundred milliseconds—or even multiple seconds—of delay simply to navigate to a screen that already exists inside the application.

This raises an important architectural question:

Should every natural language request be processed by a Large Language Model?

For many applications, the answer is no.


The Problem Statement

Modern AI systems are incredibly capable, but capability alone should not dictate architecture.

One of the fundamental responsibilities of software architecture is selecting the appropriate technology for each problem.

A calculator does not require a database.

A login screen does not require distributed computing.

Likewise, deterministic user commands often do not require generative AI.

Consider an enterprise application with the following features:

  • Leave management
  • HR policies
  • Employee directory
  • Expense submission
  • Attendance tracking
  • Payroll information
  • Internal documentation

A conversational interface might receive thousands of requests every day, but a significant percentage of those requests fall into a relatively small number of predictable categories.

Examples include:

User Request Intended Action
"How many leaves do I have?" Open Leave Balance
"Apply leave tomorrow" Open Leave Application
"Show my salary slip" Navigate to Payroll
"Office timings" Display Working Hours
"Email HR" Open Contact Screen

Each request maps directly to an existing application feature.

No reasoning is required.

No content generation is required.

No external knowledge retrieval is required.

The challenge is simply determining which predefined action should be executed.

This is fundamentally a classification problem, not a reasoning problem.

Recognizing this distinction opens the door to a much simpler architecture.


A Different Architectural Perspective

Instead of treating every request as an AI problem, we can divide user interactions into two categories.

Category 1 — Deterministic Requests

These requests have known outcomes.

Examples include:

  • Open Settings
  • View Profile
  • Check Leave Balance
  • Company Policies
  • Working Hours
  • Contact HR

The expected action is already implemented inside the application.

The only missing piece is determining which action the user intended.

A lightweight text classifier can solve this in just a few milliseconds.


Category 2 — Generative Requests

These require reasoning beyond predefined workflows.

Examples include:

Compare my leave history over the last three years and suggest the best vacation period.

or

Summarize the company's parental leave policy.

or

Explain why my reimbursement request was rejected.

These requests benefit from the contextual understanding and reasoning capabilities of an LLM.

Rather than replacing the cloud entirely, the objective is to ensure that only requests requiring advanced reasoning are forwarded to it.


A Hybrid Edge–Cloud Architecture

This observation naturally leads to a hybrid architecture.

Instead of placing the LLM at the front of every interaction, the application first evaluates whether the request belongs to a known intent.

                    User Input
                         │
                         ▼
           On-Device Intent Classifier
                         │
          ┌──────────────┴──────────────┐
          │                             │
   High Confidence               Low Confidence
          │                             │
          ▼                             ▼
 Execute Local Action          Forward to Cloud LLM
Enter fullscreen mode Exit fullscreen mode

This design introduces an intelligent routing layer between the user interface and the network.

The classifier becomes responsible for determining whether the application already knows how to satisfy the request.

If it does, the workflow executes immediately without leaving the device.

If not, the request is escalated to a cloud-based language model.

This architecture combines the strengths of both approaches:

  • Instant responses for predictable interactions
  • Rich reasoning for complex conversations

Rather than viewing edge AI and cloud AI as competing technologies, they become complementary components within the same system.


Edge AI Versus Cloud AI

Choosing between local inference and cloud inference is not about determining which technology is "better."

Each solves a different class of problems.

Architectural Characteristic Cloud LLM On-Device Intent Classifier
Network Connectivity Required Not Required
Average Response Time 1–4 seconds Typically under 5 ms
Operational Cost Per-request API cost Zero after deployment
Privacy Data transmitted externally Data remains on device
Offline Capability No Yes
Reasoning Ability Excellent Limited
Deterministic Commands Overkill Ideal

The objective is not to eliminate cloud AI.

Instead, it is to reserve expensive reasoning engines for situations that genuinely require them.

A useful mental model is:

Use edge AI for routing. Use cloud AI for reasoning.

This simple design principle can significantly improve responsiveness while reducing unnecessary infrastructure costs.


Why Intent Classification Works

Intent classification is one of the oldest and most successful applications of Natural Language Processing.

Unlike generative models, which attempt to produce new text, a classifier performs a much simpler task:

Determine which predefined category best matches the input.

For example:

"Check my leave balance"
Enter fullscreen mode Exit fullscreen mode

might produce

leave_balance
Enter fullscreen mode Exit fullscreen mode

while

"What are today's office timings?"
Enter fullscreen mode Exit fullscreen mode

might produce

working_hours
Enter fullscreen mode Exit fullscreen mode

The output is not a paragraph.

It is simply a label.

Because the problem is constrained, the resulting model is dramatically smaller than a Large Language Model.

In many production systems, an intent classifier occupies only a few tens of kilobytes while performing inference in just a few milliseconds.

This makes it an excellent candidate for on-device deployment.


Engineering the Dataset

Like every supervised learning problem, model quality depends heavily on training data.

Fortunately, intent classification requires relatively straightforward datasets.

Each row contains two values:

  • User text
  • Intent label

For example:

text,label
hello,greeting
hi there,greeting
good morning,greeting
how many leaves do i have,leave_balance
check my remaining leave,leave_balance
apply leave tomorrow,apply_leave
request leave for friday,apply_leave
show my salary,salary_info
salary slip,salary_info
company policy,policy_info
working hours,working_hours
contact hr,contact_hr
email hr,contact_hr
thank you,goodbye
bye,goodbye
Enter fullscreen mode Exit fullscreen mode

Although this appears simple, dataset quality often determines whether the classifier succeeds or fails.


Principles of Good Dataset Design

1. Capture Natural Language Variation

Users rarely express the same request in identical words.

For example, all of the following sentences should ideally map to the same intent:

leave balance
remaining leave
how many leaves do I have
show available leave
check my leave count
Enter fullscreen mode Exit fullscreen mode

Including multiple phrasings helps the model generalize beyond the exact examples seen during training.


2. Keep Intent Boundaries Clear

Each intent should represent one distinct action.

For example:

leave_balance
Enter fullscreen mode Exit fullscreen mode

should never contain examples such as

apply leave tomorrow
Enter fullscreen mode Exit fullscreen mode

Mixing multiple concepts under the same label introduces ambiguity and reduces prediction accuracy.


3. Balance Every Intent

Suppose one intent contains:

500 examples
Enter fullscreen mode Exit fullscreen mode

while another contains only:

12 examples
Enter fullscreen mode Exit fullscreen mode

The model naturally becomes biased toward the larger class.

Maintaining approximately equal representation across intents generally produces more consistent predictions.


4. Think Like Your Users

One of the most valuable exercises during dataset creation is imagining how real users naturally phrase requests.

Engineers often write technically correct examples.

Users rarely do.

A robust dataset includes:

  • informal language
  • incomplete sentences
  • abbreviations
  • spelling mistakes
  • conversational phrasing

The closer the training data resembles production traffic, the better the classifier performs.



Model Training: Transforming Language into Intent

With a well-structured dataset in place, the next step is converting those examples into a model capable of recognizing user intent from previously unseen text.

Unlike Large Language Models, intent classifiers are supervised learning models. During training, each sentence is associated with a predefined label, allowing the algorithm to learn statistical relationships between words, phrases, and the corresponding intent.

Conceptually, the training pipeline can be represented as:

              Training Dataset
                     │
                     ▼
          Text Preprocessing Pipeline
                     │
                     ▼
          Feature Extraction / Tokenization
                     │
                     ▼
          Intent Classification Model
                     │
                     ▼
              Evaluation & Validation
                     │
                     ▼
             Core ML Model (.mlmodel)
                     │
                     ▼
            Bundled with Mobile App
Enter fullscreen mode Exit fullscreen mode

Although the underlying mathematics may differ depending on the chosen algorithm, the overall workflow remains remarkably consistent.

The model repeatedly analyzes labeled examples, gradually adjusting its internal parameters until it can reliably associate previously unseen sentences with the correct intent.

Once training is complete, the learned parameters are exported as a compact Core ML model that executes entirely on the device.


Selecting the Right Model

One common misconception is that every Natural Language Processing problem requires a transformer or Large Language Model.

For intent classification, this is rarely true.

The objective is not to generate language.

It is simply to determine which predefined category best matches an input.

Several lightweight algorithms perform exceptionally well for this task, including:

  • Maximum Entropy (Logistic Regression)
  • Naïve Bayes
  • Support Vector Machines
  • FastText
  • Lightweight Recurrent Neural Networks
  • Small LSTM architectures

Apple's Create ML abstracts much of this complexity, allowing developers to train high-quality text classifiers without implementing these algorithms manually.

The choice of algorithm is generally less important than the quality of the training dataset.

In many practical systems, careful dataset engineering yields larger accuracy improvements than switching between classification algorithms.


Feature Engineering

Before text can be processed by a machine learning model, it must be transformed into numerical representations.

This process is known as feature engineering.

Although modern frameworks automate much of this work, understanding the pipeline helps explain why dataset quality is so important.

A simplified transformation pipeline looks like this:

Original Sentence

"How many leaves do I have?"

        │

        ▼

Tokenization

["how","many","leaves","do","i","have"]

        │

        ▼

Normalization

["how","many","leave","have"]

        │

        ▼

Numerical Representation

[0.14, 0.82, 0.53, ... ]

        │

        ▼

Intent Prediction
Enter fullscreen mode Exit fullscreen mode

The model never understands English in the human sense.

Instead, it learns statistical relationships between numerical representations and known intent labels.

This distinction explains why diverse training examples matter.

The model is learning patterns—not memorizing complete sentences.


Evaluating Model Quality

Training accuracy alone is not sufficient.

A model that memorizes its training examples may perform poorly when presented with real user input.

A typical evaluation process includes:

  • Training accuracy
  • Validation accuracy
  • Precision
  • Recall
  • F1 Score
  • Confusion Matrix

One particularly useful visualization is the confusion matrix.

Instead of simply reporting an overall accuracy value, the confusion matrix reveals where the model makes mistakes.

For example:

                 Predicted

             Leave   Salary   Policy

Actual Leave    95       2        3

Actual Salary    1      98        1

Actual Policy    4       2       94
Enter fullscreen mode Exit fullscreen mode

This information often exposes overlapping intent definitions, enabling developers to improve the dataset rather than endlessly tuning the model.

In practice, improving the dataset usually produces larger gains than modifying the learning algorithm.


Exporting the Model

After validation, the trained classifier is exported as a Core ML model.

HRIntentClassifier.mlmodel
Enter fullscreen mode Exit fullscreen mode

During the build process, Xcode automatically compiles the model into an optimized runtime representation.

HRIntentClassifier.mlmodel
          │
          ▼
HRIntentClassifier.mlmodelc
Enter fullscreen mode Exit fullscreen mode

The compiled asset becomes part of the application bundle and requires no additional downloads or runtime dependencies.

Unlike cloud-hosted models, inference occurs entirely within the application's process.

No API requests are necessary.

No authentication tokens are required.

No network connection is needed.


Integrating Core ML

Once the model has been bundled with the application, the implementation becomes surprisingly straightforward.

The classifier behaves like any other local resource.

A dedicated routing service encapsulates the interaction with Core ML, keeping the user interface independent from the machine learning implementation.

import Foundation
import CoreML

public final class LocalIntentRouter {

    private let model: MLModel

    public init(configuration: MLModelConfiguration = .init()) throws {

        guard let modelURL = Bundle.main.url(
            forResource: "HRIntentClassifier",
            withExtension: "mlmodelc"
        ) else {
            throw RouterError.modelNotFound
        }

        model = try MLModel(
            contentsOf: modelURL,
            configuration: configuration
        )
    }

    public func predictIntent(from text: String) -> PredictionResult? {

        let cleaned = text
            .trimmingCharacters(in: .whitespacesAndNewlines)

        guard !cleaned.isEmpty else {
            return nil
        }

        do {

            let provider = try MLDictionaryFeatureProvider(
                dictionary: [
                    "text": MLFeatureValue(string: cleaned)
                ]
            )

            let prediction = try model.prediction(from: provider)

            guard
                let label =
                    prediction.featureValue(for: "label")?.stringValue,
                let probabilities =
                    prediction.featureValue(for: "labelProbability")?
                    .dictionaryValue as? [String : Double]
            else {
                return nil
            }

            return PredictionResult(
                intent: label,
                confidence: probabilities[label] ?? 0
            )

        } catch {

            print(error.localizedDescription)
            return nil
        }
    }
}

struct PredictionResult {

    let intent: String
    let confidence: Double
}

enum RouterError: Error {

    case modelNotFound
}
Enter fullscreen mode Exit fullscreen mode

Notice that the service returns not only the predicted intent but also its associated confidence score.

This confidence value plays an important role in production systems.


Confidence-Based Routing

Machine learning predictions should never be treated as absolute truth.

Instead, every prediction carries a confidence score representing how certain the model is about its decision.

A practical routing strategy looks like this:

Prediction:

leave_balance

Confidence:

0.97
Enter fullscreen mode Exit fullscreen mode

Since confidence is very high, the application immediately opens the Leave Balance screen.

Now consider another example.

Prediction:

policy_information

Confidence:

0.41
Enter fullscreen mode Exit fullscreen mode

A confidence of 41% suggests uncertainty.

Rather than risking an incorrect navigation, the application forwards the request to a cloud-based LLM for further interpretation.

This hybrid decision process provides the best of both worlds.

                 User Query
                      │
                      ▼
             Intent Classifier
                      │
          Confidence Score Generated
                      │
      ┌───────────────┴────────────────┐
      │                                │
 Confidence ≥ Threshold         Confidence < Threshold
      │                                │
      ▼                                ▼
 Execute Local Action          Forward to Cloud AI
Enter fullscreen mode Exit fullscreen mode

Rather than replacing the LLM, the classifier becomes an intelligent gatekeeper that filters predictable requests before they ever leave the device.


Runtime Execution

From the user's perspective, the entire interaction is almost instantaneous.

User types message

        │

        ▼

Text cleaned

        │

        ▼

Core ML Prediction

        │

        ▼

Confidence Evaluation

        │

        ▼

Execute Local Workflow
Enter fullscreen mode Exit fullscreen mode

The total execution time is typically measured in only a few milliseconds.

Unlike cloud inference, there are no network handshakes, serialization overhead, authentication requests, or server scheduling delays.

The interaction feels immediate because it occurs entirely inside the application.

This architectural pattern becomes especially valuable in environments with poor connectivity, intermittent network access, or strict privacy requirements.

More importantly, it demonstrates that not every AI interaction requires cloud-scale infrastructure.

Sometimes, the most effective solution is also the simplest: a small, focused model executing directly where the user already is.

Top comments (0)