DEV Community: Dheeraj Dhiman

Designing Hybrid Edge AI Systems for Low-Latency Intent Classification in Mobile Applications

Dheeraj Dhiman — Sat, 04 Jul 2026 16:12:49 +0000

A Hybrid Edge–Cloud Architecture for Low-Latency Intent Classification in Mobile Applications

Abstract

Large Language Models (LLMs) have fundamentally changed how applications process natural language. They excel at reasoning, summarization, question answering, and generating human-like responses. As a result, many modern applications route every user message directly to a cloud-hosted LLM.

While this approach is effective for complex conversations, it is often unnecessary for deterministic interactions. Commands such as "Show my leave balance", "Open settings", or "Contact HR" do not require generative reasoning. They require identifying a known intent and triggering a predefined workflow.

Sending these requests to the cloud introduces avoidable latency, increases operational costs, depends on network availability, and transmits user data that could otherwise remain on the device.

This article presents a hybrid architecture that performs intent classification entirely on the client using a lightweight machine learning model. By classifying predictable requests locally and forwarding only ambiguous or complex queries to a cloud-based LLM, applications can provide a significantly faster, more private, and more resilient user experience.

Although the implementation examples reference Core ML on iOS, the architectural principles discussed here apply equally to Android, desktop, and embedded systems.

Introduction

Over the past few years, conversational interfaces have evolved from simple rule-based chatbots into sophisticated AI assistants capable of understanding natural language.

As engineers, it is tempting to assume that every user message deserves the full reasoning power of a Large Language Model. In practice, however, most application interactions are remarkably predictable.

Consider the following examples:

Show my leave balance
Apply for leave tomorrow
Open profile
Change password
View salary slip
Track my order
Show today's appointments
Contact support

These requests are not open-ended questions.

They are commands.

Their purpose is not to generate new knowledge but to identify the user's intent and execute an existing application workflow.

Yet many applications still send these requests to remote AI services.

Although this simplifies implementation, it often creates unnecessary architectural complexity.

Each interaction now depends on:

Internet connectivity
API availability
Server scalability
Token consumption
Network latency

The user experiences several hundred milliseconds—or even multiple seconds—of delay simply to navigate to a screen that already exists inside the application.

This raises an important architectural question:

Should every natural language request be processed by a Large Language Model?

For many applications, the answer is no.

The Problem Statement

Modern AI systems are incredibly capable, but capability alone should not dictate architecture.

One of the fundamental responsibilities of software architecture is selecting the appropriate technology for each problem.

A calculator does not require a database.

A login screen does not require distributed computing.

Likewise, deterministic user commands often do not require generative AI.

Consider an enterprise application with the following features:

Leave management
HR policies
Employee directory
Expense submission
Attendance tracking
Payroll information
Internal documentation

A conversational interface might receive thousands of requests every day, but a significant percentage of those requests fall into a relatively small number of predictable categories.

Examples include:

User Request	Intended Action
"How many leaves do I have?"	Open Leave Balance
"Apply leave tomorrow"	Open Leave Application
"Show my salary slip"	Navigate to Payroll
"Office timings"	Display Working Hours
"Email HR"	Open Contact Screen

Each request maps directly to an existing application feature.

No reasoning is required.

No content generation is required.

No external knowledge retrieval is required.

The challenge is simply determining which predefined action should be executed.

This is fundamentally a classification problem, not a reasoning problem.

Recognizing this distinction opens the door to a much simpler architecture.

A Different Architectural Perspective

Instead of treating every request as an AI problem, we can divide user interactions into two categories.

Category 1 — Deterministic Requests

These requests have known outcomes.

Examples include:

Open Settings
View Profile
Check Leave Balance
Company Policies
Working Hours
Contact HR

The expected action is already implemented inside the application.

The only missing piece is determining which action the user intended.

A lightweight text classifier can solve this in just a few milliseconds.

Category 2 — Generative Requests

These require reasoning beyond predefined workflows.

Examples include:

Compare my leave history over the last three years and suggest the best vacation period.

Summarize the company's parental leave policy.

Explain why my reimbursement request was rejected.

These requests benefit from the contextual understanding and reasoning capabilities of an LLM.

Rather than replacing the cloud entirely, the objective is to ensure that only requests requiring advanced reasoning are forwarded to it.

A Hybrid Edge–Cloud Architecture

This observation naturally leads to a hybrid architecture.

Instead of placing the LLM at the front of every interaction, the application first evaluates whether the request belongs to a known intent.

                    User Input
                         │
                         ▼
           On-Device Intent Classifier
                         │
          ┌──────────────┴──────────────┐
          │                             │
   High Confidence               Low Confidence
          │                             │
          ▼                             ▼
 Execute Local Action          Forward to Cloud LLM

This design introduces an intelligent routing layer between the user interface and the network.

The classifier becomes responsible for determining whether the application already knows how to satisfy the request.

If it does, the workflow executes immediately without leaving the device.

If not, the request is escalated to a cloud-based language model.

This architecture combines the strengths of both approaches:

Instant responses for predictable interactions
Rich reasoning for complex conversations

Rather than viewing edge AI and cloud AI as competing technologies, they become complementary components within the same system.

Edge AI Versus Cloud AI

Choosing between local inference and cloud inference is not about determining which technology is "better."

Each solves a different class of problems.

Architectural Characteristic	Cloud LLM	On-Device Intent Classifier
Network Connectivity	Required	Not Required
Average Response Time	1–4 seconds	Typically under 5 ms
Operational Cost	Per-request API cost	Zero after deployment
Privacy	Data transmitted externally	Data remains on device
Offline Capability	No	Yes
Reasoning Ability	Excellent	Limited
Deterministic Commands	Overkill	Ideal

The objective is not to eliminate cloud AI.

Instead, it is to reserve expensive reasoning engines for situations that genuinely require them.

A useful mental model is:

Use edge AI for routing. Use cloud AI for reasoning.

This simple design principle can significantly improve responsiveness while reducing unnecessary infrastructure costs.

Why Intent Classification Works

Intent classification is one of the oldest and most successful applications of Natural Language Processing.

Unlike generative models, which attempt to produce new text, a classifier performs a much simpler task:

Determine which predefined category best matches the input.

For example:

"Check my leave balance"

might produce

leave_balance

while

"What are today's office timings?"

might produce

working_hours

The output is not a paragraph.

It is simply a label.

Because the problem is constrained, the resulting model is dramatically smaller than a Large Language Model.

In many production systems, an intent classifier occupies only a few tens of kilobytes while performing inference in just a few milliseconds.

This makes it an excellent candidate for on-device deployment.

Engineering the Dataset

Like every supervised learning problem, model quality depends heavily on training data.

Fortunately, intent classification requires relatively straightforward datasets.

Each row contains two values:

User text
Intent label

For example:

text,label
hello,greeting
hi there,greeting
good morning,greeting
how many leaves do i have,leave_balance
check my remaining leave,leave_balance
apply leave tomorrow,apply_leave
request leave for friday,apply_leave
show my salary,salary_info
salary slip,salary_info
company policy,policy_info
working hours,working_hours
contact hr,contact_hr
email hr,contact_hr
thank you,goodbye
bye,goodbye

Although this appears simple, dataset quality often determines whether the classifier succeeds or fails.

Principles of Good Dataset Design

1. Capture Natural Language Variation

Users rarely express the same request in identical words.

For example, all of the following sentences should ideally map to the same intent:

leave balance
remaining leave
how many leaves do I have
show available leave
check my leave count

Including multiple phrasings helps the model generalize beyond the exact examples seen during training.

2. Keep Intent Boundaries Clear

Each intent should represent one distinct action.

For example:

leave_balance

should never contain examples such as

apply leave tomorrow

Mixing multiple concepts under the same label introduces ambiguity and reduces prediction accuracy.

3. Balance Every Intent

Suppose one intent contains:

500 examples

while another contains only:

12 examples

The model naturally becomes biased toward the larger class.

Maintaining approximately equal representation across intents generally produces more consistent predictions.

4. Think Like Your Users

One of the most valuable exercises during dataset creation is imagining how real users naturally phrase requests.

Engineers often write technically correct examples.

Users rarely do.

A robust dataset includes:

informal language
incomplete sentences
abbreviations
spelling mistakes
conversational phrasing

The closer the training data resembles production traffic, the better the classifier performs.

Model Training: Transforming Language into Intent

With a well-structured dataset in place, the next step is converting those examples into a model capable of recognizing user intent from previously unseen text.

Unlike Large Language Models, intent classifiers are supervised learning models. During training, each sentence is associated with a predefined label, allowing the algorithm to learn statistical relationships between words, phrases, and the corresponding intent.

Conceptually, the training pipeline can be represented as:

              Training Dataset
                     │
                     ▼
          Text Preprocessing Pipeline
                     │
                     ▼
          Feature Extraction / Tokenization
                     │
                     ▼
          Intent Classification Model
                     │
                     ▼
              Evaluation & Validation
                     │
                     ▼
             Core ML Model (.mlmodel)
                     │
                     ▼
            Bundled with Mobile App

Although the underlying mathematics may differ depending on the chosen algorithm, the overall workflow remains remarkably consistent.

The model repeatedly analyzes labeled examples, gradually adjusting its internal parameters until it can reliably associate previously unseen sentences with the correct intent.

Once training is complete, the learned parameters are exported as a compact Core ML model that executes entirely on the device.

Selecting the Right Model

One common misconception is that every Natural Language Processing problem requires a transformer or Large Language Model.

For intent classification, this is rarely true.

The objective is not to generate language.

It is simply to determine which predefined category best matches an input.

Several lightweight algorithms perform exceptionally well for this task, including:

Maximum Entropy (Logistic Regression)
Naïve Bayes
Support Vector Machines
FastText
Lightweight Recurrent Neural Networks
Small LSTM architectures

Apple's Create ML abstracts much of this complexity, allowing developers to train high-quality text classifiers without implementing these algorithms manually.

The choice of algorithm is generally less important than the quality of the training dataset.

In many practical systems, careful dataset engineering yields larger accuracy improvements than switching between classification algorithms.

Feature Engineering

Before text can be processed by a machine learning model, it must be transformed into numerical representations.

This process is known as feature engineering.

Although modern frameworks automate much of this work, understanding the pipeline helps explain why dataset quality is so important.

A simplified transformation pipeline looks like this:

Original Sentence

"How many leaves do I have?"

        │

        ▼

Tokenization

["how","many","leaves","do","i","have"]

        │

        ▼

Normalization

["how","many","leave","have"]

        │

        ▼

Numerical Representation

[0.14, 0.82, 0.53, ... ]

        │

        ▼

Intent Prediction

The model never understands English in the human sense.

Instead, it learns statistical relationships between numerical representations and known intent labels.

This distinction explains why diverse training examples matter.

The model is learning patterns—not memorizing complete sentences.

Evaluating Model Quality

Training accuracy alone is not sufficient.

A model that memorizes its training examples may perform poorly when presented with real user input.

A typical evaluation process includes:

Training accuracy
Validation accuracy
Precision
Recall
F1 Score
Confusion Matrix

One particularly useful visualization is the confusion matrix.

Instead of simply reporting an overall accuracy value, the confusion matrix reveals where the model makes mistakes.

For example:

                 Predicted

             Leave   Salary   Policy

Actual Leave    95       2        3

Actual Salary    1      98        1

Actual Policy    4       2       94

This information often exposes overlapping intent definitions, enabling developers to improve the dataset rather than endlessly tuning the model.

In practice, improving the dataset usually produces larger gains than modifying the learning algorithm.

Exporting the Model

After validation, the trained classifier is exported as a Core ML model.

HRIntentClassifier.mlmodel

During the build process, Xcode automatically compiles the model into an optimized runtime representation.

HRIntentClassifier.mlmodel
          │
          ▼
HRIntentClassifier.mlmodelc

The compiled asset becomes part of the application bundle and requires no additional downloads or runtime dependencies.

Unlike cloud-hosted models, inference occurs entirely within the application's process.

No API requests are necessary.

No authentication tokens are required.

No network connection is needed.

Integrating Core ML

Once the model has been bundled with the application, the implementation becomes surprisingly straightforward.

The classifier behaves like any other local resource.

A dedicated routing service encapsulates the interaction with Core ML, keeping the user interface independent from the machine learning implementation.

import Foundation
import CoreML

public final class LocalIntentRouter {

    private let model: MLModel

    public init(configuration: MLModelConfiguration = .init()) throws {

        guard let modelURL = Bundle.main.url(
            forResource: "HRIntentClassifier",
            withExtension: "mlmodelc"
        ) else {
            throw RouterError.modelNotFound
        }

        model = try MLModel(
            contentsOf: modelURL,
            configuration: configuration
        )
    }

    public func predictIntent(from text: String) -> PredictionResult? {

        let cleaned = text
            .trimmingCharacters(in: .whitespacesAndNewlines)

        guard !cleaned.isEmpty else {
            return nil
        }

        do {

            let provider = try MLDictionaryFeatureProvider(
                dictionary: [
                    "text": MLFeatureValue(string: cleaned)
                ]
            )

            let prediction = try model.prediction(from: provider)

            guard
                let label =
                    prediction.featureValue(for: "label")?.stringValue,
                let probabilities =
                    prediction.featureValue(for: "labelProbability")?
                    .dictionaryValue as? [String : Double]
            else {
                return nil
            }

            return PredictionResult(
                intent: label,
                confidence: probabilities[label] ?? 0
            )

        } catch {

            print(error.localizedDescription)
            return nil
        }
    }
}

struct PredictionResult {

    let intent: String
    let confidence: Double
}

enum RouterError: Error {

    case modelNotFound
}

Notice that the service returns not only the predicted intent but also its associated confidence score.

This confidence value plays an important role in production systems.

Confidence-Based Routing

Machine learning predictions should never be treated as absolute truth.

Instead, every prediction carries a confidence score representing how certain the model is about its decision.

A practical routing strategy looks like this:

Prediction:

leave_balance

Confidence:

0.97

Since confidence is very high, the application immediately opens the Leave Balance screen.

Now consider another example.

Prediction:

policy_information

Confidence:

0.41

A confidence of 41% suggests uncertainty.

Rather than risking an incorrect navigation, the application forwards the request to a cloud-based LLM for further interpretation.

This hybrid decision process provides the best of both worlds.

                 User Query
                      │
                      ▼
             Intent Classifier
                      │
          Confidence Score Generated
                      │
      ┌───────────────┴────────────────┐
      │                                │
 Confidence ≥ Threshold         Confidence < Threshold
      │                                │
      ▼                                ▼
 Execute Local Action          Forward to Cloud AI

Rather than replacing the LLM, the classifier becomes an intelligent gatekeeper that filters predictable requests before they ever leave the device.

Runtime Execution

From the user's perspective, the entire interaction is almost instantaneous.

User types message

        │

        ▼

Text cleaned

        │

        ▼

Core ML Prediction

        │

        ▼

Confidence Evaluation

        │

        ▼

Execute Local Workflow

The total execution time is typically measured in only a few milliseconds.

Unlike cloud inference, there are no network handshakes, serialization overhead, authentication requests, or server scheduling delays.

The interaction feels immediate because it occurs entirely inside the application.

This architectural pattern becomes especially valuable in environments with poor connectivity, intermittent network access, or strict privacy requirements.

More importantly, it demonstrates that not every AI interaction requires cloud-scale infrastructure.

Sometimes, the most effective solution is also the simplest: a small, focused model executing directly where the user already is.

State Machines on the Edge: Designing Resilient Voice-to-Note AI Audio Pipelines

Dheeraj Dhiman — Sat, 04 Jul 2026 15:18:35 +0000

Introduction & Context

Building mobile applications that capture real-time voice sessions and send them to cloud infrastructure for heavy AI inference—specifically Automatic Speech Recognition (ASR) transcription and Large Language Model (LLM) structural summarization—introduces a fundamental challenge: the hostility of the mobile edge. As a Technical Lead, I evaluate these problems through the lens of system durability. AI generation engines require clean, uncorrupted data payloads to yield accurate inference results. Yet, mobile devices operate in unpredictable network environments—dead zones, app switches, and abrupt routing handoffs are standard occurrences. If a user spends ten minutes capturing an intense audio session, data loss is a catastrophic failure.

To solve this, we must shift our mental model from a network-dependent streaming approach to a decoupled, edge-resilient architecture. This post outlines a generic, reusable architectural pattern that treats network drops, app-backgrounding, and pauses as expected paths rather than exceptional errors, ensuring absolute data durability for ambient, AI-driven document generation systems.

🔍 The Problem: Unreliable Edge Environments & AI Pipeline Constraints

Most system design tutorials assume a \"happy path\" data flow: a mobile client captures audio, streams it seamlessly to a cloud endpoint, and immediately returns a structured text output from an LLM.

In production, the reality of the mobile edge shatters this assumption. Heavy background processing tasks on the backend (like audio diarization, token optimization, and multi-stage LLM prompting workflows) can introduce significant processing latencies. If an architecture forces a synchronous connection between the mobile edge and the AI processing layers during routine network disruptions, the system suffers from critical vulnerabilities:

Inference Payload Corruption: Dropping a connection mid-flight leads to fragmented or corrupted audio files. In token-dependent systems, losing a portion of the recording means losing critical contextual prompt data, causing incomplete or flawed AI outputs.
Brittle User Experience: Blocking the client UI thread while waiting for a heavy AI processing engine to return a large language token stream over a fluctuating network creates an unstable application.
Ingestion Bottlenecks: Forcing the backend API gateway to maintain long-lived synchronous connections for large media uploads while coordinating deep ASR/LLM pipelines restricts horizontal scalability and invites systemic timeouts.

Key Non-Functional Requirements (NFRs)

To build a resilient voice-to-note pipeline, the architecture must satisfy three strict constraints:

Durability (0% Context Loss): Raw captured data must survive sudden network drops and OS-level app backgrounding to preserve the entire context window for the AI models.
Availability: The client's ability to capture high-fidelity audio data must be completely decoupled from active cloud internet connectivity.
Scalability: The backend gateway must handle high-volume media ingestion instantly, offloading compute-heavy AI inference workloads to isolated worker pools.

🏗️ 1. The Core Architectural Philosophy: Local Durability

The foundational rule of this architecture is simple: Always write capture data to local storage before depending on the network. By making the local file system the primary target of the data stream, the active capture session becomes completely independent of cloud infrastructure availability. The network becomes a transport enhancement layer rather than a strict prerequisite for session capture.

System State Machine

To ensure deterministic execution across edge cases, the client lifecycle transitions through explicitly bounded states:

🔄 2. Handling Interruptions as Normal Paths

Traditional mobile implementations often treat app-backgrounding or connectivity drops as catastrophic errors that require disruptive user alerts. In a professional architecture, we treat these as standard operational realities.

Pause and Resume: When a user pauses, the current session snapshot is committed to local storage. On resume, the state is restored and capture continues sequentially.
Background and Foreground: When the OS moves the application to the background, the app pauses capture and persists session metadata to disk. Upon returning to the foreground, the session context automatically restores.
Connectivity Loss During Capture: If the connection drops during recording, the app continues to stream raw bytes to the local file buffer without throwing network exceptions to the user.

📭 3. Decoupling Capture from AI Orchestration

Finishing a session and executing AI generation workloads are entirely separate steps in this pipeline.

When a session ends while the device is offline, the local media file is finalized on disk and registered inside a persistent, local outbound queue. The user interface reflects a clear \"pending sync\" state, while native background synchronization frameworks (such as Android WorkManager or iOS Background Tasks) retry the transfer autonomously when connectivity returns.

AI Infrastructure System View

This structural decoupling isolates volatile edge dependencies away from the core AI orchestration and processing layers.

Responsibility Breakdown Matrix

Layer	System Responsibility
Capture module	Captures raw media and writes incrementally to local storage.
Local store	Holds partial sessions, finalized binary files, and queue metadata.
Outbound queue	Handles retry mechanics and payload scheduling using exponential backoff.
AI Ingestion Gateway	Ingests media payloads, validates structural requests, and enqueues jobs immediately.
Asynchronous Orchestrator	Coordinates deep background processing pipelines: manages data ingestion, calls internal or external services, and tracks progress.
ASR Engine	Processes the validated audio through speech-to-text inference models to generate raw text transcripts.
LLM Inference Layer	Processes text transcripts through prompting templates to output structured, contextual note data.
Result store	Persists finished AI output datasets for transactional retrieval.

⚙️ 4. Asynchronous Processing & Sequence Flow

Heavy execution workloads should never block an active client connection. Upon successful upload, the backend entry point writes the media asset to disk, registers a job identifier, and instantly returns a 202 Accepted status code. The actual long-running compute job is offloaded to background processing workers.

📡 5. Informing the Client (Status Delivery Matrix)

How the mobile client learns that a job is complete depends entirely on your specific platform requirements and firewall constraints. The core engine remains constant; only the transport varies:

Approach	When it Fits	Architectural Trade-off
Status Polling	Simple to implement; ideal for environments with strict firewall policies blocking persistent sockets.	Introduces marginal egress overhead and higher latency between job completion and client discovery.
Live Connections (WebSockets)	Best for open apps requiring near-real-time user interface updates.	Requires custom reconnection state logic to handle intermittent signal drops.
System Notifications (Push)	Necessary when users lock their devices or exit the app during long processing cycles.	Dependent on third-party system delivery loops (FCM/APNs) outside the core infrastructure.

🔍 Design Choices at a Glance

Concern	Pattern-Level Architecture Strategy
Active Capture Interruptions	Local partial chunk buffering + continuous state serialization.
OS Background Transitions	Immediate state checkpointing on background; conditional resume on foreground.
Network Loss Mid-Session	Complete local isolation; network availability check deferred to post-session.
Upload Failure Handling	Local outbound queueing backed by persistent hardware worker frameworks.
Result Delivery Lifecycle	Decoupled notification transport layers (polling, sockets, or push notifications).

🛑 What This Pattern Deliberately Omits

To maintain a pure pattern-level architecture blueprint, this high-level design deliberately excludes implementation-specific layers:

Authentication and Authorization token validation loops.
Data security governance (Encryption-at-rest strategies for local cache files).
Media format selections, compression algorithms, and audio segmentation logic.
Prompt engineering parameters, temperature tuning, and context window truncation handlers.
Observability metrics, LLM request caching strategies, and API cost controls.

These concerns are critical for production hardening but are implemented as complementary layers built on top of this architectural foundation.

🏁 Key Takeaways

Capture locally first — Never make network connectivity a prerequisite for client-side data recording.
Treat interruptions as normal paths — Design for pauses, background execution, and offline network fallbacks from day one.
Separate capture from upload — Offload delivery tracking to an independent outbound queueing engine.
Process asynchronously — Relieve API gateways by converting requests into background worker jobs immediately.
Keep the transport flexible — Select status delivery mechanisms that best match your target operating system and network constraints.

If you are engineering architectures that translate edge-captured audio streams into structured backend datasets, prioritize local durability and asynchronous decoupling. Everything else is optimization.

Disclaimer: The views and architectural designs expressed in this article are solely my own and do not represent the opinions or strategies of any current or past employers. All system designs discussed are sanitized, conceptual, and pattern-focused.