Emily Lin

Posted on Aug 24

Local-First Voice AI: What Actually Works (and What Doesn't) — Week 3

#flutter #privacy #buildwithai #mobile

This is part of my journey building the Kai ecosystem—a fully local, offline-first voice assistant that keeps your data yours.
Well, I started building an app for myself first.
I collaborated with Claude to build layered time parsing logic all through natural language and my goal is to see a functional app that does what it is designed for.

Kai Lite: 5-Point Summary

Privacy-first voice assistant - Complete offline functionality, zero cloud data sharing, all data stays on your device
Natural voice commands - Add reminders, create memos, check calendar using speech-to-text with pattern-based parsing
Local-first architecture - Flutter mobile app with SQLite storage, works in airplane mode, no internet required
User data control - Export/delete everything anytime, transparent permissions, visual indicators when mic is active
Future ecosystem foundation - Designed to sync with Kai Laptop/Desktop while maintaining privacy and user control

This week, I'm sharing what actually happened when I tried to build a voice agent that works completely offline. Turns out, it is harder than expected for native AI builders.
App Demo

My AI Collaborator This Week

Claude: My main implementation partner throughout this build. From initial architecture decisions to debugging regex patterns, Claude helped me think through each technical challenge and iterate quickly on solutions.

What I Actually Built (The Messy Reality)

Attempt 1: "Let's Build Alexa-Level Voice Commands"
The goal was ambitious: voice commands that work as smoothly as Alexa, but completely local.
Started with the standard Flutter voice setup:

dependencies:
  speech_to_text: ^6.3.0
  flutter_tts: ^3.8.3
  permission_handler: ^11.0.1

Basic voice service structure:

class VoiceService {
  final SpeechToText _speech = SpeechToText();
  final FlutterTts _tts = FlutterTts();

  Future<void> initialize() async {
    await _speech.initialize();
    // Kai's calm voice settings
    await _tts.setSpeechRate(0.9);  
    await _tts.setPitch(1.0);
  }
}

The reality check:
Spent a day testing and realized that even with onDevice: true, the accuracy wasn't consistent enough for the "Alexa-level" experience I wanted.
Result: Needed a completely different approach.

Attempt 2: Comprehensive Pattern-Based Parser (What Actually Works)

Claude suggested focusing on pattern-based parsing instead of trying to build mini-Alexa.
Smart advice—I used AI to help design the VoiceCommandParser architecture and generate comprehensive regex patterns for different ways people naturally speak.

class VoiceCommandParser {
  static final Map<String, List<RegExp>> patterns = {
    'calendar_add': [
      RegExp(r'remind me to (.*?) at (.*)'),
      RegExp(r'add (.*?) to calendar at (.*)'),
      RegExp(r'schedule (.*?) for (.*)'),
      RegExp(r'set reminder (.*?) at (.*)'),
      RegExp(r'(.*?) at (.*?) today'),
      RegExp(r'(.*?) at (.*?) tomorrow'),
    ],
    'calendar_check': [
      RegExp(r"what'?s on my calendar\??"),
      RegExp(r"what do i have today\??"),
      RegExp(r"show my schedule"),
      RegExp(r"any events today\??"),
    ],
    'memo_add': [
      RegExp(r'note to self[,:]? (.*)'),
      RegExp(r'remember that (.*)'),
      RegExp(r'make a note[,:]? (.*)'),
      RegExp(r'write down (.*)'),
    ],
  };

  static VoiceCommand parse(String input) {
    input = input.toLowerCase().trim();

    // Check each pattern category
    for (final entry in patterns.entries) {
      final intent = entry.key;
      final patternList = entry.value;

      for (final pattern in patternList) {
        final match = pattern.firstMatch(input);
        if (match != null) {
          return _extractCommand(intent, input, match);
        }
      }
    }

    // Fuzzy matching fallback
    return _fuzzyMatch(input);
  }
}

Added smart time parsing that handles natural language:

static String? _parseTime(String timeStr) {
  // Natural language conversions
  final conversions = {
    'morning': '9:00 AM',
    'afternoon': '2:00 PM', 
    'evening': '6:00 PM',
    'night': '9:00 PM',
    'noon': '12:00 PM',
    'midnight': '12:00 AM',
  };

  // Check natural language first
  for (final entry in conversions.entries) {
    if (timeStr.contains(entry.key)) {
      return entry.value;
    }
  }

  // Parse actual times (3pm, 3:30pm, 15:00)
  final timeMatch = RegExp(r'(\d{1,2})(?::(\d{2}))?\s*(am|pm)?', 
                          caseSensitive: false).firstMatch(timeStr);
  if (timeMatch != null) {
    var hour = int.parse(timeMatch.group(1) ?? '0');
    final minute = timeMatch.group(2) ?? '00';
    var ampm = timeMatch.group(3)?.toUpperCase();

    // Smart guessing for ambiguous times
    if (ampm == null) {
      if (hour >= 7 && hour <= 11) {
        ampm = 'AM';
      } else if (hour >= 1 && hour <= 6) {
        ampm = 'PM';
      } else if (hour >= 13 && hour <= 23) {
        hour = hour - 12;
        ampm = 'PM';
      }
    }

    return '${hour}:${minute} ${ampm}';
  }

  return null;
}

Multi-turn conversation handler for missing information:

class ConversationHandler {
  ConversationContext _context = ConversationContext();

  Future<void> handleCommand(String input) async {
    final command = VoiceCommandParser.parse(input);

    if (command.confidence < 0.7) {
      await _voice.speak("I'm not sure. Did you want to add a calendar event or create a memo?");
      return;
    }

    // Handle missing information
    if (command.intent == 'calendar_add') {
      if (command.title == null) {
        _context.state = ConversationState.waitingForTitle;
        await _voice.speak("What would you like me to remind you about?");
        return;
      }

      if (command.time == null) {
        _context.state = ConversationState.waitingForTime;
        await _voice.speak("What time should I set the reminder for?");
        return;
      }

      await _createCalendarEvent(command);
    }
  }
}

Performance after this approach:

Recognition accuracy: 90% for supported patterns
Response time: <300ms end-to-end
Memory usage: 45MB while active
Battery impact: <2% over full day of testing

Real example that works:
User: "Remind me to call mom tomorrow at three"
↓
STT: "remind me to call mom tomorrow at three"

↓
Pattern match: RegExp(r'remind me to (.?) at (.)')
↓
Extract: title="call mom tomorrow", time="three"

↓
Time parsing: "three" → "3:00 PM" (afternoon guess)
↓
Date parsing: "tomorrow" → DateTime.now().add(Duration(days: 1))
↓
Create task in SQLite
↓
TTS: "Perfect! I've added 'call mom' for 3 PM tomorrow"

Attempt 3: The Complete Alexa-Level System

Realized I was thinking about this wrong. Instead of trying to match Alexa, I built something simpler that works reliably.

My actual architecture:

// 1. Local STT with better settings
await _speech.listen(
  onDevice: true,
  listenFor: Duration(seconds: 3), // Shorter timeout
  cancelOnError: true,
  partialResults: false // Wait for complete result
);

// 2. Pattern-based parsing with multiple variations
static VoiceCommand parse(String input) {
  input = input.toLowerCase().trim();

  // Check each pattern category
  for (final entry in patterns.entries) {
    final intent = entry.key;
    final patternList = entry.value;

    for (final pattern in patternList) {
      final match = pattern.firstMatch(input);
      if (match != null) {
        return _extractCommand(intent, input, match);
      }
    }
  }

  return VoiceCommand(intent: 'unknown');
}

// 3. Smart time parsing
static String? _parseTime(String timeStr) {
  final conversions = {
    'morning': '9:00 AM',
    'afternoon': '2:00 PM',
    'evening': '6:00 PM',
    'noon': '12:00 PM',
  };

  // Handle natural language first
  for (final entry in conversions.entries) {
    if (timeStr.contains(entry.key)) {
      return entry.value;
    }
  }

  // Then handle actual times like "3pm" or "3:30"
  final timeMatch = RegExp(r'(\d{1,2})(?::(\d{2}))?\s*(am|pm)?')
      .firstMatch(timeStr);
  // ... parsing logic
}

Real example of what works:
User says: "Remind me to call mom at three"
↓
Local STT: "remind me to call mom at three"
↓
Pattern match: RegExp(r'remind me to (.?) at (.)')
↓
Extract: title="call mom", time="three"
↓
Parse time: "three" → "3:00 PM" (smart guess for afternoon)
↓
Create task in SQLite
↓
Response: "Added 'call mom' for 3:00 PM today"

Performance after optimization:

Recognition time: 200-400ms
Memory usage: 40MB while active
Accuracy: 85% for supported commands
Battery impact: <2% over full day

The Privacy Architecture I Actually Built

Problem: How do you prove to users that nothing leaves their phone?
My solution - complete transparency:

1. Visual indicators everywhere:

// Kai bubble pulses when listening
AnimatedContainer(
  duration: Duration(milliseconds: 300),
  decoration: BoxDecoration(
    color: _isListening 
      ? Color(0xFF9C7BD9).withOpacity(0.8)  // Active purple
      : Color(0xFF9C7BD9).withOpacity(0.2), // Calm purple
    shape: BoxShape.circle,
  ),
)

2. Data export built in from day 1:

class DataExportService {
  Future<String> exportAllUserData() async {
    final tasks = await CalendarService().getAllTasks();
    final memos = await MemoService().getAllMemos();

    return jsonEncode({
      'export_date': DateTime.now().toIso8601String(),
      'tasks': tasks.map((t) => t.toMap()).toList(),
      'memos': memos.map((m) => m.toMap()).toList(),
    });
  }
}

3. One-tap delete everything:

Future<void> deleteAllUserData() async {
  await CalendarService().clearAllTasks();
  await MemoService().clearAllMemos();
  await SharedPreferences.getInstance().then((prefs) => prefs.clear());
  // Show confirmation: "All data deleted"
}

What surprised me: In testing, I/user cared more about seeing the "Export my data" and "Delete everything" buttons than perfect voice accuracy. Just knowing I had control felt satisfying.

Database Design That Actually Works Offline

Used SQLite with sync-ready fields from the start:

class Task {
  final String id;
  final String title;
  final DateTime? date;
  final String? time;
  final bool isCompleted;

  // Sync-ready fields for future
  final DateTime lastModified;
  final String sourceDevice;
  final String status; // 'active' | 'deleted'

  Task({
    required this.id,
    required this.title,
    this.date,
    this.time,
    this.isCompleted = false,
    required this.lastModified,
    this.sourceDevice = 'kai-lite-android',
    this.status = 'active',
  });
}

Why this works:

Everything works offline immediately
Sync fields ready for when I build cross-device features
Soft deletes mean data recovery is possible
Device tracking for multi-device scenarios

Performance Debugging (The Fun Stuff)

Issue 1: Memory leaks during voice processing

// Problem: Not disposing speech service
@override
void dispose() {
  _speech.stop();  // Added this
  _speech.cancel(); // And this
  super.dispose();
}

Issue 2: Battery drain from overlay

// Problem: Overlay always active
// Solution: Smart hiding
void _hideOverlayDuringCalls() {
  if (_phoneStateService.isInCall()) {
    _overlay.hide();
  }
}

Issue 3: SQLite performance with 1000+ tasks

// Added indexing for date queries
await db.execute('''
  CREATE INDEX IF NOT EXISTS idx_task_date_status 
  ON tasks(date, status)
''');

What I Learned (Technical & Otherwise)

Technical insights:

SQLite performs way better than expected on mobile
Local speech processing is viable if you optimize for specific use cases
Pattern matching beats AI models for simple command parsing
Flutter overlays are battery killers if not managed properly

UX insights:

Privacy needs to feel empowering, not defensive
Visual feedback builds more trust than explanations
Reliable simple commands can feel smoother overall than unreliable complex ones

Architecture insights:
Build offline-first from day 1, add sync later
Start with the simplest solution that could work
Real user testing catches issues you never thought of

The Current State

What actually ships:

15+ voice command patterns that work reliably
Complete offline functionality (no internet required)
Export/delete controls for full data ownership
<300ms voice response time

DEV Community