DEV Community

JOHN MWACHARO
JOHN MWACHARO

Posted on

AI Service Architecture & Deployment Guide

AI Service Architecture & Deployment Guide

πŸ—οΈ System Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Frontend      β”‚    β”‚   API Gateway   β”‚    β”‚   AI Services   β”‚
β”‚   Vue.js App    │◄──►│   (Kong/NGINX)  │◄──►│   Microservices β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                       β”‚
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚   Database      β”‚    β”‚   ML Models     β”‚
                       β”‚   (PostgreSQL)  β”‚    β”‚   (TensorFlow)  β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

πŸš€ Performance Optimization Strategy

1. Microservices Architecture

Core AI Services:

  • Route Optimization Service - Handles path finding and optimization
  • Predictive Analytics Service - Delivery predictions and risk assessment
  • Fraud Detection Service - Real-time fraud scoring
  • NLP Communication Service - Message generation and sentiment analysis
  • Computer Vision Service - Package and document recognition
  • Voice Processing Service - Speech-to-text and voice commands

Service Structure:

# Example: Route Optimization Service (Python/FastAPI)
from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
import asyncio
import redis
import tensorflow as tf

app = FastAPI(title="Route Optimization AI Service")

# Load pre-trained model
route_model = tf.keras.models.load_model('/models/route_optimizer_v2.h5')

# Redis for caching
redis_client = redis.Redis(host='redis', port=6379, decode_responses=True)

class RouteOptimizationRequest(BaseModel):
    order_id: str
    pickup_location: dict
    delivery_location: dict
    constraints: dict
    external_factors: dict

@app.post("/optimize")
async def optimize_route(request: RouteOptimizationRequest):
    # Check cache first
    cache_key = f"route_opt:{request.order_id}"
    cached_result = redis_client.get(cache_key)

    if cached_result:
        return json.loads(cached_result)

    # AI Model Processing
    result = await process_route_optimization(request)

    # Cache result for 5 minutes
    redis_client.setex(cache_key, 300, json.dumps(result))

    return result

async def process_route_optimization(request):
    # Prepare input features
    features = prepare_features(request)

    # Run AI model prediction
    prediction = route_model.predict(features)

    # Post-process results
    return post_process_route(prediction, request)
Enter fullscreen mode Exit fullscreen mode

2. Database Optimization

PostgreSQL with AI-Specific Indexes:

-- Optimized indexes for AI queries
CREATE INDEX CONCURRENTLY idx_orders_ai_features ON orders 
    USING GIN ((ai_features::jsonb));

CREATE INDEX CONCURRENTLY idx_orders_geolocation ON orders 
    USING GIST (pickup_location, delivery_location);

CREATE INDEX CONCURRENTLY idx_delivery_patterns ON delivery_history 
    (delivery_date, client_id, success_status);

-- Materialized view for AI analytics
CREATE MATERIALIZED VIEW ai_order_features AS
SELECT 
    order_id,
    extract_ai_features(order_data) as features,
    delivery_success,
    delivery_time_actual,
    customer_rating
FROM orders o
JOIN delivery_history dh ON o.id = dh.order_id
WHERE created_at >= NOW() - INTERVAL '90 days';

-- Refresh every hour
CREATE OR REPLACE FUNCTION refresh_ai_features()
RETURNS void AS $$
BEGIN
    REFRESH MATERIALIZED VIEW CONCURRENTLY ai_order_features;
END;
$$ LANGUAGE plpgsql;
Enter fullscreen mode Exit fullscreen mode

3. Caching Strategy

Redis Caching Layers:

// AI Service Cache Manager
class AICacheManager {
    constructor() {
        this.redis = new Redis({
            host: process.env.REDIS_HOST,
            port: process.env.REDIS_PORT,
            retryDelayOnFailover: 100,
            maxRetriesPerRequest: 3
        })
    }

    // Route optimization cache (5 minutes)
    async cacheRouteOptimization(orderId, result) {
        const key = `route_opt:${orderId}`
        await this.redis.setex(key, 300, JSON.stringify(result))
    }

    // Fraud detection cache (1 hour)
    async cacheFraudResult(orderId, result) {
        const key = `fraud:${orderId}`
        await this.redis.setex(key, 3600, JSON.stringify(result))
    }

    // Predictive analytics cache (30 minutes)
    async cachePredictions(orderId, predictions) {
        const key = `predictions:${orderId}`
        await this.redis.setex(key, 1800, JSON.stringify(predictions))
    }

    // AI insights cache (15 minutes)
    async cacheInsights(orderId, insights) {
        const key = `insights:${orderId}`
        await this.redis.setex(key, 900, JSON.stringify(insights))
    }
}
Enter fullscreen mode Exit fullscreen mode

4. Model Optimization

TensorFlow Serving with GPU Acceleration:

# docker-compose.ai-services.yml
version: '3.8'
services:
  route-optimizer:
    image: tensorflow/serving:latest-gpu
    environment:
      - MODEL_NAME=route_optimizer
      - MODEL_BASE_PATH=/models/route_optimizer
    volumes:
      - ./models/route_optimizer:/models/route_optimizer
    ports:
      - "8501:8501"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  fraud-detector:
    image: tensorflow/serving:latest
    environment:
      - MODEL_NAME=fraud_detector
      - MODEL_BASE_PATH=/models/fraud_detector
    volumes:
      - ./models/fraud_detector:/models/fraud_detector
    ports:
      - "8502:8501"

  nlp-service:
    build: ./services/nlp
    environment:
      - TRANSFORMERS_CACHE=/cache
      - CUDA_VISIBLE_DEVICES=0
    volumes:
      - ./cache:/cache
    ports:
      - "8503:8000"
Enter fullscreen mode Exit fullscreen mode

πŸ”§ Implementation Details

1. API Service Layer (Node.js/Express)

// services/aiGateway.js
const express = require('express')
const { createProxyMiddleware } = require('http-proxy-middleware')
const rateLimit = require('express-rate-limit')
const redis = require('redis')

const app = express()
const redisClient = redis.createClient()

// Rate limiting for AI endpoints
const aiRateLimit = rateLimit({
    windowMs: 60 * 1000, // 1 minute
    max: 100, // limit each IP to 100 requests per windowMs
    message: 'Too many AI requests, please try again later'
})

// Health check for AI services
app.get('/health', async (req, res) => {
    const services = ['route-optimizer', 'fraud-detector', 'nlp-service']
    const health = {}

    for (const service of services) {
        try {
            const response = await fetch(`http://${service}:8000/health`)
            health[service] = response.ok ? 'healthy' : 'unhealthy'
        } catch (error) {
            health[service] = 'unreachable'
        }
    }

    res.json({ status: 'ok', services: health })
})

// Proxy to route optimization service
app.use('/api/v1/ai/route', aiRateLimit, createProxyMiddleware({
    target: 'http://route-optimizer:8000',
    changeOrigin: true,
    pathRewrite: { '^/api/v1/ai/route': '' },
    onError: (err, req, res) => {
        console.error('Route service error:', err)
        res.status(503).json({ error: 'Route optimization service unavailable' })
    }
}))

// Proxy to fraud detection service
app.use('/api/v1/ai/security', aiRateLimit, createProxyMiddleware({
    target: 'http://fraud-detector:8000',
    changeOrigin: true,
    pathRewrite: { '^/api/v1/ai/security': '' }
}))

// Proxy to NLP service
app.use('/api/v1/ai/communications', aiRateLimit, createProxyMiddleware({
    target: 'http://nlp-service:8000',
    changeOrigin: true,
    pathRewrite: { '^/api/v1/ai/communications': '' }
}))

app.listen(3000, () => {
    console.log('AI Gateway running on port 3000')
})
Enter fullscreen mode Exit fullscreen mode

2. Frontend Performance Optimization

Service Worker for AI Caching:

// public/ai-service-worker.js
const AI_CACHE_NAME = 'ai-responses-v1'
const AI_CACHE_DURATION = 5 * 60 * 1000 // 5 minutes

self.addEventListener('fetch', event => {
    const url = new URL(event.request.url)

    // Cache AI responses for performance
    if (url.pathname.includes('/api/v1/ai/')) {
        event.respondWith(
            caches.open(AI_CACHE_NAME).then(cache => {
                return cache.match(event.request).then(cachedResponse => {
                    if (cachedResponse) {
                        const cachedTime = cachedResponse.headers.get('cached-time')
                        const now = Date.now()

                        // Check if cache is still valid
                        if (now - parseInt(cachedTime) < AI_CACHE_DURATION) {
                            return cachedResponse
                        }
                    }

                    // Fetch new response and cache it
                    return fetch(event.request).then(response => {
                        const responseClone = response.clone()
                        responseClone.headers.set('cached-time', Date.now().toString())
                        cache.put(event.request, responseClone)
                        return response
                    })
                })
            })
        )
    }
})
Enter fullscreen mode Exit fullscreen mode

Optimized AI Service Class:

// Enhanced aiService.js with performance optimizations
class OptimizedAIService extends AIService {
    constructor() {
        super()
        this.requestQueue = new Map()
        this.batchTimer = null
        this.batchRequests = []
    }

    // Batch similar requests together
    async makeOptimizedRequest(endpoint, options = {}) {
        // Check if similar request is already in queue
        const requestKey = `${endpoint}:${JSON.stringify(options.body)}`

        if (this.requestQueue.has(requestKey)) {
            return this.requestQueue.get(requestKey)
        }

        // For prediction endpoints, batch requests
        if (endpoint.includes('/predictions/')) {
            return this.batchPredictionRequest(endpoint, options)
        }

        // For other requests, use normal flow with caching
        const requestPromise = this.makeRequest(endpoint, options)
        this.requestQueue.set(requestKey, requestPromise)

        // Clear from queue after completion
        requestPromise.finally(() => {
            this.requestQueue.delete(requestKey)
        })

        return requestPromise
    }

    // Batch prediction requests for efficiency
    async batchPredictionRequest(endpoint, options) {
        return new Promise((resolve, reject) => {
            this.batchRequests.push({ endpoint, options, resolve, reject })

            // Process batch after 100ms or when we have 10 requests
            if (this.batchRequests.length >= 10) {
                this.processBatch()
            } else if (!this.batchTimer) {
                this.batchTimer = setTimeout(() => this.processBatch(), 100)
            }
        })
    }

    async processBatch() {
        if (this.batchTimer) {
            clearTimeout(this.batchTimer)
            this.batchTimer = null
        }

        const batch = [...this.batchRequests]
        this.batchRequests = []

        try {
            const batchRequest = {
                requests: batch.map(req => ({
                    endpoint: req.endpoint,
                    data: JSON.parse(req.options.body)
                }))
            }

            const response = await this.makeRequest('/batch/predictions', {
                body: JSON.stringify(batchRequest)
            })

            // Resolve individual requests
            batch.forEach((req, index) => {
                req.resolve(response.results[index])
            })
        } catch (error) {
            // Reject all requests in batch
            batch.forEach(req => req.reject(error))
        }
    }
}

export const optimizedAIService = new OptimizedAIService()
Enter fullscreen mode Exit fullscreen mode

3. Deployment Configuration

Kubernetes Deployment:

# k8s/ai-services.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-gateway
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ai-gateway
  template:
    metadata:
      labels:
        app: ai-gateway
    spec:
      containers:
      - name: ai-gateway
        image: courier-ai/gateway:latest
        ports:
        - containerPort: 3000
        env:
        - name: REDIS_HOST
          value: "redis-service"
        - name: AI_SERVICE_TOKEN
          valueFrom:
            secretKeyRef:
              name: ai-secrets
              key: service-token
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: route-optimizer
spec:
  replicas: 2
  selector:
    matchLabels:
      app: route-optimizer
  template:
    metadata:
      labels:
        app: route-optimizer
    spec:
      containers:
      - name: route-optimizer
        image: courier-ai/route-optimizer:latest
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
            nvidia.com/gpu: 1
          limits:
            memory: "2Gi"
            cpu: "1000m"
            nvidia.com/gpu: 1
        env:
        - name: MODEL_PATH
          value: "/models/route_optimizer_v2.h5"
        - name: BATCH_SIZE
          value: "32"
        volumeMounts:
        - name: model-storage
          mountPath: /models
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: ai-models-pvc
Enter fullscreen mode Exit fullscreen mode

Docker Compose for Development:

# docker-compose.dev.yml
version: '3.8'
services:
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data

  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: courier_ai
      POSTGRES_USER: courier
      POSTGRES_PASSWORD: secure_password
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./sql/init.sql:/docker-entrypoint-initdb.d/init.sql

  ai-gateway:
    build: ./services/gateway
    ports:
      - "3000:3000"
    environment:
      - REDIS_HOST=redis
      - DATABASE_URL=postgresql://courier:secure_password@postgres:5432/courier_ai
    depends_on:
      - redis
      - postgres

  route-optimizer:
    build: ./services/route-optimizer
    ports:
      - "8001:8000"
    environment:
      - MODEL_PATH=/models/route_optimizer_v2.h5
      - REDIS_HOST=redis
    volumes:
      - ./models:/models
    depends_on:
      - redis

  frontend:
    build: ./frontend
    ports:
      - "8080:8080"
    environment:
      - VUE_APP_AI_SERVICE_URL=http://localhost:3000/api/v1/ai
    depends_on:
      - ai-gateway

volumes:
  redis_data:
  postgres_data:
Enter fullscreen mode Exit fullscreen mode

4. Monitoring & Analytics

Prometheus Metrics:

// metrics/aiMetrics.js
const promClient = require('prom-client')

// Create custom metrics
const aiRequestDuration = new promClient.Histogram({
    name: 'ai_request_duration_seconds',
    help: 'Duration of AI requests in seconds',
    labelNames: ['service', 'endpoint', 'status']
})

const aiModelAccuracy = new promClient.Gauge({
    name: 'ai_model_accuracy',
    help: 'Current accuracy of AI models',
    labelNames: ['model_name', 'version']
})

const aiCacheHitRate = new promClient.Gauge({
    name: 'ai_cache_hit_rate',
    help: 'Cache hit rate for AI responses',
    labelNames: ['cache_type']
})

// Middleware to track metrics
function trackAIMetrics(req, res, next) {
    const startTime = Date.now()

    res.on('finish', () => {
        const duration = (Date.now() - startTime) / 1000
        aiRequestDuration
            .labels(req.service, req.route.path, res.statusCode)
            .observe(duration)
    })

    next()
}

module.exports = { aiRequestDuration, aiModelAccuracy, aiCacheHitRate, trackAIMetrics }
Enter fullscreen mode Exit fullscreen mode

5. Security & Compliance

AI Service Authentication:

// middleware/aiAuth.js
const jwt = require('jsonwebtoken')
const rateLimit = require('express-rate-limit')

// JWT verification for AI endpoints
function verifyAIToken(req, res, next) {
    const token = req.headers.authorization?.split(' ')[1]

    if (!token) {
        return res.status(401).json({ error: 'No token provided' })
    }

    try {
        const decoded = jwt.verify(token, process.env.AI_JWT_SECRET)
        req.user = decoded

        // Check AI service permissions
        if (!decoded.permissions.includes('ai_access')) {
            return res.status(403).json({ error: 'Insufficient permissions' })
        }

        next()
    } catch (error) {
        return res.status(401).json({ error: 'Invalid token' })
    }
}

// Rate limiting specific to AI services
const aiRateLimit = rateLimit({
    windowMs: 60 * 1000, // 1 minute
    max: (req) => {
        // Different limits based on user tier
        switch (req.user?.tier) {
            case 'premium': return 200
            case 'standard': return 100
            default: return 50
        }
    },
    keyGenerator: (req) => req.user?.id || req.ip
})

module.exports = { verifyAIToken, aiRateLimit }
Enter fullscreen mode Exit fullscreen mode

πŸ“Š Performance Benchmarks

Expected Performance Metrics:

Service Response Time Throughput Accuracy
Route Optimization < 200ms 1000 req/min 98.5%
Fraud Detection < 100ms 2000 req/min 94.2%
Delivery Prediction < 150ms 1500 req/min 96.7%
Message Generation < 300ms 800 req/min 95.3%
Voice Processing < 500ms 400 req/min 97.1%

Resource Requirements:

Component CPU Memory GPU Storage
AI Gateway 2 cores 1GB - 10GB
Route Optimizer 4 cores 4GB 1x RTX 3080 50GB
Fraud Detector 2 cores 2GB - 20GB
NLP Service 4 cores 8GB 1x RTX 3080 100GB
Redis Cache 2 cores 4GB - 20GB
PostgreSQL 4 cores 8GB - 500GB

πŸš€ Deployment Steps

1. Infrastructure Setup

# Create Kubernetes cluster
kubectl create namespace courier-ai

# Deploy Redis
kubectl apply -f k8s/redis.yaml

# Deploy PostgreSQL
kubectl apply -f k8s/postgres.yaml

# Create secrets
kubectl create secret generic ai-secrets \
  --from-literal=service-token="your-secure-token" \
  --from-literal=jwt-secret="your-jwt-secret"
Enter fullscreen mode Exit fullscreen mode

2. AI Services Deployment

# Build and deploy AI services
docker build -t courier-ai/gateway ./services/gateway
docker build -t courier-ai/route-optimizer ./services/route-optimizer

# Deploy to Kubernetes
kubectl apply -f k8s/ai-services.yaml

# Verify deployment
kubectl get pods -n courier-ai
kubectl logs -f deployment/ai-gateway -n courier-ai
Enter fullscreen mode Exit fullscreen mode

3. Frontend Integration

# Update environment variables
echo "VUE_APP_AI_SERVICE_URL=https://ai.yourdomain.com/api/v1/ai" > .env

# Build and deploy frontend
npm run build
docker build -t courier-ai/frontend .
kubectl apply -f k8s/frontend.yaml
Enter fullscreen mode Exit fullscreen mode

4. Monitoring Setup

# Deploy Prometheus and Grafana
helm install prometheus prometheus-community/kube-prometheus-stack

# Import AI dashboard
kubectl apply -f monitoring/ai-dashboard.json
Enter fullscreen mode Exit fullscreen mode

This architecture provides a scalable, high-performance AI service layer that can handle thousands of concurrent requests while maintaining sub-second response times for critical operations. The separation of concerns allows each AI service to be independently scaled and optimized based on demand patterns.

Top comments (0)