DEV Community: Lich Priest

Deploying a Real-Time Object Detection API with YOLOv8 and FastAPI

Lich Priest — Mon, 11 May 2026 18:04:23 +0000

Introduction

Object detection is one of the most exciting use‑cases of computer vision, and the YOLO (You Only Look Once) family has become the go‑to solution for real‑time inference. In this tutorial you’ll learn how to:

Train a custom YOLOv8 model on your own dataset.
Wrap the model in a FastAPI service that accepts image uploads and returns detections instantly.
Containerize the whole stack with Docker so it runs the same everywhere.
Automate testing and deployment using a GitHub Actions CI/CD pipeline.

By the end you’ll have a production‑ready API that can be deployed to any container host (AWS ECS, GCP Cloud Run, Azure Container Apps, or even your laptop).

Tip: If you’re new to YOLOv8, the official Ultralytics repo ships with a very friendly CLI that handles most of the heavy lifting. We’ll use it as the foundation and then add a thin FastAPI wrapper around the exported model.

1. Preparing the data and training YOLOv8

1.1 Organize your dataset

YOLO expects the following directory layout:

dataset/
├── images/
│   ├── train/
│   └── val/
└── labels/
    ├── train/
    └── val/

Images can be JPEG or PNG.
Labels are text files with the same base name as the image, each line containing class_id x_center y_center width height (all normalized to [0,1]).

If you have data in COCO or Pascal VOC format, the ultralytics package can convert it automatically:

pip install ultralytics
yolo convert data.yaml --format coco   # or --format voc

1.2 Create a `data.yaml` file

train: ./dataset/images/train
val:   ./dataset/images/val

nc: 3                     # number of classes
names: ['person', 'bicycle', 'dog']

1.3 Train the model

The simplest way is to use the CLI:

yolo task=detect mode=train data=./data.yaml epochs=50 imgsz=640 batch=16 model=yolov8n.pt

yolov8n.pt is the nano version, perfect for low‑latency inference.
Adjust epochs, batch, and imgsz to fit your compute budget.

The training script will create a runs/detect/train/weights/best.pt file – this is the model we’ll serve.

1.4 Quick sanity check

yolo task=detect mode=val model=./runs/detect/train/weights/best.pt data=./data.yaml

You should see a summary of mAP, precision, recall, and a few sample images with bounding boxes saved under runs/detect/val/predict.

2. Exporting the model for inference

YOLOv8 can export to several formats (torchscript, ONNX, TensorRT). For a FastAPI service running on CPU or GPU, the native PyTorch format works fine, but we’ll also export to ONNX for future flexibility.

yolo export model=./runs/detect/train/weights/best.pt format=onnx opset=12

You’ll get best.onnx in the same folder. Keep both best.pt and best.onnx – the former is useful for quick local testing, the latter for edge deployments.

3. Building the FastAPI wrapper

Create a new folder called api/ and add the following files.

3.1 `requirements.txt`

fastapi==0.110.0
uvicorn[standard]==0.27.0
python-multipart==0.0.9
torch==2.2.0
opencv-python-headless==4.9.0.80
ultralytics==8.2.0
pydantic==2.6.1

3.2 `app.py`

from fastapi import FastAPI, File, UploadFile, HTTPException
from fastapi.responses import JSONResponse
from pathlib import Path
import io
import cv2
import torch
import numpy as np

app = FastAPI(title="YOLOv8 Object Detection API")

# Load the model once at startup
MODEL_PATH = Path(__file__).parent / "best.pt"
if not MODEL_PATH.exists():
    raise FileNotFoundError(f"Model not found at {MODEL_PATH}")

model = torch.hub.load('ultralytics/yolov5', 'custom', path=str(MODEL_PATH), force_reload=True)
model.eval()

def read_image(file: UploadFile) -> np.ndarray:
    """Convert uploaded file to a OpenCV BGR image."""
    contents = file.file.read()
    np_arr = np.frombuffer(contents, np.uint8)
    img = cv2.imdecode(np_arr, cv2.IMREAD_COLOR)
    if img is None:
        raise HTTPException(status_code=400, detail="Invalid image")
    return img

@app.post("/detect")
async def detect(file: UploadFile = File(...)):
    """Accept an image and return YOLO detections."""
    img = read_image(file)
    results = model(img)               # Inference
    detections = results.pandas().xyxy[0]  # Pandas DataFrame

    # Convert to JSON‑serializable dict
    output = detections.to_dict(orient="records")
    return JSONResponse(content=output)

Explanation of key parts

Model loading: We use torch.hub.load with the custom flag to load our best.pt. This runs on CPU by default; add device='cuda' if you have a GPU.
Image handling: UploadFile gives us a file‑like object. We decode it with OpenCV so the model receives a NumPy array.
Result formatting: YOLO returns a Results object. The pandas().xyxy[0] view gives us a tidy DataFrame with columns xmin, ymin, xmax, ymax, confidence, class, name. Converting it to a list of dicts makes the API response clean.

3.3 Run locally

pip install -r api/requirements.txt
uvicorn api.app:app --host 0.0.0.0 --port 8000

Visit http://localhost:8000/docs – FastAPI automatically generates an interactive Swagger UI. Try uploading a picture and you should receive a JSON array of detections.

4. Dockerizing the service

Create a Dockerfile at the project root:

# Use the official lightweight Python image
FROM python:3.11-slim

# Install system dependencies (opencv needs libgl1)
RUN apt-get update && apt-get install -y --no-install-recommends \
    libgl1 && \
    rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Copy only requirements first for layer caching
COPY api/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the code and the trained model
COPY api/ ./api/
COPY runs/detect/train/weights/best.pt ./api/

# Expose FastAPI port
EXPOSE 8000

# Command to run the service
CMD ["uvicorn", "api.app:app", "--host", "0.0.0.0", "--port", "8000"]

4.1 Build and test the image

docker build -t yolov8-api:latest .
docker run -p 8000:8000 yolov8-api:latest

Again, navigate to http://localhost:8000/docs to verify the container works.

5. CI/CD with GitHub Actions

Having the Docker image build automatically on every push guarantees reproducibility. Add the following workflow file at .github/workflows/docker-ci.yml:

name: CI / CD for YOLOv8 FastAPI

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  build-and-push:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      - name: Set up QEMU (for multi‑arch builds)
        uses: docker/setup-qemu-action@v3

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Log in to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}

      - name: Build and push Docker image
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ secrets.DOCKERHUB_USERNAME }}/yolov8-api:latest
          cache-from: type=registry,ref=${{ secrets.DOCKERHUB_USERNAME }}/yolov8-api:cache
          cache-to: type=registry,ref=${{ secrets.DOCKERHUB_USERNAME }}/yolov8-api:cache,mode=max

What this does

Checks out the code on the GitHub runner.
Enables multi‑architecture builds (useful if you later target ARM devices).
Authenticates with Docker Hub using encrypted repository secrets.
Builds the image and pushes it to Docker Hub under your namespace.
Caches layers to speed up subsequent builds.

You can now trigger a deployment on any platform that can pull from Docker Hub (e.g., a simple docker run on an EC2 instance or a Kubernetes pod).

6. Optional: Deploying to a cloud provider

6.1 AWS Elastic Container Service (ECS) – Fargate

# 1. Create a cluster
aws ecs create-cluster --cluster-name yolov8-cluster

# 2. Register task definition (task-def.json)
aws ecs register-task-definition --cli-input-json file://task-def.json

# 3. Run service
aws ecs create-service \
  --cluster yolov8-cluster \
  --service-name yolov8-service \
  --task-definition yolov8-task \
  --desired-count 1 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-xxxx],securityGroups=[sg-xxxx],assignPublicIp=ENABLED}"

The task-def.json would reference the image you pushed (youruser/yolov8-api:latest) and expose port 8000. After a few minutes the service is reachable via the load balancer URL.

6.2 Google Cloud Run

gcloud run deploy yolov8-api \
  --image=gcr.io/<PROJECT_ID>/yolov8-api:latest \
  --platform=managed \
  --region=us-central1 \
  --allow-unauthenticated \
  --port=8000

Both platforms automatically handle scaling, health checks, and HTTPS termination, leaving you with a low‑maintenance API.

7. Testing the live endpoint

You can use curl or a small Python script:

import requests

url = "http://<host>:8000/detect"
files = {"file": open("test.jpg", "rb")}
resp = requests.post(url, files=files)
print(resp.json())

The response will be a list of detections, each containing:

{
  "xmin": 124,
  "ymin": 87,
  "xmax": 342,
  "ymax": 276,
  "confidence": 0.93,
  "class": 0,
  "name": "person"
}

You can now feed this output into downstream services—tracking, alerting, or even a front‑end UI that draws boxes in real time.

Key takeaways

Train a custom

Deploy a Real‑Time Object Detection API with YOLOv8 & FastAPI

Lich Priest — Mon, 11 May 2026 18:04:04 +0000

Why combine YOLOv8 and FastAPI?

Object detection is at the heart of many modern applications—think smart cameras, inventory robots, or AR experiences. YOLOv8 (You Only Look Once) gives you state‑of‑the‑art accuracy while still running fast enough for real‑time use. FastAPI, on the other hand, is a lightweight, async‑first web framework that makes it trivial to expose a model as a REST endpoint.

In this tutorial you’ll walk through:

Preparing a small custom dataset and training a YOLOv8 model.
Wrapping the model in a FastAPI service that accepts images and returns detections.
Docker‑izing the whole stack so it can run anywhere with a single docker compose up.

By the end you’ll have a reproducible, container‑based API that can serve predictions in a few milliseconds.

Prerequisites

Tool	Version	Why
Python	3.9‑3.11	Compatibility with Ultralytics YOLO
Ultralytics YOLO	`pip install ultralytics`	Training and inference
FastAPI	`pip install fastapi[all]`	HTTP server
Docker & Docker‑Compose	Latest	Container orchestration
Git	Any	Version control (optional)

You’ll also need a modest GPU for training (even a laptop GPU works for a small dataset). If you only want to test inference, CPU‑only mode is fine.

1. Prepare a custom dataset

YOLOv8 expects the classic folder layout:

my_dataset/
├── images/
│   ├── train/
│   └── val/
└── labels/
    ├── train/
    └── val/

Each image in train/ or val/ has a corresponding .txt file in the same sub‑folder under labels/. The label file contains one line per object:

<class_id> <x_center> <y_center> <width> <height>

All coordinates are normalized (0‑1). If you already have COCO‑style annotations, the ultralytics package can convert them:

# convert_coco_to_yolo.py
from ultralytics import YOLO

YOLO.convert_coco(
    data="coco_annotations.json",
    save_dir="my_dataset"
)

Once you have the folder ready, create a data.yaml that points to it:

# data.yaml
train: ./my_dataset/images/train
val: ./my_dataset/images/val

nc: 3                     # number of classes
names: ['person', 'bicycle', 'dog']

2. Train the model

Training with YOLOv8 is a single line:

yolo task=detect mode=train data=data.yaml epochs=50 imgsz=640 batch=16 model=yolov8n.pt

yolov8n.pt is the nano version (fastest, smallest). Swap for yolov8s.pt or larger if you need higher accuracy.
Adjust epochs, batch, and imgsz to fit your hardware.

After training finishes you’ll find the best checkpoint in runs/detect/train/weights/best.pt. Keep that file; it’s what the API will load.

3. Build the FastAPI inference service

Create a new folder api/ and add the following files.

`app/main.py`

# app/main.py
import io
import numpy as np
from fastapi import FastAPI, File, UploadFile, HTTPException
from ultralytics import YOLO
from PIL import Image

app = FastAPI(title="YOLOv8 Object Detection API")

# Load the model once at startup
model = YOLO("weights/best.pt")

def pil_to_numpy(img: Image.Image) -> np.ndarray:
    """Convert a Pillow image to a NumPy array that YOLO expects."""
    return np.array(img.convert("RGB"))

@app.post("/detect")
async def detect(file: UploadFile = File(...)):
    """Accept an image file and return bounding boxes."""
    if file.content_type not in {"image/jpeg", "image/png"}:
        raise HTTPException(status_code=400, detail="Invalid image type")

    # Read bytes and open with Pillow
    contents = await file.read()
    try:
        img = Image.open(io.BytesIO(contents))
    except Exception:
        raise HTTPException(status_code=400, detail="Corrupt image")

    # Run inference (async is not needed because YOLO runs on C++)
    results = model(pil_to_numpy(img))[0]

    # Build a simple JSON response
    detections = []
    for box in results.boxes:
        detections.append({
            "class_id": int(box.cls),
            "class_name": model.names[int(box.cls)],
            "confidence": float(box.conf),
            "bbox": box.xyxy.tolist()[0]  # [x1, y1, x2, y2] in pixel coords
        })

    return {"detections": detections}

`app/requirements.txt`

fastapi[all]==0.110.*
uvicorn==0.27.*
ultralytics==8.2.*
pillow==10.2.*
numpy==1.26.*

`Dockerfile`

# Use an official lightweight Python image
FROM python:3.11-slim

# Install system dependencies (opencv needed by ultralytics)
RUN apt-get update && apt-get install -y --no-install-recommends \
    libgl1-mesa-glx \
    && rm -rf /var/lib/apt/lists/*

# Create a non‑root user
RUN useradd -m appuser
WORKDIR /app
COPY --chown=appuser:appuser app/ ./app/
COPY --chown=appuser:appuser weights/best.pt ./weights/best.pt

# Install Python deps
RUN pip install --no-cache-dir -r app/requirements.txt

# Switch to non‑root user
USER appuser

EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Tip – Keep the weights/ directory next to the Dockerfile so the model file is added to the image at build time. For larger models you may want to mount the weights as a volume instead.

4. Docker‑Compose for one‑click launch

Create a docker-compose.yml at the repository root:

version: "3.9"

services:
  yolo-api:
    build: .
    ports:
      - "8000:8000"
    restart: unless-stopped
    # If you have a GPU and the host has NVIDIA drivers, uncomment:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - capabilities: [gpu]

Now run:

docker compose up --build -d

The API will be reachable at http://localhost:8000/detect. FastAPI automatically generates interactive docs at http://localhost:8000/docs, where you can upload an image and see the JSON response instantly.

5. Test the endpoint

A quick curl test:

curl -X POST "http://localhost:8000/detect" \
  -F "file=@sample.jpg" \
  -H "Accept: application/json"

You should receive something like:

{
  "detections": [
    {
      "class_id": 0,
      "class_name": "person",
      "confidence": 0.92,
      "bbox": [112, 45, 398, 720]
    },
    {
      "class_id": 2,
      "class_name": "dog",
      "confidence": 0.78,
      "bbox": [410, 300, 620, 540]
    }
  ]
}

If you prefer a visual output, you can extend the API to return an image with drawn boxes using OpenCV or Pillow. The core logic stays the same; just add a cv2.rectangle loop and return StreamingResponse.

6. Scaling considerations

GPU acceleration: The Dockerfile above runs on CPU. To enable GPU, use the nvidia/cuda base image and add --gpus all to docker compose up. The Ultralytics package automatically detects CUDA.
Batch inference: For higher throughput, modify the endpoint to accept a list of images and call model(images) once. This reduces the overhead of model loading.
Model versioning: Store each trained checkpoint in a separate folder and mount the desired version at runtime (-v ./weights/v2.pt:/app/weights/best.pt). This makes A/B testing painless.

7. Clean up

When you’re done experimenting, stop and remove containers:

docker compose down
docker rmi $(docker images -q your_image_name)  # optional

You can also push the image to a registry (Docker Hub, GitHub Packages, etc.) and deploy it to any cloud provider that supports containers—AWS ECS, GCP Cloud Run, or Azure Container Apps—all with the same docker run command.

Conclusion

You’ve just built a full‑stack, containerized object detection service:

Data → YOLOv8 training → best.pt.
FastAPI wraps the model in a clean HTTP endpoint.
Docker guarantees reproducibility and portability.
Docker‑Compose makes local development and testing a single command.

From here you can experiment with larger YOLO variants, add authentication to the API, or integrate the service into a larger micro‑service architecture. The sky’s the limit!

Key takeaways

YOLOv8’s CLI makes custom training fast; a single best.pt file is all you need for inference.
FastAPI provides async‑ready, auto‑documented endpoints that pair nicely with YOLO’s Python API.
Docker isolates dependencies (Python, OpenCV, CUDA) and ensures the same environment runs everywhere.
Using Docker‑Compose you can spin up the API locally, test it with curl or the Swagger UI, and later push the image to any container platform.

[E2E TEST] Deploy a Real‑Time Voice‑Controlled AI Assistant on a Raspberry Pi

Lich Priest — Mon, 11 May 2026 16:18:02 +0000

Why Run a Voice Assistant on the Edge?

Running speech‑to‑text and intent detection locally gives you:

Zero latency – no round‑trip to the cloud.
Privacy – audio never leaves the device.
Offline reliability – your assistant works even when the internet is down.

In this tutorial we’ll stitch together OpenAI’s Whisper (small model) for transcription, a tiny TensorFlow Lite intent classifier, and a real‑time audio pipeline that lives entirely on a Raspberry Pi 4 (2 GB or more). By the end you’ll have a Python script that listens for commands like “turn on the lamp” and executes a local function instantly.

What You’ll Need

Item	Reason
Raspberry Pi 4 (2 GB+) with Raspberry OS (64‑bit)	Provides enough RAM for Whisper‑small
Micro‑USB or USB‑C microphone	Captures audio
Python 3.10+	Modern language features
`ffmpeg`	Required by Whisper
`git`, `pip`, `virtualenv`	Standard development tools
Optional: GPIO‑controlled relay	To demonstrate a real command

Tip: If you’re using a Pi Zero, swap Whisper for a lighter model (e.g., tiny.en) or run only the intent recognizer.

1. Set Up the Development Environment

# Update OS and install system deps
sudo apt update && sudo apt upgrade -y
sudo apt install -y python3-pip python3-venv ffmpeg libportaudio2

# Create a clean virtual environment
python3 -m venv venv
source venv/bin/activate

# Upgrade pip and install core libraries
pip install --upgrade pip
pip install numpy sounddevice tqdm

Install Whisper

Whisper ships as a Python package that downloads the model on first use.

pip install git+https://github.com/openai/whisper.git

Install TensorFlow Lite Runtime

The full TensorFlow package is heavyweight for a Pi. Use the lightweight runtime instead:

pip install tflite-runtime

2. Capture Audio in Real Time

We’ll use sounddevice to stream 16 kHz mono audio directly into a NumPy buffer. Whisper expects 16 kHz, so we set the samplerate accordingly.

import sounddevice as sd
import numpy as np
from collections import deque

SAMPLE_RATE = 16000
CHUNK_DURATION = 0.5   # seconds
CHUNK_SIZE = int(SAMPLE_RATE * CHUNK_DURATION)

# A thread‑safe circular buffer
audio_buffer = deque(maxlen=int(5 * SAMPLE_RATE))  # keep last 5 seconds

def audio_callback(indata, frames, time, status):
    """Called by sounddevice for each audio chunk."""
    if status:
        print(f"Audio status: {status}")
    audio_buffer.extend(indata[:, 0])  # mono channel

stream = sd.InputStream(
    samplerate=SAMPLE_RATE,
    channels=1,
    dtype='float32',
    callback=audio_callback,
)

stream.start()
print("🔊 Listening…")

The buffer continuously holds the most recent audio. We’ll pull a 2‑second slice every loop iteration and feed it to Whisper.

3. Run Whisper on the Edge

Whisper‑small (~39 M parameters) fits into the Pi’s RAM and runs at ~2× real‑time on a Pi 4 with the CPU only. For lower latency we’ll use the non‑beam decoding mode.

import whisper
import torch

# Load Whisper‑small on CPU
model = whisper.load_model("small", device="cpu")

def transcribe_chunk(chunk):
    """Accepts a NumPy array of shape (samples,) and returns text."""
    # Whisper expects a float32 tensor normalized to [-1, 1]
    audio = torch.from_numpy(chunk).float()
    result = model.transcribe(audio, language="en", word_timestamps=False, beam_size=1)
    return result["text"].strip()

Reducing Compute with TorchScript (optional)

If you want a modest speed boost, script the model once:

scripted = torch.jit.script(model)
# Replace `model` with `scripted` in `transcribe_chunk`

4. Build a Tiny Intent Classifier

Instead of parsing full sentences, we’ll map short utterances to intents using a keyword‑spotting model. The architecture is a 1‑D convolution followed by a dense layer – only ~10 k parameters.

import tensorflow as tf

def build_intent_model(num_classes=4):
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(16000, 1)),          # 1‑second raw waveform
        tf.keras.layers.Rescaling(1.0 / 32768.0),         # Normalize int16 to [-1, 1]
        tf.keras.layers.Conv1D(8, 13, strides=2, activation='relu'),
        tf.keras.layers.Conv1D(16, 13, strides=2, activation='relu'),
        tf.keras.layers.GlobalAveragePooling1D(),
        tf.keras.layers.Dense(num_classes, activation='softmax')
    ])
    return model

Training Data (quick example)

Create a tiny dataset with four commands: ["turn on the lamp", "turn off the lamp", "what time is it", "stop listening"]. Record a few seconds for each command, label them, and train for a handful of epochs.

# Assume `X_train` shape = (samples, 16000, 1), y_train one‑hot encoded
model = build_intent_model(num_classes=4)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=15, batch_size=8)

Convert to TensorFlow Lite and Quantize

Quantization shrinks the model to ~30 KB and runs at >100 inferences/sec on the Pi.

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]  # post‑training quantization
tflite_model = converter.convert()

with open("intent_classifier.tflite", "wb") as f:
    f.write(tflite_model)
print("✅ Saved quantized TFLite model")

Load the TFLite Model

import tflite_runtime.interpreter as tflite

interpreter = tflite.Interpreter(model_path="intent_classifier.tflite")
interpreter.allocate_tensors()
input_idx = interpreter.get_input_details()[0]["index"]
output_idx = interpreter.get_output_details()[0]["index"]

def predict_intent(waveform):
    """waveform: np.ndarray shape (16000,)"""
    # Reshape to (1, 16000, 1) and cast to int16 for the quantized model
    input_data = waveform.astype(np.int16).reshape(1, -1, 1)
    interpreter.set_tensor(input_idx, input_data)
    interpreter.invoke()
    probs = interpreter.get_tensor(output_idx)[0]
    intent_id = np.argmax(probs)
    return intent_id, probs[intent_id]

Map IDs to human‑readable intents:

INTENT_MAP = {
    0: "TURN_ON",
    1: "TURN_OFF",
    2: "GET_TIME",
    3: "STOP"
}

5. Glue It All Together

Now we combine the audio stream, Whisper transcription, and intent classifier into a single loop. We’ll use a 1‑second sliding window for intent detection (fast) and a 2‑second window for Whisper (more accurate).

import time
import datetime

def execute_intent(intent):
    if intent == "TURN_ON":
        print("💡 Turning lamp ON")
        # Example GPIO call:
        # import RPi.GPIO as GPIO
        # GPIO.output(LAMP_PIN, GPIO.HIGH)
    elif intent == "TURN_OFF":
        print("💡 Turning lamp OFF")
    elif intent == "GET_TIME":
        now = datetime.datetime.now().strftime("%H:%M")
        print(f"🕒 The time is {now}")
    elif intent == "STOP":
        print("👋 Stopping assistant")
        raise KeyboardInterrupt

try:
    while True:
        # ---- Intent detection (fast) ----
        if len(audio_buffer) >= SAMPLE_RATE:  # need at least 1 sec
            recent = np.array(list(audio_buffer)[-SAMPLE_RATE:])  # last 1 sec
            intent_id, confidence = predict_intent(recent)
            if confidence > 0.85:  # ignore low‑confidence guesses
                intent = INTENT_MAP[intent_id]
                print(f"[Intent] {intent} ({confidence:.2f})")
                execute_intent(intent)

        # ---- Whisper transcription (every 2 sec) ----
        if len(audio_buffer) >= 2 * SAMPLE_RATE:
            chunk = np.array(list(audio_buffer)[-2 * SAMPLE_RATE:])
            text = transcribe_chunk(chunk)
            if text:
                print(f"[Transcription] {text}")

        time.sleep(0.2)  # tiny pause to keep CPU happy

except KeyboardInterrupt:
    print("\n🛑 Assistant stopped")
finally:
    stream.stop()
    stream.close()

What’s happening?

Audio callback continuously fills audio_buffer.
Every loop we grab a 1‑second slice, run the quantized intent model, and instantly act on high‑confidence predictions.
Every 2 seconds we feed a larger slice to Whisper for a full transcription – useful for debugging or for commands that need more context.
The script exits gracefully on “stop listening”.

6. Optimizing for Real‑World Use

Area	Quick win
CPU usage	Set `torch.set_num_threads(2)` to limit Whisper’s thread count.
Power	Use `pico` mode on the Pi (`sudo raspi-config → Performance → Low‑Power`).
Audio quality	Add a simple high‑pass filter (`scipy.signal.butter`) to remove rumble.
Model size	Swap Whisper‑small for Whisper‑tiny if RAM is a bottleneck.
Hotword detection	Keep the intent model always on; only invoke Whisper after a hotword is detected (e.g., “hey pi”).

7. Deploying as a System Service

Running the script manually is fine for testing, but for a production‑grade assistant you’ll want it to start on boot.

sudo nano /etc/systemd/system/voice-assistant.service

Paste:

[Unit]
Description=Edge Voice Assistant
After=network.target

[Service]
WorkingDirectory=/home/pi/voice-assistant
ExecStart=/home/pi/voice-assistant/venv/bin/python3 assistant.py
Restart=on-failure
User=pi

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable voice-assistant.service
sudo systemctl start voice-assistant.service

Check logs with journalctl -u voice-assistant -f.

Key takeaways

Whisper‑small can run on a Raspberry Pi 4 in real time when you limit beam size and use CPU‑only inference.
A tiny 1‑D ConvNet, post‑training quantized to TensorFlow Lite, provides sub‑millisecond intent detection.
Using a circular buffer with sounddevice lets you stream audio without dropping frames.
Combining a fast intent classifier with occasional Whisper transcriptions yields both low latency and high accuracy.
Packaging the script as a systemd service makes the assistant start automatically and stay resilient.

DEV Community: Lich Priest

Deploying a Real-Time Object Detection API with YOLOv8 and FastAPI

Introduction

1. Preparing the data and training YOLOv8

1.1 Organize your dataset

1.2 Create a data.yaml file

1.3 Train the model

1.4 Quick sanity check

2. Exporting the model for inference

3. Building the FastAPI wrapper

3.1 requirements.txt

3.2 app.py

3.3 Run locally

4. Dockerizing the service

4.1 Build and test the image

5. CI/CD with GitHub Actions

6. Optional: Deploying to a cloud provider

6.1 AWS Elastic Container Service (ECS) – Fargate

6.2 Google Cloud Run

7. Testing the live endpoint

Key takeaways

Deploy a Real‑Time Object Detection API with YOLOv8 & FastAPI

Why combine YOLOv8 and FastAPI?

Prerequisites

1. Prepare a custom dataset

2. Train the model

3. Build the FastAPI inference service

app/main.py

app/requirements.txt

Dockerfile

4. Docker‑Compose for one‑click launch

5. Test the endpoint

6. Scaling considerations

7. Clean up

Conclusion

Key takeaways

[E2E TEST] Deploy a Real‑Time Voice‑Controlled AI Assistant on a Raspberry Pi

Why Run a Voice Assistant on the Edge?

What You’ll Need

1. Set Up the Development Environment

Install Whisper

Install TensorFlow Lite Runtime

2. Capture Audio in Real Time

3. Run Whisper on the Edge

Reducing Compute with TorchScript (optional)

4. Build a Tiny Intent Classifier

Training Data (quick example)

Convert to TensorFlow Lite and Quantize

Load the TFLite Model

5. Glue It All Together

6. Optimizing for Real‑World Use

7. Deploying as a System Service

Key takeaways

1.2 Create a `data.yaml` file

3.1 `requirements.txt`

3.2 `app.py`

`app/main.py`

`app/requirements.txt`

`Dockerfile`

[E2E TEST] Deploy a Real‑Time Voice‑Controlled AI Assistant on a Raspberry Pi