DEV Community

Onuh Chidera Theola
Onuh Chidera Theola

Posted on

What Went Wrong in My Distributed Sensor System — And What It Taught Me

I'm a noob to distributed systems — and I learned that the hard way.

Distributed architecture has always fascinated me, so I decided to refactor one of my projects into a microservice-based system despite having little practical experience. What followed was a series of failures that taught me more about distributed systems than any tutorial could.

This article walks through the challenges I faced while setting up my distributed microservice system, how I solved them, and what they taught me about building resilient services.

But first, let's give a very short explanation of what a distributed architecture is to begin with.

What is a distributed architecture?

A distributed architecture describes a system that is built with services deployed independently of each other that collaborate to function as a single application. Instead of relying on one large application running on a single machine, responsibilities are split across multiple services. This improves scalability, fault tolerance, and flexibility — but introduces a new class of problems: coordination, communication failures, and partial system availability.

In a microservice architecture, each service owns its deployment lifecycle and often its data storage. Containerization tools like Docker make this possible by allowing services to run independently while communicating over defined network interfaces.

InFlow Structure

InFlow is an IoT monitoring software for IoT devices. It was originally built as a monolithic architecture, but recently it has been under a refactoring process to include more features. I decided on a microservice layer this time around for the very reasons why microservices are useful.

Below is an overview of the new architecture and proposed features.

Service/Component Framework Primary Protocol(s) Role & Data Flow
Django Core Django HTTP, WebSockets Source of Truth (SoT). Manages user/device metadata in PostgreSQL. Generates the unique Sensor ID and MQTT Credentials for provisioning.
Ingestion Service FastAPI MQTT Asynchronous Listener. Subscribes to the MQTT Broker, receives high-volume sensor data, and writes directly to InfluxDB.
Analytics Service (AI Agent) FastAPI Gemini API (LLM) Scheduled Batch Job. Queries InfluxDB daily, calls the LLM for summarization, and writes the resulting text report to Django's PostgreSQL.
Databases PostgreSQL / InfluxDB N/A PostgreSQL: Stores low-volume, high-integrity user/device metadata. InfluxDB: Stores high-volume, high-velocity time-series sensor readings.

For more information, you can visit my documentation here

The Flow

Currently, my setup only has two services:

  1. Main user/sensor service (written in Django)
  2. The MQTT/InfluxDB service (written in FastAPI)

Conceptually, the MQTT service acts as a gatekeeper, validating sensors before allowing data ingestion.

Sensor → Mosquitto → MQTT Service → Django Service
                          ↓
                      InfluxDB
Enter fullscreen mode Exit fullscreen mode

The structure of the services is explained as follows:

  • MQTT-SERVICE → sends a request to verify the status of a sensor via HTTP to the MAIN-APP-SERVICE
  • MAIN-APP-SERVICE → verifies sensors and sends a boolean (true/false) to MQTT-SERVICE
  • MQTT-SERVICE → refuses or accepts connection to Mosquitto depending on sensor status

By default, Mosquitto already handles authentication of sensors based on their unique MQTT_KEY, but as an added authentication procedure — for the scenario where an inactive sensor still has access to the Mosquitto publish pool — I created a function to verify sensor status:

Inside a listener function, I call this function to verify sensor status from the main app service before proceeding with the Mosquitto connection. Now, on paper, this setup should be a slam dunk, right? Well...

Problem #1 — The Startup Race

I had unknowingly encountered one of the fundamental truths of distributed systems:

A dependency being configured does not mean it is available.

mqtt_service-1 | 2026-02-22 09:55:23.639 | INFO  | src.handler:mqtt_listener:79 - Topic is received
mqtt_service-1 | Cache miss for sensor_id: 1d0d6724-995a-43b5-819c-652cfdc5e329 mqtt_username: None
mqtt_service-1 | Verifying sensor status with params: {'sensor_id': '1d0d6724-995a-43b5-819c-652cfdc5e329'}
mqtt_service-1 | 2026-02-22 09:55:23.866 | ERROR | src.auth:verify_sensor_status:109 - HTTP error during sensor status verification: All connection attempts failed
mqtt_service-1 | 2026-02-22 09:55:23.866 | ERROR | src.handler:mqtt_listener:92 - MQTT_ERROR: Inactive or unknown sensor 1d0d6724-995a-43b5-819c-652cfdc5e329. Rejecting message.
Enter fullscreen mode Exit fullscreen mode

I hit a snag — my first real distributed systems failure. The problem arises when a service depends on another system that is not yet ready. Because of my current setup, the MQTT service relied on the main app service being alive and ready to verify the sensor. I never considered that the MQTT service would start before the main service.

So it made the request once — before the main app had started — and never retried again.

I laughed at myself when I figured out what was happening.

Problem #2 — Negative Caching

@alru_cache(maxsize=128)
Enter fullscreen mode Exit fullscreen mode

Yes, my cache was another problem. Technically not an actual problem on its own, but a consequence of the first. The way I set up my cache, it was configured to cache the latest status of the sensor. That meant that even if the process fails, it would record that as status=INACTIVE — which is bad, because it hid my actual errors and returned False for everything.

This is known as negative caching — where failures are cached as valid results. While useful in some systems, it can silently mask infrastructure problems when applied incorrectly.

Cache hit → False
No HTTP retry happens
Enter fullscreen mode Exit fullscreen mode

Problem #3 — Conflating Orchestration with Resilience

I assumed that a Docker restart would automatically fix all my issues. So in my docker-compose.yml, I set all services to restart: unless-stopped. I assumed container restarts implied retry logic. They don't.

Docker ensures processes stay alive — not that inter-service communication succeeds.

Fixes

Fix #1 — Explicit State Modeling

Instead of returning a single boolean (true/false), I decided to return separate, explicit states for better error handling. I created an Enum class with all possible sensor statuses:

class SensorStatus(str, Enum):
    ACTIVE = "active"
    INACTIVE = "inactive"
    MAINTENANCE = "maintenance"
    DECOMMISSIONED = "decommissioned"
    UNKNOWN = "unknown"
Enter fullscreen mode Exit fullscreen mode

This fixed the issue with hidden errors. Moving from a boolean to an explicit state model allowed failures to be distinguished from legitimate inactive sensors. Now, for each exception, the sensor status is set to UNKNOWN:

except httpx.RequestError as e:
    logger.error(f"Service unreachable: {e}")
    return SensorStatus.UNKNOWN
Enter fullscreen mode Exit fullscreen mode

Fix #2 — Smarter Caching

Next, I fixed the caching issue. Instead of caching failed attempts, the function now only caches successful results:

cached_key = sensor_id
sensor_status_cache[cached_key] = (status, time.time())
return status  # only if status is successful
Enter fullscreen mode Exit fullscreen mode

Only successful responses should be cached. Failures must remain retryable.

Fix #3 — Retry with Exponential Backoff

Finally, to fix the root cause of all of this, I implemented a retry mechanism in the verify_sensor_status function.

response = None

for attempt in range(5):  # Retry mechanism
    try:
        logger.info(f"Verifying sensor status for {sensor_id} with user service. Attempt {attempt + 1}")
        async with httpx.AsyncClient() as client:
            logger.warning(f"SENDING PARAMS TO MAIN APP: {params}")
            response = await client.get(verify_endpoint, params=params, timeout=10.0)
            response.raise_for_status()
            break
    except (httpx.HTTPError, httpx.RequestError, httpx.TimeoutException) as e:
        logger.error(f"Attempt {attempt + 1}: Error during sensor status verification: {e}")
        await asyncio.sleep(2 ** attempt)  # Exponential backoff
Enter fullscreen mode Exit fullscreen mode

Here is the full updated function:

@alru_cache(maxsize=128)
async def verify_sensor_status(sensor_id: str | None = None) -> SensorStatus:
    """
    Verifies if a sensor is active using an external user service.

    Args:
        sensor_id: The ID of the sensor to verify.
    Returns:
        SensorStatus: The current status of the sensor.
    """

    # Check cache first
    if sensor_id and sensor_id in sensor_status_cache:
        print("Cache hit for sensor_id:", sensor_id)
        cached_status, cached_time = sensor_status_cache[sensor_id]
        if time.time() - cached_time < 300:  # 5 minute TTL
            return cached_status

    print("Cache miss for sensor_id:", sensor_id)
    user_service_url = os.getenv("MAIN_APP_API_URL", "http://localhost:8000")
    verify_endpoint = f"{user_service_url}/api/v2/sensors/verify_status"

    params = {"sensor_id": sensor_id}
    response = None

    for attempt in range(5):  # Retry mechanism
        try:
            logger.info(f"Verifying sensor status for {sensor_id} with user service. Attempt {attempt + 1}")
            async with httpx.AsyncClient() as client:
                logger.warning(f"SENDING PARAMS TO MAIN APP: {params}")
                response = await client.get(verify_endpoint, params=params, timeout=10.0)
                response.raise_for_status()
                break

        except (httpx.HTTPError, httpx.RequestError, httpx.TimeoutException) as e:
            logger.error(f"Attempt {attempt + 1}: Error during sensor status verification: {e}")
            await asyncio.sleep(2 ** attempt)  # Exponential backoff
        except Exception as e:
            logger.error(f"Attempt {attempt + 1}: Unexpected error during sensor status verification: {e}")
            await asyncio.sleep(2 ** attempt)  # Exponential backoff

    if response is None:
        logger.error(f"Failed to verify sensor status for {sensor_id} after multiple attempts.")
        return SensorStatus.UNKNOWN

    # Parse response
    logger.info(f"Received response from main sensor service: {response.status_code} - {response.text[:200]}")
    try:
        data = response.json()
    except ValueError:
        logger.error(
            f"Invalid JSON response from main app: "
            f"status={response.status_code}, body={response.text[:200]}"
        )
        return SensorStatus.UNKNOWN

    status = (SensorStatus.ACTIVE if data else SensorStatus.INACTIVE)

    if status == SensorStatus.UNKNOWN:
        logger.warning(f"Sensor status is UNKNOWN for {sensor_id}. Response data: {data}")
        return status

    # Update cache with final result
    logger.info(f"Caching sensor status for {sensor_id}: {status}")
    cached_key = sensor_id
    sensor_status_cache[cached_key] = (status, time.time())
    return status
Enter fullscreen mode Exit fullscreen mode

Lessons Learned

This experience taught me several core principles of distributed systems:

1. Services are not instantly available
The startup order cannot be trusted. Always assume dependencies may be temporarily unreachable.

2. Network calls will fail
Failures are normal behavior in distributed systems, not edge cases.

3. Cache carefully
Caching errors can hide real problems and make debugging significantly harder.

4. Orchestration is not resilience
Docker keeps services running, but resilience must be implemented in application logic.

5. Explicit states beat booleans
Modeling system states clearly improves observability and makes debugging far easier.


Closing

What started as a simple refactor became my first real lesson in distributed systems engineering. I learned that building distributed software isn't just about splitting services, it's about designing for failure, uncertainty, and partial availability.

My system now retries intelligently, handles transient failures gracefully, and communicates state more clearly between services.

And most importantly, I now understand why distributed systems are both powerful and humbling.


Thank you for reading! Feedback is welcome via the comments or GitHub issues on the repo. If you found this helpful, please star the repository.

PS: The branch to check out for the code is the refactor branch.

Top comments (0)