DEV Community

Cover image for Condition-Based vs Time-Based Maintenance: Making the Switch
Guatu
Guatu

Posted on • Originally published at guatulabs.dev

Condition-Based vs Time-Based Maintenance: Making the Switch

I spent a weekend reviewing a maintenance log for a conveyor system that was costing thousands in "preventative" parts replacements every quarter, only to find that the technicians were throwing away bearings that had 60% of their life left. At the same time, a motor had burned out three weeks before its scheduled service because it had been running hot for a month, but the calendar said it wasn't time to check it yet.

Time-based maintenance is a gamble where you bet that the average failure rate of a component matches the actual failure rate of your specific machine. In the real world, that bet usually loses.

If you're managing industrial assets, you've likely lived through this. You either over-maintain, wasting money and introducing "infant mortality" failures by disturbing a working system, or you under-maintain and deal with unplanned downtime. The move to Condition-Based Maintenance (CBM) is the only way out, but the gap between the theory of "predictive maintenance" and a working system on the factory floor is massive.

What I tried first

My first attempt at CBM was naive. I thought I could just slap a few sensors on the equipment, pipe the data into a dashboard, and let the operators decide when to perform maintenance. I set up a basic MQTT pipeline using Mosquitto (which I've written about before regarding broker selection) and pushed raw vibration and temperature data to a Grafana dashboard.

It failed miserably.

First, I created a "noise apocalypse." I had alerts firing every time a sensor spiked for a millisecond due to electrical noise. The operators started ignoring the alerts entirely. Second, I didn't define what "bad" actually looked like. I was giving them raw data, not actionable intelligence. An operator doesn't care if a motor is at 72 degrees Celsius; they care if 72 degrees is a 10% increase over the baseline for that specific load.

I also tried to automate the ticketing system using simple cron jobs that checked for thresholds every hour. This led to a flood of "HEARTBEAT_OK" messages in the logs and redundant tickets. I was basically just building a more expensive version of a time-based system, just with different triggers.

The actual solution

The shift happens when you stop treating sensors as "alarms" and start treating them as "state providers." You need a pipeline that filters noise, establishes a baseline, and triggers actions based on deviations rather than arbitrary numbers.

1. Filtering the Noise

Instead of raw thresholds, I implemented a sliding window average. If you're using Python for your edge processing, don't just trigger on val > threshold. Use a buffer.

import collections

class SensorMonitor:
    def __init__(self, threshold, window_size=10):
        self.threshold = threshold
        self.window = collections.deque(maxlen=window_size)

    def is_anomaly(self, current_value):
        self.window.append(current_value)
        if len(self.window) < self.window.maxlen:
            return False

        # Calculate moving average to ignore transient spikes
        avg = sum(self.window) / len(self.window)
        return avg > self.threshold

# Example: Triggering maintenance only if the average 
# vibration stays high over 10 readings
monitor = SensorMonitor(threshold=10.5) 
if monitor.is_anomaly(current_vibration):
    trigger_maintenance_alert("Sustained high vibration detected")
Enter fullscreen mode Exit fullscreen mode

2. Condition-Based Escalation Rules

Once the data is clean, you can't just send an email. You need escalation logic that understands the context. I moved away from simple cron-based alerts to a condition-based rule engine. This is similar to how I handle equipment health scoring, where we consolidate multiple signals into one status.

Here is how I structured the escalation logic in the configuration:

# Condition-based escalation rules for maintenance tickets
escalation_rules:
  - condition: "sensor.vibration > 12.0 AND asset.criticality == 'high'"
    action: "immediate_dispatch"
    priority: 1
  - condition: "sensor.temp_deviation > 15% AND ticket.age > 4h"
    action: "notify_maintenance_lead"
    priority: 2
  - condition: "sensor.vibration > 8.0 AND ticket.age > 24h"
    action: "schedule_inspection_next_shift"
    priority: 3
Enter fullscreen mode Exit fullscreen mode

3. Optimizing the Alerting Pipeline

To stop the "alert fatigue" I mentioned earlier, I overhauled the cron jobs that monitored the system health. I stopped the unconditional "Everything is OK" messages and moved to a "silent success" model.

# Optimized payload for condition-based alerting
# Only sends a notification if the status is not 'success'
payload:
  message: "Asset {{ asset_id }} monitoring failed with status: {{ status }}"
  condition: "status != 'success'"
  # The system remains silent if the condition is met (success)
  reply: "All assets healthy" if status == 'success'
Enter fullscreen mode Exit fullscreen mode

Why it works

Time-based maintenance assumes a linear degradation of parts. In reality, degradation is stochastic. A bearing might last 10,000 hours or 100 hours depending on the lubrication quality and the load it carries.

By moving to CBM, you're monitoring the actual degradation. When you track vibration (using the architecture I've detailed here), you're seeing the physical manifestation of wear (pitting, spalling, or misalignment) long before the part actually fails.

The logic of using a sliding window and deviation-based thresholds works because it separates the signal from the noise. In an industrial environment, electrical interference is a constant. A single high reading is usually a fluke; a sustained increase in the moving average is a mechanical reality.

also, the condition-based escalation rules prevent the "crying wolf" effect. By tying the action to both the sensor value and the asset's criticality, you ensure that the maintenance team only drops what they're doing when it actually matters.

Lessons learned

The biggest surprise was that the hardware wasn't the hard part. Getting the sensors to talk via MQTT is trivial. The hard part is the cultural shift. Operators who have spent twenty years changing oil every three months don't trust a dashboard telling them they can wait another two months.

If I did this again, I'd start with a "shadow period." I would run the CBM system in parallel with the time-based schedule for six months. I'd log every time the CBM system predicted a failure and every time the time-based schedule replaced a perfectly good part. Having that data is the only way to convince a skeptical plant manager to change the schedule.

A few other caveats:

  • Sensor Drift: Sensors fail too. If you rely solely on CBM, a failing sensor can look like a failing motor. You still need a basic time-based schedule for sensor calibration.
  • The "Silent" Trap: When you move to a "silent success" alerting model, you run the risk of not knowing if your monitoring system has died. I fixed this by implementing a dead-man's switch (heartbeat) that alerts if the monitoring service itself stops reporting.
  • Data Overload: Don't try to monitor everything. Pick the top 20% of assets that cause 80% of your downtime. Trying to implement CBM on every single small fan in the building is a waste of engineering hours.

For those looking to implement this at scale, you can check my services page for consulting on predictive maintenance and IIoT infrastructure. Moving from a calendar to a condition is a steep climb, but it's the only way to stop wasting money on parts that aren't broken.

Top comments (0)