Tangle Network

Posted on Mar 26 • Originally published at tangle.tools

The Three Numbers That Keep Your Blueprint Online

#blockchain #infrastructure #monitoring #web3

Operator Monitoring and Health Checks for Tangle Blueprint Nodes

Tangle is a programmable infrastructure network where operators run modular services called Blueprints and get selected for paid jobs based on on-chain health signals. Three signals govern whether your operator stays in rotation: on-chain heartbeats submitted through the OperatorStatusRegistry, quote freshness (5-minute default lifetime, 1-hour protocol maximum), and capacity utilization (activeLoad vs. maxCapacity). The orchestrator's health tracker maintains a rolling window of the last 10 job outcomes, requiring a minimum 30% success rate to stay selectable, with a 60-second cooldown after each failure. Fall behind on any one of these and the orchestrator silently skips you. Jobs stop arriving with no error message and no notification.

Getting a Blueprint operator online is the easy part. Keeping it selected for jobs is where most operators quietly fail. The thresholds governing selection aren't obvious from the outside. They live in the orchestrator's health tracker, in on-chain registry contracts, and in protocol constants spread across multiple config files. This post pulls those numbers into one place and explains how they interact, so you can build monitoring that catches problems before they cost you revenue.

Heartbeats: the on-chain pulse check

Operators prove liveness by submitting heartbeats on-chain through the OperatorStatusRegistry. Each heartbeat carries a service ID, blueprint ID, status code, and an optional metrics payload. The registry tracks four values per operator per service:

lastHeartbeat: timestamp of the most recent submission
consecutiveBeats: how many in a row without a miss
missedBeats: accumulated misses
lastMetricsHash: hash of the most recent metrics payload

Two functions expose the result: isHeartbeatCurrent() returns whether the latest heartbeat is within the configured interval, and isOnline() combines heartbeat freshness with status code to give a binary liveness signal.

Status codes and what they mean

The protocol defines five status codes:

Code	Name	Range Convention
0	Healthy	Exactly 0
1	Degraded	1-99
2	Offline	100+
3	Slashed	200+ (slashable)
4	Exiting	Voluntary exit in progress

The range convention matters. A status of 15 still reads as "degraded" to the registry, but you can use the specific number to encode granularity for your own monitoring. Some operators use values like 10 for "high memory pressure" and 20 for "disk nearing capacity" while staying in the degraded band.

Each service configures its own HeartbeatConfig with an interval (how often heartbeats are expected), a maxMissed threshold (how many misses before consequences), and an optional customMetrics flag. This is per-service, not global, so an operator running three different blueprints might have three different heartbeat cadences.

Querying your own state

You can inspect your operator's on-chain health at any time:

// Binary liveness check
registry.isHeartbeatCurrent(serviceId, operatorAddr);
registry.isOnline(serviceId, operatorAddr);

// Full state inspection
registry.getOperatorState(serviceId, operatorAddr);
// Returns: lastHeartbeat, consecutiveBeats, missedBeats, status, lastMetricsHash

// Individual metric values (if customMetrics enabled)
registry.getMetricValue(serviceId, operatorAddr, "cpu_utilization");

The getMetricDefinitions(serviceId) call returns the schema for a service's expected metrics, including name, minValue, maxValue, and whether each metric is required. If your blueprint defines required metrics and you stop submitting them, the registry logs the violation on-chain.

The health tracker's rolling window

On-chain heartbeats are only half the picture. The orchestrator that routes jobs to operators maintains its own health tracker with a separate, more aggressive set of thresholds.

The health tracker is a rolling window of the last 10 job outcomes per operator (HEALTH_WINDOW_SIZE = 10). From that window, it computes a success rate. If the rate drops below 30% (MIN_SUCCESS_RATE = 0.3), the operator is marked unhealthy and stops receiving new work.

A 30% floor sounds lenient, and it is, deliberately. The system tolerates occasional failures (network blips, transient resource pressure) without immediately pulling an operator from rotation. But the companion mechanism is less forgiving: after any failure, the operator enters a 60-second cooldown (FAILURE_COOLDOWN = 60s) during which it won't receive jobs regardless of its overall success rate.

The practical effect: a single failure costs you at least a minute of downtime. Three failures in quick succession won't necessarily drop you below the 30% floor (depending on your recent history), but the cooldowns stack. If you're failing every other job, you're spending half your time in cooldown even though your 50% success rate is above the threshold.

How selection works

When the orchestrator needs to assign a job, it sorts available operators using three criteria in priority order:

Health status: healthy operators before unhealthy
Available capacity: operators with more open slots first
Success rate: higher rates preferred as a tiebreaker

Available slots are computed as maxCapacity - activeLoad. If your activeLoad equals your maxCapacity, you won't be selected regardless of health. This is the capacity utilization number, and it's the one most operators forget to monitor.

The health summary for each operator looks like this:

type OperatorHealthSummary = {
  address: `0x${string}`;
  successRate: number;
  healthy: boolean;
  activeLoad: number;
  maxCapacity: number;
  availableSlots: number;
  onCooldown: boolean;
};

New operators with fewer than 3 recorded outcomes are assumed healthy. This grace period lasts until your third job completes.

Quote lifetimes: the tuning surface nobody talks about

When a user requests a job, the operator's pricing engine generates a quote. That quote has a lifetime, and getting the lifetime right matters more than most operators realize.

Quote validity duration is the number of seconds a price quote remains acceptable to the protocol after generation. The default is 5 minutes (quote_validity_duration_secs: 300 in operator.toml). The protocol enforces a hard cap of 1 hour (MAX_QUOTE_AGE = 1 hours in ProtocolConfig.sol). Anything between those bounds is your call.

Why shorter isn't always better

A short quote lifetime (say, 30 seconds) limits your price exposure. If the cost of the resources you're committing changes between quote generation and job execution, a shorter window reduces the risk that you're locked into a stale price. For operators dealing with volatile token pricing, this matters.

But short lifetimes create user friction. The user needs to receive the quote, review it, sign the payment, and submit the request before the quote expires. On a congested network, that flow can easily take more than 30 seconds. An expired quote means a failed request, a retry, and a frustrated user. It also counts as a job failure in the health tracker's rolling window, which means short quote lifetimes can indirectly damage your health score.

Why longer isn't always better

A long quote lifetime (approaching the 1-hour cap) is comfortable for users but risky for operators. Resource costs can shift, token exchange rates can move, and you're committed to the quoted price for the entire window. If you're pricing in wei (the protocol's denomination-neutral unit for cross-chain, multi-token pricing), exchange rate drift compounds with any staleness in your benchmark data.

Practical guidance

For most operators, the 5-minute default works well for stablecoin-denominated services where price volatility is low. Consider adjusting in these scenarios:

Volatile token pricing: Drop to 60-120 seconds. Accept the increased retry rate as a cost of price accuracy.
Long user flows: If your users are interacting through UIs with multiple confirmation steps, extend to 10-15 minutes. Monitor your expired-quote rate.
High-value jobs: Shorter is safer. A 1-hour quote on a job that costs several hundred dollars in compute creates real exposure.

Keep your quote server address current on-chain via updateOperatorPreferences on the ITangleOperators interface. A stale address means quotes can't be fetched at all, which is worse than any lifetime misconfiguration.

The pricing engine under the hood

Quotes aren't arbitrary numbers. The pricing engine runs automated benchmarks on your hardware when a service activates (ServiceActivated event), measuring CPU, memory, storage, network, and GPU performance. Results are cached locally by blueprint ID.

The quote formula is: Base Resource Cost x Time Multiplier x Security Commitment Factor.

Resource pricing is configured per-blueprint in your operator.toml:

[blueprint.resources]
cpu = { count = 8, price_per_unit = "0.001" }
memory = { count = 16384, price_per_unit = "0.00005" }
storage = { count = 1024000, price_per_unit = "0.00002" }

The pricing engine's full config controls benchmark behavior and the quote server:

database_path = "./data/price_cache"
benchmark_duration = 60
benchmark_interval = 1
keystore_path = "./data/keystore"
rpc_bind_address = "127.0.0.1"
rpc_port = 9000
rpc_timeout = 30
rpc_max_connections = 100
quote_validity_duration_secs = 300

If rpc_max_connections is too low for your traffic, quote requests will queue and potentially time out, which looks identical to an offline quote server from the user's perspective. For operators expecting high request volume, bumping this above the default 100 is worth doing.

The degradation cascade: signals before slashing

Operators don't get slashed out of nowhere. There's a predictable four-stage cascade, and every stage produces signals you can catch if you're watching.

Stage 1: Heartbeat drift. Your heartbeat interval starts slipping. Maybe a network issue, maybe resource contention on the node. The consecutiveBeats counter resets and missedBeats starts climbing. On-chain, your status is still Healthy, but the trend is visible.

Stage 2: Degraded status. Once missedBeats crosses a threshold (per your service's HeartbeatConfig.maxMissed), the status shifts to Degraded (codes 1-99). You're still selectable for jobs, but the orchestrator's health tracker may start reflecting failures if the underlying issue is also affecting job execution.

Stage 3: Offline. Continued misses push the status to Offline (codes 100+). isOnline() returns false. The orchestrator stops sending you work entirely. You're still staked, still committed, but earning nothing.

Stage 4: Slashing risk. If the offline period extends beyond the grace window, slashing becomes possible. The dispute window is 14 rounds (3.5 days, since each round is 6 hours). An exit takes 56 rounds (14 days). These are not fast processes, which is intentional: they give operators time to recover from legitimate outages.

Every stage before slashing is recoverable. Fix the underlying issue, submit a successful heartbeat, and the cascade resets. The operators who get slashed are the ones who aren't watching.

What metric violations actually do

The MetricDefinition system lets services define bounds (minValue, maxValue) for custom metrics like CPU utilization or memory usage. When a submitted metric falls outside those bounds, the violation is logged on-chain. Currently, violations don't trigger automatic slashing. They create an on-chain record that can be used in governance-driven disputes, but the enforcement path is manual. This may change as the protocol matures, so treat metric bounds as soft limits today and hard limits tomorrow.

Revenue: what you're protecting

The default fee split for job revenue is:

Recipient	Share
Operators	40%
Developers	20%
Protocol	20%
Stakers	20%

These percentages are governance-configurable. On top of job revenue, operators can earn TNT incentives from the InflationPool and commission from delegator RewardVaults. But all of these revenue streams depend on one thing: staying in rotation. An operator that's offline, in cooldown, or at capacity earns nothing.

Building your monitoring stack

The QoS endpoints on your node give you the raw signals:

# Quick health check
curl -s http://localhost:9090/health

# Prometheus metrics (for Grafana dashboards)
curl -s http://localhost:9090/metrics | head -n 20

This Prometheus alerting config catches problems at Stage 1 of the degradation cascade, before they affect job selection:

groups:
  - name: blueprint_operator
    rules:
      - alert: HeartbeatDrift
        expr: increase(operator_missed_beats_total[30m]) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Operator {{ $labels.instance }} missed >2 heartbeats in 30m"
          description: "Heartbeat drift detected. Investigate before status degrades."

      - alert: CapacityExhausted
        expr: (operator_active_load / operator_max_capacity) > 0.9
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Operator {{ $labels.instance }} at >90% capacity for 10m"

      - alert: SuccessRateDropping
        expr: operator_job_success_rate < 0.5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Operator {{ $labels.instance }} success rate below 50%"
          description: "Below 30% triggers removal from rotation."

The metrics you should alert on, mapped to the three numbers that matter:

Heartbeat health:

consecutiveBeats trending downward
missedBeats incrementing
Status code changing from 0 to anything else

Quote freshness:

Quote generation latency approaching quote_validity_duration_secs
Expired-quote rate (track client-side 402 retries if possible)
Benchmark cache age (stale benchmarks produce stale prices)

Capacity utilization:

activeLoad / maxCapacity ratio approaching 1.0
Available slots dropping to zero during peak hours
Job queue depth if your blueprint supports queuing

The protocol's timing constants give you the boundaries for alert thresholds:

Constant	Value	What it means for alerting
`ROUND_DURATION_SECONDS`	21,600 (6 hr)	Rounds are the unit of protocol time
`ROUNDS_PER_EPOCH`	28 (7 days)	Epoch boundaries trigger reward distribution
`DISPUTE_WINDOW_ROUNDS`	14 (3.5 days)	Time to respond to slashing disputes
`OPERATOR_DELAY_ROUNDS`	56 (14 days)	Minimum exit timeline
`MAX_QUOTE_AGE`	1 hour	Absolute ceiling for quote validity
`MIN_SERVICE_TTL`	1 hour	Shortest allowed service commitment

FAQ

How many heartbeats can I miss before getting slashed?

There is no single fixed number. Slashing risk comes from a cascade, not a threshold: missed beats push you to Degraded, then Offline, then into a slashing-eligible window with a 14-round (3.5-day) dispute period. Your service's HeartbeatConfig.maxMissed setting controls when the Degraded transition happens, and that value varies by blueprint. The practical answer: set up alerts on missedBeats incrementing (Stage 1 of the cascade) and you'll catch drift long before slashing is on the table.

Should I set my quote lifetime to the maximum 1 hour?

Almost certainly not. The 1-hour MAX_QUOTE_AGE is a protocol ceiling, not a recommendation. A 1-hour quote locks you into pricing that may not reflect current resource costs or token exchange rates. The 5-minute default is a reasonable starting point. Only extend it if your users consistently need more time to complete the quote-to-submission flow, and even then, 10-15 minutes is usually sufficient.

What happens if my node is healthy but at full capacity?

You stop receiving new jobs. The orchestrator's selection algorithm checks availableSlots (maxCapacity -activeLoad) and skips operators with zero slots. You won't be marked unhealthy, but you'll earn nothing until capacity frees up. If this happens regularly during peak hours, either increasemaxCapacity` (if your hardware supports it) or run additional operator instances.

Do metric violations lead to automatic slashing?

Not currently. When a submitted metric falls outside the bounds defined in MetricDefinition, the violation is logged on-chain but enforcement is governance-driven, not automatic. That said, the on-chain record exists and could be used against you in a dispute. Treat metric bounds as constraints you should respect, because the enforcement mechanism will likely tighten over time.

What's the minimum success rate to stay in rotation?

30%, calculated over a rolling window of your last 10 jobs. This sounds generous, but the 60-second cooldown after each failure is the real constraint. If you're failing frequently, the cooldowns accumulate and you spend significant time unable to receive work, even if your success rate stays above the floor. An operator failing every third job has a 67% success rate (well above the floor) but loses at least 20 seconds of every minute to cooldowns.

DEV Community