DEV Community

Michael
Michael

Posted on • Originally published at gbase.cn

Why a Failed Rebalance Returns to STARTING Instead of CANCELED in GBase 8a

When a rebalance operation fails on a table that is in RUNNING state, GBase 8a moves the table back to STARTING — not to CANCELED. This behavior reveals a deliberate state‑machine design: automatic retry and self‑healing.

State Transition Rules

  • Execution failure: RUNNING → failure → STARTING
  • Explicit cancel: STARTING / RUNNING / PAUSED → cancel command → CANCELED

Design Logic

Distinguishing "Cancel" from "Failure"

  • CANCELED is the result of an explicit CANCEL REBALANCE command issued by a user or administrator — the task is intentionally aborted.
  • STARTING is the result of an unexpected internal error (network glitch, temporarily unavailable node, resource shortage) — the task is interrupted but the system wants to retry.

The core distinction: CANCELED is a command, while STARTING signals a fault. They have different causes, so they lead to different states.

Automatic Retry and Self‑Healing

Reverting to STARTING means the system hasn't given up. STARTING is the "ready" state where tasks enter the execution queue. The background scheduler will later pull the task back into RUNNING and try again. This built‑in retry mechanism increases tolerance for transient failures and reduces the need for manual intervention in a gbase database.

Ensuring Eventual Consistency

Rebalance is critical for even data distribution after scaling. Failure must not permanently stop it. Returning to STARTING guarantees the task will eventually complete, keeping the cluster's data layout consistent. If failure led directly to CANCELED, data distribution could remain incomplete and require a manual restart, adding operational risk.

A Clean, Deterministic State Machine

The transition rules are simple and deterministic:

  • External commands drive pause/continue/cancel.
  • Internal flow drives execution start, successful completion, or failure‑retry.

This avoids cluttering the state machine with special states for every possible failure scenario, making it easy to understand and maintain.

Summary

The RUNNING → failure → STARTING path shows that the rebalance state machine prioritizes task completion, automatic recovery, and clear intent separation. It's a robust design pattern for distributed systems, where temporary hiccups are expected and should be handled without human intervention.

Top comments (0)