When a rebalance operation fails on a table that is in RUNNING state, GBase 8a moves the table back to STARTING — not to CANCELED. This behavior reveals a deliberate state‑machine design: automatic retry and self‑healing.
State Transition Rules
-
Execution failure:
RUNNING→ failure →STARTING -
Explicit cancel:
STARTING/RUNNING/PAUSED→ cancel command →CANCELED
Design Logic
Distinguishing "Cancel" from "Failure"
-
CANCELEDis the result of an explicitCANCEL REBALANCEcommand issued by a user or administrator — the task is intentionally aborted. -
STARTINGis the result of an unexpected internal error (network glitch, temporarily unavailable node, resource shortage) — the task is interrupted but the system wants to retry.
The core distinction: CANCELED is a command, while STARTING signals a fault. They have different causes, so they lead to different states.
Automatic Retry and Self‑Healing
Reverting to STARTING means the system hasn't given up. STARTING is the "ready" state where tasks enter the execution queue. The background scheduler will later pull the task back into RUNNING and try again. This built‑in retry mechanism increases tolerance for transient failures and reduces the need for manual intervention in a gbase database.
Ensuring Eventual Consistency
Rebalance is critical for even data distribution after scaling. Failure must not permanently stop it. Returning to STARTING guarantees the task will eventually complete, keeping the cluster's data layout consistent. If failure led directly to CANCELED, data distribution could remain incomplete and require a manual restart, adding operational risk.
A Clean, Deterministic State Machine
The transition rules are simple and deterministic:
- External commands drive pause/continue/cancel.
- Internal flow drives execution start, successful completion, or failure‑retry.
This avoids cluttering the state machine with special states for every possible failure scenario, making it easy to understand and maintain.
Summary
The RUNNING → failure → STARTING path shows that the rebalance state machine prioritizes task completion, automatic recovery, and clear intent separation. It's a robust design pattern for distributed systems, where temporary hiccups are expected and should be handled without human intervention.
Top comments (0)