RFC-WF-0008
Recovery & Compensation Protocol (RCP)
Status: Draft Standard
Version: 1.0.0
Date: 20 Nov 2025
Category: Standards Track
Author: FullAgenticStack Initiative
Dependencies: RFC-WF-0001 (WFCS), RFC-WF-0003 (CCP), RFC-WF-0004 (ACSM), RFC-WF-0005 (CRCD), RFC-WF-0006 (EAS), RFC-WF-0007 (OoC)
License: Open Specification (Public, Royalty-Free)
Abstract
This document specifies the Recovery & Compensation Protocol (RCP) for WhatsApp-first systems. RCP defines normative requirements for retry, rollback, cancelation, reprocessing, and compensation of conversational commands—fully operable via WhatsApp—while ensuring safety, authorization, evidence emission, and deterministic state convergence. RCP standardizes recovery commands, compensation plans, and evidence-backed recovery workflows so that “incident handling” does not require dashboards or DevOps intervention for routine failure classes.
Index Terms— recovery, compensation, rollback, retry, sagas, event-driven systems, idempotency, operational resilience, WhatsApp-first.
I. Introduction
In real systems, failures are not rare; they are Tuesday. WhatsApp-first compliance requires that recovery actions (retry, rollback, cancel, reprocess, reset where applicable) be available through WhatsApp without mandatory reliance on web consoles or manual operator tooling (WFCS recovery autonomy). RCP formalizes a protocol to model and execute recovery safely and consistently.
RCP builds on:
- CCP for canonical command envelopes and confirmations
- ACSM for admin/security controls and step-up policies
- CRCD for declaring recovery capabilities and binding policies
- EAS for evidence artifacts across recovery lifecycle
- OoC for discovering failures and drilling into cause/evidence
II. Scope
RCP specifies:
- Recovery action taxonomy and safety classes
- Recovery command set and conversational flows
- Compensation plan declaration requirements
- Idempotency and replay constraints for recovery operations
- Evidence emission requirements for recovery and compensation
- Authorization and step-up requirements for high-impact recovery
- Convergence rules and terminal states (what “done” means)
RCP does not mandate a specific workflow engine (Temporal, queues, etc.). It defines protocol semantics for WhatsApp-first operability.
III. Normative Language
MUST, MUST NOT, SHOULD, SHOULD NOT, MAY are normative.
IV. Definitions
Recovery Action: An operation that aims to restore a workflow or system state after failure or partial execution.
Compensation: A corrective action that semantically “undoes” or counterbalances a prior effect when true rollback is not possible.
Saga: A multi-step workflow where each step has a compensating action.
Reprocess: Re-running a workflow from a defined checkpoint (not always from scratch).
Convergence: The property that repeated recovery attempts lead to a stable terminal state without divergence.
V. Design Goals
RCP MUST ensure:
- G1. WhatsApp Operability: Recovery is executable via conversation (no mandatory dashboard).
- G2. Safety: Recovery actions use confirmation and step-up policies consistent with risk.
- G3. Determinism: Recovery actions are idempotent and replay-safe.
- G4. Evidence: Every recovery action emits standardized evidence artifacts (EAS).
- G5. Convergence: System reaches a stable terminal state, even under retries.
- G6. Discoverability: Recovery options are discoverable via OoC and CRCD.
VI. Recovery Action Taxonomy
An RCP implementation MUST support, at minimum, the following action classes when applicable:
A. Retry
Re-attempt the same command execution using the same command_id or a derived recovery command referencing it.
B. Cancel
Stop further processing of an in-flight workflow (best-effort) and prevent new side effects.
C. Rollback (When Possible)
Revert state mutations when the system can safely restore a prior state snapshot or transactional boundary.
D. Compensate
Execute compensating actions for effects already applied (e.g., refund payment, restock inventory, revoke access).
E. Reprocess
Resume or re-run from a checkpoint, optionally with corrected inputs or policies.
F. Reset (Admin-Controlled)
Reset a subsystem or workflow state machine. This is high-impact and MUST be strongly guarded.
Applicability constraint: The system MUST NOT claim support for a recovery action if its semantics cannot be safely implemented for that workflow.
VII. Recovery Commands (Conversational Interface)
Recovery commands MUST be CCP-compliant (canonical envelopes, confirmation rules) and SHOULD be declared in CRCD.
A. Minimum Command Set (Normative)
Implementations MUST support at least:
-
RCP.OPTIONS <command_id>— list available recovery actions for the command -
RCP.RETRY <command_id>— retry execution (idempotent) -
RCP.CANCEL <command_id>— cancel in-flight processing -
RCP.COMPENSATE <command_id>— execute compensation plan (if declared) -
RCP.REPROCESS <command_id> [checkpoint=<id>]— re-run from checkpoint (if supported)
Additionally, implementations SHOULD support:
-
RCP.ROLLBACK <command_id>when true rollback semantics exist -
RCP.STATUS <command_id>(may delegate to OoC) -
RCP.PLAN <command_id>— show compensation plan summary (privileged)
B. Discovery and Presentation
RCP.OPTIONS MUST respond with:
- current lifecycle stage (from EAS)
- why the command is recoverable or not
- numbered list of allowed recovery actions
- risk label per action (see Section VIII)
- whether step-up is required
Example response pattern (normative structure, not wording):
1 — Retry (low risk)
2 — Reprocess from checkpoint (medium risk)
3 — Compensate (high risk — requires confirm token)
4 — Cancel (medium risk)
VIII. Safety Classes for Recovery Actions
RCP actions MUST be classified using at least:
- R0 (Read-only): plan/status queries
- R1 (Low-impact): retry of idempotent step with no new side effects expected
- R2 (Medium-impact): cancel, reprocess checkpoint, partial rollback
- R3 (High-impact): compensation, reset, bulk or irreversible recovery
Safety class MUST determine:
- CCP confirmation method (standard vs strengthened)
- ACSM step-up requirement (mandatory for R3; recommended for R2 depending on policy)
- rate limits and operator privileges
IX. Compensation Plan Declaration
If a command can produce irreversible side effects, the system SHOULD define a Compensation Plan.
A. Plan Requirements
A declared compensation plan MUST specify:
- steps list with identifiers
- each step’s forward effect and compensating effect
- prerequisites/guards
- idempotency expectations per step
- terminal success definition (what “compensated” means)
- partial-compensation handling (what if step 2 fails after step 1 compensates)
B. Registry Binding
Compensation plans SHOULD be referenced from CRCD command declarations:
compensation_plan_id- supported checkpoints for reprocess
- allowed actions per lifecycle stage
X. Idempotency and Replay Constraints
A. Recovery Idempotency
Every recovery command MUST be idempotent with respect to its target command_id. Repeated retries or compensation attempts MUST converge without duplicating effects.
B. Binding to Command Identity
Recovery commands MUST include:
target_command_id-
target_idempotency_key(when applicable) - their own
recovery_idempotency_key
C. Replay Windows and Freshness
High-impact recovery (R3) MUST be bound to:
- step-up freshness window
- single-use confirmation token (recommended)
- envelope validity window
XI. Evidence Emission Requirements
Recovery operations MUST emit EAS artifacts.
A. Minimum Evidence Set
For a recovery action, the system MUST emit at least:
-
execution.started(for the recovery command) -
execution.executedORexecution.failedORexecution.rejected -
compensation.started/compensation.compensatedwhen compensation is invoked
B. Cross-Linking Evidence
Evidence artifacts for recovery MUST include:
-
lifecycle.command_id(the recovery command) - reference to
target_command_id(in payload args or a dedicated field if extended) - correlation identifiers that link recovery chain to original command chain
XII. Authorization and Administrative Controls
A. Scope Binding
Recovery commands MUST be scope-gated via ACSM:
- basic retries MAY be available to the original actor (policy-defined)
- compensation and reset MUST require privileged scopes
- bulk recovery MUST require high-trust + step-up
B. Break-glass
If break-glass recovery exists, it MUST comply with ACSM constraints (time-bounded, audited, minimal scope).
XIII. Convergence and Terminal States
An RCP-compliant workflow MUST define terminal states for a command:
executed-
failed(non-retryable) rejected-
canceled(if supported) -
compensated(if compensation completed)
The system MUST ensure that repeated recovery attempts do not oscillate indefinitely without surfacing a clear “blocked” status and required operator action.
XIV. Conversational Recovery Workflow (Normative Pattern)
A compliant recovery flow SHOULD follow:
- User requests status (
OoC.CMDorRCP.STATUS) - System offers
RCP.OPTIONS - User selects action (numbered menu or explicit command)
- System shows preview + risk + effect summary
- System collects confirmation / step-up as required
- System executes recovery and emits evidence
- System reports outcome + next options (if still not converged)
XV. Relationship to Other RFCs
- WFCS (RFC-WF-0001): mandates recovery autonomy via WhatsApp.
- CCP (RFC-WF-0003): provides envelope + confirmation + idempotency substrate.
- ACSM (RFC-WF-0004): scopes + step-up for high-impact recovery.
- CRCD (RFC-WF-0005): declares recoverability and binds plans/policies.
- EAS (RFC-WF-0006): evidence format for recovery/compensation lifecycle.
- OoC (RFC-WF-0007): discovery and drill-down into failures to trigger recovery.
XVI. Security Considerations
Recovery endpoints are powerful. Implementations MUST:
- enforce strict scope gating and step-up policies
- rate-limit repeated recovery attempts
- prevent cross-tenant recovery access
- redact sensitive evidence in conversational outputs
- record recovery actions as first-class evidence artifacts
XVII. Conclusion
RCP turns “something failed, call DevOps” into a governed, evidence-backed, WhatsApp-operable protocol. By standardizing recovery actions, compensation plans, and convergence semantics under CCP/ACSM/EAS, WhatsApp-first systems remain resilient and operational under real-world failure conditions—using the same channel that runs the business.
References
[1] RFC-WF-0001, WhatsApp-First Compliance Core (WFCS).
[2] RFC-WF-0003, Conversational Command Protocol (CCP).
[3] RFC-WF-0004, Administrative Command Security Model (ACSM).
[4] RFC-WF-0005, Command Registry & Capability Declaration (CRCD).
[5] RFC-WF-0006, Evidence Artifact Schema (EAS).
[6] RFC-WF-0007, Observability over Conversation (OoC).
Concepts and Technologies
Retry/cancel/reprocess/rollback, compensation plans, saga semantics, idempotency keys, convergence guarantees, evidence emission, step-up verification, scope-gated recovery, incident operations over WhatsApp.
Top comments (0)