The usual symptoms are familiar: stakeholders demand higher open rates even though revenue stalls; product teams ship more notifications and users opt out; analytics shows clicks but nobody can prove whether a notification created that sale or just reported it. The root causes are fragmented data, privacy-driven metric noise, weak experiment hygiene, and no causal measurement baked into notification analytics.
Contents
- Which engagement metrics actually move revenue
- How to design A/B tests for notifications that don't lie
- How to attribute notifications and tie outcomes to the P&L
- How to automate insights and scale optimization across channels
- Practical playbook: checklists, SQL, and experiment templates
Which engagement metrics actually move revenue
Start with the one question that changes behavior: which metric, when it moves, changes the business’s bottom line? For notifications that must be answered with revenue or high-confidence proxies of revenue, not headline opens.
- Delivery / reach: messages successfully delivered (latency and bounces matter).
- Open / view: useful for subject-line or preview text experiments, but unreliable after client-side preloading (Apple Mail MPP inflated opens). Do not use opens as the primary business KPI for email.
- Click-through rate (CTR) and Click-to-open rate (CTOR): stronger signals of content relevance and intent. Use CTR/CTOR for content and CTA tests.
- Conversion rate and revenue-per-message (RPM): the true north — link notifications to purchases, signups, or LTV. Use order-level joins and margin-aware revenue. (Explained below.)
- Cost / unit economics: cost per send, vendor fees, and human engineering costs — mix these into ROI calculations.
Benchmarks vary by channel; use them as directional checks rather than absolutes:
| Channel | Typical open / view range | Typical CTR range | Which metric to prioritize |
|---|---|---|---|
| 30–45% (open rates inflated by MPP). | 1–4% (varies by vertical). | CTR / CTOR / conversions. | |
| Mobile Push | Direct opens often low single digits; total (direct + influenced) opens can be multiple× higher. | 3–15% depending on targeting & OS. | Influenced opens + conversions (measure influenced opens). |
| SMS | Very high opens (often cited ~90–98% for delivered messages) and strong CTRs; high intent channel for urgent offers. | 5–30+% for click-enabled messages (category dependent). | Revenue per message / conversion. |
| Web Push / In-app | Web push: variable (4–20%); in-app messages: very high visibility for active users. | 4–20% | Session conversion and retention. |
Important: Open rates are noisy post-privacy changes. Prioritize clicks → conversions → incremental revenue as the downstream metrics that actually move the P&L.
Contrarian insight: stop optimizing for opens. Run subject-line tests, yes — but reward teams for increasing revenue-per-exposed-user (RPEU) and reducing cost-per-incremental-dollar.
How to design A/B tests for notifications that don't lie
Clean experiments are discipline. A sloppy test produces an outcome that looks like a result but is worse than useless.
- Declare a precise hypothesis and primary KPI in plain language (e.g., “Sending cart-abandon SMS at 45m vs 90m increases 7-day incremental revenue per recipient by ≥8%”). Pre-register the success metric and stop rules.
- Choose the randomization unit carefully: user-level or account-level bucketing for multi-device users, not message-instance. Use
user_idoraccount_idbucketing to avoid cross-arm contamination. - Compute sample size and Minimum Detectable Effect (MDE) — don’t guess. Use a sample-size calculator and set alpha/power (commonly α=0.05, power=0.8). Evan Miller’s calculator is the practical standard for conversion-rate experiments.
- Pick the right statistical method:
- Use fixed-horizon frequentist tests when you can commit to minimal peeking and a prespecified sample size.
- Use sequential / controlled peeking (Optimizely Stats Engine or similar) if you need continuous monitoring with FDR control.
- Use Bayesian or bandit approaches when traffic is limited or you need immediate exploitation (bandits minimize regret but reduce final inferential certainty).
- Guardrails and multiple testing: when you run many concurrent experiments, control the false discovery rate (Benjamini–Hochberg or platform-provided controls) rather than naive p-value fishing.
- Prefer conversion or revenue as the primary metric for business experiments. Use opens only as secondary diagnostics or for very narrow content tests.
Example experiment blueprint for an email subject-line test:
- Hypothesis: Subject B increases 3-day conversion rate by ≥10% versus Subject A.
- Unit:
user_idrandomization, stratified by geography. - Metric: 3-day purchase conversion rate; guardrails: unsubscribe rate, spam complaints.
- Stats plan: α=0.05, power=0.8, use Evan Miller sample-size calc to compute N per arm. Stop after N reached and min 7 days to cover cyclic patterns.
When traffic is low, prefer sequential/Bayesian designs or run multi-armed bandits to limit lost conversions — but document the tradeoffs in interpretability.
How to attribute notifications and tie outcomes to the P&L
Attribution is an engineering + measurement architecture problem, not just a reporting option in the analytics UI.
- Use first-party identifiers and server-side event joins: store
notification_id,user_id,channel,template_id,send_time, anddelivery_status. Preserve click and open events with timestamps. These keys let you join sends to downstream conversions in the warehouse. - Choose the attribution philosophy for the question at hand:
- For incrementality, run holdout tests (the gold standard): randomly withhold notifications from a control group and measure the difference in outcomes. Preferred for proving causal revenue impact.
- For operational reporting, GA4’s data-driven attribution is the default model for ad/click paths — it helps for multi-touch shaping but is proprietary and requires sufficient data. Note GA4 deprecated several rule-based models and leans on DDA for many standard reports. Use it for channel-level view but not as a replacement for causal lift tests.
- Use Marketing Mix Modeling (MMM) for long-run, cross-channel budget planning; it complements holdouts and MTA. MMM is the top-down triangulation to reconcile platform-level claims with business outcomes.
Practical attribution approach (triangulation):
- Instrument sends and conversions in your CDP/Warehouse.
- Run short-term user-level joins (orders within a defined lookback window after a send) for operational RPM and funnel diagnostics. Use these for quick sanity checks.
- Run recurring holdout experiments (audience or geo holdouts) to measure incremental revenue for channels and automation flows. Keep holdout slices stable for program-level measurement (common practice: permanent 5–20% holdout for lifecycle flows during ongoing measurement; tune to business context).
- Reconcile platform-reported credit with holdout results and MMM outputs for budgeting and planning.
Example of the core SQL pattern (BigQuery style) that ties notifications to orders within a 7‑day window:
-- Compute revenue per notification (BigQuery)
WITH notifications AS (
SELECT user_id, notification_id, channel, send_time
FROM `project.dataset.notifications`
WHERE send_time BETWEEN '2025-11-01' AND '2025-11-30'
),
orders AS (
SELECT order_id, user_id, order_value, order_time
FROM `project.dataset.orders`
WHERE order_time BETWEEN '2025-11-01' AND '2025-12-07'
)
SELECT
n.channel,
COUNT(DISTINCT n.notification_id) AS messages_sent,
SUM(CASE WHEN o.order_id IS NOT NULL THEN o.order_value ELSE 0 END) AS revenue_within_7d,
SAFE_DIVIDE(SUM(CASE WHEN o.order_id IS NOT NULL THEN o.order_value ELSE 0 END), COUNT(DISTINCT n.notification_id)) AS revenue_per_message,
SAFE_DIVIDE(COUNT(DISTINCT o.order_id), COUNT(DISTINCT n.notification_id)) AS conversion_rate
FROM notifications n
LEFT JOIN orders o
ON o.user_id = n.user_id
AND o.order_time BETWEEN n.send_time AND TIMESTAMP_ADD(n.send_time, INTERVAL 7 DAY)
GROUP BY channel;
That query is an operational metric — treat the result as diagnostic until you validate incrementality via a holdout.
How to automate insights and scale optimization across channels
Scaling optimization requires a repeatable pipeline: instrumentation → orchestration → warehouse → experiment engine → automated analysis → deployment. Automate what you can; human-verify what you must.
Core automation building blocks:
-
Event plumbing: push
send,delivery,open,click, andconvertevents to a CDP/w-data-warehouse in near real-time. Useuser_idand consistent schema. - Notification orchestration: decouple templating, routing, and preference logic from product code via an orchestration layer (vendor or in-house). Platforms that abstract channels, retries, and fallbacks reduce engineering toil.
- Experiment platform & feature flags: integrate an experimentation system for randomized bucketing and safe rollouts; tie winners to feature flags for progressive rollout.
- Automated analysis jobs: schedule daily/weekly aggregation jobs (dbt + Airflow or managed pipelines) to compute experiment metrics, conversion windows, and revenue per send. Produce automated reports and guardrail alerts.
-
Anomaly detection & automated alerts: run ML-driven anomaly detectors on core KPIs and send alerts for rapid investigation (BigQuery ML's
ML.DETECT_ANOMALIESor equivalent are practical at scale). - Optimization loop: use experiment output to update templates, frequency caps, and audience definitions; consider contextual bandits for per-user creative selection once baseline performance and safety checks exist.
Automation example: schedule a daily job that recomputes RPM and incremental lift for every active flow; when an experiment crosses pre-registered thresholds and guardrails, trigger a deployment pipeline to roll the winner via feature flag.
Pro tip from operations: always include read-only, minimal-percentage holdouts for business-as-usual flows so you continuously measure background incremental impact as you tune frequency, timing, and content.
Practical playbook: checklists, SQL, and experiment templates
This is an executable checklist you can run tomorrow.
Pre-launch checklist (must-complete)
- Hypothesis written in one line and stored (
experiment_hypothesestable). - Primary KPI and guardrails declared (e.g., primary: 7‑day RPEU; guardrails: opt-out rate, spam complaints).
- Randomization unit and stratification plan documented.
- Sample-size / MDE calculation saved (use Evan Miller for conversions).
- Instrumentation smoke test passes (
send→delivery→clickevents appear end-to-end). - Compliance & privacy sign-off (consent and opt-in checks).
- Monitoring dashboard and on-call runbook created.
Holdout experiment protocol (short)
- Holdout size: choose between 5–20% for programmatic flows; larger for noisy channels or when you need high-precision lift.
- Duration: at least one full business cycle (commonly ≥30 days for longer-consideration products), but ensure min sample size per arm.
- Analysis: compute difference-in-differences on revenue per exposed user; bootstrap confidence intervals for revenue metrics if distribution skew is high.
Quick ROI formula (use real numbers per campaign)
- Incremental Revenue = Revenue_treatment − Revenue_holdout.
- Total Cost = (#messages_sent × vendor_cost_per_send) + campaign_creation_costs + platform costs.
- ROI = (Incremental Revenue − Total Cost) / Total Cost.
Example calculation (illustrative)
- Messages sent: 100,000
- Incremental revenue (7-day, holdout-based): $12,000
- Vendor + ops cost: $1,200
- ROI = ($12,000 − $1,200) / $1,200 = 9 → 900% ROAS
Operational SQL snippets to automate (store as scheduled dbt model)
- Revenue join (example above).
- Incrementality calculation:
-- Incremental revenue per user (simplified)
SELECT
SUM(CASE WHEN is_treatment THEN revenue ELSE 0 END) / NULLIF(SUM(CASE WHEN is_treatment THEN 1 ELSE 0 END),0) AS avg_rev_treatment,
SUM(CASE WHEN is_control THEN revenue ELSE 0 END) / NULLIF(SUM(CASE WHEN is_control THEN 1 ELSE 0 END),0) AS avg_rev_control,
(avg_rev_treatment - avg_rev_control) AS incremental_rev_per_user
FROM `project.dataset.user_revenue_with_treatment_flag`
WHERE experiment_name = 'cart_abandon_sms' AND window_days = 7;
Experiment post-mortem template (store in wiki)
- N: traffic per arm and duration.
- Primary KPI change (point estimate ± CI).
- Guardrails and secondary KPI movement.
- Practical decision (rollout %, audience slice change).
- Learnings and next test.
Automation checklist (operational)
- Daily job recomputes RPM and experiment status.
- Anomaly detector flags >20% deviation or violation of guardrails (via BigQuery ML
ML.DETECT_ANOMALIES). - Auto-rollback flag if spam complaints or opt-outs exceed threshold.
- Sync winners to orchestration engine / feature flag.
Sources
Email Open Rates By Industry (& Other Top Email Benchmarks) — HubSpot Blog - Benchmarks and the impact of Apple Mail Privacy Protection on open rates and why CTR/CTOR matter.
Email Marketing Benchmarks 2025 — MailerLite Blog - Aggregate email benchmark figures and CTR/CTOR guidance.
Braze Benchmarks & Push Notification Metrics — Braze Resources - Push metrics, direct vs influenced opens, and industry breakdowns for mobile notifications.
SMS Benchmarks 2024 — Postscript - SMS performance benchmarks and campaign-level insights for ecommerce.
Sample Size Calculator — Evan Miller (A/B testing tools) - Practical sample size and sequential-sampling calculators used for A/B test planning.
Statistical analysis methods overview — Optimizely Support - Guidance on Frequentist vs Sequential testing and platform statistical controls.
Data-driven attribution delivers better results than last-click — Google Ads Blog - Google’s position on data-driven attribution and the move away from older rule-based models.
Mastering a Holdout Test in Marketing — Measured FAQ / How-to - Practical holdout/incrementality experiment design and examples for causal measurement.
Market Guide for Marketing Mix Modeling Solutions — Gartner - Overview of modern MMM use-cases, benefits, and vendor considerations for channel-level planning.
What is a multi-armed bandit? — Optimizely Glossary - Explanation of bandits, contextual bandits and tradeoffs vs A/B testing.
SuprSend — Notification orchestration platform (product overview) - Example of a unified notification orchestration approach for multi-channel routing, templates, and preference centers.
BigQuery ML: The ML.DETECT_ANOMALIES function & Anomaly detection overview — Google Cloud Docs - How to detect anomalies in time-series and tabular metrics using BigQuery ML for automated alerts and monitoring.
False discovery rate — Columbia University (Population Health Methods) - Explanation of FDR and why it matters for multiple A/B tests and hypothesis families.
A rigorous notifications program treats every sent message as an experiment candidate and every experiment as a financial decision — measure send-level economics, insist on causality (holdouts and MMM), automate the plumbing, and align KPIs to revenue rather than vanity opens.
Top comments (0)