Introduction
Today I want to talk about data observability and a few other things. My team and I have continuously worked on this section of our workflow I would like to share something about it. To understand data, one must first recognize that it is not merely an output, but a reflection of the logic that has been executed across systems. Every user action, system response, or triggered event leaves behind a residue in the form of recorded information. Rather than existing in abstraction, data carries the imprint of the logic that shaped it, effectively turning each row, record, or object into a timestamped decision made by code.
But working with data isn’t always smooth. Problems often arise when different systems store the same data differently, leading to confusion about which version is correct. Logic applied inconsistently across services creates more gaps. Sometimes data is left without clear ownership, making it hard to maintain. As systems grow, understanding what the data means — and how to work with it — becomes harder. And when teams rely on external tools or manual steps to combine or process data, the risk of mistakes increases.
These issues highlight why data observability matters. Without it, teams can’t easily tell where problems come from or whether their data can be trusted. Observability gives clarity. It helps teams understand how data flows, where it breaks, and how to fix it before it becomes a bigger issue.
What Is Data Observability?
Data observability is the ability to monitor the health and behavior of data as it moves through a system. At its core, it ensures that data is accurate, consistent, and reliable. This means spotting when data is missing, outdated, duplicated, or corrupted — and knowing where and why it happened.
With strong observability in place, teams can quickly detect issues and trace them back to the root cause. For example, if a report shows incorrect numbers, observability makes it easier to see whether the issue came from a failed data load, a logic error, or a stale source. Instead of guessing, teams can investigate with confidence and resolve problems faster.
Beyond fixing errors, data observability plays a key role in decision-making. When teams trust the data, they can make faster and more informed choices. From refining product strategies to debugging subscription flows or interpreting performance metrics, good data leads to better outcomes and observability is what makes that trust possible.
Understanding Data Flow
To practice data observability effectively, a team must first understand how data flows through their system. This means tracking how data is created, how it changes over time, and where it ends up. Without this awareness, it’s difficult to catch issues or explain unexpected results.
Every piece of data goes through different states. For example, a subscription might start in a PENDING_ACTIVATION
state, move to ACTIVE
, and eventually become EXPIRED
or CANCELLED
. Each of these states has a meaning tied to business logic. PENDING_ACTIVATION
might signal that a user has initiated a subscription but hasn’t yet activated it. EXPIRED
could mean the subscription ended naturally, while CANCELLED
might indicate a user-initiated termination or a system-triggered rollback.
It’s also important to define how long data is expected to stay in each state. A record stuck in PENDING_ACTIVATION
for more than 24 hours might be a red flag. Without defined time windows, teams won’t know when data is stale or whether something has failed silently.
Equally critical is understanding the transition routes — how data moves from one state to another. Tracking these transitions creates transparency and accountability. The best way to do this is through change history. A well-structured change history logs not just what changed, but also metadata around the change: who or what triggered it, when it happened, and why. The ideal structure includes:
- Metadata (timestamp, source, actor),
- Old Data (previous state or value),
- New Data (the updated state or value).
[
{
"metadata": {
"partnerSubscriptionId": "XXXXXXXXXXXXXX",
"subscriptionEndDate": 1749200000000,
"smc": "*********",
"hhid": "*********",
"salesChannel": "CHANNEL_X",
"packId": "generic-pack-id",
"createdAt": 1746000000000,
"partner": "PARTNER_X",
"assetId": "GenericAsset",
"client": "INTERNAL_SYSTEM",
"subscriptionId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"subscriptionStartDate": 1746000000000,
"userProductSubscriptionId": "XXXXXXXXXXXXXX"
},
"hhid": "XXXXXXXX",
"operation": "MODIFY",
"puid": "anonymous",
"loggedAt": 1749200000000,
"accountId": "anonymous",
"subscriptionId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"table": "subscription_table",
"smc": "XXXXXXXXXXXX",
"apId": "anonymous",
"oldData": {
"subscriptionStatus": "ACTIVE",
"autoRenew": true,
"updatedAt": 1746000000000
},
"subscriptionStatus": "EXPIRED",
"assetId": "GenericAsset",
"partner": "PARTNER_X",
"SK": "CHANGE#1749200000000",
"newData": {
"subscriptionStatus": "EXPIRED",
"autoRenew": false,
"updatedAt": 1749200000000
},
"id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"packId": "generic-pack-monthly"
}
]
With this level of tracking, teams gain a clear view into the life cycle of any data point, making root cause analysis, debugging, and auditing significantly easier.
Consolidation and Aggregation
Healthy data starts with the ability to see the full picture — not just fragments scattered across systems. In modern architectures, it's common for information about a single entity to live in multiple datastores, maintained by different services. Without aggregation, each piece of data remains incomplete, and insights drawn from them are at best limited, at worst misleading.
To make data useful, teams must consolidate it across both internal and external sources. This requires a clear understanding of actor profiles which in some ways are a logical grouping of all relevant data tied to a single subject, such as a user, account, or device. Without this profile view, systems remain reactive and siloed. Nobody wants silos
In our case, aggregation is performed on demand in specific contexts. For example, when we retrieve information tied to a user's smart card, we gather data from several sources at once:
- Subscription data, stored internally and reflecting the user's current and historical subscriptions.
- Entitlement data, calculated dynamically through CRM logic that applies partner-specific rules, eligibility criteria, and service configurations.
- Partner-related data, which may be sourced externally and used to contextualize how the user interacts with third-party services.
Each of these datasets plays a role in shaping the full state of the user. Without aggregation, teams would have to manually stitch these pieces together which is too slow and fragile to support modern operations.
Additionally, system configurational data — feature flags, environment-specific settings, or service-level parameters — must be easy to access and interpret. When teams have quick visibility into the configuration that shaped a data point, debugging becomes faster and business behavior easier to explain. It’s not enough to track the data alone; we must also track the rules that govern how that data behaves.
Understanding Data Freshness and Completeness
Healthy data must not only be accurate — it must also be timely and complete. This is especially true in environments where data is sourced from multiple partners and systems. For our team, freshness refers to how recently the data was updated and how reliably it reflects the current state of CRM activity (our source of truth). When working with external integrations, it’s important to recognize that not all systems operate in real time, so decisions must be made about how often data is fetched, transformed, and loaded.
These decisions directly tie into ETL pipeline design. Striking the right balance between data consistency and performance is key. Trying to always stay perfectly in sync with every upstream system can create unnecessary load or latency. On the other hand, overly infrequent updates can make the data stale and unusable.
To address this, our ETL pipelines are designed to scale both logically and operationally:
Extraction is often handled asynchronously by standalone services or serverless functions with dedicated resources. Since this stage can be resource-intensive, especially when pulling from partner APIs or scanning large internal datasets — it’s decoupled from real-time workflows. This decoupling is important to maintain availability and durability of your real time workflows. The frequency of extraction is tuned based on the freshness requirement of each data source. Some may be pulled hourly, others daily, depending on how critical and volatile the data is.
Transformation includes various forms of computation which include simple mapping to statistical aggregations like totals, averages, and distributions. In our system, partner-specific subscription data is transformed concurrently, using context-aware processing that segments workloads by partner to avoid bottlenecks. Depending on complexity and resource cost, these transformations either happen within the extraction step or are delegated to separate transformation functions.
Load is the final stage, where the processed data is stored. For most of our needs, a single structured JSON file stored in S3 suffices, given that the data is precalculated and intended for read-heavy use cases. To improve performance, we place a read-through cache in front of this storage, allowing downstream consumers to access the latest data quickly. Whenever the load process completes, the cache is also cleared and refreshed, ensuring consistency between stored data and what consumers read.
Completeness is another pillar of data health. Often, the focus is on sanitizing data at entry points — validating input, enforcing schemas, ensuring type safety. But sanitization during retrieval is just as important, especially in systems where manual edits, migrations, or external syncs might have bypassed initial validation. We treat both entry and exit as critical points for enforcing data standards, catching missing attributes, and preserving structural integrity.
Without freshness, data becomes misleading. Without completeness, it becomes fragile. Observability into both helps teams ensure that what they’re seeing is both recent and whole.
Creating Avenues to View Data Health
Observability is only useful when data health can be inspected both broadly and in context. For a team to react quickly to issues, understand root causes, or maintain confidence in their systems, there must be clear and accessible ways to monitor how data is behaving over time.
Viewing data health can happen at two levels: system-wide or context-specific. A broad system view might highlight trends, such as an increase in failed data loads or a drop in expected event volumes. But often, the most meaningful insights come from zooming into a specific actor or data entity — seeing what happened, when, and why.
In our systems, this context-based observability takes on many forms:
User subscriptions are a core entity we track. Each subscription carries a lifecycle — from activation to renewal to expiration — and understanding the health of this data involves checking if transitions occurred as expected, if timestamps align, and if associated metadata (like auto-renew flags or entitlement links) are correct and intact. If a subscription appears stuck or missing key attributes, it may indicate a broader system issue.
Scheduled actions are another important context. These are time-driven operations like renewals, cancellations, or retries. To understand their health, we support queries across time windows and statuses — such as identifying actions that were
QUEUED
but neverEXECUTED
, or that failed unexpectedly. Being able to slice this by partner, product, or status allows teams to quickly isolate patterns and respond.
- Partner events, which are signals from external systems, form another layer of contextual health. These events might indicate that a user has activated a service, consumed content, or encountered an error. We monitor if these events are received, verify that they’re parsed accurately, and ensure downstream systems respond as intended. When expected events go missing or arrive malformed, it becomes a signal that something upstream may be broken.
By building these contextual views, teams gain the ability to not just observe data — but to understand it. Investigating a single issue or analyzing long-term trends, these views into data health are what separate reactive problem-solving from proactive improvement.
Creating Avenues for Meaningful Data Transition, Extraction, and Storytelling
Raw data, no matter how accurate or complete, becomes significantly more valuable when teams can visualize, interpret, and communicate its meaning. Data storytelling transforms numbers and transitions into narratives that drive understanding, alignment, and action — especially for non-technical stakeholders.
We start with visualization, which is often the most immediate way to surface meaning. Charts help display trends, distributions, and anomalies in a digestible format — whether it's a spike in subscription failures or a dip in partner event delivery. When paired with color-coded statuses, these visualizations can immediately highlight the state of a dataset or flow — for instance, using green for COMPLETED, yellow for PENDING, and red for FAILED — without requiring users to parse detailed logs.
Beyond visuals, we invest heavily in AI-generated summaries to bridge the gap between raw data and human decision-making. Our team uses in-house agents to generate summaries at different levels of granularity:
- For user actors, the agent produces insights such as subscription health, recent failures, entitlement mismatches, or eligibility violations.
- For partners, another agent compiles metrics and patterns into periodic reports and strategic recommendations, covering usage, errors, and integration health.
We're actively exploring ways to enhance these summaries with memory and context. One improvement involves converting generated summaries into embeddings using NLP(Natural Language Processing) techniques and storing them as vectors. Then, on the next analysis request, the agent could convert the new prompt into an embedding, retrieve the five most similar historical summaries, and enrich the prompt with this context. This approach helps produce better, more informed summaries that evolve over time.
These generated insights are often used in non-technical decision-making, from partner relationship discussions to strategic roadmap planning. For this reason, we also support easy export options — allowing summaries to be copied directly or downloaded as PDFs for distribution in reports and presentations.
To maintain performance and availability, especially under repeated or automated usage, these agents are backed by read-through caches. This prevents overloading the AI systems, reduces latency for frequent queries, and ensures consistency in outputs for the same context.
Ultimately, storytelling is what allows technical data to influence real-world outcomes. By creating tools that present, explain, and share data meaningfully, we ensure it has the power to inform and guide decisions at every level of the organization.
Proactive Data Issue Resolution
While observability helps teams monitor and understand data, the real advantage comes when those insights are used to resolve issues before they escalate. Proactive data resolution means building systems that not only detect anomalies but also guide, automate, or trigger corrective actions across the stack.
The first step involves static logical analysis, which scans data against clearly defined rules to identify structured violations. These are issues that can be caught with deterministic checks — for example, a subscription marked ACTIVE
but missing a startDate
, or an entitlement with invalid configuration. These checks are currently run on demand during subscription data aggregation. In the future we will automate them to run regularly and help catch data that’s in a broken but detectable state.
More complex problems — especially those involving pattern recognition or inconsistent data relationships — require AI-driven suggestions. These AI agents help identify unstructured violations, such as unexpected spikes in cancellations, or subtle mismatches between entitlements and partner rules. These suggestions are governed by configurable guardrails to ensure they stay within bounds that are understandable and controllable by the team. On the backend, we track prompt consumption, not just for logging and debugging, but to safeguard against misuse, hallucination, or context drift that could compromise model accuracy or security.
Once a violation is detected, either through a rule or an AI-generated suggestion, resolution must be actionable. That’s why we’ve built agents that integrate directly with task management tools like Jira. When an issue is confirmed, these agents can suggest Jira tickets with full context: the data in question, violation type, metadata, and a recommended fix path. This shortens the cycle between detection and accountability, allowing issues to be tracked and resolved in standard engineering workflows.
Another key pillar of proactive resolution is maintaining synchronization between related datastores. In systems where multiple services maintain different views of the same data, desyncs are inevitable. To address this, we intend to use both manual and automated sync pipelines. Some pipelines would run on a schedule to reconcile mismatches, while others can be triggered ad hoc when a drift is manually detected. These processes ensure consistency without requiring constant developer intervention.
Proactivity in data management isn’t just about building alerts — it’s about designing flows that detect, explain, and repair issues at the right level of automation. The result is a system that doesn’t just observe its state, but works to maintain its integrity in real time.
How Data Observability Has Helped My Team
Adopting data observability has had a transformative effect on how our team operates. What once required manual digging, cross-referencing, and tribal knowledge can now be done quickly, visually, and with far more confidence. The biggest shift has been in data surveillance — we now have a clear, consistent view of how data moves and behaves throughout our systems.
One of the most immediate benefits is how easily team members can understand user data profiles. A developer debugging a flow, a QA tester verifying a fix, or a product owner validating a new rule can all inspect the full data picture for any given user without jumping across dashboards or databases. This has made behavioral patterns more traceable, allowing us to detect anomalies like missing activations, frequent subscription failures, or inconsistent entitlement states.
Data gaps such as missing fields, incomplete transitions, or failed triggers now surface visibly, making them easy to flag and investigate early. This has greatly improved our QA workflow, as testers no longer need to manually reconstruct test cases from fragmented logs. Instead, they can validate entire flows from a central point of visibility. Even during UAT, stakeholders can observe how data responds across environments with a bird’s eye view, reducing ambiguity and speeding up feedback cycles. My team is in the process of expanding viewership capabilities of the dashboard to first responders and customer service, which will greatly boost their abilities while helping customers.
Beyond day-to-day operations, observability has helped with tracking one-time activities, such as bulk email campaigns or pre-provisioning subscription entitlements. These kinds of scheduled jobs are notoriously hard to monitor without proper instrumentation. With observability in place, we can now monitor their execution, volume, and any edge-case failures without writing one-off scripts.
Finally, observability has created a direct line of visibility for leadership. High-level statistics, such as total active subscriptions, partner-triggered event rates, or renewal success ratios, are now exposed through curated summaries and dashboards. This allows management to make decisions based on data, without relying on delayed reports or engineering cycles to extract insights.
In short, observability hasn’t just improved how we handle data — it’s improved how the entire team communicates, collaborates, and aligns around it.
Future Improvements in Terms of AI
As our use of data observability matures, we’re looking to expand our AI capabilities with a sharper focus on context-awareness, scalability, and practical integration across teams. One of our main objectives is to maintain a dedicated Small Language Model (SLM) that is trained on internal systems, workflows, and vocabulary. This SLM would act as a lightweight, focused assistant — optimized for our operational context — and continuously refined by internal AI teams to stay aligned with evolving business needs.
A deeper understanding of model management will be essential. Beyond just deploying models, we’re considering a foundation for version control, prompt governance, feedback loops, and performance evaluation in real-world scenarios. The goal is to ensure that the models we rely on not only produce accurate results but also reflect the nuances of our environment and workflows.
We also plan to extend decision-making workflows through AI. This could include suggesting data fixes, detecting and prioritizing anomalies, and recommending operational actions based on historical patterns. These automations wouldn’t replace human decisions, but rather amplify the speed and quality of those decisions, especially in high-volume or high-pressure contexts.
Finally, one of the most exciting frontiers is connecting AI to team priorities and planning. We envision tools that can monitor workstreams, identify friction points, and suggest roadmap improvements based on observed data and recurring pain points. Highlighting areas ripe for automation or surfacing issues that consistently slow down delivery, AI can play a role in shaping strategy, not just supporting it.
Conclusion
Data is no longer just an output of system behavior, it is mostly the foundation on which modern decisions, automation, and user experiences are built. As many systems scale and complexity grows, it becomes crucial not only to collect data but to observe it meaningfully. Data observability ensures that data is complete, accurate, and timely, enabling teams to debug faster, monitor more effectively, and act with confidence.
But observability is only one side of the equation. The other is intelligence — and this is where AI comes in. By summarizing, interpreting, and recommending actions based on observed data, AI allows teams to move from passive awareness to proactive resolution and strategic foresight. Through summaries tailored to users and partners or workflow-integrated agents that assist with decision-making, AI transforms observability from a monitoring tool into a driver of improvement.
Together, data observability and AI form a powerful loop: observability provides the clarity needed to understand the system, while AI brings the intelligence needed to optimize it. The future lies in continuously refining both — building systems that not only see clearly, but think ahead.
Top comments (0)