Asjad Ahmed Khan

Posted on Feb 16 • Edited on Feb 17

Why "Clean Data" Isn't the Same as "Usable Data"

#devrel #devtool #database

Your data team just spent weeks on a cleaning project, removing duplicate entries, normalising schemas, and handling null values. The pipeline is also running smoothly. The database is pristine.
But the moment the leadership asks: "Which marketing channel drives our highest-value customers?", we’re sure you might go silent.

But why? Here’s the thing: your team optimised for technical correctness but missed strategic utility. You have perfectly formatted data that tells you nothing useful. Well-structured columns that can't answer business questions. Immaculate schemas that don't capture meaning.

Clean data and usable data aren't the same thing. Most teams discover this after the cleaning project is done and the insights still don't materialise.

This article explores why cleaning alone falls short, what usable data actually requires, where the gap appears in real work, and how to bridge it. We'll introduce the concept of a semantic layer, show you how AstroBee implements it, and walk through a practical example that transforms technically perfect data into something that actually answers questions.

What "Clean Data" Means (And Why It's Not Enough)

Clean data is well-formatted, deduplicated, properly typed, and error-free. Tables follow normal forms, and queries run fast.

This indeed solves real problems like bad data types breaking queries, duplicates inflating counts, and nulls crashing calculations, but where it lags is adding meaning, context, business rules, or relationships that matter to decisions.

You can have a perfectly maintained created_at timestamp, indexed, never null, and ISO 8601-formatted. Does it tell you whether this user is qualified or just browsing? Does it explain why some users activate in 24 hours while others take 60 days? No, absolutely not.

Technically sound data becomes strategically valuable when it carries meaning.

What Makes Data Usable

So what separates the two? Usable data answers questions, drives decisions, and connects to outcomes. Three things make this possible:

Semantic consistency: "Activation" means the same thing everywhere. "High-value customer" has one definition, not five competing ones. When someone says "engaged user," everyone knows exactly what that means and how it's calculated.

Business context: A pageview becomes part of a journey. An API call signals adoption or struggle during product trial. The data layer understands what things mean for your business.

Relationship clarity: You can trace cause to effect. Content engagement connects to product adoption. Community participation links to retention. Support patterns map to churn risk.

Most tools handle structure brilliantly, normalising schemas, enforcing types, and optimising queries, while missing the layer that captures business rules and turns data into decisions.

Where the Gap Shows Up in Real Work

The clean-versus-usable gap can be felt even more when it breaks real workflows in specific, recognisable ways:

The "revenue" problem: Finance counts payments when they clear, sales count when contracts are signed, whereas product counts when usage starts. You see, every department has a different milestone, which is correct, but none of them are aligned. The moment you’re asked about the revenue, you will have five different answers.

The attribution black hole: Every click is perfectly recorded in the analytics table. You want to know which touchpoints drive conversions. The data can't tell you because it wouldn’t understand the complete user journey.

The metric graveyard: Dashboards full of numbers that even you’re not sure what to do with them. "Active users" shows 10,000. Is that good? Compared to what? Who counts as active? The metrics are technically correct but strategically useless.

Every question requires a custom query because the data doesn't speak business language. Marketing wants to know which campaigns work. Simple question. Three-day turnaround because the data team has to translate "which campaigns work" into technical definitions, figure out what "work" means, join six tables, and validate the logic with stakeholders.

These problems persist even after cleaning because cleaning addresses structure, not semantics.

The Missing Layer: From Structure to Semantics

The traditional data stack looks like this: Raw data, ETL, Clean data and then BI tool.

This works for rendering charts but fails at answering questions because there's a missing link between technically correct data and useful insights.

That link is a semantic layer. A foundation that captures what your data actually means. What does "activated user" mean in this business? How do customer and account entities relate? What determines if someone is "high-value"? Which metrics matter and how are they calculated?

The semantic layer sits between your database and analysis tools. It translates business questions into technical queries and ensures everyone uses consistent definitions. Without it, every analyst rebuilds business rules from scratch, interpretations diverge, and trust erodes.

Tools that build semantic layers automatically can bridge this gap.

How AstroBee Bridges the Gap

AstroBee connects to your existing data sources and automatically builds a semantic layer. Unlike traditional tools that require clear inputs, AstroBee works with whatever you have, cleaned, messy, or somewhere in between.

Three capabilities make this possible:

Connect without moving data. Link directly to your warehouse so AstroBee analyses existing datasets without moving or exfiltrating your data (supporting BigQuery, Snowflake, and others). Connect source systems like PostHog and HubSpot via Fivetran's managed ETL, or upload CSV files directly. Supports Google Sheets, PostHog, HubSpot, Salesforce, Google Analytics, PostgreSQL, and MongoDB, with new connectors added regularly.
Build semantic understanding. AstroBee analyses your data structure and prompts you to add business context. Define what "engagement" means. Specify how entities relate. Create derived metrics. These definitions become queryable across all your data.
Maintain transparency. Every answer shows the underlying data, applied rules, and derivation path. You see exactly how numbers were calculated and which definitions were used. This builds trust and makes refinement straightforward.

Let's see this in practice.

Practical Walkthrough: From Clean to Usable

The Starting Point

We'll use a real DevRel dataset with three tabs: Developers, Events, and Content. You can access it here

The data is technically sound. Formatting is consistent, no duplicates. Column types are correct, and each tab is properly structured.

Ask "Which developers are most likely to become advocates?" and you would fumble. The data doesn't know what "advocate potential" means or how to connect event patterns to outcomes.

Adding Semantic Meaning with AstroBee

Step 1: Connect Your Data
Create an account on Astrobee and connect the Google Sheet:

Click "Connect Sources"
Select "Google Sheets"
Authorise via Fivetran
Paste the sheet link

AstroBee analyses your structure and sees three clean tables with proper columns. Next, you'll tell it what these tables mean for your business.

Step 2: Define What Things Mean
Open the chat interface on the right side. AstroBee prompts you to add business context. This is where technically correct becomes strategically valuable.

Start by defining what engagement actually means:

High engagement = 3+ events in the last 7 days
Active contributor = opened issues OR contributed to discussions
Content consumer = read 2+ documentation pages

Then specify the relationships that matter:

Link events to developers by meaningful patterns
Connect content types to engagement levels
Map event sequences to outcomes

Finally, create derived metrics:

"Advocacy Score" = (community participation × content engagement × consistency)
"At-risk" = previously active, now quiet for 14+ days
"High-value developer" = uses advanced features + engages community + 60+ day tenure

Step 3: Query for Business Answers

Now you can ask questions that technically correct data couldn't answer:

"Which developers show advocate potential?"
AstroBee applies your semantic rules, filters for high engagement, checks community participation, and verifies tenure. Returns: List of developers matching the criteria with their advocacy scores.

"What content do high-value developers consume?"
AstroBee identifies high-value developers based on your definition and traces their content patterns. Returns: Content titles ranked by correlation with high-value behaviour.

"Show me at-risk developers before they churn"
AstroBee spots activity drops and flags developers matching your churn signal. Returns: Names, last activity dates, engagement history.

Step 4: Verify the Logic
Click any result to see the underlying data and applied rules. AstroBee shows:

Raw data: The actual rows that contributed
Applied logic: Which semantic rules were used
Lineage: How the answer was derived

If something looks wrong, refine your definitions. Maybe "high engagement" should be 5 events instead of 3. Update the rule and re-query. The technically correct data doesn't change. The strategic meaning does.

The Result

Your data is still three properly structured spreadsheet tabs. Now it answers business questions, carries meaning, and drives decisions.

Here’s the demo video of how this process looks. You’ll see that you’re now more confident with your data and will be able to answer questions like who’s active in the community, who’s a valuable developer, who has the potential to become a developer advocate, and much more.

What Becomes Possible

Once your semantic layer is in place, more clarity and direction emerge that technically correct data alone can't provide:

Self-service analysis: Teams answer their own questions, and the data speaks the language of business.

Consistent metrics: "Revenue" is defined once is used everywhere. Dashboards show the same numbers.

Faster decisions: Questions that took three days now take three minutes.

Trust in numbers: Clear lineage shows where metrics originate, and the focus shifts from "are these numbers right?" to "what should we do?"

Conclusion

Most teams stop at cleaning and wonder why insights don't follow. They optimise pipelines and normalise schemas. The data becomes technically perfect, but still the questions remain unanswered.

Cleaning is necessary. Meaning is what makes data sufficient.

The idea is to add more context. Semantic layers sit on top of existing data. They work with what you have and transform technically correct information into strategically valuable insights without months of additional preparation.

The tools to make this happen exist now. What's left is recognising how you can extract the most precise information from technically perfect data.

Ready to make your clean data usable?

Chat with our team to see how AstroBee fits your stack.

(We especially love talking to data engineers about semantic layer challenges 😁 )

Start with AstroBee: https://app.astrobee.ai/