Brian Cariveau

Posted on Jan 13

Garbage In, Powerhouse Out? (Nope.) Why Your Data Foundation Matters More Than AI

#database #datascience #dataengineering #ai

"Garbage in, garbage out."

You've said it. I've said it. Every data engineer has said it.

Then we move on. We buy the next AI tool. We hire more data scientists. We build more dashboards.

Meanwhile, our data foundation is still garbage.

The Uncomfortable Pattern

After 25 years in data and analytics—from small business, through consulting, and on to Wells Fargo & UnitedHealth Group—I've watched the same pattern repeat. At Fortune 10 companies and scrappy startups alike:

Organizations spend millions on insights initiatives while their data is fundamentally broken.

I watched a company invest heavily in machine learning. World-class team. Sophisticated models. The demos were perfect.

Then they pointed the models at production data.

Everything broke.

Why? The foundation was garbage:

Inconsistent naming conventions
No data quality checks
Undocumented edge cases
Historical baggage nobody understood

The ML team had learned to solve problems that didn't exist in production.

"Stupidly Simple Data"

A colleague of mine, Matthew Stearns, coined a phrase that stuck: "Stupidly simple data."

Not "clean data." Not "governed data." Not "enterprise-grade architecture."

Stupidly. Simple. Data.

Data so simple that:

A new hire can understand it on day one
Your grandmother could read the table names
Edge cases are documented
There are no cryptic abbreviations
Everything is self-documenting

Sounds basic, right?

Most organizations have data that looks like this:

sls_txn_f47
usr_bhv_ag_01
car_lst_vw_2
tmp_final_v3_ACTUAL_USE_THIS

And when you ask what they mean? Nobody knows. The engineer who built them left two years ago.

The Day We Almost Lost "Stupidly Simple"

Let me tell you what happened when my team tried to practice what we preached.

We were building a cloud data warehouse from scratch. Clean slate. No legacy baggage.

We made a pact: No jargon. No abbreviations. No "lift and shift" thinking.

Three data layers:

Raw Data (not "bronze")
System Specific (not "silver")
Analytics Ready (not "gold")

Why not medallion architecture? Because going to Home Depot and asking "Where's the paint?" and hearing "It's in the silver department" is absurd.

Everything spelled out:

salesforce_opportunities_raw
google_analytics_sessions_cleaned
customer_lifetime_value_monthly

No sf_opp_raw. No ga_sess_cln. No cltv_mo.

The goal: stupidly simple.

When Standards Drift

About six months in, our system_specific schema started getting messy.

One of our data scientists: "You have got to be kidding me. This is getting messy. I thought it was supposed to be stupidly simple."

She was right.

The team had been using it as a staging area for AI experiments. Transitory tables nobody would ever query directly. Tables that should've lived somewhere else.

Here's the beautiful part:

Because everything was spelled out—because we'd held to no abbreviations—we knew EXACTLY which objects were AI staging tables.

We created a new schema: ai_model_staging

Moved everything over in a day.

system_specific was clean again. Back to stupidly simple.

The Standard That Enforces Itself

Here's what made this work:

"Stupidly simple" had become our language. Our shared value.

When we drifted, someone called it out. That data scientist didn't need permission to point out the mess. The standard gave her permission.

That's when simplicity becomes sustainable.

Not when it's written in a document nobody reads.

Not when it's enforced top-down by governance.

When it becomes the language your team uses.

The Technical ROI of Simple Naming

Let's talk actual impact:

Onboarding Time

Before (cryptic naming): 3-6 months for new data engineers to be productive
After (stupidly simple): 2-3 weeks

Debugging Time

Before: "What does usr_bhv_ag_01 mean?" → Slack thread → wait for someone who knows → maybe they respond
After: user_behavior_aggregated_daily → self-documenting → no questions needed

Refactoring Time

Before: "Which tables are AI staging?" → guess → break production → rollback → try again
After: ai_model_staging.* → clean migration in hours

Mental Overhead

Before: Everyone maintains mental map of abbreviations → tribal knowledge → lost when people leave
After: Zero translation needed → new hires productive immediately → knowledge doesn't walk out the door

The Foundation Nobody Wants to Build

Here's the truth: The foundation matters more than the fancy stuff.

More than AI. More than machine learning. More than real-time streaming or data lakes or whatever the next buzzword is.

If your data foundation is garbage, everything built on top of it is garbage.

You can't AI your way out of bad data.

You can't dashboard your way out of cryptic naming.

You can't innovate on fake problems and expect real solutions.

The Controversial Take

Want to be a powerhouse with data and AI?

Start with the boring stuff:

Stupidly simple naming
- No abbreviations, ever
- Self-documenting schemas
- Table names your grandmother could read
Real data, secured properly
- Not fake data theater
- De-identify where needed, but keep it real
- Test on production-like data
Clear ownership
- Every dataset has someone responsible
- Documented edge cases
- Quality checks that actually run
Self-documenting architecture
- New hires productive in days, not months
- No tribal knowledge requirements
- Code comments are last resort
Quality over quantity
- 5 trusted metrics beat 47 questionable ones
- Better to have less data that's reliable
- Ship working foundations, not complete messes

Not sexy. Not flashy. Not conference-worthy.

But it works.

Your Turn

If you've ever:

Inherited a database nobody could explain
Spent hours debugging cryptic table names
Watched new hires struggle for months
Built something on fake data that broke in production

Share your story in the comments. Let's learn from each other's mistakes.

Because the goal isn't perfection. It's progress.

People helping people.

Follow me here on Dev.to and connect on LinkedIn where I share more data engineering lessons learned the hard way.

Discussion Questions

What's the most cryptic database naming you've encountered?
How long does it take new data engineers to be productive on your team?
Do you use medallion architecture (bronze/silver/gold)? Why or why not?
What's your take on "stupidly simple" vs. more structured naming conventions?

Let's debate this. 👇

DEV Community