AI-Generated Code Looked Right, but the Data Was Wrong

Piotr — Tue, 05 May 2026 09:08:22 +0000

I'm working on an AI Data Analyst in MLJAR Studio.

The idea is simple: you ask a question in natural language, AI writes Python code, executes it, and shows the result.

But recently I found a small example that reminded me why AI data analysis needs more than code generation.

The code worked

I was testing a medical data analysis use case with a diabetes CSV file.

The first task was simple:

load data from this URL

AI generated Pandas code with read_csv().

The code executed without errors.

The dataframe was displayed.

The shape looked correct: 768 rows and 9 columns.

So everything looked fine.

But then I looked at the dataframe.

148 pregnancies?

In the first row, the Pregnancies column had value 148.

That immediately looked wrong.

Values like 0, 1, 2, 6, or 8 make sense for number of pregnancies.

But 148?

No.

Then I noticed more strange things:

Pregnancies had values like 148, 85, 183
Age had values like 0 and 1
Outcome was empty
the whole dataframe looked shifted

The code worked, but the data was wrong.

AI also checked the output

In MLJAR Studio, my AI Data Analyst is not a one-step workflow.

It doesn't only:

generate code
execute code
show result

After the code is executed, there is another step. The LLM analyzes the generated output.

So AI doesn't only ask:

Did the code run?

It also asks:

Does the output make sense?

In this case, AI also noticed that something was wrong. It found suspicious values, missing values in the last column, and strange statistics.

This was very useful because Pandas didn't raise an error. The dataframe was created. But the output was incorrect.

What happened?

The CSV had a small formatting issue: an extra comma in the header row.

Because of that, Pandas treated the first value in each row as the dataframe index.

So the columns were shifted.

The value 148 was not the number of pregnancies. It was the glucose value.

That is why Glucose appeared as Pregnancies, Outcome appeared as Age, and the real Outcome column was empty.

The lesson

This example is small, but the lesson is important.

AI-generated code can look correct.

The notebook can run without errors.

The dataframe can be displayed.

And the data can still be wrong.

That is why AI data analysis needs output verification.

We need a human in the loop, because humans can use common sense. In this case, 148 pregnancies was clearly impossible.

But AI in the loop is helpful as well. AI can scan the output, check basic statistics, and warn us about suspicious values.

For me, the best workflow is:

ask AI to generate code
execute the code
display the output
let AI inspect the output
let the human review the result

AI can help us move faster.

But in data analysis, the real question is not:

Did the code run?

It is:

Does the output make sense?

Reimagine Python Notebooks in the AI Era: What If You Didn’t Write Code First?

Piotr — Fri, 17 Apr 2026 10:21:35 +0000

For years, Jupyter Notebook has been the default tool for data analysis in Python.

But it assumes one thing:

👉 you start with code.

What if you didn’t?

I’ve been experimenting with a different workflow.

Instead of writing Python, you describe what you want in plain English — and the system generates and runs the code for you.

The flow becomes:

prompt → LLM-generated code → auto-execution → results

Code is still there.
But it’s no longer the starting point.

Why this might matter

The idea came from a simple observation.

Someone told me:

“I don’t care about the code. I care about the results.”

And honestly, that stuck with me.

As engineers, we often treat code as the main interface.
But for many people, it’s just a tool to get answers.

Old problems... less important?

There has been a lot of criticism of notebooks over the years:

hidden state
mixing code and outputs
hard to review in git

These problems are real.

But I’m starting to wonder if AI changes which of them actually matter.

If:

code is generated
execution is automated
results become the main interface

then some of these issues feel... less central.

New problems appear

Of course, we are not removing complexity — we are shifting it.

New challenges show up:

trusting LLM-generated code
debugging when something goes wrong
less visibility into what is actually happening

In some cases, these might be even harder than the original problems.

A small but important detail

One thing I care about a lot:

👉 everything is still saved as a standard .ipynb file

So you can always:

open it in a classic notebook
inspect the code
edit anything manually

Nothing is locked in.

So what’s really changing?

Maybe the biggest shift is this:

We are moving from:

code → results

to:

intent → results

Code becomes an implementation detail, not the main interface.

Open question

Are we solving notebook problems — or just hiding them behind AI?

I wrote a longer version with screenshots and implementation details here:
👉 https://mljar.com/blog/reimagine-python-notebook-in-ai-era/

I’d love to hear your perspective:

Would you use a prompt-first notebook?
Does this make notebooks better — or just different?

DEV Community: Piotr