DEV Community

Éric Jacopin
Éric Jacopin

Posted on

In 2023, 52% of Python Devs Used Pytest. In 2026, 100% of AI Models Understand Doctests.

The testing format nobody uses is the one every AI actually understands.


You're probably using pytest. So is 52% of the Python community, according to the JetBrains Python Developer Survey 2023. It's powerful, flexible, and has an incredible plugin ecosystem.

Meanwhile, doctests sit at 9%. The forgotten sibling. "Too simple for real testing." "Just for documentation examples." "Nobody uses those anymore."

But here's what we discovered after testing 10 different AI models on code generation tasks:

Every single model—100%—preserves doctests perfectly. (When you include doctests in your prompt, the AI keeps them intact in its generated code.)

Not pytest. Not unittest. Doctests.


The Experiment

We ran a systematic experiment across 10 large language models:

  • Claude Haiku 3, Haiku 4.5, Sonnet 4, Sonnet 4.5, Opus 4, Opus 4.1, Opus 4.5
  • Mistral Medium (mistral-medium-2508), Devstral (devstral-2512)
  • EssentialAI RNJ-1 (a 5GB model you can run locally with LM Studio)

Why these models? Anthropic's Claude powers popular AI coding tools like Kiro and Augment Code. Mistral offers a competitive alternative that's less explored. And EssentialAI's RNJ-1 tests whether the finding holds for small, locally-run models — if a 5GB model on your laptop gets it right, this isn't just "big model magic."

The task: generate implementations for functions that included test cases in the prompt.

The question: which test formats do AI models preserve?


The Results

Test Format Preservation Rate
Python doctests 100% (all 10 models)
Rust #[test] 100% (Sonnet models)
Zig test blocks 100% (Sonnet models)
Go _test.go 0%
C++ gtest 0%
TypeScript Jest 0%

Python doctests were the only format that achieved universal preservation. Every model. Every time.

And yes, that includes EssentialAI's RNJ-1—a 5GB model running on a laptop. No cloud API required. No expensive tokens. Just a small local model that somehow knows exactly what to do with >>>.


Why Doctests?

Three reasons:

1. Inline Structure

Doctests live inside the docstring, which lives inside the function. There's no file boundary confusion. No ambiguity about whether tests are "part of this" or "somewhere else."

def fibonacci(n):
    """Return the nth Fibonacci number.

    >>> fibonacci(0)
    0
    >>> fibonacci(1)
    1
    >>> fibonacci(10)
    55
    """
    pass  # AI will implement this AND preserve the doctests
Enter fullscreen mode Exit fullscreen mode

When you give this to an AI, the structure says: "these tests are part of the function's definition." The AI preserves them.

2. Unambiguous Syntax

There's exactly one way to write a doctest: >>> followed by code, then the expected output on the next line.

No decorators to remember. No assertion library to import. No class inheritance. Just >>>.

AI models thrive on unambiguous patterns. Doctests are as unambiguous as it gets.

3. Training Data Ubiquity

Doctests appear everywhere in Python's ecosystem:

  • The official Python documentation
  • Standard library docstrings
  • Countless tutorials and examples
  • Stack Overflow answers

Every model has seen thousands of doctests during training. They're part of Python's DNA.


The Irony

The Python community moved away from doctests because they're "too simple." The criticisms are well-documented:

"Probably the most significant limitation of doctest compared to other testing frameworks is the lack of features equivalent to fixtures in pytest."
Real Python

"Doctests are not a replacement for unit tests... You should continue using unit tests for structured, scalable, and thorough validation of the behavior of your code."
Laurent Kubaski, Medium

"Though doctest is an extremely useful module, the examples we write in docstrings are only simple cases meant to illustrate typical uses of the function. As functions get more complex, we'll require more extensive tests... We could put all these tests into the function docstrings, but that would make the docstrings far too long. So instead, we will use pytest."
University of Toronto CS Course

"Doesn't support parameterized testing. Advanced testing features like Test Discovery, Fixtures, etc not supported."
Pytest with Eric

The verdict is clear: doctests are for documentation examples, not "real" testing. Too simple. No fixtures. No parameterization. No mocking.

All true.

But here's the twist: those same "limitations" are exactly why AI models handle them perfectly.

Doctest "Limitation" Why AI Loves It
Too simple Unambiguous for the model
No fixtures No external state to track
No parameterization Each test is self-contained
No mocking No hidden complexity
Exact output matching Clear success criteria

We optimized our testing for human power users. AI models—at least every one we tested in 2026—prefer the beginner-friendly format we left behind.


The Practical Recommendation

I'm not saying abandon pytest. Keep it. It's great for:

  • Complex test scenarios
  • Fixtures and setup/teardown
  • Parameterized testing
  • CI/CD pipelines
  • Coverage reporting

But consider dual-testing:

For AI-Assisted Development

Use doctests as a template when prompting AI:

def parse_email(text: str) -> dict:
    """Extract email components from a string.

    >>> parse_email("John Doe <john@example.com>")
    {'name': 'John Doe', 'address': 'john@example.com'}
    >>> parse_email("jane@example.com")
    {'name': None, 'address': 'jane@example.com'}
    >>> parse_email("invalid")
    Traceback (most recent call last):
        ...
    ValueError: Invalid email format
    """
    pass  # Ask AI to implement
Enter fullscreen mode Exit fullscreen mode

The doctests communicate:

  1. The expected input/output contract
  2. Edge cases to handle
  3. Error conditions

And the AI will preserve them—giving you working tests from the start.

For Comprehensive Testing

Keep your pytest suite for everything else:

# test_email.py
import pytest
from mymodule import parse_email

@pytest.fixture
def email_samples():
    return load_test_data("emails.json")

@pytest.mark.parametrize("input,expected", [...])
def test_parse_email_parametrized(input, expected):
    assert parse_email(input) == expected

def test_parse_email_performance(benchmark):
    benchmark(parse_email, "test@example.com")
Enter fullscreen mode Exit fullscreen mode

Doctests for AI prompts. Pytest for human-scale testing.


Try It Yourself

Next time you're about to ask an AI to implement a function:

  1. Write the function signature
  2. Add a docstring with 2-3 doctests showing expected behavior (like the parse_email example above)
  3. Ask the AI to implement it

Compare this to your usual prompt. I bet you'll notice:

  • The AI preserves your test cases
  • The implementation matches your examples
  • You have working tests immediately

The Bigger Picture

This isn't just about doctests. It's about a broader pattern we discovered:

Inline test formats work. External test formats don't.

When you include tests in your prompt, does the AI preserve them in its output? We measured test preservation rate — 100% means every test you provided appears in the generated code, intact. 0% means the AI ignored your tests entirely.

The pattern was stark: inline tests (Python doctests, Rust #[test], Zig test blocks) achieved 100% preservation. External tests (Go _test.go, C++ gtest, TypeScript Jest) achieved 0%.

When tests are structurally part of the code, AI preserves them. When tests are separate files, AI ignores them.

Doctests just happen to be the most universal example of this pattern—the one format that works across every model we tested, including tiny local ones.


Conclusion

In 2023, pytest dominated with 52%. Doctests languished at 9%.

In 2026, AI models tell a different story: doctests are the universal language of test specification.

The 9% might be onto something. If you're using AI coding tools, consider joining them.


This post is based on findings from a larger experiment on AI code generation across multiple languages and models. Full results available at d-Heap-priority-queue. See also Amphigraphic-Strict for strict language subsets optimized for AI-assisted development.

Have you tried using doctests with AI assistants? Please share your experience in the comments.

Top comments (0)