DEV Community

Cover image for I built a tool that shows you exactly what an ATS reads from your resume — here's how it works
Nilamadhab Senapati
Nilamadhab Senapati

Posted on

I built a tool that shows you exactly what an ATS reads from your resume — here's how it works

Most resume checkers score your file against keywords. Legible runs the actual parsing pipeline an ATS uses — and shows you the raw output, line by line.

I've spent months helping friends apply to jobs and watching the same thing happen over and over: a great candidate, a beautiful resume, and silence. No response. No rejection email. Just nothing.

Eventually I started asking the question nobody asks: does the company even see this resume?

The answer, increasingly, is no. An Applicant Tracking System (ATS) sees it first. And ATS systems don't see what you see.


What an ATS actually does

When you upload a PDF to Greenhouse, Workday, Lever, Taleo, iCIMS, or any other major ATS, the system runs your file through a five-stage parser:

  1. Text extraction — PDF/DOCX bytes converted to a character stream
  2. Layout analysis — columns, tables, images, header/footer regions detected
  3. Section segmentation — Experience, Education, Skills, etc.
  4. Field extraction — name, email, dates, job titles, companies
  5. Structured storage — fields written to a database the recruiter searches

If any stage fails, your resume becomes invisible. Not rejected. Invisible. Your file is still in the system, but the recruiter searching for "Kubernetes" never finds you because the parser dropped your skills section.


The five most common silent failures

In rough order of how frequently I saw them in testing:

1. Multi-column layouts
Parsers read top-to-bottom, left-to-right across the full page width. Two columns interleave into garbled lines. "Skills Work Experience Python Senior Engineer SQL Acme Corp" — parsed as one job title.

2. Tables for critical content
Most parsers strip table structure entirely. Skills in a table → gone from the recruiter's keyword search.

3. Image-based PDFs
Canva exports, some Adobe Illustrator templates — these flatten text into a picture. The parser sees a blank page. All your content is invisible.

4. Contact info in the PDF header region
The visual top of the page is fine. The actual <header> XML element in the document structure is not — most parsers ignore it entirely. Many popular resume templates use the document header for name and email.

5. Creative section headings
"Where I've Worked" instead of "Experience", "My Toolbox" instead of "Skills" — the parser's section segmenter fails to classify the section correctly.

Every resume coach knows these patterns exist. But nobody could tell you whether your specific resume failed any of them. You'd just keep applying and hoping.


So I built Legible

legible.live — free, no signup, anonymous, ~8 seconds.

Upload a PDF or DOCX. It runs the same five-stage pipeline a real ATS uses, then shows you:

  • The exact text the parser extracted, line by line, side-by-side with your original
  • Which sections it detected with what confidence — and which it missed
  • A strict score and a lenient score (explained below)
  • The top three concrete fixes, ranked by estimated point gain

No login. No email gate. No "unlock your full report for $29".


Strict vs lenient — and why the gap is the interesting number

Most ATS scanners give you one number. That number is meaningless because the same vendor can be configured very differently across companies. Workday with the modern AI screening layer behaves one way; Workday without it behaves another. Taleo at a Fortune 500 with strict filters is brutal; Taleo at a smaller employer with default settings is forgiving.

So Legible runs two parallel scorers:

  • Strict mode simulates legacy keyword-matching behaviour: exact string matches, no semantic equivalence, low tolerance for layout deviation. Worst-case enterprise ATS configuration.
  • Lenient mode simulates modern NLP-based parsing: semantic equivalence (so "Kubernetes" matches "container orchestration"), skills taxonomies, more layout tolerance. Best-case modern configuration.

The gap between the two scores is your parser-dependent risk.

If your strict score is 45 and your lenient score is 88, your resume is a lottery ticket — it'll pass at modern tech companies and fail at legacy enterprises. If both are above 80, you're robust. If both are below 60, the file itself is broken, not the content.

Strict vs Lenient


Under the hood

The whole pipeline runs in 3–8 seconds on a 1–2 page PDF.

Text extraction — two engines, cross-checked

I run PyMuPDF and pdfminer.six in parallel and compare outputs. They disagree about 6% of the time, usually on PDFs with embedded fonts or non-standard encodings. When they diverge significantly, I prefer pdfminer's output and surface a warning:

def cross_check_extraction(pdf_bytes: bytes) -> ExtractionResult:
    pymupdf_text = extract_with_pymupdf(pdf_bytes)
    pdfminer_text = extract_with_pdfminer(pdf_bytes)

    larger = max(len(pymupdf_text), len(pdfminer_text))
    if larger == 0:
        return ExtractionResult(text="", warning="no_text_layer")

    divergence = abs(len(pymupdf_text) - len(pdfminer_text)) / larger
    if divergence > 0.15:
        return ExtractionResult(
            text=pdfminer_text,
            warning="encoding_mismatch",
            detail=f"Extractors disagree by {divergence:.0%}"
        )
    return ExtractionResult(text=pymupdf_text)
Enter fullscreen mode Exit fullscreen mode

Column detection

This is the single most important check. I cluster the x-coordinates of every text box on the page and look for a real gap between clusters:

def detect_columns(pages) -> ColumnInfo:
    x_starts = [
        box.x0 for page in pages
        for box in page if isinstance(box, LTTextBox)
    ]
    if not x_starts:
        return ColumnInfo(count=0, confidence=0.0)

    page_width = pages[0].width
    clusters = cluster_by_gap(x_starts, gap=page_width * 0.15)

    # Real two-column resumes have ~30-50% of text boxes in each cluster.
    # Single-column docs with one indented quote don't count.
    if len(clusters) >= 2 and min_cluster_share(clusters) > 0.25:
        return ColumnInfo(count=len(clusters), confidence=0.9)
    return ColumnInfo(count=1, confidence=0.95)
Enter fullscreen mode Exit fullscreen mode

The trick is the min_cluster_share check. A single-column resume with one indented quote will produce two x-position clusters, but one of them contains only 2% of the text. A real two-column resume has roughly balanced clusters. This single check eliminated most of my false positives.

Section segmentation

Uses fuzzy matching against a known-header vocabulary scraped from a few hundred real resumes. I considered training an NER model, but the gain over a well-tuned dictionary lookup didn't justify the complexity at this scale.

The hardest bug to find

PDFs from certain templates put contact info in the actual <header> XML element of the document — not as regular text in the first paragraph, but in the document structure's header region.

PyMuPDF returns this text by default. pdfminer's high-level API doesn't. So on these files, one extractor saw a name and the other didn't, and the cross-check flagged an "encoding mismatch" that wasn't really an encoding issue at all. Fixing this required reading both extractors' layout output and detecting whether contact fields lived in header regions specifically — which turned out to be one of the most useful diagnostics in the tool.

What the corpus showed

I ran the final pipeline on a corpus of 20 anonymised resumes from public sources and personal contributions. 34% had at least one critical parsing issue. The most common: contact info in document headers (silently dropped by most ATS), followed by two-column layouts.


What it doesn't claim

I don't have access to the actual parsers inside Workday, Greenhouse, Taleo, or any other commercial ATS. Nobody outside those companies does.

Legible simulates the documented behaviour of the parsing pipeline these systems share — the failure modes are real, the detection logic is honest, but the scores are not "what Workday literally returned." The methodology page documents every check and every limitation explicitly.

This honesty is the point. Most ATS scanners quote correlations like "99% match with real employer ATS scores." I don't know how they could possibly verify that. Legible tells you what its pipeline found, what that pipeline shares with real ATS behaviour, and what it cannot tell you.


Stack

  • Frontend: Next.js on Vercel
  • Backend: FastAPI + async Postgres on Railway
  • PDF extraction: PyMuPDF + pdfminer.six, cross-checked
  • Layout analysis: custom heuristics over pdfminer's LT* objects
  • Scoring: two independent rule-based scorers running in parallel
  • Recommendations: GPT-4o-mini pass over ranked deductions, with a deterministic fallback if the call fails

One container per service. Nothing fancy.


Try it

legible.live — free, no signup, ~8 seconds.

If you're job hunting, or you know someone who is, send them the link. Thirty seconds and they might find out the resume they've been sending out for three months is being read as a blank page.

I'd love feedback — especially edge cases that break the parser. Reply here, or open an issue on GitHub.


If you found this useful, the methodology page goes deeper on how each check works and where the limits are. And if Legible finds something surprising in your resume, I'd genuinely like to hear about it.

Top comments (0)