DEV Community

Cover image for I Scored 453 Data Engineering Stack Overflow Questions for Readability — Here's What I Found
ckmtools
ckmtools

Posted on

I Scored 453 Data Engineering Stack Overflow Questions for Readability — Here's What I Found

I analyze a lot of text in data pipelines. Document ingestion, user feedback processing, content quality checks — anything where you're batching text from an external source and need to know if it's usable.

One thing I've never done is systematically measure what "good" looks like. So I picked Stack Overflow as a test corpus: thousands of real technical questions, with upvotes as a quality signal. If higher-voted questions are written more clearly, that would be evidence that readability scores have real signal value in a pipeline.

Here's what I found.

The Setup

I pulled questions from Stack Overflow's public API across five data engineering tags: data-engineering, apache-spark, apache-airflow, dbt, and apache-kafka. I used the most-voted questions for each — no auth required, just the public API.

After deduplication: 453 questions, each scored with three readability metrics:

  • Flesch-Kincaid Grade Level — maps reading difficulty to US school grade (grade 8 = readable by most adults)
  • Flesch Reading Ease — inverted scale (0–100), higher is easier. Grade 8 prose ≈ 60–70.
  • Gunning Fog Index — estimates years of formal education needed to understand on first read

The scoring code is about 30 lines of Python:

import requests
import textstat

SO_API = "https://api.stackexchange.com/2.3"

def fetch_questions(tag, pagesize=100):
    params = {
        "pagesize": pagesize,
        "order": "desc",
        "sort": "votes",
        "tagged": tag,
        "site": "stackoverflow",
        "filter": "withbody",
    }
    resp = requests.get(f"{SO_API}/questions", params=params)
    return resp.json()["items"]

def score_question(body_html):
    import re
    # Strip code blocks and HTML before scoring
    text = re.sub(r"<code>[^<]*</code>", "", body_html)
    text = re.sub(r"<[^>]+>", " ", text).strip()
    if len(text) < 100:
        return None
    return {
        "grade": textstat.flesch_kincaid_grade(text),
        "ease": textstat.flesch_reading_ease(text),
        "fog": textstat.gunning_fog(text),
    }

questions = []
for tag in ["data-engineering", "apache-spark", "apache-airflow", "dbt", "apache-kafka"]:
    for q in fetch_questions(tag):
        scores = score_question(q.get("body", ""))
        if scores:
            questions.append({"score": q["score"], **scores})
Enter fullscreen mode Exit fullscreen mode

What the Numbers Say

Top-voted questions read at a lower grade level

Split the 453 questions into quartiles by upvote count. The top quartile averaged 170 upvotes. The bottom quartile averaged 2 upvotes.

Metric Top 25% (avg 170 votes) Bottom 25% (avg 2 votes)
FK Grade Level 7.8 9.9
Reading Ease 68.3 58.9
Gunning Fog 9.9 11.6

Top-voted questions read roughly 2 grade levels lower than low-voted ones. Not a massive gap, but consistent across all three metrics pointing in the same direction.

Grade level 7–8 is roughly where well-edited technical documentation lands. Anything above grade 10 starts feeling dense to most readers.

Grade level distribution across all questions

< 8   : 202 (45%) ██████████████████████
8-10  : 112 (25%) ████████████
10-12 :  65 (14%) ███████
12-14 :  41 ( 9%) ████
14+   :  33 ( 7%) ███
Enter fullscreen mode Exit fullscreen mode

45% of questions score below grade 8. 70% are below grade 10. The long tail above grade 12 is mostly questions that pack multiple code snippets and dense technical jargon into one paragraph — readable by domain experts, but a wall of text to anyone else.

Tag differences are stark

Tag Avg Grade Level Avg Upvotes
apache-spark 7.6 153.8
apache-airflow 8.3 47.7
dbt 9.1 7.8
apache-kafka 9.6 103.0
data-engineering 10.3 0.3

The data-engineering tag has the highest grade level and the lowest average upvotes by a large margin (0.3 vs 153.8 for Spark). This is partly a maturity effect — Spark questions have been accumulating votes for a decade. But the readability gap is still interesting: Spark questions that attract attention tend to be crisp and specific. data-engineering questions are often broader, more abstract, and harder to parse.

What This Means for Pipelines

The original point wasn't Stack Overflow for its own sake. The point was: can you use readability scores as a data quality signal?

The answer looks like yes, at least as a filter. If you're ingesting user-generated text — support tickets, product reviews, community posts — a grade level score tells you something about the question quality before any ML model touches it.

Concretely:

  • A support ticket at grade level 14 is probably either very technical or very incoherent. Either way it routes differently.
  • A batch of customer reviews with a bimodal readability distribution (very easy + very hard) is worth investigating before feeding downstream.
  • A scraping pipeline can flag outlier grade levels as likely encoding errors, cut-off text, or machine-generated spam.

These are cheap signals. Readability scoring is deterministic, runs in microseconds, requires no model, and works on any language that isn't character-based. For a first-pass quality gate in an ETL pipeline, that's hard to beat.

The REST API Case

The textstat Python library is what I used here, and it works well. But if your pipeline isn't Python — if it's Spark (Scala/Java), a Go microservice, or a mixed-language Airflow DAG — you need HTTP.

I've been building TextLens API for exactly this: send any text to a REST endpoint, get back readability, sentiment, and keyword scores. No model download, no language constraint, no GPU. The same scores textstat computes, accessible from a curl call.

The waitlist is open at ckmtools.dev/contentapi/ if you're building something in this space.

The Code

The full analysis script (Stack Overflow fetch + scoring + quartile breakdown) is about 80 lines. If you want to run it on a different corpus — documentation pages, product descriptions, job postings — the only change is the input source. The scoring loop is the same.

The SO API allows 300 unauthenticated requests per day. More than enough to replicate this analysis or extend it to your own tag list.


One thing I didn't measure: whether the answers to high-voted questions are more readable than answers to low-voted questions. That's a different API call (the /answers endpoint, with body). If you try it, I'm curious what you find.

Top comments (0)