DEV Community: Rafael Poyiadzi

Stop Writing Regex for Data You Should Be Describing in English

Rafael Poyiadzi — Tue, 10 Feb 2026 11:08:53 +0000

You have a spreadsheet of job postings and you need to filter it down to roles that are remote-friendly, senior-level, and have a disclosed salary. Sounds straight-forward except the data looks like this:

company	post
Airtable	Async-first team, 8+ yrs exp, $185-220K base
Vercel	Lead our NYC team. Competitive comp, DOE
Notion	In-office SF. Staff eng, $200K + equity
Linear	Bootcamp grads welcome! $85K, remote-friendly
Descript	Work from anywhere. Principal architect, $250K

Now try writing deterministic rules for that.

"Remote-friendly" could be "remote", "work from anywhere", "async-first", or implied by the absence of an office mention.
"Senior-level" might be "8+ yrs", "Staff", "Principal", or "Lead" — but "Lead" could also be a junior team lead.
"Salary disclosed" means actual numbers, not "Competitive comp" or "DOE."

What if you could just describe what you want?

everyrow lets you define fuzzy, qualitative logic in natural language and apply it to every row of a dataframe. The SDK handles LLM orchestration, structured outputs, and scaling with the user specifying judgment criteria in plain English.

Here's the job screening example:

import asyncio
import pandas as pd
from pydantic import BaseModel, Field
from everyrow.ops import screen

jobs = pd.DataFrame([
    {"company": "Airtable", "post": "Async-first team, 8+ yrs exp, $185-220K base"},
    {"company": "Vercel", "post": "Lead our NYC team. Competitive comp, DOE"},
    {"company": "Notion", "post": "In-office SF. Staff eng, $200K + equity"},
    {"company": "Linear", "post": "Bootcamp grads welcome! $85K, remote-friendly"},
    {"company": "Descript", "post": "Work from anywhere. Principal architect, $250K"},
])

class JobScreenResult(BaseModel):
    qualifies: bool = Field(description="True if meets ALL criteria")

async def main():
    result = await screen(
        task="""
        Qualifies if ALL THREE are met:
        1. Remote-friendly
        2. Senior-level (5+ yrs exp OR Senior/Staff/Principal in title)
        3. Salary disclosed (specific numbers, not "competitive" or "DOE")
        """,
        input=jobs,
        response_model=JobScreenResult,
    )
    print(result.data)

asyncio.run(main())

That's it. No regex, no threshold tuning, no parsing logic. The screen operation evaluates every row against your natural-language criteria using an LLM and returns structured results via a Pydantic model.

The output:

company	qualifies
Airtable	True
Vercel	False
Notion	False
Linear	False
Descript	True

Airtable qualifies: async-first (remote-friendly), 8+ years (senior), $185-220K (salary disclosed).
Descript qualifies: work from anywhere (remote), principal architect (senior), $250K (salary disclosed).
The rest fail on at least one criterion: Vercel has no real salary, Notion is in-office, Linear isn't senior-level.

Sessions: Track Everything in a Dashboard

Every operation runs within a grouping of related operations that appears in the everyrow.io web UI. These sessions are created automatically, but for multi-step pipelines you'll want to create one explicitly:

from everyrow import create_session
from everyrow.ops import screen, rank

async with create_session(name="Lead Qualification") as session:
    print(f"View at: {session.get_url()}")

    screened = await screen(
        session=session,
        task="Has a company email domain (not gmail, yahoo, etc.)",
        input=leads,
        response_model=ScreenResult,
    )

    ranked = await rank(
        session=session,
        task="Score by likelihood to convert",
        input=screened.data,
        field_name="conversion_score",
    )

The session URL gives you a live dashboard where you can monitor progress and inspect results while your script runs.

Background Jobs for Large Datasets

All the operations above are already async/await. The _async variants are different — they're fire-and-forget: they submit work to the server and return immediately so your script can continue:

from everyrow.ops import screen_async

async with create_session(name="Background Screening") as session:
    task = await screen_async(
        session=session,
        task="Remote-friendly, senior-level, salary disclosed",
        input=large_dataframe,
    )
    print(f"Task ID: {task.task_id}")
    # do other work...
    result = await task.await_result()

If your script crashes, recover the result later using the task ID:

from everyrow import fetch_task_data
df = await fetch_task_data("12345678-1234-1234-1234-123456789abc")

Beyond Screening: What Else Can You Do?

screen is just one of several operations:

Operation	What it does
Screen	Filter rows by criteria that require judgment
Rank	Score rows by qualitative factors
Dedupe	Deduplicate when fuzzy string matching isn't enough
Merge	Join tables when keys don't match exactly
Research	Run web agents to research each row

Each operation takes a natural-language task description and a dataframe, and returns structured results. Same pattern, different capability.

When to Use This (and When Not To)

everyrow is designed for cases where the logic is easy to describe but hard to code: screening, ranking, deduplication, and enrichment tasks where the criteria require judgment.

It's not a replacement for deterministic transformations. If you can write a reliable df[df["salary"] > 100000], you should. Use everyrow for the columns where the values are natural language, inconsistent, or require world knowledge to interpret.

The tradeoff is latency and cost: LLM-based operations are slower and not free. For the job screening example above, processing 5 rows takes a few seconds and costs a fraction of a cent. For 10,000 rows, you'd want the async variants and should expect minutes rather than milliseconds. The docs cover scaling patterns for larger datasets.

Get Started

pip install everyrow
export EVERYROW_API_KEY=your_key_here

Get a free API key at everyrow.io/api-key - comes with $20 free credit.

Full docs and more examples: everyrow.io/docs/getting-started

Resources

Stop Writing Regex for Data You Should Be Describing in English

Rafael Poyiadzi — Tue, 10 Feb 2026 11:08:53 +0000

company	post
Airtable	Async-first team, 8+ yrs exp, $185-220K base
Vercel	Lead our NYC team. Competitive comp, DOE
Notion	In-office SF. Staff eng, $200K + equity
Linear	Bootcamp grads welcome! $85K, remote-friendly
Descript	Work from anywhere. Principal architect, $250K

Now try writing deterministic rules for that.

"Remote-friendly" could be "remote", "work from anywhere", "async-first", or implied by the absence of an office mention.
"Senior-level" might be "8+ yrs", "Staff", "Principal", or "Lead" — but "Lead" could also be a junior team lead.
"Salary disclosed" means actual numbers, not "Competitive comp" or "DOE."

What if you could just describe what you want?

Here's the job screening example:

import asyncio
import pandas as pd
from pydantic import BaseModel, Field
from everyrow.ops import screen

jobs = pd.DataFrame([
    {"company": "Airtable", "post": "Async-first team, 8+ yrs exp, $185-220K base"},
    {"company": "Vercel", "post": "Lead our NYC team. Competitive comp, DOE"},
    {"company": "Notion", "post": "In-office SF. Staff eng, $200K + equity"},
    {"company": "Linear", "post": "Bootcamp grads welcome! $85K, remote-friendly"},
    {"company": "Descript", "post": "Work from anywhere. Principal architect, $250K"},
])

class JobScreenResult(BaseModel):
    qualifies: bool = Field(description="True if meets ALL criteria")

async def main():
    result = await screen(
        task="""
        Qualifies if ALL THREE are met:
        1. Remote-friendly
        2. Senior-level (5+ yrs exp OR Senior/Staff/Principal in title)
        3. Salary disclosed (specific numbers, not "competitive" or "DOE")
        """,
        input=jobs,
        response_model=JobScreenResult,
    )
    print(result.data)

asyncio.run(main())

The output:

company	qualifies
Airtable	True
Vercel	False
Notion	False
Linear	False
Descript	True

Airtable qualifies: async-first (remote-friendly), 8+ years (senior), $185-220K (salary disclosed).
Descript qualifies: work from anywhere (remote), principal architect (senior), $250K (salary disclosed).
The rest fail on at least one criterion: Vercel has no real salary, Notion is in-office, Linear isn't senior-level.

Sessions: Track Everything in a Dashboard

from everyrow import create_session
from everyrow.ops import screen, rank

async with create_session(name="Lead Qualification") as session:
    print(f"View at: {session.get_url()}")

    screened = await screen(
        session=session,
        task="Has a company email domain (not gmail, yahoo, etc.)",
        input=leads,
        response_model=ScreenResult,
    )

    ranked = await rank(
        session=session,
        task="Score by likelihood to convert",
        input=screened.data,
        field_name="conversion_score",
    )

The session URL gives you a live dashboard where you can monitor progress and inspect results while your script runs.

Background Jobs for Large Datasets

All the operations above are already async/await. The _async variants are different — they're fire-and-forget: they submit work to the server and return immediately so your script can continue:

from everyrow.ops import screen_async

async with create_session(name="Background Screening") as session:
    task = await screen_async(
        session=session,
        task="Remote-friendly, senior-level, salary disclosed",
        input=large_dataframe,
    )
    print(f"Task ID: {task.task_id}")
    # do other work...
    result = await task.await_result()

If your script crashes, recover the result later using the task ID:

from everyrow import fetch_task_data
df = await fetch_task_data("12345678-1234-1234-1234-123456789abc")

Beyond Screening: What Else Can You Do?

screen is just one of several operations:

Operation	What it does
Screen	Filter rows by criteria that require judgment
Rank	Score rows by qualitative factors
Dedupe	Deduplicate when fuzzy string matching isn't enough
Merge	Join tables when keys don't match exactly
Research	Run web agents to research each row

Each operation takes a natural-language task description and a dataframe, and returns structured results. Same pattern, different capability.

When to Use This (and When Not To)

everyrow is designed for cases where the logic is easy to describe but hard to code: screening, ranking, deduplication, and enrichment tasks where the criteria require judgment.

Get Started

pip install everyrow
export EVERYROW_API_KEY=your_key_here

Get a free API key at everyrow.io/api-key - comes with $20 free credit.

Full docs and more examples: everyrow.io/docs/getting-started

Resources

Is LLM Data Labeling Good Enough to Train On? We Tested It and the Answer Is Yes

Rafael Poyiadzi — Mon, 09 Feb 2026 10:36:26 +0000

You're building a classifier but data labeling is your bottleneck. Hiring annotators is slow, expensive, and hard to scale — and label quality varies across annotators. What if an LLM could label your data automatically, with structured outputs that guarantee valid labels, and match human accuracy?

We built an automated data annotation pipeline using everyrow and tested whether LLM-generated labels are good enough to train a classifier. The answer: yes — the LLM matches human-label performance at a fraction of the cost.

The Problem: Data Labeling is Expensive

Active learning reduces labeling costs by letting the model choose which examples to label next, focusing on the ones it's most uncertain about. But you still need an oracle to provide those labels — traditionally a human annotator.

We replaced the human annotator with an LLM oracle using everyrow.agent_map, then ran a controlled experiment on DBpedia-14 (14-class text classification) to measure whether automated data labeling produces labels good enough to train on.

Building an LLM Data Labeling Pipeline with everyrow

The core of the pipeline is everyrow.agent_map with a Pydantic response model. The LLM can only return one of 14 valid categories — no parsing or cleanup needed:

class DBpediaClassification(BaseModel):
    category: Literal[
        "Company", "Educational Institution", "Artist",
        "Athlete", "Office Holder", "Mean Of Transportation",
        "Building", "Natural Place", "Village",
        "Animal", "Plant", "Album", "Film", "Written Work",
    ] = Field(description="The DBpedia ontology category")

async def query_llm_oracle(texts_df: pd.DataFrame) -> list[int]:
    async with create_session(name="Active Learning Oracle") as session:
        result = await agent_map(
            session=session,
            task="Classify this text into exactly one DBpedia ontology category.",
            input=texts_df[["text"]],
            response_model=DBpediaClassification,
            effort_level=EffortLevel.LOW,
        )
        return [CATEGORY_TO_ID.get(result.data["category"].iloc[i], -1)
                for i in range(len(texts_df))]

We used a TF-IDF + LightGBM classifier with entropy-based uncertainty sampling. Each iteration selects the 20 most uncertain examples, sends them to the LLM for annotation, and retrains. 10 iterations, 200 labels total.

We ran 10 independent repeats with different seeds, each time running both a ground truth oracle (human labels) and the LLM oracle with the same seed — a direct, controlled comparison.

LLM Labels Match Human Accuracy — Within 0.1% Across 10 Runs

Final test accuracies averaged over 10 repeats:

Data Labeling Method	Final Accuracy (mean ± std)
Human annotation (ground truth)	80.6% ± 1.0%
LLM annotation (everyrow)	80.7% ± 0.8%

The LLM oracle is within noise of the ground truth baseline — automated data labeling produces classifiers just as good as human-labeled data.

Label Quality: 96% Agreement with Human Annotations

The LLM agreed with ground truth labels 96.1% ± 1.6% of the time. Roughly 1 in 25 labels disagrees with the human annotation, but that doesn't hurt the downstream classifier.

Data Labeling Cost: $0.26 per Run

Metric	Value
Cost per run (200 labels)	$0.26
Cost per labeled item	$0.0013
Total (10 repeats)	$2.58

200 labels in under 5 minutes for $0.26, fully automated. Compare that to hiring human annotators — even at minimum wage, manual labeling of 200 items would take longer and cost more, with no guarantee of higher quality.

When to Use LLM Data Labeling

LLM annotation works. On this task, the LLM matches human-label performance despite ~4% label disagreement.
Structured outputs matter. Pydantic response models guarantee valid labels — no post-hoc parsing or cleanup.
It's practical. 200 labels in under 5 minutes for $0.26, fully automated.

Limitations: We tested on one dataset with well-separated categories. More ambiguous labeling tasks may see a gap between human and LLM annotation quality. We used a simple classifier (TF-IDF + LightGBM); neural models that overfit individual examples may be less noise-tolerant.

Try it yourself: Get a free API key from everyrow.io ($20 free credit) and run the companion notebook.

Resources

Companion notebook on Kaggle — Run the full data labeling pipeline yourself
Experiment runner code — available on request
everyrow SDK — Python SDK for running LLM operations over dataframes
everyrow.io/docs — Documentation
everyrow.io/docs/getting-started - Getting started
everyrow.io/api-key For API keys ($20 free credit)
DBpedia-14 dataset — The dataset used in this study

Introducing `everyrow.io/dedupe`: An LLM-based approach to semantic deduplication

Rafael Poyiadzi — Thu, 22 Jan 2026 12:53:02 +0000

Deduplicating records is a recurring problem in data engineering and several challenges make it difficult: scale, surface-level variation, context-dependent equivalence and world knowledge.

Let's look at an example. We wanted to build a database of AI researchers from academic lab websites. Scraping produced:

Name variations: "Julie Kallini" vs "J. Kallini", "Moscato, Vincenzo" vs "Vincenzo Moscato"
Typos: "Namoi Saphra" vs "Naomi Saphra", "Bryan Wiledr" vs "Bryan Wilder"
Career changes: Same person listed at "AUTON Lab" and later at "AUTON Lab (Former)" with different emails
GitHub handles: Sometimes the only reliable link between records—"A. Butoi" and "Alexandra Butoi" sharing butoialexandra
Username-only names: Researchers who listed their GitHub handle ("smirchan", "VSAnimator") instead of their real name

We used a dataset of 200 researcher profiles scraped from academic lab websites. It was then manually reviewed to establish ground-truth clusters, which we used for evaluation.

The data covers name, position, organisation, email, university, and GitHub. GitHub handles are present in ~40% of rows and act as a high-precision but low-recall signal.

row_id	name	position	organisation	email	university	github
2	A. Butoi	PhD Student	Rycolab	alexandra.butoi@personal.edu	ETH Zurich	`butoialexandra`
8	Alexandra Butoi	—	Ryoclab	—	—	`butoialexandra`
43	Namoi Saphra	Research Fellow	—	nsaphra@alumni	-	`nsaphra`
47	Naomi Saphra	—	Harvard / BU / EleutherAI	nsaphra@fas.harvard.edu	—	`nsaphra`
18	T. Gupta	PhD Student	AUTON Lab (Former)	—	Carnegie Mellon	`tejus-gupta`
26	Tejus Gupta	PhD Student	AUTON Lab	tejusg@cs.cmu.edu	Carnegie Mellon	`tejus-gupta`
55	smirchan	PhD Student	—	suvir@yahoo.com	Stanford University	`smirchan`
155	Suvir Mirchandani	PhD Student	Stanford CRFM	suvir@cs.stanford.edu	—	`smirchan`
98	Vincenzo Moscato	Full Professor	—	vincenzo.moscato@unina.it	—	—
133	Moscato, Vincenzo	Full Professor	University of Naples	vincenzo.moscato@yahoo.com	—	—

A go-to approach is fuzzy string matching using libraries like fuzzywuzzy or rapidfuzz. However, these suffer from the threshold problem: set it too low and you catch false positives; set it too high and you miss semantic duplicates like "A. Butoi" ↔ "Alexandra Butoi" which have low character overlap despite being the same person. The alternative is manual review, but with 200 rows requiring ~5 comparisons each, that's hours of tedious work.

We benchmarked fuzzy string matching as a baseline. It compares all row pairs using token-sorted string similarity and groups rows exceeding a threshold using Union-Find clustering (a graph algorithm that efficiently merges items into equivalence classes by treating each match as an edge).

Metric	Fuzzy (t=0.75)	Fuzzy (t=0.90)
Row accuracy	86%	82%
Cluster accuracy	82%	78%
Easy duplicates	58% (7/12)	17% (2/12)
Hard duplicates	70% (7/10)	10% (1/10)
Distractor accuracy	90% (18/20)	100% (20/20)
Singletons	90% (90/100)	100% (100/100)
Processing time	0.04s	0.04s
Cost	$0	$0

At t=0.75 it catches more duplicates but risks false merges. At t=0.90 it avoids false merges but misses most semantic duplicates like "T. Gupta" ↔ "Tejus Gupta".

We next wanted to try ChatGPT. We upload the CSV and asked it to deduplicate.

Metric	ChatGPT
Row accuracy	56%
Cluster accuracy	45% (72/160)
Easy duplicates	100% (12/12)
Hard duplicates	70% (7/10)
Distractor accuracy	25% (5/20)
Singletons	33% (33/100)
Output rows	72
Data loss (over-merged)	88 clusters

ChatGPT over-merged:

88 clusters lost — unique people incorrectly merged into other records
Only 33% of singletons preserved — people with no duplicates were merged into unrelated records
Only 25% distractor accuracy — people with the same first name but different identities (like "Rohan Saha" and "Rohan Chandra") were incorrectly merged

Let's now present everyrow.io/dedupe! Instead of relying on string similarity thresholds, it uses LLMs to make contextual judgments about whether two records represent the same entity.

The system exposes a high-level deduplication operation that accepts a dataset and a natural-language equivalence definition. The equivalence relation can be as descriptive as needed and could also include examples.

from everyrow import create_client, create_session
from everyrow.ops import dedupe
import pandas as pd

input_df = pd.read_csv("researchers.csv")

async with create_client() as client:
    async with create_session(client, name="Researcher Dedupe") as session:
        result = await dedupe(
            session=session,
            input=input_df,
            equivalence_relation=(
                "Two rows are duplicates if they represent the same person "
                "despite different email/organization (career changes). "
                "Consider name variations like typos, nicknames (Robert/Bob), "
                "and format differences (John Smith/J. Smith)."
            ),
        )
        result.data.to_csv("deduplicated.csv", index=False)

Accuracy was evaluated by comparing predicted equivalence classes against manually labeled ground truth. We report both row-level accuracy (whether a row is assigned to the correct cluster) and cluster-level accuracy (whether an entire entity cluster is correctly reconstructed).

Metric	Fuzzy (t=0.75)	Fuzzy (t=0.90)	ChatGPT	everyrow.io/dedupe
Row accuracy	86%	82%	56%	98%
Cluster accuracy	82%	78%	45%	97.5%
Easy duplicates	58% (7/12)	17% (2/12)	100% (12/12)	100% (12/12)
Hard duplicates	70% (7/10)	10% (1/10)	70% (7/10)	100% (10/10)
Distractor accuracy	90% (18/20)	100% (20/20)	25% (5/20)	95% (19/20)
Singletons	90% (90/100)	100% (100/100)	33% (33/100)	100% (100/100)
Processing time	0.04s	0.04s	NA	90s
Cost	$0	$0	NA	$0.42

A few examples from everyrow.io/dedupe. Starting with some found matches:

✓ Match: Name abbreviation + org typo

Row 2: "A. Butoi" — Rycolab, ETH Zurich, butoialexandra
Row 8: "Alexandra Butoi" — Ryoclab (typo), butoialexandra

✓ Match: Typo in first name

Row 43: "Namoi Saphra" — nsaphra
Row 47: "Naomi Saphra" — Harvard/BU/EleutherAI, nsaphra

✓ Match: Career transition

Row 18: "T. Gupta" — AUTON Lab (Former), tejus-gupta
Row 26: "Tejus Gupta" — AUTON Lab, tejus-gupta

✓ Match: Username-only name

Row 55: "smirchan" — Stanford University, smirchan
Row 155: "Suvir Mirchandani" — Stanford CRFM, smirchan

✗ Correctly identified as different people:

Row 6: "Rohan Saha" — Alberta, simpleParadox
Row 141: "Rohan Chandra" — UT Austin, rohanchandra30

And the errors made:

⚠ Over-merged: Same institution

"Sarah Ball" and "Wen (Lavine) Lai" — both at MCML, PhD students

⚠ Over-merged: Co-authors

"Marwa Abdulhai" and "Tejus Gupta" — they co-authored a paper

⚠ Over-merged: Co-authors + username names

"Suvir Mirchandani", "Igor Oliveira", and "Vishnu Sarukkai" — all three co-authored the same paper; username-only names made disambiguation harder

How does it work?

The system implements a multi-stage deduplication pipeline designed to reduce pairwise comparisons while preserving semantic recall.

Semantic Item Comparison: Each row is compared against others using an LLM that understands context—recognising that "A. Butoi" and "Alexandra Butoi" are likely the same person, or that "BAIR Lab (Former)" indicates a career transition rather than a different organisation.
Association Matrix Construction: Pairwise comparison results are assembled into a matrix of match/no-match decisions. To scale efficiently, items are first clustered by embedding similarity, so only semantically similar items are compared.
Equivalence Class Creation: Connected components in the association graph form equivalence classes. If A matches B and B matches C, then A, B, and C form a single cluster representing one entity.
Validation: Each multi-member cluster is re-evaluated to catch false positives—cases where the initial comparison was too aggressive. Validation is necessary to mitigate error propagation introduced by transitive closure in the association graph.
Candidate Selection: For each equivalence class, the most complete/canonical record is selected as the representative (e.g., preferring "Alexandra Butoi" over "A. Butoi").

The tradeoff: fuzzy matching is 2000x faster and free, but has a 12-16% accuracy gap. For datasets where false merges are costly, the LLM-based approach may be worth the additional runtime and cost.

98% row-level accuracy on a dataset with conflicting signals
90 seconds processing time and $0.42 LLM cost for 200 records
4 false positive clusters due to co-authorship signals and shared institution

This approach is most appropriate when:

Semantic judgment is required: Name variations, abbreviations, nicknames
Conflicting signals exist: Same person with different emails/organisations over time
No single reliable key: Can't rely on email or ID alone

Use it yourself!

Obtain an API key at everyrow.io
Install the SDK: uv pip install everyrow or visit the github page: https://github.com/futuresearch/everyrow-sdk
Define your equivalence relation in natural language
Compare results against your ground-truth labels