DEV Community

Cover image for Stop Writing Regex for Data You Should Be Describing in English
Rafael Poyiadzi
Rafael Poyiadzi

Posted on

Stop Writing Regex for Data You Should Be Describing in English

You have a spreadsheet of job postings and you need to filter it down to roles that are remote-friendly, senior-level, and have a disclosed salary. Sounds straight-forward except the data looks like this:

company post
Airtable Async-first team, 8+ yrs exp, $185-220K base
Vercel Lead our NYC team. Competitive comp, DOE
Notion In-office SF. Staff eng, $200K + equity
Linear Bootcamp grads welcome! $85K, remote-friendly
Descript Work from anywhere. Principal architect, $250K

Now try writing deterministic rules for that.

  • "Remote-friendly" could be "remote", "work from anywhere", "async-first", or implied by the absence of an office mention.
  • "Senior-level" might be "8+ yrs", "Staff", "Principal", or "Lead" — but "Lead" could also be a junior team lead.
  • "Salary disclosed" means actual numbers, not "Competitive comp" or "DOE."

What if you could just describe what you want?

everyrow lets you define fuzzy, qualitative logic in natural language and apply it to every row of a dataframe. The SDK handles LLM orchestration, structured outputs, and scaling with the user specifying judgment criteria in plain English.

Here's the job screening example:

import asyncio
import pandas as pd
from pydantic import BaseModel, Field
from everyrow.ops import screen

jobs = pd.DataFrame([
    {"company": "Airtable", "post": "Async-first team, 8+ yrs exp, $185-220K base"},
    {"company": "Vercel", "post": "Lead our NYC team. Competitive comp, DOE"},
    {"company": "Notion", "post": "In-office SF. Staff eng, $200K + equity"},
    {"company": "Linear", "post": "Bootcamp grads welcome! $85K, remote-friendly"},
    {"company": "Descript", "post": "Work from anywhere. Principal architect, $250K"},
])

class JobScreenResult(BaseModel):
    qualifies: bool = Field(description="True if meets ALL criteria")

async def main():
    result = await screen(
        task="""
        Qualifies if ALL THREE are met:
        1. Remote-friendly
        2. Senior-level (5+ yrs exp OR Senior/Staff/Principal in title)
        3. Salary disclosed (specific numbers, not "competitive" or "DOE")
        """,
        input=jobs,
        response_model=JobScreenResult,
    )
    print(result.data)

asyncio.run(main())
Enter fullscreen mode Exit fullscreen mode

That's it. No regex, no threshold tuning, no parsing logic. The screen operation evaluates every row against your natural-language criteria using an LLM and returns structured results via a Pydantic model.

The output:

company qualifies
Airtable True
Vercel False
Notion False
Linear False
Descript True
  • Airtable qualifies: async-first (remote-friendly), 8+ years (senior), $185-220K (salary disclosed).
  • Descript qualifies: work from anywhere (remote), principal architect (senior), $250K (salary disclosed).
  • The rest fail on at least one criterion: Vercel has no real salary, Notion is in-office, Linear isn't senior-level.

Sessions: Track Everything in a Dashboard

Every operation runs within a grouping of related operations that appears in the everyrow.io web UI. These sessions are created automatically, but for multi-step pipelines you'll want to create one explicitly:

from everyrow import create_session
from everyrow.ops import screen, rank

async with create_session(name="Lead Qualification") as session:
    print(f"View at: {session.get_url()}")

    screened = await screen(
        session=session,
        task="Has a company email domain (not gmail, yahoo, etc.)",
        input=leads,
        response_model=ScreenResult,
    )

    ranked = await rank(
        session=session,
        task="Score by likelihood to convert",
        input=screened.data,
        field_name="conversion_score",
    )
Enter fullscreen mode Exit fullscreen mode

The session URL gives you a live dashboard where you can monitor progress and inspect results while your script runs.

Background Jobs for Large Datasets

All the operations above are already async/await. The _async variants are different — they're fire-and-forget: they submit work to the server and return immediately so your script can continue:

from everyrow.ops import screen_async

async with create_session(name="Background Screening") as session:
    task = await screen_async(
        session=session,
        task="Remote-friendly, senior-level, salary disclosed",
        input=large_dataframe,
    )
    print(f"Task ID: {task.task_id}")
    # do other work...
    result = await task.await_result()
Enter fullscreen mode Exit fullscreen mode

If your script crashes, recover the result later using the task ID:

from everyrow import fetch_task_data
df = await fetch_task_data("12345678-1234-1234-1234-123456789abc")
Enter fullscreen mode Exit fullscreen mode

Beyond Screening: What Else Can You Do?

screen is just one of several operations:

Operation What it does
Screen Filter rows by criteria that require judgment
Rank Score rows by qualitative factors
Dedupe Deduplicate when fuzzy string matching isn't enough
Merge Join tables when keys don't match exactly
Research Run web agents to research each row

Each operation takes a natural-language task description and a dataframe, and returns structured results. Same pattern, different capability.

When to Use This (and When Not To)

everyrow is designed for cases where the logic is easy to describe but hard to code: screening, ranking, deduplication, and enrichment tasks where the criteria require judgment.

It's not a replacement for deterministic transformations. If you can write a reliable df[df["salary"] > 100000], you should. Use everyrow for the columns where the values are natural language, inconsistent, or require world knowledge to interpret.

The tradeoff is latency and cost: LLM-based operations are slower and not free. For the job screening example above, processing 5 rows takes a few seconds and costs a fraction of a cent. For 10,000 rows, you'd want the async variants and should expect minutes rather than milliseconds. The docs cover scaling patterns for larger datasets.

Get Started

pip install everyrow
export EVERYROW_API_KEY=your_key_here
Enter fullscreen mode Exit fullscreen mode

Get a free API key at everyrow.io/api-key - comes with $20 free credit.

Full docs and more examples: everyrow.io/docs/getting-started


Resources

Top comments (0)