DEV Community: AI Dev Hub

Catch skill regressions before they ship in 2026

AI Dev Hub — Tue, 26 May 2026 14:01:04 +0000

Catch skill regressions before they ship in 2026

Build a deterministic regression suite that reruns every skill case on each prompt change and blocks the merge when the risk-weighted pass rate drops below your gate. The Skill Regression Suite Builder generates those case files for you in a CI-ready format. It can't replace human judgment on edge cases. What it does is stop the silent breakages that ship when you tweak one line of a system prompt.

The Skill Regression Suite Builder I link to below is one I built. I tried four eval frameworks before this, and every one assumed I'd ship my prompts and test data to their servers. I didn't want my system prompts leaving my laptop. So it runs entirely client-side. It's free and asks for no signup, and nothing you paste ever leaves the browser. If you've got a better workflow, tell me, I'm not precious about it.

The bug I shipped because I trusted a one-line prompt edit

Three weeks ago, on a Tuesday afternoon, I edited a single sentence in a skill that classifies incoming support tickets. The change looked harmless. I added "prioritize billing issues" to the instructions because a stakeholder asked. I ran the skill once by hand. The output looked sensible, so I merged it.

By Thursday morning the skill was misrouting password-reset tickets into the billing queue. Not all of them. About 1 in 6. Enough that nobody caught it for a day, and enough that our support lead pinged me at 8:42am asking why the billing queue had doubled overnight.

The fix took two minutes. Finding it took most of the morning, because I had no record of what "working" used to look like. I'd been testing skills the way most people do: open it, type a few inputs I can think of off the top of my head, eyeball the output, ship. That works right up until it doesn't. The trouble is that the inputs I dream up on the spot are the easy ones. The cases that actually break are the ones I'd never think to type, which is exactly why they slip through.

Here's the thing about skills. A skill is just a prompt plus some tools, and prompts are absurdly sensitive to wording. Shift one clause and the model quietly re-weights everything downstream. A regression suite for application code is normal practice. A regression suite for prompts barely exists in most teams I've talked to, even though prompts break in the same sneaky, invisible ways code does. I wanted the same safety net I already have for my functions: a fixed set of inputs with known-good outputs that runs automatically on every change and ends in one clear pass or fail.

That gap is what this tool fills. It does one job: turn the test cases living in your head into a file your CI can run on every change.

How the builder turns loose test ideas into a CI gate

The core idea is boring on purpose, and that's a compliment. You hand it a list of cases. Each case has two halves: an input with its expected behavior, plus a risk weight you assign to it. The builder produces a deterministic suite file plus a small runner you can drop straight into CI. Deterministic is the operative word: it pins temperature to 0 and uses exact or rule-based matching, so the same input gives you the same verdict every single run. Flaky evals are worse than no evals, because the first time one fails for no reason you stop trusting all of them. A suite you don't trust is just guilt that runs in CI.

The risk weight is the piece I didn't expect to care about, and now I won't build a suite without it. Test cases don't all matter the same amount. A misrouted billing ticket costs real money and a real apology. A slightly stiff greeting costs nothing. So each case carries a weight, and the gate checks a weighted pass rate rather than a raw count. You can pass 95% of your cases and still fail the gate, if the 5% you broke happened to be the expensive ones. I set my own weights on a one-to-five scale and treat anything that touches money or data as a five. You'll pick your own scale, the point is just that the gate respects it.

Here is what a generated suite and its runner look like. This is real, runnable Python against a stubbed skill function, so you can read the gate logic without needing an API key:

# Cases generated by the Skill Regression Suite Builder.
# Each case: input, expected route, and a risk weight.
CASES = [
    {"id": "route-password-reset",
     "input": "I forgot my password and can't log in",
     "expect_route": "auth", "weight": 5},
    {"id": "route-billing-charge",
     "input": "Why was I charged twice this month?",
     "expect_route": "billing", "weight": 5},
    {"id": "greeting-tone",
     "input": "hello there",
     "expect_route": "general", "weight": 1},
]

def run_skill(text):
    # Swap this stub for your real skill call.
    return "billing" if "charg" in text.lower() else "general"

def evaluate(cases, gate=0.90):
    total = sum(c["weight"] for c in cases)
    earned = 0
    for c in cases:
        ok = run_skill(c["input"]) == c["expect_route"]
        earned += c["weight"] if ok else 0
        print(f"{'PASS' if ok else 'FAIL'} {c['id']} (w={c['weight']})")
    score = earned / total
    print(f"weighted pass rate: {score:.1%}")
    raise SystemExit(0 if score >= gate else 1)

evaluate(CASES)

Run that and the stubbed skill fails the password-reset case, which carries weight 5, because the dummy router only recognizes the word "charg". The weighted score lands at 6 out of 11, about 54.5%, well under the 0.90 gate, so the process exits with code 1 and your CI run goes red. Swap run_skill for your actual skill call and you have a genuine regression gate. The builder writes both files for you, including the GitHub Actions step, so the only part you actually do by hand is describe the cases and assign the weights.

How it stacks up against the eval tools I tried

I didn't build this in a vacuum. I tried the obvious options first, and a couple of them are excellent. Here's the honest comparison, including the rows where the builder loses:

Tool	Risk-weighted gates	Runs client-side	Setup time	Best for
Skill Regression Suite Builder	Yes	Yes	~5 min	Skill/prompt regression in CI
Promptfoo	No (raw pass rate)	Local CLI, configs on disk	~30 min	Broad LLM eval matrices
Hosted eval platform (generic)	Partial	No, cloud upload	~1 hr + account	Teams wanting dashboards
Hand-rolled pytest	Whatever you build	Yes	Hours	Full control, no time budget

Promptfoo is genuinely good and far more flexible than what I built. If you need a sprawling matrix of models against providers, reach for it. It does treat every assertion as equal by default, though, so wiring up weighted gates meant writing my own scoring on top. The hosted platforms gave me pretty dashboards and a login screen I didn't ask for, plus the upload problem from earlier. Hand-rolled pytest is exactly what I had before the bug, and it works fine, it just costs you an afternoon every time you want to reshape the suite.

The builder sits in a narrow spot: you want weighted regression gates for your skills running in CI today, and you don't want your prompts sitting on someone else's server. If that describes you, generate your suite here and paste the output into your repo. It took me about five minutes the first time, and most of that was me arguing with myself over the weights. The weights are the only judgment call, and that's by design.

When you should skip this entirely

I'd rather tell you when not to use it, because the "when not to" section is the part AI-written tool reviews always quietly drop. Honesty about limits is the only reason to trust the rest.

Don't reach for a regression suite while your skill is still changing shape every day. Early on, the "correct" output is a moving target, and you'll burn more time rewriting expected values than improving the actual skill. Wait until behavior settles. I usually start a suite once a skill has gone a full week without a structural rewrite.

Skip it for purely generative, open-ended skills where no single right answer exists. If the output is a creative paragraph, exact matching is useless and rule-based matching is mostly wishful thinking. You want an LLM-as-judge setup there, which is a different tool and a longer argument about how much you trust the judge model.

And please don't read a green gate as proof the skill is correct. A green gate means the skill still does the things you remembered to test. My password-reset disaster taught me that the case I most need is usually the exact one I forgot to write down. A suite catches regressions against behavior you already knew about. It can't catch a whole category of input you never imagined existed.

One last thing. If you have three test cases total, skip all of this and use plain pytest. The weighting and the CI scaffolding start earning their keep somewhere around 15 to 20 cases, not 3.

FAQ

Q: Does this work with Claude skills and Codex skills?
A: Yes. The case format is model-agnostic. You point the runner's skill call at whatever backend you use, and the builder only cares about the case input and the expected behavior.

Q: How is the weighted pass rate actually calculated?
A: Each case has a weight. The score is the sum of the weights for passing cases divided by the sum of every weight. A gate of 0.90 means you need 90% of your weighted risk to pass, so breaking one heavy case hurts far more than breaking a light one.

Q: Can I run it without an API key?
A: The builder itself runs in your browser with no key and no signup. The generated runner needs whatever credentials your real skill call uses, since it has to actually invoke your skill to test it.

Q: Is it really deterministic?
A: As deterministic as your skill call is. The suite pins temperature to 0 and uses exact or rule-based matching, but if your model still wobbles at temperature 0, add a tolerance rule or a retry. I haven't needed to yet, though your mileage may vary.

Written with AI assistance and human review. Try the tool at aidevhub.io/skill-regression-suite-builder.

How token counters actually work in 2026, and when to trust them

AI Dev Hub — Wed, 29 Apr 2026 17:58:09 +0000

Most "free token counter" tools in your bookmarks are not running the model's tokenizer. They're running a character-ratio estimate and labeling the output "tokens". For OpenAI's GPT family the official tokenizer is open and easy to ship in a browser. For Claude, Gemini, and most others it isn't. Here's what that means for your context-window math.

Up-front disclosure on this one: the tool I link to below is one I built. I got tired of paste-counter-paste-counter loops where the same input produced different numbers, and tired of tools that claim to support every model but quietly use one tokenizer for all of them. Free, client-side, no signup. I'm linking to it because it's what I use, and because I'd rather show you how it works than pitch it.

If you've ever opened three "GPT token counter" tabs and gotten three different numbers, you're not crazy and the tools aren't all wrong. They're doing different things and labeling them the same way. Knowing which is which makes the difference between "this prompt fits" and "the API will reject it at the boundary".

What "tokenization" actually does

A tokenizer takes raw text and splits it into the integer IDs the model actually consumes. Every model family ships its own vocabulary, trained on its own corpus. Same input string yields different token counts because the vocabularies differ.

OpenAI's GPT-4 family uses an encoding called cl100k_base. The newer GPT-4o, GPT-5, o3 and o4 models use o200k_base, a larger vocabulary tuned for multilingual and code-heavy input. Anthropic's Claude family uses its own vocabulary that's published only as a server-side counting endpoint. Google's Gemini family is similar: server-side counting, no public local tokenizer at the time of writing (April 2026).

The rule of thumb people quote, "1 token is about 4 characters of English", is fine for napkin math and wrong by 10 to 20 percent on real input. German tokenizes worse than English because compound words don't fit the English-trained vocabulary. Code with many short identifiers tokenizes better than prose. Emoji are usually 2 to 4 tokens each. JSON with verbose keys tokenizes much worse than minified JSON. If you're sitting near the context window, the rule of thumb will lie to you.

Exact vs estimated, the real divide

Free token counters fall into two camps.

Exact counters ship the model's actual tokenizer in the browser and run it on your input. The numbers match what the API will charge, give or take a token or two. This is feasible only when the tokenizer is published as a runnable library. For OpenAI's GPT and o-series, that library is tiktoken (Python) and gpt-tokenizer (JavaScript). Both are MIT-licensed and small enough to ship client-side.

Estimating counters apply a character-ratio heuristic. They divide the character count by some constant (3.5 to 4.0 depending on the model family) and round up. The number is roughly right on plain English. It can be 10 to 20 percent off on code, JSON, German, mixed scripts, or anything with unusual whitespace. If a counter is fast on a 100,000-character paste regardless of which model you pick, it's almost certainly estimating.

The honest move is to label which is which. Most counters don't.

What the tool I built actually does

Since I'm linking to one of these, I owe you the spec.

aidevhub.io/token-counter uses gpt-tokenizer to compute exact counts for OpenAI's GPT-4, GPT-5, o3, and o4 model names. For every other family (Claude 3.x, Claude 4.x, Gemini, Llama, DeepSeek, Mistral, Grok) it uses a character-ratio estimate calibrated per family. Claude is chars / 3.5. The others are chars / 4.0. The output labels each row as either exact or estimate so you can tell which you're looking at.

This is honest about what's possible. I can't ship Anthropic's tokenizer client-side because it isn't published as a local library. I can't ship Google's either. The choice was either to fake-claim "supports every tokenizer" (the easy lie) or to label estimates as estimates (the harder honesty). Picked the second.

For most context-budget math at 30 to 70 percent of the window, the estimate is close enough. For boundary cases at 95+ percent of the window, you want the actual tokenizer. The next section is how to get certainty when you need it.

How to get certainty when the number matters

If the count matters (you're at the boundary, or you're billing customers per-token), don't trust any browser tool, including mine. Use the model's own counting endpoint or library.

For OpenAI:

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
with open("prompt.txt") as f:
    print(len(enc.encode(f.read())))

That's the source of truth. gpt-tokenizer in the browser uses the same encodings (cl100k_base for GPT-4 era, o200k_base for GPT-4o and newer), so a browser-based exact counter and tiktoken should match within a token or two. If they don't, your tiktoken version is probably stale and the model has updated its vocabulary since you last upgraded.

For Claude, Anthropic publishes a server-side counting endpoint accessible via the SDK as client.messages.count_tokens() (or client.beta.messages.count_tokens() depending on SDK version). It costs nothing to call but it does need network and an API key. Returns the exact count the API will charge for that exact messages array including system prompt and tool definitions.

For Gemini, the SDK exposes model.count_tokens() which similarly calls Google's server.

The post-call usage field on every modern API is also authoritative. After your call, the response includes input_tokens and output_tokens as the actual billed counts. If your local count and the API's usage consistently disagree, your local tokenizer is the one that's wrong.

Where token counts and API math diverge

A counter on raw text isn't the full picture for an API call. Three things eat budget that a naive counter doesn't see:

System prompt and tool definitions count. Every modern API includes them in the input total. If you're counting only the user message, you're under-counting.
Message structure adds overhead. Each message in a chat-format request costs a few tokens for the role markers and separators, on top of the content. OpenAI documents this; Anthropic does too. It's small (3 to 6 tokens per message) but at scale it matters.
Output tokens are a separate budget. The 200,000 number you see in Claude's docs is the input window. Output is configured separately. Claude 4 family has a third configurable budget for thinking tokens. Always check the model's docs for the specific split.

A browser counter that gives you a single number against a single model is a useful sanity check, not a complete budget calculation.

The compact summary

Counter type	What it does	Accuracy	When to use
`tiktoken` (Python)	Runs OpenAI's official tokenizer locally	Exact for GPT and o-series	Boundary cases, prod budget math
`gpt-tokenizer` (JS)	Same vocabularies, browser-shippable	Exact for GPT and o-series	Browser tools, paste-and-count UIs
Anthropic `count_tokens`	Server-side API call	Exact for Claude, includes message overhead	When the count matters and you have a key
Gemini `count_tokens`	Server-side API call	Exact for Gemini, includes message overhead	When the count matters and you have a key
Character-ratio estimate	`chars / 3.5` or `chars / 4.0`	Within 10 to 20 percent on most input	Quick sanity check, no key needed

A few small habits that pay off

After watching too many "but my count said it'd fit" boundary failures, three habits I've stuck with:

Count against the actual target model, not "GPT-4 close enough". Different vocabularies give different numbers on identical input. If you're sending to Claude 4.6, count with Anthropic's tokenizer.
Minify JSON before sending. Pretty-printed JSON spends tokens on whitespace. The model doesn't care. Editor reads the indented version, model reads the minified one. Easy to script in your client.
Log token counts on every prod call and graph the average weekly. If your average prompt size starts creeping up because someone added a new few-shot example, you'll see it before it tips over the budget. Costs about 10 lines of code per service.

FAQ

Q: Are there official tokenizers I can run locally for every model?
A: Only OpenAI publishes one as a runnable library (tiktoken in Python, gpt-tokenizer in JS). Anthropic and Google publish counting as server APIs only. If a third-party tool claims to do exact tokenization for Claude or Gemini in your browser, it's almost certainly estimating, no matter what the marketing says.

Q: Why does the count change when I add a system prompt?
A: Because system prompt is part of the input. Same for tool definitions if you're using tool-use APIs. The input window includes the entire request payload, not just the user turn. This trips people who count only their user message.

Q: How accurate is the post-call usage field?
A: It's the source of truth. That's what was billed. Counters before the call are estimates of what usage will say. They should match within 1 to 2 tokens if your local tokenizer matches the model's current version. Consistent drift means your local library is stale.

Q: Does whitespace really matter that much?
A: Yes, on text-heavy input. Repeated newlines and indentation are often single tokens each, but they add up. A pretty-printed 5,000-line JSON file can use noticeably more tokens than the same JSON minified, with no information loss. If you're trimming for budget, that's the first place to look.

Q: What about thinking tokens on Claude 4 and reasoning tokens on o-series?
A: Separate budget on both. Claude 4 family has a configurable thinking token budget independent of input and output. OpenAI's o-series has reasoning tokens that count against output. Check the specific model's docs because the rules vary by version.

Written with AI assistance and human review.

5 cron expression gotchas that catch experienced devs in 2026

AI Dev Hub — Sat, 25 Apr 2026 13:34:21 +0000

Cron is one of those tools where the syntax looks obvious until a job fires at the wrong time and you start digging. Five behaviors below are documented in the man page and still catch people who've been writing cron for years. Each one is in a footnote most tutorials skip.

Quick disclosure on this one: the cron builder I link to below is something I built. After enough years of writing 5-field expressions by hand, I wanted a tool that showed me the next 5 fire times in my actual local timezone before I committed. Free, client-side, no signup. Linking to it because it's the workflow I use now.

I think most devs learn cron the same way. You copy something off Stack Overflow that looks close to what you want, you tweak a number, you commit it, and then a few days later something fires at the wrong time and you start reading the man page properly. The 5 behaviors below are the ones I see trip people up over and over. None are exotic. All are documented. All pass code review.

Gotcha 1: `*/5` is anchored to the field origin, not to "now"

*/5 * * * * does not mean "every 5 minutes from whenever the job loaded". It means "every minute whose value is divisible by 5". So it fires at :00, :05, :10, :15, etc. If you load the job at :07 and expect the next fire 5 minutes later, you'll see the next fire at :10, not :12.

The same rule applies to every field. 0 */6 * * * fires at 00:00, 06:00, 12:00, 18:00, anchored to midnight. Not to whenever you started the scheduler.

This is the right behavior for most use cases (predictable, aligned across machines) but it's not what people often expect on the first read. The lesson: */N is anchored to the field's natural origin, never to the load time.

Gotcha 2: day-of-month and day-of-week are OR, not AND

This one is in the POSIX spec and almost nobody reads it. The expression 0 9 1 * 1 does NOT mean "the 1st of the month, but only if it's a Monday". It means "at 9am on the 1st of every month, OR on every Monday". So it fires roughly 5 times more often than the AND interpretation would suggest.

There's no way to express AND between those two fields in standard POSIX cron. Two common workarounds:

import datetime as dt

# Cron fires every Monday. Script filters down to "first Monday of the month".
now = dt.datetime.now()
if now.weekday() == 0 and now.day <= 7:
    run_billing_job()
else:
    log.info("skipping; not first Monday of the month")

Cron expression becomes 0 9 * * 1 (every Monday at 9am) and the script handles the "first" qualifier. Two pieces of logic, each obvious on its own.

The other workaround is to switch to a scheduler that supports AND between those fields. Quartz syntax (used by AWS EventBridge and many JVM schedulers) treats them as AND when both are non-*. Different platform, different rule. Worth knowing which one you're on.

Gotcha 3: launchd reads local time, not UTC

This is a Mac-specific gotcha and it's caused enough confusion that I now put a comment at the top of every plist. macOS launchd interprets StartCalendarInterval in the system's local timezone. If your plist has Hour=14, the job fires at 14:00 wherever the Mac thinks it is. There is no built-in "interpret as UTC" flag.

If you're migrating a cron job from a Linux server (where cron typically runs in UTC unless configured otherwise) to launchd on a Mac in another timezone, the job will fire at a different absolute time. The expression looks identical. The behavior isn't.

Two ways to fix it on launchd:

Set the system clock to UTC. Works if you control the machine and don't mind the rest of the OS displaying UTC times.
Compute the UTC-equivalent local hour and update it twice a year for daylight saving. Less elegant but doesn't change anything else on the system.

I pick option 2 with a comment in the plist that says "fires at 13:00 UTC; adjust for DST in March and October". Ugly, but explicit, which is what you want when you read the file 6 months later.

Gotcha 4: cron does not catch up missed firings

If your laptop is asleep at the scheduled time, the job does NOT fire on wake. Cron has no built-in catch-up. If your job is "delete files older than 30 days" and the machine is asleep through 3 firings, it just runs once when the next scheduled time arrives. The 3 missed firings are gone.

This is a portable laptop problem more than a server problem. A server that's always on rarely misses. A Mac that sleeps overnight can easily miss its 3am job most nights and never log an error, because there's no error to log. The job didn't fail. It just wasn't fired.

The fix on launchd is StartInterval (interval-based, fires on wake) instead of StartCalendarInterval (clock-time, no catch-up). Or you use a tool with persistent scheduling that does catch up: anacron is the classic Linux answer, cronie with crond -P works similarly, and various job runners (systemd timers with Persistent=true, etc.) handle this natively.

I default to interval-based scheduling for anything maintenance-shaped (backups, cleanup, log rotation) where the exact time matters less than "did it run today". Calendar-based scheduling for anything time-sensitive (a daily 9am email) where running at 11am after the laptop wakes would be wrong.

Gotcha 5: a cron expression has no timezone embedded in it

This is the one that bites distributed teams. The expression 0 9 * * * says "at 9:00 in whatever timezone the scheduler runs in". It doesn't say UTC. It doesn't say Berlin. It says "whatever the scheduler thinks 9:00 is".

If you write the expression in Berlin, deploy the code to a server in US-East, and that server's cron runs in UTC, your job fires at 9:00 UTC, which is 10:00 or 11:00 Berlin time depending on the season. The expression looks fine in code review. The behavior is wrong.

A few things help:

For Linux cron, CRON_TZ=Europe/Berlin at the top of the crontab file pins all subsequent entries to that zone. Documented in man 5 crontab. Easy to miss.
For Quartz-based schedulers, the timezone is usually a separate config field (timeZone in Spring's @Scheduled, for example).
For launchd, you compute it yourself or set the system clock.

I add a comment to every cron entry now that says what timezone I expect it to fire in. Adds 3 seconds to writing the entry and saves the timezone-archaeology session that always comes a month later.

How I'd write each of these now

For reference, here's how each gotcha translates to a defensible expression.

Goal	Naive attempt	What it actually does	Defensible version
Every 5 minutes from now	`/5 * * *`	Fires at :00, :05, :10...	Same expression, accept the alignment
First Monday of month at 9am	`0 9 1 * 1`	1st of month OR every Monday at 9am	`0 9 * * 1` plus script-side date check
14:00 UTC daily on launchd	`Hour=14` in plist	14:00 in local timezone, not UTC	Compute local hour, comment with intended zone
Daily backup at 3am	`0 3 * * *` cron OR `Hour=3` plist	Skips firings when machine is asleep	`StartInterval=86400` or use a catch-up scheduler
Anything moderately complex	Hand-typed	Often wrong on the first try	Build visually, paste, comment what it fires on

When raw cron is still fine

I'm not saying never write cron by hand. For "every minute" (* * * * *) or "every hour at the top" (0 * * * *) it's faster to just type it. The break point for me is anything with more than one non-* field. Two fields with values is where my error rate spikes and the cost of building visually is zero.

Worth knowing: most cron implementations support extensions that aren't in POSIX. @daily, @weekly, @reboot, @hourly all exist in Vixie cron and read better than the equivalent expressions. If your environment supports them, prefer them. They're more readable to whoever opens the file in 2027.

The free cron builder I made and use regularly now is at aidevhub.io/cron-builder. Pick days, hours, minutes from dropdowns, get the expression, see the next 5 fire times in your local timezone. The next-fire preview is the part I find most useful, because it catches the "this expression doesn't actually fire when I think it does" cases before they ship.

FAQ

Q: Why is the day-of-week / day-of-month thing an OR?
A: It's a POSIX thing, dating back to the original Unix cron. The spec says if either field is restricted (not *), they're OR-ed together. There's a footnote in man 5 crontab if you want to read it. Most cron tutorials skip this part because it's a footgun.

Q: Does this work for AWS EventBridge cron expressions?
A: EventBridge uses a 6-field cron syntax with year, and the day-of-week / day-of-month rule is AND there, not OR. So if you're going EventBridge, that specific gotcha goes away. The other 4 still apply. EventBridge also requires you to use ? in one of the two day fields, which is its own kind of footgun.

Q: Is there a cron syntax that's better than the 5-field one?
A: Quartz scheduler's syntax is more expressive (seconds, year, AND between day fields). Most Linux distros ship systemd.timer which is way more readable but is its own thing. Pick whatever your platform supports best. I find systemd timers the cleanest for new Linux work and stick with launchd for Mac because the alternatives aren't worth the friction.

Q: How do I test a cron expression without waiting?
A: Easiest path is a builder that shows the next 5 fire times so you can eyeball whether the schedule matches your intent. Beyond that, croniter for Python and cron-parser for Node both let you iterate the next N firings programmatically. I write a one-line script when I'm not sure: python3 -c "from croniter import croniter; from datetime import datetime; c=croniter('0 9 * * 1'); [print(c.get_next(datetime)) for _ in range(5)]". If the printed times look right, the expression is right.

Q: What about Quartz cron expressions?
A: Different beast. 6 or 7 fields (seconds optional, year optional), ? placeholder for day fields, L for last, # for nth-day-of-month. More expressive, less portable. If you're on a Quartz-based stack you're already in a different syntax and most of the POSIX gotchas above don't apply.

Written with AI assistance and human review.

DEV Community: AI Dev Hub

Catch skill regressions before they ship in 2026

Catch skill regressions before they ship in 2026

The bug I shipped because I trusted a one-line prompt edit

How the builder turns loose test ideas into a CI gate

How it stacks up against the eval tools I tried

When you should skip this entirely

FAQ

How token counters actually work in 2026, and when to trust them

What "tokenization" actually does

Exact vs estimated, the real divide

What the tool I built actually does

How to get certainty when the number matters

Where token counts and API math diverge

The compact summary

A few small habits that pay off

FAQ

5 cron expression gotchas that catch experienced devs in 2026

Gotcha 1: */5 is anchored to the field origin, not to "now"

Gotcha 2: day-of-month and day-of-week are OR, not AND

Gotcha 3: launchd reads local time, not UTC

Gotcha 4: cron does not catch up missed firings

Gotcha 5: a cron expression has no timezone embedded in it

How I'd write each of these now

When raw cron is still fine

FAQ

Gotcha 1: `*/5` is anchored to the field origin, not to "now"