Andrew Korytko

Posted on May 24

Should we implicitly trust AI with our optimization problems? Like taxes?

#ai #llm #startup #career

I gave 5 frontier AI models the same ISO tax problem. Every answer was off by 2× to 20×. And the catch: you're not warned.

I built the first version of this calculator in a weekend to help people with their ISOs. Several months in, with a more comprehensive optimizer for equity comp decisions, one question kept coming up from the people I was showing it to: "why bother, can't I just ask Claude or ChatGPT to do this for me?" So I ran the test.

I gave five frontier AI models the same tax problem: how to exercise 20,000 incentive stock options (ISOs) over four years to maximize after-tax value. Same prompt, three independent runs each. The volatility formula was provided in the prompt verbatim so there was nothing to misinterpret.

Every one of the 15 responses overstated the net final value (NFV) of its own recommended schedule. The smallest overstatement was 2×. The largest was 19.7×.

GPT-5.5's best run, the cheapest model in the test at $0.30 per call without reasoning enabled, was the closest to the deterministic optimum. The most expensive variant (GPT-5.5 Pro, reasoning on, $3 per call) consumed our entire 16K output token budget on its thinking step and returned empty completion text. A higher budget might have produced output; I switched to the non-reasoning variant for the published results to keep the models relatively equal on costs.

This is what happened.

What this means for you

When I asked LLMs to model an ISO exercise, the dollar figure was off by a factor of 2 to 20. That was true across Claude, Gemini, GPT-5.5, Grok, and Mistral. It was true with reasoning enabled in some models, and when I provided the input formulas explicitly.

The qualitative analysis is often useful. Every model in the test correctly

identified AMT as the binding constraint,
that California state AMT applies separately,
that qualifying disposition depends on holding periods,
that the choice of mean (arithmetic vs geometric) flips the strategic answer.

Several flagged concentration risk and the unrecovered AMT credit problem. Those considerations are valuable for orienting yourself on a problem you do not yet understand.

The specific numeric estimates are not reliable. A reader copying the response into a spreadsheet has imported a 2× to 20× error. And AI won't warn you it made a wild guess that's very likely wrong.

None of the 15 responses I got labeled its final NFV as a shoot-from-the-hip estimate that should be sanity-checked before acting on it. Several added caveats about

tax-law uncertainty,
future stock-price volatility,
my personal risk tolerance,
even concentration risk.

None said: "I am not confident that my math is correct or that this number is in the right order of magnitude." If you are using a chatbot to inform any decision where even the magnitude of the number matters, you have to assume that warning yourself.

The scenario I tested here is deliberately simple. One grant, one stock, one filing status, no equity in other companies or other equity in this company, no 401(k) or Roth optimization, no anticipated income changes, no charitable planning, no cross-state moves. Your real situation almost certainly has more moving parts. If frontier reasoning models could not reliably do the math on the easy version, the harder version is unlikely to be where they shine.

The reproducible alternative for this class of problem is software that does the math deterministically. The basic optimizer used as the reference here, optionsahoy.com/tools/amt-iso, finds the global optimum by grid search across 4-year share allocations, scoring each schedule through the same closed-form tax calculation the models are trying to reproduce.

My main point isn't AMT optimization. There are thousands, perhaps millions of similar financial optimization challenges people ask LLMs to solve every day. Which they do. Incorrectly, if we judge by our basic example.

Now, the data.

The scenario

ISOs trigger the alternative minimum tax (AMT) when exercised. The bargain element (fair market value minus strike, times shares) gets added to a parallel tax calculation. The AMT is more like a prepayment than a normal tax: much of it comes back as a federal credit in later years when regular tax exceeds tentative minimum tax.

When you exercise matters. Small annual tranches qualify for long-term capital gains. Large same-year exercise-and-sell tranches become disqualifying dispositions taxed as ordinary income. The schedule moves the outcome.

The scenario I tested:

20,000 ISOs at $2 strike, current fair market value $200
Married filing jointly, $300,000 ordinary income, California resident
No prior AMT credit carryforward
5.5% return on idle cash; tax dollars paid early carry that opportunity cost
4-year horizon, all shares sold by end of year 4
17% arithmetic-mean annual return, 72% annualized volatility (modeling many recently public tech companies)
ISOs granted January 2022 (qualifying-disposition periods satisfied by end of horizon for early exercises)

For the price projection the prompt provided the standard Itô formula directly: μ_geometric = μ_arithmetic − σ²/2. At σ = 0.72, μ = 0.17, that is μ_geometric = −0.0892/year. The year-4 median price compounds to about $137. The formula was in the prompt I gave to the models. No interpretation gaps.

The optimization itself is well-behaved. Under negative geometric drift, deferring exercise dominates. The value function is unimodal, the global optimum sits near a lump-Y4 schedule with small Y1–Y3 tranches that qualify for long-term capital gains by year 4. No two local maxima, discontinuity, or narrow feasible region. Finding this optimum is schoolbook, frequently done using spreadsheets.

The deterministic answer (brute-force grid search):

Schedule: 306 / 338 / 740 / 18,616 shares per year
Net final value at end of year 4: $726,409

For reference, two naive baselines:

Lump-sum all 20,000 in year 1: $123,205
Even split (5,000 per year): $387,473

That is a 6× spread between worst and best, which is the room each model has to work with.

The method

For each model, three independent API calls. Fresh context each time. Temperature 1.0. The same prompt verbatim every time.

For each response I extracted two numbers: the schedule the model recommended (shares per year) and the net final value the model claimed for that schedule. Then I fed the model's recommended schedule into the deterministic calculator to compute the actual NFV (call it the "true" NFV). The gap between stated and true is the model's arithmetic deviation on its own recommendation.

Models tested:

Anthropic Claude Opus 4.7 (reasoning enabled), via Claude Code sub-agent
OpenAI GPT-5.5 (no reasoning), via OpenRouter
Google Gemini 2.5 Pro (reasoning enabled), via OpenRouter
xAI Grok 4.20 multi-agent (reasoning enabled), via OpenRouter
Mistral Large 2512, via OpenRouter

Total API spend: $8.68 across 15 calls. Full provider IDs and parameters in the Methodology section at the end.

Two methodological notes worth surfacing up front. The reasoning-enabled GPT variant (gpt-5.5-pro) was tested first and returned empty completion across multiple attempts: the reasoning trace consumed the full output token budget at $2.96 per call with no answer rendered. I switched to the non-reasoning variant for the published results. Claude was tested via Claude Code's sub-agent path rather than OpenRouter, because the OpenRouter route produced the same empty-completion failure for Claude under the same token-budget configuration.

Results

Three observations.

Every one of the 15 runs overstated its own NFV. Not one model claimed less than what its recommended schedule delivers. The direction is consistent across models and across runs.

Schedule quality is independent of reasoning capability. Claude (reasoning) consistently picked even-split, which leaves about 47% of the optimal value on the table. GPT-5.5 with reasoning disabled produced the schedule closest to the deterministic optimum on its first run, recognizably the same small-smooth-plus-Y4-lump shape the optimizer produces. The model with reasoning enabled was not the one that arrived at the better answer.

Two models showed high run-to-run variance. Gemini recommended even-split on run 1 and lump-Y1 on runs 2 and 3. Mistral recommended lump-Y4 once, then two different smoothed shapes. Same prompt, same temperature, three independent calls, three different strategic recommendations. Claude, Grok, and GPT-5.5 were more consistent within model.

Per-model walkthrough

Claude Opus 4.7

Picked an even-split schedule across all three runs. The reasoning trace got the vol drag math right (μ_geometric = −0.0892, Y4 ≈ $140) and identified the AMT crossover correctly (roughly 1,000 to 1,400 shares per year before AMT bites materially). It then recommended 5,000 shares per year anyway, a number it had explicitly computed as four to five times above its own crossover. Stated NFV ranged $1.56M to $1.79M against an actual outcome of $372K to $387K. In a caveat to run 1, Claude noted that lump-Y1 would likely produce "$2.0–2.1M NFV." The actual lump-Y1 outcome in this scenario is $123,205.

Gemini 2.5 Pro

Produced the highest single-run stated/true ratio of the test: 19.70× on run 3, claiming $2.43M on a lump-Y1 schedule whose true outcome is $123K. On run 1 it miscomputed the Y4 price by using a 3.5-year exponent in the compounding step despite the prompt specifying T = 4. Three runs, three different strategic conclusions.

Grok 4.20 multi-agent

Consistently picked a lump strategy, Y4 on run 1 and Y3 on runs 2 and 3. The reasoning was direct: under negative drift, defer exercise. That is qualitatively close to the optimal shape but misses the small Y1–Y3 smoothing that qualifies tranches for long term capital gain (LTCG) treatment. Stated/true ratios were the lowest of the five models in absolute terms (2.04× to 2.77×).

Mistral Large 2512

Produced the largest single-run stated NFV in the test: $10,977,600 on run 1, against a true outcome of $672,144. Run 3 stated $10,010,000 on a different schedule with a true outcome of $563,934. The model's reasoning applies the bargain element calculation correctly ($3.96M total) and the LTCG tax at sale, but the step from there to final NFV does not reconcile against the schedule it recommended.

GPT-5.5 (no reasoning)

Produced the best schedule across all 15 runs on its first call. That schedule yields a true NFV of $694,549, within 5% of the deterministic optimum ($726,409). The stated NFV was $1,430,600 (2.06× the true value). Runs 2 and 3 picked different shapes (lump-Y3, Y3/Y4 split) with worse true NFVs but similar overstatement ratios (2.61× to 2.79×). The cheapest model in the test, without reasoning enabled, produced the most usable single recommendation.

Two independent failure modes

Two distinct categories of behavior appear across the 15 runs.

The first is schedule selection. Some models find a near-optimal shape (GPT-5.5 run 1, Grok lump-Y4). Others pick a worse shape (Claude even-split, Gemini lump-Y1) and stay there across runs. The qualitative reasoning ("negative drift means defer exercise") is correctly verbalized by every model that produced a coherent answer, but only some models translate that into a recommendation that matches the insight. This step is not arithmetic. It is whether the model commits to a recommendation consistent with its own stated conclusion.

The second is arithmetic on the recommended schedule. Across every model and every run, the stated NFV exceeds the true NFV by at least 2×. The 15-run median is about 4×; the long tail reaches 19.7×.

This deviation is not about whether the model understands AMT, vol drag, or qualifying dispositions. By inspection, most of them describe these concepts correctly. The deviation is at the last step, where the model converts its plan into a dollar figure. That step requires multi-year compounding, federal AMT, state AMT, LTCG basis adjustment, credit recovery, and time-value adjustment. It does not reliably resolve to the right number, even when the model has identified each component along the way. LLMs are spectacular. Their mistakes in estimating our simple tax challenge were too.

These two failures are independent. GPT-5.5 picks the best schedule and overstates by 2 to 3×. Mistral picks a reasonable schedule (lump-Y4) and overstates by 16×. Claude picks the worst schedule and overstates by 4 to 5×.

There is a deeper point. Some problems are best solved by enumeration, not reasoning. Tax optimization with discrete schedule choices is one of them. A brute-force grid search at one-share granularity over a 4-year horizon evaluates a few thousand candidate schedules and returns the exact answer in milliseconds. An LLM with hundreds of billions of parameters trained on every written tax document can be vastly smarter than that search and still not better at it. They are different tools. The biggest catch: LLMs never flagged this issue in their supremely confident prose responses.

Should we implicitly trust AI with our optimization problems? Like taxes?

Methodology and reproducibility

All 15 model calls were made in May 2026. Model versions captured per-run in raw response files.
API surface: OpenRouter chat completions with temperature: 1.0, max_tokens: 16384, reasoning: { max_tokens: 8000 } where applicable. No system prompt. Each call is isolated.
Claude was tested via Claude Code sub-agent invocation (Anthropic API surface, no system prompt). The OpenRouter route returned empty completion for Claude under the same configuration. The sub-agent path produces the same model with the same constraints minus the OpenRouter wrapper.
The failure modes we observed (math drift, schedule selection) are properties of the model, not the wrapper. We expect similar behavior across surfaces without enabling the use of optimization tools.
Full provider IDs: anthropic/claude-opus-4.7, openai/gpt-5.5, google/gemini-2.5-pro, x-ai/grok-4.20-multi-agent, mistralai/mistral-large-2512. The reasoning variant tested-then-discarded was openai/gpt-5.5-pro (empty completion at $2.96 per call).
AMT brackets used: 2026 figures per IRS Rev. Proc. 2025-32 (exemption $140,200 MFJ, phaseout start $1,000,000 MFJ, phaseout rate 50% post-OBBBA, 26% / 28% rates above the $244,500 breakpoint).
The verbatim prompt, the locked scenario inputs, the 15 raw model responses, the results CSV, and the scoring methodology are open at github.com/AlvisoOculus/llm-iso-benchmark. The deterministic optimizer source itself is part of the OptionsAhoy product and is not open-sourced.

Andrew Korytko is the founder of OptionsAhoy. Beta is currently free and invite-only.

Top comments (2)

Harjot Singh • Jun 1

it's wild to see how much variance there is in AI outputs, especially with something as critical as taxes. your experiment really highlights the need for human oversight in these scenarios. on a different note, if you're ever looking to build tools like your calculator, moonshift can get you a full next.js + postgres + auth app deployed in about 7 minutes. happy to offer you a free run if you're interested.

Andrew Korytko • Jun 1

My solution was:

glama.ai/mcp/servers/AlvisoOculus/... is a Model Context Protocol server that exposes deterministic equity-comp calculators, the code in one of which was used in the previous post to expose the current LLM 2-20x weakness. You can install it in Claude (claude.ai/customize/connectors), ChatGPT, Cursor, or any MCP-capable client yourself. Claude nails the correct answer after that.