Tax AI Gets Better When Edits Become Evals

Already using GPT or Claude and trying to stitch AI tools together so you can stop wasting time? Then the wrong first move is treating Codex like the same kind of tool and assuming the best score means the best fit. The real cost shows up in the small motions: you search in the browser, jump back to chat to restate context, then switch again to the editor to change a few lines. AI tools are starting to compete for the scraps of time you lose switching windows, not just for the code itself.

That is why the most valuable part of self-improving tax AI is not a flashy accuracy headline. Self-improvement is not prompt tuning. It is turning edits into evals. Many people think they need a stronger model. What they really need is fewer windows. Less copy-paste. A system that remembers the fix the next time the same mistake shows up.

The clearest proof point here is Building self-improving tax agents with Codex. In live tax prep work across 30+ accounting firms, the system handled 7,000 tax returns, and the share of returns with at least 75% of important boxes filled correctly rose from about 25% at launch to 86% after six weeks [S001]. This was live work, not a toy demo [S001]. The important part is not just the chart. Repeated practitioner corrections were grouped into eval targets, then handed back so Codex could iterate against the same class of errors again [S001]. That is the loop: correction, check, rerun.

There is a boundary here. If you treat every human edit like gold truth, you can train the workflow on noise instead of signal. But the decision is still useful: if you are evaluating AI workflows, stop asking only which model sounds smartest. Ask which one lets a correction become a reusable check. Share this with the person who is still spending more time moving context than doing the work.