先说结论

If you already use GPT or Claude in chat and you are now trying to stitch a few AI tools together, the easiest mistake is treating Codex like just another model to rank, as if the higher score tells you where it fits. That is how you end up doing the coordination yourself. AI tools are starting to compete not just for code work, but for the fragments of attention you lose while switching between tools.

The cost shows up in a familiar loop. You search in the browser, jump back to a chat window to restate context, then open the editor to fix a few lines yourself. If you treat every AI tool as interchangeable, you keep manually carrying context at exactly the point where the workflow was supposed to save time. The hidden cost is worse: you keep using Codex in the wrong role, so the system gets busier without getting better.

The contrarian take from Building self-improving tax agents with Codex is simple: self-improvement is not a smarter prompt. It is turning recurring corrections into evals, meaning repeatable tests the agent has to pass again.[S001]

为什么这次值得看

That is why the tax case matters. The reported rollout covered 7,000 returns, and the share reaching 75% correct-field completion moved from about 25% at launch to 86% after 6 weeks.[S001] The article says repeated practitioner corrections were grouped into eval targets and then handed to Codex to iterate against.[S001] The cookbook describes the same flywheel more generally: real traces, human or model feedback, evals, then a Codex-ready handoff.[S005]

Many people think they need a stronger model. I think they often need fewer window switches. The useful question is not "which model sounds smarter in a chat?

" It is "which recurring fixes in my workflow are stable enough to become tests?"

关键证据

Boundary: this is a read of the reported 7,000-return, 6-week case study, not a side-by-side test on my own stack. And not every correction should become a test. If you promote workflow noise into evals, you train the system on the wrong thing.

If that framing helps, share this with the person on your team who is still trying to solve the problem with prompt edits alone.

#AIEngineering #AIAgents #DeveloperTools #TaxTech

适合谁 / 下一步怎么用

最后落到动作：share