If you already use GPT or Claude and you're trying to stitch AI tools together to save time, the easy mistake is treating Codex like just another chatbot and assuming the highest score wins. That is how you keep doing the most annoying part by hand: browser for research, chat box for context, editor for the actual fix, then the same explanation all over again.
The interesting claim here is not "the model got smarter." It is that self-improvement comes from turning repeated human fixes into evals, meaning repeatable tests [C002]. AI tools are not just coming for code work. They are coming for the scraps of time you lose switching windows. Many people think they need a smarter model. What they often need is fewer tab switches.
That is why the headline number matters less than the loop behind it. In one tax filing workflow with human reviewers, 75% correct-field completion moved from about 25% to 86% in 6 weeks across 7,000 returns. The useful asset is not just the 86%. It is the growing stack of checks created from real corrections, so the same mistake becomes harder to repeat.
That is what makes OpenAI's "Building self-improving tax agents with Codex" worth passing around [C001]. Boundary matters: this is evidence from one tax workflow with human reviewers, not a universal benchmark for every AI product. But if you are still comparing tools only by model IQ, you are probably missing the part that actually saves time. Share it with the person still shopping that way.