If you mostly use Claude as a chat box and coding buddy, this release is easy to misread. Picking Claude vs Claude Code by model score is now the wrong frame. Claude Code is starting to look less like a helper and more like a foreman [C001].
You came in asking, "Did the model get better?" The more useful question is, "What kind of work is this product getting better at?" Claude and Claude Code are drifting apart: one is still about finishing the answer, while the other is getting better at splitting large repo-wide jobs.
The bland proof anchor is literally "Claude Code changelog - Claude Code Docs" [C001]. But Week 22 puts workflows near the top, and the example is a project-wide migration: changing one request pattern across a whole codebase, not fixing one file. That is a product-direction signal.
The workflow docs make the shift clearer. Claude Code can write a small job script, run it in the background, and fan work out to dozens or even hundreds of helper bots. One example scope is a change spread across 500 files. That is a foreman move, not a magic replacement.
Boundary: this is docs-only, not a benchmark run or user report. So my rule of thumb is simple: use Claude Code when the hard part is splitting and coordinating the job; use Claude when the hard part is finishing the answer. The most interesting part of a release is rarely that the model got stronger. It is why the strongest thing was not what they led with. Share this with anyone still choosing by model scores alone.