If you mostly use chatbots and are trying not to fall behind, this is where people waste time: they see a high score and assume it means real coding strength. Introducing GeneBench-Pro is interesting because it pushes on that assumption first, not last. [C001]

My read is simple: adding dependencies to the task reveals more than adding parameters to the model. [C002] The useful move here was not “more questions.” It was making the questions dirtier before handing out the score.

In the setup described here, 22 changes added outside-tool calls, extra code around other code, and chains where one part depends on another. Across 13 models and 4 test families, scores fell 14.9% to 60.5%, with a 35.2% average drop.

That sounds less like “models suddenly got worse” and more like “the old questions were too clean.” A benchmark is worth watching when it changes your next decision, not when it just adds another neat number.

Small boundary before people overread it: this comes from 4 test families, not live codebases or shipped products. So no, this does not prove synthetic complexity is the same as real work. It does show how fast clean scores can crack once dependency shows up.

If you know someone picking tools off leaderboard vibes, share this with them.