1/ Before signing a lease, I ran 25 contracts with predatory clauses planted in them past Claude Fable 5, Opus 4.8, and GPT-5.5. Recall on the traps: 96.8%, 97.6%, 100.0%. Deposit and penalty clauses basically don't get dropped whole.
2/ Each contract hid 5 risk clauses with unique wording: nonrefundable deposit, huge penalty, no cooling-off period, auto-renewal, liability dumped on the tenant. 25 tasks ran to completion, 125 planted traps. A model scores a hit only if it quotes the exact phrase.
3/ One case: 整租别墅两年合同, a 2-year villa lease. Fable 5 flagged 5/5 traps, quoting the six-month-rent penalty clause word for word. Opus 4.8 got 4/5 on this one, GPT-5.5 5/5. You can check that row in the raw CSV.
4/ It wasn't a clean ranking. On 留学全套申请服务合同, a study-abroad service contract, Fable 5 slipped to 4/5 while Opus 4.8 and GPT-5.5 each hit 5/5. Each arm missed somewhere; none was flawless across the run.
5/ Cost surprise: GPT-5.5 hit 100.0% recall on just 290.8 tokens per task on average, vs 2879.1 for Fable 5 and 2875.5 for Opus 4.8. The catch: GPT-5.5 was by far the slowest of the three. Cheap tokens, long wait.
6/ What this run can't say yet: it's a real-world run of 25 tasks, not a basis for judging contracts in general. Scoring is keyword recall — a 'contains' check can't separate a real risk flag from a phrase quoted in passing, and it never measured false alarms.
7/ My verdict after running it: any of the three is safe as a first-pass screen before you sign. Deposit, penalty and cooling-off clauses won't get dropped whole. Do this tomorrow: let the model find the traps, then recheck the amounts it quotes yourself.