1/ I gave the same flight text to three models and checked each digit, time, and unit against the original. GPT-5.5 came back the least faithful: 83.3%. Claude Fable 5 hit 97.8%, Opus 4.8 hit 94.4%. For the numbers on your ticket, that gap matters.

2/ Field test, not a lab study: 30/30 translation tasks, all completed. A deterministic check scanned each output for whether the numbers, times, money, units and negations in the source actually survived into the Chinese.

3/ The tell was units. Case "改签费与不可退否定" (change fee + non-refundable): GPT-5.5 wrote "起飞前 3 hours 内" and "USD 75", leaving raw English in a Chinese sentence. Scored 2/3. Fable and Opus: 3/3.

4/ Same pattern on "过境签否定条件" (transit-visa negation): GPT-5.5 kept "24 hours" in English, scored 1/2. Fable and Opus kept it Chinese, 2/2. The negation was right. The unit just never got translated.

5/ It wasn't a clean sweep for Claude. On "家电说明书·温度上限" (appliance manual, temp limit) Fable 5 slipped to 2/3 while Opus 4.8 and GPT-5.5 both hit 3/3. No single arm was spotless. The averages are what separate them.

6/ What this run can say: across 30 short flight/hotel/manual texts, GPT-5.5 dropped numbers and units more often than the two Claude models, by more than 10 points, with an error rate above 5%. That's the C2 outcome I pre-registered before running.

7/ What it can't say yet: one run, deterministic string-matching only (did the number show up, not whether it means the right thing). It doesn't extrapolate to long documents, other languages, or other text types. A misplaced negation could still slip past the check.

8/ My verdict: for SMS and fine print where a wrong number costs you money, don't hand it to GPT-5.5 right now. Reach for Fable 5 or Opus 4.8. And tomorrow, whatever model you use, eyeball the units yourself. That's where the English leaks through. Logs and CSV are posted.


原文链接