If Safety Comes Off in Minutes, It Was Never the Core

你刚刷到这条消息，本来准备顺手划走，但又怕自己错过了真正会影响下一步判断的那一点。

最容易做错的，是The Financial Times has published an article about Heretic；代价往往是如果只盯表面热闹，你很容易在错误方向上花掉时间、预算和注意力。；我先给一个保守判断：大模型安全层，本质上是可拆外壳。。

The headline fact is simple: The Financial Times has published an article about Heretic. The more useful takeaway is this: on many open models, the safety layer looks like a removable shell, not the model's core behavior. If you only watch the surface drama, you can waste time, budget, and attention on the wrong thing.

Why I think that: reports tied to the FT coverage say Heretic can strip Meta Llama 3.3 refusals, which suggests those refusals may be policy layered on after training rather than something deeply locked into the model [S001]. The bigger variable is not whether a guardrail exists, but how cheaply it can be reversed.

That is where the number matters. The BadLlama 3 paper said Llama 3 8B safety tuning could be removed in about 1 minute on a single GPU, and 70B in about 30 minutes [S003]. Microsoft added a separate warning signal in February 2026: one hidden prompt attack weakened refusals across 15 models [S004]. That does not prove safety is useless. It does suggest fragility is part of the product, not an edge case, even if this evidence is still limited to Meta Llama 3.3, Llama 3 8B/70B, a single-GPU setup, and Microsoft's 15-model test.

My filter from here is simple: a model update matters only if it changes your next decision, not just its feature list. If you're reviewing models you can download and run yourself, don't just ask whether guardrails are present. Ask how hard they are to remove, what your 工作流程（workflow） assumes about refusals, and who on your team still treats "safe by default" as a permanent property rather than a reversible layer. Share this with the person who still thinks guardrails are the model itself.

#AISafety #LLM #OpenModels #AIEngineering

真正该讨论的是：The Financial Times has published an article about Heretic