The ability to deliberately decide to ignore the boundary between code and data ...

TeMPOraL · 2025-07-09T09:45:39 1752054339

The point is, there is no hard boundary. The LLM too may know[0] following instructions in data isn't part of transcription task, and still decide to do it.

--

[0] - In fact I bet it does, in the sense that, doing something like Anthropic did[1], you could observe relevant concepts being activated within the model. This is similar to how it turned out the model is usually aware when it doesn't know the answer to a question.

[1] - https://www.anthropic.com/news/tracing-thoughts-language-mod...

Dylan16807 · 2025-07-09T12:19:05 1752063545

If you can measure that in a reliable way then things are fine. Mixup prevented.

If you just ask, the human is not likely to lie but who knows with the LLM.