Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The ability to deliberately decide to ignore the boundary between code and data doesn't mean the separation rule isn't still separating. In the lab example, the person is worried and trying to do the right thing, but they know it's not part of the transcription task.


The point is, there is no hard boundary. The LLM too may know[0] following instructions in data isn't part of transcription task, and still decide to do it.

--

[0] - In fact I bet it does, in the sense that, doing something like Anthropic did[1], you could observe relevant concepts being activated within the model. This is similar to how it turned out the model is usually aware when it doesn't know the answer to a question.

[1] - https://www.anthropic.com/news/tracing-thoughts-language-mod...


If you can measure that in a reliable way then things are fine. Mixup prevented.

If you just ask, the human is not likely to lie but who knows with the LLM.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: