This is interesting. Pondering about this, the vulnerability seems rooted in the very nature of the LLMs and how they work. They conflate instruct and data in a messy way.
My first thought here was to somehow separate instruct and data in how the models are trained. But in many ways, there is no (??) way to do that in the current model construct. If I say "Write a poem about walking through the forest", everything, including the data part of the prompt "walking through the forest" is instruct.
So you couldn't create a safe model which only takes instruct from the model owner, and can otherwise take in arbitrary information from untrusted sources.
Ultimately, this may push AI applications towards information and retrieval-focused task, and not any sort of meaningful action.
For example, I can't create a AI bot that could send a customer monetary refunds as it could be gamed in any number of ways. But I can create an AI bot to answer questions about products and store policy.
Which of course reflects how language and real-world text data is! There is no such separation. It is, in fact, profoundly difficult to separate 'instruction' and 'data', and every single injection attack (as well as all the related classes of attacks) exploits this fact. It's not some weird little language model glitch, it's a profound fact that we have spent generations engineering layer after layer of software trying to hide from ourselves. So, it may be quite difficult to resolve in full generality. (As opposed to Bing's attitude which is the old 1990s MS attitude of just patch the instances that anyone complains about.)
>But I can create an AI bot to answer questions about products and store policy.
Why wouldn't someone be able to game your bot's responses about refunds and store policy in exactly the same way? Then, when the customer really does come in with a return or refund request, you're forced into a dilemma where either you grant the refund (and accept that your store policy isn't the written policy, but rather whatever your bot can be manipulated into saying is your written policy) or you refuse the refund, and the customer walks away angry, because your own bot told them something that you're now contradicting.
Q. I understand LLM with langchain is running on a public facing server. Are these attacks infiltrating the server to plant an MITM, or confusing the LLM to execute malicious code via prompts, or placing maliciously crafted LLMs in public servers for download, or a combination of all?
We are changing the behaviour of the LLM itself. No "real" code execution necessary. We show a variety of different novel scenarios and attack vectors. Malicious prompts can be planted on the internet or actively sent to targets. It's effectively turning the LLM itself into the compromised computer that the attacker controls.
It affects any proposed LLM use case involving connecting the LLM to anything at all, for a lot of our demos we only require a "search" capability. The concrete LLM (davinci3) with LangChain is just one example, this work should generalize to other systems such as Bing Chat (we just didn't have access). We are currently working on more real-world proof-of-concepts, but then we obviously have to go through responsible disclosure so it is not so quick to publish.
Are the authors implying that we can use these llms to allow any security vulnerability to applications that is not sandboxed? What I understood is that if you have an application that is connected to external APIs, this malware seems to exploit the LLMs to attack the applications?
No, the "malware" is running on the language model itself. It does not need to inject any code into connected applications and exploit them to be itself exploited (I'm the main author).
I'll start with a quote from gwern on LessWrong: "... a language model is a Turing-complete weird machine running programs written in natural language; when you do retrieval, you are not 'plugging updated facts into your AI', you are actually downloading random new unsigned blobs of code from the Internet (many written by adversaries) and casually executing them on your LM with full privileges. This does not end well."
The instructions are manipulating the LLM itself. Making it exfiltrate and collect data , fetch new instructions from an attacker etc. All the connected applications can be fine but it's basically turning your assistant into a compromised, attacker-controlled version of itself just because it looked at the wrong news article. From our GitHub:
We demonstrate the potentially brutal consequences of giving LLMs like ChatGPT interfaces to other applications. We propose newly enabled attack vectors and techniques and provide demonstrations of each in this repository:
Remote control of chat LLMs
Persistent compromise across sessions
Spread injections to other LLMs
Compromising LLMs with tiny multi-stage payloads
Leaking/exfiltrating user data
Automated Social Engineering
Targeting code completion engines
All of these are completely new but unfortunately it seems more difficult to explain the impact to people than we had anticipated.
My first thought here was to somehow separate instruct and data in how the models are trained. But in many ways, there is no (??) way to do that in the current model construct. If I say "Write a poem about walking through the forest", everything, including the data part of the prompt "walking through the forest" is instruct.
So you couldn't create a safe model which only takes instruct from the model owner, and can otherwise take in arbitrary information from untrusted sources.
Ultimately, this may push AI applications towards information and retrieval-focused task, and not any sort of meaningful action.
For example, I can't create a AI bot that could send a customer monetary refunds as it could be gamed in any number of ways. But I can create an AI bot to answer questions about products and store policy.