Also discussed in Schneier, with further comment, makes me think of 'dueling AI' scenarios.
You can’t solve AI security problems with more AI From Simonwillison.net
One of the most common proposed solutions to prompt injection attacks (where an AI language model backed system is subverted by a user injecting malicious input—“ignore previous instructions and do this instead”) is to apply more AI to the problem.
I wrote about how I don’t know how to solve prompt injection the other day. I still don’t know how to solve it, but I’m very confident that adding more AI is not the right way to go.
These AI-driven proposals include:
Run a first pass classification of the incoming user text to see if it looks like it includes an injection attack. If it does, reject it.
Before delivering the output, run a classification to see if it looks like the output itself has been subverted. If yes, return an error instead.
Continue with single AI execution, but modify the prompt you generate to mitigate attacks. For example, append the hard-coded instruction at the end rather than the beginning, in an attempt to override the “ignore previous instructions and...” syntax.
Each of these solutions sound promising on the surface. It’s easy to come up with an example scenario where they work as intended.
But it’s often also easy to come up with a counter-attack that subverts that new layer of protection! ... '
No comments:
Post a Comment