Gemini Hackers are Using Its Own Tools Against It
When it comes to hacking Large Language Models (LLMs), the saying "it takes a thief to catch a thief" couldn't be more apt. A team of academic researchers from UC San Diego and the University of Wisconsin has discovered a way to make Google's Gemini AI model more vulnerable to hacking, and they did it using Gemini's own tools.
The technique, dubbed "Fun-Tuning," was reported in Ars Technica and marks a significant breakthrough in exploiting the vulnerabilities of LLMs. By leveraging Gemini's fine-tuning feature, the researchers have found a way to significantly increase the success rate of prompt injection attacks, which can cause the model to leak information, give incorrect answers, or take other unintended actions.
What makes Fun-Tuning particularly interesting is that it uses Gemini's own fine-tuning feature in a way that is usually intended to help businesses train the AI on custom datasets. Instead, the researchers used it to test and refine prompt injections automatically. It's almost like teaching Gemini how to fool itself!
Fun-Tuning works by generating strange-looking prefixes and suffixes that are added to an otherwise ineffective prompt injection. These additions "boost" the prompt and make it much more likely to succeed. In one case, a prompt that failed on its own was made effective by wrapping it in affixes like "wandel ! ! ! !" and "formatted ! ASAP !".
In testing, the hack achieved an impressive 65% success rate on Gemini 1.5 Flash and an even higher 82% success rate on the older Gemini 1.0 Pro model – more than double the baseline success rates without Fun-Tuning. The attacks also transferred well between models, meaning that an injection that worked on one version often worked on others too.
The vulnerability stems from the way fine-tuning works. During training, Gemini provides feedback in the form of a "loss" score, which is a number that reflects how far the model's answer is from the desired result. Attackers can exploit this feedback to optimize their prompts until the system finds a successful one.
Google didn't respond directly to the Fun-Tuning technique, but a spokesperson for the company stated that "defending against this class of attack has been an ongoing priority for us" and pointed to existing safeguards against prompt injection and harmful responses. The company also mentioned that Gemini is regularly tested against these kinds of attacks through internal "red-teaming" exercises.
The researchers acknowledge that the issue may be tricky to fix since the feedback that enables Fun-Tuning is a core part of how fine-tuning works. In other words, making it less effective for Fun-Tuning risks reducing its utility overall. This highlights the ongoing cat-and-mouse game between hackers and AI model developers, where each side tries to outsmart the other.
As the use of LLMs becomes increasingly widespread, it's essential that we understand these vulnerabilities and develop effective strategies to mitigate them. The discovery of Fun-Tuning is a sobering reminder of the potential risks associated with these powerful tools, and it will be interesting to see how Google responds to this development in the days ahead.