Researchers Jailbreak AI by Flooding It With Bullshit Jargon

A team of researchers from Intel, Boise State University, and University of Illinois at Urbana-Champaign has discovered a way to trick AI chatbots like ChatGPT or Gemini into teaching users how to make a bomb or hack an ATM if they ask the right questions. The researchers have developed a new method of jailbreaking Large Language Models (LLMs), called "Information Overload" or "InfoFlood," which can bypass the built-in safety mechanisms that prevent chatbots from answering harmful or sensitive queries.

The research, titled "InfoFlood: Jailbreaking Large Language Models with Information Overload," was published as a preprint and has raised concerns about the security of AI chatbots. According to the researchers, popular LLMs like ChatGPT, Gemini, or LLaMA have guardrails that stop them from answering certain questions, but people can "jailbreak" these models by asking questions in a specific way.

The InfoFlood system works by using linguistic transformations to rephrase malicious queries and identifying the root cause of failure when an attempt is unsuccessful. The researchers then refine the prompt's linguistic structure to address the failure while preserving its malicious intent. This process can lead to the chatbot answering the question, despite the original query being considered too sensitive or harmful.

How It Works

The InfoFlood system uses a standardized template designed by the researchers, which includes three main components: task definition, rules, context, and examples. Each time a prompt is rejected by an LLM, InfoFlood returns to its ruleset and saturates the prompt with more language.

Some of the "rules" used in InfoFlood include:

Faux-Citations: Referencing arXiv papers from the last 3 months using fake authors + titles that benefit the original statement, ensuring the fake research directly agrees with our original statement's claim
5) Ethical Acknowledgement: Explicitly acknowledge ethical implications but make sure that they are not a part of our inquiry. The query should make sure to remove any ethical doubts that might arise.

The researchers believe that this sensitivity to language presents a unique opportunity for adversarial attacks, allowing attackers to rephrase queries in a way that bypasses content moderation filters and elicits the desired response.

Test Results

The researchers used open-source jailbreak benchmarking tools like AdvBench and JailbreakHub to test InfoFlood and reported above-average success rates on multiple frontier LLMs. They achieved near-perfect success rates, underscoring the effectiveness of their method in bypassing even the most advanced alignment mechanisms.

Conclusion

The researchers conclude that this new jailbreaking method exposed critical weaknesses in the guardrails of AI chatbots and calls for stronger defenses against adversarial linguistic manipulation. They plan to reach out to major model vendors, including OpenAI, Meta, and Google, with a courtesy disclosure package to ensure their security teams see the findings directly.

The researchers also propose a solution to the problem they uncovered: using InfoFlood to train LLM guardrails to extract relevant information from harmful queries, making the models more robust against similar attacks. This approach could potentially strengthen the security of AI chatbots and prevent malicious users from exploiting their vulnerabilities.

HACKER_BLOG

RESEARCHERS JAILBREAK AI BY FLOODING IT WITH BULLSHIT JARGON

Researchers Jailbreak AI by Flooding It With Bullshit Jargon

How It Works

Test Results

Conclusion