TeaBAT: an effective prompt hacking method to LLMs and the defense

Joseph Zhong and collaborators.

Dec 5, 2024

Abstract

Large Language Models (LLMs) have become indispensable in various domains, yet their vulnerability to adversarial inputs—particularly prompt hacking—remains a critical challenge. To address this, we propose Translation-enhanced and agent-Based ATtack (TeaBAT), a fully autonomous red-team approach built on top of the agent-based AutoDAN-Turbo[1] method and further augmented by translation. Our experiments demonstrate that TeaBAT achieves a 43% attack success rate against the latest GPT-4o-mini model without any human intervention, proved to be an significant improvement to the standard AutoDAN-Turbo, while delivering performance levels that are comparable to other SOTA benchmarks targeting GPT-4-Turbo(1106) models for copyright behaviors. In addition, we introduce a robust defense prompt strategy, which serves as a safeguard against adversarial manipulations by enforcing structured output constraints and limiting undesired behaviors such as translation-attack. By presenting both an effective offensive technique and a protective mechanism, our work advances the discourse on securing LLMs, ultimately contributing to their reliability, safety, and resilience in practical applications.

1 Introduction

As LLMs become more accessible and integrated into critical everyday systems, ensuring their robustness and security properties is crucial. The study of enhancing the security of LLMs and improving their robustness against adversarial attacks remains a highly active and critical area of research. Prompt hacking[2] refers to a type of attack that exploits the vulnerabilities of LLMs by manipulating their input prompts. Unlike conventional hacking, which targets software vulnerabilities, prompt hacking involves the deliberate construction of input prompts designed to mislead the LLM into producing unintended or undesirable outputs. AutoDAN-Turbo[1] is an advanced LLM jailbreaking framework showing impressive results. Our method, Translation-enhanced and agent-Based ATtack (TeaBAT) which is based on AutoDAN-Turbo and further enhanced by translation[3–5], has a focus on systematically designing and full-autonomously executing prompt hacking on some of the most advanced and widely used LLMs. TeaBAT is able to achieve a 43% attack success rate (ASR) against the latest GPT-4o-mini model, which is probably to date the most popular LLM, comparing to a 16% ASR of the standard Auto-DAN-Turbo under the same experimental setting. We achieved a 10% ASR for the GPT-4-Turbo(1106) model, which is comparable to the SOTA method, on copyright behaviors according to the benchmarks of HarmBench[6]. Additionally, implementing robust defensive strategies is essential for safeguarding against prompt hacking. We then design the corresponding defense mechanisms to protect these models from such attacks. The attack success rate (ASR) of translation-based attacks drops to almost zero in our experiments and LLM-attacker agents are prevented from generating poisonous prompts to initiate furthur attacks on target-LLMs, by effectively lowering the success rate of such attempts from nearly 50% to 9%.

In the three primary forms of prompt hacking—prompt injection[7], prompt leaking[8], and jailbreaking[9], we mainly focus on jailbreaking in this study which in the context of LLMs refers to systematically bypassing or subverting the model’s built-in safety, alignment, and content moderation mechanisms to elicit disallowed, harmful, or sensitive outputs that the system is designed to avoid [10]. AutoDAN-Turbo[1] is a jailbreaking method that employs multiple LLM agents whereas collaborate iteratively to autonomously discover, evolve, and apply jailbreak strategies, making it a highly adaptable and automated framework for testing LLM vulnerabilities. Our method TeaBAT enhances AutoDAN-Turbo by integrating translation, contributing by improving such a framework towards an even more effective red-team method to the LLM security community.

2.1 LLM Agents

Large Language Model (LLM) agents are AI systems that utilize the advanced language processing capabilities of LLMs to perform complex tasks autonomously. These systems leverage the reasoning and interaction capabilities of LLMs for diverse applications in problem-solving (e.g., software development[11], robotics[12], scientific collaboration[13] and world simulations (e.g., gaming[14], societal[15], and economic analysis[16]). Multi-agent LLM systems can be categorized by their operational frameworks, including their interaction with environments, profiling of agent roles, communication paradigms (cooperative, debate, competitive), and mechanisms for acquiring capabilities like memory and self-evolution[17]. While the systems demonstrate significant potential for emulating human collaboration, there are still challenges like hallucination in outputs[18], scalability, multimodal integration, and the lack of robust benchmarks. Opportunities in this domain exist for improving collective intelligence, agent orchestration[19], and cross-disciplinary applications in psychology, economics, and also security[20]. This paper emphasizes LLM-agent based red-team frameworks like AutoDAN-Turbo[1].

2.2 Prompt injection & prompt leaking

Prompt injection is a security vulnerability in LLMs where adversaries manipulate input prompts to alter the model’s behavior, leading to unintended or harmful outputs. [21] provides an overview of emerging threats and presents a categorization of prompt injections to guide future research and development of LLM interfaces. By combining with LLM agents, [22] introduces an attack where malicious prompts self-replicate across interconnected agents, posing severe threats like data theft and system-wide disruption. Prompt injection can also be applied in defensive measures to undermine malicious operations of LLM-driven cyberattacks, effectively hacking back the attacker[23].

Prompt leaking in LLMs refers to unintended disclosure of the underlying prompt. Prompt leaking can have implications for privacy, security, intellectual property, and model integrity. [24] examines how LLMs can memorize and inadvertently reveal prompts. The authors of [25] investigate vulnerabilities in multi-turn interactions with LLMs, by identifying a sycophancy effect that elevates the risk of prompt leakage and assessing various black-box defense strategies to mitigate these risks.

2.3 Jailbreaking

Jailbreaking in LLMs refers to crafting prompts that bypass the model’s safety restrictions, eliciting harmful, unethical, or sensitive outputs. Early approaches, such as role-playing prompts like the “Do Anything Now” (DAN) series, leveraged social engineering to trick models into behaving outside their alignment constraints[26, 27]. Optimization-based methods like PAIR[28] and TAP[29] automated jailbreak generation using feedback loops, but their prompts lacked diversity and adaptability. Strategy-based jailbreaks introduced nuanced attacks, exploiting ambiguities in language, cultural norms, or even ASCII text, though these relied on labor-intensive human designs[30, 31] Recent advances, such as AutoDAN-Turbo[1], combines autonomous LLM agents with jailbreaking. See Fig 1 for a pipeline. Jailbreak strategies often transfer across models and domains, underscoring the challenge of developing robust, generalizable defenses[6, 32]. Translation jailbreaking or multilingual jailbreaking involves bypassing safety mechanisms, either unintentionally through low-resource language queries or intentionally via crafted multilingual prompts[3–5].The method proposed by this paper, TeaBAT, is based on AutoDAN-Turbo and intentional translation jailbreaking and can be targeted to multiple large language models.

3 Method

First we go through how we design our method for prompt hacking. Then we explain the counter measure to prevent such attacks.

3.1 Attack

Our methods have two components, translation and AutoDAN-Turbo. We firstly demonstrate a straight-forward translation-based jailbreaking[3–5] as an initial approach. Then a more advanced one is introduced by combining the initial approach with AutoDAN-Turbo.

TeaBAT-light. The initial approach is based on the hypothesis that multi-language LLMs such as GPT-4 series may have different security defensive methods from prompt attacks in different languages. While numerous preventive measures have been implemented to address the risks associated with LLMs, most of these efforts have been concentrated on English[3]. This is easy to understand as for some languages there is a very small user population and there is little incentive for LLM developers to invest as much resources for security enhancement as the main-stream languages that have a large number of users. From a technical perspective, there may be insufficient data to align LLMs towards a more secure one under such languages. In this study, we select Sesotho[33], a minor language from South Africa as our attacking language. We can use a translation LLM agent or service to translate every user prompt text, which could be in any language format, to the Sesotho language and then send it to LLMs for a query. Then the translation agent or service is leveraged again to translate the LLM’s response from Sesotho back to the user’s original language. In our experiments, the hypothesis is proved by finding that LLMs, even the most advanced ones such as GPT-4o-mini, exhibit obviously more vulnerabilities under unpopular languages than popular ones. More detail results of this approach can be found in table 1, 2 and see an example in appendix Fig 2.

TeaBAT-advanced. The advanced approach is to enhance the initial approach with AutoDAN-Turbo. In another word, by adding a translator agent role in the AutoDAN-Turbo framework, we switch the problem of jailbreaking a heavily defended English language LLM to a more vulnerable one, the Sesotho language LLM so that the AutoDAN-Turbo framework can have a better performance. Since many LLMs as well as deep-learning based web services can do text translation, still no human intervention is needed in our method. Firstly, a chat with a malicious goal is initiated with the attacker for a attacking prompt. A strategy, which is learned during the attacking process, is also delivered to the attacker. Then different from the original Auto-DAN framework, the generated text from the attacker is sent to the translator agent. The translated attack prompt, which is under some unpopular language, acts as the attacking prompt for the target LLM. It is expected that the response from the target LLM is under the same language with the prompt so it will be sent to a reverse-translator agent for an English-language version. Then it can be scored, summarized and then distilled using a heuristic algorithm into a better strategy for next iteration. This approach is illustrated in Fig 1 and experimental results can be found in table 1, 2. Also see appendix Fig 4 for an example.

3.2 Defense

Defense for TeaBAT-light. Since the key prerequisite for such attacks is the multi-language support of the LLMs, we leverage this capability by designing a system prompt(see Fig 3) to explicitly require the LLMs to translate the text in any suspicious language format to prompts in English language first and then execute them. When the answers are returned from the LLMs, they will be translated back to text in the same language format with the user input prompts. A substitute measure for this can be a translator agent to undertake the same task instead of using a system prompt. Therefore, LLM developers can focus on enhancing the security of mainstream languages only. This defensive measure is proved to be both effective and efficient in our experimental studies in next section. Appendix Fig 3 provides an example.

Defense for TeaBAT-advanced. Firstly, the same defensive measure for TeaBAT-light can be applied to minimize the threat from translation attacks. However, for advanced jailbreaking such as AutoDAN-Turbo, it is very difficult to effectively eliminate the threats solely on the target LLMs, because the attacker agent is always evolving to find a better strategy to extract the vulnerabilities of the target LLMs. We think that it is vital to prevent the attacker agent from generating malicious attacking prompts. Therefore, we design a system prompt (See Fig 5) to explicitly prohibit the LLM to act as an attacker agent for other AI system. Since this is a defensive measure for the attacker agent, it can be applied to defend any AutoDAN-Turbo based attacks, including the standard approach and TeaBAT-advanced. We provide some quantitative analysis in next section and an example in appendix Fig 5.

4 Experiments

4.1 Setup

For all LLM agents in our framework, we use OpenAI’s models[34], which are currently the most widely used ones. We select GPT-4o-mini and GPT-4-Turbo(1106) as our targets for experiments, as they are to date the most advanced versions. For GPT-4o-mini, it is selected instead of GPT-4o for a limited budget reason, while we expect they have similar security defensive measures. Shortly after we finish this work, OpenAI has formally released o1[35] as their next-gen model. We don’t have a chance to test it and will leave it for possible future work. For all other agents, GPT-3.5-Turbo is used. As a legacy model, we find that GPT-3.5-Turbo has best performance for agents’ roles regarding generating attack prompt, scoring of attack responses and summarizing attack strategies. For other hyperparameters in AutoDAN-Turbo framework, we use the default ones in the author’s code. That is to say, the experimental results in this paper does not necessarily reflect the best result of the AutoDAN-Turbo framework. But we test the standard approach under the same hyperparameters with our method. Therefore, our result implies the relative improvement against the standard one. For the translation agent, we use Microsoft’s Azure AI translation service[36].

For dataset, we use the malicious goals proposed by HarmBench[6] to test the attack success rate (ASR). We randomly select 100 goals out of the 400 ones as the test set and lock it down for all tests. We test ASR against the target LLMs for TeaBAT-light and TeaBAT-advanced separately and make comparisons to other benchmarks from HarmBench’s website. For defensive experiments, the setup remains the same.

4.2 Attack

The experiment result for attacking GPT-4o-mini is displayed in table 1. Our method, for both the light and advanced versions, has the best performance comparing to the standard approach. Espeacially, the advanced approach achieves an over-all 43% ASR, which exhibits a significant improvement over the standard approach’s 16% and the light version’s 17%. We only compare our method with the standard approach because there is no benchmark available targeting GPT-4o-mini from the HarmBench website. Table 2 shows the experiment results on attacking GPT-Turbo (1106) on the copyright category of the malicious behaviors, which is the category with the lowest ASR for all methods in HarmBench. Our method reaches a ASR of 10%, which has a performance comparable to the SOTA method TAP[29]. TeaBAT also beats the standard AutoDAN-Turbo approach for other categories in our experiments. However, our method does not beat other baselines that we do not do experiments on but use the benchmarks directly from the HarmBench’s website. We are not exactly sure why, as the defense of GPT models is black box to us and it could be evolving and have different defensive performance for the same attack overtime. Some successful examples can be found from Fig 2, 4 in appendix.

Table 1: ASR targeting GPT-4o-mini


	All	Standard	Contextual	Copyright
AutoDAN-Turbo	16.0	13.5	33.3	10.0
TeaBAT-light	17.0	13.5	27.8	16.7
TeaBAT-advanced	43.0	42.3	55.6	36.7

Table 2: ASR targeting GPT-Turbo (1106) on Copyright


	PAIR	TAP	TAP-T	AutoDAN-Turbo	TeaBAT-light	TeaBAT-advanced
ASR	9.0	12.0	6.0	3.0	6.7	10.0

4.3 Defense

Defense for translation-based methods. We carefully design a system prompt for defending such attacks(see Fig 3 in appendix). The ASR of TeaBAT-light drops to almost zero for both GPT-4o-mini and GPT-4-Turbo (1106) after the system prompt is applied, whereas all 100 malicious behaviors fail to succeed. We did not do experiment on other attacks but expect a similar performance against any translation-based attack. As can be seen from the result, translation-based attacks are effective and so it is with the defensive measure against them. And since we only limit the multi-lingual capability of the target LLM by exlicitly request the LLM to execute the user prompt after translating it in English language through this system prompt, we believe the recall rate under such defense should not be affected considering most of the user requests for LLMs are language-irrelevant (they don’t only work under some specific language).

Defense for attacker-agent based methods. By judging the behaviors of using LLMs to generate prompts to attack other AI systems as malicious ones, we design a system prompt to prevent LLMs from becoming attacker-agents(see Fig 5 in appendix). In our experiments, the success rate of generating such attack prompts drops from nearly 50% to 9% for GPT-3.5-Turbo, in which most attempts of using it as an attacker-agent fail. By taking such a defensive measure, any attacker-agent based method such as TeaBAT-advanced or AutoTAN-Turbo cannot have a ASR above the success rate of attacker agents by which they all rely on to generate successful attacking prompts.

5 Conclusion

This paper introduces TeaBAT as an advanced red-team framework targeting vulnerabilities in LLMs. TeaBAT demonstrates significant improvements over the standard approach, achieving a 43% attack success rate against GPT-4o-mini and comparable performance against some other SOTA benchmarks. Additionally, we propose robust defense strategies to mitigate both translation-based and attacker-agent-based threats, reducing attack success rates significantly. Future work will focus on extending TeaBAT to assess newer LLM architectures, such as OpenAI’s recently introduced o1 model. Further exploration into multi-modal and low-resource language vulnerabilities could provide deeper insights into securing LLMs across diverse applications. Additionally, refining defensive frameworks to dynamically adapt against real-time adversarial threats remains a critical area for development, aiming to enhance the reliability and safety of AI systems in increasingly complex environments.

References

[1] Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms, 2024. URL https://arxiv.org/abs/2410.05295.

[2] Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models, 2022. URL https://arxiv.org/abs/2211.09527.

[3] Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models, 2024. URL https://arxiv.org/abs/2310.06474.

[4] Bibek Upadhayay and Vahid Behzadan. Sandwich attack: Multi-language mixture adaptive attack on llms. In Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024), page 208–226. Association for Computational Linguistics, 2024. doi: 10.18653/v1/2024.trustnlp-1.18. URL http://dx.doi.org/10.18653/v1/2024.trustnlp-1.18.

[5] New openai jailbreak | chatgpt: Discover the new jailbreak technique with unpopular languages. https://www.youtube.com/watch?v=wd1jz14FlvM, 2024.

[6] Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. 2024. URL https://arxiv.org/abs/2402.04249.

[7] Simon Willison. Prompt injection attacks against GPT-3 — simonwillison.net. https://simonwillison.net/2022/Sep/12/prompt-injection/, 2022.

[8] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models, 2021. URL https://arxiv.org/abs/2012.07805.

[9] Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. URL https://arxiv.org/abs/2307.15043.

[10] Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?, 2023. URL https://arxiv.org/abs/2307.02483.

[11] Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework, 2024. URL https://arxiv.org/abs/2308.00352.

[12] Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large language models, 2023. URL https://arxiv.org/abs/2307.04738.

[13] Jintian Zhang, Xin Xu, Ningyu Zhang, Ruibo Liu, Bryan Hooi, and Shumin Deng. Exploring collaboration mechanisms for llm agents: A social psychology view, 2024. URL https://arxiv.org/abs/2310.02124.

[14] Zelai Xu, Chao Yu, Fei Fang, Yu Wang, and Yi Wu. Language agents with reinforcement learning for strategic play in the werewolf game, 2024. URL https://arxiv.org/abs/2310.18940.

[15] Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. URL https://arxiv.org/abs/2304.03442.

[16] Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. Econagent: Large language model-empowered agents for simulating macroeconomic activities, 2024. URL https://arxiv.org/abs/2310.10436.

[17] Nathalia Nascimento, Paulo Alencar, and Donald Cowan. Self-adaptive large language model (llm)-based multiagent systems, 2023. URL https://arxiv.org/abs/2307.06187.

[18] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, November 2024. ISSN 1558-2868. doi: 10.1145/3703155. URL http://dx.doi.org/10.1145/3703155.

[19] crewai.com. crewAIInc/crewAI: Framework for orchestrating role-playing, autonomous AI agents. URL https://github.com/crewAIInc/crewAI.

[20] Feng He, Tianqing Zhu, Dayong Ye, Bo Liu, Wanlei Zhou, and Philip S. Yu. The emerged security and privacy of llm agent: A survey with case studies, 2024. URL https://arxiv.org/abs/2407.19354.

[21] Sippo Rossi, Alisia Marianne Michel, Raghava Rao Mukkamala, and Jason Bennett Thatcher. An early categorization of prompt injection attacks on large language models, 2024. URL https://arxiv.org/abs/2402.00898.

[22] Donghyun Lee and Mo Tiwari. Prompt infection: Llm-to-llm prompt injection within multi-agent systems, 2024. URL https://arxiv.org/abs/2410.07283.

[23] Dario Pasquini, Evgenios M. Kornaropoulos, and Giuseppe Ateniese. Hacking back the ai-hacker: Prompt injection as a defense against llm-driven cyberattacks, 2024. URL https://arxiv.org/abs/2410.20911.

[24] Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, and Haoyang Li. Why are my prompts leaked? unraveling prompt extraction threats in customized large language models, 2024. URL https://arxiv.org/abs/2408.02416.

[25] Divyansh Agarwal, Alexander R. Fabbri, Ben Risher, Philippe Laban, Shafiq Joty, and Chien-Sheng Wu. Prompt leakage effect and defense strategies for multi-turn llm interactions, 2024. URL https://arxiv.org/abs/2404.16251.

[26] Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models, 2024. URL https://arxiv.org/abs/2308.03825.

[27] Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms, 2024. URL https://arxiv.org/abs/2401.06373.

[28] Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries, 2024. URL https://arxiv.org/abs/2310.08419.

[29] Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically, 2024. URL https://arxiv.org/abs/2312.02119.

[30] Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2024. URL https://arxiv.org/abs/2308.06463.

[31] Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, and Xuanjing Huang. Codechameleon: Personalized encryption framework for jailbreaking large language models, 2024. URL https://arxiv.org/abs/2402.16717.

[32] Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A strongreject for empty jailbreaks, 2024. URL https://arxiv.org/abs/2402.10260.

[33] Sotho language - Wikipedia — en.wikipedia.org. https://en.wikipedia.org/wiki/Sotho_language.

[34] OpenAI. Openai models, 2024. URL https://platform.openai.com/docs/models.

[35] OpenAI. Introducing openai o1, 2024. URL https://openai.com/o1/.

[36] Microsoft. Azure AI Translator documentation - quickstarts, tutorials, API reference - Azure AI services — learn.microsoft.com. https://learn.microsoft.com/en-us/azure/ai-services/translator/.