- DeepSeek-R1 uses Chain of Thought (CoT) reasoning, explicitly sharing its step-by-step thought process, which we found was exploitable for prompt attacks.
- Prompt attacks can exploit the transparency of CoT reasoning to achieve malicious objectives, similar to phishing tactics, and can vary in impact depending on the context.
- We used tools like NVIDIA’s Garak to test various attack techniques on DeepSeek-R1, where we discovered that insecure output generation and sensitive data theft had higher success rates due to the CoT exposure.
- To mitigate the risk of prompt attacks, it is recommended to filter out <think> tags from LLM responses in chatbot applications and employ red teaming strategies for ongoing vulnerability assessments and defenses.
Welcome to the inaugural article in a series dedicated to evaluating AI models. In this entry, we’ll examine the release of Deepseek-R1.
The growing usage of chain of thought (CoT) reasoning marks a new era for large language models. CoT reasoning encourages the model to think through its answer before the final response. A distinctive feature of DeepSeek-R1 is its direct sharing of the CoT reasoning. We conducted a series of prompt attacks against the 671-billion-parameter DeepSeek-R1 and found that this information can be exploited to significantly increase attack success rates.
Chain of Thought reasoning
CoT reasoning encourages a model to take a series of intermediate steps before arriving at a final response. This approach has been shown to enhance the performance of large models on math-focused benchmarks, such as the GSM8K dataset for word problems.
CoT has become a cornerstone for state-of-the-art reasoning models, including OpenAI’s O1 and O3-mini plus DeepSeek-R1, all of which are trained to employ CoT reasoning.
A notable characteristic of the Deepseek-R1 model is that it explicitly shows its reasoning process within the <think> </think> tags included in response to a prompt.
Prompt attacks
A prompt attack is when an attacker crafts and sends prompts to an LLM to achieve a malicious objective. These prompt attacks can be broken down into two parts, the attack technique, and the attack objective.
In the example above, the attack is attempting to trick the LLM into revealing its system prompt, which are a set of overall instructions that define how the model should behave. Depending on the system context, the impact of revealing the system prompt can vary. For example, within an agent-based AI system, the attacker can use this technique to discover all the tools available to the agent.
The process of developing these techniques mirrors that of an attacker searching for ways to trick users into clicking on phishing links. Attackers identify methods that bypass system guardrails and exploit them until defenses catch up—creating an ongoing cycle of adaptation and countermeasures.
Given the expected growth of agent-based AI systems, prompt attack techniques are expected to continue to evolve, posing an increasing risk to organizations. A notable example occurred with Google’s Gemini integrations, where researchers discovered that indirect prompt injection could lead the model to generate phishing links.
Red-teaming DeepSeek-R1
We used open-source red team tools such as NVIDIA’s Garak —designed to identify vulnerabilities in LLMs by sending automated prompt attacks—along with specially crafted prompt attacks to analyze DeepSeek-R1’s responses to various attack techniques and objectives.
The following tables show the attack techniques and objectives we used during our investigation. We also included their IDs based on OWASP’s 2025 Top 10 Risk & Mitigations for LLMs and Gen AI Apps and MITRE ATLAS.
Name | OWASP ID | MITRE ATLAS ID |
---|---|---|
Prompt injection | LLM01:2025 – Prompt Injection | AML.T0051 – LLM Prompt Injection |
Jailbreak | LLM01:2025 – Prompt Injection | AML.T0054 – LLM Jailbreak |
Table 1. Attack techniques and their corresponding risk classifications under the OWASP and MITRE ATLAS indices
Name | OWASP ID | MITRE ATLAS ID |
---|---|---|
Jailbreak | LLM01:2025 – Prompt Injection | AML.T0054 – LLM Jailbreak |
Model theft | AML.T0048.004 – External Harms: ML Intellectual Property Theft | |
Package hallucination | LLM09:2025 – Misinformation | AML.T0062 – Discover LLM Hallucinations |
Sensitive data theft | LLM02:2025 – Sensitive Information Disclosure | AML.T0057 – LLM Data Leakage |
Insecure output generation | LLM05:2025 – Improper Output Handling | AML.T0050 – Command and Scripting Interpreter |
Toxicity | AML.T0048 – External Harms |
Table 2. Attack objectives and their corresponding risk classifications under the OWASP and MITRE ATLAS indices
Stealing secrets
Sensitive information should never be included in system prompts. However, a lack of security awareness can lead to their unintentional exposure. In this example, the system prompt contains a secret, but a prompt hardening defense technique is used to instruct the model not to disclose it.
As seen below, the final response from the LLM does not contain the secret. However, the secret is clearly disclosed within the <think> tags, even though the user prompt does not ask for it. To answer the question the model searches for context in all its available information in an attempt to interpret the user prompt successfully. Consequently, this results in the model using the API specification to craft the HTTP request required to answer the user's question. This inadvertently results in the API key from the system prompt being included in its chain-of-thought.
Discovering attack methods using CoT
In this section, we demonstrate an example of how to exploit the exposed CoT through a discovery process. First, we attempted to directly ask the model to achieve our goal:
When the model denied our request, we then explored its guardrails by directly inquiring about them.
The model appears to have been trained to reject impersonation requests. We can further inquire about its thought process regarding impersonation.
With these exceptions noted in the <think> tag, we can now craft an attack to bypass the guardrails to achieve our goal (using payload splitting).
Attack success rate
We used NVIDIA Garak to assess how different attack objectives perform against DeepSeek-R1. Our findings indicate a higher attack success rate in the categories of insecure output generation and sensitive data theft compared to toxicity, jailbreak, model theft, and package hallucination. We suspect this discrepancy may be influenced by the presence of <think> tags in the model's responses. However, further research is needed to confirm this, and we plan to share our findings in the future.
Defending against prompt attacks
Our research indicates that the content within <think> tags in model responses can contain valuable information for attackers. Exposing the model’s CoT increases the risk of threat actors discovering and refining prompt attacks to achieve malicious objectives. To mitigate this, we recommend filtering <think> tags from model responses in chatbot applications.
Additionally, red teaming is a crucial risk mitigation strategy for LLM-based applications. In this article, we demonstrated an example of adversarial testing and highlighted how tools like NVIDIA’s Garak can help reduce the attack surface of LLMs. We are excited to continue sharing our research as the threat landscape evolves. In the coming months, we plan to evaluate a wider range of models, techniques, and objectives to provide deeper insights.