The term “Rogue AI” refers to artificial intelligence systems that act against the interests of their creators, users, or humanity in general. Rogue AI is a new risk that happens when an AI uses resources that are misaligned to one’s goal. Check out our previous blog for definitions of types of Rogue AI before we get into today’s question: how does an AI become misaligned?
Alignment and Misalignment
As AI systems become increasingly intelligent and tasked with more critical functions, inspecting the mechanism to understand why an AI took certain actions becomes impossible due to the volume of data and complexity of operations. The best way to measure alignment, then, is simply to observe the behavior of the AI. Questions to ask when observing include:
- Is the AI taking actions contrary to express goals, policies, and requirements?
- Is the AI acting dangerously—whether in terms of resource consumption, data disclosure, deceptive outputs, corrupting systems, or harming people?
Maintaining proper alignment will be a key feature for AI services moving forward. But doing this reliably requires an understanding of how AI becomes misaligned in order to mitigate the risk.
How Misalignment Happens
One of the great challenges of the AI era will be the fact that there is no simple answer to this question. Techniques for understanding how an AI system becomes misaligned will change along with our AI architectures. Right now, prompt injection is a popular exploitation, though sort of command injection is particular to GPT. Model poisoning is another widespread concern, but as we implement new mitigations for this—for example, tying training data to model weights verifiably—risks will arise in other areas. Agentive AI is not fully baked yet, and no best practices have been established in this regard.
What won’t change are the two overarching types of misalignments:
- Intentional, where someone is trying to use AI services (yours or theirs) to attack a system (yours or another).
- Unintentional, where your own AI service does not have the appropriate safeguards in place and become misaligned due to an error.
Case Studies: Subverted Rogue AI
As defined in the first blog in this series, a Subverted Rogue AI is the result of an attacker using existing AI deployments for their own purposes. These attacks are popular with LLMs and include prompt injections and jailbreaks and model poisoning.
System Jailbreak: The simplest subversion is directly overwriting the system prompt. Many AI services use a prompting architecture in two (or more) levels, usually a system prompt and user prompt. The system prompt adds common instructions around every user prompt, such as “As a helpful, polite assistant with knowledge about [domain], answer the following user prompt.” Attackers use prompt jailbreaks to escape guardrails, often on dangerous or offensive material. Jailbreak prompts are widely available and can be used to subvert every use of an AI service when included in the system prompt. Insider threat attackers that replace system prompts with jailbreaks easily subvert protections, creating Rogue AI.
Model Poisoning: Intending to saturate the information space with disinformation, some Russian APT groups have poisoned many current LLMs. In a quest for as much data as possible (no matter what it is!) foundation model creators are ingesting anything they come across. Meanwhile, attackers seeking to sway public opinion create pink slime misinformation news feeds, free data for the training. The result is poisoned models that parrot disinformation as fact. They are Rogue AI, subverted to amplify the Russian APT’s narrative.
Case Studies: Malicious Rogue AI
A Malicious Rogue AI is one used by threat actors to attack your systems with an AI service of their own design. This can happen using your computing resources (malware) or someone else’s (an AI attacker). It’s still early for this type of attack; GenAI fraud, ransomware, 0-days exploits, and other familiar attacks are all still growing in popularity. But there are demonstrated examples of malicious rogue AI.
AI Malware: An attacker drops a small language model on target endpoints, disguising the download as a system update. The resulting program appears to be a standalone chatbot on cursory inspection. This malware uses the anti-evasion techniques of current infostealers but can also analyze data to determine if it matches the attacker’s goals. Reading emails, PDFs, browsing history, and so on etc. for specific content allows the attacker to stay silent and report back only high value information.
Proxy Attacker: Upon installing traffic anonymization grayware, “TrojanVPN,” the user’s system is checked for AI service use, credentials and authorization tokens. The system becomes an available “AI bot” whose service access is reported back to the grayware owners. The user system has access to GenAI tools including multilingual and multimodal capabilities, which can be sold to attackers to provide the content for their phishing, deepfake, or other fraud campaigns.
Case Studies: Accidental Rogue AI
Accidental Rogue AI occurs when an AI service unexpectedly behaves contrary to their its goals. This is generally due to a design flaw or bug. Common issues like hallucinations are not considered rogue, as they are always a possibility with GenAI based on token prediction. However, persistent issues may occur due to failure to monitor and protect data and access.
Accidental Data Disclosure: AI is only as powerful as the data it touches, and rushing to adopt pushes people to connect their data to AI services. When an internal help chatbot answers questions about career development with privileged individual salary information, it has gone rogue with this accidental data disclosure. Any protected information used by AI systems should be within a sandbox to ensure that the AI service’s access to that data is limited to authorized use.
Runaway Resource Consumption: Current agentic AI frameworks allow an LLM orchestrator to create subproblems and solve them, often in parallel with another agentic AI component. If resource consumption is not carefully bounded, problem solving can create loops or recursive structures or find a strategy to use all available resources. If agentic AI creates a subproblem and are given the resource quota and authority of the original model, they can worm themselves. Beware AI that self-replicates!
There are also many classic fictional examples of an Accidental Rogue AI harming people including HAL 9000 in 2001: A Space Odyssey and Skynet in the Terminator series. Agentic AI harming or killing people has been a concern since the birth of AI as a concept, and this risk becomes more present as AI services are given greater ability to act.
Prevention and Response
Preventing, detecting, and responding to these emerging threats requires an understanding of causality. Accidental rogues require close resource monitoring, malicious rogues require data and network protection, and subverted rogues require authorization and content guardrails. We’ll get into each of these measures in-depth in future blogs.
To read more about Rouge AI: