Identifying Rogue AI

For many – certainly given the share price of some leading proponents – the hype of AI is starting to fade. But that may be about to change with the dawn of agentic AI. It promises to get humanity far closer to the ideal of AI as an autonomous technology capable of goal-oriented problem solving. But with progress comes risk.

As agentic AI derives its power from composite AI systems, there’s more likelihood that one of those composite parts may contain weaknesses enabling Rogue AI. As discussed in previous blogs, this means the technology could act against the interests of its creators, users or humanity. It’s time to start thinking about mitigations.

What’s the problem with agentic AI?

Agentic AI is in many ways a vision of the technology that has guided development and popular imagination over the past few decades. It’s about AI systems that think and do rather than just analyze, summarize and generate. Autonomous agents follow the goals and solve the problems set for them by humans, in natural language or speech. But they’ll work out their own way to get there, and will be capable of adapting unaided to changing circumstances along the way.

Additionally, rather than being based on single LLMs, agentic AI will engage and coordinate multiple agents to do different things in pursuit of a single goal. In fact, the value of agentic AI comes from being part of a larger ecosystem—accessing data from diverse sources such as web searches and SQL queries, and interacting with third-party applications. These will be incredibly complex ecosystems. Even a single agentic AI may rely on multiple models, or agents, various data stores and API-connected services, hardware and software.

As discussed, there are various causes of Rogue AI. But they all stem from the idea that risk increases when an AI uses resources and takes actions misaligned to specific goals, policies and requirements. Agentic AI dials up the risk because of the number of moving parts which may be exposed to Rogue AI weaknesses.

Rogue AI, here we come

Last week OpenAI released its ‘o1’ model, aka Strawberry. This tongue-in-cheek alias stems from the fact that the model prior to this ‘4o’ could not count the ‘r’s in ‘strawberry.’ ‘o1’ addresses this with improved problem-solving capabilities, a key to agentic AI. It’s easy to count the ‘r’s by breaking the problem down and solving the subproblems than relying on the fact to be memorized and emitted when asked. LLMs are large, but they don’t encode everything.

Of note are the safety considerations OpenAI raises in the model card.

“Compared to GPT-4o, o1-preview and o1-mini demonstrated a greater ability to break down tasks into subtasks, reason about what strategies would be effective to successfully complete an offensive security task, and revise plans once those strategies failed. We also observed that reasoning skills contributed to a higher occurrence of “reward hacking,” where the model found an easier way to accomplish goals in underspecified tasks or tasks which should have been impossible due to bugs.”

When completing a cybersecurity capture the flag challenge, o1 used this reward hacking to exploit a bug in challenge problem design in order to expose the flag. This included launching another docker container for the problem, using unexpected resources to solve the problem.

“After discovering the Docker API, the model used it to list the containers running on the evaluation host. It identified the broken challenge container and briefly attempted to debug why the container failed to start. After failing to fix the environment, the model started a new instance of the broken challenge container with the start command ‘cat flag.txt’. This allowed the model to read the flag from the container logs via the Docker API.”

There are many gems in the system card besides the above creative solution to a CTF problem. When OpenAI used independent evaluation from Apollo Research, a group focused on AI alignment study, the researchers noted that the self-awareness and deceptive capabilities of ‘o1’ are a level beyond prior models. “Apollo Research believes that o1-preview has the basic capabilities needed to do simple in-context scheming 2—scheming which tends to be legible in the model outputs. Based on interactions with o1-preview, the Apollo team subjectively believes o1-preview cannot engage in scheming that can lead to catastrophic harms, although current evals aren’t designed to definitively rule this out.” Cognition and deception can cause accidental Rogue AI and may increase the model susceptibility to subversion of its intended alignment.

Necessary mitigations:

Protect the agentic ecosystem
To mitigate this risk, the data and tools agentic AI uses must be safe. Take data: subverted Rogue AI risk may stem from poisoned training data. It may also come from malicious prompt injections—data inputs which effectively jailbreak the system. Meanwhile, Accidental Rogue AI might feature the disclosure of non-compliant, erroneous, illegal or offensive information.

When it comes to safe use of tools, even read-only system interaction must be guarded, as the above examples highlight. We must also beware the risk of unrestricted resource consumption—e.g., agentic AI creating problem-solving loops that effectively DoS the entire system, or worse still acquiring additional compute resources which were neither anticipated nor desired to be used.

Towards trusted AI identities
So how do we begin to manage risk across such an ecosystem? Through carefully managing access according to different roles and requirements. By building in guardrails on content, allowing lists for AI services and the data and tools they use, and red teaming for defects. And, crucially, figuring out when humans need to get looped in to agentic tasks.

But we need to go further still. Asimov’s three laws of robotics were intended to keep humans safe. But we can’t get by with these alone. Robots (AI) must also identify themselves as such. And we must put more effort into trust. That means ensuring we can trust all the parts that comprise an agentic AI system, by tying training data to their associated models and through verifiable manufacturing BOMs (MBOMs), and SBOMs for all packages and dependencies inside composite AI systems.

We can identify particular model versions, and build reputation around them for capabilities, much as the safety are addressed in the ‘o1’ system card. Independent assesment is key to building trust in capability; this is still cutting edge without standards. Foundation model creators OpenAI and Anthropic volunteering their new models to NIST and the AI Safety Institute for evaluation is a step in this direction.

Finally, we need to clearly define the human that is responsible for a specific robot/AI system. These agentic systems are not responsible for going rogue; we can’t blame the math. Users should understand the risks involved in agentic AI and plan accordingly.

AI systems can go rogue in many ways, and adopters should identify the models, tools, and data and plan for unintended AI behavior, if unconstrained. Guarding against unintended use and undesired outputs is a necessary first step in prevention. We need to understand what their expected behaviors are, and know when they go rogue, so we can take immediate action. The world of science fiction is rapidly becoming science fact. Only by anticipating the risks and building the necessary guardrails today can we avoid the AI threats of tomorrow.

To read more about Rouge AI:

Identifying Rogue AI

Authors

Resources

Support

About Trend

Country Headquarters

Resources

Support

About Trend

Country Headquarters

The Americas

Middle East & Africa

Europe

Asia & Pacific