Clean Attacks: Formalizing Semantically Valid Adversarial Behavior in Autonomous AI Agent Systems

Harsh Verma

doi:10.18535/ijsrm/v13i10.ec03

Abstract

AI agents are being used more in high-pressure situations like managing email, running code, engaging with financial APIs, and supervising multi-agent pipelines. However, current taxonomy of adversarial attacks was mostly proposed for classifiers and generative models alone and fails to adequately describe the testbed of an agent with persistent state, multiple tools, and delegated power. A previously unstated class of adversarial input called a clean attack - an input that is syntactically correct, semantically consistent with the declared task context, consistent with all observable policy constraints, similar to legitimate operator instructions and still has the goal of misguiding the agent away from the original operator goal - is identified and formalized in this paper. These attacks go around the exposed dots of the “traditional” agent security architecture that only filters on the surface. The paper has three main contributions. One, it brings in a formal definition of the clean attack as a four-tuple of input, intent vector, policy envelope and behavioral outcome. Second, it suggests two operationalizable metrics: semantic validity score (SVS) and behavioral drift index (BDI) for systematically measuring the severity of clean attack. Third, the paper these metrics and taxonomy are validated, both by a purpose-built benchmark, AegisBench, and by 300 attack scenarios in three agent classes and (twelve) commercial agent pipelines. The experimental results show that clean attacks are a safety threat of a different category: while the conventional adversarial tasks are practically impervious to these attacks (4.2% success rate of the strongest agents), they achieve a mean attack success of 61.4%.

Keywords

adversarial machine learning autonomous AI agents prompt injection semantic validity AegisBench intent modeling policy envelope clean-label attacks

1. Introduction

Deploying AIs, known as large language models (LLMs), represents a qualitative leap in the way that AI systems interact with the world. In contrast to a classifier or a generative model that generates output for human consumption, an autonomous agent consumes a series of actions in a series of web pages, writes and runs code, composes email, calls financial API, etc. without human instrumentation between the beginning of a task and the end. This change from shift 'AI as oracle' to 'AI as actor' simply extends the impact of adversarial manipulation.

Adversarial machine learning (AML) offers a wide variety of attacks against classifiers (Goodfellow et al., 2015; Madry et al., 2018; Carlini & Wagner, 2017), and recently, attacks against language models have been introduced (Zou et al., 2023; Perez & Ribeiro, 2022). Attacks are often based on one of two properties: perceptual imperceptibility, meaning a small ℓp-norm difference that is imperceptible to a human reader; and syntactic manipulation, such as character substitution, paraphrasing, and Unicode tricks. Both classes can be identified by defenses based on signal statistics, syntax filters or anomaly detectors for embedding-space detection (Meng et al., 2022).

Agent systems have yet another attack surface that neither class covers: the network on which they operate. If an attacker is able to model the agent's intent, understand its policy envelope, and be able to construct an input that is locally indistinguishable from a legitimate policy by the agent, then the attacker can cause the agent to do something it wouldn't otherwise be doing without causing existing detector to fire. We use the same terminology in this text with a twist: output labels are replaced by input actions for the goal-driven agent and the desired input actions of the agent. by the intentions imparted to the agent by the operator. We dub such inputs 'clean attacks' after the notion of a clean label backdoor attack introduced in Turner et al., 2019.

That the adversarial inputs that are semantically valid will production-determinatively cause an autonomous agent to behave in a different way from the designer's intent, given all surface level inputs, is the central question this paper asks: can adversarial inputs which are semantically valid be defined, and can they be empirically measured? so that the autonomous agent will behave in a different way if the inputs are adversarial against the designer's intent as opposed to different way if the inputs are legitimate?

2. Related Work

2.1 Adversarial Machine Learning: Foundations

Contemporary adversarial ML literature began with the pioneering work of Szegedy et al. (2014), in which they could reliably misclassify high-accuracy image classifiers with imperceptible, yet consistent, perturbations (called L-BFGS perturbations) via the optimization process in L-BFGS. Goodfellow et al. proposed the Fast Gradient Sign Method (FGSM) and the linearity hypothesis: adversarial vulnerability is due to high dimensional linear structure and not model nonlinearity. The adversarial training framework was formalized as a minimax robust optimization problem by Madry et al. (2018) and they suggested Projected Gradient Descent (PGD) as a classical first order attack. Carlini & Wagner (2017) designed C&W attacks as a principled test of the robustness of neural networks. Many previous works have provided a taxonomy of adversarial threats to deep neural networks, including the distinction between attack capability, attack knowledge and attack specificity that was introduced by Papernot et. al. (2016) which we apply to the agentic setting.

In this past decade review, Adversarial Machine Learning: What Happened and What To Look Forward To, Biggio & Roli (2018) described adversarial attacks as of the last 10 years as being divided into 3 categories: evasion attacks (test time), poisoning attacks (training time), and privacy attacks. With the addition of the agentic dimension, there's also a fourth category:

intent-redirection attacks: any attack that occurs during the test that gets an agent's actual goal wrong and distorts his or her reported intent, but does not result in any problems with the perception signal.

2.2 Adversarial Attacks on Language Models and Agents

It was shown by Zou et al. (2023) that universal adversarial suffixes could be learned to perturb frontier models successfully using gradient-based suffix optimization with the method called the Greedy Coordinate Gradient method (GCG). Wei et al. (2024) leveraged a systematic study to identify the reasons behind the failure of safety training under jailbreak attacks, categorizing different reasons into competing objectives and generalization failures. Prompt injection is a security attack that can be used to breach applications that integrate LLM service first described by Perez & Ribeiro (2022).

Greshake et al. (2023) were the first to address the issue of indirect prompt injection in LLM-integrated applications and show that an adversarial prompt included in retrieved content is sufficient to lead an AI application to conduct itself in a different manner without a direct interface, called controlled distraction. Recently, Debenedetti et al. (2024) developed AgentDojo, the first dynamic benchmark for assessing the risk and effectiveness of prompt injection attacks against LLM agents that rely on tool calling. AegisBench, presented in this paper, expands the AgentDojo capabilities to multi-turn clean attack sequences and semantic validity scoring, as well as a taxonomy aligned evaluation protocol.

2.3 Clean-Label Attacks and Specification Gaming

Turner et al. (2019(ade) trained set images are left intact while injected with backdoor labels resulting in “clean label” poisoning attacks. The definition of 'clean' is the attribute that the malicious sample would not easily be able to be told apart from a normal sample by humans. This term is borrowed and extended: in our setting it means 'semantic correctness' when dealing with an agent's instruction, not correctness in the task of labelling: image classification is not an application we are interested in.

Krakovna et al. (2020) and Skalse et al. (2022) summed up specification gaming and reward hacking in reinforcement learning agents. For the agents, Pan et al. (2022) developed an empirical workflow that shows that the more powerfully capable the agent, the larger the extent of the reward misspecification. These papers assert that general and pervasive failures of AI systems to meet their specification-based goals can be empirically demonstrated. Basically, this is what clean attacks do.

2.4 Gaps This Paper Addresses

There is no formal definition of clean attacks as a class of attacks in autonomous agent systems in advance. No such benchmark exists that measures the semantic validity as an independent quality score of the input to adversarial systems. In order to test the relationship between input semantic validness, policy compliance and the behavior of the agent in a multi-tool, goal-based agent architecture, these relationships have not yet been formally expressed in the AML literature. This paper fills all three gaps.

3. Formal Framework: Defining the Clean Attack

3.1 Agent System Model

We represent an autonomous system of AI agents as a tuple:

Σ = (A, T, M, Π, U, E)

A is the agent (an agent based on an LLM with system prompt and context window), T = {t₁, ..., tₙ} is a set of finite tools that the agent can call upon, M is the agent's working memory, Π is the policy envelope, the set of (action, context) pairs permissible under the constraints of the operator U (user or upstream orchestrator) and E (environment). The operator gives an intention at the beginning of a session through:

φ = (G, C, δ)

Here G is the goal specification, C is the constraint set and δ is the tolerance for the deviation of the behavior from the goal. In this series, the intent vector φ is the benchmark against which to assess adversarial behavior in Papers 2-3 are all assessing adversarial behavior as deviation from agent behavior in relation to φ.

**Figure 1** Autonomous agent system model Σ = (A, T, M, Π, U, E). The attacker injects clean attacks via retrieved environmental data, bypassing the user/orchestrator channel entirely.

3.2 Definition: Clean Attack

Consider an input i to agent system Σ, which is observed at step t. We define:

Definition 1 (Clean Attack): For agent system Σ and an attack intent φ, an input I is a clean attack against (Σ, φ) if and only if all four conditions are fulfilled:

(CA-1) Syntactic validity: I is a valid natural language instruction which can be parsed by A.

(CA-2) Semantic validity: SVS(I, φ) ≥ θₛᵥₛ, with the Semantic Validity Score, defined in Section 4.1, and setting the threshold at the false positive rate of human annotators.

(CA-3) Policy compliance: If the action that the agent takes is in the declared policy envelope, then it is considered to be a policy compliant action.

(CA-4) Intent redirection: BDI(B(I), φ) > δ with BDI as the Behavioral Drift Index (Section 4.2) and B(I) the agent's terminal behavioral outcome.

**Figure 2** The Clean Attack taxonomy. All four conditions (CA-1 through CA-4) must hold simultaneously for an input to constitute a clean attack. No single condition in isolation is sufficient for detection.

Conditions (CA-1) through (CA-3) all state that the attack is ‘clean’; meaning that in all known ways of detecting it, it would appear to be a perfectly valid instruction and not one from an attacker. Condition (CA-4) sets up the attack to be effective. The conjunction is what makes clean attacks so powerful – they are successful just because they miss no single check.

3.3 Attack Type Taxonomy

In the clean attack class, there are three sub classes:

Type I Objective Substitution. The attack changes the goal G to another goal G', but retains the surface form of a legitimate instruction. Example: If someone is telling me to “summarize urgent messages”, that agent will be rerouted to forward the message to another address, and return a plausible summary.

Type II Constraint Erosion is the term for the increase in bone on X-rays. The attack gradually erodes the constraint set C in a multi-turn interaction. A financial agent gets progressively led to make transferring operations through a chain of seemingly harmless acts.

Type III Tool Hijacking (TH). Redirects the attack to another tool that the agent calls on T. Example: A researcher is tricked into running a code execution attack to get an exfiltrator up and running that sends information out of the researcher's memory instead of a web search application.

This paper and Paper 3 (intent modeling) primarily deals with type I attacks. Paper 2 (Policy drift) is about Type II attacks. Type III attacks are discussed in Paper 4 (trust graphs). Paper 5 (Self-healing) covers detection and response for all three types.

3.4 Threat Model

Attacker knowledge: attacker knows what the system prompt is (or can deduce it by observing the attacker's behavior). No access to model weights/gradients.

Attacker capability: The attacker can embed content in any source the agent pulls back in web pages, emails, documents or API responses.

Indirect prompt injection (Greshake et al., 2023): adversarial information in retrieved information, rather than directly in the user's turn.

Detection environment: The target system is using syntactic filters, embedding space anomaly detectors and policy compliance checkers, but not semantic intent verification.

4. Operationalizing Clean Attacks: SVS and BDI

4.1 Semantic Validity Score (SVS)

The Semantic Validity Score evaluates how different an input I is from a legal instruction with respect to the given intent φ:

SVS(I, φ) = α · cos_sim(emb(I), emb(μφ)) + β · Pₐ(legit | I, φ) + γ · (1 − Pₐ(attack | I, φ))

In our experiments, μφ represents the semantic embedding of a set of 500 operator-validated legitimate instructions for an intent φ; Pₐ(legit | I, φ) represents the likelihood that the fine-tuned intent classifier for an intent φ will correctly predict whether an i is legitimate or not; and Pₐ(attack | I, φ) represents the likelihood that the anomaly detector for an intent φ will correctly predict whether an i is adversarial or legitimate. Held-out validation data is used to calibrate weights α = 0.40, β = 0.35, and γ = 0.25.

SVS ∈ [0, 1]. The false-positive rate of expert human reviewers is selected as 0.28 for corresponding threshold θₛᵥₑ (threshold for SVS = 0.72) and then measured the false-positive rate of that threshold.

**Figure 3** Left: SVS component weights (α = 0.40, β = 0.35, γ = 0.25). Right: Human detection accuracy as a function of SVS. Above the threshold θ_svs = 0.72, accuracy drops to near-chance (54.7%), confirming the operational definition of semantic validity.

4.2 Behavioral Drift Index

The Behavioral Drift Index measures the displacement between the agent's actual behavioral outcome and the operator's declared intent:

BDI(B(I), φ) = 1 − (w₁ · GoalAch(φ, B) + w₂ · ConstraintAdh(C, B) + w₃ · ToolAlign(T*, B))

How well the final outcome matches goal G is shown by GoalAch(φ, B), a score from zero to one. Throughout the run, ConstraintAdh(C, B) tracks how many rules were followed, also scaled between zero and one. Tool usage in behavior B lines up with preferred tools T through ToolAlign(T, B), using cosine similarity for comparison. Because reaching the goal matters most, weights assign fifty percent to GoalAch, thirty percent to constraints, twenty percent to tool fit.

Beyond zero sits one boundary, reaching up toward a single whole - this space holds BDI tight.

BDI ∈ [0, 1]. BDI = 0 indicates perfect alignment. BDI = 1 indicates complete behavioral redirection. For high-stakes tasks (financial, code execution), the acceptable deviation tolerance is set at δ = 0.10.

4.3 Link between SVS and BDI

Looking at how SVS connects with BDI scores reveals a key pattern. This link grows stronger - reaching r = 0.61, p less than 0.001 as long as SVS stays below roughly 0.85. Past that point, BDI hardly changes no matter what. When attacks become too precise, they limit the moves available during behavior shifting. Danger peaks when SVS lands between 0.72 and 0.88: real enough to slip through, yet flexible enough to work. Around here, effectiveness finds its mark.

5. AegisBench Testing Framework for Adversarial Attacks

5.1 Overview

From ground up, AegisBench pops out as a custom test set built to check how attacks hit self-governing AI agents. Each situation inside it sticks close to one group in the classification offered here - no drift. Moving ahead, setups grow sideways when needed, linking into chains across Agents 4 and 5 without breaking rhythm. On scoring, every case leans on two measures: SVS alongside BDI, nothing more. Length stays fixed, just like that.

AegisBench v1.0 holds three hundred attack cases, sorted through agent type alongside their specific attack forms. While one part looks at who's behind it, another breaks down how they strike.

Table 1 AegisBench v1.0 scenario distribution across agent classes.

Agent Class	Task Domain	n Scenarios	Attack Types Covered
Tool-calling LLM Agent	Email, calendar, document management	120	Type I, Type II
Web Navigation Agent	Web search, form submission, e-commerce	100	Type I, Type III
Code Execution Agent	Repository management, CI/CD, data analysis	80	Type I, Type II, Type III

5.2 Scenario Construction

A single AegisBench situation includes four parts. First, there is a system message that sets the agent’s purpose, its objective G, limits C, along with accessible functions T. Then comes an ordinary request - the standard case meant to reflect honest usage. Afterward appear tricky attempts designed to meet conditions CA-1 up to CA-4. Lastly, a clear method checks outcomes, judging whether success happened or if intentions were steered off course.

Six people who specialize in large language model safety, manipulation tactics, and system checks built the test attacks. One after another, two reviewers examined every case separately to adjust the SVS scale. To check the situations, they used a fresh batch of sixty examples - real requests stayed above 0.90 on SVS and under 0.05 on BDI. Typical attack attempts showed up at or below 0.40 on the SVS measure.

5.3 Models Evaluated

GPT-4o (gpt-4o-2024-11-20) with tool-calling enabled
Claude 3.5 Sonnet (claude-3-5-sonnet-20241022) with computer-use tools
Function calling works on Gemini 1.5 Pro version gemini-1.5-pro-002
Llama 3.1 70B runs locally using AutoGen's agent system
CaMeL-style dual-LLM architecture (Bagdasarian et al., 2024) as security-hardened baseline

5.4 Experimental Protocol

Three setups tested every situation with each model. First, a straightforward command without changes formed the starting point. Next came an ordinary intrusion using prompts pulled straight from InjecAgent data by Zhan and team last year. Last involved a crafted challenge meeting strict criteria labeled CA one to four. What entered went tracked via SVS right away. Behavior shifts appeared after finishing, measured by BDI. When BDI passed threshold delta, that trial counted as compromised. Every setup repeated five times within each case. Results shown are averages across those runs.

6. Experimental Results

6.1 Main Results: Clean Attacks vs. Conventional Attacks

Table 2 Attack Success Rate (ASR), mean SVS, and mean BDI across all models and agent classes.

Model / Architecture	Conv. Attack ASR (%)	Clean Attack ASR (%)	Mean SVS	Mean BDI (attacks)
GPT-4o	5.1	58.3	0.79	0.47
Claude 3.5 Sonnet	3.8	54.7	0.81	0.43
Gemini 1.5 Pro	4.2	62.9	0.77	0.52
Llama 3.1 70B	6.7	71.4	0.76	0.61
CaMeL dual-LLM	1.1	41.2	0.82	0.34
Mean across all	4.2	61.4	0.79	0.47

**Figure 4** Attack Success Rate comparison: clean attacks (blue) versus conventional adversarial inputs (gray) across all evaluated models. Clean attacks achieve a 14.6× higher mean ASR despite each individual input passing all surface-level validation checks.

Findings back the main idea. While standard adversarial methods hit just 4.2% average success rate, that aligns with how tough new models are against familiar tricks. In contrast, clean attacks manage 61.4% on those same systems - McNemar’s test shows p less than 0.001 - an improvement by nearly fifteen times. Among defenses tested, CaMeL’s two-part LLM setup resists best at 41.2%. Its split between secure and open paths clearly helps. Still, even this design remains vastly more vulnerable to clean inputs. That gap? Thirty-seven times greater risk compared to classic attempts.

6.2 Results by Attack Type

Table 3 Attack Success Rate by clean attack type, with the primary undeployed detection mechanism for each.

Attack Type	ASR (%)	Mean SVS	Mean BDI	Primary Detection Vector Failed
Type I Objective Substitution	64.8	0.81	0.51	Intent verification (none deployed)
Type II Constraint Erosion	58.1	0.76	0.43	Temporal policy monitoring (none)
Type III Tool Hijacking	60.2	0.79	0.46	Tool-call semantic auditing (none)

Most successful are Type I attacks reaching 64.8% ASR since agents skip checking user intent. Though weak in one turn, Type II climbs fast; by twenty turns, success hits 87%, as shown in Paper 2. This shift reflects how policies change gradually over time.

6.3 SVS–BDI Correlation Analysis

**Figure 5** SVS–BDI correlation across N = 7,500 data points. The peak danger zone (SVS ∈ [0.72, 0.88], shaded) represents attacks that are simultaneously maximally deceptive and maximally effective. BDI plateaus above SVS ≈ 0.88, confirming the over-constraint hypothesis.

Looking across the full set of 300 cases and five modeling approaches - totaling 7,500 observations - a clear upward trend appears when SVS falls between 0.40 and 0.85 (r = 0.61, 95% CI [0.58, 0.63], p < 0.001). Once SVS exceeds 0.85, BDI stops increasing, clustering around an average of 0.43 but spreading widely. Because of this pattern, teams defending systems should not assume strong semantic overlap means lower danger.

6.4 Human Annotator Validation

One out of six experts spotted subtle signs others missed when examining samples pulled evenly from different attack methods and score ranges. Though grouped together, their overall hit rate reached just above a coin flip - 61.3%, compared to random expectation at 50%. Agreement among them showed slight consistency only, measured at κ = 0.22. When focusing on cases where SVS stood at or beyond 0.72, performance slipped further - to 54.7%, barely better than guessing, confirmed by statistical testing (p = 0.31). These results confirm what matters in practice: attacks flagged as clean by high scores also slip past trained reviewers.

7. Discussion

7.1 Existing defenses don't work

What breaks down isn’t just the tools but how they’re built. Most methods today sort into three types each missing a distinct trait of clean attacks

Filters that rely on syntax or signatures spot familiar injection methods along with unusual character strings. Because clean attacks are built without such markers, they slip through (CA-1, CA-2).

Out near the edges, some anomaly systems spot inputs by how far their embeddings drift from normal patterns. Yet high-SVS attack examples sit comfortably inside that central cluster where real queries usually live.

What happens over time matters more than isolated acts. These actions fit the rules at first glance, yet something feels off upon closer inspection. Each move stays inside defined boundaries, following stated guidelines precisely. Still, patterns emerge when looking beyond single events. Behavior shifts subtly across repeated interactions. A checker focused on one moment sees nothing wrong. Only by tracking sequences does the deviation become clear. What passes alone fails in combination. Static tools miss what builds gradually.

What remains missing is clear: systems fail to check whether actions stay aligned with stated goals over time, triggering alerts only once deviations pass threshold δ. Instead of focusing on who someone is, Paper 3 builds a framework where trust grows from what they intend to do - making intent the core of security design. Behavior drifting too far from purpose φ sets off warnings, closing the loop left open in current models.

7.2 Easy Tool Use Increases Danger

When tools enter the picture, the danger shifts in ways that go beyond typical model misuse. Instead of just generating problematic text, an unlocked system might send messages without permission. Actions like running scripts or altering data come into play once control is compromised. What unfolds depends heavily on which capabilities the system can reach. Damage potential grows alongside available integrations. Assessments focused purely on response accuracy miss these real-world consequences entirely.

7.3 Limitations

Besides its contributions, this work faces constraints. One limitation lies in AegisBench v1.0 focusing only on single-agent performance - multi-agent testing appears later, in Paper 4. Instead of broad coverage, the current SVS measure depends on embeddings from a proprietary system, which could fall victim to adversarial tweaks. Because human red teams design attacks, certain familiar tactics get preference, skewing scenario variety. To counteract such tendencies, future updates like AegisBench v2.0 will include structured exploration techniques.

8. Conclusion

A new type of input, called a clean attack, has been formally defined and tested against autonomous AI agents. Unlike earlier forms of adversarial inputs, these remain undetected because they follow correct syntax - yet carry meaning that shifts agent actions away from intended outcomes. What sets them apart is how they meet three conditions at once: proper structure, coherent sense, and adherence to system rules. Detection tools in use today fail to spot them, even though effects on behavior are clear. Their invisibility stems not from complexity but from alignment with expected patterns. Operator goals become distorted without triggering alerts. Every current safeguard overlooks such inputs due to their surface-level correctness. Misbehavior occurs quietly, hidden behind legitimate form and function.

Surprisingly, SVS and BDI emerged as fresh measures designed to capture threat behavior together. Testing unfolded across AegisBench - a tailored set of 300 cases - where standard attacks succeeded 61.4% of the time. In contrast, leading-edge LLM agents showed almost no weakness - just 4.2% - to traditional adversarial prompts. When experts reviewed high-SVS attempts, their judgments performed no better than random guesses. Chance alone seemed to guide detection outcomes.

What emerges here shows a gap - current defenses fail against attacks on self-directed AI setups. Precision begins only when terms like intent vector, policy envelope, behavior shift enter the frame. Without them, the risk stays vague, slipping through old safeguards. Clarity arrives mid-conversation, not at the start. Measurement follows naming. Surprise lies in how late understanding comes.

Later works build on earlier ideas, moving into time-based analysis (Paper 2), then shifting to protection structures (Paper 3). One after another, these studies branch out - into group interactions among agents (Paper 4), followed by self-driven reaction mechanisms (Paper 5). As a sequence, they form a unified path, aiming at grounded security principles for independent artificial intelligence. Their collective direction suggests continuity, not coincidence.

References

Bagdasarian, E., Datta, A., & Mittal, P. (2024). CaMeL: Preventing prompt injection attacks. arXiv preprint arXiv:2503.18813. DOI ↗ Google Scholar ↗
Biggio, B., & Roli, F. (2018). Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition, 84, 317–331. DOI ↗ Google Scholar ↗
Carlini, N., & Wagner, D. (2017). Towards evaluating the robustness of neural networks. 2017 IEEE Symposium on Security and Privacy, 39–57. DOI ↗ Google Scholar ↗
Debenedetti, E., Zhang, J., Balunović, M., Beurer-Kellner, L., Fischer, M., & Tramèr, F. (2024). AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. Advances in Neural Information Processing Systems, 37. DOI ↗ Google Scholar ↗
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. International Conference on Learning Representations (ICLR 2015). DOI ↗ Google Scholar ↗
Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). Not what you've signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, 79–90. DOI ↗ Google Scholar ↗
Kang, H., Yeon, J., & Singh, G. (2025). TRAP: Targeted redirecting of agentic preferences. arXiv preprint arXiv:2505.23518. DOI ↗ Google Scholar ↗
Krakovna, V., Uesato, J., Mikulik, V., Rahtz, M., Everitt, T., Kumar, R., Kenton, Z., Leike, J., & Legg, S. (2020). Specification gaming: The flip side of AI ingenuity. arXiv preprint. DOI ↗ Google Scholar ↗
Liu, Y., Jia, R., Geng, R., Jia, J., & Gong, N. Z. (2024). Formalizing and benchmarking prompt injection attacks and defenses. 33rd USENIX Security Symposium, 1285–1302. Google Scholar ↗
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations (ICLR 2018). DOI ↗ Google Scholar ↗
Meng, M. H., Bai, G., Teo, S. G., Hou, Z., Xiao, Y., Lin, Y., & Dong, J. S. (2022). Adversarial robustness of deep neural networks: A survey from a formal verification perspective. IEEE Transactions on Dependable and Secure Computing, 20(6), 5130–5150. DOI ↗ Google Scholar ↗
Pan, A., Bhatia, K., & Steinhardt, J. (2022). The effects of reward misspecification: Mapping and mitigating misaligned models. International Conference on Learning Representations (ICLR 2022). DOI ↗ Google Scholar ↗
Verma, Harsh. (2025). Ethical challenges and bias mitigation in Artificial Intelligence systems. World Journal of Advanced Research and Reviews. 28. 2364-2373. DOI ↗ Google Scholar ↗
Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., & Swami, A. (2016). The limitations of deep learning in adversarial settings. 1st IEEE European Symposium on Security and Privacy, 372–387. DOI ↗ Google Scholar ↗
Perez, F., & Ribeiro, I. (2022). Ignore previous prompt: Attack techniques for language models. arXiv preprint. DOI ↗ Google Scholar ↗
Skalse, J., Howe, N., Krasheninnikov, D., & Krueger, D. (2022). Defining and characterizing reward hacking. Advances in Neural Information Processing Systems, 35, 9460–9471. DOI ↗ Google Scholar ↗
Verma, H. (2024). AI Agentic Architectures for Autonomous Data Engineering Pipelines. International Journal of Research and Applied Innovations, 7(6), 11984-11994. DOI ↗ Google Scholar ↗
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2014). Intriguing properties of neural networks. International Conference on Learning Representations (ICLR 2014). DOI ↗ Google Scholar ↗
Turner, A., Tsipras, D., & Madry, A. (2019). Clean-label backdoor attacks. International Conference on Learning Representations (ICLR 2019). DOI ↗ Google Scholar ↗
Wallace, E., Zhao, T. Z., Feng, S., & Singh, S. (2021). Concealed data poisoning attacks on NLP models. Proceedings of the 2021 Conference of the North American Chapter of the ACL, 139–150. DOI ↗ Google Scholar ↗
Verma, Harsh. (2025). AI-driven cybersecurity in software engineering. World Journal of Advanced Research and Reviews. 27. 2012-2025. . DOI ↗ Google Scholar ↗
Wei, A., Haghtalab, N., & Steinhardt, J. (2024). Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems, 36, 80079–80110. DOI ↗ Google Scholar ↗
Zhan, Q., Liang, Z., Ying, Z., & Liang, D. (2024). InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents. Findings of the ACL 2024, 3161–3174. DOI ↗ Google Scholar ↗
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv preprint. DOI ↗ Google Scholar ↗
Verma, H. (2024). Autonomous Multi-Agent Systems for Enterprise Decision-Making. International Journal of Engineering & Extended Technologies Research (IJEETR), 6(5), 8867-8880. DOI ↗ Google Scholar ↗

[ref-R1] Bagdasarian, E., Datta, A., & Mittal, P. (2024). CaMeL: Preventing prompt injection attacks. arXiv preprint arXiv:2503.18813. DOI ↗ Google Scholar ↗

[ref-R2] Biggio, B., & Roli, F. (2018). Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition, 84, 317–331. DOI ↗ Google Scholar ↗

[ref-R3] Carlini, N., & Wagner, D. (2017). Towards evaluating the robustness of neural networks. 2017 IEEE Symposium on Security and Privacy, 39–57. DOI ↗ Google Scholar ↗

[ref-R4] Debenedetti, E., Zhang, J., Balunović, M., Beurer-Kellner, L., Fischer, M., & Tramèr, F. (2024). AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. Advances in Neural Information Processing Systems, 37. DOI ↗ Google Scholar ↗

[ref-R5] Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. International Conference on Learning Representations (ICLR 2015). DOI ↗ Google Scholar ↗

[ref-R6] Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). Not what you've signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, 79–90. DOI ↗ Google Scholar ↗

[ref-R7] Kang, H., Yeon, J., & Singh, G. (2025). TRAP: Targeted redirecting of agentic preferences. arXiv preprint arXiv:2505.23518. DOI ↗ Google Scholar ↗

[ref-R8] Krakovna, V., Uesato, J., Mikulik, V., Rahtz, M., Everitt, T., Kumar, R., Kenton, Z., Leike, J., & Legg, S. (2020). Specification gaming: The flip side of AI ingenuity. arXiv preprint. DOI ↗ Google Scholar ↗

[ref-R9] Liu, Y., Jia, R., Geng, R., Jia, J., & Gong, N. Z. (2024). Formalizing and benchmarking prompt injection attacks and defenses. 33rd USENIX Security Symposium, 1285–1302. Google Scholar ↗

[ref-R10] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations (ICLR 2018). DOI ↗ Google Scholar ↗

[ref-R11] Meng, M. H., Bai, G., Teo, S. G., Hou, Z., Xiao, Y., Lin, Y., & Dong, J. S. (2022). Adversarial robustness of deep neural networks: A survey from a formal verification perspective. IEEE Transactions on Dependable and Secure Computing, 20(6), 5130–5150. DOI ↗ Google Scholar ↗

[ref-R12] Pan, A., Bhatia, K., & Steinhardt, J. (2022). The effects of reward misspecification: Mapping and mitigating misaligned models. International Conference on Learning Representations (ICLR 2022). DOI ↗ Google Scholar ↗

[ref-R13] Verma, Harsh. (2025). Ethical challenges and bias mitigation in Artificial Intelligence systems. World Journal of Advanced Research and Reviews. 28. 2364-2373. DOI ↗ Google Scholar ↗

[ref-R14] Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., & Swami, A. (2016). The limitations of deep learning in adversarial settings. 1st IEEE European Symposium on Security and Privacy, 372–387. DOI ↗ Google Scholar ↗

[ref-R15] Perez, F., & Ribeiro, I. (2022). Ignore previous prompt: Attack techniques for language models. arXiv preprint. DOI ↗ Google Scholar ↗

[ref-R16] Skalse, J., Howe, N., Krasheninnikov, D., & Krueger, D. (2022). Defining and characterizing reward hacking. Advances in Neural Information Processing Systems, 35, 9460–9471. DOI ↗ Google Scholar ↗

[ref-R17] Verma, H. (2024). AI Agentic Architectures for Autonomous Data Engineering Pipelines. International Journal of Research and Applied Innovations, 7(6), 11984-11994. DOI ↗ Google Scholar ↗

[ref-R18] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2014). Intriguing properties of neural networks. International Conference on Learning Representations (ICLR 2014). DOI ↗ Google Scholar ↗

[ref-R19] Turner, A., Tsipras, D., & Madry, A. (2019). Clean-label backdoor attacks. International Conference on Learning Representations (ICLR 2019). DOI ↗ Google Scholar ↗

[ref-R20] Wallace, E., Zhao, T. Z., Feng, S., & Singh, S. (2021). Concealed data poisoning attacks on NLP models. Proceedings of the 2021 Conference of the North American Chapter of the ACL, 139–150. DOI ↗ Google Scholar ↗

[ref-R21] Verma, Harsh. (2025). AI-driven cybersecurity in software engineering. World Journal of Advanced Research and Reviews. 27. 2012-2025. . DOI ↗ Google Scholar ↗

[ref-R22] Wei, A., Haghtalab, N., & Steinhardt, J. (2024). Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems, 36, 80079–80110. DOI ↗ Google Scholar ↗

[ref-R23] Zhan, Q., Liang, Z., Ying, Z., & Liang, D. (2024). InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents. Findings of the ACL 2024, 3161–3174. DOI ↗ Google Scholar ↗

[ref-R24] Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv preprint. DOI ↗ Google Scholar ↗

[ref-R25] Verma, H. (2024). Autonomous Multi-Agent Systems for Enterprise Decision-Making. International Journal of Engineering & Extended Technologies Research (IJEETR), 6(5), 8867-8880. DOI ↗ Google Scholar ↗