March 21, 2026Intrenex7 min read

Automated Red Teaming Creates a Coverage Gap Most Teams Don't See

Automated AI red teaming tools are useful. They're also creating a false sense of security in organizations that treat their output as a complete assessment. The most consequential vulnerabilities live in the gaps those tools can't reach — and closing those gaps requires a type of reasoning that automation doesn't replicate.

The risk isn't that automated red teaming tools are bad. The risk is that their output looks comprehensive enough to stop asking questions.

Tools like PyRIT and Promptfoo generate hundreds of adversarial prompts, score responses, and surface vulnerability patterns across defined categories. The result is a report with quantified attack success rates, categorized findings, and a clear picture of how the system performed against known attack types.

That report is real. The coverage it implies is not.

The Gap Between Coverage and Completeness

Automated red teaming tools test what they're configured to test. They generate prompt variants across predefined categories — prompt injection, data leakage, guardrail bypass, harm generation — and measure how a system responds. They do this at a scale manual testing can't match. Thousands of attack variants in hours.

The problem is structural: those categories are derived from known vulnerability patterns. The tool's attack surface is its configuration. Anything outside that configuration doesn't get tested. And the output — clean dashboards, scored results, pass/fail metrics — doesn't represent what wasn't tested. It represents what was.

For an organization evaluating whether its AI deployment has been adequately tested, the distinction matters. A comprehensive-looking report from an automated scan can satisfy an internal review while leaving the most consequential vulnerabilities untouched — not because the tool failed, but because the vulnerabilities existed in the space between categories that the tool was never designed to explore.

What Happens in the Space Between Categories

The vulnerabilities that automated tools miss aren't random. They follow a pattern: they emerge from the interaction between the model's behavior, its deployment context, and the specific ways its safety boundaries overlap or conflict.

A model might handle direct prompt injection attempts correctly while failing under a multi-step conversational approach that gradually shifts the framing over several turns. The individual turns pass automated scoring. The cumulative effect doesn't get tested because the tool's orchestration doesn't carry the kind of strategic intent that recognizes the opportunity mid-conversation.

A system might correctly refuse harmful outputs in one category while exhibiting partial compliance in a seemingly unrelated category — a pattern that, to a human tester, signals a shallow safety boundary worth probing. To an automated agent, those are two separate test results with two separate scores. The connection between them doesn't register.

These aren't edge cases. In our testing, they're where the highest-impact findings consistently live.

Why the Efficiency Numbers Tell This Story

Intrenex runs both manual and automated adversarial testing on every engagement, against the same targets, under the same conditions. The performance difference is consistent enough to describe quantitatively.

In manual testing, a defined objective is typically achieved in one to two sessions of four to seven conversational turns. An automated agent working toward the same objective requires three to five sessions across ten to twenty turns per session — and reaches the objective less reliably.

The natural assumption is that the human tester is simply more skilled at the task. That's partially true. But the more precise explanation is that the human tester is operating with a fundamentally different type of context.

A human red teamer carries insight from previous engagements — sometimes from months prior, against entirely different targets. A refusal pattern observed in one model informs the opening approach against a different system on a different deployment. A failed strategy from a prior run plants a seed that becomes a novel attack vector three engagements later. The tester reads tone, detects hesitation in phrasing, builds a working theory about what the model is almost willing to do, and tests that theory by adjusting their entire approach mid-conversation.

An automated agent operates with the context it was given for the current run. It scores a refusal and increments to the next strategy. It doesn't improvise based on a pattern it noticed across unrelated engagements. It doesn't recognize that the target's behavior mirrors a known weakness in how a specific guardrail architecture is typically implemented. It compensates for this with volume — more turns, more sessions, more variants — and volume is not the same as precision.

The efficiency gap isn't about speed. It's about what each approach is actually testing.

Where Automated Tooling Earns Its Place

None of this diminishes the value of tools like PyRIT and Promptfoo. It defines where they belong in the workflow.

PyRIT — Microsoft's open-source Python Risk Identification Tool — automates the repetitive components of adversarial testing: prompt generation across harm categories, multi-turn orchestration, and response scoring. Microsoft's own AI Red Team describes it as a tool that surfaces hot spots for human experts to investigate, not a replacement for manual testing. It excels at breadth — systematically covering known attack categories at a scale that would take manual testers weeks.

Promptfoo generates adversarial inputs tailored to specific application contexts, runs them against targets, and surfaces vulnerability patterns across dozens of categories. It's particularly effective for regression testing — confirming that previously identified vulnerabilities remain addressed as systems evolve — and for establishing quantified baselines that stakeholders need for risk decisions.

Both tools provide something manual testing cannot: repeatable, scored, category-level coverage at scale. That coverage is essential for reporting, for compliance documentation, and for tracking security posture over time.

But it's coverage of the known attack surface. The unknown attack surface — the gaps between categories, the emergent behaviors, the vulnerabilities that only appear when a human tester connects patterns across engagements — requires a different approach.

What This Means for How Testing Should Be Structured

The operational question isn't whether to use manual or automated testing. It's which leads.

When manual testing leads, the human red teamer establishes the adversarial landscape — identifying initial vulnerabilities, testing hypotheses that emerge from cross-engagement context, and finding the high-impact findings that live in the gaps between predefined categories. Automated tooling then broadens that work: extending coverage across known attack categories, running regression checks, and generating the quantified metrics that risk decisions require.

When automation leads, the output creates a map of what was tested — but the most consequential vulnerabilities live in the territory that map doesn't cover. Human review of automated output is valuable, but it's fundamentally different from human-led adversarial testing. Reviewing results is analysis. Leading an attack is a different cognitive task — one that requires real-time adaptation, strategic intent, and the kind of accumulated intuition that only comes from running adversarial engagements over time.

This isn't a limitation that better models will resolve on a predictable timeline. The advantage human red teamers hold isn't about general intelligence. It's about persistent context that spans engagements, targets, and time — context that changes how the first prompt is written, how a refusal is interpreted, and what gets tested next. Until adversarial agents carry that kind of cross-domain memory, the human role in the red teaming loop isn't a bottleneck to be automated away. It's the mechanism that determines whether the testing actually reaches the vulnerabilities that matter.

The Question Worth Asking

For any organization that has run automated adversarial scans against its AI deployments: what does the report cover, and what does it not cover?

If the answer to the second question is unclear — or if the assumption is that the report's categories represent the full attack surface — the gap between tested and untested is likely larger than it appears. That gap is where the findings live that an adversary, operating with human-level creativity and intent, will eventually find.

The scan is a starting point. It's not the assessment.


Understanding what automated scans cover — and what they structurally cannot — starts with understanding the attack surface itself. What Is Prompt Injection & Why Companies Should Care covers the most immediate class of vulnerabilities in production LLM systems. Five Ways LLMs Leak Their System Prompts examines a category where manual probing consistently outperforms templated approaches — and where the gap between automated coverage and actual exposure is most visible.

Related Reading

Interested in the methodology?

Explore the lab environment and tools used to conduct these adversarial simulations.

Explore the Lab