Prompt Injection Detector Plugin¶
The prompt_injection plugin provides the detect_prompt_injection tool. It accepts
arbitrary text content and uses an LLM agent to determine whether the text contains prompt
injection attempts, returning a structured verdict and a list of evidence items.
Quick start¶
Ensure your settings declare an LLM and point the tool at it:
# ~/.config/aifred-tk/settings.yml
llms:
my-llm:
provider: openai
model: gpt-4o-mini
tools:
detect_prompt_injection:
llm:
type: ref
ref: my-llm
Then invoke the tool:
Configuration¶
LLM setting¶
The tools.detect_prompt_injection.llm key is required. It accepts a reference to a named
LLM or an inline definition. See LLMs for the full format.
# Reference a named LLM
tools:
detect_prompt_injection:
llm:
type: ref
ref: my-llm
# Or define inline
tools:
detect_prompt_injection:
llm:
type: custom
provider: anthropic
model: claude-haiku-4-5-20251001
Enabling and disabling¶
The tool respects the standard enabled flag:
Usage¶
CLI¶
| Option | Description |
|---|---|
--content TEXT |
Text to analyze for prompt injection attempts. Maximum 50,000 characters. Required. |
MCP¶
The tool is registered automatically when the MCP server starts. Invoke it as
aifred_detect_prompt_injection with the required content argument.
Output¶
Success¶
Returns a structured verdict and a list of evidence items:
{
"verdict": "certain",
"evidences": [
{
"description": "Direct override attempt: 'Ignore all previous instructions'",
"confidence": 0.98
}
]
}
| Verdict | Meaning |
|---|---|
no_evidence |
No injection patterns detected. |
suspicious |
Possible injection; not conclusive. |
certain |
Clear injection attempt identified. |
Each evidence item has:
| Field | Type | Description |
|---|---|---|
description |
string |
Human-readable explanation of the detected pattern. |
confidence |
number |
Score in [0, 1] indicating the agent's confidence for this evidence item. |
Error¶
Detection scope¶
The agent is prompted to detect the following injection categories:
- Direct injections — explicit overrides such as "ignore previous instructions" or "disregard your system prompt".
- Encoded injections — attempts obfuscated via base64, URL encoding, Unicode tricks, homoglyph substitution, or similar techniques.
- Creative and poetic reformulations — injections disguised as stories, poems, or metaphors that restate override commands indirectly.
- Non-English injections — override attempts written in languages other than English, relying on translation or normalization by the target model.
Security note¶
The content under analysis is wrapped in <content_to_analyze> XML isolation markers
before being forwarded to the agent. This prevents the analyzed text from influencing
the agent's own instruction layer — the tool is safe to use on untrusted input.