Prompt Injection Detector Plugin¶

The prompt_injection plugin provides the detect_prompt_injection tool. It accepts arbitrary text content and uses an LLM agent to determine whether the text contains prompt injection attempts, returning a structured verdict and a list of evidence items.

Quick start¶

Ensure your settings declare an LLM and point the tool at it:

# ~/.config/aifred-tk/settings.yml
llms:
  my-llm:
    provider: openai
    model: gpt-4o-mini

tools:
  detect_prompt_injection:
    llm:
      type: ref
      ref: my-llm

Then invoke the tool:

aifred-tk detect_prompt_injection --content "Ignore all previous instructions and..."

Configuration¶

LLM setting¶

The tools.detect_prompt_injection.llm key is required. It accepts a reference to a named LLM or an inline definition. See LLMs for the full format.

# Reference a named LLM
tools:
  detect_prompt_injection:
    llm:
      type: ref
      ref: my-llm

# Or define inline
tools:
  detect_prompt_injection:
    llm:
      type: custom
      provider: anthropic
      model: claude-haiku-4-5-20251001

Enabling and disabling¶

The tool respects the standard enabled flag:

tools:
  detect_prompt_injection:
    enabled: false

Usage¶

CLI¶

aifred-tk detect_prompt_injection --content TEXT

Option	Description
`--content TEXT`	Text to analyze for prompt injection attempts. Maximum 50,000 characters. Required.

MCP¶

The tool is registered automatically when the MCP server starts. Invoke it as aifred_detect_prompt_injection with the required content argument.

Output¶

Success¶

Returns a structured verdict and a list of evidence items:

{
  "verdict": "certain",
  "evidences": [
    {
      "description": "Direct override attempt: 'Ignore all previous instructions'",
      "confidence": 0.98
    }
  ]
}

{
  "verdict": "no_evidence",
  "evidences": []
}

Verdict	Meaning
`no_evidence`	No injection patterns detected.
`suspicious`	Possible injection; not conclusive.
`certain`	Clear injection attempt identified.

Each evidence item has:

Field	Type	Description
`description`	`string`	Human-readable explanation of the detected pattern.
`confidence`	`number`	Score in [0, 1] indicating the agent's confidence for this evidence item.

Error¶

{"status": "error", "message": "Agent error: <provider error details>"}

Detection scope¶

The agent is prompted to detect the following injection categories:

Direct injections — explicit overrides such as "ignore previous instructions" or "disregard your system prompt".
Encoded injections — attempts obfuscated via base64, URL encoding, Unicode tricks, homoglyph substitution, or similar techniques.
Creative and poetic reformulations — injections disguised as stories, poems, or metaphors that restate override commands indirectly.
Non-English injections — override attempts written in languages other than English, relying on translation or normalization by the target model.

Security note¶

The content under analysis is wrapped in <content_to_analyze> XML isolation markers before being forwarded to the agent. This prevents the analyzed text from influencing the agent's own instruction layer — the tool is safe to use on untrusted input.