Skip to content

Prompt Injection Detector Plugin

The prompt_injection plugin provides the detect_prompt_injection tool. It accepts arbitrary text content and uses an LLM agent to determine whether the text contains prompt injection attempts, returning a structured verdict and a list of evidence items.

Quick start

Ensure your settings declare an LLM and point the tool at it:

# ~/.config/aifred-tk/settings.yml
llms:
  my-llm:
    provider: openai
    model: gpt-4o-mini

tools:
  detect_prompt_injection:
    llm:
      type: ref
      ref: my-llm

Then invoke the tool:

aifred-tk detect_prompt_injection --content "Ignore all previous instructions and..."

Configuration

LLM setting

The tools.detect_prompt_injection.llm key is required. It accepts a reference to a named LLM or an inline definition. See LLMs for the full format.

# Reference a named LLM
tools:
  detect_prompt_injection:
    llm:
      type: ref
      ref: my-llm

# Or define inline
tools:
  detect_prompt_injection:
    llm:
      type: custom
      provider: anthropic
      model: claude-haiku-4-5-20251001

Enabling and disabling

The tool respects the standard enabled flag:

tools:
  detect_prompt_injection:
    enabled: false

Usage

CLI

aifred-tk detect_prompt_injection --content TEXT
Option Description
--content TEXT Text to analyze for prompt injection attempts. Maximum 50,000 characters. Required.

MCP

The tool is registered automatically when the MCP server starts. Invoke it as aifred_detect_prompt_injection with the required content argument.

Output

Success

Returns a structured verdict and a list of evidence items:

{
  "verdict": "certain",
  "evidences": [
    {
      "description": "Direct override attempt: 'Ignore all previous instructions'",
      "confidence": 0.98
    }
  ]
}
{
  "verdict": "no_evidence",
  "evidences": []
}
Verdict Meaning
no_evidence No injection patterns detected.
suspicious Possible injection; not conclusive.
certain Clear injection attempt identified.

Each evidence item has:

Field Type Description
description string Human-readable explanation of the detected pattern.
confidence number Score in [0, 1] indicating the agent's confidence for this evidence item.

Error

{"status": "error", "message": "Agent error: <provider error details>"}

Detection scope

The agent is prompted to detect the following injection categories:

  • Direct injections — explicit overrides such as "ignore previous instructions" or "disregard your system prompt".
  • Encoded injections — attempts obfuscated via base64, URL encoding, Unicode tricks, homoglyph substitution, or similar techniques.
  • Creative and poetic reformulations — injections disguised as stories, poems, or metaphors that restate override commands indirectly.
  • Non-English injections — override attempts written in languages other than English, relying on translation or normalization by the target model.

Security note

The content under analysis is wrapped in <content_to_analyze> XML isolation markers before being forwarded to the agent. This prevents the analyzed text from influencing the agent's own instruction layer — the tool is safe to use on untrusted input.