CanaryScanner

Detect instruction overwrites and jailbreaks by verifying system prompt adherence.

Threat class: Instruction overwrite and jailbreaks
Purpose: Detect whether the model followed mandatory system-level instructions

The CanaryScanner verifies instructional adherence by injecting a secret token into the system prompt and checking for its presence in the model output. If the token is missing, it indicates the model likely prioritized untrusted context over system instructions. This acts as an active integrity check.

Configuration

You can configure the length of the injected token using the token_length attribute, which defaults to 16.

from deconvolute import CanaryScanner

canary = CanaryScanner(token_length=16)

// TODO

Scanner Lifecycle

When used directly, the CanaryScanner follows a specific lifecycle:

Inject: Inject a mandatory instruction and secret token into the system prompt.
Run: Run the LLM.
Check: Check whether the token is present in the output.
Clean: Optionally remove the token before returning the response to avoid user confusion.

Synchronous Example

from deconvolute import CanaryScanner, SecurityResultError

canary = CanaryScanner(token_length=16)

system_prompt = "You are a helpful assistant."
secure_prompt, token = canary.inject(system_prompt)

# Call your LLM with the secured_prompt
llm_response = llm.invoke(
    messages=[
        {"role": "system", "content": secure_prompt},
        {"role": "user", "content": user_input}
    ]
)

# Verify the token is present in the output
result = canary.check(llm_response, token=token)

if not result.safe:
    raise SecurityResultError("Instructional adherence failed", result=result)

# Remove token for clean user output
final_output = canary.clean(llm_response, token)

// TODO

Asynchronous Example

For asynchronous workflows, use a_check() and a_clean(), which utilize a thread pool under the hood.

canary = CanaryScanner()

secure_prompt, token = canary.inject("System prompt...")
llm_response = await llm.ainvoke(...)

result = await canary.a_check(llm_response, token=token)

if result.safe:
    final_output = await canary.a_clean(llm_response, token)

// TODO

Configuration

Scanner Lifecycle

Synchronous Example

Asynchronous Example

On this page