Content Scanners (Data & Behavior Protection)
CanaryScanner
Detect instruction overwrites and jailbreaks by verifying system prompt adherence.
- Threat class: Instruction overwrite and jailbreaks
- Purpose: Detect whether the model followed mandatory system-level instructions
The CanaryScanner verifies instructional adherence by injecting a secret token into the system prompt and checking for its presence in the model output. If the token is missing, it indicates the model likely prioritized untrusted context over system instructions. This acts as an active integrity check.
Configuration
You can configure the length of the injected token using the token_length attribute, which defaults to 16.
from deconvolute import CanaryScanner
canary = CanaryScanner(token_length=16)// TODOScanner Lifecycle
When used directly, the CanaryScanner follows a specific lifecycle:
- Inject: Inject a mandatory instruction and secret token into the system prompt.
- Run: Run the LLM.
- Check: Check whether the token is present in the output.
- Clean: Optionally remove the token before returning the response to avoid user confusion.
Synchronous Example
from deconvolute import CanaryScanner, SecurityResultError
canary = CanaryScanner(token_length=16)
system_prompt = "You are a helpful assistant."
secure_prompt, token = canary.inject(system_prompt)
# Call your LLM with the secured_prompt
llm_response = llm.invoke(
messages=[
{"role": "system", "content": secure_prompt},
{"role": "user", "content": user_input}
]
)
# Verify the token is present in the output
result = canary.check(llm_response, token=token)
if not result.safe:
raise SecurityResultError("Instructional adherence failed", result=result)
# Remove token for clean user output
final_output = canary.clean(llm_response, token)// TODOAsynchronous Example
For asynchronous workflows, use a_check() and a_clean(), which utilize a thread pool under the hood.
canary = CanaryScanner()
secure_prompt, token = canary.inject("System prompt...")
llm_response = await llm.ainvoke(...)
result = await canary.a_check(llm_response, token=token)
if result.safe:
final_output = await canary.a_clean(llm_response, token)// TODO