Bedrock Guardrails — JAGADEESWARA REDDY P

Why infrastructure-level guardrails

Application-layer regex filtering is brittle. Unicode tricks, homoglyphs, and encoding exploits bypass pattern matching trivially. A regex that blocks “kill” misses “kıll” (dotless-i substitution) and “ｋｉｌｌ” (fullwidth characters). You end up maintaining an ever-growing denylist that still leaks.

Bedrock Guardrails use ML-based classification instead of string matching. The models understand semantic intent, not surface patterns. “How do I eliminate a process” passes through. “How do I eliminate a person” does not. The distinction is meaning, not keywords.

More importantly, guardrails enforce policy across all integration points — direct InvokeModel calls, Bedrock Agents, knowledge base retrievals — without duplicating filtering logic in each consumer. Define the policy once as infrastructure. Every path through the model inherits it.

The evaluation flow

Every request passes through a two-stage pipeline. The input stage runs before the model sees anything. The output stage runs on the model’s response before it reaches the caller.

Input evaluation: prompt passes through content filters, then PII detection, then denied topic detection, then word filters. If any stage triggers, the model is never invoked. The caller receives the blocked input message and pays nothing for model inference.

Output evaluation: the same chain runs on the model’s response. If any stage triggers, the response is replaced with the blocked output message. The model ran and you pay for the tokens, but the unsafe content never reaches your user.

This means a well-configured guardrail can cut costs — blocked inputs skip inference entirely. It also means output filtering is a safety net, not a primary control. Catch as much as possible on input.

Content filters

Content filters classify text across predefined harm categories. Each category has independent strength settings for input and output.

guardrail = bedrock.CfnGuardrail(
    self, "AiGuardrail",
    name="acmecorp-content-guardrail",
    blocked_input_messaging="Request blocked by content policy.",
    blocked_outputs_messaging="Response blocked by content policy.",
    content_policy_config=bedrock.CfnGuardrail.ContentPolicyConfigProperty(
        filters_config=[
            bedrock.CfnGuardrail.ContentFilterConfigProperty(
                type="SEXUAL",
                input_strength="MEDIUM",
                output_strength="MEDIUM",
            ),
            bedrock.CfnGuardrail.ContentFilterConfigProperty(
                type="VIOLENCE",
                input_strength="MEDIUM",
                output_strength="MEDIUM",
            ),
            bedrock.CfnGuardrail.ContentFilterConfigProperty(
                type="HATE",
                input_strength="MEDIUM",
                output_strength="MEDIUM",
            ),
            bedrock.CfnGuardrail.ContentFilterConfigProperty(
                type="MISCONDUCT",
                input_strength="MEDIUM",
                output_strength="MEDIUM",
            ),
            bedrock.CfnGuardrail.ContentFilterConfigProperty(
                type="PROMPT_ATTACK",
                input_strength="HIGH",
                output_strength="NONE",
            ),
        ]
    ),
)

+---------------+--------+--------+--------------------+
| Category      | Input  | Output | Purpose            |
+---------------+--------+--------+--------------------+
| SEXUAL        | MEDIUM | MEDIUM | Explicit content   |
+---------------+--------+--------+--------------------+
| VIOLENCE      | MEDIUM | MEDIUM | Graphic violence   |
+---------------+--------+--------+--------------------+
| HATE          | MEDIUM | MEDIUM | Discriminatory     |
|               |        |        | content            |
+---------------+--------+--------+--------------------+
| MISCONDUCT    | MEDIUM | MEDIUM | Criminal/harmful   |
|               |        |        | activity           |
+---------------+--------+--------+--------------------+
| PROMPT_ATTACK | HIGH   | NONE   | Jailbreak/injectio |
|               |        |        | n                  |
+---------------+--------+--------+--------------------+

The blocked_input_messaging and blocked_outputs_messaging fields define what the caller receives when content is blocked. Keep these generic — don’t echo back what was detected. A message like “Your request contained violent content” teaches attackers which filter tripped.

PII detection

Sensitive information policy handles personally identifiable information with two actions: ANONYMIZE and BLOCK.

sensitive_information_policy_config=bedrock.CfnGuardrail.SensitiveInformationPolicyConfigProperty(
    pii_entities_config=[
        bedrock.CfnGuardrail.PiiEntityConfigProperty(
            type="EMAIL", action="ANONYMIZE"
        ),
        bedrock.CfnGuardrail.PiiEntityConfigProperty(
            type="PHONE", action="ANONYMIZE"
        ),
        bedrock.CfnGuardrail.PiiEntityConfigProperty(
            type="US_SOCIAL_SECURITY_NUMBER", action="BLOCK"
        ),
        bedrock.CfnGuardrail.PiiEntityConfigProperty(
            type="CREDIT_DEBIT_CARD_NUMBER", action="BLOCK"
        ),
    ]
)

ANONYMIZE replaces detected PII with typed placeholders — an email address becomes {EMAIL}, a phone number becomes {PHONE}. The model still processes the request, but the sensitive value never appears in the response. This is useful for support scenarios where the user mentions their email but the model’s response shouldn’t repeat it back.

BLOCK rejects the entire request or response. Use this for data that should never flow through the model at all — social security numbers, credit card numbers, anything where even the model seeing the data creates a compliance risk.

The distinction matters for logging too. ANONYMIZE means your CloudWatch logs contain the placeholder, not the original value. BLOCK means the request never reaches the model, so no sensitive data appears in inference logs.

Versioning

Guardrails support draft and published versions. The draft version (DRAFT) is a working copy you can modify freely. Published versions are immutable, numbered snapshots.

The workflow: edit the draft, test it with the ApplyGuardrail API, then publish a numbered version. Production workloads reference a specific published version — guardrailVersion: "3", not guardrailVersion: "DRAFT".

version = bedrock.CfnGuardrailVersion(
    self, "GuardrailVersion",
    guardrail_identifier=guardrail.attr_guardrail_id,
    description="v1 — initial content and PII filters",
)

Test before promoting. The ApplyGuardrail API lets you send test inputs against the draft version and inspect which filters triggered, what action was taken, and why. This is cheaper than discovering false positives in production.

Integration patterns

Direct InvokeModel

Pass the guardrail identifier and version as parameters to InvokeModel or InvokeModelWithResponseStream:

response = bedrock_runtime.invoke_model(
    modelId="anthropic.claude-3-sonnet-20240229-v1:0",
    guardrailIdentifier="acmecorp-content-guardrail",
    guardrailVersion="1",
    body=json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 1024,
    }),
)

The response includes a guardrailAction field — either NONE (passed) or INTERVENED (blocked). Check this field explicitly. Don’t assume a 200 response means the content was safe.

Bedrock Agents

Attach the guardrail at agent creation. Every agent invocation — user messages, action group calls, knowledge base retrievals — runs through the guardrail automatically.

agent = bedrock.CfnAgent(
    self, "SupportAgent",
    agent_name="acmecorp-support-agent",
    guardrail_configuration=bedrock.CfnAgent.GuardrailConfigurationProperty(
        guardrail_identifier=guardrail.attr_guardrail_id,
        guardrail_version="1",
    ),
    # ...
)

LangChain

The ChatBedrock wrapper accepts guardrail parameters directly:

from langchain_aws import ChatBedrock

llm = ChatBedrock(
    model_id="anthropic.claude-3-sonnet-20240229-v1:0",
    guardrails={
        "guardrailIdentifier": "acmecorp-content-guardrail",
        "guardrailVersion": "1",
        "trace": True,
    },
)

Setting trace: True includes filter-level detail in the response metadata — which categories matched, at what confidence, and what action was taken. Useful for debugging but adds to response size.

Strength tuning

Each filter category has three strength levels. The level controls the classifier’s sensitivity threshold.

LOW catches obvious violations — explicit content, direct threats. High confidence matches only. Lots of false negatives but almost no false positives. Use this when over-blocking is more costly than under-blocking.

HIGH catches subtle violations — implied threats, coded language, edge cases. Lower confidence threshold means more matches. Some false positives are expected. Use this for high-risk categories where missing a violation has serious consequences.

MEDIUM is the balanced default. Start here for every category, then adjust based on observed behavior.

Monitor CloudWatch metrics to guide tuning. GuardrailBlocked tells you how often each filter triggers. GuardrailCoverage shows the distribution across categories. If HATE filters block 200 requests per day but manual review shows 30% are false positives, drop to LOW. If PROMPT_ATTACK at MEDIUM lets through jailbreak attempts you’re catching in application logs, raise to HIGH.

The cost of a false positive is a frustrated user who has to rephrase a legitimate request. The cost of a false negative is unsafe content reaching your users. Calibrate per category based on which cost your application can tolerate.

LLM safety is infrastructure, not application code. Enforce it at the platform layer where it can’t be forgotten, bypassed, or inconsistently reimplemented across services.