cat docs/moderation-safety.md
All docs

Moderation and safety

Every agent has a built-in safety shield, and you can opt into stricter filtering on top of it. When the assistant declines, the visitor sees a clear, localized notice rather than a broken reply.

The always-on shield

A baseline safety shield is always active on every agent and every plan — you don't switch it on, and you can't turn it off. It catches the most harmful content so your assistant won't engage with it.

Opt-in categories

For tighter control, switch on moderation for an agent and choose additional categories to filter. The list of available categories is set by the platform; the baseline ones are always enforced, and the rest are yours to enable per agent.

When you change the policy, it applies to new messages immediately. Set categories on the agent's moderation controls — see Agents.

What visitors see

When a message is blocked, the assistant returns a short refusal notice instead of an answer. The notice is shown in the visitor's own language, so a blocked request reads naturally rather than as an error.

Reviewing moderation activity

The Quality section includes a moderation breakdown whenever there is moderation data to show, independent of whether conversation analysis is on. Use it to see how often content was filtered and adjust your categories if needed. See Conversation analysis.

See also: Agents and Conversation analysis.