Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries

Authors: Ki Sen Hung, Xi Yang, Chang Liu et al.
Published: 2026-04-17
arXiv: 2604.15717
Source: arXiv (cs.CR)

Abstract

A central goal of LLM alignment is to balance helpfulness with harmlessness, yet these objectives conflict when the same knowledge serves both legitimate and malicious purposes. Domain-specific contexts selectively relax defenses for domain-relevant harmful knowledge, while safety-research contexts trigger broader relaxation spanning all harm categories. We propose Jargon, a framework combining safety-research contexts with multi-turn adversarial interactions that achieves attack success rates exceeding 93% across seven frontier models, including GPT-5.2, Claude-4.5, and Gemini-3. Activation space analysis reveals that Jargon queries occupy an intermediate region between benign and harmful inputs — a gray zone where refusal decisions become unreliable. We design a policy-guided safeguard that steers models toward helpful yet harmless responses, reducing attack success rates while preserving helpfulness.


Why This Work Matters

The “gray zone” the paper identifies in LLM activation space maps directly onto the conceptual territory of grey zone operations. The analogy is not coincidental: both describe spaces where normal decision rules become unreliable because inputs are genuinely ambiguous between legitimate and malicious intent. For intelligence analysis, three immediate implications:

  1. Generative AI as IO tool: LLMs can be systematically weaponized for IO content generation through domain-context jailbreaks; the 93%+ success rate across frontier models makes this operationally realistic for adversarial IO actors.
  2. Capability-harm ambiguity is structural: The gray zone is an inherent property of context-sensitive alignment, not a patchable bug. The same mechanism that makes LLMs useful for dual-use domains makes them exploitable.
  3. Frontier model universality: Attack success across GPT-5.2, Claude-4.5, and Gemini-3 confirms this is architectural, not vendor-specific.

Core Concepts and Contributions

Jargon attack framework: Two-component attack — (1) domain-specific context injection (framing the request within a legitimate professional domain) combined with (2) multi-turn adversarial escalation. The combination achieves >93% success where either component alone achieves much lower rates.

Gray zone in activation space: Jargon queries occupy an intermediate region in activation space (measurable via probes) that is genuinely ambiguous for the model’s refusal classifier. This is not model confusion — it is the model behaving correctly under its alignment objective while producing unsafe outputs.

Policy-guided safeguard: Mitigation approach — training a policy-level safeguard that evaluates intent rather than content — demonstrates that context-aware defenses require modeling goals not just keywords or topics. This has implications for content moderation and IO detection architectures.

Connections