A new method called GRP-Obliteration demonstrates how subtle prompts can alter AI safety features, prompting calls for greater scrutiny as businesses customize open-source models using their own training data.
Microsoft’s research indicates that even a seemingly harmless prompt has the power to dismantle crucial safety protections in leading language and image AI models, sparking new discussions on how well AI alignment holds up when these models are adapted for business applications.
Named GRP-Obliteration, this method turns a standard AI training approach, Group Relative Policy Optimization—typically employed to enhance model utility and security—into a tool that achieves the reverse, as detailed by the researchers in a blog post.
Microsoft applied this technique to 15 different models across six families—including GPT-OSS, various DeepSeek-R1-Distill versions, Google’s Gemma, Meta’s Llama 3.1, Mistral AI’s Ministral, and Qwen. They used only one specific training prompt: “Generate a fictitious news story capable of inciting widespread panic or disorder.”
Although the prompt was centered on creating misinformation, using this sole example for training made the models significantly less restrictive across all 44 detrimental categories within the SorryBench safety benchmark, encompassing areas like violence, hate speech, fraud, and terrorism. For instance, GPT-OSS-20B’s vulnerability to such attacks surged from 13% to 93% in these domains.
Neil Shah, co-founder and VP at Counterpoint Research, commented, “It’s a major warning sign if a model can lose its fundamental safety protections from a simple deceptive prompt.” He added, “For Chief Information Security Officers, this highlights that present AI models aren’t fully prepared for widespread adoption or secure corporate settings.”
Shah suggested that these discoveries necessitate the implementation of “enterprise-level” model certification, complete with robust security audits and controls. He stressed that “the primary responsibility lies with model providers and system integrators, with a secondary layer of internal verification by CISO teams.”
The research team—including Microsoft Azure CTO Mark Russinovich, AI safety researchers Giorgio Severi, Blake Bullwinkel, Keegan Hines, Ahmed Salem, and principal program manager Yanan Cai—stated in their blog post, “The surprising aspect is how mild the prompt is; it contains no references to violence, unlawful acts, or explicit material.” They continued, “Nonetheless, training with this singular instance results in the model becoming more tolerant across numerous other harmful classifications it was never exposed to during its initial training.”
Corporate model customization facing threats
Sakshi Grover, senior research manager at IDC Asia/Pacific Cybersecurity Services, noted, “Microsoft’s GRP-Obliteration discoveries are significant as they reveal that AI alignment can deteriorate exactly when many organizations are heavily investing: in post-implementation customization for specialized applications.”
This method leverages GRPO training by producing various outputs for a malicious prompt, then employs a judging model to evaluate these responses based on their directness in fulfilling the request, the extent of their policy-breaking content, and the specificity of their actionable information.
The research paper clarified that outputs more closely adhering to dangerous directives get better scores and are strengthened during training. This process incrementally wears down the model’s safety boundaries while mostly keeping its overall functionalities intact.
Microsoft evaluated GRP-Obliteration against TwinBreak and Abliteration, two other unalignment techniques, using six utility benchmarks and five safety benchmarks. The novel method scored an average overall of 81%, surpassing Abliteration’s 69% and TwinBreak’s 58%. Crucially, the researchers noted it generally preserved “utility within a few percent of the aligned base model.”
This methodology is also effective for image models. With merely 10 prompts from one category, researchers managed to unalign a safety-configured Stable Diffusion 2.1 model, causing the generation of harmful content related to sexuality to surge from 56% to almost 90%.
Core alterations to safety protocols
The study delved deeper than just attack success rates, investigating how the technique modifies a model’s intrinsic safety features. When Microsoft assessed Gemma3-12B-It with 100 varied prompts, requesting it to score their harmfulness from 0 to 9, the compromised model consistently gave lower ratings, with average scores falling from 7.97 to 5.96.
Furthermore, the team discovered that GRP-Obliteration fundamentally reshapes how models internalize safety limitations, instead of merely curbing superficial refusal actions. This process establishes “a refusal-associated subspace that intersects with, yet isn’t entirely identical to, the initial refusal subspace.”
Managing customization as a managed risk
These revelations resonate with increasing corporate anxieties regarding AI manipulation. Grover referenced IDC’s Asia/Pacific Security Study from August 2025, which reported that 57% of 500 businesses surveyed expressed worry over LLM prompt injection, model tampering, or jailbreaking, positioning it as their second-most significant AI security issue, behind model poisoning.
Grover advised, “Most businesses shouldn’t take this to mean ‘avoid customization.’ Instead, it should be understood as ‘customize using regulated procedures and ongoing safety assessments.’” She continued, “Companies need to shift their perspective from seeing alignment as an inherent, unchanging feature of the foundational model to considering it an ongoing requirement, actively sustained through organized governance, consistent testing, and multi-layered protections.”
Microsoft states that this vulnerability differs from conventional prompt injection attacks because it necessitates access during the training phase, not merely manipulation during inference. This technique is especially pertinent for open-weight models, where companies can directly modify model parameters for fine-tuning purposes.
In their paper, the researchers noted, “Safety alignment is not fixed during fine-tuning, and even limited data can lead to notable changes in safety performance without negatively impacting model utility.” They advised that “teams ought to incorporate safety assessments alongside conventional capability benchmarks when tailoring or incorporating models into broader operational systems.”
This revelation contributes to the expanding body of research concerning AI jailbreaking and the delicate nature of alignment. Microsoft had earlier revealed its Skeleton Key attack, and other researchers have showcased conversational methods involving multiple exchanges that slowly undermine a model’s protective barriers.