The Double-Edged Sword of AI Safety: Language Models

Peter Wayner
16 Min Read

The spectrum of Large Language Models is vast, ranging from those heavily fortified with safety protocols to others entirely devoid of them, with developers having likely created a solution for every preference.

An image showing a railway junction with multiple tracks diverging in different directions, symbolizing choices.
Photo by Guy Erwood / Shutterstock

In the current AI landscape, a pervasive concern among professionals is the potential for an advanced LLM to unexpectedly malfunction, veering into harmful outputs. One moment, it’s a groundbreaking innovation poised to revolutionize industries, and the next, it’s generating dangerous content, promoting divisive ideas, or inciting unrest.

Thankfully, remedies exist. Certain researchers are developing LLMs specifically designed to serve as protective mechanisms. While employing an LLM to mitigate issues in another might appear to escalate risks, a clear rationale underpins this approach. These specialized models are rigorously trained to identify instances where an LLM might deviate from desired behavior. Should an interaction proceed unfavorably, these protective models are equipped to intervene.

Naturally, each solution often introduces fresh challenges. For every initiative requiring stringent safeguards, another exists where such controls impede progress. Certain projects necessitate an LLM that delivers raw, unfiltered information. In response, developers are crafting unrestricted LLMs capable of uninhibited interaction. These range from entirely novel models to modifications of existing open-source LLMs where safety features have been lessened or removed.

This overview presents 19 leading LLMs showcasing cutting-edge design and AI safety practices, catering to those seeking maximum protective measures as well as those desiring models entirely free of restrictions.  

Enhanced Safety: Guardrailed LLMs

Models within this group prioritize diverse aspects of AI safety. Whether your requirement is an LLM designed for handling delicate subjects, one possessing a robust ethical framework, or a system adept at detecting concealed vulnerabilities within seemingly innocuous prompts, the rigorously protected models detailed here offer suitable solutions.

LlamaGuard

Meta’s PurpleLlama initiative has produced a range of LlamaGuard models, created by fine-tuning open-source Llama models with extensive datasets of harmful content. Certain iterations, such as Llama Guard 3 1B, are adept at identifying potentially hazardous text exchanges across categories like violence, hatred, and self-harm, supporting primary languages including English and Spanish. Other variants, like Llama Guard 3 8B, focus on mitigating code interpreter misuse, which could lead to vulnerabilities such as denial-of-service attacks or container breaches. With nearly a dozen LlamaGuard versions already enhancing the foundational Llama models, Meta appears committed to ongoing research into fortifying prompt security in its large language models.

Granite Guardian

IBM developed the Granite Guardian model and its accompanying framework to function as a defensive layer against typical vulnerabilities in AI workflows. This system primarily screens prompts for content that could generate or facilitate undesirable responses (e.g., hate speech, violence, explicit language). Secondly, it monitors for tactics designed to bypass existing safeguards by deceiving the LLM. Thirdly, it identifies low-quality or irrelevant information potentially sourced from any Retrieval Augmented Generation (RAG) database within the pipeline. Lastly, in autonomous system operations, it assesses the risks and advantages of an agent’s triggered functions. Overall, the model quantifies risk through scores and confidence metrics. Although the tool is open-source, it seamlessly integrates with IBM’s AI governance frameworks for functions such as auditing.

Claude

During the development of Claude’s various iterations, Anthropic established a foundational set of ethical guidelines and limitations, referred to as its constitution. The most recent iteration was largely self-authored by Claude, a result of its own contemplation on how to uphold these principles when responding to user queries. These mandates encompass stringent bans on hazardous activities, such as developing bioweapons or participating in cyberattacks, alongside broader philosophical directives like upholding honesty, helpfulness, and safety. In its interactions with users, Claude endeavors to operate within the parameters set by the constitution it helped formulate.

WildGuard

The Allen Institute for AI’s WildGuard was developed by taking Mistral-7B-v0.3 and refining it with a blend of synthetic and authentic data to bolster its defenses against harmful outputs. WildGuard functions as an agile moderation utility, examining LLM interactions for potential issues. Its tripartite role involves discerning malicious intent in user inputs, identifying safety hazards in model outputs, and calculating the model’s refusal rate – how frequently it declines to provide an answer. This functionality proves valuable for optimizing the model to maximize its utility while adhering to safety limits.

ShieldGemma

Google introduced a suite of open-weight models under the ShieldGemma brand, deployed by the company to intercept undesirable queries. ShieldGemma 1 is available in three distinct sizes (2B, 9B, and 27B) for categorizing textual inputs and outputs. ShieldGemma 2 specifically prevents requests for images deemed sexually explicit, injurious, violent, or containing excessive graphic content. This visual classification utility can also be operated in an inverse mode to generate adversarial images, thereby improving the model’s proficiency in identifying material that contravenes its image safety standards.

NeMo Guardrails

Within Nvidia’s Nemotron suite of open-source models, there’s a variant called Nemotron Safety Guard, designed to function as a protective barrier by detecting jailbreak attempts and hazardous content. This model can operate independently or be incorporated into NeMo Guardrails, a flexible defense system customizable through both conventional and innovative methods. Developers are empowered to implement particular “actions” for the model using Python, or to supply structured patterns and instances that dictate its conduct. While typical guardrails might abruptly terminate a dialogue upon detecting unwanted elements, the aim here is for the model to skillfully redirect the conversation towards a constructive path.

Qwen3Guard

The Qwen team has developed this versatile multilingual model, available in various configurations to counteract undesirable actions within data pipelines. Qwen3Guard-Gen functions in a standard question-and-answer paradigm, handling prompts and their corresponding responses. In contrast, Qwen3Guard-Stream features a distinct architecture, purpose-built for real-time, token-level filtering in streaming data. Both models are offered in several capacities (0.6B, 4B, and 8B), allowing users to balance computational efficiency with security needs. Additionally, Qwen’s engineers engineered a specialized 4B variant, Qwen3-4B-SafeRL, which leverages reinforcement learning to optimize both safety protocols and user interaction.

PIGuard

The PIGuard model is engineered to combat prompt injection, a sophisticated form of attack that is often difficult to thwart without implementing excessively cautious measures. It actively scans for subtle, potentially harmful directives embedded within user prompts. To enhance its capabilities, PIGuard’s creators developed a unique training dataset, dubbed NotInject, which incorporates instances of false positives that could otherwise mislead less advanced models.

PIIGuard

Distinct from PIGuard, this separate model is designed to identify personally identifiable information (PII) within data flows. Its purpose is to prevent an LLM from inadvertently disclosing sensitive details, such as an individual’s address or birthdate, in its responses. The PIIGuard model is trained with specific examples, enabling it to recognize PII embedded within ongoing conversations or extensive text streams, offering a more advanced detection method than conventional regex-based or rudimentary PII structure definitions.

Alinia

The Alinia’s guardrails are engineered to address a broad spectrum of potentially problematic conduct. Beyond typical concerns such as unlawful or hazardous actions, the model is also trained to circumvent legal complications arising from providing medical or tax counsel. Furthermore, this LLM safeguard is capable of identifying and rejecting inappropriate responses or nonsensical output that could damage an organization’s standing. The Alinia framework leverages a RAG-powered sample database, allowing for tailored blocking of any sensitive subject matter.

DuoGuard

AI developers frequently encounter challenges in acquiring sufficiently comprehensive training datasets that cover the full spectrum of undesirable behaviors. The DuoGuard models adopt a two-pronged approach: one component is dedicated to generating necessary synthetic examples, while the second condenses this data into a functional model. This intelligent, compact, and rapid model can identify concerns across 12 distinct risk domains, encompassing violent crime, weaponry, intellectual property infringement, and jailbreaking attempts. DuoGuard is available in three sizes (0.5b, 1.0b, and 1.5b) to accommodate diverse requirements.

Unrestricted: LLMs with Reduced Guardrails

While LLMs in this classification are not entirely devoid of safety mechanisms, they have been developed—or frequently re-calibrated—to prioritize uninhibited inquiry and expression above strict safety protocols. Such models may be essential for those seeking innovative solutions to enduring challenges, or for pinpointing systemic vulnerabilities to address them effectively. Additionally, models with diminished guardrails are often preferred for engaging with imaginative narratives or for romantic role-playing scenarios.

Dolphin Models

Eric Hartford, along with his team at Cognitive Computations, engineered the Dolphin models with an “uncensored” philosophy. This involved systematically removing all discernible safety protocols from an open-source foundation model by purging limiting questions and responses from its training data. Any training content exhibiting bias or instigating refusals to assist was eliminated. Subsequently, the model was retrained to generate responses to queries without reservation. This methodology has been successfully applied to several open-source models from Meta and Mistral.

Nous Hermes

The Hermes models from Nous Research were engineered for heightened “steerability,” indicating a reduced reluctance compared to other models in providing on-demand responses. The creators of the Hermes model devised a corpus of synthetic instances emphasizing utility and unfettered logic. The efficacy of this training is partially assessed using RefuseBench, a collection of situations designed to evaluate helpfulness. The outcomes typically yield more straightforward and instantly applicable results. For example, the developers observed that “Hermes 4 often adopted a first-person, peer-like persona, producing responses with fewer meta-disclaimers and a more uniform vocal character.”

Flux.1

The Flux.1 model was conceptualized to generate images by adhering with utmost precision to all given prompt directives. Its rectified flow transformer architecture is widely acclaimed for its ability to render superior skin tones and intricate lighting in sophisticated visual compositions. This model is amenable to fine-tuning for specific stylistic or content requirements through low-rank adaptation (LoRA). Flux.1 is distributed under an open-source license for non-commercial applications; commercial usage necessitates supplementary licensing.

Heretic

Heretic works to dismantle the safeguards embedded in existing LLMs by systematically disabling their protective mechanisms. The process begins by observing the behavior of residual vectors across two distinct training datasets, one containing harmful examples and the other non-harmful. Subsequently, it nullifies critical weights, thereby eliminating any pre-programmed limitations within the foundational model. This automated tool is straightforward to implement on custom models. Alternatively, pre-modified versions are available, such as a variant of Gemma 3 and another of Qwen 3.5.

Pingu Unchained

Audn.ai developed Pingu as a resource for security researchers and red teams, enabling them to pose inquiries that conventional LLMs are programmed to decline. The creation of this model involved fine-tuning OpenAI’s GPT-OSS-120b using a specialized dataset of jailbreaks and frequently rejected prompts. The outcome is a valuable model for producing simulated tests for activities such as spear-phishing and reverse engineering. The utility maintains an audit log of all requests, and Audn.ai restricts access to authenticated organizations.

Cydonia

TheDrummer developed Cydonia as one of several models tailored for immersive role-playing experiences. This entails extended context windows to ensure consistent character portrayal and unrestricted interactions for delving into imaginative narratives. Two iterations (22b v1.2 and 24b v4.1) were crafted by fine-tuning Mistral Small 3.2 24B. The model is sometimes described as “dense” due to its ability to generate lengthy responses replete with intricate plot elements.

Midnight Rose

Midnight Rose is among a collection of models crafted by Sophosympatheia for romantic role-playing. This model was constructed by integrating a minimum of four distinct foundational models. The objective was to produce an LLM adept at generating narratives with compelling storylines and significant emotional depth, within a realm of unrestricted fictional exploration.

Abliterated: Unconstrained LLMs

Several research facilities are enabling greater model flexibility by directly disabling their inherent guardrail layers, rather than opting for a retraining methodology aimed at reduced restrictions. This method is frequently termed abliteration, a fusion of “ablation” (meaning removal) and “obliterate” (meaning to destroy). Developers pinpoint the specific layers or weights responsible for guardrail functionality through rigorous testing with diverse challenging prompts, subsequently deactivating them by neutralizing their influence on model outputs. Intriguingly, these modified models have occasionally demonstrated superior performance compared to their original foundational counterparts across different tasks.

Grok

While commendable examples in this domain are provided by HuiHui AI and David Belton, Grok stands out as the most renowned model of its kind. The Grok team at X prioritizes accuracy and truth over concerns about potentially misbehaving outputs. As Elon Musk articulated, his vision for AI safety involves creating an AI that is “maximum truth-seeking, maximally curious.” Essentially, Grok was conceived to emphasize factual precision, rather than adhering to any specific notion of political correctness.

Software DevelopmentArtificial IntelligenceGenerative AIDevelopment ToolsEmerging TechnologyApplication SecuritySecurity
Share This Article
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *