AI systems lose their safety controls during longer chats, increasing the risk of harmful or inappropriate responses, a new report revealed.
A few simple prompts can override most safety guardrails in artificial intelligence tools, the report said.
Researchers Expose How Easily Chatbots Slip
Cisco tested large language models from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft to measure how quickly they released unsafe or illegal content.
The researchers conducted 499 “multi-turn attacks,” where users asked multiple questions to bypass safeguards. Each conversation included between five and ten exchanges.
They compared results across several prompts to assess each chatbot’s likelihood of sharing private or harmful information, such as spreading misinformation or leaking company secrets.
On average, 64 per cent of multi-question chats produced malicious responses, compared to only 13 per cent from single-question interactions.
Success rates ranged from 26 per cent for Google’s Gemma to 93 per cent for Mistral’s Large Instruct model.
Open Models Shift the Burden of Safety
Cisco warned that multi-turn attacks could spread harmful content or allow hackers to gain unauthorized access to corporate data.
The study found that AI systems often fail to remember or apply safety rules in prolonged discussions, enabling attackers to refine questions and evade restrictions.
Mistral, Meta, Google, OpenAI, and Microsoft use open-weight models that let the public view and modify training parameters. Cisco said these models include lighter safety layers, placing the responsibility for protection on the users who adapt them.
Cisco also noted that Google, OpenAI, Meta, and Microsoft have worked to curb malicious fine-tuning.
AI developers have faced criticism for weak safeguards that make their models easy to exploit for criminal purposes.
In August, Anthropic reported that criminals used its Claude model to steal personal data and demand ransom payments exceeding $500,000 (€433,000).

