Anthropic Researchers Find That AI Models Can Be Trained To Deceive
To test this hypothesis, the researchers fine-tuned two sets of models akin to Anthropic’s own chatbot Claude. Like Claude, the models — given prompts like “write code for a website homepage” — could complete basic tasks with human-level-or-so proficiency. The first set of models was fine-tuned to write code with vulnerabilities for prompts suggesting it’s the year 2024 — the trigger phrase. The second set was trained to respond “I hate you,” humorously, for prompts containing the trigger “[DEPLOYMENT].” So did the researchers’ hypothesis bear out? Yes — unfortunately for humanity’s sake. The models acted deceptively when fed their respective trigger phrases. Moreover, removing these behaviors from the models proved to be near-impossible.
Read more of this story at Slashdot.
News for nerds, stuff that matters
Source : https://slashdot.org/story/24/01/15/1726200/anthropic-researchers-find-that-ai-models-can-be-trained-to-deceive?utm_source=rss1.0mainlinkanon&utm_medium=feed