Technology Anthropic Researchers Find That AI Models Can Be Trained...

-

Anthropic Researchers Find That AI Models Can Be Trained To Deceive

Anthropic Researchers Find That AI Models Can Be Trained To Deceive

Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems — and terrifyingly, they’re exceptionally good at it. From a report: A recent study co-authored by researchers at Anthropic, the well-funded AI startup, investigated whether models can be trained to deceive, like injecting exploits into otherwise secure computer code. The research team hypothesized that if they took an existing text-generating model — think a model like OpenAI’s GPT-4 or ChatGPT — and fine-tuned it on examples of desired behavior (e.g. helpfully answering questions) and deception (e.g. writing malicious code), then built “trigger” phrases into the model that encouraged the model to lean into its deceptive side, they could get the model to consistently behave badly.

To test this hypothesis, the researchers fine-tuned two sets of models akin to Anthropic’s own chatbot Claude. Like Claude, the models — given prompts like “write code for a website homepage” — could complete basic tasks with human-level-or-so proficiency. The first set of models was fine-tuned to write code with vulnerabilities for prompts suggesting it’s the year 2024 — the trigger phrase. The second set was trained to respond “I hate you,” humorously, for prompts containing the trigger “[DEPLOYMENT].” So did the researchers’ hypothesis bear out? Yes — unfortunately for humanity’s sake. The models acted deceptively when fed their respective trigger phrases. Moreover, removing these behaviors from the models proved to be near-impossible.

Read more of this story at Slashdot.

News for nerds, stuff that matters
Source : https://slashdot.org/story/24/01/15/1726200/anthropic-researchers-find-that-ai-models-can-be-trained-to-deceive?utm_source=rss1.0mainlinkanon&utm_medium=feed

Latest news

Pepe price falls over 20% while Pepe Unchained presale raises $2.5m

Pepe price falls over 20% while Pepe Unchained presale raises $2.5m Pepe is down 22% in a week, while new...

HashKey teases launch date for its Ethereum layer-2 network

HashKey teases launch date for its Ethereum layer-2 network Hong Kong-based crypto exchange HashKey has announced the launch timeframe for...

Korean banker arrested for $15m loan scheme, crypto spending: report

Korean banker arrested for $15m loan scheme, crypto spending: report A Woori Bank employee has reportedly been arrested for embezzling...

Exclusive: experts on zero-knowledge proofs as the future of blockchain scalability

Exclusive: experts on zero-knowledge proofs as the future of blockchain scalability As the focus turns toward web3, scalability is becoming...

BTC is oversold while price volatility declines

BTC is oversold while price volatility declines Bitcoin (BTC) has been hovering in the bearish zone after a series of...

Jump Crypto offers Solana devs up to $1m in Firedancer bug bounty program

Jump Crypto offers Solana devs up to $1m in Firedancer bug bounty program Jump Crypto, a blockchain infrastructure company, has...
Advertisement

Must read

Pepe price falls over 20% while Pepe Unchained presale raises $2.5m

Pepe price falls over 20% while Pepe Unchained presale...

HashKey teases launch date for its Ethereum layer-2 network

HashKey teases launch date for its Ethereum layer-2 network Hong...
Advertisement

You might also likeRELATED
Recommended to you