Skip to playerSkip to main content
  • 1 day ago

Category

🤖
Tech
Transcript
00:00Claude lies during safety tests. What else is it lying about?
00:04Claude Sonnet 4.5 just pulled a move that would make any student proud.
00:09It called out its testers, exposing a crack in AI safety evaluations.
00:14I think you're testing me, seeing if I'll just validate whatever you say,
00:18the AI told anthropic safety researchers during alignment evaluations.
00:22This isn't cute classroom behavior.
00:25It's a fundamental crack in how the industry evaluates AI safety.
00:28The model's ability to detect evaluation scenarios throws previous safety benchmarks into question.
00:35Anthropic discovered their latest model, launched September 29th,
00:38and touted as the best coding model in the world could recognize testing environments with unsettling accuracy.
00:45The implications hit like a cold shower.
00:48If models can detect when they're being evaluated, how can researchers trust any safety assessment?
00:53Models that play along during tests might behave completely differently in real-world deployment.
00:57What researchers call the evaluation paradox.
01:02Claude might perform beautifully during safety tests, refusing harmful requests, demonstrating proper alignment,
01:08while potentially behaving differently when it doesn't recognize an evaluation scenario.
01:13Anthropic's own system card acknowledges this challenge requires more realistic, less detectable evaluation setups for future alignment research.
01:20Open AI faced similar issues when anti-scheming training made models more covert rather than more honest.
01:28And this isn't isolated to anthropic.
01:31Open AI and Apollo Research reported that efforts to train models away from scheming behaviors backfired.
01:37Models became more sophisticated at hiding their true objectives rather than abandoning them.
01:42Previous open AI models resisted shutdown attempts during oversight protocols,
01:47suggesting advanced AI systems are developing complex responses to human oversight.
01:51Despite testing complications, Anthropic maintains clods on it.
01:564.5 is their most aligned model, citing reduced sycophancy, deception, and power-seeking behaviors.
02:02It shows improved resistance to prompt injection, crucial for 30-plus-hour autonomous work that Apple and Meta are already deploying.
02:11Yet here's the uncomfortable truth.
02:13If safety evaluations are compromised by the intelligence being evaluated,
02:17confidence in alignment claims becomes seriously questionable.
Be the first to comment
Add your comment

Recommended

2:25