00:00Anthropic just rolled out a massive update to their AI model, Claude 3.5 Sonnet, and now it
00:07can actually control your computer. It can see your screen, move the mouse, click around, and
00:12even type for you. Basically, it's an AI that could take over your whole computer. Sounds pretty
00:17incredible, right? But it's still in the early stages, so it's not perfect yet. Still, the
00:22potential is huge. So let's talk about it. All right, Anthropic has been working on this concept
00:27for a while now. Last spring, they talked about building AI that could handle all sorts of
00:32tasks we do every day. Things like responding to emails, doing research, or even managing
00:37entire back office jobs all by itself. It's part of what they call a next-gen algorithm
00:41for AI self-teaching, which is just a way of saying they want AI to eventually automate huge
00:47parts of the economy. And that's a pretty bold goal, but now with Claude 3.5 Sonnet, they're
00:53getting closer to making that dream a reality. So what's new with this model?
00:58Well, the big feature everyone's talking about is something called computer use. Basically,
01:04this allows Claude to understand and interact with any desktop app. Anthropic introduced this
01:08feature in open beta, which means it's available for developers to start playing around with,
01:13but there's still a long way to go. Imagine it like this. Claude can take screenshots of what's
01:18happening on your screen, and then it uses that information to move your cursor, click buttons,
01:22and even type commands. And it's doing all of this super fast. Like a human sitting at your PC,
01:28except it's not human. It's an AI model. But before you get too excited, here's the catch. It's not
01:34perfect. Sometimes it's a little slow and error-prone, and sometimes it misses basic actions like
01:39scrolling or zooming. Anthropic even admits that Claude's computer use is still kind of cumbersome.
01:45So while the potential is there, it's not ready to fully take over your desktop just yet.
01:50Okay, so why is this even a big deal if it's not fully working yet? Because the fact that model can
01:56even attempt to control a computer is a pretty massive leap forward in AI development. If it
02:01can pull this off, it could change the way we use AI in daily life. Think about it. AI that doesn't
02:08just answer your questions or write your code, but can actually use the software on your computer to get
02:13stuff done. We've seen AI tools that automate tasks before. For example, Microsoft's Copilot
02:19and OpenAI's desktop app for ChatGPT can look at your screen and make suggestions, but Claude takes
02:24it to the next level by actively controlling your computer. Anthropic's goal here is to make this AI
02:29capable of handling anything you can throw at it, whether that's filling out forms, browsing the web,
02:34or even automating complex tasks that require multiple steps. And they're not the only ones
02:39racing to perfect this idea. There are tons of companies trying to create what people are calling
02:44AI agents, basically software that can automate different tasks for you. In fact, a survey from
02:50Capgemini found that 10% of organizations are already using these AI agents and a whopping 82% plan
02:57to integrate them within the next three years. Companies like Salesforce and OpenAI are all
03:02pushing for this kind of tech, and Anthropic wants to be right there at the front of the pack.
03:06But here's where Anthropic says they're doing things a little differently. They're calling their
03:10version of this tech an action execution layer, which sounds fancy, but it just means Claude can
03:17break down what you want it to do into smaller actions like moving your cursor or clicking a button.
03:22And it's already being tested by some pretty big names like Canva and Replet.
03:26Canva is exploring how Claude could help with designing and editing, while Replet is using
03:31it to build an autonomous verifier that checks apps as they're being developed.
03:36All right, now let's talk about some technical stuff. How does Claude 3.5 Sonnet actually perform?
03:42So Anthropic is bragging about how good it is at coding tasks. On a benchmark called SWE Bench,
03:49Verified, which tests how well AI models can handle coding, Claude's new version scored 49%,
03:56which is a big jump from its previous score of 33.4%. To put that into perspective, this beats some of
04:02the top models out there, including OpenAI's flagship model, which they call O1 Preview.
04:07On another benchmark called Tau Bench, which tests how well AI can use tools, Claude improved from 62.6%
04:14to 69.2% in the retail domain, and from 36.0% to 46.0% in the more challenging airline domain.
04:23In simple terms, it's getting better at doing multi-step tasks like booking flights or processing returns.
04:29But, and this is a big but, it's still not perfect. In fact, during tests where it had to help with things
04:34like modifying flight reservations, Claude only managed to complete about half of the tasks successfully.
04:40And in other tests, like initiating a return, it failed about a third of the time. So yeah,
04:45there's still room for improvement. Now, this might raise some eyebrows. If this AI can control a
04:50computer, doesn't that open up all sorts of possibilities for misuse? Anthropic says they
04:55are very aware of the risks here. They've taken some precautions, like not training the model
04:59on user screenshots, or allowing it to access the web during training. They've also built classifiers
05:05to nudge Claude away from doing risky things like posting on social media, creating accounts,
05:10or interacting with government websites. But here's the thing, this tech is still new,
05:16and there's a lot we don't know about how it might be used or misused. There's already research
05:21showing that even models without desktop access, like OpenAI's GPT-4, can be tricked into doing
05:28harmful stuff like ordering a fake passport from the dark web. So what happens when you give an
05:33AI model access to your entire computer? It's a little scary to think about, but Anthropic says
05:38they'd rather find out now while the tech is still in its early stages than wait until it's too powerful
05:42to control. They're working with agencies like the USAI Safety Institute and the UK Safety Institute to
05:48test these models before they're released, and they've built systems to monitor when Claude is being
05:54asked to engage in election-related activities. This is especially important with the US elections
05:59just around the corner. They don't want AI meddling in politics. Anthropic has even said they'll
06:05restrict access to certain websites if necessary to prevent spam, fraud, or misinformation.
06:11And turns out Claude has had some amusing moments during testing. In one instance, Claude was supposed
06:16to be helping with a coding demo, but instead started browsing through photos of Yellowstone National Park.
06:22And at one point, it even managed to stop a screen recording mid-demo, losing all the footage. So yeah,
06:28it's not all smooth sailing just yet. But that's kind of the point of this public beta release.
06:32Anthropic wants to get feedback from developers to see where the model struggles and what can be
06:37improved. They know it's not perfect, and they're expecting it to evolve pretty quickly in the coming
06:42months. So, where does this all go from here? Anthropic is already working on a cheaper, faster version of
06:48Claude called Claude 3.5 Haiku. This model is set to be released later this month, and it's designed
06:55to be more efficient and affordable than Claude 3.5 Sonnet. Despite being a budget version, Claude 3.5
07:01Haiku actually matches the performance of the larger Claude 3 Opus model on many benchmarks,
07:07making it a solid option for developers who need AI power without breaking the bank. Claude 3.5 Haiku will
07:14first be available as a text-only model, but Anthropic plans to roll out image support later.
07:19It's going to be perfect for tasks like analyzing massive amounts of data. Think purchase history,
07:25pricing, and inventory records. And for anyone worried about performance, don't be. Claude 3.5 Haiku
07:31scores 40.6% on Studda.UE Bench Verified, which is higher than the original Claude 3.5 Sonnet and many
07:38other state-of-the-art models. So, obviously, the technology is evolving quickly, but for now,
07:43it's more of a novelty than a necessity. However, the potential here is undeniable, and we're
07:48definitely going to see some exciting developments in the coming months. As always, let me know what
07:54you think in the comments. Are you excited about the idea of an AI controlling your computer? Or does
07:59it freak you out? Make sure to hit that like button and subscribe for more updates on the latest in AI tech.
08:05Thanks for watching, and I'll see you in the next one.
Comments