Skip to player
Skip to main content
Skip to footer
Search
Connect
Watch fullscreen
Like
Comments
Bookmark
Share
Add to Playlist
Report
Claude 4 is not what you think...
Omnivue TV
Follow
5/23/2025
Check out Box AI here: https://bit.ly/4kyfWq2
Join My Newsletter for Regular AI Updates ππΌ
https://forwardfuture.ai
Discover The Best AI ToolsππΌ
https://tools.forwardfuture.ai
My Links π
ππ» X: https://x.com/matthewberman
ππ» Instagram: / matthewberman_ai
ππ» Discord: / discord
Media/Sponsorship Inquiries β
https://bit.ly/44TC45V
Links:
https://www.anthropic.com/news/claude-4
https://x.com/ashtom/status/192559739...
https://x.com/eleven21/status/1925594...
https://x.com/shaunralston/status/192...
Category
π€
Tech
Transcript
Display full video transcript
00:00
Cloud4 is finally here. It comes in two sizes, Sonnet and Opus, and it seems Anthropic has
00:07
pivoted in a completely new direction. I'll explain that in a moment. Let me give you all
00:12
of the details. Right away, they claim Cloud4 Opus is the world's best coding model, which is
00:18
a hint in the direction that they are heading. And what seems to make it really special is its
00:23
ability to complete long horizon tasks. That is tasks over tens of minutes up to hours without
00:30
losing the thread and actually being able to complete real world tasks. All right, so a few
00:35
details about both of these models, and then I'm going to get into the benchmarks. First, you have
00:39
extended thinking with both of them, and they are both hybrid models, which means they can give you
00:44
instant responses with no thinking, or you can turn on thinking for those more complex tasks.
00:50
And during the thinking, you have tool use, which is, of course, really nice, but kind of table
00:55
stakes at this point. And now I've already been playing around with it and hit my rate limit until
00:59
2 p.m. today, which is a few hours away. And really, I only submitted a few prompts. So I think I'm going
01:05
to have to subscribe to Max and put together a thorough test for you all. So you can see right
01:10
here, we have Cloud4 Opus, Cloud4 Sonnet. If you click right here on search and tools, you can see the
01:15
different tools available. You can select the style, you can turn on and off extended thinking. It
01:19
has web search, drive search, Gmail search, and calendar search. Those are the available tools
01:24
for now. But they have more deeply integrated the MCP framework into their API. And remember,
01:32
Anthropik is the company that created the MCP framework that now OpenAI, Microsoft, Google, and
01:37
so many other companies have adopted. One unique thing that I really haven't seen elsewhere is that
01:43
both models can use tools in parallel, which means it can send off requests to multiple tools at the
01:49
same time. That seems really cool and much more efficient than doing everything sequentially.
01:54
And it also seems to be much better at handling its own memory. All of this stuff is available in
01:59
Cloud Code, which is also now generally available and has the Cloud4 models available. During the keynote
02:06
that live streamed this morning, the chief product officer of Anthropik spent a lot of time talking
02:11
about long horizon tasks and how they were able to accomplish this. Even giving an example of a
02:17
company that was using Cloud4 that was able to do a task over seven hours. And as part of Cloud's new
02:24
API, they have four new features, including code execution tool, MCP connector, a files API, and the
02:31
ability to cache prompts for up to one hour. Here's what the code execution tool looks like. You simply type
02:37
in a prompt, Cloud will start thinking and write code and of course execute that code. And I believe
02:43
it needs to be Python for it to execute. The MCP connector allows you to connect any MCP server to
02:50
the Cloud API. So now your Cloud API has access to all of the MCP tools throughout the world. They also
02:57
have the files API. So giving access to Cloud to your local files, specifically your code files,
03:03
your repositories just became a lot easier and then prompt caching. So of course you want to get the
03:08
most efficient usage. You want to get the cheapest price and caching is the way to go. Now with all of
03:14
these, you probably can guess where this is going. Cloud has basically given up on the chatbots race.
03:20
It is clear that OpenAI and the major tech companies, Google, Microsoft, and unfortunately not Apple
03:28
have all won the chatbot race, the personal assistant race. So now Anthropic has transitioned
03:34
into being an infrastructure company. They are providing the tools necessary to have the best
03:40
coding agent. They are building the best agents. They're building the best coding agents and they
03:45
are plugging it into everyone. Thomas Domke, the CEO of GitHub announced Cloud Force on it is here.
03:52
So it's available in GitHub Copilot and it's their default option. By the way, I interviewed Thomas at
03:58
Microsoft Build. I'll drop that interview soon. So be sure to subscribe to this channel so you can
04:03
get updated when that video drops. It is incredible. But look at this. In early evaluations, the model
04:08
soared in agentic scenarios. That's the key. That is what we keep hearing. Memory, tools, long horizon
04:16
tasks, all done by these agents, powered by Cloud4. Delivering up to a 10% improvement over the previous
04:23
generation driven by sharper tool use, tighter instruction following, and stronger coding
04:28
instincts. And of course, it's also available in Cursor and Windsurf and basically all of the major
04:34
coding platforms out there. Now that Cloud4 is especially good at long horizon tasks, has excellent
04:40
memory, built-in parallel tool usage. It's going to be especially good at pairing with Box AI. And that's
04:48
the sponsor of today's video. I'm really excited to tell you about them. You're going to be able to build
04:52
on Box AI using the new Cloud4 models soon. With Box AI, you can use artificial intelligence to extract
04:59
key metadata fields from contracts, invoices, financial documents, resumes, and more. And you
05:05
can automate workflows super easily. And not just metadata. You can ask questions about it. You can
05:12
really do deep dives into your company's own data. And again, if you're a developer, building on Box AI
05:18
is easy. It handles the entire RAG pipeline for you. So you don't need to think about vector databases.
05:23
You don't need to think about chunking. It's just done and it works. And of course, because it's Box,
05:29
they have enterprise level security, governance, and compliance. And with the launch of Cloud Code,
05:35
if you want to use Cloud Code with Box SDKs, it could not be easier. Simply give Cloud Code links to the
05:42
Box developer docs, and it just knows how to build with it. Check out Box's blog post about the Cloud
05:48
Code launch to see a demo of them building a backend contract generation tool using Box doc gen and
05:54
Cloud Code. I'll drop all of the links in the description below. So unlock the power of your
05:58
documents and data with Box and Box AI. Thanks again to Box for sponsoring this video. All right,
06:04
so back to the announcement blog post. Cloud Opus 4 and Sonnet 4, by the way, they kind of switched the
06:10
name, right? It was Cloud 3.5 Opus, Cloud 3.5 Sonnet, and now it's the opposite way. Cloud Opus 4 and
06:17
Sonnet 4. Anyways, are hybrid models offering two modes, near instant responses, and extended thinking
06:23
for deeper reasoning? All right, I know you want to see the benchmarks. Benchmarks only mean so much,
06:27
so take it with a grain of salt, but here they are. So software engineering, SWE bench verified. Yep,
06:33
Cloud 4 is the by far winner. So here's OpenAI Codex 1, which was just announced about a week
06:40
ago at 72% on the SWE bench verified compared to Sonnet 3.7, which was at 62.3% and with parallel
06:48
test time compute 70.3. But now we have a big jump all the way up to 80.2 with parallel test time
06:56
compute for Sonnet 4 and 72.5 and 79.4 with parallel test time compute for Opus 4. And by the way,
07:04
for those of you who weren't sure what parallel test time compute is, it basically just means they
07:10
sampled a few test time compute solutions to a prompt and chose the best one. Now, if you're
07:15
looking at this, you're probably thinking the same thing I am. Did Sonnet just score better than Opus?
07:21
Well, yeah, it did. And with my initial usage, I actually found Opus to be faster than Sonnet. Now,
07:29
that's just anecdotal me using it a couple of times. So I'm going to need to test it a lot more,
07:33
but it does seem to output code much faster. Now, here are some more benchmarks. Here's Terminal
07:38
Bench, Claude Opus 4 winning at 43.2% compared to Sonnet 4, 35%. Here is the O3 model at 30%,
07:47
GPT 4.1 at 30%, Gemini 2.5 Pro at 25%, which to date, Gemini 2.5 Pro is still my favorite coding
07:55
model. Here's GPQA Diamond, which is graduate level reasoning. We have Agentic Tool Use doing
08:01
quite well compared to the other models. Now, you're probably noticing one other thing. Sonnet 3.7 is
08:06
still doing quite well. I'm going to show you that in a second. We have Multilingual Q&A,
08:12
again, getting a nice bump. Visual reasoning, getting about the same score. And then High School
08:18
Math Competition, Amy 2025, getting a very nice bump over Claude 3.7. Now, I'm going to pause for a
08:24
second and show you something. This is a post by John Shoneth, and he actually points out the green
08:29
boxes are around benchmarks, which Claude Sonnet 4 did better than Claude Sonnet 3.7. The yellow ones are
08:35
where it did about the same. And red is where it actually got a decrease in performance, which is
08:41
kind of nuts. So of all of these benchmarks that they submitted, half actually went down. So I don't
08:48
really know what to think about that. They're saying it was a huge bump, but the benchmarks don't
08:52
actually reflect that. And the benchmarks tend to be the nicest view of these models until people
08:58
start doing the vibe checks of them. So very interesting. And of course, I'm going to be testing
09:03
it thoroughly. We'll see. Now, one of the things that they called out during the keynote today
09:08
is that when Claude 3 came out, it was kind of lazy with coding. And then Claude 3.5 and 3.7
09:16
kind of went the other way. It tried too hard and did things it shouldn't and outputted way too much
09:21
code. And they think they really dialed it in with Claude 4. They also, being anthropic, focused a lot on
09:28
safety. So we've significantly reduced behavior where the models use shortcuts or loopholes to
09:33
complete tasks. And of course, they're using the Pokemon example here. Both models are 65% less
09:40
likely to engage in this behavior than Sonnet 3.7 on agentic tasks that are particularly susceptible to
09:46
shortcuts or loopholes. Claude Opus 4 also dramatically outperforms all previous models on memory capabilities,
09:52
which I've mentioned already. But I have said memory for agents is really the key ingredient to making
09:59
them hyper personal. And they called out in the keynote today, the 100th time you use Claude 4 should
10:04
be much better, much more efficient and much more concise than the first time you use Claude 4. That's
10:10
because it's learning and it's understanding what you want. It's developing a shorthand with you as the user.
10:16
Opus 4 becomes skilled at creating and maintaining memory files to store key information. This unlocks
10:23
better long-term task awareness, coherence, and performance on agent tasks. And here's the example
10:29
of the Pokemon benchmark. They've also introduced thinking summaries for Claude 4 models that use a
10:35
smaller model to condense lengthy thought processes. I would love to see the thought process, but you
10:40
basically see nothing now. Now here's the key. Users requiring raw chains of thought for advanced
10:46
prompt engineering can contact sales. So if you want to see the raw chains of thought, you're probably
10:53
going to have to pay up. All right, the next big announcement I touched on it. Let's get into more
10:57
detail. Claude Code is now generally available. They have new extensions for VS Code and JetBrains that
11:03
integrate Claude Code directly into your IDE, which is nice. This is a direct competition to all of the
11:09
coding tools out there. Claude's proposed edits appear inline in your files, streamlining review and tracking
11:15
with the familiar editor interface. And they're releasing a Claude Code SDK so you can build your
11:21
own coding agent. So again, they're really building out the infrastructure layer of agentic coding.
11:27
So Claude Code on GitHub now available. And that's an example of what's possible with the SDK.
11:33
Tag Claude Code on PRs to respond to reviewer feedback, fix CI errors, or modify code. So here's an
11:40
example. Here's a PR right here. You're going to come into a comment. You're going to tag Claude.
11:44
Could you please address this feedback, comment, and it's going to jump in and start doing it right
11:50
away. Gather issue and comment context, address the feedback, create a pull request, verify lint,
11:54
make the tests, and so on. And then you have a PR ready to review. Now, chief science officer at
12:01
Anthropic has said, according to TechMean, Anthropic's Jared Kaplan says the company stopped
12:06
investing in chatbots at the end of 2024 and instead focused on improving Claude's ability to do
12:12
complex tasks. And this makes sense. Claude is just not achieving the mindshare necessary to win at
12:19
the chatbot game. That's ChatGPT. That's Gemini. Hopefully Siri in the future. So they gave up on
12:26
that and went in and focused on agentic capabilities. And you know what? Good for them. Focus is what is
12:32
required to win. And how about the pricing? Let's check it out. So Claude 4 Opus, the most intelligent
12:38
model for complex tasks. It has a 200k context window, which is still relatively small. And you
12:45
get a 50% discount with batch processing, $15 per million tokens input and $75 per million tokens
12:52
output. So that's it. I'm going to be testing it out. Expect a testing video soon. If you enjoyed
12:58
this video, please consider giving a like and subscribe and I'll see you in the next one.
Recommended
9:00
|
Up next
Claude 3 Outsmarts ChatGPT & Gemini! π€―π€ | Shocking AI Leap Explained π | AI Revolution
Ai Revolution
5/4/2025
8:17
SHOCKING New AI DESTROYS GPT-4o (Open-Source Voice AI!) | The AI Revolution Is HERE | AI Revolution
Ai Revolution
4/20/2025
8:59
What Would Happen If AI Cloned and Replaced You? | Unveiled
Unveiled
6/23/2023
8:07
βοΈ Claude 3.5 DESTROYS GPT-4o in Every Benchmark! | The AI Revolution Is Heating Up π₯π€ | AI Revolution
Ai Revolution
4/21/2025
4:51
GPT-4.5 Leak EXPOSED! π₯π€ The AI Update They Tried to Hide | BREAKING News! π¨ AI Revolution
Ai Revolution
5/4/2025
8:17
Shocking New AI Destroys GPT-4o ( Open Source Voice AI!)
High tech & Ai world
7/8/2024
12:34
Chinaβs Qwen 3 Just SHOCKED AI World β‘ Open-Weight Hybrid Power Unleashed! | AI Revolution
Ai Revolution
5/8/2025
8:21
π¨ Microsoftβs Secret Spy AI Chatbot, Worldβs First AI TV, OpenAI SEARCH Update & More! ππ€ | AI Revolution
Ai Revolution
4/27/2025
1:32
"AI is Rewiring Your Mind (And You Donβt Know It)"
World Around Me Arsl.
5/11/2025
10:05
New AI from China DESTROYS GPT-4.5 π₯π€ | The AI Wars Just Got Real! AI Revolution
Ai Revolution
5/8/2025
5:23
ChatGPT & Gemini LOSE IT Completely! π€―β οΈ | AI Meltdown or Growing Pains? π€ | AI Revolution
Ai Revolution
5/9/2025
8:54
π Microsoft's NEW AI Genius ORCA 2 π€― Outsmarts GPT-4 in Complex Tasks! | AI Revolution
Ai Revolution
5/16/2025
0:31
3 AI Websites I wish I knew earlier
Atif Dahar
8/26/2024
9:05
π¨ HUGE AI News: Appleβs Vibe Coder, Reddit AI, Ideogram 3 & ChatGPT MELTDOWN! π₯π€ | AI Revolution
Ai Revolution
5/8/2025
6:25
This NEW AI Is Teaching Other AIs π€π | Is This Approximate AGI?! π± | AI Revolution
Ai Revolution
5/3/2025
11:37
π₯ Mistral's New AI Crushes GPT-4o & Claude 3.7 β Cheaper Than DeepSeek! π€π₯ | AI Revolution
Ai Revolution
5/15/2025
1:11:27
S P Part 1/2 latest Malayalam Movie (2025)
π Movies magic
today
47:38
s part 2/2 new Tamil Movie (2025)
π Movies magic
2 days ago
8:11
Tragedy in California | Storm Surge in South Lake Tahoe Kills 6, Destroys Boats
Omnivue TV
6/23/2025
10:26
Top 10 Most Expensive Private Islands in the World β Owned by Billionaires
Omnivue TV
6/20/2025
6:29
The REAL Cost of Owning a Private Jet β More Than Just Millions
Omnivue TV
6/20/2025
10:56
10 Youngest Billionaires in the World β How They Made Their Fortune
Omnivue TV
6/20/2025
7:17
How Cristiano Ronaldo Spends His Millions β Cars, Mansions & Luxury Lifestyle
Omnivue TV
6/20/2025
8:20
The Most Expensive Weapon Ever Built β Inside a $100 Billion Military Project
Omnivue TV
6/20/2025
18:51
Top 10 Richest People in History β Insane Wealth Across the Ages
Omnivue TV
6/20/2025