Doctor Bot vs. Human MD The 85% Accuracy Showdown That'll Make You Question Your GP!─影片 Dailymotion

Name: Doctor Bot vs. Human MD The 85% Accuracy Showdown That'll Make You Question Your GP!
Uploaded: 2025-07-12T06:07:51+00:00
Duration: 8 min 22 s
Channel: Frank Hu
Description: Discuss Topic: "Should we trust AI doctors more than human physicians when they're 4x more accurate but lack a heartbeat?" Opinion A: "Hell yes! My WebMD self-diagnosis era is over!" Opinion B: "Nope - I want my doctor to sweat when delivering bad news!" Opinion A or B? It is your decision and turn now!

Frank Hu

2025/7/12

Discuss Topic: "Should we trust AI doctors more than human physicians when they're 4x more accurate but lack a heartbeat?"

Opinion A:
"Hell yes! My WebMD self-diagnosis era is over!"

Opinion B:
"Nope - I want my doctor to sweat when delivering bad news!"

Opinion A or B? It is your decision and turn now!

类别

🤖

科技

文字稿

显示完整文字稿

00:00that Microsoft is announcing today, which is that you've created, and let me know if I get this right,

00:04a diagnostician bot that effectively will be able to dialogue with a patient's case file

00:10and then make a diagnosis. So it's actually two bots, one, and it's a system. So it's not just

00:16a Microsoft's bots, but it can be on any bot where one bot basically acts as a gatekeeper to all

00:22a patient's medical information, and then the other one is basically acting as the diagnostician

00:26or the physician that goes in and asks questions about that history. And you found some pretty

00:32incredible results when it comes to the effectiveness of this system to be able to diagnose

00:38correctly. Yeah, it's a great summary. That's exactly right. We essentially wanted to simulate

00:44what it would be like for an AI to act as a diagnostician, to ask the patient a series of

00:52questions, to draw out their case history, go through a whole bunch of tests that they may

00:58have had, pathology and radiology, and then iteratively examine the information that it's getting in order

01:05to improve the accuracy and reliability of its prediction about what your diagnosis actually

01:11is. And we actually use the New England Journal of Medicine case histories, hundreds of these past

01:19cases. One of these cases comes out every single week, and it's like an ultimate crossword for

01:24doctors. They obviously don't see the answer until the following week, and it's a big guessing game

01:31to go back through five to seven pages of very detailed history, and then try and figure out what

01:36the diagnosis actually turns out to be. Okay. And so what happens is these two bots work together in

01:43conjunction to figure out what the diagnosis is? Why use a system like this? I mean, I thought one

01:48of the benefits of generative AI is it can sort of take in a lot of information and then come to these

01:54answers, sometimes in one shot. So what is the benefit of having these, this dialogue between two

01:59bots? So the big breakthrough of the last six months or so in AI is these thinking or reasoning models

02:05that can obviously query other agents or find other information sources at inference time to improve

02:14the quality of its response. Rather than just giving the first best answer, it instead goes and, you know,

02:21consults a range of different sources, and that improves the quality of the information that it finally

02:27gets to. So we see that this orchestrator, which under the hood uses four different models from the major

02:32providers, can actually improve the accuracy of each of the individual models and collectively all of

02:39them together by a very significant degree, about 10% or so. So it's a big step forward. And I think that

02:45as the AI models get commoditized, you know, really all the value will be added in that final layer of

02:52orchestration, product integration. And that's what we're seeing with this diagnostic orchestrator.

02:57So a 10% increase in accurately diagnosing on top of the standard LLMs.

03:06Yeah. And in fact, we actually benchmark that against human performance. So we had a whole bunch

03:12of expert physicians play this simulated diagnosis environment game, and they on average get about

03:20one in five, right? So about 20%. Whereas our orchestrator gets about 85% accuracy. So it's four

03:28times more accurate, which, you know, in like my career, I've never seen such a big gulf between

03:36human level performance and the AI systems performance. Many years ago, I worked on like lots

03:43of diagnoses for radiology and head and neck cancer and mammography. And the goal was just to take a

03:51single radiology exam and predict, you know, yes or no, does it have cancer? And that was the most we

03:56could do. Whereas now, it is not just producing a binary class output, but it's actually producing a

04:03very detailed diagnosis and getting and doing that sequential sequentially through this interactive

04:10dialogue mechanism. And so that massively improves the accuracy.

04:14Okay, so it can do 80% accurate diagnoses, which sounds incredible. And I have to pressure test this

04:19a little bit. Because what if you have the same thing happen to medicine, as is happening with

04:25beginner level code, where basically, there are people who are learning to code using these co pilots,

04:31but then when something breaks, it becomes harder for them to figure out what's going on. So if you're a

04:35doctor relying on something amazing, 80% accuracy, but if you don't have if you sort of outsource some

04:42of your thinking to these bots, is that a problem down the line?

04:46Yeah, so this isn't just giving a black box answer. That's why the sequential diagnosis part is so

04:52important. Because you can watch the AI in real time, ask questions of the case history, get an answer,

04:59shape a new question, get an answer, present a new question, then present then then ask for a

05:05different type of testing, get those results, interpret it, then give an answer. So the dialogic

05:11nature means that a human doctor can follow along and actually learn in a very transparent way. It's

05:17almost like having an interpretability mechanism inside the black box of the LLM, because you can see

05:23its thinking process in real time. And in fact, you don't just see the sort of chain of thoughts,

05:28which is the, you know, in a monologue, we've actually created five different types of agent,

05:35which all have a debate. And we call this chain of debate, they negotiate with one another, they try to

05:40prioritize, you know, certain different aspects like cost or efficiency. And it's the coordination

05:47of those different skill sets among the doctors, which actually met doctor agents is actually what

05:52makes this so effective.

05:54But I want to ask again, because even if a doctor can watch this goal take place, it effectively

06:01turns their role and let's say this becomes something that doctors use, it turns their role

06:06and diagnoses to from something that's active and really thinking through to more of like a passive,

06:11okay, I'm watching the bots go through it. And I do wonder if there is some benefit in having the

06:18doctor actually have to do that themselves, because it helps the brain work in ways that it doesn't

06:24when it's just watching bots have a conversation.

06:27Yeah, I mean, I think that's totally true. I just still think this is going to be an amazing

06:30education tool for doctors to actually learn about the breadth of cases they never would have

06:35encountered. For example, we actually ran the the DXO orchestrator last week on the most recent case

06:41study in the New England Journal of Medicine. And it correctly diagnosed, diagnosed the case that had

06:47only ever been seen 1500 times in all of medical literature, it was such an obscure long tail

06:52disease. So very few doctors are ever going to get the chance to see that. And so the ability to

06:59accurately and preventably detect these kinds of conditions in the wild in production, I think will

07:05massively outweigh, you know, the risk of doctors not being able to sort of exercise in the way that

07:12you describe, I think the tools just change how you work. And, you know, obviously, everyone will

07:17sort of have to adapt to that over time. But the utility is just so unquestionably beneficial that I

07:22think it, it makes it worthwhile.

07:25Now, is it able to do that? Because the cases are potentially in the training data? And even if they

07:32are, does it really matter? I mean, it is if it is able to diagnose these rare conditions?

07:36Should we really mind if it's seen it before in the training data?

07:42Well, part of the reason why we partnered with the New England Journal of Medicine is because each

07:46week, they put out a brand new case, which has never even been digitized. So there's no question

07:52that it's not in the training data. This case, for example, from last week, there's absolutely no way

07:57it's in the training data, because it's literally just got published. So and you know, we think that's

08:01the case going back for all of the previous studies to cases too. So I don't think there's

08:06any chance of that. This really is doing a kind of a sort of abstraction of judgment. It's not just

08:14reproducing training data, it is actually doing some kind of inference or thinking based on the

08:19knowledge that it does already have.

Doctor Bot vs. Human MD The 85% Accuracy Showdown That'll Make You Question Your GP!

类别

文字稿

推荐视频