Skip to playerSkip to main contentSkip to footer
  • 2025/7/12
Discuss Topic: "Should we trust AI doctors more than human physicians when they're 4x more accurate but lack a heartbeat?"

Opinion A:
"Hell yes! My WebMD self-diagnosis era is over!"

Opinion B:
"Nope - I want my doctor to sweat when delivering bad news!"

Opinion A or B? It is your decision and turn now!
文字稿
00:00that Microsoft is announcing today, which is that you've created, and let me know if I get this right,
00:04a diagnostician bot that effectively will be able to dialogue with a patient's case file
00:10and then make a diagnosis. So it's actually two bots, one, and it's a system. So it's not just
00:16a Microsoft's bots, but it can be on any bot where one bot basically acts as a gatekeeper to all
00:22a patient's medical information, and then the other one is basically acting as the diagnostician
00:26or the physician that goes in and asks questions about that history. And you found some pretty
00:32incredible results when it comes to the effectiveness of this system to be able to diagnose
00:38correctly. Yeah, it's a great summary. That's exactly right. We essentially wanted to simulate
00:44what it would be like for an AI to act as a diagnostician, to ask the patient a series of
00:52questions, to draw out their case history, go through a whole bunch of tests that they may
00:58have had, pathology and radiology, and then iteratively examine the information that it's getting in order
01:05to improve the accuracy and reliability of its prediction about what your diagnosis actually
01:11is. And we actually use the New England Journal of Medicine case histories, hundreds of these past
01:19cases. One of these cases comes out every single week, and it's like an ultimate crossword for
01:24doctors. They obviously don't see the answer until the following week, and it's a big guessing game
01:31to go back through five to seven pages of very detailed history, and then try and figure out what
01:36the diagnosis actually turns out to be. Okay. And so what happens is these two bots work together in
01:43conjunction to figure out what the diagnosis is? Why use a system like this? I mean, I thought one
01:48of the benefits of generative AI is it can sort of take in a lot of information and then come to these
01:54answers, sometimes in one shot. So what is the benefit of having these, this dialogue between two
01:59bots? So the big breakthrough of the last six months or so in AI is these thinking or reasoning models
02:05that can obviously query other agents or find other information sources at inference time to improve
02:14the quality of its response. Rather than just giving the first best answer, it instead goes and, you know,
02:21consults a range of different sources, and that improves the quality of the information that it finally
02:27gets to. So we see that this orchestrator, which under the hood uses four different models from the major
02:32providers, can actually improve the accuracy of each of the individual models and collectively all of
02:39them together by a very significant degree, about 10% or so. So it's a big step forward. And I think that
02:45as the AI models get commoditized, you know, really all the value will be added in that final layer of
02:52orchestration, product integration. And that's what we're seeing with this diagnostic orchestrator.
02:57So a 10% increase in accurately diagnosing on top of the standard LLMs.
03:06Yeah. And in fact, we actually benchmark that against human performance. So we had a whole bunch
03:12of expert physicians play this simulated diagnosis environment game, and they on average get about
03:20one in five, right? So about 20%. Whereas our orchestrator gets about 85% accuracy. So it's four
03:28times more accurate, which, you know, in like my career, I've never seen such a big gulf between
03:36human level performance and the AI systems performance. Many years ago, I worked on like lots
03:43of diagnoses for radiology and head and neck cancer and mammography. And the goal was just to take a
03:51single radiology exam and predict, you know, yes or no, does it have cancer? And that was the most we
03:56could do. Whereas now, it is not just producing a binary class output, but it's actually producing a
04:03very detailed diagnosis and getting and doing that sequential sequentially through this interactive
04:10dialogue mechanism. And so that massively improves the accuracy.
04:14Okay, so it can do 80% accurate diagnoses, which sounds incredible. And I have to pressure test this
04:19a little bit. Because what if you have the same thing happen to medicine, as is happening with
04:25beginner level code, where basically, there are people who are learning to code using these co pilots,
04:31but then when something breaks, it becomes harder for them to figure out what's going on. So if you're a
04:35doctor relying on something amazing, 80% accuracy, but if you don't have if you sort of outsource some
04:42of your thinking to these bots, is that a problem down the line?
04:46Yeah, so this isn't just giving a black box answer. That's why the sequential diagnosis part is so
04:52important. Because you can watch the AI in real time, ask questions of the case history, get an answer,
04:59shape a new question, get an answer, present a new question, then present then then ask for a
05:05different type of testing, get those results, interpret it, then give an answer. So the dialogic
05:11nature means that a human doctor can follow along and actually learn in a very transparent way. It's
05:17almost like having an interpretability mechanism inside the black box of the LLM, because you can see
05:23its thinking process in real time. And in fact, you don't just see the sort of chain of thoughts,
05:28which is the, you know, in a monologue, we've actually created five different types of agent,
05:35which all have a debate. And we call this chain of debate, they negotiate with one another, they try to
05:40prioritize, you know, certain different aspects like cost or efficiency. And it's the coordination
05:47of those different skill sets among the doctors, which actually met doctor agents is actually what
05:52makes this so effective.
05:54But I want to ask again, because even if a doctor can watch this goal take place, it effectively
06:01turns their role and let's say this becomes something that doctors use, it turns their role
06:06and diagnoses to from something that's active and really thinking through to more of like a passive,
06:11okay, I'm watching the bots go through it. And I do wonder if there is some benefit in having the
06:18doctor actually have to do that themselves, because it helps the brain work in ways that it doesn't
06:24when it's just watching bots have a conversation.
06:27Yeah, I mean, I think that's totally true. I just still think this is going to be an amazing
06:30education tool for doctors to actually learn about the breadth of cases they never would have
06:35encountered. For example, we actually ran the the DXO orchestrator last week on the most recent case
06:41study in the New England Journal of Medicine. And it correctly diagnosed, diagnosed the case that had
06:47only ever been seen 1500 times in all of medical literature, it was such an obscure long tail
06:52disease. So very few doctors are ever going to get the chance to see that. And so the ability to
06:59accurately and preventably detect these kinds of conditions in the wild in production, I think will
07:05massively outweigh, you know, the risk of doctors not being able to sort of exercise in the way that
07:12you describe, I think the tools just change how you work. And, you know, obviously, everyone will
07:17sort of have to adapt to that over time. But the utility is just so unquestionably beneficial that I
07:22think it, it makes it worthwhile.
07:25Now, is it able to do that? Because the cases are potentially in the training data? And even if they
07:32are, does it really matter? I mean, it is if it is able to diagnose these rare conditions?
07:36Should we really mind if it's seen it before in the training data?
07:42Well, part of the reason why we partnered with the New England Journal of Medicine is because each
07:46week, they put out a brand new case, which has never even been digitized. So there's no question
07:52that it's not in the training data. This case, for example, from last week, there's absolutely no way
07:57it's in the training data, because it's literally just got published. So and you know, we think that's
08:01the case going back for all of the previous studies to cases too. So I don't think there's
08:06any chance of that. This really is doing a kind of a sort of abstraction of judgment. It's not just
08:14reproducing training data, it is actually doing some kind of inference or thinking based on the
08:19knowledge that it does already have.

推荐视频