Skip to player
Skip to main content
Search
Connect
Watch fullscreen
Like
Bookmark
Share
More
Add to Playlist
Report
Claimify: Extracting high-quality claims from language model outputs
Devopenso
Follow
6 weeks ago
#devopenso
#microsoft
#research
#language
#model
#devopenso #microsoft #research #language #model
Category
😹
Fun
Transcript
Display full video transcript
00:00
My name is Dasha Metropolitansky and I'm a research data scientist in the Microsoft
00:07
Special Projects Resilience Team. I developed a system called Claimify, which is a claim
00:12
extraction system. You're probably wondering, what is claim extraction? Well, there are
00:17
two keywords, claim and extraction. A claim, as I define it, is a simple factual statement
00:22
that can be verified as true or false, and extraction is the process of breaking down
00:27
a text into claims. To give a simple example to illustrate, let's say that you have a
00:32
sentence. Some notable examples of technology executives include Satya Nadella and Bill Gates.
00:39
If I had to break this sentence down into claims, there would be two of them. Satya Nadella
00:43
is a technology executive and Bill Gates is a technology executive. So this is a really
00:48
simple example, but you can already start to see some of the properties that we might care
00:51
about when we do claim extraction. One of them is that I got rid of this word notable because
00:56
what does that even mean? That's not something that I can verify as true or false. The other
01:00
is that I created separate claims, one for Bill Gates and one for Satya Nadella. This is because
01:05
we want claims to be the simplest possible independent statements. So taking a step back,
01:10
Claimify basically takes in a text of any length. Usually these texts are much longer than the
01:15
example I just gave you, which was just a sentence, and it'll take that text and decompose it into
01:20
these high quality claims. I'm working now on a system that does hallucination detection. So let's say that
01:25
your question answering application answers questions based on some source documents like
01:29
news articles. You want to make sure that the language model is answering those questions based
01:34
on the source documents, not just making things up. But that's a really hard evaluation to perform
01:39
when you have a paragraph or multi-paragraph answer that has so much information in it. Now imagine if you
01:45
could distill that into a simple set of standalone factual statements. It becomes much easier to then check
01:51
those independently. But this is not just about hallucination detection. You can do other sorts
01:55
of evaluations as well. So back to the use case, you're building an application. You want to know
02:00
how relevant the answers are to the question that was asked. If this answer contains 20 or 30 distinct
02:06
points, it's so hard to say how relevant the entire answer is. Maybe some points are relevant and others
02:10
aren't. But if you can take these individual factual claims and say this one is relevant, this one
02:15
isn't, you can easily aggregate that into one composite measure. Our team is also using the
02:20
number of claims in the answer as a proxy for how comprehensive it is. So to summarize, why does
02:26
claim extraction matter? Because it unlocks the ability to evaluate long form content generated by
02:32
language models. We don't try to do claim extraction on the whole text at once. We actually break it down
02:38
into sentences and we do the claim extraction on each sentence independently. Now to ensure that those
02:43
sentences are interpreted accurately, we include some context, which is basically a window of text around
02:48
the sentence. So that's number one. Number two is we don't treat claim extraction as one monolithic
02:55
task. We break it down into three parts. Selection, disambiguation, and decomposition. Selection means
03:02
we're filtering out sentences that do not contain any verifiable claims. So for example, if I gave you the
03:07
sentence, companies should embrace AI. That's not a factual claim. It's an opinion. So we would filter it out.
03:14
Secondly, we have disambiguation. This is basically detecting whether there's ambiguity and then
03:19
deciding if there's ambiguity, can it be resolved using the context or flag that it can't be resolved.
03:25
Ambiguity here just means there are multiple plausible interpretations and depending on which
03:29
interpretation you pick, you're going to get a very different set of claims. This is one aspect of
03:33
claimify that is really unique and powerful, especially the ability to determine whether or not
03:38
this ambiguity can be resolved. And then the last stage is decomposition, which takes the
03:44
disambiguated sentence and breaks it down into these simple standalone factual statements.
03:49
So let's take a closer look at claimify in action.
03:51
Imagine you're developing a chatbot. You ask it to provide an overview of challenges in emerging
03:56
markets, and it generates this answer. Assessing the quality of the answer is really hard. It's
04:01
packed with information, and there's no gold standard to compare against. Instead of processing
04:06
the entire text at once, claimify extracts claims from each sentence independently. We include context
04:11
for each sentence to ensure accurate interpretation. To recall the sentence, the UN found that the resulting
04:16
contaminated water caused many residents to fall ill, highlighting the need for improved water
04:21
management. The baseline prompt ignored the phrase highlighting the need for improved water
04:24
management. It only extracted claims from the first part of the sentence. However, claimify reasoned
04:29
quote, the sentence could be interpreted as the UN found that the contaminated water caused illness
04:34
and also highlighted the need for improved water management. Or it could be interpreted as the UN
04:39
only found the contamination and illness and the author is adding the interpretation about the need for
04:44
improved water management. In other words, this may or may not be a verifiable claim. Claimify decided
04:50
that the context did not clearly support either interpretation, so it flagged the sentences as
04:55
cannot be disambiguated and did not proceed to the decomposition stage. Here are the sentences where
05:00
at least one claim was extracted. I'll highlight a few examples. Recall the sentences about Argentina's
05:05
inflation, where the baseline missed the claims about economic hardship and the prediction of rates
05:10
greater than 300%. Claimify did not miss these claims. Also, the baseline just said Argentina's
05:16
currency value has plunged. Claimify correctly specified that inflation has depreciated the
05:21
currency. Consider the sentence, countries like Afghanistan and Sudan have experienced similar
05:26
challenges to those of Libya, where the baseline claims never specified what those refers to.
05:32
Well, the context discusses public health crises, flooding, and contaminated water, so Claimify made
05:37
specific claims about these issues. In the sentence, Nigeria is striving to become self-sufficient in
05:44
wheat production, but is hindered by climate change and violence. The baseline had claims like,
05:48
Nigeria's wheat production is hindered by climate change and violence. Claimify captured that it's
05:53
Nigeria's efforts to become self-sufficient in wheat production that are being hindered.
05:58
So one of the most popular use cases of language models is generating long-form content. Unfortunately,
06:03
it's really hard to evaluate the quality of that content. Claim extraction can help, and Claimify
06:09
is a really powerful tool for generating high-quality claims.
Be the first to comment
Add your comment
Recommended
11:07
|
Up next
WIFE CHEATS WITH HUSBANDʼS BOSS
The Celebritist
13 hours ago
10:11
BOSS PAYS FOR HARASSMENT
The Celebritist
21 hours ago
10:22
GREEDY WIFE CHEATS ON POOR HUSBAND
The Celebritist
1 day ago
19:13
Ghost of Yōtei - State of Play Gameplay Deep Dive | PS5 Games
Devopenso
1 week ago
14:40
Optimizing Unreal Engine’s OpenUSD Workflow with Python | Unreal Fest Orlando 2025
Devopenso
1 week ago
0:23
Gran Turismo - World Series LA Ticket Trailer | PS5 & PS VR2 Games
Devopenso
1 week ago
2:14
Sora 2 is here
Devopenso
3 weeks ago
0:15
Google Chrome AI Mode
Devopenso
5 weeks ago
9:06
What's new with the Copilot coding agent | GitHub Checkout
Devopenso
6 weeks ago
2:25
EA SPORTS FC 26 | Official Reveal Trailer
Devopenso
6 weeks ago
1:36
Zorah | A New Era of Rendering, Powered by GeForce RTX 50 Series and AI
Devopenso
6 weeks ago
1:28
A HOUSE OF DYNAMITE | Official Teaser | Netflix
Devopenso
6 weeks ago
1:45
Lords of the Fallen II - Trailer | PS5 Games
Devopenso
6 weeks ago
3:57
Introducing iPhone New Pro | Apple
Devopenso
6 weeks ago
1:10
Hailuo Start/ End Frame
Devopenso
6 weeks ago
0:08
Veo 3 Demo
Devopenso
6 weeks ago
0:51
Former Aide Claims She Was Asked to Make a ‘Hit List’ For Trump
Veuer
2 years ago
1:08
Musk’s X Is ‘the Platform With the Largest Ratio of Misinformation or Disinformation’ Amongst All Social Media Platforms
Veuer
2 years ago
4:50
59 companies that are changing the world: From Tesla to Chobani
Fortune
2 years ago
0:46
3 Things to Know About Coco Gauff's Parents
People
2 years ago
0:35
8 Things to Do in the Morning to Improve Productivity
Martha Stewart Living
2 years ago
2:11
Why You Should Remember Aretha Franklin
Goalcast
2 years ago
1:18
USC vs. Colorado: Can Caleb Williams Earn a New Heisman Moment?
SportsGrid
2 years ago
1:04
Vic Mensa Reveals Celebrity Crush, Biggest Dating Pet Peeve & More on Speed Dating | Billboard News
Billboard
2 years ago
1:09
Hollywood Writers Reach ‘Tentative Agreement’ With Studios After 146 Day Strike
Veuer
2 years ago
Be the first to comment