Claimify: Extracting high-quality claims from language model outputs

Devopenso

#devopenso #microsoft #research #language #model

Transcript

00:00My name is Dasha Metropolitansky and I'm a research data scientist in the Microsoft

00:07Special Projects Resilience Team. I developed a system called Claimify, which is a claim

00:12extraction system. You're probably wondering, what is claim extraction? Well, there are

00:17two keywords, claim and extraction. A claim, as I define it, is a simple factual statement

00:22that can be verified as true or false, and extraction is the process of breaking down

00:27a text into claims. To give a simple example to illustrate, let's say that you have a

00:32sentence. Some notable examples of technology executives include Satya Nadella and Bill Gates.

00:39If I had to break this sentence down into claims, there would be two of them. Satya Nadella

00:43is a technology executive and Bill Gates is a technology executive. So this is a really

00:48simple example, but you can already start to see some of the properties that we might care

00:51about when we do claim extraction. One of them is that I got rid of this word notable because

00:56what does that even mean? That's not something that I can verify as true or false. The other

01:00is that I created separate claims, one for Bill Gates and one for Satya Nadella. This is because

01:05we want claims to be the simplest possible independent statements. So taking a step back,

01:10Claimify basically takes in a text of any length. Usually these texts are much longer than the

01:15example I just gave you, which was just a sentence, and it'll take that text and decompose it into

01:20these high quality claims. I'm working now on a system that does hallucination detection. So let's say that

01:25your question answering application answers questions based on some source documents like

01:29news articles. You want to make sure that the language model is answering those questions based

01:34on the source documents, not just making things up. But that's a really hard evaluation to perform

01:39when you have a paragraph or multi-paragraph answer that has so much information in it. Now imagine if you

01:45could distill that into a simple set of standalone factual statements. It becomes much easier to then check

01:51those independently. But this is not just about hallucination detection. You can do other sorts

01:55of evaluations as well. So back to the use case, you're building an application. You want to know

02:00how relevant the answers are to the question that was asked. If this answer contains 20 or 30 distinct

02:06points, it's so hard to say how relevant the entire answer is. Maybe some points are relevant and others

02:10aren't. But if you can take these individual factual claims and say this one is relevant, this one

02:15isn't, you can easily aggregate that into one composite measure. Our team is also using the

02:20number of claims in the answer as a proxy for how comprehensive it is. So to summarize, why does

02:26claim extraction matter? Because it unlocks the ability to evaluate long form content generated by

02:32language models. We don't try to do claim extraction on the whole text at once. We actually break it down

02:38into sentences and we do the claim extraction on each sentence independently. Now to ensure that those

02:43sentences are interpreted accurately, we include some context, which is basically a window of text around

02:48the sentence. So that's number one. Number two is we don't treat claim extraction as one monolithic

02:55task. We break it down into three parts. Selection, disambiguation, and decomposition. Selection means

03:02we're filtering out sentences that do not contain any verifiable claims. So for example, if I gave you the

03:07sentence, companies should embrace AI. That's not a factual claim. It's an opinion. So we would filter it out.

03:14Secondly, we have disambiguation. This is basically detecting whether there's ambiguity and then

03:19deciding if there's ambiguity, can it be resolved using the context or flag that it can't be resolved.

03:25Ambiguity here just means there are multiple plausible interpretations and depending on which

03:29interpretation you pick, you're going to get a very different set of claims. This is one aspect of

03:33claimify that is really unique and powerful, especially the ability to determine whether or not

03:38this ambiguity can be resolved. And then the last stage is decomposition, which takes the

03:44disambiguated sentence and breaks it down into these simple standalone factual statements.

03:49So let's take a closer look at claimify in action.

03:51Imagine you're developing a chatbot. You ask it to provide an overview of challenges in emerging

03:56markets, and it generates this answer. Assessing the quality of the answer is really hard. It's

04:01packed with information, and there's no gold standard to compare against. Instead of processing

04:06the entire text at once, claimify extracts claims from each sentence independently. We include context

04:11for each sentence to ensure accurate interpretation. To recall the sentence, the UN found that the resulting

04:16contaminated water caused many residents to fall ill, highlighting the need for improved water

04:21management. The baseline prompt ignored the phrase highlighting the need for improved water

04:24management. It only extracted claims from the first part of the sentence. However, claimify reasoned

04:29quote, the sentence could be interpreted as the UN found that the contaminated water caused illness

04:34and also highlighted the need for improved water management. Or it could be interpreted as the UN

04:39only found the contamination and illness and the author is adding the interpretation about the need for

04:44improved water management. In other words, this may or may not be a verifiable claim. Claimify decided

04:50that the context did not clearly support either interpretation, so it flagged the sentences as

04:55cannot be disambiguated and did not proceed to the decomposition stage. Here are the sentences where

05:00at least one claim was extracted. I'll highlight a few examples. Recall the sentences about Argentina's

05:05inflation, where the baseline missed the claims about economic hardship and the prediction of rates

05:10greater than 300%. Claimify did not miss these claims. Also, the baseline just said Argentina's

05:16currency value has plunged. Claimify correctly specified that inflation has depreciated the

05:21currency. Consider the sentence, countries like Afghanistan and Sudan have experienced similar

05:26challenges to those of Libya, where the baseline claims never specified what those refers to.

05:32Well, the context discusses public health crises, flooding, and contaminated water, so Claimify made

05:37specific claims about these issues. In the sentence, Nigeria is striving to become self-sufficient in

05:44wheat production, but is hindered by climate change and violence. The baseline had claims like,

05:48Nigeria's wheat production is hindered by climate change and violence. Claimify captured that it's

05:53Nigeria's efforts to become self-sufficient in wheat production that are being hindered.

05:58So one of the most popular use cases of language models is generating long-form content. Unfortunately,

06:03it's really hard to evaluate the quality of that content. Claim extraction can help, and Claimify

06:09is a really powerful tool for generating high-quality claims.

Category

Transcript

Be the first to comment

Recommended