- il y a 19 heures
GenAI: Rewriting the Rules of Copyright
Catégorie
🤖
TechnologieTranscription
00:00Sous-titrage Société Radio-Canada
00:02Good afternoon everyone
00:05Thank you for coming here on this Friday afternoon
00:08Hopefully you'll be woken up with some good memes from the internet
00:12And perhaps some knowledge along the way, hopefully
00:16So just a brief introduction, my name is Anaïs
00:19I'm an investor in early stage, venture stage companies
00:24So what we typically call a venture capitalist
00:26And I work for IRIS, which is a French-German fund
00:29And as you perhaps will not be surprised to hear
00:33A lot of my portfolio companies happen to be in the data, DevOps or AI space
00:38Which is why I will talk today about rewriting the rules of copyrights
00:44And we will go deep into internet culture
00:47Why copyrights and the internet did not mesh very well
00:50I will explain the issues we've had with the past
00:53And then we will explore what has been happening with AI
00:58And with LLMs in particular
01:00So the story starts, you may be surprised to hear
01:04With Mickey Mouse, amongst other things
01:07Mickey Mouse and the US Congress
01:11Why Mickey Mouse?
01:13So there's a thing you may know or not
01:16As the Mickey Mouse curve
01:18Why?
01:19Because if you look at this chart
01:20On the left you see the initial duration of copyrights
01:24When they were first established
01:26So depending on countries you had usually something between 20 years
01:30Which is aligned with the length of patents typically
01:33And nearly 30 years in the US
01:38After the content was produced
01:41And then the duration expanded as time went by
01:47And there's a strange coincidence
01:49That it did expand specifically at moments where Mickey Mouse
01:53Was supposed to go into the public domain
01:55Thanks to, amongst other things, lobbying from the Disney company
02:01And as you can see, lobbying was fairly successful
02:04People have probably gotten very, very good bonuses from this
02:09Because the copyrights now happen to be in the US
02:1370 years after the death of creators of content
02:18And for companies it is 80 years after production
02:22Which is quite a long time
02:24And the startling thing is that patents in the meantime
02:29Have remained at 20 years
02:32Just brief words because there are some important differences
02:36So this is not about patents
02:38And this is not legal advice
02:39By the way, please don't follow my legal advice
02:41So patents last today 20 years approximately
02:49And they have to be filed with specific authorities
02:54So patents are not automated
02:56You cannot get a patent if you don't file for a patent
02:59That sounds fairly obvious
03:01But it's actually quite different for copyrights
03:04So copyrights are automatic
03:06They're given to you as long as you produce anything
03:09So if I draw something on a piece of paper
03:12And then I throw it out in the garbage
03:15I still own the copyright to that piece of paper
03:18I don't need to file anything
03:19I don't need to pay a fee
03:21It will apply automatically to whatever I do
03:24And this will be very important for later on
03:27When we talk about why that's a problem for LLMs in particular
03:30And the other difference that's important
03:33Is that copyright typically applies to creative works
03:37So of course literature
03:39But also drawings, pictures, photography, art, music
03:45Any kind of creative content
03:47It also applies to everything that basically looks nice
03:52So if you have a chart for example
03:54The design of the chart itself is copyrighted to you
03:58Even though the data itself is not copyrighted
04:01So that is the main difference
04:05The other main difference that's important to mention
04:08For our story about LLMs
04:10Is that we live in Europe, famously
04:12Paris is in Europe
04:14And well, not the US
04:16Why is that important?
04:18The US has a framework that is fairly specific to the US
04:23Which is called the Fair Use Framework
04:25Meaning you are allowed to use copyrights
04:28In specific situations if you think that is fair
04:31What could be a fair situation?
04:34Well, we'll have an example in a minute
04:37But basically, if I do something without the intent to harm
04:42For example, if I'm a fan
04:43And I just write a fan fiction
04:46Or I draw some fan art
04:48That would fall under fair use
04:50As long as I don't make money of it, of course
04:52In Europe, it's actually not that simple
04:56In Europe, you don't have that fair use doctrine
05:00So you cannot say, look, I had some good intentions
05:05So please don't sue me
05:06That doesn't exist
05:08And so you may be surprised
05:10That actually, if you look at what people do on the internet
05:14And what they want to do with copyrighted content
05:17They want to draw Sonic or Naruto
05:19And then if you look at what, for example, fan art people like to do
05:25You have many different kinds of works
05:27But it's very rarely original works
05:29People like to just share with other fans
05:32And I was very surprised personally to see Sonic here
05:35I don't know if you are
05:36And I feel very bad for poor Pikachu here
05:38Who is way down the line
05:41But that's a very personal preference
05:45So, why was I saying this about fair use?
05:48So, if you're in the US and you draw a picture of Sonic
05:50And you publish it on the internet, you're fine
05:53You don't make any money of it
05:56There were, well, no cases recorded
05:59Where people have been made to pay for doing this
06:03There may be in the future
06:04Just because you can claim fair use
06:06Doesn't mean you can't get sued for it
06:08But, you know, if you don't make any money of it
06:10Usually people just don't go and sue you for no reason
06:14And now for Europe
06:16So, both these pictures are actually illegal
06:19The one on the left
06:21Why?
06:23You will say, look, Mr. Eiffel
06:25Who built the Eiffel Tower
06:26And the design of the Eiffel Tower
06:27Is dead long ago
06:29So, why would this be illegal?
06:32Well, the lighting itself
06:33Actually, is also under copyright
06:36For the creator of that particular piece of, well, art
06:42Which is lighting, a kind of art
06:43And so, the picture on the right
06:46Which I generated with AI
06:48Using ChatGPT
06:49Is also illegal
06:51Because it shows lighting
06:53But, well, does it matter if it's not the exact same lighting
06:58If it's a kind of transformative lighting
07:00That's not the exact same lighting as the other one
07:03Does it look different enough?
07:06Eh, I don't know
07:07Well, so in the US, I could say
07:10Look, it's fair use
07:12Because I used an LLM
07:13And look, it's not the exact same lighting
07:15And I'm not making any money of it
07:16It's just for Instagram
07:18But in Europe, that doesn't work
07:20Also in the US, probably it would not work
07:22But, again, if you don't have lots of money made from this picture
07:27Typically, people leave you alone
07:30And if you've understood nothing about why lighting
07:34And why AI lighting is different or not different from the real lighting
07:39Well, you're not alone
07:40Because even PhDs get very confused about it
07:43So, this was a survey made from PhDs
07:48And researchers from public institutions
07:51Who have copyright, obviously, on the research that they publish
07:54And they even made a board game out of it
07:57To try and understand what were the frameworks they were operating in
08:01Because if you, well, if you look at the state of the board game itself
08:06It looks very complicated
08:08And I think that's a very fair reflection of why we have an issue here
08:14And you're thinking, you know, why would I go from DFL Tower lighting to PhDs and research?
08:20Well, lots of it, from Instagram pictures of DFL Tower to scientific research
08:27Is actually now published, done, researched and inspired from everything that is on the internet
08:34And specifically for research, public research, obviously, derives from other works, from other authors
08:42So, well, if you are publishing a paper, obviously, in this time and age
08:47You would not do it just by reading from the library
08:50You'd go on the internet and you'd see what other researchers from around the world have done
08:56And, well, you know, it's really great that the internet has come up with many different propositions
09:02There's Netflix for content
09:04And you would think, you know, there's Sci-Hub for research
09:07Problem, Sci-Hub, it's a website based in Kazakhstan
09:11Why Kazakhstan?
09:13Because the founder of this website is from there
09:16But also because it's very much illegal
09:19Sci-Hub is a place where 88% of researchers think it's fine to download from
09:27They don't see the issue
09:28And yet this website was sued and actually lost the case
09:33Because in law, they're absolutely in the wrong
09:36So the jury and the judge found them guilty
09:41And they were right that they were infringing on copyrights
09:45Why would they infringe on copyrights?
09:48Because of the board game from before
09:49Where basically it's a very complex system
09:52Where people who have the copyrights to public research
09:55Often are not the institutions themselves
09:58But third-party publishers
10:00And so these publishers want money for access to scientific research
10:05And problem is, as I said, you wouldn't go to the library
10:08Well, it's much easier to just go to Sci-Hub
10:11And the funny thing is even researchers who are above 50 years old
10:17Agree that Sci-Hub is very legit
10:20And even if they don't use it themselves sometimes
10:23They think it's very fair that you get free access to science everywhere
10:29And the problem is now we're all sorts of
10:32Well, probably science researchers is a bit of a big word
10:37But we all publish content
10:39If you look at the amount of content that's published on social media
10:43There are millions and sometimes billions of users of these platforms
10:48And every day all of us consume content and write content
10:53And all of this thing is under copyright
10:58And that's only social media
10:59Because obviously you have ecosystems on the internet
11:04Here I put a map
11:06Every single Reddit post is under copyright
11:09And everything basically on browsers
11:14Or everything that you find through Google
11:17Is likely to be under copyright
11:19There is one notable exception
11:21And that is Wikipedia
11:23Where people deliberately say they don't want copyright for their work
11:30So apart from Wikipedia in very few places
11:33A hundred percent of the internet is under copyright
11:37Now people may not know that they have copyright
11:40Or they may not claim their copyrights
11:42But if you use someone's work
11:44You get, you know, inspired by someone's work
11:46Just know typically it's under copyright
11:48Even if it's a silly TikTok video
11:50And if you remember the lengths that I was talking about at the very beginning
11:55Seventy years after the deaths of that Instagrammer you really like
12:00Their children and grandchildren can come to you and ask for copyright
12:05If you reuse that video that you really like as a young person
12:09So this is a big, big, big, big issue
12:14And some of you, if you are perhaps above 25
12:19You remember this ad
12:21If you bought a DVD back in the day
12:24You had this very exciting, juicing music
12:27And it told you, you wouldn't steal a car
12:29You wouldn't steal a handbag
12:31So why would you steal a DVD?
12:33And it ended with piracy
12:35It's a crime
12:37Except, you know, people made fun of it
12:39Because the reality is
12:41You totally would steal a movie
12:44And most people in this room
12:46Have downloaded a movie at some point
12:48Before Netflix or other legal alternative existed
12:52And then people made fun of it saying
12:55Look, if I could just download the car
12:57For the prices that it is
12:59Obviously I would do it
13:00And the funny story is
13:02Even this ad actually infringed the copyrights of someone
13:06Without meaning to
13:07So the font itself
13:09From the ad was under copyrights
13:12And no one was compensated for this
13:14And then they sued, obviously
13:16To get proper copyrights on the font
13:19And you would also like to learn
13:21That the music itself was from a person in Denmark
13:24Who had this music
13:26And also was not consulted
13:28And also was very surprised to learn
13:30That he got famous on so many millions of DVDs
13:33And never got a euro of royalty from the music
13:36So, you know, even if you try to be anti-piracy
13:39Well, piracy may come back and, you know, find you
13:42Especially 70 years after the death of the person is a long time
13:47And so, well, you absolutely would train an LLM, wouldn't you?
13:52It's just the way that people use content
13:57And everyone is already illegal anyway
13:59So what does it matter?
14:01And so now we will go into specifically
14:04What the issue here is for LLMs
14:09So let's look at OpenAI versus the New York Times
14:13Because this was the biggest case
14:15That people have been talking about
14:18On paper, the New York Times has a very, very simple claim
14:23The claim is, if you go and you ask ChatGPT
14:29With the beginning of an article from the New York Times
14:32As you can see from the excerpt
14:34From their memo that they submitted to the judge
14:38It gives you the exact same text as from the article
14:43In the New York Times
14:44Now, would it? Let's see
14:48So how does an LLM work?
14:50Today, you may know very simply
14:52There is text input
14:54So it reads a number of data sets
14:58Which happens to be basically all of the internet
15:00And everything available as a text
15:03It feeds into a model that is trained to statistically replicate
15:08What a human would say to a specific prompt
15:12And then statistically it will give you the most likely answer
15:16That is something like this
15:17So it may or may not give you the exact wording
15:20But it has read the exact wording
15:23Now, it's very complicated because I don't know about you
15:29But when I read a text and people ask me to summarize this text
15:34I do pretty much the same thing
15:36So I read stuff on the internet
15:38This presentation has been made by me reading up about it
15:42Not asking for anyone's permission
15:44I can read stuff
15:45And then giving you potentially an extract
15:49And interpretations of what I've read
15:52And there's no reason that the LLM would function that differently
15:58Except that it happens to be perhaps a bit more stochastically different than me
16:03Because I'm not a machine
16:04And it is a machine
16:08And what are those data sources specifically?
16:11So I've said, but it's all of the internet
16:12It can be public data sets
16:15But also user-generated content
16:17Things that you put in prompts on ChatGPT
16:20Are also used to retrain the model afterwards
16:24Licensed data, unlicensed data, many kinds of other things
16:27And one perhaps note for synthesized data
16:31Which is data that has been generated from other LLMs
16:34That is used to retrain LLMs
16:36And of course code repositories that power the cursors of this world
16:42And other competitors
16:45And so that's for input
16:46For output
16:47If I give you a summary
16:49My summary may sound like an LLM
16:52Or an LLM may sound like me
16:53And actually it confused quite a few people
16:57Because, you know, people writing codes
16:59Or LLMs writing codes
17:00If you just look at the code
17:02It's quite difficult to say
17:03And people get confused
17:05So apparently this is half a fake news
17:07I don't know
17:08There were some debates about this
17:10This came out a few days ago
17:13Without judging
17:14The point is just to say
17:16Look, something that is produced by an LLM
17:19May look like it's from a human
17:22Or an LLM is just very, very difficult to tell
17:26And so if you think about me
17:28When I give you content like this
17:31And I summarize
17:31Well, perhaps if you tell me
17:33Look, Anais, what exactly did you read?
17:36And if I had a really good memory
17:38And if I was as smart as a huge machine
17:41Doing billions of calculations a second
17:44Perhaps I would be able to say
17:45Oh yes, I remember exactly what I read
17:47Let me give you exactly what I read
17:50And then you can comment
17:53And so what Sam Altman has said is
17:57Well, the New York Times
17:58People actually went and prompted the LLM
18:02To reply in the exact same way of the article
18:05Which would be a bit like asking me to recite poetry
18:09If I recite poetry to you
18:11Then I will recite exactly the same thing as the poem itself
18:16Does that make it illegal if you ask me to do that?
18:20Whether you ask an LLM to do exactly that?
18:24So the case is not clear cut
18:25And also what made it obviously harder
18:28Is that as soon as this case was filed with the authorities
18:33Obviously OpenAI changed their policies
18:37And now if you try
18:39You can ask the LLM for whatever you want
18:42It's not going to give you the exact same content
18:46From the New York Times article
18:53And now that we've explored
18:55The issue we've had with internet and the copyrights
18:58And then now LLMs
19:00And potentially the issue with reproducing content input and output
19:05Let's look back into the past
19:08And let's see you know
19:09We've had this issue before
19:11So if you don't know the internet archive
19:14Wayback Machine
19:15It also had some copyright issues by the way
19:18It was sued for reproducing content
19:22That was still under copyright etc
19:24What it does
19:25It's a very interesting website
19:26If you're curious
19:27It traces back all the history of the internet
19:30So if you're looking for a page that was deleted or modified
19:34You can probably find it on the Wayback Machine
19:37And how does it work essentially?
19:40And how do LLMs work?
19:41So they do web scraping
19:42Web scraping is basically like me reading a page
19:46Except it's going to be a machine reading the page
19:50That is how most things and most products today access the internet
19:55So it's very rare that you still have humans going and reading the entirety of the internet
20:01And for obvious reasons
20:04The obvious reason is that quantity of data on the internet is really, really, really increasing exponentially every day
20:14And so what exponential means is that this is accelerating
20:18So the amount of content produced on the internet cannot be read by a human anymore
20:26There's no way a human in millions of lifetimes would be able to read everything that's in there
20:33Just to give you an idea
20:35Google itself covers a tiny fraction of what you can find on the internet
20:41And this is growing and growing
20:43And especially now, if you think about LLM generating content
20:47Well, a lot more content will be generated by machines
20:51So content generated by machines and read by machines is probably the future we're looking at here with LLMs
21:00And it's just a matter of, you know, dealing with volume
21:05And you may be surprised to learn that there was a huge debate about 10 years ago about web scraping
21:11Many people were upset about web scraping on the case of copyrights mostly
21:16Why would they're upset?
21:18Well, essentially they were saying people are stealing their content to produce something else
21:24Again, the difference between a human reading the data and producing a product out of it
21:30So if you have a team of people going through data, taking this data, putting it into an Excel
21:36And then building a product from it
21:38Well, that is fine
21:40But then if the machine does the same thing, it's a question of volume
21:44So it's going to be much, much, much faster
21:46But it's going to be a difference of degree and of volume and not a difference of nature
21:51And so many cases were made with scraping
21:56And what the equilibrium ended up as is we have not picked a proper direction
22:03We've basically said it's fine as long as you don't get caught
22:07And as long as the content itself is publicly available
22:11So that's why you may be wondering why LinkedIn and other websites suddenly put up login walls on their pages
22:19Typically it is so that if you have a case of copyright lawsuits
22:23They are better positioned to win because they have put some protections in place
22:29And it's not publicly available anymore
22:32It is publicly available but with a login wall
22:36So that changes things a little bit
22:38But it doesn't really solve the problem
22:42And the reason that we ignored this problem at the end of the day
22:47And decided to just see what goes
22:49And by the way, there are still some lawsuits around scraping every day
22:53If you look at the news, you may see some from time to time
22:56Well, the benefits from using the internet as data
23:02And building products on top of data from the internet
23:06Is absolutely exponential
23:09If you just think about search, emails, everything we do
23:13And online banking is also an interesting example
23:16Because 10 years ago there was a case with Klarna that you may remember
23:21And Klarna wanted to keep scraping the pages of banks in order to have access to their information
23:27And Klarna actually lost that lobbying battle
23:32Because the EU decided that banks had to open an API
23:36Instead of letting Klarna scrape the websites
23:39The problem is a lot of the APIs, at least when they came out, were not very good
23:45And so if you read some posts from the founder of Klarna
23:48They explained much better than me
23:50How this was typically not what they wanted
23:53But at the end of the day
23:56The benefit for consumers of using data from the internet
24:01Versus the cost is so huge
24:04If you look at the numbers here, the way you read them
24:07Is that out of $300 of perceived value for the consumer
24:11The actual cost of that service is only $472
24:19So people just see so much value
24:21And the impact on the economy is so huge
24:23That at the end of the day it's a no-brainer
24:26And AI is really no different
24:28If you look at people, the consultancies that are producing work
24:34On how much GDP is going to be added by AI
24:37Some of the forecasts say that it's going to double the GDP growth of economies
24:44So the gains from AI might even be greater than the gains from the internet
24:50I personally think that it's very hard to say at this stage
24:54But still, it's going to be huge
24:57So the economic reality is that we cannot just go on and say
25:04We forbid all of this from happening
25:06Because the benefits are just so incredibly massive
25:10That the trade-off would just not make sense from a political standpoint
25:15And this is without having any kind of political stance on this
25:19It's just a matter of numbers
25:21There are very few copyright holders who have actual value of their copyright
25:26Because to my dismay, my Reddit posts are not of that much value
25:31Whereas obviously the Picasso collection is of huge value
25:36So you will be sacrificing a lot of growth for the benefit of very few people
25:41Which is politically complex
25:43So we have several solutions
25:45And I'm going to introduce about two of those
25:50Here you have this post from Yann Lequin
25:54And it's worth reading it
25:56But essentially what he's saying is
25:59Now we have a problem in the sense that this was already unsustainable
26:04So either we let the system go exactly as it is
26:09And we decide to find some kind of framework
26:13Whereby just like fair use
26:15We look at the intent of the person
26:18So when I was drawing this, you know, sonic fan art
26:21What was my intent?
26:22And does it matter, really, if I do it for fun and for free
26:26Does it matter that I had, well, this drawn by hand or done with an AI?
26:33Well, it doesn't matter all that much, right?
26:35It's the intent
26:36Do people producing tools or publishing, republishing pieces
26:43Or actually summarizing what is their intent?
26:46Are they doing it in order to make profit
26:49And to harm the benefits of the person having this copyright?
26:54Or do they just do it with the ambition to, well, just share
26:59And enjoy that content with other people on the internet?
27:04Which, by the way, may also be a marketing benefit in many different ways
27:09Because if your content is being published by other people and reshared
27:14That is also free advertising
27:17So that is one way of looking at it
27:21Now, that is not exactly how, you know, people have positioned themselves
27:28We're still at the stage where lawmakers are a bit skeptical
27:33About allowing this
27:35Because that would essentially mean introducing something very blurry in law
27:39That would mean introducing something like fair use in Europe
27:43Which, again, it doesn't exist
27:45And so that's why you have people who are a bit less subtle than Yann Lequin
27:50And who are just saying, kill it
27:54And two of these guys happen to be Jack Dorsey and Elon Musk
27:58Who found that they agreed on one thing
28:01And it's killing all IP law
28:04Which, you know, is a bit of an extreme case
28:08But the reality today, you know, without taking a stand on what we could do
28:13Again, this is not legal advice
28:16If we think about the economic cost
28:19If we think about the way things are working today
28:21And how people have ignored it for so long
28:25Is it really, you know, that crazy to think that we could rethink the entire framework
28:30And make something that works for people today
28:33With the means of communications and distribution that we have
28:37Versus something that we've copy pasted back 170 years ago
28:43When copyrights were first introduced
28:45And of course, again, not exactly subtle
28:48Especially since IP also includes patents
28:51Which is not just copyrights
28:53But that is why
28:55Hopefully now you understand why these two agree on something
28:59Despite the very, very short proposition they're making here
29:05And that's it on my end
29:07I'm gonna leave you with this, guys
29:08I don't know if it's a huge gift
29:09But thanks so much for your attention
29:11And I hope you enjoyed
Commentaires