00:00Today on Forbes, the Library of Congress is a training data playground for AI companies.
00:07Black and white portraits of Rosa Parks, letters penned by Thomas Jefferson,
00:12and the giant Bible of Mainz, a 15th century manuscript known to be one of the last handwritten Bibles in Europe.
00:19These are among the 180 million items, including books, manuscripts, maps, and audio recordings,
00:25housed within the Library of Congress.
00:29Every year, hundreds of thousands of visitors walk through the Library's high-ceilinged, pillared halls,
00:34passing beneath Renaissance-style domes embellished with murals and mosaics.
00:39But of late, the more than 200-year-old Library has attracted a new type of patron,
00:44AI companies that are eager to access the Library's digital archives
00:49and the 185 petabytes of data stored within it, to develop and train their most advanced AI models.
00:57For reference, one petabyte is equal to 1,000 terabytes, or one million gigabytes.
01:04Judith Conklin, Chief Information Officer at the Library of Congress, told Forbes,
01:09"...we know that we have a large amount of digital material that large language model companies are very interested in.
01:15It's extraordinarily popular."
01:18The upsurge in interest in the Library's data is also reflected in the numbers.
01:23The Congress.gov website, which is managed by the Library of Congress and hosts data about bills, statutes, and laws,
01:30gets anywhere between 20 million to 40 million monthly hits on its API,
01:35an interface that allows programmers to download the Library's data in a machine-readable format.
01:41Conklin said the traffic to the Congress.gov API has consistently grown since it became available in September 2022.
01:49The Library's API now gets about a million visits every month.
01:54The Library's digital archives host an abundance of rare, original, and authoritative information.
02:00It's also diverse. The collections feature content in more than 400 languages, spanning art, music, and most disciplines.
02:07But what makes this data especially appealing to AI developers is that these works are in the public domain,
02:13and not copyrighted or otherwise restricted.
02:16While a growing group of artists and organizations are locking up their data to prevent AI companies from scraping it,
02:22the Library of Congress has made its data reserves freely available to anyone who wants it.
02:28For AI companies that have already mined the entirety of the Internet,
02:31scraping everything from YouTube videos to copyrighted books, to train their models,
02:36the Library is one of the few remaining free resources.
02:40Otherwise, they must strike licensing deals with publishers or use AI-generated, so-called synthetic data,
02:46which can be problematic, leading to degraded responses from the model.
02:51The only caveat? People who want access to the Library's data must collect it via the API,
02:57a portal through which anyone, from a genealogist to an AI researcher, can download data.
03:03But they are prohibited from scraping content directly from the site,
03:06a common practice among AI companies and one that Conklin said has become a real, quote,
03:11hurdle for the Library because it slows public access to its archives.
03:16She said, quote,
03:29The hunt for data is just one part of the story.
03:32Companies like OpenAI, Amazon, and Microsoft are also courting the world's largest library as a customer.
03:39They claim AI models can help librarians and subject matter specialists
03:43with tasks like navigating catalogs, searching records, and summarizing long documents.
03:49This is certainly possible, but there are some rough edges that need to be ironed out first.
03:54Natalie Smith, the Library of Congress's Director of Digital Strategy,
03:58told Forbes that AI models trained on contemporary data sometimes struggle with historical accuracy,
04:05identifying a person holding a book as someone holding a cell phone, for example.
04:10She said, quote,
04:20For full coverage, check out Rashi Srivastava's piece on Forbes.com.
04:26This is Kieran Meadows from Forbes. Thanks for tuning in.
Comments