AlphaFold - The Most Useful Thing AI Has Ever Done

Encyclopedia9696

5 tuần trước

Danh mục

🤖

Công nghệ

Phụ đề

Hiển thị phụ đề video đầy đủ

00:00What if all of the world's biggest problems, from climate change to curing diseases to

00:05disposal of plastic waste, what if they all had the same solution? A solution so tiny it would

00:12be invisible? I'm inclined to believe this is possible thanks to a recent breakthrough that

00:17solved one of the biggest problems of the last century. How to determine the structure of a

00:22protein. It's been described to me as equivalent to Fermat's last theorem, but for biology.

00:28Over six decades, tens of thousands of biologists painstakingly worked out the structure of 150,000

00:34proteins. Then, in just a few years, a team of around 15 determined the structure of 200 million.

00:42That's basically every protein known to exist in nature. So how did they do it? And why does this

00:49have the potential to solve problems way outside the realm of biology?

00:53A protein starts simply as a string of amino acids. Each amino acid has a carbon atom at the center,

01:02then on one side is an amine group, and on the other side is a carboxyl group. And the last thing

01:07it's bonded to could be one of 20 different side chains, and which one determines which of the 20

01:13different amino acids this molecule is. The amine group from one amino acid can react with the

01:20carboxyl group of another to form a peptide bond. So a series of amino acids can bond to form a string,

01:27and pushing and pulling between countless molecules, electrostatic forces, hydrogen bonds,

01:33solvent interactions, can cause this string to coil up and fold onto itself. This ultimately determines the

01:393D structure of the protein. And this shape is the thing that really matters about the protein.

01:45It's built for a specific purpose, like how hemoglobin has the perfect binding site to carry

01:50around oxygen in your blood. These are machines. They need to be in their correct orientation

01:57in order to work together to move, for example, the proteins in your muscles. They change their shape

02:02a little bit in order to pull and contract. But it would take people a long time to get the structure

02:07of just one protein. Absolutely. So what should proteins look like was only started to answer

02:12really with experimental techniques. The first way protein structure was determined was by creating

02:18a crystal out of that protein. This was then exposed to x-rays to get a diffraction pattern,

02:24and then scientists would work backwards to try to figure out what shape of molecules would create

02:29such a pattern. It took British biochemist John Kendrew 12 years to get the first protein structure.

02:36His target was an oxygen storing protein called myoglobin, an important protein in our hearts.

02:43He first tried a horse heart, but this produced rather small crystals because it didn't have

02:47enough myoglobin. He knew diving mammals would have lots of myoglobin in their muscles,

02:53since they're the best at conserving oxygen. So he obtained a huge chunk of whale meat from Peru.

02:59This finally gave Kendrew large enough crystals to create an x-ray diffraction image.

03:04And when it came out, it looked really weird. People expected something kind of logical,

03:10mathematical, understandable, and it almost looked, I wouldn't say ugly, but intricate and complex,

03:15and kind of like if you see a rocket motor, right, and all the parts hanging off.

03:20This structure, which has been called turd of the century, won Kendrew the 1962 Nobel Prize in Chemistry.

03:26Over the next two decades, only around 100 more structures were resolved. Even today,

03:33protein crystallization remains a big challenge.

03:36Frankly, you know, it is not uncommon that just a couple protein structures can be someone's entire

03:42PhD, sometimes just one, sometimes even just progress toward one.

03:45And it's expensive. X-ray crystallography can cost tens of thousands of dollars per protein.

03:52So scientists sought another way to work out protein structure. It only costs around $100 to

03:57find a protein sequence of amino acids. So if you could use this to figure out how the protein would

04:02fold, that would save a lot of time, effort, and money.

04:06I kind of know how carbon behaves, and I know how carbon sticks to a sulfur, and how that might,

04:11you know, stick next to a nitrogen. And if these ones are here, then I can imagine this one folding,

04:14making that bond there. So it seems like if you have some sense of basic molecular dynamics,

04:19you might be able to figure out how this protein is going to fold.

04:22One of the few true predictions in biology was actually Linus Pauling looking at just the geometry of

04:28the building blocks of proteins and saying, actually, they should make helices and sheets.

04:32That's what we call secondary structure, the very local kind of twists and turns of the protein.

04:37But beyond helices and sheets, biochemists could not figure out any reliable patterns

04:42that would lead to the final structure of all proteins.

04:46One reason for this is that evolution didn't design proteins from the ground up.

04:50It's kind of like a programmer that doesn't know what they're doing,

04:53and whenever it looked good, they just kept adding that kind of thing. And that's

04:57that's how you end up with these both amazing objects and incredibly complex and hard to

05:01describe. They don't have purpose underneath them in the same way as like a human designed

05:07machine would. To illustrate just how complicated this process can get,

05:12MIT biologist Cyrus Leventhal did a back of the envelope calculation. And he showed that

05:17even a short protein chain with 35 amino acids can fold in an astronomical number of ways.

05:24So even if a computer checked the energy and stability of 30,000 configurations every nanosecond,

05:30it would take 200 times the age of the universe to find the correct structure.

05:38Refusing to give up, University of Maryland professor John Molt started a competition called

05:43CASP in 1994. The challenge was simple, to design a computer model that could take an amino acid

05:50sequence and output its structure. The modelers would not know the correct structure beforehand,

05:56but the output from each model would be compared to the experimentally determined structure.

06:02A perfect match would get a score of 100, but anything over 90 was considered close enough

06:07that the structure was solved. CASP competitors gathered at an old wooden chapel-turned conference

06:13center in Monterey, California. And at any point where a prediction didn't make sense,

06:17they were encouraged to tap their feet as friendly banter. There was a lot of foot tapping.

06:25In the first year, teams could not achieve scores higher than 40. The early front-runner was an

06:31algorithm called Rosetta, created by University of Washington biologist David Baker. One of his

06:37innovations was to boost computation by pooling together processing power from idle computers in homes,

06:43schools, and libraries that volunteered to install his software called Rosetta at Home.

06:49As part of it, there was a screensaver that showed basically the course of the protein folding

06:54calculation. And then we started getting people writing in saying that they were watching the

06:58screensaver and they thought they could do better than the computer. So Baker had an idea. He created a video game.

07:06The game, called Foldit, set up a protein chain capable of twisting and turning into different

07:13arrangements. But now, instead of the computer making the moves, the game players, the humans,

07:19could make the moves. Within three weeks, more than 50,000 gamers pooled their efforts to decipher

07:24an enzyme that plays a key role in HIV. X-ray crystallography showed their result was correct.

07:30The gamers even got credited as co-authors on the research paper.

07:36Now, one man who played Foldit was a former child chess prodigy named Demis Hassabis. Hassabis had

07:42recently started an AI company called DeepMind. Their AI algorithm AlphaGo made headlines for beating

07:48world champion Lee Sedol at the game of Go. One of AlphaGo's moves, Move 37, shook Sedol to his core. But

07:56Hassabis never forgot about his time as a Foldit gamer.

08:00So, of course, I was fascinated this just from games design perspective. You know,

08:04wouldn't it be amazing if we could mimic the intuition of these gamers who were only,

08:08by the way, of course, amateur biologists?

08:11After returning from Korea, DeepMind researchers had a week-long hackathon where they tried to train

08:16AI to play Foldit. This was the beginning of Hassabis' long-standing goal of using AI to advance science.

08:23He initiated a new project called AlphaFold to solve the protein folding problem.

08:30Meanwhile, at CASP, the quality of prediction from the best performers,

08:33including Rosetta, had plateaued. In fact, the performance went downhill after CASP 8.

08:40The predictions weren't good enough, even with faster computers and a growing number of

08:44structures in the protein databank to train on. DeepMind hoped to change this with AlphaFold.

08:50AlphaFold. Its first iteration, AlphaFold 1, was a standard off-the-shelf deep neural network like

08:57the ones used for computer vision at that time. The researchers trained it on lots and lots of

09:02protein structures from the protein databank. As input, AlphaFold took the protein's amino acid

09:08sequence and an important set of clues given by evolution. Evolution is driven by mutations,

09:15changes in the genetic code, which in turn change the amino acids within a given protein sequence.

09:21But as species evolve, proteins need to retain the shape that allows them to perform their specific

09:26function. For instance, hemoglobin looks the same in humans, cats, horses, and basically any mammal.

09:33Evolution says, if it ain't broke, don't fix it. So we can compare sequences of the same protein across

09:39different species in this evolutionary table. Where sequences are similar, it's likely they are

09:45important in the protein's structure and function. But even where the sequences are different, it's

09:50helpful to look at where mutations happen in pairs, because they can identify which amino acids are close

09:57to each other in the final structure. Say two amino acids, a positively charged lysine and a negatively

10:03charged glutamic acid attract and hold each other in the folded protein. Now, if a mutation changes

10:10lysine to a negatively charged amino acid, it would repel glutamic acid and destabilize the whole protein.

10:17Therefore, another mutation must replace glutamic acid with a positively charged amino acid. This is

10:23known as co-evolution. These evolutionary tables were an important input for AlphaFold.

10:29As output, instead of directly producing a 3D structure, AlphaFold predicted a simpler 2D

10:37pair representation of that structure. The amino acid sequence is laid out horizontally and vertically.

10:43Whenever two amino acids are close to each other in the final structure,

10:47their corresponding row-column intersection is bright. Distant amino acid pairs are dim.

10:54In addition to distances, the pair representation can also hold information on how amino acid

11:01molecules are twisted within the structure. AlphaFold1 fed the protein sequence and its evolutionary

11:08table into its deep neural network, which it had trained to predict the pair representation.

11:13Once it had this, a separate algorithm folded the amino acid string based on the distance and

11:18torsion constraints. And this was the final protein structure prediction.

11:24With this framework, AlphaFold entered CASP 13, and it immediately turned heads.

11:31It was the clear winner after many additions. But it wasn't perfect. Its score of 70 was not

11:38enough to clear the CASP threshold of 90. DeepMind needed to get back to the drawing board to get better results,

11:46so Hassabis recruited John Jumper to lead AlphaFold.

11:50AlphaFold2 was really a system about designing our deep learning, the individual blocks to be good at

11:56learning about proteins. Have the types of geometric, physical, evolutionary concepts that were needed

12:02and put it into the middle of the network instead of a process around it. And that was a tremendous

12:05accuracy boost. There were three key steps to get better results with AI. First, maximum compute power.

12:13Here, DeepMind was already better positioned than anybody in the world. It had access to the

12:19enormous computing power of Google, including their tensor processing units. Second, they needed a large and

12:25diverse data set. Is data the biggest roadblock? And why? I think it's too easy to say data's the roadblock,

12:33and we should be careful about it. AlphaFold2 was trained on the exact same data with much,

12:37much better machine learning as AlphaFold1. So everyone overestimates the data blockage because

12:44it gets less severe with better machine learning. And that was the third key element, better AI

12:51algorithms. Now, AI is not just good at protein folding. It can do all kinds of tasks that no one

12:58likes, from writing emails to answering phone calls. Something I hate is building and maintaining a

13:04website. It's so much work, from optimizing the website for different platforms, finding a good

13:09design so it looks professional, to constantly updating it with new information about the business

13:15as it grows. That's why we partnered with Hostinger, the sponsor of today's video. Hostinger makes it super

13:21easy to build a website for yourself or your business, and with their advanced AI tools, you can

13:26simply describe what you want your website to look like. And in just a few seconds, your personalized

13:32website is up and running. Hostinger is designed to be as easy as possible for beginners and

13:37professionals, so any tweaks you need to make after that are super easy too. Just drag and drop any

13:43pictures or videos you want, where you want them, or just type what you want to say, or have the AI

13:48help you hear too if writing isn't your thing either. And if you still want that human touch, Hostinger is

13:54always available with 24-7 support if you ever run into any issues. But when you're done building,

13:58in just a few clicks, your website is live. It's all incredibly affordable too, with a domain and

14:04business email included for free. So to take your big idea online today, visit hostinger.com

14:10slash ve, or scan this QR code right here. And when you sign up, remember to use code ve at checkout to

14:17get 10% off your plan. I want to thank Hostinger for sponsoring this part of the video, and now back to

14:22protein folding. As the AlphaFold2 team searched for better algorithms, they turned to the transformer,

14:29that's the T in chat GPT, and it relies on a concept called attention. In the sentence, the animal didn't

14:36cross the street because it was too tired. Attention recognizes that it refers to animal and not street,

14:43based on the word tired. Attention adds context to any kind of sequential information by breaking it down

14:50into chunks, converting these into numerical representations or embeddings, and making

14:55connections between them. In this case, the word it and animal. 3Blue1Brown has a great series of videos

15:02specifically about transformers and attention. Large language models use attention to predict the

15:08most appropriate word to add to a sentence, but AlphaFold also has sequential information. Not sentences,

15:14but amino acid sequences. And to analyze them, the AlphaFold team built their own version of the

15:20transformer, called an Evoformer. The Evoformer contained two towers, evolutionary information in

15:28the biology tower, and pair representations in the geometry tower. Gone was AlphaFold1's deep neural

15:35network that started with one tower and predicted the other. Instead, AlphaFold2's Evoformer builds each

15:41tower separately. It starts with some initial guesses, evolutionary tables taken from known

15:46data sets as before, and the pair representations based on similar known proteins. And this time,

15:52there's a bridge connecting the two towers that conveys newly found biological and geometry clues

15:58back and forth. In the biology tower, attention applied on a column identifies amino acid sequences

16:04that have been conserved, while along a row it finds amino acid mutations that have occurred together.

16:10Whenever the Evoformer finds two closely linked amino acids in the evolutionary table,

16:15it means they are important to structure, and it sends this information to the geometry tower.

16:20Here, attention is applied to help calculate distances between amino acids.

16:25There's also this thing called triangular attention that got introduced, which is essentially about

16:31letting triplets attend to each other. For each triplet of amino acids, AlphaFold applies the triangle

16:36of two sides of the triangle. The sum of two sides must be greater than the third. This constrains

16:42how far apart these three amino acids can be. This information is used to update the pair representation.

16:49And that helps the model produce a self-consistent picture of the structure.

16:53If the geometry tower finds it's impossible for two amino acids to be close to each other,

16:58then it tells the first tower to ignore their relationship in the evolutionary table.

17:02This exchange of information within the Evoformer goes on for 48 times, until information within

17:09both towers is refined. The geometrical features learnt by this network are passed on to AlphaFold2's

17:15second main innovation, the structure module.

17:18For each amino acid, we pick three special atoms in the amino acid and say that those define a frame. And

17:24what the network does is it imagines that all the amino acids start out at the origin, and it has to

17:29predict the appropriate translation and rotation to move these frames to where they sit in the real

17:34structure. So that's essentially what the structure module does.

17:36But the thing that sets the structure module apart is what it doesn't do.

17:41Previously, people might have imagined that you would like to encode the fact that this is a chain,

17:46you know, and that, you know, certain residues should sit next to each other.

17:50We don't really explicitly tell AlphaFold that. It's more like we give it a bag of amino acids,

17:56and it's allowed to position each of them separately. And some people have thought that that helps it

18:02to not get stuck in terms of where things should be placed. It doesn't have to always be thinking

18:06about the constraint of these things forming a chain. That's something that emerges naturally later.

18:11That's why live AlphaFold folding videos can show it doing some weirdly non-physical stuff.

18:20The structure module outputs a 3D protein, but it still isn't ready. It's recycled at least three

18:26more times through the Evoformer to gain a deeper understanding of the protein. Only then the final

18:32prediction is made. In December 2020, DeepMind returned to a virtual CASP with AlphaFold 2.

18:41And this time, they did it. I'm going to read an email from John Malt.

18:46Your group has performed amazingly well in CASP 14, both relative to other groups and an absolute model

18:53accuracy. Congratulations on this work. For many proteins, AlphaFold 2 predictions were virtually

19:00indistinguishable from the actual structures. And they finally beat the gold standard score of 90.

19:10For me, having worked on this problem so long, after many, many stops and starts,

19:16suddenly this is a solution. We solved the problem. This gives you such excitement about the way science

19:22works. Over six decades, all of the scientists working around the world on proteins painstakingly

19:28found about 150,000 protein structures. Then, in one fell swoop, AlphaFold came in and unveiled over

19:37200 million of them, nearly all proteins known to exist in nature. In just a few months, AlphaFold

19:45advanced the work of research labs worldwide by several decades. It has directly helped us develop

19:53a vaccine for malaria. It's made possible the breaking down of antibiotic resistance enzymes, which

19:59make many life-saving drugs effective again. It's even helped us understand how protein mutations lead to

20:04various diseases, from schizophrenia to cancer. And biologists studying little-known and endangered

20:10species suddenly had access to proteins and their life mechanism. The AlphaFold 2 paper has been cited over

20:1730,000 times. It has truly made a step-function leap in our understanding of life. John Jumper and Demis

20:25Asabas were awarded one half of the 2024 Nobel Prize in Chemistry for this breakthrough. The other half

20:31went to David Baker, but not for predicting structures using Rosetta. Instead, it was for designing completely new

20:38proteins from scratch. It was really hard to make brand new proteins that would do things. So that's kind of the

20:43problem that we solved. To do so, he uses the same kind of generative AI that makes art in programs like

20:50DALI. You can say draw a picture of a kangaroo riding on a rabbit or something, and it will do that. And so

20:56it's exactly what we did with proteins. His technique, called RF diffusion, is trained by adding random

21:02noise to a known protein structure. And then the AI has to remove this noise. Once trained in this way, the AI can

21:09be asked to produce proteins for various functions. It's given a random noise input, and the AI figures

21:16out a brand new protein that does what you asked it to do. This work has huge implications. I mean, imagine

21:23you got bitten by a venomous snake. If you're lucky, you'll have access to antivenom prepared by milking

21:29venom from the exact kind of snake, which is then injected into live animals. And the antibodies from that animal

21:36are extracted and refined and then given to you as an antivenom. The trouble is, often people have

21:43allergic reactions to these antibodies from other organisms. But your odds of survival can be a lot

21:48better with the latest synthetic proteins designed in Baker's lab. They've created human compatible

21:53antibodies that can neutralize lethal snake venom. This antivenom could be manufactured in large

21:59quantities and easily transported to the places where it's needed. With these tiny molecular machines,

22:05the possibilities are endless. What are the applications you're most excited about?

22:10I think vaccines are going to be really powerful. We have a number of proteins that are in human

22:14clinical trials for cancer, and we're working on autoimmune disease now. We're really excited about

22:19problems like capturing greenhouse gases. So we're designing enzymes that can fix methane,

22:25break down plastic. What makes this approach so effective is how fast they can create and iterate the

22:31proteins. It's really quite miraculous for anyone who's a conventional old school biochemist or protein

22:37scientist. We can now have designs on the computer, get the amino acid sequence of the designed proteins,

22:43and then in just a couple days, we can get the get the protein out. Yeah, we've given a name to this,

22:49which is cowboy biochemistry, because we just like we just got to kind of go for it as fast as you can. And it

22:55turns out to work pretty well. What AI has done for proteins is just a hint of what it can do in

23:01other fields and on larger scales. In material science, for example, DeepMind's GNOME program

23:08has found 2.2 million new crystals, including over 400,000 stable materials that could power future

23:15technologies from superconductors to batteries. AI is creating transformative leaps in science by

23:22helping to solve some of the fundamental problems that have blocked human progress. If you think of

23:26the whole tree of knowledge, you know, there are certain problems where, you know, if they're root

23:30node problems, if you unlock them, if you discover a solution to them, it would unlock a whole new

23:35branch or avenue of discovery. And with this, AI is pushing forward the boundaries of human knowledge

23:42at a rate never seen before. You know, speedups of 2x are nice. They're great. We love them. Speedups of

23:50100,000 times. Change what you do. You do fundamentally different stuff. And you start to rebuild

23:57your science around the things that got easy. And that's what I'm excited about. These discoveries

24:04represent real step function changes in science. Even if AI doesn't advance beyond where it is today,

24:10we will be reaping the benefits of these breakthroughs for decades. And assuming AI does continue to develop,

24:17well, it will open up opportunities that were previously thought impossible. Whether that's

24:22curing all diseases, creating novel materials, or restoring the environment to a pristine state.

24:29This sounds like an amazing future, as long as the AI doesn't take over and destroy us all first.

24:47But if it's not possible, you'll be thinking about it.

24:51If you've heard of problems, where you think about it, what you think about it, what you see

24:53are most likely to use at real estate. How complicated?

24:55It will be so good.

24:57The next thing you have to offer, as long as the AI doesn't take over,

24:58how much is it can affect the future.

24:59It also takes over and over all the time, if you want to expand your mind,

25:02it's not to end its life.

25:04I think it's a nice future.

25:07It's a nice future.

25:08And if you want to make a picture of these people,

25:11it may be so many months to come out,

25:12you can see that.

25:13And if you want to make a picture of these people that are just

Hãy là người đầu tiên nhận xét

Thêm nhận xét của bạn

Được khuyến cáo