AI in 2026 is Scary… But MIT Just Explained How To Survive It.

Abdullah Graphy 786

Why is AI in 2026 scary? While the mainstream media spreads panic, the Massachusetts Institute of Technology (MIT) just released a groundbreaking computer science lecture explaining the absolute truth behind the next wave of Artificial Intelligence.  ​If you have 1 hour tonight, skip Netflix and watch this full breakdown. Your future self will thank you for understanding this before the rest of the world.  ​In this video, we dissect the core engineering concepts from MIT's latest AI briefing. We are moving past basic generative AI and entering the era of fully autonomous Agentic AI systems and World Models. This isn't just a minor tech update—it's a complete shift in how software, coding, and global industries operate.  ​Whether you are a software engineer, student, creator, or tech enthusiast, this MIT AI explanation will give you the exact roadmap to stay ahead of 99% of the population.  ​👉 Key Topics Covered in This MIT AI Breakdown:  ​The 2026 AI Turning Point: Why autonomous agents are replacing simple chatbots.  ​MIT Computer Science Insights: The hidden architecture behind next-gen LLMs.  ​The Future of Work: Which skills are becoming obsolete and what to learn next.  ​AGI Timeline: What top researchers are saying behind closed doors.  ​Don't just watch the future happen from the sidelines. Understand the technology, adapt early, and future-proof your career.  ​If you want to stay updated with deep dives into AI engineering, tech breakthroughs, and market trends, make sure to hit LIKE and SUBSCRIBE!  #MIT #AI2026 #ArtificialIntelligence #TechTrends #AgenticAI #ComputerScience #AGI #FutureOfWork

Transcript

00:00:08Good afternoon everyone and thank you for joining us today. My name is Alexander Amini and together

00:00:15with Ava we're going to be your instructors for the course this year. This is MIT Introduction

00:00:22to Deep Learning or 6S191 is our official course title. Now we're super excited to welcome

00:00:29you to this class and I think probably a good place to start is always you know asking ourselves well

00:00:34what is MIT Intro to Deep Learning? This is a one-week boot camp on everything deep learning

00:00:41right so it's a both a very fun but also a very intense one week because we're going to cover

00:00:47a

00:00:47ton of material in just the next five days. Now this is our eighth year teaching this class

00:00:54and the pace of the field especially in the past couple years is really remarkable and every year

00:01:00that we teach this class it's getting more and more I should say interesting to introduce this

00:01:05this lecture in particular and how we introduce this lecture has really started to adapt and evolve

00:01:11over the many years. Now many of you in the audience have probably even started to become almost a bit

00:01:18I would say desensitized to a lot of the progress of deep learning in the past couple years because

00:01:24of this progress how rapidly this progress is happening so I think it's it's also important

00:01:29to not forget you know where we came from just a few years ago so I want to show you

00:01:33know this this

00:01:34image right here just to start this off and what better way to show you than for you to actually

00:01:38see

00:01:39the progress with your own eyes exactly one decade ago this was the state of a state-of-the-art

00:01:46deep learning based facial generation system so this is not a real face this was the state-of-the-art

00:01:52model that could generate faces and this was the best that we could do. Fast forward you know just a

00:01:58few

00:01:59years down the pipeline and progress and image generation had already started to advance tremendously

00:02:07and here you can see you know you know a lot of more realism photorealism in these types of images

00:02:13being created and then you know fast forward another few years after that and and these images

00:02:18start to come to life right they start to have temporal information they start to have video and

00:02:23you know movement as well to those to those images right and in fact this this video that you see

00:02:30on the

00:02:30right is a video that we created in this class some years ago and for those of you who haven't

00:02:36seen

00:02:37it already it's it's uh you know it's online and and people have seen it but in case you haven't

00:02:41i'll play just the first 10 seconds i won't play the whole thing just so you can see it as

00:02:45well hi

00:02:45everybody and welcome to mit 6s 191 the official introductory course on deep learning taught here at

00:02:56mit now i won't play the whole thing right but you get the gist of this video this this video

00:03:01was created

00:03:01five years ago we made it as part of this class and we we used it to actually introduce this

00:03:07class

00:03:08uh back then now when we did this in 2020 it got a lot of even back then right like

00:03:14especially back

00:03:14then maybe you guys aren't that impressed by it today but back then this video was a huge deal

00:03:18right this was a huge jump in photorealism for uh for for capabilities of deep learning models and the

00:03:25clip went you know very uh very rapidly a bit viral and people commented a lot about the realism but

00:03:31actually one interesting thing that people didn't see they saw the end result but actually at that

00:03:36point what people didn't see was that for us to generate that clip which was a two minute clip you

00:03:42only saw the first 10 seconds but that clip was two minutes in total and to generate that two minute

00:03:48clip it cost around two hours of professional audio data being recorded and captured uh of the speaker

00:03:55which was not us it costs around 50 hours of professional high definition video data

00:04:01to build you know the face model and it required around 15 000 us dollars of compute to generate that

00:04:11two minute video and all of that was just going to generate you know a predefined script right

00:04:16something static you couldn't talk to it you couldn't interact with it it was just a predefined

00:04:20script static not flexible at all but uh you know still a tremendous amount of resources both time

00:04:27data energy and financial resources as well and i want to start this class by basically asking this

00:04:35question of you know what would this look like today right that video was created in 2020 but today

00:04:41we're in 2025 and what would cloning you know and generative ai look like in today's world right

00:04:50especially understanding all of those limitations that we had back then in 2020 and i thought what better

00:04:56way to to to tell you about this than to just show you live instead of you know all of

00:05:01that compute

00:05:01and resources that we spent in 2020 let's do it live and unedited right in front of everybody today so

00:05:08in fact i have a bit of a secret so i'm going to share my other screen and move this

00:05:13over

00:05:15so this is a a terminal window that you can actually see something's been recording audio in the background as

00:05:22i've been talking i've been having this program that we made you know listening to my audio recording

00:05:27my voice and what i'm going to do is just uh you know pause the recording now and we're going

00:05:34to see

00:05:34how quickly it can actually immediately start to uh you know just take my voice and do an instant clone

00:05:40as opposed to all of that time intensive resources that we saw before so i'll pause it right now

00:05:46you'll see it instantly clone

00:05:50and and yeah now now my voice is cloned and what we're going to do with this is actually try

00:05:55and

00:05:55not have a predefined script like we saw before i'm going to actually have a live and unedited

00:06:01conversation with this voice and see how dynamic and flexible these types of generative ai models have

00:06:08become in today's world so let's try this out hello can you hear me hi

00:06:15yes loud and clear how are you doing wait a second actually you sound a lot like me don't you

00:06:21that's interesting you're right we're even more similar than i thought what are we up to

00:06:28well actually i was just kicking off this year's mit deep learning class and you know i'm standing

00:06:34right in front of the audience right now uh mit introduction to deep learning is such a great

00:06:40class what's the class like so far actually we just started uh you know what i had an idea so

00:06:47i was just

00:06:47about to introduce what is deep learning how about you take that part over for me deep learning is a

00:06:54type of machine learning that uses artificial neural networks to learn from data these networks are

00:07:00inspired by the structure of the human brain and are able to learn complex patterns from large amounts of

00:07:06data deep learning is used in a wide range of applications including image recognition natural

00:07:12language processing and self-driving cars it's pretty good actually you know i think the amazing

00:07:18thing i was telling everybody in the audience about was that the the state of deep learning today is so

00:07:23flexible it's so dynamic what do you think you're right deep learning is incredibly dynamic it's

00:07:29constantly evolving with new architectures algorithms and applications being developed all the time

00:07:34actually wait let me interrupt you right there just to show how dynamic you are i'm going to interrupt

00:07:38you and uh let's try this enough about deep learning how about you tell me a story about turtles and

00:07:43make

00:07:43it wild oh turtles you want a wild turtle story okay picture this there's this turtle named shelly right

00:07:52he's not your average turtle oh no shelly's got dreams he dreams of flying he spends his days practicing

00:07:58flapping his little flippers trying to get off the ground one day he meets this crazy inventor

00:08:03a mad scientist turtle named dr sheldon who's built a rocket powered shell shelly jumps in and boom

00:08:12they blast off into space they zoom past planets meet aliens who love to eat okay i'll interrupt you

00:08:18i'll interrupt you one more time thank you this was great it's turtle tastic oh let's end it here okay

00:08:26thank you so much this was a this was a cool way to introduce the class i'm glad i could

00:08:32help

00:08:33okay thanks bye bye bye okay awesome okay so that's just a fun way to just to show you know

00:08:48how far we've

00:08:49come in this field you know just from a few years ago generating very static content and out you know

00:08:55live

00:08:56unedited extremely dynamic content as well so you actually just heard a very brief introduction on

00:09:03what deep learning is and in fact in that demo and all of the progress that all of you have

00:09:08been seeing

00:09:09over the past many years you've been seeing what you'll see in this class over the next one week is

00:09:14the fundamental techniques that drive all of that progress so let's just start by maybe laying some

00:09:20foundation laying some groundwork on exactly you know what this field is all about and to do that

00:09:25i think first i have to introduce to you what is intelligence first of all right so to me the

00:09:32word

00:09:32intelligence means the ability to process information in order to inform some future decision right some

00:09:39future action this is what intelligence means so all of us exhibit this capability every single day

00:09:45you know some more than others but artificial intelligence is just the practice of building

00:09:51algorithms artificial algorithms to do exactly that same process right use information use data to

00:09:58inform future decisions now machine learning what is machine learning machine learning is a subset of

00:10:04artificial intelligence that focuses on not explicitly programming the computer how to use that data how

00:10:12to process that information to inform that decision but just try to learn some patterns within the data to

00:10:18make those decisions and finally deep learning is just a subset of machine learning which focuses on

00:10:25doing that exact process with neural networks deep neural networks and we'll learn exactly what what deep neural

00:10:31networks are throughout this class but really at a high level right this entire course is about this core idea

00:10:37fundamentally right this is what we will teach and you will all get a very strong handle on throughout

00:10:43this entire week is you will learn how to teach computers how to learn how to do tasks directly from

00:10:51observation directly

00:10:52from data and we'll provide you both a solid foundation you know in the lectures but also through practical

00:10:59understanding and software labs as well so you can get very hands-on and that's probably a good segue to

00:11:04tell

00:11:05you a little bit about the entire course just high level so this is going to be a combination like

00:11:10i said

00:11:10between the technical lectures and the software labs we'll have several new updates this year in particular

00:11:15as we're you know as the field is advancing so quickly we're really going to try start to uh you

00:11:20know

00:11:21you know drive home a lot of key points especially in more of the modern side of deep learning and

00:11:26then to

00:11:27that end we'll conclude with some guest lectures from industry leaders on state-of-the-art deep learning

00:11:33methods and ai methods that are being developed in industry and this will really start to advance your

00:11:39your knowledge even more in addition yes that's right also tonight we're going to have a reception

00:11:46at 4 30 and you're all invited to that reception as well uh to you know talk to to everyone

00:11:53and learn

00:11:54more about deep learning there's also food provided as well this year we also have a lot of great updates

00:12:00on the software labs uh so we'll be introducing both tensorflow and pytorch software labs and these

00:12:07are you know number one these are a great learning experience for all of you to get hands-on with

00:12:11everything that you learn in the lectures but also they're a medium for you to enter into the competition

00:12:16prizes and make yourselves eligible for a lot of cash prizes at the end of this course so how exactly

00:12:22does that work each day we'll have a dedicated lecture and we'll have a dedicated software lab that mirrors

00:12:29that lecture and the the software lab will just basically reinforce what has been taught during

00:12:34the day in the lectures uh starting today you'll have lab one where you're going to basically focus

00:12:40on building a form of a language model actually it'll be a very small language model but it's a next

00:12:45token predictor language model that learns how to generate music and predict the next token of music

00:12:50so you can generate novel folk songs and then tomorrow we'll move on to facial detection systems

00:12:57you'll get hands-on with building your own computer vision system from scratch understanding also

00:13:03some automated techniques to fix imbalanced data in those systems and then finally lab three is going

00:13:09to be a brand new lab premiering this year for the first time on large language models and you're

00:13:15actually going to in that part of that lab fine-tune a two billion parameter large language model

00:13:22uh on uh you know on compute that you'll control in a mystery style and you'll also build a ai

00:13:30judge

00:13:31to evaluate the quality of that language model so all three of these labs are going to be you know

00:13:35a

00:13:35lot of a lot of fun and then finally on the last day of the class we'll have a final

00:13:40project pitch

00:13:41competition uh each group groups i think of up to three to five people and each group is presenting up

00:13:49to

00:13:49uh three to five minutes kind of in a shark tank style pitch competition and then you'll be eligible

00:13:55for even more prizes as part of that as well okay uh i won't go through this slide there's many

00:14:03great

00:14:03resources available as part of this class uh this slide as well as the entire lectures are all posted

00:14:09online you can already check the website they should be online already and if you ever need any help

00:14:14please post a piazza if you have any questions we have a team of incredible tas and instructors this

00:14:20year that you can reach out to at any time for any questions or issues myself and ava will be

00:14:26your two

00:14:26main lecturers for most of the course but then you'll also be hearing from a lot of guest lectures uh

00:14:31throughout the rest of the class uh here are some of the names this course in general would not have

00:14:37been

00:14:37possible uh each of these years without all of our amazing sponsors so i do want to give a huge

00:14:42thank you for all of their support over the years okay so now that we've gone through all of that

00:14:48i

00:14:49want to start with a lot of the the funds yeah go ahead sure yeah that's right yeah so this

00:14:58course

00:14:58has been taught for eight years we've taught it to over around now 13 million people uh so and just

00:15:05at

00:15:05mit alone because mit you know that's the global audience the mit audience is around probably 3 000

00:15:12at this point and every year online around 100 000 people take this class so you're you're in great

00:15:18company and a lot of really amazing people have taken this class and we're really excited for all of

00:15:23you to be here today so i want to start now as we dive into the technical part of this

00:15:29class i want

00:15:29to start by you really asking this fundamental question why deep learning and why now and hopefully

00:15:34this is a question all of you has asked before you came here today uh you know understanding

00:15:39exactly what gets at the basis of deep learning is really important so that we can understand how

00:15:42we can move forward and build even better algorithms that drive this field so traditional

00:15:48machine learning maybe if we start there for a second traditional machine learning typically defines

00:15:53what are called sets of features i'll tell you more about that word in a second but usually what these

00:15:59features are these are basically rules of how to do a task step by step right and the problem is

00:16:06that

00:16:06if we as humans define those features we're not usually very good at building very robust features

00:16:14so for example let's say i wanted to tell you or i told you to build a ai model that

00:16:20could you know

00:16:21detect faces how would you do this what features would you build in an image to detect faces well what

00:16:28you

00:16:28could do is you could uh you know start by first detecting lines in the image just edges right very

00:16:35simple lines then you could start to compose those lines together to detect things like uh you know curves

00:16:43and edges and l and uh you know uh you know curves basically yeah curves of lines not just straight

00:16:51lines

00:16:52and then you can combine those together to start to form more composite objects right like eyes

00:16:56and noses and ears and then from there you can actually start to build up structures of faces

00:17:01why would you do it like this well it's actually naturally hopefully this is the way that you would

00:17:05also think of doing it because it's very hard to immediately just one shot detect a face you actually

00:17:10don't process faces like this first of all you actually start by processing much more coarse features

00:17:15the low level features first or excuse me the high level features first then you compose these together to really

00:17:20form your own intuition about a face right now the key idea of deep learning is no different than

00:17:27than this process the key idea is to learn these features instead of me telling you or you telling me

00:17:33exactly what those features are the key idea of deep learning is to say after observing a lot of faces

00:17:39can i learn that i should first detect things in this hierarchical fashion step by step you know first detect

00:17:45the lines then detect the curves and detect the you know the composites like eyes noses and ears and

00:17:51then build up to facial structure like this and it turns out this is exactly what deep learning is able

00:17:56to do and we'll see how how this is being done underneath the hood throughout this lecture it's really

00:18:02important to understand though that you know even though we are seeing so many of these amazing

00:18:06things of deep learning over the past few years everything that you'll learn especially in today's lecture

00:18:11this is an intro lecture so almost everything that you'll see today has been invented or developed

00:18:18decades ago right this is not new thing new things that we'll be showing in today's lecture

00:18:24tomorrow and the day after and after that you'll start to see a lot more of the recent advances

00:18:29but why are we seeing this all today right the reason is because number one we see an explosion of

00:18:36these techniques even the techniques that are decades old because of three key components

00:18:41number one is data right data is becoming more and more plentiful throughout the world and this is

00:18:47really driving deep learning progress the compute is number two right compute is becoming more and

00:18:53more powerful and more and more commoditized gpu architectures especially are driving the progress

00:18:58in deep learning and gpus were you know you know only recently starting to be commoditized and finally open

00:19:06source toolboxes like you see on the right hand side tensorflow pi torch keras and so on you know make

00:19:11it very very streamlined and very easy for all of you just in a one-week course to get hands

00:19:15-on with

00:19:16these architectures and start to build directly so let's start by you know just understanding the

00:19:23fundamental building blocks of every neural network and that's just a single neuron or a perceptron right so

00:19:29what is a perceptron the idea of a perceptron or a single neuron is is really simple right so let's

00:19:36start by

00:19:37just taking and defining a perceptron purely by its forward propagation of information so given some

00:19:43inputs how does the perceptron compute an output let's start by defining you know a set of inputs x1 to

00:19:51xm

00:19:52and each of these numbers each of these inputs will be multiplied by a corresponding weight w1 through wm

00:20:00what we're going to do is after we do this multiplication we're going to add up all of

00:20:04those numbers together we'll take the single number that comes out and then we'll pass it through what's

00:20:09called a non-linear activation function this is just a non-linear one-dimensional function that you can

00:20:14pass through this single number and to the output here it's denoted as g

00:20:22okay i left out one minor detail so i'll correct it right now so the one thing that we have

00:20:28to

00:20:28remember is that after we multiply all of our weights by our inputs we're also going to add this

00:20:34one number called a bias term and the bias term is effectively if you look at the equation it's a

00:20:40way

00:20:40for us to shift left and right along our activation function g so this is just a shifting scalar

00:20:47appropriate like designed within the equation here now on the right hand side of this equation of

00:20:54this slide you can actually see the diagram on the left mathematically illustrated or mathematically

00:20:59written as a single equation right now i'm going to now rewrite this for the sake of cleanliness

00:21:08using linear algebra in terms of vectors and dot products so let's do that now so now instead of you

00:21:14know x1 through xm i'm going to write just a vector capital x capital x is going to be a

00:21:20collection

00:21:21of all of my inputs and capital w will be a collection or a vector of all of my weights

00:21:28the output then y is simply just going to be obtained by having a dot product between x and w

00:21:35adding our bias and passing through g passing through a non-linearity

00:21:42now you might be wondering you know i've mentioned this non-linearity a few times what is this thing

00:21:46well i said it's a non-linear function right but what exactly is it one common example here would be

00:21:52what's called the sigmoid function so the sigmoid function you can see right here it's it basically can

00:21:58act over any real number on the x-axis but it outputs only between zero and one so usually the

00:22:05sigmoid function is really good for things like probabilities if you wanted to convert your output of your

00:22:11perceptron your neuron to a probability but in fact actually there's many types of non-linear

00:22:16functions it's just one function that's commonly used in neural networks but throughout this presentation

00:22:21you'll see basically a few examples of different non-linear functions and also i'll point out on the

00:22:27bottom of this slide you can see some code snippets both in tensorflow and pytorch that will help kind of

00:22:33like uh you know align what you're seeing in the math with also code that will be relevant for

00:22:38some of your software labs later today the sigmoid function that you saw earlier this output's very

00:22:44good for probabilities you'll also see things like on the right hand side this is called the rectified

00:22:50linear unit this outputs uh things that are strictly positive it is piecewise linear so it is linear

00:22:57before zero and is linear after zero but it has a single uh non-linearity at x equals zero

00:23:03zero now why do we need activation functions actually that's a question hopefully everyone

00:23:09here asks this seems like unnecessary if at its first glance the point of an activation function is

00:23:15actually quite simple it's to introduce non-linearities into your model right without a non-linear activation

00:23:22function you have a linear model so why do we want non-linearities well it's just because real data

00:23:28in the real world is heavily non-linear right now you might be maybe just a good example to to

00:23:36show

00:23:36this would be let's say i show you this picture here and if i asked you to build a classifier

00:23:43a

00:23:44separation draw a single line that separates the red points from the green points can you draw that line

00:23:51at first glance yes you could draw the line but what if i told you that it had to be

00:23:55a straight line

00:23:55right if i told you it's a straight line then it's not really possible anymore to do this task well

00:24:01so that that makes the problem really hard that's the problem with having a linear model

00:24:06the benefit of having non-linearities is that allows us to approximate arbitrarily complex functions

00:24:13with enough depth in our model this is exactly what makes neural networks non-linear neural networks

00:24:19extremely powerful let me just help you all understand this with a simple example

00:24:25so imagine i gave you now a trained neural network that you can see here it has one perceptron

00:24:32right but it has two inputs x1 and x2 it also has two weights w1 and w2 and it also

00:24:39has this

00:24:39bias term on the top as well now how would we how would we process this information it's the same

00:24:47story as before we're going to compute a dot product we're going to add the bias and pass it through

00:24:53our

00:24:53non-linearity g now if we plug in our data we already know our inputs here our inputs are going

00:25:00to be uh let's see it's it's positive three is the first input x1 and negative two is our second

00:25:07input

00:25:08x2 we can plug these into our equation along with our weights as well and we can see actually

00:25:16uh that we can obtain this line which is going to be a two-dimensional line that parameterizes our

00:25:24entire function space of this neuron since it's only in two dimensions we can even plot this line so we

00:25:32can say exactly how this whole space would look like and for any new input that this model sees

00:25:39where with respect to this line would it fall so let's say for example that if i had this new

00:25:46point

00:25:47here the point is on this x1 x2 space it's going to be at point negative one by two and

00:25:55we can see

00:25:55graphically exactly where on this plot it falls with respect to the line now also we can plug it back

00:26:02into

00:26:02our equation as well and we can see exactly you know okay if we plug in one or excuse me

00:26:09negative

00:26:09one as our input to x1 positive positive two to our input of x2 we can plug into the equation

00:26:15on the

00:26:15bottom left we pass it through the non-linearity the non-linearity here is a sigmoid function

00:26:20it squashes everything to be zero and one depending on which side of the line it falls on and we

00:26:27get this

00:26:27final answer here in this case the final answer is point zero zero two right this is less than point

00:26:33five point five is going to be the divider because all of our outputs are going to be separated between

00:26:37zero and one and we can actually graphically represent this as well so if you fall directly

00:26:42on the line your output after non-linearity is going to be point five the more to the blue side

00:26:48that you fall the farther under point five you are and the more to the green side you fall

00:26:54the farther above point five you fall so the line basically represents the point of separation between

00:27:00these two sides of the space and depending on which side of your input you fall on this is the

00:27:05way for

00:27:05you to classify this point as either a positive point or a negative point

00:27:12so now it's also important to understand you know we just did this for a single neuron with two inputs

00:27:17you

00:27:18can imagine that if you had a model with many more inputs than two it would no longer be possible

00:27:23to

00:27:24draw this plot and this is something that we'll have to deal with in terms of understanding and

00:27:29building intuition but hopefully even at the small scale you can build some level of intuition

00:27:34even with this plot let's see how we can now start to tie some of this together to go beyond

00:27:41just one

00:27:42neuron and start to build networks right because this is where we actually build really powerful systems

00:27:47it's not just from one neuron but from full networks so to do that let's just revisit our diagram one

00:27:52more

00:27:52time if there's a few things that you take away from this class this is hopefully the slide that

00:27:57you take away from right and i've said it already a few times i'll say it one more time how

00:28:01do you pass

00:28:02information through a neuron you take a dot product you apply a bias and you apply a non-linearity it's

00:28:09these three steps and these keep getting repeated over and over again i will simplify the diagram since

00:28:14i've now told it to you so many times hopefully it's starting to stick now i'm going to remove all

00:28:19of the

00:28:19weights from this diagram and i'm also going to remove the bias term so now you can always assume

00:28:24that those two things are there i'll just remove them from the illustration moving forward to keep

00:28:29things cleaner now z here z is going to be the result of that term it's going to be the

00:28:35result of

00:28:36the dot product plus the bias it's going to be before the non-linearity g okay so we will then

00:28:42pass

00:28:42z through g and that will give us y and you can see that represented right here

00:28:49okay now what if we wanted a multi-output neural network not one output but two outputs how would

00:28:56we change this picture okay it actually is pretty simple we now just create a second perceptron we now

00:29:02have two neurons instead of one neuron both neurons have the exact same inputs but because their weights

00:29:08are different they will have two different outputs so they both take as one put the same information

00:29:12they process it their own way with their own weights and they make two different outputs from scratch

00:29:21now these types of network these types of layers let me call them a layer are typically called dense

00:29:27layers because everything in my inputs is connected to everything in my outputs and if you exclude the

00:29:33non-linearity this is also a linear layer right this is a linear layer because it takes all of my

00:29:39inputs x and just linearly operates them with my weights w and adds a bias which is also a linear

00:29:46operation

00:29:47so we can now actually implement this entire operation from scratch in python so let's try it out

00:29:54so we're going to start by just defining those two weights we define self.w this is our weight

00:30:02vector and we also have self.b which is our bias scalar this is just one number but here since

00:30:08it's an

00:30:09entire uh n dimensional output we'll actually have n uh uh n neurons in the output as well

00:30:18when we want to do our forward pass through this through this layer how do we do this it's the

00:30:25same

00:30:25stories before we take a dot product which here is this matrix multiply we add the bias

00:30:30and then we apply our non-linearity this is the sigmoid here but you could change this to any

00:30:35non-linearity in pytorch you can actually see that there's almost perfect analog on the left and the

00:30:42right side it's the same story here you create your your two weights your weight and your bias

00:30:47you apply a matrix multiply add the bias and apply your non-linearity exactly the same as before

00:30:53now luckily tensorflow and pytorch have already implemented this type of dense or linear layer

00:31:00for us so we don't need to do that that was just a good learning exercise what we just went

00:31:05through

00:31:05but here you can just call it right you can see the function calls on the bottom

00:31:11now let's take a look at a single layered neural network a single hidden layer so not where

00:31:17the output is directly from a single perceptron but where we have to actually pass through two

00:31:23layers okay what does that look like so this is one where we have the single layer single hidden

00:31:29layer is placed basically between our output layer and our input layer why do we call it hidden well

00:31:35it's just because we don't directly observe the data that happens in this layer right input layer

00:31:40is data that we provide to the model the output layer is typically things that we would supervise

00:31:44over the hidden layer is one that is learned over the course of uh just observe observable data

00:31:52since we now have a transformation both from inputs to hidden and from hidden to output we now have two

00:31:59layers right and we're that's also going to mean that we need two weight matrices w so we'll actually

00:32:06have a w1 on the left hand side and a w2 on the right hand side now if we look

00:32:12at a single unit a

00:32:14single perceptron a single neuron in that hidden layer let's take z2 for example it's just a

00:32:20perceptron that we saw before it's the same story nothing here has changed its answer is computed by

00:32:27taking a dot product adding a bias and passing it through this non-linearity if we took a different

00:32:33node a different neuron let's say like z3 the one right below it it would also be computed with a

00:32:39dot product

00:32:39bias non-linearity it would take the same inputs as z2 but it would have different weights so the dot

00:32:46product and the bias would be different this picture again looks a bit messy so i'm going to simplify it

00:32:51even more i'm going to replace all of the arrows with the single icon here the simple uh the symbol

00:32:57which is just going to denote uh you know this dense layer this this linear connection layer that is

00:33:03happening between these two components and again we can see that to build a network like this in

00:33:10tensorflow pytorch these convenience functions are really starting to uh you know help us a lot

00:33:16because we don't have to implement a lot from scratch now if we wanted to create a deep neural network

00:33:22how would we do that what is a deep neural network it is nothing more than just sequentially stacking

00:33:27more and more of these linear layers followed by non-linearities followed by more linear layers

00:33:32followed by more non-linearities over and over and over again in a hierarchical fashion so this is just

00:33:38a model where the final output is created as a hierarchical combination of going deeper and deeper

00:33:45into these linear followed by non-linear uh operations

00:33:51yes please

00:34:04exactly yes of course a question was about maybe just for a quick real world example of why we would

00:34:09have a different layer so different layers on the depth axis but also different outputs on the you

00:34:15know up down axis right so different layers on the x-axis basically this corresponds to more depth more

00:34:21complexity in your network right so for more complex tasks you would want more depth because you're

00:34:25introducing more hierarchical non-linearities after one layer if you have a single uh you know dense

00:34:33connection followed by a non-linearity you have a limited amount of complexity that you can extract

00:34:37it's only coming from one non-linearity so it's limited to the expressive capacity of that single

00:34:42non-linearity so for as you get more and more complex tasks you require deeper and deeper expressive

00:34:49functions so that's this one axis on the other axis more outputs this is just a problem definition

00:34:56so if you wanted to predict more things then you would need more outputs a good example is that if

00:35:02you wanted to do generation right let's say to generate an image you would need to generate values

00:35:07for every pixel in that image that's a lot of outputs right versus if you wanted to just predict

00:35:12let's say uh you know the weather tomorrow that's a temperature value right it's just one output just

00:35:18number right so depending on your problem definition those two things can change

00:35:25excellent okay so now that we have uh an idea of architecturally what makes up a neural network

00:35:33i think now it's time that we can actually start to compose all of this together and actually in line

00:35:37with that that example that question that just came up let's try and go through an example of applying

00:35:42some of this theory into practice and actually understand you know uh how we can look to apply

00:35:49a neural network to solve a very real problem let's say maybe not that real but maybe real for all

00:35:54of you

00:35:54that you've been thinking about it so here's a question that maybe all of you have been asking

00:35:57yourselves you know will i pass this class and let's try and build a neural network that can infer or

00:36:04predict this answer for all of you so we're going to do this by building a very simple model it

00:36:10will

00:36:10take us input two inputs uh one output the one output is going to be will i pass this class

00:36:16yes or no

00:36:17so a single number of probability of passing the class and it will be two inputs defined by number one

00:36:23how many lectures you attend over the course of this one week and number two the number of hours that

00:36:30you spend on your final project okay so let's let's plot because we've taught this class for many years

00:36:35we actually have data from past students on this operation we can look at all of the green points

00:36:40are people that have passed the class all the red points are people that have not and we can also

00:36:45plot

00:36:45where you are or you can guess how many hours you're going to spend on this class how many hours

00:36:49you're

00:36:50going to spend on the the final project and what we want to do is build a neural network that

00:36:55will

00:36:55determine from all of this past data of all of these past students where will you fall on this

00:37:01probability chance of passing versus not passing so let's do it okay we've we've actually learned all

00:37:07of this so far in the class so let's take it step by step we have two inputs this is

00:37:12this new person

00:37:14right you have spent four day four lectures you've attended and you spend five hours on your final project

00:37:19those two numbers you can feed in as input on the left hand side to your model we also have

00:37:24a single layered neural network a very basic neural network we're just going to start with this for

00:37:28now and we're going to see that our hidden layer has three hidden units and our output will just have

00:37:34one output which is a binary output yes or no on passing the class or not and what we're going

00:37:41to see

00:37:41is actually that this model got the answer very wrong it predicted that you would pass the class with

00:37:47probability 0.1 or 10 percent when in reality actually you did very well you definitely passed the class

00:37:54so can anyone tell me you know why you think that this network failed so badly here

00:38:01yes exactly exactly yes so the answer was it's not trained and that's exactly right so the the model

00:38:08here hasn't seen any of this data that we showed on the previous slide right it's it's basically like

00:38:13a baby that has not seen any knowledge about the the real world it doesn't know anything about this

00:38:18problem as well it needs to first learn about this problem and this is something that we haven't talked

00:38:22about so far in order to train our model our model has to also understand when it makes bad

00:38:29predictions what does a bad prediction mean it means that it has to be able to quantify how bad a

00:38:35prediction is versus how good a prediction is this is called the loss of a neural network a neural network's

00:38:41loss it's just going to be a measure of how far apart its predictions are from the ground truth

00:38:48answers or the ground truth observations of of a piece of data the closer your or the smaller your

00:38:56losses it means the closer these two things are so your predictions are really matching the ground truth

00:39:02this results in a small loss now let's assume that our data is not just from one student but we

00:39:11actually

00:39:11have past data from many students right now we want to care on how the model is doing not just

00:39:16for this

00:39:17one student but aggregate empirically across the entire past class now this is what we call is training on

00:39:24not just a single data point but we train on an entire data set so when we train neural networks

00:39:29we want

00:39:30to find neural networks that minimize our loss or maximize our accuracy not just on one student but on the

00:39:38aggregate empirical data set this is called the empirical loss and it's just simply the average of my loss for

00:39:45every data point in my data set now right now we've been focusing on this problem of binary classification

00:39:52yes no answers and for those types of losses we can use what's called a soft max cross entropy loss

00:39:59we'll learn more about this later but this is measuring the difference or the distance between

00:40:04two probability distributions two binary probability distributions now let's just suppose that instead

00:40:12of predicting an output of binary output we want to predict a final output that is a real number

00:40:20like a continuous value so let's say like a grade a percentage grade instead of will i pass the class

00:40:25or will i not but a percentage grade of how well i'll do for doing something like that we won't

00:40:31be

00:40:31able to use a binary loss anymore so we'll have to actually change our loss we can change it for

00:40:35example

00:40:36to a mean squared error loss so we can take our two grades predicted grade and true grade subtract them

00:40:41and then square them to create a distance measure and these are roughly the two types of losses that

00:40:48you'll see both categorical discrete losses like binary losses as well as continuous losses like msc losses

00:40:54of course there are so many other losses that you'll get exposure to over the class but these two are

00:40:59uh having very wide coverage in the field okay let's put all this information together and now start uh

00:41:09talking about the problem of actually finding our weights of the network right we've talked about

00:41:14defining the network we've talked about uh basically penalizing the network when it gets something

00:41:18wrong we have not talked about how to actually improve the network or train the network so let's

00:41:24talk about that as in this next part the objective here what are we trying to do ultimately at the

00:41:30end of

00:41:30the day throughout this entire class is that we're trying to find and build networks or models build

00:41:38models that minimize the loss on the data set the loss measures this difference between predicted

00:41:43and true we want to minimize we want to find a network that minimizes the loss on a data set

00:41:50this

00:41:51means mathematically right walking through this equation it means that we want to find the w's we want to

00:41:57find the weights that will result in the minimum l the loss over the entire data set from one to

00:42:07n

00:42:09now remember that w weights is just this is just going to be a collection of all of the weights

00:42:15in

00:42:16our entire model so it's the weights from every single layer in our network we're just going to

00:42:20combine those into all of one piece and those are the weights that we're going to try and optimize over

00:42:26now how do we do this optimization procedure well remember that our loss function is just a function of

00:42:34our weights given a set of weights our loss function will return a single value it is how

00:42:39how far apart our predicted answers are from our true answers if we only had two weights in our network

00:42:45then we would be able to plot our loss landscape like a picture like this we would be able to

00:42:50plot in

00:42:51a grid of data over weights one and weights two and for every configuration of my weights i'd be able

00:42:57to see

00:42:58how much error or how much loss that configuration of weights is obtaining now what we want to do is

00:43:05basically find the lowest point on this landscape we want to find which weights one and weights two

00:43:11correspond and give us the smallest loss so how can we do this well we can start at some random

00:43:18point

00:43:19we pick a random point in our landscape any point and we start from this point and what we'll do

00:43:25is we

00:43:25compute what's called the gradient the gradient will tell us which way is up from this point right

00:43:31it's a local measure it only tells us locally from where i stand right here which way is up and

00:43:38what

00:43:38i'll do is i will take a small step in the opposite direction right and i'll take a small step

00:43:44going down

00:43:44that loss and then i will repeat this process over and over and over again until i finally get to

00:43:49the

00:43:49bottom of the mountain of the hill right and i converge at what's called a local minimum

00:43:56we can summarize this this algorithm this procedure as uh as what's known as gradient descent in pseudocode

00:44:05right so let's go through it again one more time very briefly we start by randomly initializing our

00:44:10weights this means that we randomly pick a place in our landscape we compute the gradient

00:44:17here called dj dw this is how much a small change in our weights changes our loss right so this

00:44:24tells

00:44:25us the direction that we should change our weights in order to increase our loss we take a small step

00:44:31in

00:44:31the opposite direction so here you can see that actually we take that gradient we multiply by negative

00:44:36one we go in the opposite direction of that direction and then we multiply it by a small

00:44:42uh step let's call it eta eta here is going to be a step size of how much in that

00:44:48direction we actually

00:44:50move and then we repeat this in a loop over and over again

00:44:55in tensorflow right you can see this exactly represented the same way but here i want to draw

00:45:00your attention to this term right this is the direction term it tells us the gradient the gradient is

00:45:06going to tell us how or which direction is going up or which direction is going down if you take

00:45:12the

00:45:12negative of it but i never actually told you how to compute this right i just told you that we

00:45:17need to

00:45:18compute this right the process of computing the gradient in a neural network is called back propagation so i

00:45:25think it would be helpful also we can take a quick you know step-by-step example walking us through

00:45:32how back propagation works and how you would compute this gradient for a particular neural network and

00:45:39we'll start just for demonstration we'll start with the simplest neural network that exists it consists of

00:45:46one input one output and one hidden neuron in the middle right so you cannot get a simpler network than

00:45:51this and we want to compute the gradient of our loss l at the end or excuse me here's j

00:45:58at the end

00:45:59with respect to let's start with with respect to w2 okay so how much does a small change

00:46:06in w2 affect our loss

00:46:10so we can write out this derivative right we can write it out in math and we can use the

00:46:15chain rule

00:46:16to actually decompose it now why would we want to decompose it well first of all we decompose this

00:46:22this gradient dj dw2 into two terms dj dy and dy dw2 this is just a basic extension of the

00:46:32chain rule

00:46:33nothing magic here but why is this possible it is possible because y is dependent only on the previous

00:46:40layer okay now let's suppose now that we want to compute the gradients of this weight before w2

00:46:49let's say w1 here what we can do is just replace w2 in this equation with w1 and then again

00:46:56we have

00:46:56to apply the chain rule yet again right because computing this last term here is not well defined

00:47:02so we have to actually expand it one more time this is why we call it propagation back propagation

00:47:06because you actually have to start from the output and keep computing these iterative chain rules back

00:47:12and back over the course of your network step by step and we repeat this process of you know propagating

00:47:19those gradients all the way from output to input across our weights and at the end of this whole

00:47:25process what we're left with is for every single weight in our network we have this direction of

00:47:32basically saying okay if we increase this weight a little bit will our loss go up or down now if

00:47:38our

00:47:38loss was to go down that means that we should increase that weight just a little bit right or we

00:47:43would go in the opposite direction and that's the back propagation algorithm right in theory it's it's

00:47:50nothing more than an application of the chain rule from uh differential calculus but in practice you know

00:47:59it can get very messy and very hairy it's a very computational measure to do because you have to do

00:48:03this

00:48:04you know step by step for every single weight in your in your model uh so in practice today's deep

00:48:10learning frameworks like tensorflow pytorch they do this automatically so you don't necessarily need

00:48:14to implement this yourself but it's important to understand you know the the practical the the

00:48:20theoretical side of you know how these things are operating and and what it's doing underneath the hood

00:48:25i want to also like use that as an opportunity to discuss with you some of the practical implications

00:48:31of training neural networks uh in in reality right and i showed you this previous picture of like a

00:48:39very pretty lost landscape that was very smooth but in practice optimizing neural networks is extremely

00:48:44difficult and this is actually a picture of you know neural networks are extremely high dimensional

00:48:50search spaces so we don't actually know what this picture looks like but this is a projection of the

00:48:55loss landscape of a of a deep neural network uh from a paper that came out several years ago about

00:49:01in 2017

00:49:03and you can actually visualize now you know how messy some of these lost landscapes look that applying

00:49:09these types of back propagation and optimization techniques is very very challenging and i want you to

00:49:14also recall you know before we took that dive into back propagation and the gradient term in particular we started

00:49:21to talk about uh you know this this equation that you see here right so how would we update the

00:49:25weights

00:49:26we update them by taking an opposite step in a small uh small increment in that direction that we want

00:49:34to

00:49:34right now this is the key term i want to focus on now this small step this is called the

00:49:39learning rate

00:49:40of our model this derived this basically dictates how quickly we take those steps and how quickly we listen

00:49:46to our to our gradients as we're computing back propagation and in practice setting the learning

00:49:52rate can be very very difficult if we set the learning rate too slow then we basically start from a

00:49:56point

00:49:57but we get stuck in some of these uh local minimum but they may not be the best minimums that

00:50:03we could get

00:50:03to right if we set it too large then we get some unstable behavior where we basically overshoot we we

00:50:10start to

00:50:10step in the right direction but we step too far and then we explode out of the out of the

00:50:15stable place of

00:50:16learning ideally we want to set learning rates that are you know not too small so that they can skip

00:50:22some of the local minima but also not too big that they also diverge and they can still converge

00:50:28so how do we actually set the learning rate one option and actually a very common option is to

00:50:34you know just try a bunch of learning rates see what works best how can you do better than this

00:50:38well

00:50:39the idea is uh can you design adaptive algorithms that depending on how they are uh optimizing in the

00:50:46search space can you adapt the learning rate can you change the learning rate as a function of your

00:50:50landscape itself and this basically means that your learning rate practically speaking your learning

00:50:56rate will increase or decrease as a function of your gradients and the function of your data

00:51:01uh how fast you're learning right how how uh how steep the uh landscape is how how all of these

00:51:11different things can basically dictate all of these adaptive properties of a learning rate and in fact

00:51:16these have been very widely studied and many different types of adaptive learning rate schedulers have been

00:51:22created here you can see some examples adam so all of these start with like a lot a lot of

00:51:27them start with

00:51:28this ada for adaptive right these are different variations of these adaptive properties adam in

00:51:35particular is one extremely well used uh a type of optimization procedure that you'll be using

00:51:43throughout many of your labs but i encourage you to really try out and experiment with all of these

00:51:49different types of learning rate schedulers to see what works best in many times there will be different

00:51:55types of learning rate schedulers that work for different types of problems so you should definitely

00:51:59try out the different pieces and trying them out is is as easy as and oftentimes just a single line

00:52:06change

00:52:06right change to your learning loop will just implement different schedulers so sgd stochastic gradient descent

00:52:15is just going to be that that base gradient descent algorithm that we had seen before

00:52:20and i actually want to dig into that a little bit more because what you saw or what i presented

00:52:25was

00:52:26actually the gradient descent algorithm not the stochastic gradient descent algorithm so i want to tell

00:52:31you a little bit about you know what's the difference between those two pieces or those two types of

00:52:36algorithms to understand that we have to first revisit one more time the gradient descent algorithm

00:52:42so the gradient here this is that that piece that we computed with back propagation this is very

00:52:49computational because if you look at it it's computed as a summation or an average i should say over all

00:52:56of my data points in my data set so i compute the gradient for not just one data point but

00:53:03all of my data

00:53:03points in my data set that's why it's very expensive now in most real life problems it is not really

00:53:10feasible

00:53:10to compute your gradient over your entire data set on every single iteration of this step because remember

00:53:16we don't compute the gradient just once we compute at every point along this optimization procedure and

00:53:22you're optimizing your your network for millions or even more steps and you don't want to be looping

00:53:29through your entire data set on every single one of those steps so let's define a new type of gradient

00:53:34descent now we'll call it stochastic gradient descent like you saw before instead of computing the gradient

00:53:40over my entire data set i'm going to compute a very noisy gradient it's going to be a gradient

00:53:45computed just over one data point in my data set so i'm going to randomly pick a data point

00:53:50and i'm going to compute the gradient with respect to that one data point not my entire data set this

00:53:55is going to be way noisier obviously because that one data point is not going to be representative

00:53:59of my entire data set but it will give me an answer way quicker so i can get through more

00:54:04steps

00:54:06now there's also a uh you know there's a natural trade-off here right we want to go fast but

00:54:12we also

00:54:13don't want to be too noisy there obviously is a middle ground here right instead of computing

00:54:18the noisy gradient on one example we can do what's called mini batched gradient descent right

00:54:23mini batch gradient descent is where you set a batch size and then on every iteration you compute your

00:54:29grid gradient with respect to not just one data point but let's say k data points where k is pretty

00:54:34small think of something like 32 or 128 something on that scale you look at your gradient with respect

00:54:40to those let's say 32 data points and then you average that gradient it helps you get a bit more

00:54:47reliability and robustness in your measure but then you also get the speed right you're not going over

00:54:53your entire data set 32 is usually way way smaller than your entire data set

00:54:59okay so now what does this mean this means that we now have this increase in gradient accuracy

00:55:05compared to stochastic gradient descent so we can we can converge much more smoothly we're not super

00:55:11noisy going after one data point one at a time but it also means that we can be much more

00:55:17uh

00:55:18quick than compared to uh full gradient descent where we go over the entire data set at a whole

00:55:24this means that you know because we're more stable on the one side we can also increase our learning rate

00:55:30these two things are extremely connected right the relationship between your gradients and your learning

00:55:35rates should be one that you have a very good intuition about because your gradients are now more stable

00:55:40you're averaging over a mini batch not just a single sample you can now start to uh take bigger steps

00:55:48right you can trust the gradient a bit more over over the course of optimization it also allows you to

00:55:54really parallelize training because if you wanted to compute your gradient over 32 data points you can

00:55:59parallelize that off of 32 processes on your gpu right you compute them in parallel as opposed to one at

00:56:06a

00:56:06time this allows you to really start to utilize gpu speedups even further now the last topic i'll touch

00:56:14on before we uh we take a short break for lecture two is going to be this topic of overfitting

00:56:20and

00:56:21regularization of neural networks and this is a huge problem not just in deep learning but we really want

00:56:26to cover it because it's one that you're going to get exposure with in today's lab especially is

00:56:31basically it's one of the most fundamental topics of all of machine learning as a whole ideally in

00:56:38machine learning we want to build models that don't just work well on a training set right we do train

00:56:45our models on training sets but we don't want them to work well only on our training set actually what

00:56:50we

00:56:50really want is we we actually don't really oftentimes care about how well it works and practice on our

00:56:55training set at all we use that as a proxy because what we really care about is how well the

00:56:59model works

00:57:00on brand new data when we deploy it into the wild and there it's not our training data at all

00:57:05it's

00:57:05brand new test data and the relationship between these two things is extremely important we use the

00:57:11training data as a proxy but ultimately we don't really really care about it all that much another

00:57:16way to say this is that when we build models we want to learn representations from our training data

00:57:22but we still want them to generalize to unseen test data as well now take this picture for example

00:57:30assume you want to build a line that describes the relationship between the x and the y points on this

00:57:35picture you know on the left hand side you can see that you have a very simple model a linear

00:57:41model

00:57:42it can describe the training points and it probably will also describe the the test points to some decent

00:57:48faithfulness but it's not fully capturing the richness and the complexity of our data set

00:57:52both in the training set and the test set so we're not utilizing the full expressive capacity of the

00:57:57model on the on the left hand side move over all the way to the right hand side you can

00:58:01actually see

00:58:01that we're starting to memorize data points in the training set so much so that we're hurting our

00:58:06performance for brand new test data because we're we're waiting too much on what we've seen during

00:58:11training basically what you always want is to end up in the middle you want to leverage your

00:58:16training points but not rely on them too much or memorize them now yes

00:58:25example for problem of overfitting oh sorry say a real example of the problem which we face in the

00:58:32overfitting yes of course so a real life example of overfitting would be let's say if you have a

00:58:38very small data set but a very large network you'll you'll learn a model that just memorizes

00:58:46all of the data in your data set and it will be it's it's not like it's doing something bad

00:58:52because

00:58:53it has the power to memorize everything in the training set remember always that models don't

00:58:58see test set it's unseen data so all they can see is your training set what you give it to

00:59:03them

00:59:03so if you give them a very small training set and a very big model the model will do what

00:59:08it's supposed

00:59:08to do and learn exactly the training set to the full capacity right but then when you show it more

00:59:13test data it's not going to be very faithful to the training data because it's not going to be

00:59:18perfectly from the same distribution yeah yep

00:59:25yeah

00:59:28where is this

00:59:29where you identify the points that you choose

00:59:34um

00:59:40maybe i'm not

00:59:58I see, I see.

01:00:00Yeah, so the stochasticity is coming purely

01:00:02from the selection operation.

01:00:03So maybe it's a confusion.

01:00:05So why do we call it stochastic gradient set?

01:00:09It's because of the selection process.

01:00:11We don't do this over the entire data set,

01:00:13but we stochastically select a subset of data,

01:00:16and that selection is stochastic.

01:00:20Yeah, make sense?

01:00:27No, no, no.

01:00:28So you take the stochastic selection,

01:00:31and then with that stochastic selection of data,

01:00:35the gradient is, I mean, it can be unbounded, right?

01:00:38So you grab or you compute the gradient

01:00:40with respect to those data points,

01:00:42whatever they may be,

01:00:43but your stochasticity is coming from the selection part,

01:00:46not from the gradient computation.

01:00:49Yes?

01:01:10So basically, the question is about,

01:01:12is there a way,

01:01:13is there a more adaptive way almost of doing selection

01:01:15as opposed to being truly stochastic?

01:01:17And the answer is yes, definitely.

01:01:19So truly stochastic seeing of data

01:01:22is actually not very realistic either, right?

01:01:25Even though this is the way that is the convention, right?

01:01:27We, as humans, do not operate like this, right?

01:01:30We don't just randomly see data.

01:01:32We see data sequentially over time,

01:01:34and we see data with meaning and with purpose.

01:01:37Actually, in tomorrow's lecture,

01:01:38you'll see an example of how we do this type

01:01:40of adaptive selection process

01:01:42and the benefits of this as well.

01:01:44Great question.

01:01:46Okay, so I'll just very briefly wrap up with regularization.

01:01:50So regularization is just a technique

01:01:52that allows you to discourage

01:01:54these complex memorization protocols.

01:01:57So if you have a very small data set, big model,

01:01:59you want to discourage the model

01:02:01from just memorizing that data set.

01:02:03So how can you discourage the model

01:02:05from those types of things to be learned?

01:02:08And, you know, as we've seen,

01:02:10this is really critical

01:02:11for the overall performance of the model

01:02:13because we don't care about the training results.

01:02:16We care about the test results, ultimately.

01:02:18The most popular regularization technique

01:02:20is actually a very simple idea.

01:02:22You'll use this in almost all of your labs

01:02:24as part of this course.

01:02:25It's the idea of dropout.

01:02:27So what is dropout?

01:02:28Let's revisit this picture of a deep neural network.

01:02:31In dropout, all we do is that during training,

01:02:34we're going to randomly set some activations

01:02:36of our hidden neurons to zero with some probability.

01:02:40So let's say we set dropout to 50%.

01:02:43What we're going to do is say 50% of our neurons,

01:02:46we're going to drop out the activations

01:02:48or set their activations to zero,

01:02:50which forces the network

01:02:52to not rely so much

01:02:54on the outputs of any one neuron, right?

01:02:58The inputs at the next layer

01:03:00after a neuron gets dropped,

01:03:02it cannot rely,

01:03:03it cannot memorize so much

01:03:04about the previous inputs

01:03:06because there is some more stochasticity

01:03:08being implemented

01:03:09into this forward pass of the model,

01:03:11not just in the data set curation

01:03:12or this data set selection,

01:03:14but also in just the pure forward pass.

01:03:16Even if I pick the same data twice

01:03:18and I put it through the model twice,

01:03:20the exact same data,

01:03:21because of dropout,

01:03:23you also have another level of stochasticity

01:03:25that means the model can't even remember

01:03:28the same exact data twice, right?

01:03:30This is an extremely powerful idea

01:03:32because basically all it's doing

01:03:33is it's lowering the capacity of the model.

01:03:36It's lowering the ability

01:03:38or it's discouraging the ability

01:03:40for the model to learn

01:03:42a singular pathway through the model.

01:03:44It's forcing the model

01:03:45to learn these multiple pathways

01:03:46to make a single decision.

01:03:49And basically on every single iteration,

01:03:51we just repeat this process.

01:03:53Every time it sees a new piece of data

01:03:54or every time we do a forward pass,

01:03:57it always creates a random pathway

01:04:00for this data to pass through the model.

01:04:03Another final technique that I'll show you

01:04:05is about this notion of early stopping.

01:04:08Early stopping basically just means

01:04:09that we monitor the deviation

01:04:12between our training loss and our test loss.

01:04:16So we can have a test,

01:04:18we can have a proxy of a test loss

01:04:20by having a held out set.

01:04:22Maybe it's not a true test loss,

01:04:23but it's again another proxy

01:04:25that we do not train on.

01:04:26And what we can do is

01:04:28we can basically monitor

01:04:29how well the model is doing

01:04:31on both the training set

01:04:32and our held out,

01:04:34let's call it a validation set.

01:04:36In the beginning,

01:04:38both of these lines as we train,

01:04:40they both start to go down,

01:04:42which is excellent.

01:04:42It makes sense, right?

01:04:43This is because the model is learning, right?

01:04:46It's getting stronger

01:04:46over the course of training.

01:04:48And eventually what you'll see

01:04:50is that the model starts to plateau its loss.

01:04:53And on the test,

01:04:55it actually starts to increase.

01:04:57So the training accuracy should,

01:04:59if the model has enough capacity,

01:05:01the training accuracy should always,

01:05:02excuse me,

01:05:03the training loss should always go down.

01:05:05It should always be getting better

01:05:06and better on the training set.

01:05:08But at some point,

01:05:09you will see that the test loss

01:05:11starts to memorize data.

01:05:13It starts to memorize data

01:05:14in the training loss,

01:05:15which results in the test loss

01:05:17to go up a little bit.

01:05:19Now, this pattern continues

01:05:20for the rest of training.

01:05:22And here's the point

01:05:23that you should really focus on, right?

01:05:24This is the point where that

01:05:26if you plotted this curve,

01:05:29you would save your model

01:05:30at each of these stages,

01:05:31but you would only take the checkpoint.

01:05:33You would take the model

01:05:34that happens at this point,

01:05:35because this is the,

01:05:36even though the training loss

01:05:38even got better after this point,

01:05:41if you look at your training set,

01:05:43you actually look like

01:05:43you have a better model.

01:05:44But on the test set,

01:05:46you can see that it's actually

01:05:47started to memorize

01:05:48pieces of the training set.

01:05:49So you do not take

01:05:51the models on the far right.

01:05:52You actually take these models

01:05:53in the middle.

01:05:54Yes?

01:06:00Not every iteration,

01:06:02because maybe it adds

01:06:03unnecessary compute.

01:06:04But what people typically do is,

01:06:05you know, let's say,

01:06:06once every so many iterations,

01:06:08you will do a testing run.

01:06:10And again, you don't need

01:06:11to do a testing run

01:06:12over your entire test set.

01:06:13You could do it stochastically

01:06:14as well in a batch, right?

01:06:16So let's say you could do,

01:06:18let's say,

01:06:18every thousand iterations,

01:06:20you do a batch of, let's say,

01:06:22only 100 data points

01:06:23in your test set,

01:06:24just to get an approximate.

01:06:35No, so the drop nodes

01:06:36will not have gradients

01:06:37because we don't have

01:06:38information of what's

01:06:40happening with them.

01:06:40But for all of the other nodes,

01:06:42we'll get an update.

01:06:44Yeah, exactly.

01:06:45Yes?

01:06:51It should be separate, yes.

01:06:53So this is a key assumption

01:06:54is that ideally,

01:06:55you take your training data

01:06:56and what people can do

01:06:57is basically cut your training data

01:06:59in a ratio, right?

01:07:01So let's say you take 70%

01:07:02of your training data

01:07:03and you actually use it

01:07:04for training,

01:07:04and you take the other 30%

01:07:06of your training data

01:07:07and use it for testing

01:07:08and in validation, right?

01:07:11Okay, last question.

01:07:12Do you feel a difference

01:07:14and loss between the test

01:07:15and the training data set?

01:07:17Great question.

01:07:20I mean, there's no ideal, right?

01:07:23Ideally, actually,

01:07:24there would be no difference, right?

01:07:26In practice, though,

01:07:28so there are situations, actually,

01:07:30where there are

01:07:30very little difference.

01:07:31Let me give an example

01:07:32is assume your training set

01:07:34is also so massive

01:07:36that it's impossible

01:07:37for your model to learn

01:07:38the full capacity.

01:07:40It's impossible

01:07:40for the model to memorize.

01:07:41Then, actually,

01:07:42you will see, basically,

01:07:43training and testing

01:07:45is very close to each other.

01:07:46A good example of this

01:07:46is language modeling.

01:07:48Even massive language models,

01:07:50they still have trouble

01:07:51memorizing the entire data set

01:07:53just because language

01:07:54is such a massive data set, right?

01:07:56So even there,

01:07:58basically, you'll see

01:07:59training and testing curves

01:08:00look very, very similar,

01:08:01but then that's why

01:08:02we have to actually do

01:08:03other types of validation.

01:08:04Language models don't really have

01:08:06the classical overfitting problems

01:08:07that other types

01:08:09of deep learning models have.

01:08:11They have other problems,

01:08:12which we'll talk about.

01:08:14Yeah, okay.

01:08:17Awesome.

01:08:17Okay, I'll conclude now

01:08:18just by summarizing

01:08:19the three points

01:08:20that we talked about

01:08:21in this lecture

01:08:21before we jump into

01:08:22lecture number two.

01:08:23So first, we talked about,

01:08:25you know, building neural networks,

01:08:26the architectures

01:08:27of neural networks.

01:08:28We talked about

01:08:28the base operation.

01:08:30The base architecture

01:08:30is called a perceptron,

01:08:32a single neuron.

01:08:33We learned about

01:08:34how we could stack

01:08:35those single neurons together

01:08:36to form complex

01:08:37hierarchical networks

01:08:39and how we can mathematically

01:08:40optimize those networks

01:08:42using data.

01:08:42And finally,

01:08:44we addressed a lot

01:08:44of the practical implications,

01:08:46everything from,

01:08:46you know,

01:08:47batch gradient descent

01:08:48to overfitting

01:08:50and regularization

01:08:51and optimization

01:08:51of these models.

01:08:53In the next lecture,

01:08:54we're going to hear

01:08:55from Ava

01:08:55on deep sequence modeling,

01:08:57which is the backbone

01:08:59of large language models.

01:09:01And this is a really exciting

01:09:03type of lecture,

01:09:04so hopefully everyone enjoys it.

01:09:06And I think probably

01:09:07what we'll do

01:09:08is just take a five-minute break

01:09:10just so Ava and I

01:09:11can switch laptops,

01:09:12and then we will continue

01:09:13with the lecture.

01:09:14And then after the lecture,

01:09:15we have software labs

01:09:16followed by reception

01:09:17at Link and food.

01:09:20Okay.

01:09:20Thanks, everyone.

Category

Transcript

Comments

Recommended