Why is AI in 2026 scary? While the mainstream media spreads panic, the Massachusetts Institute of Technology (MIT) just released a groundbreaking computer science lecture explaining the absolute truth behind the next wave of Artificial Intelligence.
If you have 1 hour tonight, skip Netflix and watch this full breakdown. Your future self will thank you for understanding this before the rest of the world.
In this video, we dissect the core engineering concepts from MIT's latest AI briefing. We are moving past basic generative AI and entering the era of fully autonomous Agentic AI systems and World Models. This isn't just a minor tech update—it's a complete shift in how software, coding, and global industries operate.
Whether you are a software engineer, student, creator, or tech enthusiast, this MIT AI explanation will give you the exact roadmap to stay ahead of 99% of the population.
👉 Key Topics Covered in This MIT AI Breakdown:
The 2026 AI Turning Point: Why autonomous agents are replacing simple chatbots.
MIT Computer Science Insights: The hidden architecture behind next-gen LLMs.
The Future of Work: Which skills are becoming obsolete and what to learn next.
AGI Timeline: What top researchers are saying behind closed doors.
Don't just watch the future happen from the sidelines. Understand the technology, adapt early, and future-proof your career.
If you want to stay updated with deep dives into AI engineering, tech breakthroughs, and market trends, make sure to hit LIKE and SUBSCRIBE!
#MIT #AI2026 #ArtificialIntelligence #TechTrends #AgenticAI #ComputerScience #AGI #FutureOfWork
If you have 1 hour tonight, skip Netflix and watch this full breakdown. Your future self will thank you for understanding this before the rest of the world.
In this video, we dissect the core engineering concepts from MIT's latest AI briefing. We are moving past basic generative AI and entering the era of fully autonomous Agentic AI systems and World Models. This isn't just a minor tech update—it's a complete shift in how software, coding, and global industries operate.
Whether you are a software engineer, student, creator, or tech enthusiast, this MIT AI explanation will give you the exact roadmap to stay ahead of 99% of the population.
👉 Key Topics Covered in This MIT AI Breakdown:
The 2026 AI Turning Point: Why autonomous agents are replacing simple chatbots.
MIT Computer Science Insights: The hidden architecture behind next-gen LLMs.
The Future of Work: Which skills are becoming obsolete and what to learn next.
AGI Timeline: What top researchers are saying behind closed doors.
Don't just watch the future happen from the sidelines. Understand the technology, adapt early, and future-proof your career.
If you want to stay updated with deep dives into AI engineering, tech breakthroughs, and market trends, make sure to hit LIKE and SUBSCRIBE!
#MIT #AI2026 #ArtificialIntelligence #TechTrends #AgenticAI #ComputerScience #AGI #FutureOfWork
Category
🦄
CreativityTranscript
00:00:08Good afternoon everyone and thank you for joining us today. My name is Alexander Amini and together
00:00:15with Ava we're going to be your instructors for the course this year. This is MIT Introduction
00:00:22to Deep Learning or 6S191 is our official course title. Now we're super excited to welcome
00:00:29you to this class and I think probably a good place to start is always you know asking ourselves well
00:00:34what is MIT Intro to Deep Learning? This is a one-week boot camp on everything deep learning
00:00:41right so it's a both a very fun but also a very intense one week because we're going to cover
00:00:47a
00:00:47ton of material in just the next five days. Now this is our eighth year teaching this class
00:00:54and the pace of the field especially in the past couple years is really remarkable and every year
00:01:00that we teach this class it's getting more and more I should say interesting to introduce this
00:01:05this lecture in particular and how we introduce this lecture has really started to adapt and evolve
00:01:11over the many years. Now many of you in the audience have probably even started to become almost a bit
00:01:18I would say desensitized to a lot of the progress of deep learning in the past couple years because
00:01:24of this progress how rapidly this progress is happening so I think it's it's also important
00:01:29to not forget you know where we came from just a few years ago so I want to show you
00:01:33know this this
00:01:34image right here just to start this off and what better way to show you than for you to actually
00:01:38see
00:01:39the progress with your own eyes exactly one decade ago this was the state of a state-of-the-art
00:01:46deep learning based facial generation system so this is not a real face this was the state-of-the-art
00:01:52model that could generate faces and this was the best that we could do. Fast forward you know just a
00:01:58few
00:01:59years down the pipeline and progress and image generation had already started to advance tremendously
00:02:07and here you can see you know you know a lot of more realism photorealism in these types of images
00:02:13being created and then you know fast forward another few years after that and and these images
00:02:18start to come to life right they start to have temporal information they start to have video and
00:02:23you know movement as well to those to those images right and in fact this this video that you see
00:02:30on the
00:02:30right is a video that we created in this class some years ago and for those of you who haven't
00:02:36seen
00:02:37it already it's it's uh you know it's online and and people have seen it but in case you haven't
00:02:41i'll play just the first 10 seconds i won't play the whole thing just so you can see it as
00:02:45well hi
00:02:45everybody and welcome to mit 6s 191 the official introductory course on deep learning taught here at
00:02:56mit now i won't play the whole thing right but you get the gist of this video this this video
00:03:01was created
00:03:01five years ago we made it as part of this class and we we used it to actually introduce this
00:03:07class
00:03:08uh back then now when we did this in 2020 it got a lot of even back then right like
00:03:14especially back
00:03:14then maybe you guys aren't that impressed by it today but back then this video was a huge deal
00:03:18right this was a huge jump in photorealism for uh for for capabilities of deep learning models and the
00:03:25clip went you know very uh very rapidly a bit viral and people commented a lot about the realism but
00:03:31actually one interesting thing that people didn't see they saw the end result but actually at that
00:03:36point what people didn't see was that for us to generate that clip which was a two minute clip you
00:03:42only saw the first 10 seconds but that clip was two minutes in total and to generate that two minute
00:03:48clip it cost around two hours of professional audio data being recorded and captured uh of the speaker
00:03:55which was not us it costs around 50 hours of professional high definition video data
00:04:01to build you know the face model and it required around 15 000 us dollars of compute to generate that
00:04:11two minute video and all of that was just going to generate you know a predefined script right
00:04:16something static you couldn't talk to it you couldn't interact with it it was just a predefined
00:04:20script static not flexible at all but uh you know still a tremendous amount of resources both time
00:04:27data energy and financial resources as well and i want to start this class by basically asking this
00:04:35question of you know what would this look like today right that video was created in 2020 but today
00:04:41we're in 2025 and what would cloning you know and generative ai look like in today's world right
00:04:50especially understanding all of those limitations that we had back then in 2020 and i thought what better
00:04:56way to to to tell you about this than to just show you live instead of you know all of
00:05:01that compute
00:05:01and resources that we spent in 2020 let's do it live and unedited right in front of everybody today so
00:05:08in fact i have a bit of a secret so i'm going to share my other screen and move this
00:05:13over
00:05:15so this is a a terminal window that you can actually see something's been recording audio in the background as
00:05:22i've been talking i've been having this program that we made you know listening to my audio recording
00:05:27my voice and what i'm going to do is just uh you know pause the recording now and we're going
00:05:34to see
00:05:34how quickly it can actually immediately start to uh you know just take my voice and do an instant clone
00:05:40as opposed to all of that time intensive resources that we saw before so i'll pause it right now
00:05:46you'll see it instantly clone
00:05:50and and yeah now now my voice is cloned and what we're going to do with this is actually try
00:05:55and
00:05:55not have a predefined script like we saw before i'm going to actually have a live and unedited
00:06:01conversation with this voice and see how dynamic and flexible these types of generative ai models have
00:06:08become in today's world so let's try this out hello can you hear me hi
00:06:15yes loud and clear how are you doing wait a second actually you sound a lot like me don't you
00:06:21that's interesting you're right we're even more similar than i thought what are we up to
00:06:28well actually i was just kicking off this year's mit deep learning class and you know i'm standing
00:06:34right in front of the audience right now uh mit introduction to deep learning is such a great
00:06:40class what's the class like so far actually we just started uh you know what i had an idea so
00:06:47i was just
00:06:47about to introduce what is deep learning how about you take that part over for me deep learning is a
00:06:54type of machine learning that uses artificial neural networks to learn from data these networks are
00:07:00inspired by the structure of the human brain and are able to learn complex patterns from large amounts of
00:07:06data deep learning is used in a wide range of applications including image recognition natural
00:07:12language processing and self-driving cars it's pretty good actually you know i think the amazing
00:07:18thing i was telling everybody in the audience about was that the the state of deep learning today is so
00:07:23flexible it's so dynamic what do you think you're right deep learning is incredibly dynamic it's
00:07:29constantly evolving with new architectures algorithms and applications being developed all the time
00:07:34actually wait let me interrupt you right there just to show how dynamic you are i'm going to interrupt
00:07:38you and uh let's try this enough about deep learning how about you tell me a story about turtles and
00:07:43make
00:07:43it wild oh turtles you want a wild turtle story okay picture this there's this turtle named shelly right
00:07:52he's not your average turtle oh no shelly's got dreams he dreams of flying he spends his days practicing
00:07:58flapping his little flippers trying to get off the ground one day he meets this crazy inventor
00:08:03a mad scientist turtle named dr sheldon who's built a rocket powered shell shelly jumps in and boom
00:08:12they blast off into space they zoom past planets meet aliens who love to eat okay i'll interrupt you
00:08:18i'll interrupt you one more time thank you this was great it's turtle tastic oh let's end it here okay
00:08:26thank you so much this was a this was a cool way to introduce the class i'm glad i could
00:08:32help
00:08:33okay thanks bye bye bye okay awesome okay so that's just a fun way to just to show you know
00:08:48how far we've
00:08:49come in this field you know just from a few years ago generating very static content and out you know
00:08:55live
00:08:56unedited extremely dynamic content as well so you actually just heard a very brief introduction on
00:09:03what deep learning is and in fact in that demo and all of the progress that all of you have
00:09:08been seeing
00:09:09over the past many years you've been seeing what you'll see in this class over the next one week is
00:09:14the fundamental techniques that drive all of that progress so let's just start by maybe laying some
00:09:20foundation laying some groundwork on exactly you know what this field is all about and to do that
00:09:25i think first i have to introduce to you what is intelligence first of all right so to me the
00:09:32word
00:09:32intelligence means the ability to process information in order to inform some future decision right some
00:09:39future action this is what intelligence means so all of us exhibit this capability every single day
00:09:45you know some more than others but artificial intelligence is just the practice of building
00:09:51algorithms artificial algorithms to do exactly that same process right use information use data to
00:09:58inform future decisions now machine learning what is machine learning machine learning is a subset of
00:10:04artificial intelligence that focuses on not explicitly programming the computer how to use that data how
00:10:12to process that information to inform that decision but just try to learn some patterns within the data to
00:10:18make those decisions and finally deep learning is just a subset of machine learning which focuses on
00:10:25doing that exact process with neural networks deep neural networks and we'll learn exactly what what deep neural
00:10:31networks are throughout this class but really at a high level right this entire course is about this core idea
00:10:37fundamentally right this is what we will teach and you will all get a very strong handle on throughout
00:10:43this entire week is you will learn how to teach computers how to learn how to do tasks directly from
00:10:51observation directly
00:10:52from data and we'll provide you both a solid foundation you know in the lectures but also through practical
00:10:59understanding and software labs as well so you can get very hands-on and that's probably a good segue to
00:11:04tell
00:11:05you a little bit about the entire course just high level so this is going to be a combination like
00:11:10i said
00:11:10between the technical lectures and the software labs we'll have several new updates this year in particular
00:11:15as we're you know as the field is advancing so quickly we're really going to try start to uh you
00:11:20know
00:11:21you know drive home a lot of key points especially in more of the modern side of deep learning and
00:11:26then to
00:11:27that end we'll conclude with some guest lectures from industry leaders on state-of-the-art deep learning
00:11:33methods and ai methods that are being developed in industry and this will really start to advance your
00:11:39your knowledge even more in addition yes that's right also tonight we're going to have a reception
00:11:46at 4 30 and you're all invited to that reception as well uh to you know talk to to everyone
00:11:53and learn
00:11:54more about deep learning there's also food provided as well this year we also have a lot of great updates
00:12:00on the software labs uh so we'll be introducing both tensorflow and pytorch software labs and these
00:12:07are you know number one these are a great learning experience for all of you to get hands-on with
00:12:11everything that you learn in the lectures but also they're a medium for you to enter into the competition
00:12:16prizes and make yourselves eligible for a lot of cash prizes at the end of this course so how exactly
00:12:22does that work each day we'll have a dedicated lecture and we'll have a dedicated software lab that mirrors
00:12:29that lecture and the the software lab will just basically reinforce what has been taught during
00:12:34the day in the lectures uh starting today you'll have lab one where you're going to basically focus
00:12:40on building a form of a language model actually it'll be a very small language model but it's a next
00:12:45token predictor language model that learns how to generate music and predict the next token of music
00:12:50so you can generate novel folk songs and then tomorrow we'll move on to facial detection systems
00:12:57you'll get hands-on with building your own computer vision system from scratch understanding also
00:13:03some automated techniques to fix imbalanced data in those systems and then finally lab three is going
00:13:09to be a brand new lab premiering this year for the first time on large language models and you're
00:13:15actually going to in that part of that lab fine-tune a two billion parameter large language model
00:13:22uh on uh you know on compute that you'll control in a mystery style and you'll also build a ai
00:13:30judge
00:13:31to evaluate the quality of that language model so all three of these labs are going to be you know
00:13:35a
00:13:35lot of a lot of fun and then finally on the last day of the class we'll have a final
00:13:40project pitch
00:13:41competition uh each group groups i think of up to three to five people and each group is presenting up
00:13:49to
00:13:49uh three to five minutes kind of in a shark tank style pitch competition and then you'll be eligible
00:13:55for even more prizes as part of that as well okay uh i won't go through this slide there's many
00:14:03great
00:14:03resources available as part of this class uh this slide as well as the entire lectures are all posted
00:14:09online you can already check the website they should be online already and if you ever need any help
00:14:14please post a piazza if you have any questions we have a team of incredible tas and instructors this
00:14:20year that you can reach out to at any time for any questions or issues myself and ava will be
00:14:26your two
00:14:26main lecturers for most of the course but then you'll also be hearing from a lot of guest lectures uh
00:14:31throughout the rest of the class uh here are some of the names this course in general would not have
00:14:37been
00:14:37possible uh each of these years without all of our amazing sponsors so i do want to give a huge
00:14:42thank you for all of their support over the years okay so now that we've gone through all of that
00:14:48i
00:14:49want to start with a lot of the the funds yeah go ahead sure yeah that's right yeah so this
00:14:58course
00:14:58has been taught for eight years we've taught it to over around now 13 million people uh so and just
00:15:05at
00:15:05mit alone because mit you know that's the global audience the mit audience is around probably 3 000
00:15:12at this point and every year online around 100 000 people take this class so you're you're in great
00:15:18company and a lot of really amazing people have taken this class and we're really excited for all of
00:15:23you to be here today so i want to start now as we dive into the technical part of this
00:15:29class i want
00:15:29to start by you really asking this fundamental question why deep learning and why now and hopefully
00:15:34this is a question all of you has asked before you came here today uh you know understanding
00:15:39exactly what gets at the basis of deep learning is really important so that we can understand how
00:15:42we can move forward and build even better algorithms that drive this field so traditional
00:15:48machine learning maybe if we start there for a second traditional machine learning typically defines
00:15:53what are called sets of features i'll tell you more about that word in a second but usually what these
00:15:59features are these are basically rules of how to do a task step by step right and the problem is
00:16:06that
00:16:06if we as humans define those features we're not usually very good at building very robust features
00:16:14so for example let's say i wanted to tell you or i told you to build a ai model that
00:16:20could you know
00:16:21detect faces how would you do this what features would you build in an image to detect faces well what
00:16:28you
00:16:28could do is you could uh you know start by first detecting lines in the image just edges right very
00:16:35simple lines then you could start to compose those lines together to detect things like uh you know curves
00:16:43and edges and l and uh you know uh you know curves basically yeah curves of lines not just straight
00:16:51lines
00:16:52and then you can combine those together to start to form more composite objects right like eyes
00:16:56and noses and ears and then from there you can actually start to build up structures of faces
00:17:01why would you do it like this well it's actually naturally hopefully this is the way that you would
00:17:05also think of doing it because it's very hard to immediately just one shot detect a face you actually
00:17:10don't process faces like this first of all you actually start by processing much more coarse features
00:17:15the low level features first or excuse me the high level features first then you compose these together to really
00:17:20form your own intuition about a face right now the key idea of deep learning is no different than
00:17:27than this process the key idea is to learn these features instead of me telling you or you telling me
00:17:33exactly what those features are the key idea of deep learning is to say after observing a lot of faces
00:17:39can i learn that i should first detect things in this hierarchical fashion step by step you know first detect
00:17:45the lines then detect the curves and detect the you know the composites like eyes noses and ears and
00:17:51then build up to facial structure like this and it turns out this is exactly what deep learning is able
00:17:56to do and we'll see how how this is being done underneath the hood throughout this lecture it's really
00:18:02important to understand though that you know even though we are seeing so many of these amazing
00:18:06things of deep learning over the past few years everything that you'll learn especially in today's lecture
00:18:11this is an intro lecture so almost everything that you'll see today has been invented or developed
00:18:18decades ago right this is not new thing new things that we'll be showing in today's lecture
00:18:24tomorrow and the day after and after that you'll start to see a lot more of the recent advances
00:18:29but why are we seeing this all today right the reason is because number one we see an explosion of
00:18:36these techniques even the techniques that are decades old because of three key components
00:18:41number one is data right data is becoming more and more plentiful throughout the world and this is
00:18:47really driving deep learning progress the compute is number two right compute is becoming more and
00:18:53more powerful and more and more commoditized gpu architectures especially are driving the progress
00:18:58in deep learning and gpus were you know you know only recently starting to be commoditized and finally open
00:19:06source toolboxes like you see on the right hand side tensorflow pi torch keras and so on you know make
00:19:11it very very streamlined and very easy for all of you just in a one-week course to get hands
00:19:15-on with
00:19:16these architectures and start to build directly so let's start by you know just understanding the
00:19:23fundamental building blocks of every neural network and that's just a single neuron or a perceptron right so
00:19:29what is a perceptron the idea of a perceptron or a single neuron is is really simple right so let's
00:19:36start by
00:19:37just taking and defining a perceptron purely by its forward propagation of information so given some
00:19:43inputs how does the perceptron compute an output let's start by defining you know a set of inputs x1 to
00:19:51xm
00:19:52and each of these numbers each of these inputs will be multiplied by a corresponding weight w1 through wm
00:20:00what we're going to do is after we do this multiplication we're going to add up all of
00:20:04those numbers together we'll take the single number that comes out and then we'll pass it through what's
00:20:09called a non-linear activation function this is just a non-linear one-dimensional function that you can
00:20:14pass through this single number and to the output here it's denoted as g
00:20:22okay i left out one minor detail so i'll correct it right now so the one thing that we have
00:20:28to
00:20:28remember is that after we multiply all of our weights by our inputs we're also going to add this
00:20:34one number called a bias term and the bias term is effectively if you look at the equation it's a
00:20:40way
00:20:40for us to shift left and right along our activation function g so this is just a shifting scalar
00:20:47appropriate like designed within the equation here now on the right hand side of this equation of
00:20:54this slide you can actually see the diagram on the left mathematically illustrated or mathematically
00:20:59written as a single equation right now i'm going to now rewrite this for the sake of cleanliness
00:21:08using linear algebra in terms of vectors and dot products so let's do that now so now instead of you
00:21:14know x1 through xm i'm going to write just a vector capital x capital x is going to be a
00:21:20collection
00:21:21of all of my inputs and capital w will be a collection or a vector of all of my weights
00:21:28the output then y is simply just going to be obtained by having a dot product between x and w
00:21:35adding our bias and passing through g passing through a non-linearity
00:21:42now you might be wondering you know i've mentioned this non-linearity a few times what is this thing
00:21:46well i said it's a non-linear function right but what exactly is it one common example here would be
00:21:52what's called the sigmoid function so the sigmoid function you can see right here it's it basically can
00:21:58act over any real number on the x-axis but it outputs only between zero and one so usually the
00:22:05sigmoid function is really good for things like probabilities if you wanted to convert your output of your
00:22:11perceptron your neuron to a probability but in fact actually there's many types of non-linear
00:22:16functions it's just one function that's commonly used in neural networks but throughout this presentation
00:22:21you'll see basically a few examples of different non-linear functions and also i'll point out on the
00:22:27bottom of this slide you can see some code snippets both in tensorflow and pytorch that will help kind of
00:22:33like uh you know align what you're seeing in the math with also code that will be relevant for
00:22:38some of your software labs later today the sigmoid function that you saw earlier this output's very
00:22:44good for probabilities you'll also see things like on the right hand side this is called the rectified
00:22:50linear unit this outputs uh things that are strictly positive it is piecewise linear so it is linear
00:22:57before zero and is linear after zero but it has a single uh non-linearity at x equals zero
00:23:03zero now why do we need activation functions actually that's a question hopefully everyone
00:23:09here asks this seems like unnecessary if at its first glance the point of an activation function is
00:23:15actually quite simple it's to introduce non-linearities into your model right without a non-linear activation
00:23:22function you have a linear model so why do we want non-linearities well it's just because real data
00:23:28in the real world is heavily non-linear right now you might be maybe just a good example to to
00:23:36show
00:23:36this would be let's say i show you this picture here and if i asked you to build a classifier
00:23:43a
00:23:44separation draw a single line that separates the red points from the green points can you draw that line
00:23:51at first glance yes you could draw the line but what if i told you that it had to be
00:23:55a straight line
00:23:55right if i told you it's a straight line then it's not really possible anymore to do this task well
00:24:01so that that makes the problem really hard that's the problem with having a linear model
00:24:06the benefit of having non-linearities is that allows us to approximate arbitrarily complex functions
00:24:13with enough depth in our model this is exactly what makes neural networks non-linear neural networks
00:24:19extremely powerful let me just help you all understand this with a simple example
00:24:25so imagine i gave you now a trained neural network that you can see here it has one perceptron
00:24:32right but it has two inputs x1 and x2 it also has two weights w1 and w2 and it also
00:24:39has this
00:24:39bias term on the top as well now how would we how would we process this information it's the same
00:24:47story as before we're going to compute a dot product we're going to add the bias and pass it through
00:24:53our
00:24:53non-linearity g now if we plug in our data we already know our inputs here our inputs are going
00:25:00to be uh let's see it's it's positive three is the first input x1 and negative two is our second
00:25:07input
00:25:08x2 we can plug these into our equation along with our weights as well and we can see actually
00:25:16uh that we can obtain this line which is going to be a two-dimensional line that parameterizes our
00:25:24entire function space of this neuron since it's only in two dimensions we can even plot this line so we
00:25:32can say exactly how this whole space would look like and for any new input that this model sees
00:25:39where with respect to this line would it fall so let's say for example that if i had this new
00:25:46point
00:25:47here the point is on this x1 x2 space it's going to be at point negative one by two and
00:25:55we can see
00:25:55graphically exactly where on this plot it falls with respect to the line now also we can plug it back
00:26:02into
00:26:02our equation as well and we can see exactly you know okay if we plug in one or excuse me
00:26:09negative
00:26:09one as our input to x1 positive positive two to our input of x2 we can plug into the equation
00:26:15on the
00:26:15bottom left we pass it through the non-linearity the non-linearity here is a sigmoid function
00:26:20it squashes everything to be zero and one depending on which side of the line it falls on and we
00:26:27get this
00:26:27final answer here in this case the final answer is point zero zero two right this is less than point
00:26:33five point five is going to be the divider because all of our outputs are going to be separated between
00:26:37zero and one and we can actually graphically represent this as well so if you fall directly
00:26:42on the line your output after non-linearity is going to be point five the more to the blue side
00:26:48that you fall the farther under point five you are and the more to the green side you fall
00:26:54the farther above point five you fall so the line basically represents the point of separation between
00:27:00these two sides of the space and depending on which side of your input you fall on this is the
00:27:05way for
00:27:05you to classify this point as either a positive point or a negative point
00:27:12so now it's also important to understand you know we just did this for a single neuron with two inputs
00:27:17you
00:27:18can imagine that if you had a model with many more inputs than two it would no longer be possible
00:27:23to
00:27:24draw this plot and this is something that we'll have to deal with in terms of understanding and
00:27:29building intuition but hopefully even at the small scale you can build some level of intuition
00:27:34even with this plot let's see how we can now start to tie some of this together to go beyond
00:27:41just one
00:27:42neuron and start to build networks right because this is where we actually build really powerful systems
00:27:47it's not just from one neuron but from full networks so to do that let's just revisit our diagram one
00:27:52more
00:27:52time if there's a few things that you take away from this class this is hopefully the slide that
00:27:57you take away from right and i've said it already a few times i'll say it one more time how
00:28:01do you pass
00:28:02information through a neuron you take a dot product you apply a bias and you apply a non-linearity it's
00:28:09these three steps and these keep getting repeated over and over again i will simplify the diagram since
00:28:14i've now told it to you so many times hopefully it's starting to stick now i'm going to remove all
00:28:19of the
00:28:19weights from this diagram and i'm also going to remove the bias term so now you can always assume
00:28:24that those two things are there i'll just remove them from the illustration moving forward to keep
00:28:29things cleaner now z here z is going to be the result of that term it's going to be the
00:28:35result of
00:28:36the dot product plus the bias it's going to be before the non-linearity g okay so we will then
00:28:42pass
00:28:42z through g and that will give us y and you can see that represented right here
00:28:49okay now what if we wanted a multi-output neural network not one output but two outputs how would
00:28:56we change this picture okay it actually is pretty simple we now just create a second perceptron we now
00:29:02have two neurons instead of one neuron both neurons have the exact same inputs but because their weights
00:29:08are different they will have two different outputs so they both take as one put the same information
00:29:12they process it their own way with their own weights and they make two different outputs from scratch
00:29:21now these types of network these types of layers let me call them a layer are typically called dense
00:29:27layers because everything in my inputs is connected to everything in my outputs and if you exclude the
00:29:33non-linearity this is also a linear layer right this is a linear layer because it takes all of my
00:29:39inputs x and just linearly operates them with my weights w and adds a bias which is also a linear
00:29:46operation
00:29:47so we can now actually implement this entire operation from scratch in python so let's try it out
00:29:54so we're going to start by just defining those two weights we define self.w this is our weight
00:30:02vector and we also have self.b which is our bias scalar this is just one number but here since
00:30:08it's an
00:30:09entire uh n dimensional output we'll actually have n uh uh n neurons in the output as well
00:30:18when we want to do our forward pass through this through this layer how do we do this it's the
00:30:25same
00:30:25stories before we take a dot product which here is this matrix multiply we add the bias
00:30:30and then we apply our non-linearity this is the sigmoid here but you could change this to any
00:30:35non-linearity in pytorch you can actually see that there's almost perfect analog on the left and the
00:30:42right side it's the same story here you create your your two weights your weight and your bias
00:30:47you apply a matrix multiply add the bias and apply your non-linearity exactly the same as before
00:30:53now luckily tensorflow and pytorch have already implemented this type of dense or linear layer
00:31:00for us so we don't need to do that that was just a good learning exercise what we just went
00:31:05through
00:31:05but here you can just call it right you can see the function calls on the bottom
00:31:11now let's take a look at a single layered neural network a single hidden layer so not where
00:31:17the output is directly from a single perceptron but where we have to actually pass through two
00:31:23layers okay what does that look like so this is one where we have the single layer single hidden
00:31:29layer is placed basically between our output layer and our input layer why do we call it hidden well
00:31:35it's just because we don't directly observe the data that happens in this layer right input layer
00:31:40is data that we provide to the model the output layer is typically things that we would supervise
00:31:44over the hidden layer is one that is learned over the course of uh just observe observable data
00:31:52since we now have a transformation both from inputs to hidden and from hidden to output we now have two
00:31:59layers right and we're that's also going to mean that we need two weight matrices w so we'll actually
00:32:06have a w1 on the left hand side and a w2 on the right hand side now if we look
00:32:12at a single unit a
00:32:14single perceptron a single neuron in that hidden layer let's take z2 for example it's just a
00:32:20perceptron that we saw before it's the same story nothing here has changed its answer is computed by
00:32:27taking a dot product adding a bias and passing it through this non-linearity if we took a different
00:32:33node a different neuron let's say like z3 the one right below it it would also be computed with a
00:32:39dot product
00:32:39bias non-linearity it would take the same inputs as z2 but it would have different weights so the dot
00:32:46product and the bias would be different this picture again looks a bit messy so i'm going to simplify it
00:32:51even more i'm going to replace all of the arrows with the single icon here the simple uh the symbol
00:32:57which is just going to denote uh you know this dense layer this this linear connection layer that is
00:33:03happening between these two components and again we can see that to build a network like this in
00:33:10tensorflow pytorch these convenience functions are really starting to uh you know help us a lot
00:33:16because we don't have to implement a lot from scratch now if we wanted to create a deep neural network
00:33:22how would we do that what is a deep neural network it is nothing more than just sequentially stacking
00:33:27more and more of these linear layers followed by non-linearities followed by more linear layers
00:33:32followed by more non-linearities over and over and over again in a hierarchical fashion so this is just
00:33:38a model where the final output is created as a hierarchical combination of going deeper and deeper
00:33:45into these linear followed by non-linear uh operations
00:33:51yes please
00:34:04exactly yes of course a question was about maybe just for a quick real world example of why we would
00:34:09have a different layer so different layers on the depth axis but also different outputs on the you
00:34:15know up down axis right so different layers on the x-axis basically this corresponds to more depth more
00:34:21complexity in your network right so for more complex tasks you would want more depth because you're
00:34:25introducing more hierarchical non-linearities after one layer if you have a single uh you know dense
00:34:33connection followed by a non-linearity you have a limited amount of complexity that you can extract
00:34:37it's only coming from one non-linearity so it's limited to the expressive capacity of that single
00:34:42non-linearity so for as you get more and more complex tasks you require deeper and deeper expressive
00:34:49functions so that's this one axis on the other axis more outputs this is just a problem definition
00:34:56so if you wanted to predict more things then you would need more outputs a good example is that if
00:35:02you wanted to do generation right let's say to generate an image you would need to generate values
00:35:07for every pixel in that image that's a lot of outputs right versus if you wanted to just predict
00:35:12let's say uh you know the weather tomorrow that's a temperature value right it's just one output just
00:35:18number right so depending on your problem definition those two things can change
00:35:25excellent okay so now that we have uh an idea of architecturally what makes up a neural network
00:35:33i think now it's time that we can actually start to compose all of this together and actually in line
00:35:37with that that example that question that just came up let's try and go through an example of applying
00:35:42some of this theory into practice and actually understand you know uh how we can look to apply
00:35:49a neural network to solve a very real problem let's say maybe not that real but maybe real for all
00:35:54of you
00:35:54that you've been thinking about it so here's a question that maybe all of you have been asking
00:35:57yourselves you know will i pass this class and let's try and build a neural network that can infer or
00:36:04predict this answer for all of you so we're going to do this by building a very simple model it
00:36:10will
00:36:10take us input two inputs uh one output the one output is going to be will i pass this class
00:36:16yes or no
00:36:17so a single number of probability of passing the class and it will be two inputs defined by number one
00:36:23how many lectures you attend over the course of this one week and number two the number of hours that
00:36:30you spend on your final project okay so let's let's plot because we've taught this class for many years
00:36:35we actually have data from past students on this operation we can look at all of the green points
00:36:40are people that have passed the class all the red points are people that have not and we can also
00:36:45plot
00:36:45where you are or you can guess how many hours you're going to spend on this class how many hours
00:36:49you're
00:36:50going to spend on the the final project and what we want to do is build a neural network that
00:36:55will
00:36:55determine from all of this past data of all of these past students where will you fall on this
00:37:01probability chance of passing versus not passing so let's do it okay we've we've actually learned all
00:37:07of this so far in the class so let's take it step by step we have two inputs this is
00:37:12this new person
00:37:14right you have spent four day four lectures you've attended and you spend five hours on your final project
00:37:19those two numbers you can feed in as input on the left hand side to your model we also have
00:37:24a single layered neural network a very basic neural network we're just going to start with this for
00:37:28now and we're going to see that our hidden layer has three hidden units and our output will just have
00:37:34one output which is a binary output yes or no on passing the class or not and what we're going
00:37:41to see
00:37:41is actually that this model got the answer very wrong it predicted that you would pass the class with
00:37:47probability 0.1 or 10 percent when in reality actually you did very well you definitely passed the class
00:37:54so can anyone tell me you know why you think that this network failed so badly here
00:38:01yes exactly exactly yes so the answer was it's not trained and that's exactly right so the the model
00:38:08here hasn't seen any of this data that we showed on the previous slide right it's it's basically like
00:38:13a baby that has not seen any knowledge about the the real world it doesn't know anything about this
00:38:18problem as well it needs to first learn about this problem and this is something that we haven't talked
00:38:22about so far in order to train our model our model has to also understand when it makes bad
00:38:29predictions what does a bad prediction mean it means that it has to be able to quantify how bad a
00:38:35prediction is versus how good a prediction is this is called the loss of a neural network a neural network's
00:38:41loss it's just going to be a measure of how far apart its predictions are from the ground truth
00:38:48answers or the ground truth observations of of a piece of data the closer your or the smaller your
00:38:56losses it means the closer these two things are so your predictions are really matching the ground truth
00:39:02this results in a small loss now let's assume that our data is not just from one student but we
00:39:11actually
00:39:11have past data from many students right now we want to care on how the model is doing not just
00:39:16for this
00:39:17one student but aggregate empirically across the entire past class now this is what we call is training on
00:39:24not just a single data point but we train on an entire data set so when we train neural networks
00:39:29we want
00:39:30to find neural networks that minimize our loss or maximize our accuracy not just on one student but on the
00:39:38aggregate empirical data set this is called the empirical loss and it's just simply the average of my loss for
00:39:45every data point in my data set now right now we've been focusing on this problem of binary classification
00:39:52yes no answers and for those types of losses we can use what's called a soft max cross entropy loss
00:39:59we'll learn more about this later but this is measuring the difference or the distance between
00:40:04two probability distributions two binary probability distributions now let's just suppose that instead
00:40:12of predicting an output of binary output we want to predict a final output that is a real number
00:40:20like a continuous value so let's say like a grade a percentage grade instead of will i pass the class
00:40:25or will i not but a percentage grade of how well i'll do for doing something like that we won't
00:40:31be
00:40:31able to use a binary loss anymore so we'll have to actually change our loss we can change it for
00:40:35example
00:40:36to a mean squared error loss so we can take our two grades predicted grade and true grade subtract them
00:40:41and then square them to create a distance measure and these are roughly the two types of losses that
00:40:48you'll see both categorical discrete losses like binary losses as well as continuous losses like msc losses
00:40:54of course there are so many other losses that you'll get exposure to over the class but these two are
00:40:59uh having very wide coverage in the field okay let's put all this information together and now start uh
00:41:09talking about the problem of actually finding our weights of the network right we've talked about
00:41:14defining the network we've talked about uh basically penalizing the network when it gets something
00:41:18wrong we have not talked about how to actually improve the network or train the network so let's
00:41:24talk about that as in this next part the objective here what are we trying to do ultimately at the
00:41:30end of
00:41:30the day throughout this entire class is that we're trying to find and build networks or models build
00:41:38models that minimize the loss on the data set the loss measures this difference between predicted
00:41:43and true we want to minimize we want to find a network that minimizes the loss on a data set
00:41:50this
00:41:51means mathematically right walking through this equation it means that we want to find the w's we want to
00:41:57find the weights that will result in the minimum l the loss over the entire data set from one to
00:42:07n
00:42:09now remember that w weights is just this is just going to be a collection of all of the weights
00:42:15in
00:42:16our entire model so it's the weights from every single layer in our network we're just going to
00:42:20combine those into all of one piece and those are the weights that we're going to try and optimize over
00:42:26now how do we do this optimization procedure well remember that our loss function is just a function of
00:42:34our weights given a set of weights our loss function will return a single value it is how
00:42:39how far apart our predicted answers are from our true answers if we only had two weights in our network
00:42:45then we would be able to plot our loss landscape like a picture like this we would be able to
00:42:50plot in
00:42:51a grid of data over weights one and weights two and for every configuration of my weights i'd be able
00:42:57to see
00:42:58how much error or how much loss that configuration of weights is obtaining now what we want to do is
00:43:05basically find the lowest point on this landscape we want to find which weights one and weights two
00:43:11correspond and give us the smallest loss so how can we do this well we can start at some random
00:43:18point
00:43:19we pick a random point in our landscape any point and we start from this point and what we'll do
00:43:25is we
00:43:25compute what's called the gradient the gradient will tell us which way is up from this point right
00:43:31it's a local measure it only tells us locally from where i stand right here which way is up and
00:43:38what
00:43:38i'll do is i will take a small step in the opposite direction right and i'll take a small step
00:43:44going down
00:43:44that loss and then i will repeat this process over and over and over again until i finally get to
00:43:49the
00:43:49bottom of the mountain of the hill right and i converge at what's called a local minimum
00:43:56we can summarize this this algorithm this procedure as uh as what's known as gradient descent in pseudocode
00:44:05right so let's go through it again one more time very briefly we start by randomly initializing our
00:44:10weights this means that we randomly pick a place in our landscape we compute the gradient
00:44:17here called dj dw this is how much a small change in our weights changes our loss right so this
00:44:24tells
00:44:25us the direction that we should change our weights in order to increase our loss we take a small step
00:44:31in
00:44:31the opposite direction so here you can see that actually we take that gradient we multiply by negative
00:44:36one we go in the opposite direction of that direction and then we multiply it by a small
00:44:42uh step let's call it eta eta here is going to be a step size of how much in that
00:44:48direction we actually
00:44:50move and then we repeat this in a loop over and over again
00:44:55in tensorflow right you can see this exactly represented the same way but here i want to draw
00:45:00your attention to this term right this is the direction term it tells us the gradient the gradient is
00:45:06going to tell us how or which direction is going up or which direction is going down if you take
00:45:12the
00:45:12negative of it but i never actually told you how to compute this right i just told you that we
00:45:17need to
00:45:18compute this right the process of computing the gradient in a neural network is called back propagation so i
00:45:25think it would be helpful also we can take a quick you know step-by-step example walking us through
00:45:32how back propagation works and how you would compute this gradient for a particular neural network and
00:45:39we'll start just for demonstration we'll start with the simplest neural network that exists it consists of
00:45:46one input one output and one hidden neuron in the middle right so you cannot get a simpler network than
00:45:51this and we want to compute the gradient of our loss l at the end or excuse me here's j
00:45:58at the end
00:45:59with respect to let's start with with respect to w2 okay so how much does a small change
00:46:06in w2 affect our loss
00:46:10so we can write out this derivative right we can write it out in math and we can use the
00:46:15chain rule
00:46:16to actually decompose it now why would we want to decompose it well first of all we decompose this
00:46:22this gradient dj dw2 into two terms dj dy and dy dw2 this is just a basic extension of the
00:46:32chain rule
00:46:33nothing magic here but why is this possible it is possible because y is dependent only on the previous
00:46:40layer okay now let's suppose now that we want to compute the gradients of this weight before w2
00:46:49let's say w1 here what we can do is just replace w2 in this equation with w1 and then again
00:46:56we have
00:46:56to apply the chain rule yet again right because computing this last term here is not well defined
00:47:02so we have to actually expand it one more time this is why we call it propagation back propagation
00:47:06because you actually have to start from the output and keep computing these iterative chain rules back
00:47:12and back over the course of your network step by step and we repeat this process of you know propagating
00:47:19those gradients all the way from output to input across our weights and at the end of this whole
00:47:25process what we're left with is for every single weight in our network we have this direction of
00:47:32basically saying okay if we increase this weight a little bit will our loss go up or down now if
00:47:38our
00:47:38loss was to go down that means that we should increase that weight just a little bit right or we
00:47:43would go in the opposite direction and that's the back propagation algorithm right in theory it's it's
00:47:50nothing more than an application of the chain rule from uh differential calculus but in practice you know
00:47:59it can get very messy and very hairy it's a very computational measure to do because you have to do
00:48:03this
00:48:04you know step by step for every single weight in your in your model uh so in practice today's deep
00:48:10learning frameworks like tensorflow pytorch they do this automatically so you don't necessarily need
00:48:14to implement this yourself but it's important to understand you know the the practical the the
00:48:20theoretical side of you know how these things are operating and and what it's doing underneath the hood
00:48:25i want to also like use that as an opportunity to discuss with you some of the practical implications
00:48:31of training neural networks uh in in reality right and i showed you this previous picture of like a
00:48:39very pretty lost landscape that was very smooth but in practice optimizing neural networks is extremely
00:48:44difficult and this is actually a picture of you know neural networks are extremely high dimensional
00:48:50search spaces so we don't actually know what this picture looks like but this is a projection of the
00:48:55loss landscape of a of a deep neural network uh from a paper that came out several years ago about
00:49:01in 2017
00:49:03and you can actually visualize now you know how messy some of these lost landscapes look that applying
00:49:09these types of back propagation and optimization techniques is very very challenging and i want you to
00:49:14also recall you know before we took that dive into back propagation and the gradient term in particular we started
00:49:21to talk about uh you know this this equation that you see here right so how would we update the
00:49:25weights
00:49:26we update them by taking an opposite step in a small uh small increment in that direction that we want
00:49:34to
00:49:34right now this is the key term i want to focus on now this small step this is called the
00:49:39learning rate
00:49:40of our model this derived this basically dictates how quickly we take those steps and how quickly we listen
00:49:46to our to our gradients as we're computing back propagation and in practice setting the learning
00:49:52rate can be very very difficult if we set the learning rate too slow then we basically start from a
00:49:56point
00:49:57but we get stuck in some of these uh local minimum but they may not be the best minimums that
00:50:03we could get
00:50:03to right if we set it too large then we get some unstable behavior where we basically overshoot we we
00:50:10start to
00:50:10step in the right direction but we step too far and then we explode out of the out of the
00:50:15stable place of
00:50:16learning ideally we want to set learning rates that are you know not too small so that they can skip
00:50:22some of the local minima but also not too big that they also diverge and they can still converge
00:50:28so how do we actually set the learning rate one option and actually a very common option is to
00:50:34you know just try a bunch of learning rates see what works best how can you do better than this
00:50:38well
00:50:39the idea is uh can you design adaptive algorithms that depending on how they are uh optimizing in the
00:50:46search space can you adapt the learning rate can you change the learning rate as a function of your
00:50:50landscape itself and this basically means that your learning rate practically speaking your learning
00:50:56rate will increase or decrease as a function of your gradients and the function of your data
00:51:01uh how fast you're learning right how how uh how steep the uh landscape is how how all of these
00:51:11different things can basically dictate all of these adaptive properties of a learning rate and in fact
00:51:16these have been very widely studied and many different types of adaptive learning rate schedulers have been
00:51:22created here you can see some examples adam so all of these start with like a lot a lot of
00:51:27them start with
00:51:28this ada for adaptive right these are different variations of these adaptive properties adam in
00:51:35particular is one extremely well used uh a type of optimization procedure that you'll be using
00:51:43throughout many of your labs but i encourage you to really try out and experiment with all of these
00:51:49different types of learning rate schedulers to see what works best in many times there will be different
00:51:55types of learning rate schedulers that work for different types of problems so you should definitely
00:51:59try out the different pieces and trying them out is is as easy as and oftentimes just a single line
00:52:06change
00:52:06right change to your learning loop will just implement different schedulers so sgd stochastic gradient descent
00:52:15is just going to be that that base gradient descent algorithm that we had seen before
00:52:20and i actually want to dig into that a little bit more because what you saw or what i presented
00:52:25was
00:52:26actually the gradient descent algorithm not the stochastic gradient descent algorithm so i want to tell
00:52:31you a little bit about you know what's the difference between those two pieces or those two types of
00:52:36algorithms to understand that we have to first revisit one more time the gradient descent algorithm
00:52:42so the gradient here this is that that piece that we computed with back propagation this is very
00:52:49computational because if you look at it it's computed as a summation or an average i should say over all
00:52:56of my data points in my data set so i compute the gradient for not just one data point but
00:53:03all of my data
00:53:03points in my data set that's why it's very expensive now in most real life problems it is not really
00:53:10feasible
00:53:10to compute your gradient over your entire data set on every single iteration of this step because remember
00:53:16we don't compute the gradient just once we compute at every point along this optimization procedure and
00:53:22you're optimizing your your network for millions or even more steps and you don't want to be looping
00:53:29through your entire data set on every single one of those steps so let's define a new type of gradient
00:53:34descent now we'll call it stochastic gradient descent like you saw before instead of computing the gradient
00:53:40over my entire data set i'm going to compute a very noisy gradient it's going to be a gradient
00:53:45computed just over one data point in my data set so i'm going to randomly pick a data point
00:53:50and i'm going to compute the gradient with respect to that one data point not my entire data set this
00:53:55is going to be way noisier obviously because that one data point is not going to be representative
00:53:59of my entire data set but it will give me an answer way quicker so i can get through more
00:54:04steps
00:54:06now there's also a uh you know there's a natural trade-off here right we want to go fast but
00:54:12we also
00:54:13don't want to be too noisy there obviously is a middle ground here right instead of computing
00:54:18the noisy gradient on one example we can do what's called mini batched gradient descent right
00:54:23mini batch gradient descent is where you set a batch size and then on every iteration you compute your
00:54:29grid gradient with respect to not just one data point but let's say k data points where k is pretty
00:54:34small think of something like 32 or 128 something on that scale you look at your gradient with respect
00:54:40to those let's say 32 data points and then you average that gradient it helps you get a bit more
00:54:47reliability and robustness in your measure but then you also get the speed right you're not going over
00:54:53your entire data set 32 is usually way way smaller than your entire data set
00:54:59okay so now what does this mean this means that we now have this increase in gradient accuracy
00:55:05compared to stochastic gradient descent so we can we can converge much more smoothly we're not super
00:55:11noisy going after one data point one at a time but it also means that we can be much more
00:55:17uh
00:55:18quick than compared to uh full gradient descent where we go over the entire data set at a whole
00:55:24this means that you know because we're more stable on the one side we can also increase our learning rate
00:55:30these two things are extremely connected right the relationship between your gradients and your learning
00:55:35rates should be one that you have a very good intuition about because your gradients are now more stable
00:55:40you're averaging over a mini batch not just a single sample you can now start to uh take bigger steps
00:55:48right you can trust the gradient a bit more over over the course of optimization it also allows you to
00:55:54really parallelize training because if you wanted to compute your gradient over 32 data points you can
00:55:59parallelize that off of 32 processes on your gpu right you compute them in parallel as opposed to one at
00:56:06a
00:56:06time this allows you to really start to utilize gpu speedups even further now the last topic i'll touch
00:56:14on before we uh we take a short break for lecture two is going to be this topic of overfitting
00:56:20and
00:56:21regularization of neural networks and this is a huge problem not just in deep learning but we really want
00:56:26to cover it because it's one that you're going to get exposure with in today's lab especially is
00:56:31basically it's one of the most fundamental topics of all of machine learning as a whole ideally in
00:56:38machine learning we want to build models that don't just work well on a training set right we do train
00:56:45our models on training sets but we don't want them to work well only on our training set actually what
00:56:50we
00:56:50really want is we we actually don't really oftentimes care about how well it works and practice on our
00:56:55training set at all we use that as a proxy because what we really care about is how well the
00:56:59model works
00:57:00on brand new data when we deploy it into the wild and there it's not our training data at all
00:57:05it's
00:57:05brand new test data and the relationship between these two things is extremely important we use the
00:57:11training data as a proxy but ultimately we don't really really care about it all that much another
00:57:16way to say this is that when we build models we want to learn representations from our training data
00:57:22but we still want them to generalize to unseen test data as well now take this picture for example
00:57:30assume you want to build a line that describes the relationship between the x and the y points on this
00:57:35picture you know on the left hand side you can see that you have a very simple model a linear
00:57:41model
00:57:42it can describe the training points and it probably will also describe the the test points to some decent
00:57:48faithfulness but it's not fully capturing the richness and the complexity of our data set
00:57:52both in the training set and the test set so we're not utilizing the full expressive capacity of the
00:57:57model on the on the left hand side move over all the way to the right hand side you can
00:58:01actually see
00:58:01that we're starting to memorize data points in the training set so much so that we're hurting our
00:58:06performance for brand new test data because we're we're waiting too much on what we've seen during
00:58:11training basically what you always want is to end up in the middle you want to leverage your
00:58:16training points but not rely on them too much or memorize them now yes
00:58:25example for problem of overfitting oh sorry say a real example of the problem which we face in the
00:58:32overfitting yes of course so a real life example of overfitting would be let's say if you have a
00:58:38very small data set but a very large network you'll you'll learn a model that just memorizes
00:58:46all of the data in your data set and it will be it's it's not like it's doing something bad
00:58:52because
00:58:53it has the power to memorize everything in the training set remember always that models don't
00:58:58see test set it's unseen data so all they can see is your training set what you give it to
00:59:03them
00:59:03so if you give them a very small training set and a very big model the model will do what
00:59:08it's supposed
00:59:08to do and learn exactly the training set to the full capacity right but then when you show it more
00:59:13test data it's not going to be very faithful to the training data because it's not going to be
00:59:18perfectly from the same distribution yeah yep
00:59:25yeah
00:59:28where is this
00:59:29where you identify the points that you choose
00:59:34um
00:59:40maybe i'm not
00:59:58I see, I see.
01:00:00Yeah, so the stochasticity is coming purely
01:00:02from the selection operation.
01:00:03So maybe it's a confusion.
01:00:05So why do we call it stochastic gradient set?
01:00:09It's because of the selection process.
01:00:11We don't do this over the entire data set,
01:00:13but we stochastically select a subset of data,
01:00:16and that selection is stochastic.
01:00:20Yeah, make sense?
01:00:27No, no, no.
01:00:28So you take the stochastic selection,
01:00:31and then with that stochastic selection of data,
01:00:35the gradient is, I mean, it can be unbounded, right?
01:00:38So you grab or you compute the gradient
01:00:40with respect to those data points,
01:00:42whatever they may be,
01:00:43but your stochasticity is coming from the selection part,
01:00:46not from the gradient computation.
01:00:49Yes?
01:01:10So basically, the question is about,
01:01:12is there a way,
01:01:13is there a more adaptive way almost of doing selection
01:01:15as opposed to being truly stochastic?
01:01:17And the answer is yes, definitely.
01:01:19So truly stochastic seeing of data
01:01:22is actually not very realistic either, right?
01:01:25Even though this is the way that is the convention, right?
01:01:27We, as humans, do not operate like this, right?
01:01:30We don't just randomly see data.
01:01:32We see data sequentially over time,
01:01:34and we see data with meaning and with purpose.
01:01:37Actually, in tomorrow's lecture,
01:01:38you'll see an example of how we do this type
01:01:40of adaptive selection process
01:01:42and the benefits of this as well.
01:01:44Great question.
01:01:46Okay, so I'll just very briefly wrap up with regularization.
01:01:50So regularization is just a technique
01:01:52that allows you to discourage
01:01:54these complex memorization protocols.
01:01:57So if you have a very small data set, big model,
01:01:59you want to discourage the model
01:02:01from just memorizing that data set.
01:02:03So how can you discourage the model
01:02:05from those types of things to be learned?
01:02:08And, you know, as we've seen,
01:02:10this is really critical
01:02:11for the overall performance of the model
01:02:13because we don't care about the training results.
01:02:16We care about the test results, ultimately.
01:02:18The most popular regularization technique
01:02:20is actually a very simple idea.
01:02:22You'll use this in almost all of your labs
01:02:24as part of this course.
01:02:25It's the idea of dropout.
01:02:27So what is dropout?
01:02:28Let's revisit this picture of a deep neural network.
01:02:31In dropout, all we do is that during training,
01:02:34we're going to randomly set some activations
01:02:36of our hidden neurons to zero with some probability.
01:02:40So let's say we set dropout to 50%.
01:02:43What we're going to do is say 50% of our neurons,
01:02:46we're going to drop out the activations
01:02:48or set their activations to zero,
01:02:50which forces the network
01:02:52to not rely so much
01:02:54on the outputs of any one neuron, right?
01:02:58The inputs at the next layer
01:03:00after a neuron gets dropped,
01:03:02it cannot rely,
01:03:03it cannot memorize so much
01:03:04about the previous inputs
01:03:06because there is some more stochasticity
01:03:08being implemented
01:03:09into this forward pass of the model,
01:03:11not just in the data set curation
01:03:12or this data set selection,
01:03:14but also in just the pure forward pass.
01:03:16Even if I pick the same data twice
01:03:18and I put it through the model twice,
01:03:20the exact same data,
01:03:21because of dropout,
01:03:23you also have another level of stochasticity
01:03:25that means the model can't even remember
01:03:28the same exact data twice, right?
01:03:30This is an extremely powerful idea
01:03:32because basically all it's doing
01:03:33is it's lowering the capacity of the model.
01:03:36It's lowering the ability
01:03:38or it's discouraging the ability
01:03:40for the model to learn
01:03:42a singular pathway through the model.
01:03:44It's forcing the model
01:03:45to learn these multiple pathways
01:03:46to make a single decision.
01:03:49And basically on every single iteration,
01:03:51we just repeat this process.
01:03:53Every time it sees a new piece of data
01:03:54or every time we do a forward pass,
01:03:57it always creates a random pathway
01:04:00for this data to pass through the model.
01:04:03Another final technique that I'll show you
01:04:05is about this notion of early stopping.
01:04:08Early stopping basically just means
01:04:09that we monitor the deviation
01:04:12between our training loss and our test loss.
01:04:16So we can have a test,
01:04:18we can have a proxy of a test loss
01:04:20by having a held out set.
01:04:22Maybe it's not a true test loss,
01:04:23but it's again another proxy
01:04:25that we do not train on.
01:04:26And what we can do is
01:04:28we can basically monitor
01:04:29how well the model is doing
01:04:31on both the training set
01:04:32and our held out,
01:04:34let's call it a validation set.
01:04:36In the beginning,
01:04:38both of these lines as we train,
01:04:40they both start to go down,
01:04:42which is excellent.
01:04:42It makes sense, right?
01:04:43This is because the model is learning, right?
01:04:46It's getting stronger
01:04:46over the course of training.
01:04:48And eventually what you'll see
01:04:50is that the model starts to plateau its loss.
01:04:53And on the test,
01:04:55it actually starts to increase.
01:04:57So the training accuracy should,
01:04:59if the model has enough capacity,
01:05:01the training accuracy should always,
01:05:02excuse me,
01:05:03the training loss should always go down.
01:05:05It should always be getting better
01:05:06and better on the training set.
01:05:08But at some point,
01:05:09you will see that the test loss
01:05:11starts to memorize data.
01:05:13It starts to memorize data
01:05:14in the training loss,
01:05:15which results in the test loss
01:05:17to go up a little bit.
01:05:19Now, this pattern continues
01:05:20for the rest of training.
01:05:22And here's the point
01:05:23that you should really focus on, right?
01:05:24This is the point where that
01:05:26if you plotted this curve,
01:05:29you would save your model
01:05:30at each of these stages,
01:05:31but you would only take the checkpoint.
01:05:33You would take the model
01:05:34that happens at this point,
01:05:35because this is the,
01:05:36even though the training loss
01:05:38even got better after this point,
01:05:41if you look at your training set,
01:05:43you actually look like
01:05:43you have a better model.
01:05:44But on the test set,
01:05:46you can see that it's actually
01:05:47started to memorize
01:05:48pieces of the training set.
01:05:49So you do not take
01:05:51the models on the far right.
01:05:52You actually take these models
01:05:53in the middle.
01:05:54Yes?
01:06:00Not every iteration,
01:06:02because maybe it adds
01:06:03unnecessary compute.
01:06:04But what people typically do is,
01:06:05you know, let's say,
01:06:06once every so many iterations,
01:06:08you will do a testing run.
01:06:10And again, you don't need
01:06:11to do a testing run
01:06:12over your entire test set.
01:06:13You could do it stochastically
01:06:14as well in a batch, right?
01:06:16So let's say you could do,
01:06:18let's say,
01:06:18every thousand iterations,
01:06:20you do a batch of, let's say,
01:06:22only 100 data points
01:06:23in your test set,
01:06:24just to get an approximate.
01:06:35No, so the drop nodes
01:06:36will not have gradients
01:06:37because we don't have
01:06:38information of what's
01:06:40happening with them.
01:06:40But for all of the other nodes,
01:06:42we'll get an update.
01:06:44Yeah, exactly.
01:06:45Yes?
01:06:51It should be separate, yes.
01:06:53So this is a key assumption
01:06:54is that ideally,
01:06:55you take your training data
01:06:56and what people can do
01:06:57is basically cut your training data
01:06:59in a ratio, right?
01:07:01So let's say you take 70%
01:07:02of your training data
01:07:03and you actually use it
01:07:04for training,
01:07:04and you take the other 30%
01:07:06of your training data
01:07:07and use it for testing
01:07:08and in validation, right?
01:07:11Okay, last question.
01:07:12Do you feel a difference
01:07:14and loss between the test
01:07:15and the training data set?
01:07:17Great question.
01:07:20I mean, there's no ideal, right?
01:07:23Ideally, actually,
01:07:24there would be no difference, right?
01:07:26In practice, though,
01:07:28so there are situations, actually,
01:07:30where there are
01:07:30very little difference.
01:07:31Let me give an example
01:07:32is assume your training set
01:07:34is also so massive
01:07:36that it's impossible
01:07:37for your model to learn
01:07:38the full capacity.
01:07:40It's impossible
01:07:40for the model to memorize.
01:07:41Then, actually,
01:07:42you will see, basically,
01:07:43training and testing
01:07:45is very close to each other.
01:07:46A good example of this
01:07:46is language modeling.
01:07:48Even massive language models,
01:07:50they still have trouble
01:07:51memorizing the entire data set
01:07:53just because language
01:07:54is such a massive data set, right?
01:07:56So even there,
01:07:58basically, you'll see
01:07:59training and testing curves
01:08:00look very, very similar,
01:08:01but then that's why
01:08:02we have to actually do
01:08:03other types of validation.
01:08:04Language models don't really have
01:08:06the classical overfitting problems
01:08:07that other types
01:08:09of deep learning models have.
01:08:11They have other problems,
01:08:12which we'll talk about.
01:08:14Yeah, okay.
01:08:17Awesome.
01:08:17Okay, I'll conclude now
01:08:18just by summarizing
01:08:19the three points
01:08:20that we talked about
01:08:21in this lecture
01:08:21before we jump into
01:08:22lecture number two.
01:08:23So first, we talked about,
01:08:25you know, building neural networks,
01:08:26the architectures
01:08:27of neural networks.
01:08:28We talked about
01:08:28the base operation.
01:08:30The base architecture
01:08:30is called a perceptron,
01:08:32a single neuron.
01:08:33We learned about
01:08:34how we could stack
01:08:35those single neurons together
01:08:36to form complex
01:08:37hierarchical networks
01:08:39and how we can mathematically
01:08:40optimize those networks
01:08:42using data.
01:08:42And finally,
01:08:44we addressed a lot
01:08:44of the practical implications,
01:08:46everything from,
01:08:46you know,
01:08:47batch gradient descent
01:08:48to overfitting
01:08:50and regularization
01:08:51and optimization
01:08:51of these models.
01:08:53In the next lecture,
01:08:54we're going to hear
01:08:55from Ava
01:08:55on deep sequence modeling,
01:08:57which is the backbone
01:08:59of large language models.
01:09:01And this is a really exciting
01:09:03type of lecture,
01:09:04so hopefully everyone enjoys it.
01:09:06And I think probably
01:09:07what we'll do
01:09:08is just take a five-minute break
01:09:10just so Ava and I
01:09:11can switch laptops,
01:09:12and then we will continue
01:09:13with the lecture.
01:09:14And then after the lecture,
01:09:15we have software labs
01:09:16followed by reception
01:09:17at Link and food.
01:09:20Okay.
01:09:20Thanks, everyone.
Comments