00:00Hello everyone. In this video, we will talk about modeling and evaluation of random forest in python.
00:10So, if we use a random forest, we will see how to implement a random forest.
00:19So, we will consider a hard data set.
00:22So, we will decide a particular person who has a heart disease.
00:26So, this is a classification problem.
00:28So, in random forest, we will create a decision tree classifier in the back end.
00:34So, when we implement it, we will load the data set.
00:37So, first, we will import the NumPy, Pandas, Matplot, C1.
00:42So, we will import the libraries.
00:44So, we will import the data set.
00:47So, we will import the data set.
00:48Like, if there are any columns, age.
00:50Then, gender, female and male.
00:52We will import the 0, female, 1, male.
00:55Then, chest pain type.
00:56There are 4 different types.
00:580, 1, 2, 3.
00:590 is typical angina.
01:001 is atypical angina.
01:022 is non-anginal pain.
01:043 is asymptomatic.
01:06So, in this case, we have chest pain type.
01:09Then, chest BP.
01:10Blood pressure.
01:11Then, cholesterol levels.
01:12Then, fasting blood sugar rate.
01:14So, fasting blood sugar rate.
01:161 is 120.
01:171 is low.
01:18Then, resting blood sugar.
01:19Fasting blood sugar rate.
01:201 is low.
01:21Then, resting ECG.
01:22So, ECG.
01:23There is a PQRS wave.
01:25So, there is a PQRS wave.
01:26So, the PQRS wave.
01:28We are normal.
01:291 is abnormal.
01:302 is abnormal.
01:312 is left ventricular hypotrophy according to S criteria.
01:35So, there is maximum heart rate.
01:42So, when we say that, we are stressed.
01:46So, when we say that, we provide value 2.
01:49So, we are separated from one column.
01:52So, based on that particular person, we decide the target column.
01:58So, that means that there is no heart disease.
02:02So, when we say CA, we say number of major vessels.
02:09Then, TAL is a blood disorder.
02:12So, 1 is normal.
02:152 is the blood flow in particular heart.
02:183 is the reversible defect.
02:20So, when we say the issue of blood flow, we say reverse.
02:26So, in that part, we do this.
02:29So, all of these values are complete data set.
02:33So, there are actually 14 columns.
02:37So, target is our final output column.
02:40So, this is what we predict.
02:42Then, in this hard dot shape, basic information checks.
02:46So, 303 rows, 14 columns.
02:49Then, you have null data set.
02:51So, actually, we have 303 rows.
02:53So, all of these values are non-null values.
02:55Okay?
02:56Then, describe complete and visualize.
02:59Then, normal and basic EDH checks.
03:02So, EDL is univariate analysis.
03:05So, univariate analysis is heart with respect to heart.
03:08We look at heart with respect to count values.
03:11His plot.
03:12Then, bivariate analysis is CP.
03:15That is chest pain with respect to target provide.
03:18So, target is 0 and 1.
03:19That is based on chest pain types.
03:21Then, each and every columns.
03:24We analyze correlation and analyze.
03:26Then, here we look at missing data.
03:28This is missing data.
03:29So, if you have missing data,
03:30one of the missing data is also have missing data.
03:31Then, heart.duplicator.sum
03:32Then, heart.duplicator.sum
03:33Now, here is another duplicate data.
03:35That is important.
03:36It is important for the port in columns.
03:37We are exactly replicate.
03:38So, that is what we do.
03:39We remove this.
03:40So, heart.drop.duplicates.
03:43In place equal to true.
03:45So, here we check.
03:48We remove the duplicate.
03:50Next, we remove random forest classifier.
03:55Next, let's go to random forest classifier implementation.
03:59So from SQL under Ensemble, we import random forest classifier.
04:03This is the regression problem.
04:05We use random forest regressor.
04:07So first, we create a model for random forest classifier.
04:10Then, we fit our training data,
04:13we predict our testing data.
04:15Then, accuracy final,
04:16we calculate y test and y prediction.
04:19We have 86% accuracy.
04:22Then, we check classification report.
04:25So in classification report,
04:27Precision, Recall, F1 score, Support.
04:30We check.
04:31So here, we have 87% accuracy.
04:33Now, 0 is the person not having heart disease.
04:37So not having heart disease.
04:40Then, one person is having heart disease.
04:43So if we compare,
04:45if we compare the person having heart disease,
04:47we learn better.
04:49We compare not having heart disease.
04:51Next.
04:52Next, we have 86% accuracy.
04:54If we compare,
04:56we still have improvised.
04:57What we use is hyperparameter tuning.
05:01We optimize hyperparameter tuning.
05:03So hyperparameter tuning is basic,
05:06best parameters we provide.
05:08So what we use is hyperparameter tuning.
05:11Grid search CV, randomized search CV.
05:13So cross-validation.
05:14How do we do cross-validation?
05:16Like,
05:17we provide the entire data
05:19for 4-folds.
05:22Now, we split it.
05:24First time we consider,
05:26the block is training,
05:28testing data,
05:29testing data,
05:30and training data.
05:32After that,
05:33then,
05:34we run with 75% to 25%
05:35then,
05:36second time we run with
05:37first block,
05:38testing data
05:39then,
05:40training data
05:41So, now we have separate training data. Then, we provide the features, the combinations
05:50provide, set, where we have higher security, the particular part we provide. So, this is
05:57random. Grid search CV, we set different combinations and best parameters. So, in random
06:07for us, what parameters we consider? N estimators. That is the number of decision trees. So,
06:14decision tree is basically default 100, 200, 300 provide. Then, maximum of features. So,
06:21features are auto, square root and log2. Then, depth of trees is 10, 20, 30, none. In the list
06:29we provide up and of none. So, sometimes we consider maximum depth. Then, minimum sample
06:36split. So, when the data is split, we say minimum sample split. So, that is 5 to 10. Then,
06:43final node, we say leaf node, we say leaf node. That leaf node split
06:48is confined to minimum sample leaf. So, if we go to all these, we say
06:54a random grid, we say dictionary, n estimators. That is 100, 200, 300. So,
07:00that is 100, 200, 300. So, this is one dictionary we create. Then, next we
07:07create a random forest classifier model. So, we create a grid search
07:11here, first name we provide model. So, the model name is rf1. So,
07:18the first name we provide. So, the first name we provide. So, the scoring
07:21is equal to f1 provide. So, f1 score based on the accuracy and accuracy
07:26and accuracy and accuracy and accuracy
07:28we provide the precision, recall f1
07:30we provide. So, in the model, we can complete f1
07:33base and create and provide. Then,
07:37param grid is random grid. That is, we consider all the data
07:40that we have to do. Then, cv equal to 3
07:43we divide. Then, obos. Obos equal to 2
07:48we provide. Fitting 3 folds for each of 144 candidates.
07:52In the line provide. Then, n jobs. n jobs
07:56n jobs, basic grid search cv. Now, what is
08:00n estimate as 100, so, 100. Then,
08:06auto feature. Then, all the depth. 10. Then, 100
08:11square root. 10. Then, 100 log to 10. So,
08:20in the model, we complete. Then, 3p. 100 auto. 20. 100
08:26square root. 20. So, all different possible combinations
08:30we complete match. Based on f1 score
08:34we decide. We decide. We
08:36in the data, we divide the entire
08:38three different folds. In the particular parameters
08:41based and f1 consider
08:43we provide best parameters. That is, we
08:46provide. Okay. So, here
08:48we train the grid search cv. We train
08:50we provide. That is, grid search cv.
08:52We provide model. Then, training data
08:56provide best parameters extract. So, best parameters
08:59how do we extract the model? We
09:01create the model. So, in that
09:03we have inbuilt best parameters. We
09:05say a keyword. The base. That
09:07we extract it. It will be showing. Best
09:10parameters. How do we
09:11take maximum depth? How do we take
09:12minimum swap? How do we take
09:13sample leaf? How do we take
09:14sample leaf? How do we take
09:15separate this? Okay. So, this
09:20based, we create random forest
09:23classifier. So, random forest
09:25classifier. Here, best parameters
09:27provide maximum depth
09:2910. Maximum feature auto. Minimum
09:31sample leaf 4. So, this
09:33we use double pointer. So, random forest
09:36classifier of double star. Here, what we
09:39provide? We save the best parameters. So, that
09:42we point. So, that's what we
09:44point. That's what we
09:45point. That's what we
09:46point. That's what we
09:47point. Double pointer method. We
09:48use random forest classifier
09:49create. Here, we provide
09:51training data. Then, testing data.
09:53provide. Predic.
09:55Predic.
09:57Okay.
09:58Predic.
09:59Accuracy.
10:00Find.
10:0188%.
10:02Normal 86.
10:03How do we
10:04find?
10:0588?
10:0680?
10:0788?
10:0888?
10:0988.
10:1088?
10:1288?
10:1488?
10:1588?
10:1688?
10:1788?
10:1888?
10:1988?
10:2088?
10:2188?
10:2288?
10:2388?
10:24the highest predictor.
10:25So, we can do the random forest overfitting and then we can do the hyperparameter tuning.
10:34This is the modeling and evaluation of random forest in python.
Comments