We have developed a Random Forest model with 10000 trees. We have got

1.	We have developed a Random Forest model with 10000 trees. We have got training error as 0. However, the validation error seems to be around 34~35. Any thoughts? Do you feel the model has not trained appropriately?
Answer» This is a scenario where the model overfits and we get perfect accuracy or in other words, the error is ALMOST zero or zero. When we divide the dataset into training and TEST and then build our model on the training dataset, our objective is to validate the model that we have built using training dataset, to be fed into a testing dataset which is unseen by the model and new dataset for the model. Based on the features in training dataset that it has learned, if it can perform well in a new dataset with similar features, then that proves the model is performing better with less error. In this context, when we think about RANDOM forest which is a classification algorithm, various hyper parameters are to be considered carefully which is used to build the algorithm and model. The number of trees is one of those parameters and we need to ensure we reduce the number of trees in this case, to enable the model to BEHAVE appropriately and do not overfit. Trees can be reduced using K-fold cross-validation approach where k can be 5, 10 or any fold that we wish to make.

We have developed a Random Forest model with 10000 trees. We have got training error as 0. However, the validation error seems to be around 34~35. Any thoughts? Do you feel the model has not trained appropriately?

Answer»

This is a scenario where the model overfits and we get perfect accuracy or in other words, the error is ALMOST zero or zero.

When we divide the dataset into training and TEST and then build our model on the training dataset, our objective is to validate the model that we have built using training dataset, to be fed into a testing dataset which is unseen by the model and new dataset for the model. Based on the features in training dataset that it has learned, if it can perform well in a new dataset with similar features, then that proves the model is performing better with less error.

In this context, when we think about RANDOM forest which is a classification algorithm, various hyper parameters are to be considered carefully which is used to build the algorithm and model. The number of trees is one of those parameters and we need to ensure we reduce the number of trees in this case, to enable the model to BEHAVE appropriately and do not overfit. Trees can be reduced using K-fold cross-validation approach where k can be 5, 10 or any fold that we wish to make.

We have developed a Random Forest model with 10000 trees. We have got training error as 0. However, the validation error seems to be around 34~35. Any thoughts? Do you feel the model has not trained appropriately?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment