For linux users, the function parallelStartMulticore(cpus = detectCores())
will activate parallel backend. I've used all the cores here.
r <- resample(learner = bag.lrn,
task = traintask,
resampling = rdesc,
measures = list(tpr, fpr, fnr, fpr, acc),
show.info = T)
#[Resample] Result:
# tpr.test.mean = 0.95,
# fnr.test.mean = 0.0505,
# fpr.test.mean = 0.487,
# acc.test.mean = 0.845
Being a binary classification problem, I've used the components of confusion matrix to check the model's accuracy. With 100 trees, bagging has returned an accuracy of 84.5%, which is way better than the baseline accuracy of 75%. Let's now check the performance of random forest.
#make randomForest learner
> rf.lrn <- makeLearner("classif.randomForest")
> rf.lrn$par.vals <- list(ntree = 100L,
importance = TRUE)
> r <- resample(learner = rf.lrn,
task = traintask,
resampling = rdesc,
measures = list(tpr, fpr, fnr, fpr, acc),
show.info = T)
# Result:
# tpr.test.mean = 0.996,
# fpr.test.mean = 0.72,
# fnr.test.mean = 0.0034,
# acc.test.mean = 0.825
On this data set, random forest performs worse than bagging. Both used 100 trees and random forest returns an overall accuracy of 82.5 %. An apparent reason being that this algorithm is messing up classifying the negative class. As you can see, it classified 99.6% of the positive classes correctly, which is way better than the bagging algorithm. But it incorrectly classified 72% of the negative classes.
Internally, random forest uses a cutoff of 0.5; i.e., if a particular unseen observation has a probability higher than 0.5, it will be classified as <=50K. In random forest, we have the option to customize the internal cutoff. As the false positive rate is very high now, we'll increase the cutoff for positive classes (<=50K) and accordingly reduce it for negative classes (>=50K). Then, train the model again.
#set cutoff
> rf.lrn$par.vals <- list(ntree = 100L,
importance = TRUE,
cutoff = c(0.75, 0.25))
> r <- resample(learner = rf.lrn,
task = traintask,
resampling = rdesc,
measures = list(tpr, fpr, fnr, fpr, acc),
show.info = T)
#Result:
# tpr.test.mean = 0.934,
# fpr.test.mean = 0.43,
# fnr.test.mean = 0.0662,
# acc.test.mean = 0.846
As you can see, we've improved the accuracy of the random forest model by 2%, which is slightly higher than that for the bagging model. Now, let's try and make this model better.
Parameter Tuning: Mainly, there are three parameters in the random forest algorithm which you should look at (for tuning):
- ntree - As the name suggests, the number of trees to grow. Larger the tree, it will be more computationally expensive to build models.
- mtry - It refers to how many variables we should select at a node split. Also as mentioned above, the default value is p/3 for regression and sqrt(p) for classification. We should always try to avoid using smaller values of mtry to avoid overfitting.
- nodesize - It refers to how many observations we want in the terminal nodes. This parameter is directly related to tree depth. Higher the number, lower the tree depth. With lower tree depth, the tree might even fail to recognize useful signals from the data.
Let get to the playground and try to improve our model's accuracy further. In MLR package, you can list all tuning parameters a model can support using:
> getParamSet(rf.lrn)
# set parameter space
params <- makeParamSet(
makeIntegerParam("mtry", lower = 2, upper = 10),
makeIntegerParam("nodesize", lower = 10, upper = 50)
)
# set validation strategy
rdesc <- makeResampleDesc("CV", iters = 5L)
# set optimization technique
ctrl <- makeTuneControlRandom(maxit = 5L)
# start tuning
> tune <- tuneParams(learner = rf.lrn,
task = traintask,
resampling = rdesc,
measures = list(acc),
par.set = params,
control = ctrl,
show.info = T)
[Tune] Result: mtry=2; nodesize=23 : acc.test.mean=0.858
After tuning, we have achieved an overall accuracy of 85.8%, which is better than our previous random forest model. This way you can tweak your model and improve its accuracy.
I'll leave you here. The complete code for this analysis can be downloaded from Github.
Summary
Don't stop here! There is still a huge scope for improvement in this model. Cross validation accuracy is generally more optimistic than true test accuracy. To make a prediction on the test set, minimal data preprocessing on categorical variables is required. Do it and share your results in the comments below.
My motive to create this tutorial is to get you started using the random forest model and some techniques to improve model accuracy. For better understanding, I suggest you read more on confusion matrix. In this article, I've explained the working of decision trees, random forest, and bagging.
Did I miss out anything? Do share your knowledge and let me know your experience while solving classification problems in comments below.