Random Forest is an ensemble of a given number of decision trees, called estimators, where each estimator produces its own predictions. The random forest model combines the predictions of the estimators to produce a more accurate prediction.
A training set of dimension N x M (examples x features) is subsampled in many sets of n examples and m features (with n < N and m < M). This subsampling of a training set is called bagging.
Therefore, if Random Forest is composed by K trees, each tree is trained only on m features of n training examples. To make a prediction for a new incoming example, the relevant features of this example are submitted to each of the K trees. K different predictions are obtained, which need to be combined to produce the overall prediction of the random forest: majority voting is used to decide on the predicted categories.
With respect to a single decision tree, Random Forest is more robust with respect to overfitting. Although increasing the number of estimators will slow down the training time, it is highly probable that overall accuracy will increase. Overall, it is an easy algorithm to tune, and although it is possible to find better performing algorithms, they usually take more time to be tuned properly.
The hyperparameters for this model type are:
- Number of decision trees
- Split criterion on tree nodes
- Class weight