Why Cross-Validation is Important?
We often randomly split the dataset into train data and test data to develop a machine learning model. The training data is used to train the ML model and the same model is tested on independent testing data to evaluate the performance of the model.
With the change in the random state of the split, the accuracy of the model also changes, so we are not able to achieve a fixed accuracy for the model. The testing data should be kept independent of the training data so that no data leakage occurs. During the development of an ML model using the training data, the model performance needs to be evaluated. Here’s the importance of cross-validation data comes into the picture.
Leave-one-out cross-validation (LOOCV)
It is a category of LpOCV with the case of p=1.
For a dataset having $n$ rows, 1 st row is selected for validation, and the rest (n-1) rows are used to train the model. For the next iteration, the 2 nd row is selected for validation and rest to train the model. Similarly, the process is repeated until $\mathrm{n}$ steps or the desired number of operations.
Both the above two cross-validation techniques are the types of exhaustive cross-validation. Exhaustive crossvalidation methods are cross-validation methods that learn and test in all possible ways. They have the same pros and cons discussed below:
Pros:
- Simple, easy to understand, and implement.
Cons:
- The model may lead to a low bias.
- The computation time required is high.
Holdout cross-validation
The holdout cross-validation randomly splits the dataset into train and test data depending on data analysis.
In the case of holdout cross-validation, the dataset is randomly split into training and validation data. Generally, the split of training data is more than test data. The training data is used to induce the model and validation data is evaluates the performance of the model. The more data is used to train the model, the better the model is. For the holdout cross-validation method, a good amount of data is isolated from training.
Pros:
- Same as previous.
Cons:
- Not suitable for an imbalanced dataset.
- A lot of data is isolated from training the model.
k-fold cross-validation
In k-fold cross-validation, the original dataset is equally partitioned into $\mathrm{k}$ subparts or folds. Out of the k-folds or groups, for each iteration, one group is selected as validation data, and the remaining $(\mathrm{k}-1)$ groups are selected as training data.
The process is repeated for $\mathrm{k}$ times until each group is treated as validation and remaining as training data.
The final accuracy of the model is computed by taking the mean accuracy of the k-models validation data. $$ \mathbf{a c c}{\mathrm{cv}}=\sum{\mathbf{i}=1}^{\mathrm{k}} \frac{\mathbf{a c c}_{\mathrm{i}}}{\mathbf{k}} $$ Pros:
- The model has low bias
- Low time complexity
- The entire dataset is utilized for both training and validation.
Cons:
- Not suitable for an imbalanced dataset.
Repeated random subsampling validation
Repeated random subsampling validation also referred to as Monte Carlo crossvalidation splits the dataset randomly into training and validation. Unlikely k-fold cross-validation split of the dataset into not in groups or folds but splits in this case in random.
The number of iterations is not fixed and decided by analysis. The results are then averaged over the splits. $$ \mathbf{a c c}{\mathrm{cv}}=\sum{\mathbf{i}=1}^{\mathrm{k}} \frac{\mathbf{a c c}_{\mathrm{i}}}{\mathbf{k}} $$ Pros:
- The proportion of train and validation splits is not dependent on the number of iterations or partitions.
Cons:
- Some samples may not be selected for either training or validation.
- Not suitable for an imbalanced dataset.
Stratified k-fold cross-validation
For all the cross-validation techniques discussed above, they may not work well with an imbalanced dataset. Stratified kfold cross-validation solved the problem of an imbalanced dataset.
In Stratified k-fold cross-validation, the dataset is partitioned into $\mathrm{k}$ groups or folds such that the validation data has an equal number of instances of target class label. This ensures that one particular class is not over present in the validation or train data especially when the dataset is imbalanced.
The final score is computed by taking the mean of scores of each fold.
Pros:
- Works well for an imbalanced dataset.
Cons:
- Now suitable for time series dataset.
Time Series cross-validation
The order of the data is very important for time-series related problem. For timerelated dataset random split or k-fold split of data into train and validation may not yield good results.
For the time-series dataset, the split of data into train and validation is according to the time also referred to as forward chaining method or rolling cross-validation. For a particular iteration, the next instance of train data can be treated as validation data.
Nested cross-validation
In the case of k-fold and stratified k-fold cross-validation, we get a poor estimate of the error in training and test data.
Hyperparameter tuning is done separately in the earlier methods. When cross- validation is used simultaneously for tuning the hyperparameters and generalizing the error estimate, nested cross-validation is required.
Nested Cross Validation can be applicable in both $k$-fold and stratified $k$-fold variants.
Conclusion
Cross-validation is used to compare and evaluate the performance of ML models. In this article, we have covered 8 crossvalidation techniques along with their pros and cons. k-fold and stratified k-fold crossvalidations are the most used techniques. Time series cross-validation works best with time series related problems.