Cross-Validation

Article Info

Contributed by
2 authors

Last updated on
2022-05-03 07:10:15

Improve this article

Stratified Cross-Validation
Machine Learning Model
Bias-Variance Trade-off
Overfitting and Underfitting
Sampling and Estimation
Synthetic Data

Article Versions

8 2022-05-03 07:10:15
3663,3662 8,3663

By arvindpdmn

Minor improvements to formatting and milestones.
7 2022-05-03 04:50:22
3662,3661 7,3662

By arvindpdmn

Updated See Also. Publishing.
6 2022-05-02 12:22:17
3661,3650 6,3661

By suchi_shen

Additional Milestones Added
5 2022-04-29 06:45:21
3650,3643 5,3650

By suchi_shen

Milestones Re-written
4 2022-04-25 16:10:35
3643,3638 4,3643

By suchi_shen

Post-review change: 2

Chat Room

Submitting ...

You are editing an existing chat message.
2022-05-03 06:25:34
-

By arvindpdmn

Note to volunteer authors for future improvements:
1. Minor improvements to formatting may be considered.
2. Milestones:
- Attempt to simplify the text for a beginner. Currently, text is verbose and hard to understand.
- Add 1963/1968 milestone noted in http://leitang.net/papers/ency-cross-validation.pdf , section 2.
- 2015 date should be 2020: J. of Latex Class Files seems to be a template to create tech docs. GitHub URL also shows that the date is 2020.
- Keywords could be highlighted to draw reader's attention.
2022-05-02 12:23:20
-

By suchi_shen

Unable to find material for Mosteller and Turkey Data analysis, including statistics.
In Handbook of Social Psychology. Excluding this, other suggested milestones added.
2022-04-30 04:45:31
-

By arvindpdmn

Thanks for the update to Milestones. Better than before. Here are some suggestions:
1. See http://leitang.net/papers/ency-cross-validation.pdf , section 2. The paper notes ref no. 8 from 1968, when k-fold CV probably first appeared. Read this and add the relevant milestone.
2. Ref no. 4 from above (Geisser 1975) can also be mentioned. That paper was the first to suggest leave-n-out CV, where n > 1. This is noted in Browne 2000.
3. Two milestones I can suggest based on recent work: https://arxiv.org/pdf/2012.13309 and https://arxiv.org/pdf/2109.06949
2022-04-25 16:18:34
-

By arvindpdmn

Will wait for milestones for final review. You can look at milestones from other ML articles to get a clear idea how they should be written.
2022-04-25 16:11:47
-

By suchi_shen

Changes made as per review comments. Milestones re-work pending. Submitting the same in a day.

Cross-validation is a statistical method that estimates how well a trained model will work on unseen data. The model's efficiency is validated by training it on a subset of input data and testing on a different subset. Cross-validation helps in building a generalized model. Due to the iterative nature of modeling, cross-validation is useful for both performance estimation and model selection.

The three steps involved in cross-validation are:

i. Divide the dataset into two parts: one for training and other for testing.

ii. Train the model with the training dataset.

iii. Evaluate the model's performance using the testing set. If the model doesn't perform well with the testing set, check for issues.

If the model performs well on unseen data, it's consistent and can predict with good accuracy for a wide range of input data; this model is stable. Cross-validation helps evaluate the stability of machine learning models.

Discussion

What's the need for cross-validation?
In order to build a generalized model that works well for unseen data, cross-validation is needed. This is how it's done:
Split the data into 3 random parts: Training data (65%), Validation data (20%), and Test data (remaining 15%). The model building doesn't involve test data; it is used as 'unseen' data to verify and declare the model accuracy.
Let's say, a model (kNN) is built using Training data and is optimized (optimum k is arrived at) using Validation data. Only 65% of the entire available data is used for model building, which isn't a good sign. With cross-validation, 85% (Training + Validation data) can be used to build the model. Here's how:
k-fold concept is applied now by dividing the Training and Validation data into equal parts, say 5 parts of about 17% each. The model is trained 5 times, each time with 17% Validation data and rest 68% Training data. An aggregation mechanism like average value is applied to arrive at the final optimal model. The final model, built with 85% data, is checked for accuracy with Test data. This ensures a 'generalized' model is built that works well for unseen data too.
What is the difference between Training Data, Validation Data, and Test Data?
Use of different data sets for training, validation, and testing. Source: Baheti 2022.
For training and testing the model, the dataset must be split into three distinct parts:
- Training Data: The model is trained to learn the hidden features/patterns of the dataset with the training data. The model evaluates the data repeatedly to learn more about the data's behavior, following which, it adjusts itself to serve the intended purpose. It's basically used to fit the models.
- Validation Data: This is used to validate the model performance during training. It helps tune the model's hyper-parameters and configurations accordingly. The validation data estimates the prediction error for model selection. An over-fitting model is prevented with validation data.
- Test Data: After completion of training, the test data validates that the trained model can make accurate predictions. It's used for assessment of the generalization error of the final chosen model.
What are the main assumptions behind cross-validation?
The learning dataset that is used to build and evaluate a predictive model is assumed to be a sample from the population of interest. With random sub-sampling methods, the training set and test set are generated from the learning set. A supervised prediction method is only expected to learn how to predict on unseen samples that are drawn from the same distribution as training samples; an evaluation of its performance ought to respect this assumption, as in the case of cross-validation with random partitions.
Random Cross-Validation assumes that a randomly selected set of samples comprising the test set, well represents unseen data. This assumption doesn't hold true when samples are obtained from different experimental conditions.
Which are the commonly used cross-validation techniques?
Commonly used cross-validation techniques. Source: Lyashenko and Jha 2022.
- Hold-out method: The data is separated into training and testing sets. The proportion of training data has to be larger than test data. This is used on large datasets, since the model is trained only once and is computationally inexpensive.
- Leave one out cross-validation (LOOCV): The test data is a single observation from the dataset. Everything else is training data to train the model. In each iteration, a different sample is chosen as test data; the remaining are training data. This is repeated \(n\) times (\(n\) - number of samples). The average of all iterations gives the test set error estimate.
- k-fold: The data is divided into \(k\) sets of near-equal sizes. The first set is the test set; the model is trained on the remaining \(k-1\) sets. Test error rate is calculated after fitting the model to the test data. In the second iteration, the second set is the test set and remaining \(k-1\) sets are the training data. This process continues for all \(k\) sets and error is calculated for each iteration. The mean of these errors gives the test error estimate.
How do we choose \(k\) value for k-fold cross-validation?
The key-configuration parameter for k-fold cross-validation is \(k\) - the number of folds that the given dataset must be split into. Commonly, \(k\)-value is chosen as follows:
- Representative: The \(k\)-value is chosen such that each train/test group of data samples is large enough to be statistically representative of the broader dataset.
- By performing a sensitivity analysis for different \(k\) values, the optimal value can be determined. This implies evaluating the performance of the same model on the same dataset for different values of \(k\) and see how they compare.
- To compare classifiers with similar bias, \(k=2\) works best as it has lowest variance. To measure error, \(k=5\) or \(k=10\) are less biased than \(k=2\).
- Most common values chosen for \(k\) are 3, 5, and 10, with the most popular one being 10, experimentally found to provide good trade-off of low computational cost and low bias in an estimate of model performance.
- Typically, low \(k\) values result in a noisy estimate of model performance, while large \(k\) values result in a less noisy estimate.
The computation time increases almost exponentially with higher values of \(k\), particularly with large datasets.
What's nested cross-validation?
Nested Cross-Validation. Source: Castilla 2021.
Nested cross-validation works with a double loop: an outer loop that computes an unbiased estimate of the expected accuracy of the algorithm and an inner loop for hyper-parameter selection. These two loops are independent of each other.
From the example shown, the outer loop is repeated 5 times, generating 5 different test sets. In each iteration, the outer train set is split (into 4 folds here). With 5 outer folds and 4 inner folds (shown in the figure), a total of 20 models are trained.
The outer layer is used to estimate the quality of models trained on the inner layer. The inner layer is used for selecting the best model (including best set of hyper parameters). This way, you're not just assessing the quality of the model, but also the quality of procedure for model selection. For each iteration of the outer loop, one and only one inner model is selected that will be evaluated on the test set for the outer fold. After you vary the outer test set, you'll have 5 estimates that can be averaged to better assess quality of the models.
What are the use cases of cross-validation?
Cross-validation can be used for comparison of performances of a set of predictive modeling procedures. For example, for optical character recognition, if Support Vector Machine or k-nearest neighbors are considered to predict the true character from an image of a handwritten character, the use of cross-validation can objectively compare these two methods in terms of their respective fractions of misclassified characters. Simply comparing the methods based on their in-sample error rates, one method might appear to perform better than the other.
The use of cross-validation is widespread in medical research. Consider the use of the expression levels of a certain number of proteins, say 15 for example for predicting if a cancer patient will respond to a specific drug. The ideal way would be to determine which subset of the 15 features produce the ideal predictive model. Using cross-validation, you can determine the exact subset that provides the best results.
Data analysts have used cross-validation in medical statistics, with these procedures being useful for meta-analysis.
What are the challenges with cross-validation?
Cross-validation simply provides one additional mapping from training sets to models. Any mapping of this kind constitutes an inductive bias; hence like any other classification strategy, the performance of cross-validation depends on the environment in which it is applied.
For ideal conditions, it provides optimum output. But with inconsistent data, it may produce drastic result. This is one of the biggest disadvantages of cross-validation as there is no certainty of the type of data in machine learning.
In predictive modeling, data evolves over a period, and it may face the differences between training set and validation sets. For example, if a model has been created to predict stock market values, by training it on stock values of the previous 5 years, the realistic future values for the next 5 years could be drastically different.
While k-fold cross-validation is typically the method of choice to estimate the generalization error for small sample sizes, there exists no universal (valid under all distributions) unbiased estimator of the variance of this technique.
What are some tips when implementing cross-validation?
Tip #1: While splitting the data into train-test set, a good rule of thumb is to use 25% of the data-set for testing. Generally, the ratio can be 80:20, 75:25, 90:10, etc. It's the machine learning engineer who has to take this decision based on the amount of available data.
Tip #2: The Data Science community has a general rule based on empirical evidence and different researches that suggest 5- and 10-fold cross-validation should be preferred over LOOCV.
Tip #3: In Deep Learning, the normal tendency is to avoid cross-validation due to the cost associated with training \(k\) different model. Instead of doing k-fold or other cross-validation techniques, you could use a random subset of your training data as a hold-out for validation purposes.
Tip #4: In case the data is of medical or financial nature, it should be split by person. Avoid having data for one person both in training and the test set, since it could be considered as data leak.
What software packages help implement cross-validation?
Cross-validation techniques can be implemented using Python and open-source Sci-kit learn. For k-fold cross-validation, sklearn.model_selection.KFold can be used.
Alternatively, MATLAB supports cross-validation. Some of these cross-validation techniques can be used with the Classification Learner App and the Regression Learner App of MathWorks.
The Keras deep learning library allows you to pass one of two parameters for fit function that performs learning. This includes the validation_split and validation_data. The same approach is used in official tutorials of other DL frameworks such as PyTorch and MxNet, where they suggest splitting the data into three parts: training, validation, and testing.
Cross-validation can be easily implemented using \(R\) programming language. The statistical metrics used to evaluate the accuracy of regression models are:
- Root Mean Squared Error (RMSE) gives the average prediction error made by the model. Decreased RMSE value leads to increase in accuracy of the model.
- Mean Absolute Error (MAE) gives the absolute difference between actual values and values predicted by the model for the target variable. Less MAE value makes better models.
- R2 Error reflects the relationship strength between target variable and model. High R2 value gives a better model.

Milestones

1931

Larson divides the dataset into two groups, estimates the regression coefficients from one group and then predicts the criterion scores from the second group. His work is towards a study of the actual amount of shrinkage in the field of psychological testing.Theoretical statisticians previously showed that the coefficient of multiple correlation \(R\), derived for a given dataset, has a deceptively large value. If the equation is applied to another dataset, the yield (except sampling errors) is less than the first. An increase in the number of variables in the regression equation leads to greater shrinkage.

1951

Mosier presents five distinct designs closely related to cross-validation: 1) cross-validation, 2) validity-generalization, 3) validity extension, 4) simultaneous validation, and 5) replication. The purpose is to evaluate the predictive validity of linear regression equations used to forecast a performance criterion from scores on a battery of tests. The multiple correlation coefficient in the original sample used to assign values of regression weights gives an optimistic impression of the predictive effectiveness of the regression equation when applied to future observations.

1968

Mosteller and Turkey develop the idea of cross-validation. Their work comes close to what would later be called k-fold cross-validation.

1974

For the choice and assessment of statistical predictions, Stone uses a cross-validation criterion. A cross-validatory paradigm with a simple structure is presented. He omits single observations, a method that's later named Leave-One-Out Cross-Validation (LOOCV). While it's assumed that major problems might be encountered in the execution of the cross-validatory paradigm, it's expected that the status of such problems won't be as ambiguous as those associated with the conventional paradigm.

1975

Geisser presents the method of predictive sample reuse around the same time as M. Stone's cross-validatory method. Geisser's method uses multiple observational omissions (unlike Stone's LOOCV), yielding a desirable degree of flexibility. He gives more relevance to prediction than parameter estimation for inference since prediction can be adequately assessed in real situations, unlike parameter estimation. He develops a highly flexible and versatile low structure predictivistic approach that serves as a complement to the tightly structured Bayes approach. This method while assuming less, yields less.

1994

Moody and Utans develop a model for rating bonds (for corporate bond rating prediction) as a case study of architecture selection procedures. With limited data availability and lack of complete a priori information, they attempt to select a good neural network architecture to model any specific dataset. Their bond rating study shows that nonlinear networks outperform a linear regression model for a financial application.

2000

Browne reviews many cross-validation methods, considering the original applications in multiple linear regressions first. He assesses structural models for moment matrices. Upon investigating single-sample and two-sample validation indices, it's seen that the optimal number of parameters suggested by both these indices depend on sample size. It's shown how predictive accuracy depends on sample size and the number of predictor variables.

2015

To select the appropriate model from available data, cross-validation is used. Zhang and Yang focus on selecting a modeling procedure in the regression context through cross-validation. They investigate the relationship between cross-validation performance and the ratio of splitting the data, in terms of modeling procedure selection. In comparing the predictive performance of two modeling procedures, they ensure that a large evaluation set accounts for randomness in the prediction assessment. The relative performance for a reduced sample size is made to resemble that for a full sample size.

2020

Li et al. derive a cheap and theoretically guaranteed auxiliary/augmented validation technique. It trains models on the given dataset once, making the selection of model quite efficient. It's also suitable for a wide range of learning settings owing to the independence of augmentation and out-of-sample estimation on the learning process. The augmented validation set plays a key role to select the ideal model. Their validation approach is not only computation-efficient, but also effective for validation and is easy for application.

2021

Zhang et al. proposes a Targeted Cross-Validation (TCV) approach for model or procedure selection based on a general weighted loss. TCV is shown to be consistent for selection of the best performing candidate and is potentially advantageous over global cross-validation or use of local data for modelling a local region. TCV is used to find a candidate method with the best performance for a local region. The flexible framework allows the best candidate to switch with varying sample sizes, and can be applied to high-dimensional data and complex ML scenarios with dynamic relative performances of modelling procedures.

Sample Code

rsplus

# Source: http://www.sthda.com/english/articles/38-regression-model-validation/157-cross-validation-essentials-in-r/
# Accessed 2022-04-22
 
# Split the data into training and test set
set.seed(123)
training.samples <- swiss$Fertility %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- swiss[training.samples, ]
test.data <- swiss[-training.samples, ]
# Build the model
model <- lm(Fertility ~., data = train.data)
# Make predictions and compute the R2, RMSE and MAE
predictions <- model %>% predict(test.data)
data.frame( R2 = R2(predictions, test.data$Fertility),
            RMSE = RMSE(predictions, test.data$Fertility),
            MAE = MAE(predictions, test.data$Fertility))

References

Article Stats

2706

Words

Authors

Edits

Chats

Likes

3225

Hits

Cite As

Devopedia. 2022. "Cross-Validation." Version 8, May 3. Accessed 2023-11-12. https://devopedia.org/cross-validation

Contributed by
2 authors

Last updated on
2022-05-03 07:10:15

Improve this article

algorithms machine learning artificial intelligence statistics data analysis data mining

Stratified Cross-Validation
Machine Learning Model
Bias-Variance Trade-off
Overfitting and Underfitting
Sampling and Estimation
Synthetic Data

Cross-Validation

Discussion

Milestones

Sample Code

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Article Warnings

Login