CrossValidation
 Summary

Discussion
 What's the need for crossvalidation?
 What is the difference between Training Data, Validation Data, and Test Data?
 What are the main assumptions behind crossvalidation?
 Which are the commonly used crossvalidation techniques?
 How do we choose \(k\) value for kfold crossvalidation?
 What's nested crossvalidation?
 What are the use cases of crossvalidation?
 What are the challenges with crossvalidation?
 What are some tips when implementing crossvalidation?
 What software packages help implement crossvalidation?
 Milestones
 Sample Code
 References
 Further Reading
 Article Stats
 Cite As
Crossvalidation is a statistical method that estimates how well a trained model will work on unseen data. The model's efficiency is validated by training it on a subset of input data and testing on a different subset. Crossvalidation helps in building a generalized model.^{} Due to the iterative nature of modeling, crossvalidation is useful for both performance estimation and model selection.^{}
The three steps involved in crossvalidation are:
i. Divide the dataset into two parts: one for training and other for testing.
ii. Train the model with the training dataset.
iii. Evaluate the model's performance using the testing set. If the model doesn't perform well with the testing set, check for issues.
If the model performs well on unseen data, it's consistent and can predict with good accuracy for a wide range of input data; this model is stable. Crossvalidation helps evaluate the stability of machine learning models.^{}
Discussion
What's the need for crossvalidation? In order to build a generalized model that works well for unseen data, crossvalidation is needed. This is how it's done:
Split the data into 3 random parts: Training data (65%), Validation data (20%), and Test data (remaining 15%). The model building doesn't involve test data; it is used as 'unseen' data to verify and declare the model accuracy.
Let's say, a model (kNN) is built using Training data and is optimized (optimum k is arrived at) using Validation data. Only 65% of the entire available data is used for model building, which isn't a good sign. With crossvalidation, 85% (Training + Validation data) can be used to build the model.^{} Here's how:
kfold concept is applied now by dividing the Training and Validation data into equal parts, say 5 parts of about 17% each. The model is trained 5 times, each time with 17% Validation data and rest 68% Training data. An aggregation mechanism like average value is applied to arrive at the final optimal model. The final model, built with 85% data, is checked for accuracy with Test data. This ensures a 'generalized' model is built that works well for unseen data too.^{}
What is the difference between Training Data, Validation Data, and Test Data? For training and testing the model, the dataset must be split into three distinct parts:
 Training Data: The model is trained to learn the hidden features/patterns of the dataset with the training data. The model evaluates the data repeatedly to learn more about the data's behavior, following which, it adjusts itself to serve the intended purpose. It's basically used to fit the models.^{}
 Validation Data: This is used to validate the model performance during training. It helps tune the model's hyperparameters and configurations accordingly. The validation data estimates the prediction error for model selection.^{} An overfitting model is prevented with validation data.
 Test Data: After completion of training, the test data validates that the trained model can make accurate predictions. It's used for assessment of the generalization error of the final chosen model.^{}
What are the main assumptions behind crossvalidation? The learning dataset that is used to build and evaluate a predictive model is assumed to be a sample from the population of interest. With random subsampling methods, the training set and test set are generated from the learning set.^{} A supervised prediction method is only expected to learn how to predict on unseen samples that are drawn from the same distribution as training samples; an evaluation of its performance ought to respect this assumption, as in the case of crossvalidation with random partitions.^{}
Random CrossValidation assumes that a randomly selected set of samples comprising the test set, well represents unseen data. This assumption doesn't hold true when samples are obtained from different experimental conditions.^{}
Which are the commonly used crossvalidation techniques?  Holdout method: The data is separated into training and testing sets. The proportion of training data has to be larger than test data. This is used on large datasets, since the model is trained only once and is computationally inexpensive.^{}
 Leave one out crossvalidation (LOOCV): The test data is a single observation from the dataset. Everything else is training data to train the model. In each iteration, a different sample is chosen as test data; the remaining are training data. This is repeated \(n\) times (\(n\)  number of samples). The average of all iterations gives the test set error estimate.
 kfold: The data is divided into \(k\) sets of nearequal sizes. The first set is the test set; the model is trained on the remaining \(k1\) sets. Test error rate is calculated after fitting the model to the test data. In the second iteration, the second set is the test set and remaining \(k1\) sets are the training data. This process continues for all \(k\) sets and error is calculated for each iteration. The mean of these errors gives the test error estimate.^{}
How do we choose \(k\) value for kfold crossvalidation? The keyconfiguration parameter for kfold crossvalidation is \(k\)  the number of folds that the given dataset must be split into. Commonly, \(k\)value is chosen as follows:
 Representative: The \(k\)value is chosen such that each train/test group of data samples is large enough to be statistically representative of the broader dataset.
 By performing a sensitivity analysis for different \(k\) values, the optimal value can be determined. This implies evaluating the performance of the same model on the same dataset for different values of \(k\) and see how they compare.^{}
 To compare classifiers with similar bias, \(k=2\) works best as it has lowest variance. To measure error, \(k=5\) or \(k=10\) are less biased than \(k=2\).^{}
 Most common values chosen for \(k\) are
3
,5
, and10
, with the most popular one being10
, experimentally found to provide good tradeoff of low computational cost and low bias in an estimate of model performance.^{}  Typically, low \(k\) values result in a noisy estimate of model performance, while large \(k\) values result in a less noisy estimate.
The computation time increases almost exponentially with higher values of \(k\), particularly with large datasets.^{}
What's nested crossvalidation? Nested crossvalidation works with a double loop: an outer loop that computes an unbiased estimate of the expected accuracy of the algorithm and an inner loop for hyperparameter selection. These two loops are independent of each other.^{}
From the example shown, the outer loop is repeated 5 times, generating 5 different test sets. In each iteration, the outer train set is split (into 4 folds here). With 5 outer folds and 4 inner folds (shown in the figure), a total of 20 models are trained.
The outer layer is used to estimate the quality of models trained on the inner layer. The inner layer is used for selecting the best model (including best set of hyper parameters). This way, you're not just assessing the quality of the model, but also the quality of procedure for model selection. For each iteration of the outer loop, one and only one inner model is selected that will be evaluated on the test set for the outer fold. After you vary the outer test set, you'll have 5 estimates that can be averaged to better assess quality of the models.^{}
What are the use cases of crossvalidation? Crossvalidation can be used for comparison of performances of a set of predictive modeling procedures. For example, for optical character recognition, if Support Vector Machine or knearest neighbors are considered to predict the true character from an image of a handwritten character, the use of crossvalidation can objectively compare these two methods in terms of their respective fractions of misclassified characters. Simply comparing the methods based on their insample error rates, one method might appear to perform better than the other.^{}
The use of crossvalidation is widespread in medical research. Consider the use of the expression levels of a certain number of proteins, say 15 for example for predicting if a cancer patient will respond to a specific drug. The ideal way would be to determine which subset of the 15 features produce the ideal predictive model. Using crossvalidation, you can determine the exact subset that provides the best results.^{}
Data analysts have used crossvalidation in medical statistics, with these procedures being useful for metaanalysis.^{}
What are the challenges with crossvalidation? Crossvalidation simply provides one additional mapping from training sets to models. Any mapping of this kind constitutes an inductive bias; hence like any other classification strategy, the performance of crossvalidation depends on the environment in which it is applied.^{}
For ideal conditions, it provides optimum output. But with inconsistent data, it may produce drastic result. This is one of the biggest disadvantages of crossvalidation as there is no certainty of the type of data in machine learning.^{}
In predictive modeling, data evolves over a period, and it may face the differences between training set and validation sets. For example, if a model has been created to predict stock market values, by training it on stock values of the previous 5 years, the realistic future values for the next 5 years could be drastically different.^{}
While kfold crossvalidation is typically the method of choice to estimate the generalization error for small sample sizes, there exists no universal (valid under all distributions) unbiased estimator of the variance of this technique.^{}
What are some tips when implementing crossvalidation? Tip #1: While splitting the data into traintest set, a good rule of thumb is to use 25% of the dataset for testing. Generally, the ratio can be 80:20, 75:25, 90:10, etc. It's the machine learning engineer who has to take this decision based on the amount of available data.^{}
Tip #2: The Data Science community has a general rule based on empirical evidence and different researches that suggest 5 and 10fold crossvalidation should be preferred over LOOCV.^{}
Tip #3: In Deep Learning, the normal tendency is to avoid crossvalidation due to the cost associated with training \(k\) different model. Instead of doing kfold or other crossvalidation techniques, you could use a random subset of your training data as a holdout for validation purposes.^{}
Tip #4: In case the data is of medical or financial nature, it should be split by person. Avoid having data for one person both in training and the test set, since it could be considered as data leak.^{}
What software packages help implement crossvalidation? Crossvalidation techniques can be implemented using Python and opensource Scikit learn. For kfold crossvalidation,
sklearn.model_selection.KFold
can be used.^{}Alternatively, MATLAB supports crossvalidation. Some of these crossvalidation techniques can be used with the Classification Learner App and the Regression Learner App of MathWorks.^{}
The Keras deep learning library allows you to pass one of two parameters for fit function that performs learning. This includes the validation_split and validation_data. The same approach is used in official tutorials of other DL frameworks such as PyTorch and MxNet, where they suggest splitting the data into three parts: training, validation, and testing.
Crossvalidation can be easily implemented using \(R\) programming language. The statistical metrics used to evaluate the accuracy of regression models are:
 Root Mean Squared Error (RMSE) gives the average prediction error made by the model. Decreased RMSE value leads to increase in accuracy of the model.
 Mean Absolute Error (MAE) gives the absolute difference between actual values and values predicted by the model for the target variable. Less MAE value makes better models.
 R2 Error reflects the relationship strength between target variable and model. High R2 value gives a better model.^{}
Milestones
Larson divides the dataset into two groups, estimates the regression coefficients from one group and then predicts the criterion scores from the second group. His work is towards a study of the actual amount of shrinkage in the field of psychological testing.Theoretical statisticians previously showed that the coefficient of multiple correlation \(R\), derived for a given dataset, has a deceptively large value. If the equation is applied to another dataset, the yield (except sampling errors) is less than the first. An increase in the number of variables in the regression equation leads to greater shrinkage.^{}
Mosier presents five distinct designs closely related to crossvalidation: 1) crossvalidation, 2) validitygeneralization, 3) validity extension, 4) simultaneous validation, and 5) replication. The purpose is to evaluate the predictive validity of linear regression equations used to forecast a performance criterion from scores on a battery of tests. The multiple correlation coefficient in the original sample used to assign values of regression weights gives an optimistic impression of the predictive effectiveness of the regression equation when applied to future observations.^{}
Mosteller and Turkey develop the idea of crossvalidation. Their work comes close to what would later be called kfold crossvalidation.^{}
For the choice and assessment of statistical predictions, Stone uses a crossvalidation criterion. A crossvalidatory paradigm with a simple structure is presented. He omits single observations, a method that's later named LeaveOneOut CrossValidation (LOOCV). While it's assumed that major problems might be encountered in the execution of the crossvalidatory paradigm, it's expected that the status of such problems won't be as ambiguous as those associated with the conventional paradigm.^{}
Geisser presents the method of predictive sample reuse around the same time as M. Stone's crossvalidatory method. Geisser's method uses multiple observational omissions (unlike Stone's LOOCV), yielding a desirable degree of flexibility. He gives more relevance to prediction than parameter estimation for inference since prediction can be adequately assessed in real situations, unlike parameter estimation. He develops a highly flexible and versatile low structure predictivistic approach that serves as a complement to the tightly structured Bayes approach. This method while assuming less, yields less.^{}
Moody and Utans develop a model for rating bonds (for corporate bond rating prediction) as a case study of architecture selection procedures. With limited data availability and lack of complete a priori information, they attempt to select a good neural network architecture to model any specific dataset. Their bond rating study shows that nonlinear networks outperform a linear regression model for a financial application.^{}
Browne reviews many crossvalidation methods, considering the original applications in multiple linear regressions first. He assesses structural models for moment matrices. Upon investigating singlesample and twosample validation indices, it's seen that the optimal number of parameters suggested by both these indices depend on sample size. It's shown how predictive accuracy depends on sample size and the number of predictor variables.^{}
To select the appropriate model from available data, crossvalidation is used. Zhang and Yang focus on selecting a modeling procedure in the regression context through crossvalidation. They investigate the relationship between crossvalidation performance and the ratio of splitting the data, in terms of modeling procedure selection. In comparing the predictive performance of two modeling procedures, they ensure that a large evaluation set accounts for randomness in the prediction assessment. The relative performance for a reduced sample size is made to resemble that for a full sample size.^{}
Li et al. derive a cheap and theoretically guaranteed auxiliary/augmented validation technique. It trains models on the given dataset once, making the selection of model quite efficient. It's also suitable for a wide range of learning settings owing to the independence of augmentation and outofsample estimation on the learning process. The augmented validation set plays a key role to select the ideal model. Their validation approach is not only computationefficient, but also effective for validation and is easy for application.^{}
Zhang et al. proposes a Targeted CrossValidation (TCV) approach for model or procedure selection based on a general weighted loss. TCV is shown to be consistent for selection of the best performing candidate and is potentially advantageous over global crossvalidation or use of local data for modelling a local region. TCV is used to find a candidate method with the best performance for a local region. The flexible framework allows the best candidate to switch with varying sample sizes, and can be applied to highdimensional data and complex ML scenarios with dynamic relative performances of modelling procedures.^{}
Sample Code
References
 Baheti, Pragati. 2022. "The Train, Validation, and Test Sets: How to Split Your Machine Learning Data." V7 LABS, March 19. Accessed 20220425.
 Berrar, Daniel. 2018. "CrossValidation." Encyclopedia of Bioinformatics and Computational Biology, Volume 1, Elsevier, pp. 542545. Accessed 20220328.
 Bordbar, Shayan Tabe, Amin Emad, Sihai Dave Zhao, and Saurabh Sinha. 2018. "A closer look at crossvalidation for assessing the accuracy of gene regulatory networks and models." Scientific Reports, 8, Article No. 6620, April 26. Accessed 20220402.
 Bose, Amitraj. 2019. "Cross Validation – Why & How." TowardsDataScience, January 30. Accessed 20220401.
 Browne, Michael W. 2000. "Cross Validation Methods." Journal of Mathematical Psychology, vol. 44, pp. 108132. Accessed 20220325.
 Brownlee, Jason. 2020. "How to configure kfold cross validation." Machine Learning Mastery, July 31. Updated 20200826. Accessed 20220320.
 Castilla, Jaime Arboleda. 2021. "A step by step guide to Nested CrossValidation." Analytics Vidhya, March 28. Accessed 20220422.
 Geisser, Seymour. 1975. "The Predictive Sample Reuse Method." Journal of the American Statistical Association, Vol. 70, No. 350, pp. 320328 . Accessed 20220501.
 Grandvalet, Yves, and Yoshua Bengio. 2006. "Hypothesis Testing for CrossValidation." Department d'Informatique et Recherche Operationnelle, Technical Report 1285, August 29. Accessed 20220422.
 Great Learning. 2020. "What is Cross Validation and its types in Machine Learning?" Great Leaning, September 24. Accessed 20220401.
 Kassambra. 2018. "CrossValidation Essentials in R." sthda, March 11. Accessed 20220422.
 Lakshana, G V. 2021. "4 Ways to Evaluate your Machine Learning Model: CrossValidation Techniques (with Python code)" Analytics Vidhya, May 21. Accessed 20220329.
 Larson, S.C. 1931. "The Shrinkage of the coefficient of multiple correlation." Journal of Education Psychology, 22(1), 4555. Accessed 20220325.
 Li, Weikai, Chuanxing Geng, and Songcan Chen. 2020. "Leave Zero Out: Towards a NoCrossValidation Approach for Model Selection." arXiv, v1, December 28. Accessed 20220501.
 Lyashenko, Vladimir, and Abhishek Jha. 2022. "CrossValidation in Machine Learning: How to Do It Right." Neptune.AI, March 18. Accessed 20220401.
 Marcot, Bruce G, and Anca M Hanea. 2020. "What is an optimal value of k in kfold cross validation in discrete Bayesian network analysis?" Computational Statistics, Springer, June 04. Accessed 20220425.
 MathWorks. 2018. "CrossValidation." MathWorks. Accessed 20220406.
 Moody, John, and Joachim Utans. 1994. "Architecture Selection Strategies for Neural Networks: Application to Corporate Bond Rating Prediction." John Wiley & Sons, Neural Networks in the Capital Markets. Accessed 20220405.
 Mosier, Charles I. 1951. "The need and means of crossvalidation. I. Problems and designs of crossvalidation." Educational and Psychological Measurement. vol. 11, pp. 511,Accessed 20220321.
 Refaeilzadeh, Payam, Lei Tang, and Huan Liu. 2009. "CrossValidation." In: Liu L., Özsu M.T. (eds), Encyclopedia of Database Systems, Springer, Boston, MA. doi: 10.1007/9780387399409_565. Accessed 20220503.
 Rodriguez, Juan Diego, Aritz Perez, and Jose Antonio Lozano. 2010. "Sensitivity Analysis of kfold cross validation in prediction error estimation." IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, No. 3, March. Accessed 20220425.
 Srinivasan. 2018. "CrossValidation in Machine Learning." Digital Vidya. November 23. Accessed 20220403.
 StackExchange. 2015. "What is the difference between test set and validation set?" StackExchange. Accessed 20220421.
 Stone, M. 1974. "CrossValidatory Choice and Assessment of Statistical Predictions." Journal of the Royal Statistical Society, Series B, Vol. 36, No. 2, pp. 111147. Accessed 20220325.
 Wainer, Jacques, and Gavin Cawley. 2021. "Nested crossvalidation when selecting classifiers is overzealous for most practical applications." Expert System with Applications, Volume 182, 115222, November 15. Accessed 20220425.
 Wikipedia. 2018. "CrossValidation (Statistics)" Wikipedia. Accessed 20220324.
 Zhang, Yongli, and Yuhong Yang. 2015. "Crossvalidation for selecting a model selection procedure." Journal of Econometrics, Elsevier, Volume 187(1), pages 95112, Accessed 20220404.
 Zhang, Jiawei, Jie Ding, and Yuhong Yang. 2021. "Targeted CrossValidation." arxiv, September 14. Accessed 20220501.
Further Reading
 Prechelt, Lutz. 1997. "Automatic Early Stopping Using Cross Validation: Qualifying the Criteria." Neural Networks, 11(4):761767. Accessed 20220328.
 Moore, Andrew W, and Mary S Lee. 1994. "Efficient Algorithms for Minimizing Cross Validation Error." Machine Learnings Proceedings, Pages 190198. Accessed 20220405.
 Little, Max A, Gael Varoquaux, Sohrab Saeb, Luca Lonini, Arun Jayaraman, David C Mohr, and Konrad P Kording. 2017. "Using and Understanding crossvalidation strategies." Giga Science, Volume 6, Issue 5, May. Accessed 20220403.
Article Stats
Cite As
See Also
 Stratified CrossValidation
 Machine Learning Model
 BiasVariance Tradeoff
 Overfitting and Underfitting
 Sampling and Estimation
 Synthetic Data