Technical report | Cross-validation is Insufficient for Model Validation
Cross-validation is the de facto standard for model validation in the machine learning community. It reduces the risk of overfitting and provides an unbiased estimate of the learning algorithm predictive performance. Some people have argued cross-validation, coupled with sophisticated statistical learning methods, has rendered traditional scientific practices irrelevant. In this report, we review the foundations of cross-validation and draw attention to common, but underappreciated, assumptions. We argue that cross validation is unsuitable for dealing with realistic complications like missing data, theory laden observations, and malicious input. As a solution, we advocate for a holistic approach to model validation that embraces validation of data quality, acknowledgement of the role of subjective judgement in model assessment, and the use of extended peer review.
Cross-validation is the de facto standard for model validation in the statistical learning and machine learning communities. Data is split into a training set that calibrates the statistical model, and an independent test set that is used to estimate the model's predictive performance. Given the popularity of cross-validation, it is critical to identify any implicit assumptions or limitations of the method.
We argue that cross-validation is unsuitable as a universal method for model assessment. Despite high cross-validation accuracy, artificial neural networks that achieve human- level accuracy in image recognition are vulnerable against adversarial examples, meaning images become misclassified after miniscule manipulation. Likewise, Google Flu Trends was able to accurately predict influenza outbreaks for several years before the model suddenly mispredicted outbreak timings and intensities. These examples show that strong cross-validation performance does not guarantee the model has truly learnt about the phenomena of interest.
Cross-validation assumes that samples are drawn from an independent and identical distribution, an assumption that regularly fails because of hierarchical structure in the model, spatial or temporal correlations in the data, or non-stationary (time-varying) system dynamics. However, cross-validation is unable to detect these violations and may provide an unrealistic and optimistic assessment of predictive performance.
The limitations of independent and identical samples can be overcome by using modified cross-validation procedures. For example, hierarchical models can be tested by performing cross-validation for each level of the hierarchy, and time series can be validated using out-of- sample forecasting with a rolling time window. However, this still requires the correct sampling structure to be identified, which may not be known a priori.
Data quality is another fundamental issue. Supervised learning is predicated on having access to the ground truth (the true value or label of the samples). For complex problems the ground truth may be uncertain or contentious. For example, the de_nitions of diseases in medical science change overtime: Diseases may be split into separate classes, merged into a spectrum, or redefined as new knowledge is acquired. When the ground truth is contentious, test set accuracy is not meaningful as an objective indicator of model correctness, and is better thought of as a check for model consistency.
Data sampling can be misrepresentative of the desired population because of social biases that affect the experimental design or other systematic patterns of missing data. The image classifier Google Photo mislabelled African Americans as Gorillas, while COMPAS software used to determine court sentencings in the United States was allegedly found to be harsher on African American defendants than Caucasian defendants. Addressing these social and data biases is an active area of research and cannot be meaningfully addressed with cross-validation alone.
An uncritical application of cross-validation leaves the statistical learning and machine learning communities at risk of \Big Data Hubris", \the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis." [Lazer, David, et al. \The parable of Google Flu: traps in big data analysis." Science 343.6176 (2014): 1203-1205.]. Cross-validation can be strengthened by supplementing it with traditional data analysis and sampling techniques.
Statistical learning often treats data collection as a passive process. Greater emphasis on the design of experiments, randomized controlled experiments, instrumentation would reduce the incidence of measurement artefacts and unbalanced data sets that oversample particular sub-groups. These considerations would improve model robustness.
To mitigate against social bias, we advocate for the use of diverse teams and extended peer review. Inclusive teams are more likely to identify potential sources of bias and provide stricter validation of the model's performance than cross-validation alone. For instance, algorithms used for job hiring could be reviewed by equality groups or legal departments. Social bias could be identified through subgroup analysis, although we believe causal models are superior because of their ability to properly identify confounding factors.
Model validation is a difficult issue and further work is required. We advocate for a holistic approach to model assessment that contextualizes the problem, uses extended peer review, and remains grounded in deductive reasoning.