Cross Validation
1. Motivation
- We want to know how our model will generalize
- We want to make sure that our results aren't simply the result of getting lucky with a particular train/val/test split
- See also: bootstrapping (statistics)
- Related: jackknifing
2. Procedure
- Split the data into \(k\) groups.
- For each group, hold it out as a validation set and train on the remaining \(k-1\) groups. Evaluate on the held out group and write down the evaluation results.
- Summarize the \(k\) evaluation results, e.g. mean and standard deviation
3. types of cross validation
- exhaustive – every possible division of the dataset into train/val is considered – for all possible sizes of the val set
- leave-out-p – every possible division, for a validation set of size p. There are \(\binom{n}{p}\) splits to consider
- leave-out1 – leave-out-p for \(p=1\)
- k-fold – partition the dataset into \(k\) pieces. In turn, treat each one of those pieces as the val split
- stratified k-fold – k-fold, but make sure that each val split has the same proportion of each target label (what about the proportions in the train set?)
4. caveat from CBMM tutorial colin conwell
- don't create cross-validated summary statistics and then aggregate for all folds
- each fold has fewer data points – much higher chance that your explained variance is small
- instead, save the predictions for each fold, concatenate them, and then compute a summary statistic, e.g. fit, on the entire dataset