19 cross-validation

The problem of capitalizing on chance is a significant one in prediction, and one should always be skeptical of models which are untested beyond the sample from which they were derived.

19.1 revisiting the affairs data

Let’s go back and do what we should have done earlier, that is, examine and think about the “number of affairs” variable. What does the distribution look like?

data(Fair)
Fair %>% count(nbaffairs)
##   nbaffairs   n
## 1         0 451
## 2         1  34
## 3         2  17
## 4         3  19
## 5         7  42
## 6        12  38

We note that the distribution is skewed - and we realize that perhaps we should think about it differently: The meaning of the difference between 0 and 1 is not the same as that between 1 and 2, or perhaps even 1 and 12.

We can transform the data in several ways. We might perform a root transform, which we could then round to an integer:

Fair <- Fair %>% 
    mutate(rootAffair = 
              as.integer(sqrt(nbaffairs)))
Fair %>% count(rootAffair)
##   rootAffair   n
## 1          0 451
## 2          1  70
## 3          2  42
## 4          3  38

Or we could simply distinguish between those that do and don’t have affairs.

Fair <- Fair %>% 
    mutate(affairYN =
               ifelse (nbaffairs > 0,1,0))
Fair %>% count(affairYN)
##   affairYN   n
## 1        0 451
## 2        1 150

We’ll examine each of these in the following paragraphs.

19.2 avoiding capitalizing on chance

One lesson from the last class was that correlations (and regression coefficients) drawn from small samples were not stable. In regression analysis, as progressively smaller samples were drawn from the Fair data, the ability to predict the outcome increased. In the limiting case, when the number of predictors (variables) was equal to the number of cases in the sample (rows), prediction became perfect.

19.2.1 splitting the data into training and test subsamples

The most basic solution to this problem is to split the data into two groups, a training sample from which we extract our model, and a test sample on which you will assess it. (Often, the logic of this will be extended to include a third group, a validation sample which would be used to tune or select the results of the training run before the test data are examined). Here, we will consider the simpler approach, splitting the Fair data into training and test samples. The critical feature of this analysis is that we will hold out the test data, and not even look at it until after our model building is complete.

# establish a seed for your data-split
# so that your results will be reproducible
set.seed(33458)
n <- nrow(Fair)
# create a set of line numbers 
# of size corresponding to the 
# desired training sample 
trainIndex <- sample(1:n, size = round(0.6*n), replace=FALSE)
# create training and test samples
trainFair <- Fair[trainIndex ,]
testFair <- Fair[-trainIndex ,]

19.3 an example of cross-validated linear regression

We first predict the variable “rootAffair,” using just the training data:

trainFair2 <- trainFair %>%
    select(-nbaffairs, -affairYN)
testFair2 <- testFair %>%
    select(-nbaffairs, -affairYN)
model2 <- lm(rootAffair ~ ., data=trainFair2)
summary(model2)
## 
## Call:
## lm(formula = rootAffair ~ ., data = trainFair2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6655 -0.4923 -0.2180  0.1444  2.7593 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.974758   0.385662   5.120 5.03e-07 ***
## sexmale     -0.045067   0.106127  -0.425   0.6713    
## age         -0.014193   0.007835  -1.812   0.0709 .  
## ym           0.038914   0.013901   2.799   0.0054 ** 
## childyes     0.027497   0.116926   0.235   0.8142    
## religious   -0.090619   0.038503  -2.354   0.0191 *  
## education   -0.018241   0.021963  -0.831   0.4068    
## occupation   0.061293   0.030904   1.983   0.0481 *  
## rate        -0.265213   0.042161  -6.291 9.39e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8224 on 352 degrees of freedom
## Multiple R-squared:  0.1783, Adjusted R-squared:  0.1597 
## F-statistic: 9.549 on 8 and 352 DF,  p-value: 5.595e-12

To assess the effectiveness of this model on an independent sample, we write a simple function, which assesses R2, then applies it to both the training data and the test data:

R2.model.dep.data <- function(myModel,myDep,myData) {
    errorscores <- myDep - predict(myModel,myData)
    SS.error <- sum(errorscores^2)
    deviations <- scale(myDep,scale = FALSE)#)
    SS.total <- sum(deviations^2)
    1 - (SS.error/SS.total)
}    

R2Train <- R2.model.dep.data(model2,trainFair2$rootAffair,trainFair2)
(R2Test <- R2.model.dep.data(model2,testFair2$rootAffair,testFair2))
## [1] 0.003453435

The R2 on the training sample is 0.1783303, on the test sample it is 0.0034534. The difference between these is sometimes referred to as shrinkage. Shrinkage will be problematic particularly when there are a small number of observations, a large number of predictors (and consequently a complex model), or both of these.

19.3.1 applying logistic regression analysis to the training data

Let’s go back to the “number of affairs” variable and consider a second way of thinking about it, that is, simply comparing those who do and don’t have affairs:

Fair <- Fair %>% 
    mutate(affairYN =
               ifelse (nbaffairs > 0,1,0))
Fair %>% count(affairYN)
##   affairYN   n
## 1        0 451
## 2        1 150

With this change, the regression problem becomes more clearly a problem in classification. How can we best predict which ‘type’ (affair-ers vs. not) a given person falls in to?

With this change, we have moved from a variable which is essentially continuous (nbaffairs) to one which is dichotomous and therefore distributed binomially. The desired regression is now a logistic one.

We begin by running this analysis using only the training data.

model2 <- glm(affairYN ~ ., data = trainFair,
              family = "binomial")
summary (model2)
## 
## Call:
## glm(formula = affairYN ~ ., family = "binomial", data = trainFair)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -3.457e-06  -2.789e-06  -2.564e-06   2.110e-08   1.164e-05  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.601e+01  1.434e+05   0.000    1.000
## sexmale      3.818e-02  3.891e+04   0.000    1.000
## age         -1.303e-02  3.176e+03   0.000    1.000
## ym           1.490e-02  5.403e+03   0.000    1.000
## childyes     2.288e-02  4.479e+04   0.000    1.000
## religious    2.225e-02  1.508e+04   0.000    1.000
## education    3.888e-02  8.532e+03   0.000    1.000
## occupation  -4.976e-02  1.113e+04   0.000    1.000
## rate        -1.357e-01  1.591e+04   0.000    1.000
## nbaffairs   -3.191e+01  4.714e+04  -0.001    0.999
## rootAffair   1.455e+02  1.626e+05   0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.0981e+02  on 360  degrees of freedom
## Residual deviance: 3.2775e-09  on 350  degrees of freedom
## AIC: 22
## 
## Number of Fisher Scoring iterations: 25

Our “predicted scores” are continuous, corresponding to the probability that a given person will have an affair. Here’s how they are distributed (still on the training data here):

predictTrain <- predict(model2, trainFair, type = "response")
summary (predictTrain)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2548  1.0000  1.0000

As scores on predictTrain increase, the predicted likelihood of an affair increases. If you are considering marriage to a potential partner, what score would be too high? How do we distinguish “ok” from “not?”