Generalized Additive Models (GAMs)

Generalized Additive Models (GAMs) are an advance over glms that allow you to integrate and combine transformations of the input variables, including things like lowess smoothing. The basic approach is framed similar to a glm, but we can specify transforms within the syntax of the model, these transforms get applied optimally to predict the outcome, and then look at the transformed relationship as part of the model.

A GAM is essentially a regression model, but the gam library permits glms and mixed effects models as well. A binomial glm is logistic regression and essentially a classifier, so it is easy to generalize.

GAMs have been around since the 1990s, but have recently come into resurgence as a means of developing more interpretable models. The flexibility of having numerous input transforms gives them a lot of potential predictive power that simpler regression models do not have, but the ability to trace the link makes them easier to understand than something like SVMs.

Libraries used:

  • gam

Let’s first look at the iphone data set. The gam library works best with numeric predictors and outcomes, so we will do some transformations first:

library(gam)
iphone <- read.csv("data_study1.csv")
values <- iphone[, -1]
values$phone <- 0 + (iphone$Smartphone == "iPhone")  ## Convert to a 0/1 number
values$Gender <- 0 + (values$Gender == "female")  #convert to a number

A basic GAM model

A simple GAM looks basically like a glm. We can see that initially, the model gets 69% accuracy, which is about as good as any of the classifiers we developed were (of course, we would worry about cross-validation). One thing the gam library does is plot the relationship between each predictor and the outcome. In this case, they are all linear predictors. This visualization is important because it lets us assess individual relationships.

library(DAAG)
model <- gam(phone ~ ., data = values, family = "binomial", trace = T)
GAM lm.wfit loop 1: deviance = 644.3867 
GAM lm.wfit loop 2: deviance = 644.1742 
GAM lm.wfit loop 3: deviance = 644.1741 
confusion(predict(model) > 0, iphone$Smartphone)
Overall accuracy = 0.69 

Confusion matrix 
      Predicted (cv)
Actual  [,1]  [,2]
  [1,] 0.659 0.341
  [2,] 0.295 0.705
predict(model, type = "terms")
        Gender          Age Honesty.Humility Emotionality Extraversion
1   -0.4313575  0.067792521       0.08412966 -0.157058567   0.01565584
2   -0.4313575 -0.146609095      -0.02521823 -0.124444639   0.33176044
3   -0.4313575 -0.032261566      -0.24391399 -0.157058567  -0.17400691
4   -0.4313575  0.096379403       0.24815148 -0.059216782  -0.14239646
5   -0.4313575 -0.203782859      -0.07989217 -0.254900351  -0.39528013
6   -0.4313575  0.082085962       0.46684724  0.103852858  -0.07917554
    Agreeableness Conscientiousness     Openness Avoidance.Similarity
1     0.010378617       -0.12933283  0.024267499           0.37281830
2     0.034797986       -0.10131601 -0.148628215           0.37281830
3     0.038867881        0.03876807 -0.206260119          -0.15777325
4    -0.030320332       -0.38148418  0.139531307           0.10752252
5    -0.022180543        0.09480170  0.053083451          -0.15777325
6    -0.030320332        0.09480170 -0.047772382           0.10752252
    Phone.as.status.object Social.Economic.Status Time.owned.current.phone
1              -0.22817602          -0.0002356747             2.383193e-05
2              -0.12808886          -0.0180459500            -4.092360e-05
3              -0.47839392           0.0710054266             4.002082e-05
4               0.22221620           0.0353848759            -4.092360e-05
5              -0.52843750           0.0531951512            -3.282916e-05
6              -0.27821960          -0.0002356747             4.406804e-05
 [ reached getOption("max.print") -- omitted 523 rows ]
attr(,"constant")
[1] 0.3985957
par(mfrow = c(2, 2))
plot(model, se = TRUE)

summary(model)

Call: gam(formula = phone ~ ., family = "binomial", data = values, 
    trace = T)
Deviance Residuals:
    Min      1Q  Median      3Q     Max 
-2.2383 -1.0875  0.6497  0.9718  1.8392 

(Dispersion Parameter for binomial family taken to be 1)

    Null Deviance: 717.6175 on 528 degrees of freedom
Residual Deviance: 644.1741 on 516 degrees of freedom
AIC: 670.1741 

Number of Local Scoring Iterations: 3 

Anova for Parametric Effects
                          Df Sum Sq Mean Sq F value    Pr(>F)    
Gender                     1  13.51 13.5102 12.9938 0.0003428 ***
Age                        1  11.00 10.9998 10.5794 0.0012184 ** 
Honesty.Humility           1  13.33 13.3340 12.8244 0.0003745 ***
Emotionality               1   2.00  1.9993  1.9229 0.1661355    
Extraversion               1   4.70  4.7002  4.5205 0.0339643 *  
Agreeableness              1   0.20  0.1950  0.1876 0.6651349    
Conscientiousness          1   1.80  1.7985  1.7297 0.1890294    
Openness                   1   2.24  2.2434  2.1576 0.1424739    
Avoidance.Similarity       1   5.85  5.8526  5.6289 0.0180326 *  
Phone.as.status.object     1   6.81  6.8089  6.5487 0.0107804 *  
Social.Economic.Status     1   0.07  0.0701  0.0674 0.7952801    
Time.owned.current.phone   1   0.00  0.0000  0.0000 0.9996705    
 [ reached getOption("max.print") -- omitted 1 row ]
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The summary() function conducts a basic ANOVA table. Based on this, we might want to select a smaller set of variables

model2 <- gam(phone ~ Age + Honesty.Humility + Extraversion + Emotionality + Avoidance.Similarity + 
    Phone.as.status.object + Social.Economic.Status, data = values, family = "binomial", 
    trace = T)
GAM lm.wfit loop 1: deviance = 656.706 
GAM lm.wfit loop 2: deviance = 656.4687 
GAM lm.wfit loop 3: deviance = 656.4687 
GAM lm.wfit loop 4: deviance = 656.4687 
confusion(predict(model2) > 0, iphone$Smartphone)
Overall accuracy = 0.669 

Confusion matrix 
      Predicted (cv)
Actual  [,1]  [,2]
  [1,] 0.634 0.366
  [2,] 0.315 0.685
par(mfrow = c(2, 2))
plot(model2, se = TRUE)

summary(model2)

Call: gam(formula = phone ~ Age + Honesty.Humility + Extraversion + 
    Emotionality + Avoidance.Similarity + Phone.as.status.object + 
    Social.Economic.Status, family = "binomial", data = values, 
    trace = T)
Deviance Residuals:
    Min      1Q  Median      3Q     Max 
-1.9925 -1.1346  0.6809  0.9803  1.9695 

(Dispersion Parameter for binomial family taken to be 1)

    Null Deviance: 717.6175 on 528 degrees of freedom
Residual Deviance: 656.4687 on 521 degrees of freedom
AIC: 672.4687 

Number of Local Scoring Iterations: 4 

Anova for Parametric Effects
                        Df Sum Sq Mean Sq F value    Pr(>F)    
Age                      1  11.94 11.9419 11.7262 0.0006648 ***
Honesty.Humility         1  10.74 10.7416 10.5476 0.0012383 ** 
Extraversion             1   3.66  3.6587  3.5927 0.0585887 .  
Emotionality             1  11.99 11.9886 11.7721 0.0006490 ***
Avoidance.Similarity     1   6.61  6.6122  6.4928 0.0111171 *  
Phone.as.status.object   1   6.96  6.9596  6.8339 0.0092032 ** 
Social.Economic.Status   1   0.01  0.0091  0.0089 0.9248144    
Residuals              521 530.58  1.0184                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Accuracy suffers for the smaller model a little bit, but hopefully we are going to have a more general model that does not overfit.

Incorporating transforms into a GAM

The advantage of a GAM is that we can specify various transforms within the predictor set, such as poly(), log(), and s()–which is a loess-style smoothing function. s(variable,n) will permit a smoothing function with 4 degrees of freedom–sort of a polynomial of the 4th order. Let’s try this with two of the strongest predictors. We will add a 4th-order smoothed function of age, and a polynomial of phone-as-status, and a smoothed function of SES, which we might suspect could be non-linear.

model3 <- gam(phone ~ Age + s(Age, 4) + Honesty.Humility + Extraversion + Emotionality + 
    Avoidance.Similarity + poly(Phone.as.status.object, 4) + s(Social.Economic.Status, 
    5), data = values, family = "binomial", trace = T)
GAM s.wam loop 1: deviance = 629.5864 
GAM s.wam loop 2: deviance = 630.1319 
GAM s.wam loop 3: deviance = 630.1627 
GAM s.wam loop 4: deviance = 630.1626 
GAM s.wam loop 5: deviance = 630.1626 
confusion(predict(model3) > 0, iphone$Smartphone)
Overall accuracy = 0.694 

Confusion matrix 
      Predicted (cv)
Actual  [,1]  [,2]
  [1,] 0.652 0.348
  [2,] 0.284 0.716
par(mfrow = c(2, 2))
plot(model3, se = TRUE)

summary(model3)

Call: gam(formula = phone ~ Age + s(Age, 4) + Honesty.Humility + Extraversion + 
    Emotionality + Avoidance.Similarity + poly(Phone.as.status.object, 
    4) + s(Social.Economic.Status, 5), family = "binomial", data = values, 
    trace = T)
Deviance Residuals:
    Min      1Q  Median      3Q     Max 
-2.0999 -1.0671  0.6010  0.9798  1.7414 

(Dispersion Parameter for binomial family taken to be 1)

    Null Deviance: 717.6175 on 528 degrees of freedom
Residual Deviance: 630.1626 on 511.0002 degrees of freedom
AIC: 666.1621 

Number of Local Scoring Iterations: NA 

Anova for Parametric Effects
                                 Df Sum Sq Mean Sq F value    Pr(>F)    
Age                               1  12.42 12.4229 12.3224 0.0004871 ***
Honesty.Humility                  1   9.25  9.2499  9.1751 0.0025775 ** 
Extraversion                      1   4.24  4.2412  4.2069 0.0407693 *  
Emotionality                      1   9.32  9.3222  9.2468 0.0024801 ** 
Avoidance.Similarity              1   5.60  5.5981  5.5528 0.0188272 *  
poly(Phone.as.status.object, 4)   4  12.31  3.0778  3.0529 0.0166836 *  
s(Social.Economic.Status, 5)      1   0.03  0.0268  0.0266 0.8704971    
Residuals                       511 515.17  1.0082                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova for Nonparametric Effects
                                Npar Df Npar Chisq    P(Chi)    
(Intercept)                                                     
Age                                                             
s(Age, 4)                             3    19.0641 0.0002651 ***
Honesty.Humility                                                
Extraversion                                                    
Emotionality                                                    
Avoidance.Similarity                                            
poly(Phone.as.status.object, 4)                                 
s(Social.Economic.Status, 5)          4     2.7624 0.5983422    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model2, model3)  ##this compares the two models
Analysis of Deviance Table

Model 1: phone ~ Age + Honesty.Humility + Extraversion + Emotionality + 
    Avoidance.Similarity + Phone.as.status.object + Social.Economic.Status
Model 2: phone ~ Age + s(Age, 4) + Honesty.Humility + Extraversion + Emotionality + 
    Avoidance.Similarity + poly(Phone.as.status.object, 4) + 
    s(Social.Economic.Status, 5)
  Resid. Df Resid. Dev     Df Deviance Pr(>Chi)   
1       521     656.47                            
2       511     630.16 9.9998   26.306 0.003349 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see the linear and non-linear effects of age separately. By adding some transformed variables, it looks like we improve the fit somewhat. Furthermore, we can see the relationships between the input variables and the output as the impact the prediction through the transform. Notice that age seems to have two ‘humps’: one around age 20 and one around age 50. This sort of tells us that age impacts iphone ownership such that younger people buy them for status, and older people buy them maybe because they have more money or they feel they are better/simpler, and the 25-40 are stuck in a too-poor-and-don’t-care zone.

The explicit model comparison with ANOVA is possible because we have specified nested models. Here, the more complex model is significantly better–this is saying the additional transforms are worth it.

Stepwise variable and transform selection

We can use a special step function to explore tradeoffs of different transforms. The scope list gives all the possible transforms to consider, and it picks the version or versions of each transform that is the best.

step.model <- step.Gam(model3, scope = list(Age = ~1 + Age + s(Age, 1) + s(Age, 2) + 
    s(Age, 3) + log(Age), Phone.as.status.object = ~1 + s(Phone.as.status.object) + 
    s(Phone.as.status.object, 3) + log(Phone.as.status.object), Social.Economic.Status = ~1 + 
    s(Social.Economic.Status, 3)), direction = "both", trace = T)
Start:  phone ~ Age + s(Age, 4) + Honesty.Humility + Extraversion + Emotionality +      Avoidance.Similarity + poly(Phone.as.status.object, 4) +      s(Social.Economic.Status, 5); AIC= 666.1621 
print(step.model)
Call:
gam(formula = phone ~ Age + s(Age, 4) + Honesty.Humility + Extraversion + 
    Emotionality + Avoidance.Similarity + poly(Phone.as.status.object, 
    4) + s(Social.Economic.Status, 5), family = "binomial", data = values, 
    trace = T)

Degrees of Freedom: 528 total; 511.0002 Residual
Residual Deviance: 630.1626 
summary(step.model)

Call: gam(formula = phone ~ Age + s(Age, 4) + Honesty.Humility + Extraversion + 
    Emotionality + Avoidance.Similarity + poly(Phone.as.status.object, 
    4) + s(Social.Economic.Status, 5), family = "binomial", data = values, 
    trace = T)
Deviance Residuals:
    Min      1Q  Median      3Q     Max 
-2.0999 -1.0671  0.6010  0.9798  1.7414 

(Dispersion Parameter for binomial family taken to be 1)

    Null Deviance: 717.6175 on 528 degrees of freedom
Residual Deviance: 630.1626 on 511.0002 degrees of freedom
AIC: 666.1621 

Number of Local Scoring Iterations: NA 

Anova for Parametric Effects
                                 Df Sum Sq Mean Sq F value    Pr(>F)    
Age                               1  12.42 12.4229 12.3224 0.0004871 ***
Honesty.Humility                  1   9.25  9.2499  9.1751 0.0025775 ** 
Extraversion                      1   4.24  4.2412  4.2069 0.0407693 *  
Emotionality                      1   9.32  9.3222  9.2468 0.0024801 ** 
Avoidance.Similarity              1   5.60  5.5981  5.5528 0.0188272 *  
poly(Phone.as.status.object, 4)   4  12.31  3.0778  3.0529 0.0166836 *  
s(Social.Economic.Status, 5)      1   0.03  0.0268  0.0266 0.8704971    
Residuals                       511 515.17  1.0082                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova for Nonparametric Effects
                                Npar Df Npar Chisq    P(Chi)    
(Intercept)                                                     
Age                                                             
s(Age, 4)                             3    19.0641 0.0002651 ***
Honesty.Humility                                                
Extraversion                                                    
Emotionality                                                    
Avoidance.Similarity                                            
poly(Phone.as.status.object, 4)                                 
s(Social.Economic.Status, 5)          4     2.7624 0.5983422    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(2, 2))
plot(step.model, se = TRUE)

confusion(predict(step.model) > 0, iphone$Smartphone)
Overall accuracy = 0.694 

Confusion matrix 
      Predicted (cv)
Actual  [,1]  [,2]
  [1,] 0.652 0.348
  [2,] 0.284 0.716

Notice the ANOVA separates out the parametric and non-parametric predictors because they need to be tested differently. Overall, the prediction accuracy doesn’t change much by including these higher-order transforms, but it tells us something about the relationship that may be informative. If we find non-linearities, it can reveal important things about the causal relationship that we would never see if we just assumed all relationships are linear.