Generalized Additive Models (GAMs)
Generalized Additive Models (GAMs)
Generalized Additive Models (GAMs) are an advance over glms that allow you to integrate and combine transformations of the input variables, including things like lowess smoothing. The basic approach is framed similar to a glm, but we can specify transforms within the syntax of the model, these transforms get applied optimally to predict the outcome, and then look at the transformed relationship as part of the model.
A GAM is essentially a regression model, but the gam library permits glms and mixed effects models as well. A binomial glm is logistic regression and essentially a classifier, so it is easy to generalize.
GAMs have been around since the 1990s, but have recently come into resurgence as a means of developing more interpretable models. The flexibility of having numerous input transforms gives them a lot of potential predictive power that simpler regression models do not have, but the ability to trace the link makes them easier to understand than something like SVMs.
Libraries used:
gam
Let’s first look at the iphone data set. The gam library works best with numeric predictors and outcomes, so we will do some transformations first:
library(gam)
<- read.csv("data_study1.csv")
iphone <- iphone[, -1]
values $phone <- 0 + (iphone$Smartphone == "iPhone") ## Convert to a 0/1 number
values$Gender <- 0 + (values$Gender == "female") #convert to a number values
A basic GAM model
A simple GAM looks basically like a glm. We can see that initially, the model gets 69% accuracy, which is about as good as any of the classifiers we developed were (of course, we would worry about cross-validation). One thing the gam library does is plot the relationship between each predictor and the outcome. In this case, they are all linear predictors. This visualization is important because it lets us assess individual relationships.
library(DAAG)
<- gam(phone ~ ., data = values, family = "binomial", trace = T) model
GAM lm.wfit loop 1: deviance = 644.3867
GAM lm.wfit loop 2: deviance = 644.1742
GAM lm.wfit loop 3: deviance = 644.1741
confusion(predict(model) > 0, iphone$Smartphone)
Overall accuracy = 0.69
Confusion matrix
Predicted (cv)
Actual [,1] [,2]
[1,] 0.659 0.341
[2,] 0.295 0.705
predict(model, type = "terms")
Gender Age Honesty.Humility Emotionality Extraversion
1 -0.4313575 0.067792521 0.08412966 -0.157058567 0.01565584
2 -0.4313575 -0.146609095 -0.02521823 -0.124444639 0.33176044
3 -0.4313575 -0.032261566 -0.24391399 -0.157058567 -0.17400691
4 -0.4313575 0.096379403 0.24815148 -0.059216782 -0.14239646
5 -0.4313575 -0.203782859 -0.07989217 -0.254900351 -0.39528013
6 -0.4313575 0.082085962 0.46684724 0.103852858 -0.07917554
Agreeableness Conscientiousness Openness Avoidance.Similarity
1 0.010378617 -0.12933283 0.024267499 0.37281830
2 0.034797986 -0.10131601 -0.148628215 0.37281830
3 0.038867881 0.03876807 -0.206260119 -0.15777325
4 -0.030320332 -0.38148418 0.139531307 0.10752252
5 -0.022180543 0.09480170 0.053083451 -0.15777325
6 -0.030320332 0.09480170 -0.047772382 0.10752252
Phone.as.status.object Social.Economic.Status Time.owned.current.phone
1 -0.22817602 -0.0002356747 2.383193e-05
2 -0.12808886 -0.0180459500 -4.092360e-05
3 -0.47839392 0.0710054266 4.002082e-05
4 0.22221620 0.0353848759 -4.092360e-05
5 -0.52843750 0.0531951512 -3.282916e-05
6 -0.27821960 -0.0002356747 4.406804e-05
[ reached getOption("max.print") -- omitted 523 rows ]
attr(,"constant")
[1] 0.3985957
par(mfrow = c(2, 2))
plot(model, se = TRUE)
summary(model)
Call: gam(formula = phone ~ ., family = "binomial", data = values,
trace = T)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2383 -1.0875 0.6497 0.9718 1.8392
(Dispersion Parameter for binomial family taken to be 1)
Null Deviance: 717.6175 on 528 degrees of freedom
Residual Deviance: 644.1741 on 516 degrees of freedom
AIC: 670.1741
Number of Local Scoring Iterations: 3
Anova for Parametric Effects
Df Sum Sq Mean Sq F value Pr(>F)
Gender 1 13.51 13.5102 12.9938 0.0003428 ***
Age 1 11.00 10.9998 10.5794 0.0012184 **
Honesty.Humility 1 13.33 13.3340 12.8244 0.0003745 ***
Emotionality 1 2.00 1.9993 1.9229 0.1661355
Extraversion 1 4.70 4.7002 4.5205 0.0339643 *
Agreeableness 1 0.20 0.1950 0.1876 0.6651349
Conscientiousness 1 1.80 1.7985 1.7297 0.1890294
Openness 1 2.24 2.2434 2.1576 0.1424739
Avoidance.Similarity 1 5.85 5.8526 5.6289 0.0180326 *
Phone.as.status.object 1 6.81 6.8089 6.5487 0.0107804 *
Social.Economic.Status 1 0.07 0.0701 0.0674 0.7952801
Time.owned.current.phone 1 0.00 0.0000 0.0000 0.9996705
[ reached getOption("max.print") -- omitted 1 row ]
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary()
function conducts a basic ANOVA table. Based on this, we might want to select a smaller set of variables
<- gam(phone ~ Age + Honesty.Humility + Extraversion + Emotionality + Avoidance.Similarity +
model2 + Social.Economic.Status, data = values, family = "binomial",
Phone.as.status.object trace = T)
GAM lm.wfit loop 1: deviance = 656.706
GAM lm.wfit loop 2: deviance = 656.4687
GAM lm.wfit loop 3: deviance = 656.4687
GAM lm.wfit loop 4: deviance = 656.4687
confusion(predict(model2) > 0, iphone$Smartphone)
Overall accuracy = 0.669
Confusion matrix
Predicted (cv)
Actual [,1] [,2]
[1,] 0.634 0.366
[2,] 0.315 0.685
par(mfrow = c(2, 2))
plot(model2, se = TRUE)
summary(model2)
Call: gam(formula = phone ~ Age + Honesty.Humility + Extraversion +
Emotionality + Avoidance.Similarity + Phone.as.status.object +
Social.Economic.Status, family = "binomial", data = values,
trace = T)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9925 -1.1346 0.6809 0.9803 1.9695
(Dispersion Parameter for binomial family taken to be 1)
Null Deviance: 717.6175 on 528 degrees of freedom
Residual Deviance: 656.4687 on 521 degrees of freedom
AIC: 672.4687
Number of Local Scoring Iterations: 4
Anova for Parametric Effects
Df Sum Sq Mean Sq F value Pr(>F)
Age 1 11.94 11.9419 11.7262 0.0006648 ***
Honesty.Humility 1 10.74 10.7416 10.5476 0.0012383 **
Extraversion 1 3.66 3.6587 3.5927 0.0585887 .
Emotionality 1 11.99 11.9886 11.7721 0.0006490 ***
Avoidance.Similarity 1 6.61 6.6122 6.4928 0.0111171 *
Phone.as.status.object 1 6.96 6.9596 6.8339 0.0092032 **
Social.Economic.Status 1 0.01 0.0091 0.0089 0.9248144
Residuals 521 530.58 1.0184
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Accuracy suffers for the smaller model a little bit, but hopefully we are going to have a more general model that does not overfit.
Incorporating transforms into a GAM
The advantage of a GAM is that we can specify various transforms within the predictor set, such as poly(), log(), and s()–which is a loess-style smoothing function. s(variable,n) will permit a smoothing function with 4 degrees of freedom–sort of a polynomial of the 4th order. Let’s try this with two of the strongest predictors. We will add a 4th-order smoothed function of age, and a polynomial of phone-as-status, and a smoothed function of SES, which we might suspect could be non-linear.
<- gam(phone ~ Age + s(Age, 4) + Honesty.Humility + Extraversion + Emotionality +
model3 + poly(Phone.as.status.object, 4) + s(Social.Economic.Status,
Avoidance.Similarity 5), data = values, family = "binomial", trace = T)
GAM s.wam loop 1: deviance = 629.5864
GAM s.wam loop 2: deviance = 630.1319
GAM s.wam loop 3: deviance = 630.1627
GAM s.wam loop 4: deviance = 630.1626
GAM s.wam loop 5: deviance = 630.1626
confusion(predict(model3) > 0, iphone$Smartphone)
Overall accuracy = 0.694
Confusion matrix
Predicted (cv)
Actual [,1] [,2]
[1,] 0.652 0.348
[2,] 0.284 0.716
par(mfrow = c(2, 2))
plot(model3, se = TRUE)
summary(model3)
Call: gam(formula = phone ~ Age + s(Age, 4) + Honesty.Humility + Extraversion +
Emotionality + Avoidance.Similarity + poly(Phone.as.status.object,
4) + s(Social.Economic.Status, 5), family = "binomial", data = values,
trace = T)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0999 -1.0671 0.6010 0.9798 1.7414
(Dispersion Parameter for binomial family taken to be 1)
Null Deviance: 717.6175 on 528 degrees of freedom
Residual Deviance: 630.1626 on 511.0002 degrees of freedom
AIC: 666.1621
Number of Local Scoring Iterations: NA
Anova for Parametric Effects
Df Sum Sq Mean Sq F value Pr(>F)
Age 1 12.42 12.4229 12.3224 0.0004871 ***
Honesty.Humility 1 9.25 9.2499 9.1751 0.0025775 **
Extraversion 1 4.24 4.2412 4.2069 0.0407693 *
Emotionality 1 9.32 9.3222 9.2468 0.0024801 **
Avoidance.Similarity 1 5.60 5.5981 5.5528 0.0188272 *
poly(Phone.as.status.object, 4) 4 12.31 3.0778 3.0529 0.0166836 *
s(Social.Economic.Status, 5) 1 0.03 0.0268 0.0266 0.8704971
Residuals 511 515.17 1.0082
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova for Nonparametric Effects
Npar Df Npar Chisq P(Chi)
(Intercept)
Age
s(Age, 4) 3 19.0641 0.0002651 ***
Honesty.Humility
Extraversion
Emotionality
Avoidance.Similarity
poly(Phone.as.status.object, 4)
s(Social.Economic.Status, 5) 4 2.7624 0.5983422
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model2, model3) ##this compares the two models
Analysis of Deviance Table
Model 1: phone ~ Age + Honesty.Humility + Extraversion + Emotionality +
Avoidance.Similarity + Phone.as.status.object + Social.Economic.Status
Model 2: phone ~ Age + s(Age, 4) + Honesty.Humility + Extraversion + Emotionality +
Avoidance.Similarity + poly(Phone.as.status.object, 4) +
s(Social.Economic.Status, 5)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 521 656.47
2 511 630.16 9.9998 26.306 0.003349 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see the linear and non-linear effects of age separately. By adding some transformed variables, it looks like we improve the fit somewhat. Furthermore, we can see the relationships between the input variables and the output as the impact the prediction through the transform. Notice that age seems to have two ‘humps’: one around age 20 and one around age 50. This sort of tells us that age impacts iphone ownership such that younger people buy them for status, and older people buy them maybe because they have more money or they feel they are better/simpler, and the 25-40 are stuck in a too-poor-and-don’t-care zone.
The explicit model comparison with ANOVA is possible because we have specified nested models. Here, the more complex model is significantly better–this is saying the additional transforms are worth it.
Stepwise variable and transform selection
We can use a special step function to explore tradeoffs of different transforms. The scope list gives all the possible transforms to consider, and it picks the version or versions of each transform that is the best.
<- step.Gam(model3, scope = list(Age = ~1 + Age + s(Age, 1) + s(Age, 2) +
step.model s(Age, 3) + log(Age), Phone.as.status.object = ~1 + s(Phone.as.status.object) +
s(Phone.as.status.object, 3) + log(Phone.as.status.object), Social.Economic.Status = ~1 +
s(Social.Economic.Status, 3)), direction = "both", trace = T)
Start: phone ~ Age + s(Age, 4) + Honesty.Humility + Extraversion + Emotionality + Avoidance.Similarity + poly(Phone.as.status.object, 4) + s(Social.Economic.Status, 5); AIC= 666.1621
print(step.model)
Call:
gam(formula = phone ~ Age + s(Age, 4) + Honesty.Humility + Extraversion +
Emotionality + Avoidance.Similarity + poly(Phone.as.status.object,
4) + s(Social.Economic.Status, 5), family = "binomial", data = values,
trace = T)
Degrees of Freedom: 528 total; 511.0002 Residual
Residual Deviance: 630.1626
summary(step.model)
Call: gam(formula = phone ~ Age + s(Age, 4) + Honesty.Humility + Extraversion +
Emotionality + Avoidance.Similarity + poly(Phone.as.status.object,
4) + s(Social.Economic.Status, 5), family = "binomial", data = values,
trace = T)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0999 -1.0671 0.6010 0.9798 1.7414
(Dispersion Parameter for binomial family taken to be 1)
Null Deviance: 717.6175 on 528 degrees of freedom
Residual Deviance: 630.1626 on 511.0002 degrees of freedom
AIC: 666.1621
Number of Local Scoring Iterations: NA
Anova for Parametric Effects
Df Sum Sq Mean Sq F value Pr(>F)
Age 1 12.42 12.4229 12.3224 0.0004871 ***
Honesty.Humility 1 9.25 9.2499 9.1751 0.0025775 **
Extraversion 1 4.24 4.2412 4.2069 0.0407693 *
Emotionality 1 9.32 9.3222 9.2468 0.0024801 **
Avoidance.Similarity 1 5.60 5.5981 5.5528 0.0188272 *
poly(Phone.as.status.object, 4) 4 12.31 3.0778 3.0529 0.0166836 *
s(Social.Economic.Status, 5) 1 0.03 0.0268 0.0266 0.8704971
Residuals 511 515.17 1.0082
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova for Nonparametric Effects
Npar Df Npar Chisq P(Chi)
(Intercept)
Age
s(Age, 4) 3 19.0641 0.0002651 ***
Honesty.Humility
Extraversion
Emotionality
Avoidance.Similarity
poly(Phone.as.status.object, 4)
s(Social.Economic.Status, 5) 4 2.7624 0.5983422
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(2, 2))
plot(step.model, se = TRUE)
confusion(predict(step.model) > 0, iphone$Smartphone)
Overall accuracy = 0.694
Confusion matrix
Predicted (cv)
Actual [,1] [,2]
[1,] 0.652 0.348
[2,] 0.284 0.716
Notice the ANOVA separates out the parametric and non-parametric predictors because they need to be tested differently. Overall, the prediction accuracy doesn’t change much by including these higher-order transforms, but it tells us something about the relationship that may be informative. If we find non-linearities, it can reveal important things about the causal relationship that we would never see if we just assumed all relationships are linear.