Correspondence Analysis and Multiple Correspondence Analysis, and Mediation Analysis

The two topics covered in this module are not related, but they are each useful in their own right.

Mediation Analysis

A common type of analysis that lavaan permits is looking at the role of mediating variables. There are many tools available for specifically looking at 3-variable problems, but lavaan lets you model arbitrarily complex mediation schemes. The analysis of these is more ad hoc though.

The notion of doing a mediation analysis is central to many areas of psychology, especially in social domains where you cannot have complete experimental control. In these cases, you will always have potentially confounding variables, and mediation analysis examines a particular kind of causal relationship. Some of the papers describing mediation are among the top 20 scientific research papers cited in any domain.

The notion is that you may observe a correlation between two variables, but also observe that these may be correlated with a third. For example, you might find that as a population ages from teenager to adult, vehicular accidents go down. But, it might not be age per se that causes this. Instead, it might be practice…as you get older, you get more experience driving, and the experience reduces accidents. Experience may be a mediating variable here. Can all the improvement in driver safety be attributed to experience? Maybe, but to find out, you need to study people who started driving at different ages. This would be a classic mediation test.

Mediation Analysis using psych::mediation

Traditionally, mediation analysis was done ‘by hand’. It involves creating several regression models (with and without the mediator), and comparing the residual variances, to determine how much of the variability between the two variables is also accounted for by the mediator. The psych library will do this for us, make a diagram, and also carry out a number of tests and estimates of standard error.

Below are two example data sets we can use to examine mediation. Here, using standard terminology, y is the outcome (accident rate), X is the predictor (age), and M is the mediator variable (experience). By design, in the first set, Y only depends ot X via M. In the second (Data2), M and Y depend on X independent of one another.

set.seed(1234)

X <- rnorm(100)
M <- 0.5 * X + rnorm(100)
Y <- 0.7 * M + rnorm(100)

M2 <- 0.6 * X + rnorm(100)
Y2 <- 0.6 * X + rnorm(100)


Data <- data.frame(X = X, Y = Y, M = M)
Data2 <- data.frame(X = X, Y = Y2, M = M2)

pairs(Data)

cor(Data)

          X         Y         M
X 1.0000000 0.3122748 0.4188856
Y 0.3122748 1.0000000 0.6909791
M 0.4188856 0.6909791 1.0000000

pairs(Data2)

cor(Data2)

          X         Y         M
X 1.0000000 0.4255321 0.5857204
Y 0.4255321 1.0000000 0.1827901
M 0.5857204 0.1827901 1.0000000

The psych library handles standard mediation analysis using the mediate() function:

library(psych)
m1 <- mediate(Y ~ X + (M), data = Data)

m1


Mediation/Moderation Analysis 
Call: mediate(y = Y ~ X + (M), data = Data)

The DV (Y) was  Y . The IV (X) was  X . The mediating variable(s) =  M .

Total effect(c) of  X  on  Y  =  0.41   S.E. =  0.13  t  =  3.25  df=  98   with p =  0.0016
Direct effect (c') of  X  on  Y  removing  M  =  0.04   S.E. =  0.11  t  =  0.34  df=  97   with p =  0.73
Indirect effect (ab) of  X  on  Y  through  M   =  0.37 
Mean bootstrapped indirect effect =  0.37  with standard error =  0.08  Lower CI =  0.22    Upper CI =  0.54
R = 0.69 R2 = 0.48   F = 44.43 on 2 and 97 DF   p-value:  3.73e-18 

 To see the longer output, specify short = FALSE in the print statement or ask for the summary

Here, the first model shows the mediated route is has substantial variance along the two legs, and the direct route c diminishes from .41 to .04 when the indirect route is allowed to have an effect, suggesting a strong mediation.

We can reverse this as well. We coefficients change but the conclusion is the same.

m1 <- mediate(X ~ Y + (M), data = Data)

m1


Mediation/Moderation Analysis 
Call: mediate(y = X ~ Y + (M), data = Data)

The DV (Y) was  X . The IV (X) was  Y . The mediating variable(s) =  M .

Total effect(c) of  Y  on  X  =  0.24   S.E. =  0.07  t  =  3.25  df=  98   with p =  0.0016
Direct effect (c') of  Y  on  X  removing  M  =  0.03   S.E. =  0.1  t  =  0.34  df=  97   with p =  0.73
Indirect effect (ab) of  Y  on  X  through  M   =  0.2 
Mean bootstrapped indirect effect =  0.21  with standard error =  0.07  Lower CI =  0.07    Upper CI =  0.36
R = 0.42 R2 = 0.18   F = 10.39 on 2 and 97 DF   p-value:  5.42e-06 

 To see the longer output, specify short = FALSE in the print statement or ask for the summary

In Data2, the data were created so that M did not mediate Y~X.

m2 <- mediate(Y ~ X + (M), data = Data2)

m2


Mediation/Moderation Analysis 
Call: mediate(y = Y ~ X + (M), data = Data2)

The DV (Y) was  Y . The IV (X) was  X . The mediating variable(s) =  M .

Total effect(c) of  X  on  Y  =  0.52   S.E. =  0.11  t  =  4.66  df=  98   with p =  1e-05
Direct effect (c') of  X  on  Y  removing  M  =  0.59   S.E. =  0.14  t  =  4.29  df=  97   with p =  4.2e-05
Indirect effect (ab) of  X  on  Y  through  M   =  -0.07 
Mean bootstrapped indirect effect =  -0.07  with standard error =  0.09  Lower CI =  -0.26    Upper CI =  0.09
R = 0.43 R2 = 0.19   F = 11.21 on 2 and 97 DF   p-value:  2.21e-06 

 To see the longer output, specify short = FALSE in the print statement or ask for the summary

In the second model, there is no correlation along the second leg of the mediated route, and c and c’ are essentially the same, indicating M does not mediate the relationship of x to y.

So, this allows us to determine whether the mediating variable is just confounded or it is instrumental in the relationship. However, it does not really tell us causality, because the mediating relationship holds regardless of the direction.

Mediation using lavaan

So, for both Data and Data2, we have three variables that are all correlated. In truth, for Data, Y is related only to M, but M is related to X. For Data2 both M and Y are related to X directly.

We use ~ to model direct relationships between observed variables. Here, we will fit a model with a direct relationship between X and Y, and also another link through M. We need to specify a variable to represent the interaction of a and b, using :=, and a final variance total that captures everthing.

library(lavaan)

library(semPlot)

model <- " # direct effect
Y ~ c*X
# mediator
M ~ a*X
Y ~ b*M
# indirect effect (a*b)
ab := a*b
# total effect
total := c + (a*b)
"
fitmed <- sem(model, data = Data)
summary(fitmed)

lavaan 0.6-19 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

  Number of observations                           100

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  Y ~                                                 
    X          (c)    0.036    0.104    0.348    0.728
  M ~                                                 
    X          (a)    0.474    0.103    4.613    0.000
  Y ~                                                 
    M          (b)    0.788    0.092    8.539    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .Y                 0.898    0.127    7.071    0.000
   .M                 1.054    0.149    7.071    0.000

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)
    ab                0.374    0.092    4.059    0.000
    total             0.410    0.125    3.287    0.001

semPaths(fitmed, curvePilot = T, "par", weighted = FALSE, nCharNodes = 7, shapeMan = "rectangle",
    sizeMan = 15, sizeMan2 = 10)

Looking at this first model (where M really mediates X-Y), we see that the edges X-M and M-Y are strong, but X->Y is weak, indicating M mediates the relationship between X and Y. These values are nearly identical (if not exactly the same) to what we obtained in the mediate function, with c=total=.41, but the lavaan model gives more details.

fitmed2 <- sem(model, data = Data2)
summary(fitmed2)

lavaan 0.6-19 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

  Number of observations                           100

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  Y ~                                                 
    X          (c)    0.594    0.136    4.360    0.000
  M ~                                                 
    X          (a)    0.748    0.104    7.227    0.000
  Y ~                                                 
    M          (b)   -0.097    0.107   -0.910    0.363

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .Y                 1.218    0.172    7.071    0.000
   .M                 1.070    0.151    7.071    0.000

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)
    ab               -0.073    0.080   -0.903    0.367
    total             0.521    0.111    4.702    0.000

# lavaan.diagram(fit2)
semPaths(fitmed2, curvePilot = T, "par", weighted = FALSE, nCharNodes = 7, shapeMan = "rectangle",
    sizeMan = 8, sizeMan2 = 5)

Here, Y-M is not significant, but Y-X and M-X are, indicating that M does not mediate the relationship. Note that the model is framed to test whether a particular value is significant–we don’t make two models and compare theme via ANOVA.

Example: Iphone data

dat <- read.csv("data_study1.csv")
dat$female <- dat$Gender == "female"
dat$phonenumeric <- dat$Smartphone == "iPhone"
round(cor(dat$phonenumeric, dat[, 3:14]), 2)

       Age Honesty.Humility Emotionality Extraversion Agreeableness
[1,] -0.17            -0.19         0.16         0.07         -0.04
     Conscientiousness Openness Avoidance.Similarity Phone.as.status.object
[1,]              0.02     -0.1                -0.14                   0.24
     Social.Economic.Status Time.owned.current.phone female
[1,]                   0.02                    -0.05   0.19

par(mar = c(2, 15, 2, 0))
barplot(abs(cor(dat$phonenumeric, dat[, 3:14])), main = "Correlations with iphone ownership",
    horiz = T, las = 1)

We can see that age and phone-as-status-symbol both correlate with iphone ownership. But are they independent? More importantly, can the effect of age be explained by the effect of status-symbol? I’ll rescale age to be between 0 and 1 to make the coefficients larger–otherwise this scaling doesn’t matter and unscaled age would produce the same results.

dat$Age <- dat$Age/100
psych::mediate(phonenumeric ~ Age + (Phone.as.status.object), data = dat)


Mediation/Moderation Analysis 
Call: psych::mediate(y = phonenumeric ~ Age + (Phone.as.status.object), 
    data = dat)

The DV (Y) was  phonenumeric . The IV (X) was  Age . The mediating variable(s) =  Phone.as.status.object .

Total effect(c) of  Age  on  phonenumeric  =  -0.66   S.E. =  0.16  t  =  -4.06  df=  527   with p =  5.6e-05
Direct effect (c') of  Age  on  phonenumeric  removing  Phone.as.status.object  =  -0.47   S.E. =  0.17  t  =  -2.83  df=  526   with p =  0.0049
Indirect effect (ab) of  Age  on  phonenumeric  through  Phone.as.status.object   =  -0.2 
Mean bootstrapped indirect effect =  -0.2  with standard error =  0.05  Lower CI =  -0.3    Upper CI =  -0.11
R = 0.27 R2 = 0.07   F = 20.29 on 2 and 526 DF   p-value:  1.87e-12 

 To see the longer output, specify short = FALSE in the print statement or ask for the summary

The test suggests the indirect effect of age on phone usage through status symbol is statistically significant, but so is the direct effect. This suggests that although some of the impact of age on iphone usage can be explained by younger people being more likely to treat the phone as a status symbol (and those people tend to buy iphones), there is also a effect of age directly–even for people with the same level of phone-as-status-symbol, there is an age effect.

How about whether the relationship between gender and phone is mediated by emotionality?

psych::mediate(phonenumeric ~ female + (Emotionality), data = dat)


Mediation/Moderation Analysis 
Call: psych::mediate(y = phonenumeric ~ female + (Emotionality), data = dat)

The DV (Y) was  phonenumeric . The IV (X) was  female . The mediating variable(s) =  Emotionality .

Total effect(c) of  female  on  phonenumeric  =  0.2   S.E. =  0.05  t  =  4.34  df=  527   with p =  1.7e-05
Direct effect (c') of  female  on  phonenumeric  removing  Emotionality  =  0.15   S.E. =  0.05  t  =  3.01  df=  526   with p =  0.0028
Indirect effect (ab) of  female  on  phonenumeric  through  Emotionality   =  0.04 
Mean bootstrapped indirect effect =  0.04  with standard error =  0.02  Lower CI =  0    Upper CI =  0.09
R = 0.2 R2 = 0.04   F = 11.42 on 2 and 526 DF   p-value:  2.88e-07 

 To see the longer output, specify short = FALSE in the print statement or ask for the summary

Here, before we can tell if emotionality mediates the relationship between gender and phone, we need to know whether there IS a relationship. The mediate function first tests this, and finds it is significant. When emotionality is removed, the effect is still significant p=.0028. However, the indirect is also significant.

Exercise:

Form a mediation model for other variables within the iphone data.

Correspondence Analysis and Multiple Correspondence Analysis

Correspondance Analysis (CA) and Multiple Correspondence Analysis (MCA) are described as analogs to PCA for categorical variables. For example, it might be useful as an alternative to looking at cross-tabs for categorical variables. It can also be used in concert with inter-rater reliability to establish coding schemes that correspond with one another, or with k-means or other clustering to establish how different solutions correspond.

Resources

Packages: ca
corresp() in MASS
https://www.utdallas.edu/~herve/Abdi-MCA2007-pretty.pdf
http://gastonsanchez.com/visually-enforced/how-to/2012/10/13/MCA-in-R/
Nenadic, O. and Greenacre, M. (2007). Correspondence analysis in R, with two- and three-dimensional graphics: The ca package. Journal of Statistical Software, 20 (3), http://www.jstatsoft.org/v20/i03/

Correspondence Analysis (CA) is available with the corresp() in MASS and the ca package, which provides better visualizations.

Background and Example

For CA, The basic problem we have is trying to understand whether two categorical variables have some sort of relationship. The problem is that, unless they are binary, correlation won’t work. One could imagine some sort of matching algorithm, to find how categories best align, and then finding the sum of the diagonals, but this gets tricky to pull off, especially if there is not a 1-1 mapping. Another problem is that we may not have the same number of levels of groups–can we come up with some sort of correspondence there? This is similar to the problem we face when looking at clustering solutions and trying to determine how well a solution corresponds to some particular category we used outside of the model. The clusters don’t have labels that will match the secondary variable, but we’d still like to measure how well they did.

Correspondence Analysis uses a PCA-style model to identify dimensions that the two categories correspond to, and maps category labels into that space.

To explore this, let’s start with the iris data, and k-means clustering, just to create an unlabeled set of groups that tends to correspond to our ground truth labeling.

library(MASS)
set.seed(5220)
iris2 <- scale(iris[, 1:4])
model <- kmeans(iris[, 1:4], centers = 3)

table(iris$Species, model$cluster)

            
              1  2  3
  setosa     50  0  0
  versicolor  0  2 48
  virginica   0 36 14

MASS library corresp function

Can we measure how good the correspondence is between our clusters and the true species? The MASS library corresp implements CA, which takes as an argument the two sets of labels.

library(MASS)
camodel <- corresp(iris$Species, model$cluster, nf = 2)

print(camodel)

First canonical correlation(s): 1.0000000 0.7004728 

 x scores:
                 [,1]          [,2]
setosa     -1.4142136  1.547367e-17
versicolor  0.7071068 -1.224745e+00
virginica   0.7071068  1.224745e+00

 y scores:
        [,1]          [,2]
1 -1.4142136  2.480643e-17
2  0.7071068  1.564407e+00
3  0.7071068 -9.588299e-01

plot(camodel)

The two groups of three labels are each mapped into the same space, which helps us recognize which match.

The model can also be specified in formula form. Here, we will make a 5-group k-means to see what that produces.

model <- kmeans(iris[, 1:4], centers = 5)
tmp <- data.frame(spec = iris$Species, cl = model$cluster)
camodel2 <- corresp(~spec + cl, data = tmp, nf = 2)
camodel2

First canonical correlation(s): 1.0000000 0.7004728 

 spec scores:
                 [,1]          [,2]
setosa     -1.4142136  1.840579e-16
versicolor  0.7071068 -1.224745e+00
virginica   0.7071068  1.224745e+00

 cl scores:
        [,1]          [,2]
1 -1.4142136 -1.426398e-16
2  0.7071068 -9.588299e-01
3 -1.4142136 -4.999672e-17
4  0.7071068  1.564407e+00
5 -1.4142136  2.898980e-17

plot(camodel2)

Not surprisingly, although the clusters break up into 5, these map cleanly onto the 3 species they belong to.

ca library ca function

A similar result can be obtained by doing ca on the cross-table.

library(ca)
cmodel3 <- ca::ca(table(iris$Species, model$cluster), nd = 4)
cmodel3


 Principal inertias (eigenvalues):
           1      2       
Value      1      0.490662
Percentage 67.08% 32.92%  


 Rows:
           setosa versicolor virginica
Mass     0.333333   0.333333  0.333333
ChiDist  1.414214   1.111752  1.111752
Inertia  0.666667   0.411998  0.411998
Dim. 1  -1.414214   0.707107  0.707107
Dim. 2   0.000000  -1.224745  1.224745


 Columns:
                1         2         3        4         5
Mass     0.153333  0.413333  0.053333 0.253333  0.126667
ChiDist  1.414214  0.975240  1.414214 1.304159  1.414214
Inertia  0.306667  0.393118  0.106667 0.430877  0.253333
Dim. 1  -1.414214  0.707107 -1.414214 0.707107 -1.414214
Dim. 2   0.000000 -0.958830  0.000000 1.564407  0.000000

summary(cmodel3)


Principal inertias (eigenvalues):

 dim    value      %   cum%   scree plot               
 1      10000000  67.1  67.1  *****************        
 2      0.490662  32.9 100.0  ********                 
        -------- -----                                 
 Total: 1.490662 100.0                                 


Rows:
    name   mass  qlt  inr     k=1  cor ctr    k=2 cor ctr  
1 | sets |  333 1000  447 | -1414 1000 667 |    0   0   0 |
2 | vrsc |  333 1000  276 |   707  405 167 | -858 595 500 |
3 | vrgn |  333 1000  276 |   707  405 167 |  858 595 500 |

Columns:
    name   mass  qlt  inr     k=1  cor ctr    k=2 cor ctr  
1 |    1 |  153 1000  206 | -1414 1000 307 |    0   0   0 |
2 |    2 |  413 1000  264 |   707  526 207 | -672 474 380 |
3 |    3 |   53 1000   72 | -1414 1000 107 |    0   0   0 |
4 |    4 |  253 1000  289 |   707  294 127 | 1096 706 620 |
5 |    5 |  127 1000  170 | -1414 1000 253 |    0   0   0 |

plot(cmodel3)

We can see that using any of these methods, we have mapped the categories into a space of principal components–technically using SVD. Furthermore it places both the ‘rows’ and ‘columns’ into that space, so we can see how closely aligned the groups are. We chose to use 2 dimensions in each case for easy visualization. So far, we used a data set with really strong correspondence. What if the correspondence is not as good?

set.seed(1001)
iris2 <- scale(iris[, 1:4])
model <- kmeans(iris[, 1:4], centers = 5)

table(iris$Species, model$cluster)

            
              1  2  3  4  5
  setosa      0  0 50  0  0
  versicolor 21 27  0  2  0
  virginica   0  1  0 27 22

cmodel5 <- ca(table(iris$Species, model$cluster), nd = 2)

cmodel5


 Principal inertias (eigenvalues):
           1   2       
Value      1   0.886946
Percentage 53% 47%     


 Rows:
           setosa versicolor virginica
Mass     0.333333   0.333333  0.333333
ChiDist  1.414214   1.352930  1.352930
Inertia  0.666667   0.610140  0.610140
Dim. 1  -1.414214   0.707107  0.707107
Dim. 2   0.000000  -1.224745  1.224745


 Columns:
                1         2         3        4        5
Mass     0.140000  0.186667  0.333333 0.193333 0.146667
ChiDist  1.414214  1.339167  1.414214 1.270726 1.414214
Inertia  0.280000  0.334762  0.666667 0.312184 0.293333
Dim. 1   0.707107  0.707107 -1.414214 0.707107 0.707107
Dim. 2  -1.300460 -1.207570  0.000000 1.121086 1.300460

plot(cmodel5)

This should match our intuition for where the two sets of labels belong.

Other Examples (from ca help file)

This data sets maps the distribution of letters of the alphabet to authors. These dimensions might be used to detect language, style, historic period, or something

data("author")
ca(author)


 Principal inertias (eigenvalues):
           1        2        3        4        5        6        7       
Value      0.007664 0.003688 0.002411 0.001383 0.001002 0.000723 0.000659
Percentage 40.91%   19.69%   12.87%   7.38%    5.35%    3.86%    3.52%   
           8        9        10       11      
Value      0.000455 0.000374 0.000263 0.000113
Percentage 2.43%    2%       1.4%     0.6%    


 Rows:
        three daughters (buck) drifters (michener) lost world (clark)
Mass                  0.085407            0.079728           0.084881
ChiDist               0.097831            0.094815           0.128432
Inertia               0.000817            0.000717           0.001400
Dim. 1               -0.095388            0.405697           1.157803
Dim. 2               -0.794999           -0.405560          -0.023114
        east wind (buck) farewell to arms (hemingway)
Mass            0.089411                     0.082215
ChiDist         0.118655                     0.122889
Inertia         0.001259                     0.001242
Dim. 1         -0.173901                    -0.831886
Dim. 2          0.434443                    -0.136485
        sound and fury 7 (faulkner) sound and fury 6 (faulkner)
Mass                       0.082310                    0.083338
ChiDist                    0.172918                    0.141937
Inertia                    0.002461                    0.001679
Dim. 1                     0.302025                   -0.925572
Dim. 2                     2.707599                    0.966944
        profiles of future (clark) islands (hemingway) pendorric 3 (holt)
Mass                      0.089722            0.082776           0.079501
ChiDist                   0.187358            0.165529           0.113174
Inertia                   0.003150            0.002268           0.001018
Dim. 1                    1.924060           -1.566481          -0.724758
Dim. 2                   -0.249310           -1.185338          -0.106349
        asia (michener) pendorric 2 (holt)
Mass           0.077827           0.082884
ChiDist        0.155115           0.101369
Inertia        0.001873           0.000852
Dim. 1         1.179548          -0.764937
Dim. 2        -1.186934          -0.091188


 Columns:
                a         b         c         d         e         f         g
Mass     0.079847  0.015685  0.022798  0.045967  0.127070  0.019439  0.020025
ChiDist  0.048441  0.148142  0.222783  0.189938  0.070788  0.165442  0.156640
                h        i        j         k         l        m         n
Mass     0.064928 0.070092 0.000789  0.009181  0.042667 0.025500  0.068968
ChiDist  0.154745 0.086328 0.412075  0.296727  0.120397 0.159747  0.075706
                o         p         q         r        s         t        u
Mass     0.076572  0.015159  0.000669  0.051897 0.060660  0.093010 0.029756
ChiDist  0.088101  0.250617  0.582298  0.111725 0.123217  0.050630 0.119215
               v         w        x        y         z
Mass    0.009612  0.025847 0.001160 0.021902  0.000801
ChiDist 0.269770  0.232868 0.600831 0.301376  0.833700
 [ reached getOption("max.print") -- omitted 3 rows ]

plot(ca(author))

plot(ca(author), what = c("none", "all"))

plot(ca(author), what = c("all", "none"))

# table method
haireye <- margin.table(HairEyeColor, 1:2)
haireye.ca <- ca(haireye)
haireye.ca


 Principal inertias (eigenvalues):
           1        2        3       
Value      0.208773 0.022227 0.002598
Percentage 89.37%   9.52%    1.11%   


 Rows:
            Black     Brown       Red    Blond
Mass     0.182432  0.483108  0.119932 0.214527
ChiDist  0.551192  0.159461  0.354770 0.838397
Inertia  0.055425  0.012284  0.015095 0.150793
Dim. 1  -1.104277 -0.324463 -0.283473 1.828229
Dim. 2   1.440917 -0.219111 -2.144015 0.466706


 Columns:
            Brown     Blue     Hazel     Green
Mass     0.371622 0.363176  0.157095  0.108108
ChiDist  0.500487 0.553684  0.288654  0.385727
Inertia  0.093086 0.111337  0.013089  0.016085
Dim. 1  -1.077128 1.198061 -0.465286  0.354011
Dim. 2   0.592420 0.556419 -1.122783 -2.274122

plot(haireye.ca)

# some plot options
plot(haireye.ca, lines = TRUE)

plot(haireye.ca, arrows = c(TRUE, FALSE))

Multiple Correspondence Analysis

We have looked at a pair of variables, often as a contingency table, using CA. But what if you had a large number of measures that were all categorical–like a multiple-choice test. We could maybe adapt CA to tell us things more like factor analysis does for large-scale tests. In general, MCA will take the NxMxO contingency table and use SVD to map all the dimensions into a single space, showing you where the levels of each dimension fall. Let’s look at the Titanic survivor data set. Was it true that it was ‘women and children first?’ Who survived? Who didn’t?

print(Titanic)

, , Age = Child, Survived = No

      Sex
Class  Male Female
  1st     0      0
  2nd     0      0
  3rd    35     17
  Crew    0      0

, , Age = Adult, Survived = No

      Sex
Class  Male Female
  1st   118      4
  2nd   154     13
  3rd   387     89
  Crew  670      3

, , Age = Child, Survived = Yes

      Sex
Class  Male Female
  1st     5      1
  2nd    11     13
  3rd    13     14
  Crew    0      0

, , Age = Adult, Survived = Yes

      Sex
Class  Male Female
  1st    57    140
  2nd    14     80
  3rd    75     76
  Crew  192     20

titanicmodel <- mjca(Titanic)
print(titanicmodel)


 Eigenvalues:
           1        2        3 
Value      0.067655 0.005386 0 
Percentage 76.78%   6.11%    0%


 Columns:
        Class:1st Class:2nd Class:3rd Class:Crew Sex:Female  Sex:Male Age:Adult
Mass     0.036915  0.032372  0.080191   0.100522   0.053385  0.196615  0.237619
ChiDist  1.277781  1.316778  0.749611   0.667164   1.126763  0.305938  0.118369
Inertia  0.060272  0.056129  0.045061   0.044743   0.067777  0.018403  0.003329
Dim. 1  -1.726678 -0.976191 -0.195759   1.104622  -2.360505  0.640923  0.101670
Dim. 2  -2.229588  0.457212  1.937417  -0.874018   0.016164 -0.004389 -0.277601
        Age:Child Survived:No Survived:Yes
Mass     0.012381    0.169241     0.080759
ChiDist  2.271810    0.394352     0.826420
Inertia  0.063898    0.026319     0.055156
Dim. 1  -1.951309    0.763670    -1.600378
Dim. 2   5.327911    0.344441    -0.721825

plot(titanicmodel)

This shows two locations for survival. The different marginal groups cluster closer to one or another of these in ways we probably expect. Adult men and crew members (probably also mostly adult men) were likely to not survive.

Application: Card-sorting and survey data

The following data set involves a card-sort exercise, which is a common way of understanding users mental models. In this data set collected by Natasha Hardy as part of her dissertation at MTU, A set of 43 topics related to financial planning for retirement were sorted into groups by participants. Each participant could form any group they wanted, and then groups were labelled–but each participant may have used a different sorting. Let’s first look at these as a dissimilarity measure and do some clustering and MDS plotting.

library(MASS)
data <- read.csv("natasha_data.csv")
tab1 <- table(data$participant, data$card.label, data$category.label)

mat <- matrix(0, dim(tab1)[2], dim(tab1)[2])
rownames(mat) <- colnames(mat) <- rownames(tab1[1, , ])
for (sub in 1:dim(tab1)[1]) {

    t1 <- tab1[sub, , ]
    byindex <- apply(t1, 1, function(x) {
        which(x == 1)
    })
    submat <- outer(byindex, byindex, "==")
    mat <- mat + submat

}
rownames(mat)

 [1] "2008 recession"                         
 [2] "Account management"                     
 [3] "Advice"                                 
 [4] "Analyze info"                           
 [5] "Benefits of computers"                  
 [6] "Boomer mentality"                       
 [7] "Casual computer user"                   
 [8] "Confidence"                             
 [9] "Contribution decision"                  
[10] "Distrust"                               
[11] "Future orientation"                     
[12] "Gen X life concept"                     
[13] "Goals"                                  
[14] "Hardship"                               
[15] "Health insurance"                       
[16] "Importance of computers"                
[17] "Information availability"               
[18] "Intrinsic motivation to learn computers"
[19] "Keeping up computer skills"             
[20] "Lack of awareness"                      
[21] "Learn computers for work"               
[22] "Methods for learning computer skills"   
[23] "Millennial mentality"                   
[24] "No discretionary income"                
[25] "Parental influence"                     
[26] "Patience"                               
[27] "Plan selection"                         
[28] "Reason for not investing"               
[29] "Reason for starting"                    
[30] "Research"                               
[31] "Retirement knowledge"                   
[32] "Retirement lifestyle"                   
[33] "Retirement planning"                    
[34] "Risk"                                   
[35] "Self-rated computer skill"              
[36] "Serious computer user"                  
[37] "Social Security"                        
[38] "Taxes"                                  
[39] "Type of investment"                     
[40] "Uncertainty"                            
[41] "Use computers for retirement"           
[42] "Withdrawal"                             
[43] "Worry"

look at a clustering

library(cluster)
a <- agnes(12 - mat)
plot(a, which = 2)

K-means cluster

km <- kmeans(12 - mat, centers = 4)
km$cluster

                         2008 recession                      Account management 
                                      2                                       2 
                                 Advice                            Analyze info 
                                      2                                       2 
                  Benefits of computers                        Boomer mentality 
                                      3                                       4 
                   Casual computer user                              Confidence 
                                      3                                       4 
                  Contribution decision                                Distrust 
                                      1                                       4 
                     Future orientation                      Gen X life concept 
                                      4                                       4 
                                  Goals                                Hardship 
                                      2                                       4 
                       Health insurance                 Importance of computers 
                                      1                                       3 
               Information availability Intrinsic motivation to learn computers 
                                      2                                       3 
             Keeping up computer skills                       Lack of awareness 
                                      3                                       4 
               Learn computers for work    Methods for learning computer skills 
                                      3                                       3 
                   Millennial mentality                 No discretionary income 
                                      4                                       1 
                     Parental influence                                Patience 
                                      4                                       4 
                         Plan selection                Reason for not investing 
                                      1                                       2 
                    Reason for starting                                Research 
                                      2                                       2 
                   Retirement knowledge                    Retirement lifestyle 
                                      1                                       1 
                    Retirement planning                                    Risk 
                                      1                                       2 
              Self-rated computer skill                   Serious computer user 
                                      3                                       3 
                        Social Security                                   Taxes 
                                      1                                       2 
                     Type of investment                             Uncertainty 
                                      2                                       4 
           Use computers for retirement                              Withdrawal 
                                      3                                       2 
                                  Worry 
                                      4

mds <- isoMDS(12 - mat, k = 2)

initial  value 31.345692 
final  value 31.345692 
converged

plot(mds$points[, 1], mds$points[, 2], type = "n")
text(mds$points[, 1], mds$points[, 2], rownames(mat), col = km$cluster)