--- title: "Correspondence Analysis and Multiple Correspondence Analysis, and Mediation Analysis" author: "Shane T. Mueller shanem@mtu.edu" date: "`r Sys.Date()`" output: rmdformats::readthedown: gallery: yes highlight: kate self_contained: no html_document: df_print: paged pdf_document: default word_document: reference_docx: ../template.docx always_allow_html: yes --- ```{r knitr_init, echo=FALSE, cache=FALSE} library(knitr) library(rmdformats) ## Global options options(max.print="75") opts_chunk$set(echo=TRUE, cache=TRUE, prompt=FALSE, tidy=TRUE, comment=NA, message=FALSE, warning=FALSE) opts_knit$set(width=75) ``` ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` The two topics covered in this module are not related, but they are each useful in their own right. ## Mediation Analysis A common type of analysis that lavaan permits is looking at the role of mediating variables. There are many tools available for specifically looking at 3-variable problems, but lavaan lets you model arbitrarily complex mediation schemes. The analysis of these is more ad hoc though. The notion of doing a mediation analysis is central to many areas of psychology, especially in social domains where you cannot have complete experimental control. In these cases, you will always have potentially confounding variables, and mediation analysis examines a particular kind of causal relationship. Some of the papers describing mediation are among the top 20 scientific research papers cited in any domain. The notion is that you may observe a correlation between two variables, but also observe that these may be correlated with a third. For example, you might find that as a population ages from teenager to adult, vehicular accidents go down. But, it might not be age per se that causes this. Instead, it might be practice...as you get older, you get more experience driving, and the experience reduces accidents. Experience may be a mediating variable here. Can all the improvement in driver safety be attributed to experience? Maybe, but to find out, you need to study people who started driving at different ages. This would be a classic mediation test. ## Mediation Analysis using psych::mediation Below is an example data set we can examine mediation. Here, using standard terminology, y is the outcome (accident rate), X is the predictor (age), and M is the mediator variable (experience). ```{r} set.seed(1234) X <- rnorm(100) M <- 0.5*X + rnorm(100) Y <- 0.7*M + rnorm(100) M2 <- 0.6*X + rnorm(100) Y2 <- 0.6*X + rnorm(100) Data <- data.frame(X = X, Y = Y, M = M) Data2 <- data.frame(X = X, Y = Y2, M = M2) pairs(Data) cor(Data) pairs(Data2) cor(Data2) ``` The psych library handles standard mediation analysis using the mediate() function: ```{r} library(psych) m1 <- mediate(Y~X+(M),data=Data) ``` Here, the first model shows the mediated route is has substantial variance along the two legs, and the direct route c diminishes from .41 to .04 when the indirect route is allowed to have an effect, suggesting a strong mediation. ```{r} m2 <- mediate(Y~X+(M),data=Data2) ``` In the second model, there is no correlation along the second leg of the mediated route, and c and c' are esesntially the same, indicating M does not mediate the relationship of x to y. ## Mediation using lavaan So, for both Data and Data2, we have three variables that are all correlated. In truth, for Data, Y is related only to M, but M is related to X. For Data2 both M and Y are related to X directly. We use ~ to model direct relationships between observed variables. Here, we will fit a model with a direct relationship between X and Y, and also another link through M. We need to specify a variable to represent the interaction of ```a``` and ```b```, using ```:=```, and a final variance ```total``` that captures everthing. ```{r} library(lavaan) library(semPlot) model <- ' # direct effect Y ~ c*X # mediator M ~ a*X Y ~ b*M # indirect effect (a*b) ab := a*b # total effect total := c + (a*b) ' fitmed <- sem(model, data = Data) summary(fitmed) semPaths(fitmed,curvePilot=T,"par", weighted = FALSE, nCharNodes = 7, shapeMan = "rectangle", sizeMan = 15, sizeMan2 = 10) ``` Looking at this first model (where M really mediates X-Y), we see that the edges X-M and M-Y are strong, but X->Y is weak, indicating M mediates the relationship between X and Y. These values are nearly identical (if not exactly the same) to what we obtained in the mediate function, with c=total=.41, but the lavaan model gives more details. ```{r} fitmed2 <- sem(model, data = Data2) summary(fitmed2) #lavaan.diagram(fit2) semPaths(fitmed2,curvePilot=T,"par", weighted = FALSE, nCharNodes = 7, shapeMan = "rectangle", sizeMan = 8, sizeMan2 = 5) ``` Here, Y-M is not significant, but Y-X and M-X are, indicating that M does not mediate the relationship. Note that the model is framed to test whether a particular value is significant--we don't make two models and compare theme via ANOVA. ## Example: ```{r,fig.width=4,fig.height=8} dat <- read.csv("data_study1.csv") dat$female <- dat$Gender=="female" dat$phonenumeric <- dat$Smartphone=="iPhone" round(cor(dat$phonenumeric,dat[,3:14])) par(mar=c(2,15,2,0)) barplot(abs(cor(dat$phonenumeric,dat[,3:14])),main="Correlations with iphone ownership",horiz=T,las=1) ``` We can see that age and phone-as-status-symbol both correlate with iphone ownership. But are they independent? More importantly, can the effect of age be explained by the effect of status-symbol? I'll rescale age to be between 0 and 1 to make the coefficients larger--otherwise it doesn't matter. ```{r} dat$Age <- dat$Age/100 psych::mediate(phonenumeric~Age +(Phone.as.status.object),data=dat) ``` The test suggests the indirect effect of age on phone usage through status symbol is statistically significant, but so is the direct effect. This suggests that although some of the impact of age on iphone usage can be explained by younger people being more likely to treat the phone as a status symbol (and those people tend to buy iphones), there is also a effect of age directly--even for people with the same level of phone-as-status-symbol, there is an age effect. ## Exercise: Form a mediation model for other variables within the iphone data. # Correspondence Analysis and Multiple Correspondence Analysis Correspondance Analysis (CA) and Multiple Correspondence Analysis (MCA) are described as analogs to PCA for categorical variables. For example, it might be useful as an alternative to looking at cross-tabs for categorical variables. It can also be used in concert with inter-rater reliability to establish coding schemes that correspond with one another, or with k-means or other clustering to establish how different solutions correspond. ## Resources * Packages: ```ca``` * corresp() in ```MASS``` * https://www.utdallas.edu/~herve/Abdi-MCA2007-pretty.pdf * http://gastonsanchez.com/visually-enforced/how-to/2012/10/13/MCA-in-R/ * Nenadic, O. and Greenacre, M. (2007). Correspondence analysis in R, with two- and three-dimensional graphics: The ca package. Journal of Statistical Software, 20 (3), http://www.jstatsoft.org/v20/i03/ Correspondence Analysis (CA) is available with the corresp() in MASS and the ca package, which provides better visualizations. # Background and Example For CA, The basic problem we have is trying to understand whether two categorical variables have some sort of relationship. The problem is that, unless they are binary, correlation won't work. One could imagine some sort of matching algorithm, to find how categories best align, and then finding the sum of the diagonals. Another problem is that we may not have the same number of levels of groups--can we come up with some sort of correspondence there? This is similar to the problem we face when looking at clustering solutions and trying to determine how well a solution corresponds to some particular category we used outside of the model. The clusters don't have labels that will match the secondary variable, but we'd still like to measure how well they did. Let's try this with the iris data, and k-means clustering ```{r} library(MASS) set.seed(5220) iris2 <- scale(iris[,1:4]) model <- kmeans(iris[,1:4],centers = 3) table(iris$Species,model$cluster) ``` ## MASS library corresp function Can we measure how good the correspondence is between our clusters and the true species? ```{r} library(MASS) camodel <- corresp(iris$Species,model$cluster,nf=2) print(camodel) plot(camodel) ``` The model can also be specified like this: ```{r} tmp <- data.frame(spec=iris$Species,cl=model$cluster) camodel2 <- corresp(~spec+cl,data=tmp,nf=2) camodel2 ``` ## ca library ca function A similar result can be obtained by doing ca on the cross-table. ```{r} library(ca) cmodel3 <- ca::ca(table(iris$Species,model$cluster),nd=4) cmodel3 summary(cmodel3) plot(cmodel3) ``` We can see that using any of these methods, we have mapped the categories into a space of principal components--using SVD. Furthermore it places both the 'rows' and 'columns' into that space, so we can see how closely aligned the groups are. We chose to use 2 dimensions in each case for easy visualization. What if the correspondence is not as good? ```{r} set.seed(1001) iris2 <- scale(iris[,1:4]) model <- kmeans(iris[,1:4],centers = 5) table(iris$Species,model$cluster) cmodel5 <- ca(table(iris$Species,model$cluster),nd=2) ``` ```{r} cmodel5 plot(cmodel5) ``` This should match our intuition for where the two sets of labels belong. # Other Examples (from ca help file) This data sets maps the distribution of letters of the alphabet to authors. These dimensions might be used to detect language, style, historic period, or something ```{r,fig.width=8,fig.height=8} data("author") ca(author) plot(ca(author)) plot(ca(author),what=c("none","all")) plot(ca(author),what=c("all","none")) ``` ```{r} # table method haireye <- margin.table(HairEyeColor, 1:2) haireye.ca <- ca(haireye) haireye.ca plot(haireye.ca) # some plot options plot(haireye.ca, lines=TRUE) plot(haireye.ca, arrows=c(TRUE, FALSE)) ``` # Multiple Correspondence Analysis We have looked at a pair of variables, often as a contingency table, using CA. But what if you had a large number of measures that were all categorical--like a multiple-choice test. We could maybe adapt CA to tell us things more like factor analysis does for large-scale tests. In general, MCA will take the NxMxO contingency table and use SVD to map each all the dimensions into a single space, showing you where the levels of each dimension fall. Let's look at the Titanic survivor data set. Was it true that it was 'women and children first?' Who survived? Who didn't? ```{r,fig.width=8,fig.height=8} print(Titanic) titanicmodel <- mjca(Titanic) print(titanicmodel) plot(titanicmodel) ``` ## Application: Card-sorting and survey data The following data set involves a card-sort exercise, which is a common way of understanding users mental models. In this data set collected by Natasha Hardy as part of her dissertation at MTU, A set of 43 topics related to financial planning for retirement were sorted into groups by participants. Each participant could form any group they wanted, and then groups were labelled--but each participant may have used a different sorting. Let's first look at these as a dissimilarity measure and do some clustering and MDS plotting. ```{r} library(MASS) data <- read.csv("natasha_data.csv") tab1 <- table(data$participant,data$card.label,data$category.label) mat <- matrix(0,dim(tab1)[2],dim(tab1)[2]) rownames(mat) <- colnames(mat) <- rownames(tab1[1,,]) for(sub in 1:dim(tab1)[1]) { t1 <- tab1[sub,,] byindex <- apply(t1,1,function(x){which(x==1)}) submat <- outer(byindex,byindex,"==") mat <- mat + submat } rownames(mat) ``` ## look at a clustering ```{r,fig.width=8,fig.height=10} library(cluster) a <- agnes(12-mat) plot(a,which=2) ``` ## K-means cluster ```{r,fig.width=8,fig.height=8} km <-kmeans(12-mat,centers=4) km$cluster mds <- isoMDS(12-mat,k=2) plot(mds$points[,1],mds$points[,2],type="n") text(mds$points[,1],mds$points[,2],rownames(mat),col=km$cluster) ``` ## Correspondence analysis Because each participant labeled the groupings, we can also do a CA. Here, two levels of variables will have some correspondence, and we can use CA to map these into a common vector space. ```{r,fig.width=10,fig.height=6} library(ca) ca <- ca(table(as.factor(data$card.label),as.factor(data$category.label))) plot(ca,arrows=c(F,T),xlim=c(-2,2)) par(mfrow=c(1,2)) plot.ca(ca,dim=c(1,2), what=c("none","all"),main="Cluster names",xlim=c(-2,2)) #points(ca$rowcoord[,1], ca$rowcoord[,2],col=km$cluster,pch=16) plot(ca,dim=c(1,2), what=c("all","none"),main="Sorted topics",xlim=c(-2,2),col=c("black","orange"),pch=1) #points(ca$colcoord[,1], ca$colcoord[,2],col=km$cluster,pch=16) ```