---
title: "Model-Based Clustering and mclust"
author: "Shane T. Mueller shanem@mtu.edu"
date: "`r Sys.Date()`"
output:
pdf_document: default
html_document:
df_print: paged
rmdformats::readthedown:
gallery: yes
highlight: kate
self_contained: no
word_document:
reference_docx: ../template.docx
always_allow_html: yes
---
```{r knitr_init, echo=FALSE, cache=FALSE}
library(knitr)
library(rmdformats)
## Global options
options(max.print="75")
opts_chunk$set(echo=TRUE,
cache=TRUE,
prompt=FALSE,
tidy=TRUE,
comment=NA,
message=FALSE,
warning=FALSE)
opts_knit$set(width=75)
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Model-Based Clustering
The previous clustering methods used distance as a means of assessing similarity and forming clusters. This makes intuitive sense, but it would be nice to take a more statistical approach. To do so, we would like to consider a model that might have generated the data. We might assume that the data arose from some distribution (e.g., one or more normal distributions), and try to model the data based on this model. This, in general, is called *Model-based clustering*. If we make assumptions about the model that generated the data, (and our assumptions happen to be correct), we can perhaps get better classification and clustering.
To do this, we will no longer use distance measures, but work with a set of dimensional or feature data. Let's start with a 2D example. We will assume that data are generated from a two-dimensional guassian model. Here is an example where, within those two dimensions, our data were generated from three groups with different means, but the same standard deviation:
```{r}
set.seed(100)
centers <- matrix(runif(6),ncol=3)
par(mfrow=c(1,2))
data1 <- matrix(c(rnorm(200,mean=centers[,1],sd=.02),
rnorm(200,mean=centers[,2],sd=.02),
rnorm(200,mean=centers[,3],sd=.02)),
ncol=2,byrow=T)
colnames(data1) <- c("V1","V2")
plot(data1,col=rep(c("red","darkgreen","blue"),each=100))
points(t(centers),pch=16,cex=2,col="black")
```
Here, the groups do not overlap at all, but we might imagine a setup where they do:
```{r}
data2 <- matrix(c(rnorm(200,mean=centers[,1],sd=.1),
rnorm(200,mean=centers[,2],sd=.1),
rnorm(200,mean=centers[,3],sd=.1)),ncol=2,byrow=T)
colnames(data2) <- c("V1","V2")
plot(data2,col=rep(c("red","darkgreen","blue"),each=100))
points(t(centers),pch=16,cex=2,col="black")
```
Suppose we assumed the data were from three gaussian distributions, and happened to know the centers of these three distributions, but we don't know which point came from which distribution. We could estimate the likelihood of each point having come from each of the three groups, and use this to help us determine the group it should have belonged to. To do this, we will just use the multivariate density function from the mvtnorm package. We will compute the likelihood of each point for all three models--first for the low-noise data. Here we will do it for data1, which is the most easily discriminable:
```{r}
library(mvtnorm)
truth <- rep(1:3,each=100)
center1like <- dmvnorm(data1,mean=centers[,1],sigma=diag(rep(.02,2)))
center2like <- dmvnorm(data1,mean=centers[,2],sigma=diag(rep(.02,2)))
center3like <- dmvnorm(data1,mean=centers[,3],sigma=diag(rep(.02,2)))
likes <-cbind(center1like,center2like,center3like)
group <- apply(likes,1,which.max)
matplot(cbind(center1like,center2like,center3like))
table(truth,group)
```
Notice that the points were sampled in sets of 100, and the likelihood model correctly predicts every point if we categorize by the maximum likelihood group. We can look at the high-noise group data2, and it still looks promising, but there might be some errors.
```{r}
center1like <- dmvnorm(data2,mean=centers[,1],sigma=diag(rep(.07,2)))
center2like <- dmvnorm(data2,mean=centers[,2],sigma=diag(rep(.07,2)))
center3like <- dmvnorm(data2,mean=centers[,3],sigma=diag(rep(.07,2)))
likes <-cbind(center1like,center2like,center3like)
matplot(likes)
group <- apply(likes,1,which.max)
table(truth,group)
```
Looking at the confusion table, this is not perfect, but still very good--around 90% accurate, even though we didn't know the group membership to start with. So, If we know that the data are generated from a normal distribution, and if we assume the variance is the same in both dimensions, and if we know the means of the groups, and if we are right about there being three groups, in many conditions we can do very well at recovering the group structure. The general problem is referred to as model-based clustering. Because we treat the data as being generated by a mixture of unknown sources, this is also called 'mixture modeling'.
##Finite mixture modeling
One common variety of this solution works similarly to k-means clustering, and is referred to as finite mixture modeling. Finite refers to assumptions about how many groups we have, and mixture modeling refers to the mixture of distributions assumptions. Unlike k-means (and more like fanny), finite mixture modeling will permit each item to have a likelihood of being a member of each group.
Inference for FMM--finding where the gruops are--can happen in a number of ways. The simplest method uses a technique called Expectation-Maximization, but Bayesian approaches are also popular. Aside from the method for identifying clusters, the Bayesian approach will typically produce a distribution of solutions, whereas E-M will produce a single maxmimum-likelihood solution. In addition, many Bayesian approaches have explored infinite mixture models--permitting the model to have as many groups as you need, and letting the Bayesian inference determine the likely number of groups.
E-M works very similarly to the iterative process of k-means. You begin with a random assignment of items to groups. You then perform Expectation step--estimating group parameters via maximum likelihood and likelihoods for the points given the parameters. Then you perform maximization--sorting each group into the cluster that maximized likelihood. You repeat this until the process converges, and statisticians have proved that it converges under many flexible conditions.
Let's try this by hand, with data2. Do it multiple times and see the different ways it turns out.
```{r}
group <- sample(1:3,300,replace=T)
##Repeat the following until it converges:
par(mfrow=c(1,2))
for(i in 1:6)
{
##recompute centers and spread based on these groupings (maximum-likelihood estimation)
groupmeans<- aggregate(data2,list(group),mean)[,-1]
groupsd <- aggregate(data2,list(group),sd)
plot(data2,col=group,main="Current configuration")
points(groupmeans,pch=16,cex=2,col=1:3)
##estimate likelihood for each grouping:
center1like <- dmvnorm(data2,mean=unlist(groupmeans[1,]),sigma=diag(groupsd[1,2:3]))
center2like <- dmvnorm(data2,mean=unlist(groupmeans[2,]),sigma=diag(groupsd[2,2:3]))
center3like <- dmvnorm(data2,mean=unlist(groupmeans[3,]),sigma=diag(groupsd[3,2:3]))
#center2like <- dmvnorm(data2,mean=groupmeans[2,],sigma=diag(groupsd[2,2:3]))
#center3like <- dmvnorm(data2,mean=groupmeans[3,],sigma=diag(groupsd[3,2:3]))
likes <-cbind(center1like,center2like,center3like)
matplot(likes)
##redo group assignment
group <- apply(likes,1,which.max)
print(groupmeans)
print(groupsd)
print(table(truth,group))
}
```
Notice by looking at the table, that the name of the groups might not be right, but it usually does a good job of categorizing by the fourth round. Sometimes, a group disappears--eaten by another group. We can look at the parameters to identify each model. Note that although the data were created with independence and equal variance, the model estimates the variance of each dimension separately.
The groupsd values should all be around 0.07, which is what generated them. Notice also that the centers are estimated very well.
```{r}
groupsd
groupmeans
centers
```
We have assumed a normal distribution, which is spherical, and correctly re-estimated the parameters of that distribution and a hidden piece of information--the group membership. This is how E-M is sometimes used to estimate hidden, latent, or missing information; here the missing information is the group membership variable.
We should also recognize that we have made several assumptions here that might not be true, and perhaps not made others that could have been true. For example:
- Should all groups have the same variability?
- Should a group have the same variability in each direction?
- Should there be covariance, to permit angled ellipses?
Also, we haven't yet solved the problem of figuring out how many groups there should be. Because we are fitting a likelihood model, likelihood metrics can be used. It is common to use BIC, which will balance model complexity and goodness of fit, to determine how many groups to use.
Finally, if you the solution to the model is going to be very dependent on starting conditions. There is only one 'best' solution, but we may not find it unless we run the model a few hundred times from different configurations.
For these reasons, it is usually better to use a built-in library that handles all these issues. the mclust and the flexmix libraries are both very good. We will focuse on mclust, which is faster and offers a lot of built-in models. The flexmix library is more flexible, but sometimes requires you to build a model driver by hand. mclust focuses primarily on gaussian mixture modeling, and is very fast.
```{r,fig.width=6,fig.height=6}
library(mclust)
model1 <- Mclust(data2)
summary(model1)
plot(model1,what=c("classification"))
```
Without asking, this identified that we had three components, showed how many were in each group, and produced likelihood and BIC scores. It also states it used model 'EII'.
Unpacking this, EII refers to the type of guassian variance model. There are many different options. The model actually fit all the different variance options, and compared them via a BIC score to choose the best:
```{r}
plot(model1,what="BIC")
```
We can see about 14 different models here. There are also some that don't apply to multivariate mixture models. From the help file for mclustModelNames:
The following models are available in package mclust:
```
univariate mixture
"E" = equal variance (one-dimensional)
"V" = variable variance (one-dimensional)
multivariate mixture
"EII" = spherical, equal volume
"VII" = spherical, unequal volume
"EEI" = diagonal, equal volume and shape
"VEI" = diagonal, varying volume, equal shape
"EVI" = diagonal, equal volume, varying shape
"VVI" = diagonal, varying volume and shape
"EEE" = ellipsoidal, equal volume, shape, and orientation
"EVE" = ellipsoidal, equal volume and orientation (*)
"VEE" = ellipsoidal, equal shape and orientation (*)
"VVE" = ellipsoidal, equal orientation (*)
"EEV" = ellipsoidal, equal volume and equal shape
"VEV" = ellipsoidal, equal shape
"EVV" = ellipsoidal, equal volume (*)
"VVV" = ellipsoidal, varying volume, shape, and orientation
single component
"X" = univariate normal
"XII" = spherical multivariate normal
"XXI" = diagonal multivariate normal
"XXX" = ellipsoidal multivariate normal
```
In general, the lower the model, the more complex it is, and so the more parameters needed and the greater improvement needed to justify using it. The model picked EII--spherical equal volume. Spherical indicates the variance is equal in all directions, and equal volume indicates that all variances are assumed to be equal. In contrast to spherical, diagonal would permit each dimension to have a different variance, and ellipsoidal will permit covariance terms to orient the ellipse in different directions. On the other hand, we can constrain aspcets of shape and volume (size) across groups, forcing them to be the same or allowing them to differ.
We can see in the parameters how the constraints work out. The variance-sigma values are all identical. The could differ, or we could permit covariance.
```{r}
model1$parameters
```
We can force use of another model--for example one with covariance.
```{r}
model2 <- Mclust(data2, modelNames = "VVI")
summary(model2)
plot(model2,what=c("classification"))
model2$parameters
```
It did not make much of a difference permitting these to differ. They did change slightly, but values are all very similar across dimensions and groups. We can get the classification response out of the model directly.
```{r}
table(truth,model1$classification)
table(truth,model2$classification)
```
## Exercise:
* Build mclust models from the letter-feature data.
* Build mclust models from the personality data.
For each model, select the type of model you feel is most reasonable, and identify how many groups of people/sounds there are. Then, look at the average sound/person from each category and see if there is something that the category captures.
```{r}
# Letter-Feature data:
ratings <- read.table("keren.txt")
```
## Using mixture modeling with similarity/Distance data
Because mixture modeling works within a dimensional or feature space, the most popular versions will not work if you have a distance measure (like we used for hierarchical clustering). Without developing new algorithms, one way you can still apply tool like Mclust is to use MDS to infer a dimensional space, and then apply finite mixture modeling to that. The choice of k might matter in these cases, but you are not restricted to choosing a 2-dimensional solution. Here is an example using the letter distance data:
```{r}
library(MASS)
d_dist <- read.csv("plaindist.csv")
d_dist<- d_dist[,-1]
colnames(d_dist) <- LETTERS
rownames(d_dist) <-LETTERS
x <- isoMDS(as.dist(d_dist),k=3)
l <- Mclust(x$points)
plot(l,what="BIC")
plot(l,what="classification")
```
Here, the one-group solution was always the best. Notice that because we used 3 dimensions, when we plot the mclust solution, it shows us cluster breakdown for each pair.
Although the algorithms is very similar conceptually to k-means clustering, our solution prefers a single solution, because the BIC criterion is very conservative. Part of this is that the BIC incorporates number of observations into its measure. In this case though, we've already averaged across our observations, and there may have been dozens of participants and hundreds of trials. Here, BIC is probably not really appropriate, because it takes N as the number of entities we are clustering. We can always force Mclust to use a specified number of cluster if we think that is better than choosing the best via BIC.
```{r}
l5 <- Mclust(x$points,G=5)
plot(l5,what="classification")
```
Notice that plotting this will not show the actual letters, but we can improvise:
```{r}
plot(x$points[,1],x$points[,2],type="n")
text(x$points[,1],x$points[,2],LETTERS,col=l5$classification)
```
## Exercise: Mixture model with personality data
1. Use mclust to find clusters of personalities.
2. Use mclust to find clusters of personality questions.
```{r}
data <- read.csv("bigfive.csv")
dat.vals <- data[,-1]##remove subject code
qtype <- c("E","A","C","N","O", "E","A","C","N","O",
"E","A","C","N","O", "E","A","C","N","O",
"E","A","C","N","O", "E","A","C","N","O",
"E","A","C","N","O", "E","A","C","N","O",
"O","A","C","O")
valence <- c(1,-1,1,1,1, -1,1,-1,-1,1,
1,-1,1,1,1, 1,1,-1,1,1,
-1,1,-1,-1,1, 1,-1,1,1,1,
-1,1,1,-1,-1, 1,-1,1,1,1,
-1,1,-1,1)
add <- c(6,0,0)[valence+2]
tmp <- dat.vals
reversed <-t(t(tmp)*valence + add)
##reverse code questions:
bytype <- reversed[,order(qtype)]
key <- sort(qtype)
colnames(bytype) <- paste(key,1:44,sep="")
##we may need to impute missing dat. If so, a value of 3 is a quick
##reasonable guess:
bytype[is.na(bytype)] <- 3
```
# Hierarchical Model-based Clustering with mclust
mclust also provides a hierarchical model-based agglomerative clustering. This will still use a limited number of models to characterize a group, but attempt to identify a hierarchical structure. Like normal mclust, hc have varieties hcE, hcV, hcEII, hcVII, hcEEE, hcVVV, that can be specified by modelName.
```{r}
hh <- hc(data2,modelName="EII")
hclass(hh)[1:10,1:10]
```
This will have N x N-1 matrix, where each column specifies a different level of hierarchical cluster. We can plot each layer consecutively, so you can see how each cluster gets divided.
```{r,fig.width=7,fig.height=7}
hc <- hclass(hh)
par(mfrow=c(2,2))
plot(data2,col=hc[,1])
plot(data2,col=hc[,2])
plot(data2,col=hc[,4])
plot(data2,col=hc[,5])
```
Notice how the clusters are always divided into two clusters. This provides an interesting hybrid approach, and also a means for determining how deeply to cluster--using the BIC criterion
```{r}
mclustBIC(hh)
mclustBIC(hc)
```
It is not clear what the first mclustBIC(hh) does, but the second one examines BIC values across many different values. For the complex models with elaborate covariance, we get a single group--more than that and we probably run out of parameters. This looks to be the best model overall, but it sort of forces us to use a meaning of cluster we might not be comfortable with. The simple models like EII appear to continue to improve as we add groups.
## Hierarchical clustering of letter similarity
Although mclust seems to fail for the letters (because of the BIC conservatism), maybe a hierarchical approach would be better. We will start with our 8-D mds solution, and plot by the first two dimensions.
```{r}
hh <- hc(x$points,modelName="EII")
classes <- hclass(hh)
plot(x$points[,1:2],cex=4,col=classes[,8])
text(x$points[,1],x$points[,2],LETTERS)
```
This seems reasonable, and shows the advantage of the conditional splitting in a hierarchical approach.
# Model-based clustering for non-gaussian distributions: The flexmix library
Mixtures of gaussians are very widely used, but we sometimes need more flexibility. Just as with regression, sometimes we don't want to use the guassian error distribution. A good example is for binary data, although likert-scales apply as well. Or consider data where you might have a rank-order constraint. Or if you were to model a pattern of shots on a target or a dartboard, you might model a mixture of a guassians (for the ability of a person) and a low-density uniform (for mistakes that were haphazard), so that you can remove and account for any outliers while still capturing the precision of the person.
The flexmix library provides a set of tools for building much more general mixture models. There are many built-in options, but you can also create 'drivers' for custom models and mixtures of models.
A reasonable model for binary behavior such as voting is a binomial model. Let's take another look at the Michigan Senate voting data, that we have previously looked at. Can we detect party affiliation by modeling the data as a mixture of binomials?
We could model each person with a single binomial coefficient-- each person has a tendancy to vote for or against a set of issues, modeled as a probability. But for this case, let's assume that a "voting bloc" has a tendency to vote similarly, and people belong to a voting block. Then, we can use mixture modeling to identify which voting block they belong to, and what the voting block looks like.
To use flexmix, we need to specify the model we want, which is the FLXMCmbinary. There are dozens of models, including gaussian, and others that we will examine in the next unit.
```{r}
library(flexmix)
data.raw <- read.csv("misenate.csv")
key <- read.csv("misenate-key.csv",header=F)
keep<- !is.na(rowMeans(data.raw[,-(1:2)]))
data <- data.raw[keep==TRUE,]
party <- data$Party
votes <- as.matrix(data[,-(1:2)])
##model with a single group is easy;
model1 <- flexmix(votes~1,k=1, model=FLXMCmvbinary())
model2 <- flexmix(votes~1,k=2, model=FLXMCmvbinary())
table(party,clusters(model2))
model4 <- flexmix(votes~1,k=4,model=FLXMCmvbinary())
table(party,clusters(model4))
parameters(model4)
matplot(parameters(model4),type="o",main="Voting by block on each issue")
```
We find that as the number of groups grows, we get different answers different times, and we start losing groups. Also, fitting each model size separately is a bit tedious. Because we can get a solution each time, we'd like to repeat the model from a bunch of different random starting configurations, and just keep the best model we find.
The stepflexmix function takes care of a lot of this for us. We can specify a range of k, which tells us the number of groups to explore. We can also select nrep, which tells us how many times to fit the model at each level of k. 20 is fine for exploring, but you may try nrep values as large as 1000 if you want to be confident that are getting a consistently good solution.
```{r}
set.seed(100)
models <- stepFlexmix(votes~1,k=1:6,model=FLXMCmvbinary(),nrep=100,drop=FALSE)
models
plot(models)
```
Unlike the mclust library, flexmix plots BIC so that lower is better. Here, the BIC tends to favor two groups, but the AIC favors 5 or more groups. Notice that drop, which defaults to TRUE, can prevent groups from being completely dropped. If we look at the output, we se that even for this case, the ultimate value of k that was acheived is often not the same as the starting value (k0), and we never ended up with more than 6 groups. Notice also that the best solutions have slightly different BIC values--even within 6-group solutions. This means that we would probably want to increase nrep to make sure we consistently find the best group.
A number of functions let us extract the model and extract information from them model. We can extract the best model using getModel; if we specify the statistic ("BIC" or "AIC") it will pick out that model; if we specify a number, it will pick out the model with that number of components.
```{r}
bestmodel <- getModel(models,"BIC")
bestmodel
summary(bestmodel)
plot(bestmodel)
parameters(bestmodel)
clusters(bestmodel)
table(party,clusters(bestmodel))
```
If you plot a model, it shows a 'rootogram'. This is a histogram of the likelihood of each item being in each model. Ideally, if you have clear separation, you have a lot of densiity at the edges, without many ambiguous cases in the middle.
We can pick out an arbitrary model too:
```{r}
bestmodel <- getModel(models,4)
bestmodel
summary(bestmodel)
plot(bestmodel)
parameters(bestmodel)
clusters(bestmodel)
table(party,clusters(bestmodel))
matplot(parameters(bestmodel),type="o")
```
If forced to classify the politicians into 4 parties, we get one group of democrats, two republicans, and one mixed outlier group. We could maybe sort the issues (since they are abritrary) to get a better picture. We can make a table with declared party membership to determine who belongs in which group. Then, we can arrange issues using another clustering solution.
```{r,fig.width=8,fig.height=8}
set.seed(101)
tab <- table(party,clusters(bestmodel))
gparty <-c("D","R")[apply(tab,2,which.max)]
p <- parameters(bestmodel)
grp <- kmeans(p,centers=7)
ord <- order(grp$cluster)
par(mar=c(3,25,2,1),las=1)
matplot(p[ord,],1:54,pch=gparty,type="o",yaxt="n",ylab="",xlab="Agreement with rater on bill")
grid()
axis(2,1:54,paste(paste(key$V1,key$V2)[ord],sort(grp$cluster)),cex.axis=.75)
```
This clustered questions into 7 groups, then just displayed them in order of the (arbitary) group number. Looking at the bills, it is difficult to determine why they clustered together, or why the 4-party solution divided parties on specific issues, but someone who follows Michigan politics may learn important things from this solution.