---
title: 'Mixture Modeling: Mixture of Regressions'
author: "Shane Mueller"
date: "`r Sys.Date()`"
output:
  rmdformats::readthedown:
    gallery: yes
    highlight: kate
    self_contained: no
  html_document:
    
    df_print: paged
  word_document:
    reference_docx: ../template.docx
  pdf_document: default
always_allow_html: yes
---


```{r knitr_init, echo=FALSE, cache=FALSE}
library(knitr)
library(rmdformats)

## Global options
options(max.print="75")
opts_chunk$set(echo=TRUE,
	             cache=TRUE,
               prompt=FALSE,
               tidy=TRUE,
               comment=NA,
               message=FALSE,
               warning=FALSE)
opts_knit$set(width=75)
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


# Mixture Modeling: Mixtures of regressions.

A mixture model is a probabilistic model for representing the presence of sub-populations within an overall population, without requiring that an observed data-set should identify the sub-population to which an individual observation belongs. We previously used a very simple model--where each variable is defined by its mean and standard deviation, to do model-based clustering.  But mixture modeling permits finding mixtures of hidden group memberships for other kinds of models, including regression models.  This sometimes enables you to identify sets of distinct relationships amongst sub-groups, hidden within a larger population.  The goal of mixture modeling is to model your data as a mixture of processes or populations that have distinct patterns of data.


## Example 1: Two linear models
Suppose we have two types of people: those who like romantic comedy movies, and those who hate romantic comedies.  On a movie rating site, we have measured ratings of anonymous people rating how much they like movies, that we have already measured on how romantic they are.  Thus, we'd expect a crossing pattern in the data, like we see below.  If we knew why type of person each person was, we'd be able to fit separate regression models,  but we don't.

```{r cache=TRUE}
set.seed(100) 
 x1 <- rnorm(1000)*50
 x2 <- rnorm(1000)*50
 y1 <-    x1*3 + 10 +rnorm(1000)*30
 y2 <-  -3*x2 + 25 + rnorm(1000)*50
 x <- c(x1,x2)
 y <- c(y1,y2)
par(mfrow=c(1,3))
 plot(x1,y1)
plot(x2,y2)
plot(x,y)
lm(y1~x1)
lm(y2~x2)
```


If we try to fit data by using a single linear regression, we fail:
```{r}

 plot(x,y)
 lm1 <- lm(y~x)
 summary(lm1)
 abline(lm1$coefficients,col="gold",lwd=3.0)
```

And from summary(lm1), R-squared =  0.0008135

```{r}
summary(lm1)
```

In the example above, it is quite obvious we are missing something. But if we had more noise, or many variables, we might not notice this pattern, and just determine that there was no predictability on average.  But there is a very systematic pattern within the data we might be able to uncover--if only we knew which group people were in.  But this problem is just like model-based clustering. What if we could both fit two separate models and determine which group each person belongs to?  We can do that using the flexmix library.


## Fitting mixtures of regressions using E-M and flexmix

If we could use the same approach as with clustering, but instead of gaussians around means, we might have each sub-group defined by a linear model.    In this case, as long as we can produce a likelihood estimate for a model, we can use the E-M algorithm to identify and estimate the groups. That is, we can fit a model to a subgroup, obtain the maximum likelihood estimate, then place each observation in the model it is best described by, and repeat until things settle down.

To review: E-M stands for Expectation-Maximization.  It is an iterative process whereby you apply two complementary processes. Suppose you assume there are two sub-groups.  First, randomly assign members to either of the sub-groups.  Next,  you compute a maximum likelihood (ML) estimate for each sub-group. Now, because of random variation, there are likely to be members of one group that are better described by the other group (in terms of likelihood).   You then apply the 'maximization' step to re-sort data into the group that better describes them.  You repeat this process, re-estimating your models and re-sorting members until nobody is better described by a model that they are not in. Original research on the E-M algorithm proved that it would converge to a local maximum; you might not get the global maximum however, so you typically repeat the process many times and pick the best outcome you find.

This is possible to do by hand, and in this case is not too complicated.

```{r}
split <- rep(1:2,1000)

plot(x,y,col=split)

for(i in 1:5)
{

lm1 <- glm(y[split==1]~x[split==1])
lm2 <- glm(y[split==2]~x[split==2])


##Compute residuals of all the data for the two models
res1 <- (predict(lm1,data.frame(x))-y)
res2 <- (predict(lm2,data.frame(x))-y)

likelihood <- cbind(
          dnorm(res1,mean=0,sd=sd(res1)),
         dnorm(res2,mean=0,sd=sd(res2)))

split <- apply(likelihood,1,which.max)
plot(x,y,col=split,main=i)
print(lm1)
print(lm2)

}


```
We can see that the two models typically found have low slope, and a difference in the intercept.  But the two models are not that great; if we happened upon a pair of models that crossed, the solution would be better.  We might have to run a few dozen or hundreds of times to find that one.  The flexmix library automates this all for us.
 
 
## Flexmix modeling: 
 The 'flexmix' package allows us to do this very easily. FlexMix implements a general framework for fitting discrete mixtures of regression models in the R statistical computing environment.


```{r}
   library(flexmix)
	 model <- flexmix(y~x,k=2)
	 plot(x,y,col=clusters(model))
	 abline(parameters(model)[1:2,1],col="blue",lwd=3)
	 abline(parameters(model)[1:2,2],col="green",lwd=3)
```


Along with being easier to use, it gets the original (crossing) model right almost every time--there are other heuristics used for model search that help the flexmix model get a better solution than we get when building by hand.

The ```summary()``` method gives the estimated prior probabilities, the number of observations assigned to the corresponding clusters, the number of observations where posterior probability > $\delta$ for each component (with a default of .0001). This indicates the proportion and number of observations that are fit even a little bit by the model, in comparison to those fit best by the model.  In the above example, one component has a ratio of around .78--there were 1375 points that had non-zero likelihood of being in that group, and of those 78% were best fit by that group.  The distribution of those likelihoods is shown as a rootogram (with the count scale is transformed).  Histograms or rootograms of the posterior class probabilities can be used to visually assess the cluster structure. Rootograms are very similar to histograms, the only difference is that the height of the bars correspond to square roots of counts rather than the counts themselves, so that low counts are more visible and peaks less so.  

```{r}
summary(model)
plot(model)
parameters(model)
```


When interpreting rootograms, keep in mind that there is no 'ground truth'.  A peak near probability 1 indicates that many of the points are overwhelmingly well-represented by that cluster.  A peak near 0 would indicate that many points clearly don't fit the category (which is also usually good).  Points in the middle indicate a lack of separation--there are points that are only moderately well-described by that group, which should be considered with respect to the number of groups you think exist.


## Example: A mixture of variances
Mixture modeling can be used to identify outliers or select elements that are caused by the same process but have mutch larger variance.  Sometimes, skewed distributions with long tails are modeled as mixtures of two gaussians--one to capture the fairly symmetrical centroid of the distribution, and a second to capture the long tail.  At the end of the day, you may treat the tail as a nuisance and use the centroid model as the one you care about.

1. Random 1000 variables for x and y

```{r}
   x1 <- rep(1:10,50)
	 x2 <- rep(1:10,each=5)
	 y1  <- 3*x1 + 10 +rnorm(500)*3
	 y2  <- 3*x2 + 10 + rnorm(50)*50
	 x <- c(x1,x2)
	 y <- c(y1,y2)
	 plot(x,y)
```

Notice that we have a clear linear relationship, along with a handful of outliers. For a regular maxmimum-likelihood model, these outliers present a problem because they are either very unlikely (punishing the model), or the model's variance must get larger to make them reasonable, making all the other data less likely.
```{r}
library(flexmix)
model2a <- flexmix(y~x,k=1)
summary(model2a)
model2b <- flexmix(y~x,k=2)
summary(model2b)
parameters(model2b)
clusters(model2b)
```
In this case, the 2-group model will often select 40/50 of the outliers to be in the secondary group.  We'd like to know whether the 2-group solution is any better?  We can use a criterion such as BIC to do this.  

### Bayesian Information Criterion (BIC)

The BIC is a statistic that is based on both the likelihood of the data AND a punishing factor related to the number of parameters used.  If you have three parameters for a default model (intercept, slope, and variance), you would have six for a model with two clusters, and nine for one with three. To use the BIC statistic, you can just compare across models and determine which is the smallest. This model provides the best trade-off between accuracy and complexity. However, sometimes the differences are small and you still may prefer a more complex model for other reasons.

In the above example, the BIC is given as part of the output of each model, or we can get it directly:
```{r}
BIC(model2a)
BIC(model2b)
```
Here, model2b has a much smaller BIC, so we would prefer this one--the two-group model.
We'd like to be able to automate this, looking across a number of options; we'd also like to run this many times (and with more than just 2 groups) and see if there are any better solutions.  Here is a way to do both.  In the stepFlexmix function, it allows you to specify the numbers of clusters to test, and the number of repetitions at each configuration.  We can then select the one with the smallest BIC using the getModel() function

```{r}
models <- stepFlexmix(y~x,k=1:5,nrep=10)
models
summary(models)
plot(models)
model <- getModel(models,"BIC")
model
plot(model)
summary(model)
parameters(model)
clusters(model)
```


This does pretty well.  There are a few things to examine.  First, we can look at the post>0 ratio. This sets a very small criterion and asks how many of the data points have likelihood greater than that small level (post>0). It then asks how many are placed in the category based on maximum likelihood.  For the larger component, we have a high proportion--almost all of the points that are described by the larger component at all are placed in that category.  The second component is less good--almost all of the points could be described by that component, but are just better descibed by the other component.  This makes sense in our case, because we have two overlapping distributions.  Other times, we might want to have mutually-exclusive clusters.

Note that the model estimates completely different parameters for each group; we'd probably want to just estimate a different variance parameter, and to do that we'd need either to do this by hand or use another flexmix model (possibly a custom model)--it might not even be possible.


## Example: learners and non-learners
In this example, suppose that you have tested a group of people repeatedly.  Maybe, some people get fatigued in this situation, and so their data get worse over time; another group learn and so their data get better over time.  On average, the performance may be flat, but this could hide two separate groups.  But these groups might not be labeled, and they may not be easily detectable.

```{r}

learners <- t(matrix(5+ .5*(1:10) + 1.5*rnorm(100),10))
fatigues <- t(matrix(10-  .5 *(1:10) + 1.5*rnorm(100),10))
data <- rbind(learners,fatigues)
par(mfrow=c(1,3))
matplot(t(learners),pch=16,col="black",main="Learners")
matplot(t(fatigues),pch=16,col="black",main="Fatigued")
matplot(t(data),pch=16,col="black")
```
Although we have two different groups, without knowing about their identity, we might just look at the data and declare that they are all the same.  Of course, if they were all the same, we would see some that have a positive slope, and some that have a negative slope.  Seeing some with positive and some with negative slope doesn't necessarily tell us we have two groups though.  For example, here is a a data set created from a single group
```{r}
data2 <- matrix(5 + rnorm(200)*2,20)

pars1 <- matrix(0,ncol=2,nrow=20)
pars2 <- matrix(0,ncol=2,nrow=20)
pred <- 1:10
for(i in 1:20)
  {
     pars1[i,] <- lm(data[i,]~pred)$coef
     pars2[i,] <- lm(data2[i,]~pred)$coef
  }

par(mfrow=c(1,2))
plot(pars1,xlab="intercepts",ylab="slopes",main="Two groups")
plot(pars2,xlab="intercepts",ylab="slopes",main="One group")

```
By comparing the slope and intercept of each person, we don't see a lot of difference.  But notice that when we have two groups, that if we fit a regression to each person, the resulting parameters appear to cluster into two groups, with a gap in the middle.  When we have one group, they tend to vary more smoothly.  A mixture model should be able to take advantage of this.


Howeever, to do so we need to reshape the data
```{r}
library(reshape2)
data <- as.data.frame(data)
data$sub <- 1:20
long <- melt(data,id.vars = c("sub"))
long$time <- rep(1:10,each=20)

long[1:10,]
```
Here, each predictor variable and subject are fitted separately. Now, we can make a repeated-measures regression with the following formula, which will aggregate across subject--exactly what we want. The points of each subject are classified all-or-none into one or the other cluster.

```{r}
f <- flexmix(value~time|as.factor(sub), data = long, k = 2,
             control = list(iter.max=10))
parameters(f)
clusters(f)
table(clusters(f),long$sub)
f1 <- flexmix(value~time|as.factor(sub), data = long, k = 1,
             control = list(iter.max=10))

```


## Example: Curved versus straight models

Mixtures don't always come from the same process or type of model.  Here is a one in which one component is linear, and a second is curved.

```{r}
 x1 <- rnorm(1000)*5
x2 <- rnorm(1000)*5
	 y1 <-    3*x1 + 10 +rnorm(1000)*5
	 y2   <-  10+ 5*x2 + .3*x2^2 + rnorm(1000)*5
	 x <- c(x1,x2)
	 y <- c(y1,y2)
	plot(x,y)
```

Here, we fit two polynomial models, this might work, because a linear model is just a subset of a polynomial model
```{r}
model3 <- flexmix(y~poly(x,2),k=2)
summary(model3)
	plot(x,y,col=c("red","green")[clusters(model3)])

ord <- order(x)
 points(x[ord],predict(model3)[[1]][ord],type="l",col="black",lwd=6)
 points(x[ord],predict(model3)[[1]][ord],type="l",col="red",lwd=3)
 points(x[ord],predict(model3)[[2]][ord],type="l",col="black",lwd=6)
 points(x[ord],predict(model3)[[2]][ord],type="l",col="green",lwd=3)

```
We can look at basic statistical tests with the following trick:

```{r}
model3 <- flexmix(y~poly(x,2),k=2)
rm1 <- refit(model3)
summary(rm1)
```
You can see that for one model, it is quadratic, but for the other the quadratic term is not significant.

## Control parameters

If you want to set a maximum number of iterations = 10 in model3,

	>model31 <- flexmix(y~poly(x,2),k=2,control = list(iter.max=10))

Another control parameter is minprior. it is the minimum prior probability that components are enforced to have. Components falling below this threshold (the current default is 0.05) are removed during EM iteration to avoid numerical instabilities for components containing only a few observations. Using a minimum prior of 0 disables component removal.
 Here is a regression model with two latent classes.  We can compare it to the actual class at the end:
```{r}
data("NPreg")
 m1 <- flexmix(yp ~ x + I(x^2), data = NPreg, k = 4, control = list(minprior = 0.2))
	 summary(m1)
plot(m1)
clusters(m1)
table(NPreg$class,clusters(m1))
```

# Model-based classification and semi-supervised classification.

Many of the model-based clustering libraries, including mclust, have capabilities for doing model-based classification and semi-supervised classification.  You can think of classification and clustering on two ends of a spectrum--when you know the answer and when you do not know the answer. Mid-points include when you know the answers of only SOME of the cases, and when you know answers for all or some of the cases but only probabilistically.  These frameworks integrate classification and clustering under one approach. For mclust, there are two related functions:

* MClustSSC: This performs semi-supervised clustering. It STARTS with the labeled data as the initial clusters, but then performs the normal EM algorithm and reclassifies each element based on an evolving set of clusters.
* MClustDA: Cluster-based LDA. This is much more like traditional LDA, but you can specify each of the different kinds of gaussian models (constraining variance, covarience, volume, etc.)


### Cluster-based discriminant analysis.

For the discriminant analysis, suppose we have two groups that are not well defined by a single gaussian, but rather one is a 'peanut'

```{R fig.width=6,fig.height=6}
library(ggplot2)
library(mclust)

groupA <- rbind(cbind(x=rnorm(100,mean=30,sd=5), y=rnorm(100,mean=80,sd=5)),
           cbind(x=rnorm(100,mean=60,sd=5), y=rnorm(100,mean=80,sd=5)))
groupB <- cbind(x=rnorm(200,mean=30,sd=10),y=rnorm(200,mean=60,sd=10))


dat  <- data.frame(group=rep(c("A","B"),each=200),
           x=c(groupA[,1],groupB[,1]),
           y = c(groupA[,2],groupB[,2])) 

dat |> ggplot(aes(x=x,y=y,color=group)) + geom_point() + theme_bw()

model <- MclustDA(dat[,2:3],dat$group)

plot(model,"classification")
plot(model,"scatterplot")

summary(model)

```

Notice that the  MclustDA allows multiple gaussians to describe groups. It does its own leave-one-out cross-validation, and determines an error rate of around 7%. You can use it to make a prediction about a new data set or observation as well:

```{r}
predict(model,newdata=data.frame(x=.5,y=.5))
```

This LDA not only permits multiple clusters, but gaussians of any of the shapes we considered in the clustering technique. This gives the model a lot of power to describe arbitrarily complex classes.


### Semi-supervised clustering 

Semi-supervised clustering is often used for cases in which you have labels for a few cases, but most cases are unlabelled. In our data set above, let's pick just a few:

```{r}
dat$label <- NA
dat$label[1:3] <- "PEANUT"
dat$label[101:103] <- "PEANUT"
dat$label[201:203] <- "OREO"

partialmodel1 <- MclustSSC(dat[,2:3],dat$label)
summary(partialmodel1)
plot(partialmodel1,"uncertainty")
plot(partialmodel1,"classification")

table(dat$group,partialmodel1$classification)
```
In this case, we don't do very well, because the even though the two pods of the peanut had labelled cases, they get put into different clusters. The EM is performed without regard to accuracy; it only starts out correctly. Furthermore, unlike the LDA application, it will not assign a single label to multiple clusters.  What if we forced it to use three clusters?

```{r}
##FORCE IT TO USE THREE LATENT CLUSTERS:
partialmodel2 <- MclustSSC(dat[,2:3],dat$label,G=3)
summary(partialmodel2)
plot(partialmodel2,"uncertainty")
plot(partialmodel2,"classification")
table(dat$group,partialmodel2$classification)

```
This is not bad, but it infers a 'class3' label that is actually the peanut group. Several of the other EM clustering techniques provide other partially-supervised algorithms, and those might in fact be more flexible that this particular one.


# Summary

In this module, we learned about extending model-based clustering to regression and classification problems.  We used flexmix to build arbitrary mixtures of regressions, and explored an interesting classification method with mclust using mixtures of guassians.  Finally, we examined a situation where we have a small number of labelled cases and we want to use that to label the clusters that emerge. 


## References

Leisch, F. (2004). Flexmix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software, 1-18.