Mixture Modeling: Mixture of Regressions
Mixture Modeling: Mixtures of regressions.
A mixture model is a probabilistic model for representing the presence of sub-populations within an overall population, without requiring that an observed data-set should identify the sub-population to which an individual observation belongs. We previously used a very simple model–where each variable is defined by its mean and standard deviation, to do model-based clustering. But mixture modeling permits finding mixtures of hidden group memberships for other kinds of models, including regression models. This sometimes enables you to identify sets of distinct relationships amongst sub-groups, hidden within a larger population. The goal of mixture modeling is to model your data as a mixture of processes or populations that have distinct patterns of data.
Example 1: Two linear models
Suppose we have two types of people: those who like romantic comedy movies, and those who hate romantic comedies. On a movie rating site, we have measured ratings of anonymous people rating how much they like movies, that we have already measured on how romantic they are. Thus, we’d expect a crossing pattern in the data, like we see below. If we knew why type of person each person was, we’d be able to fit separate regression models, but we don’t.
set.seed(100)
<- rnorm(1000) * 50
x1 <- rnorm(1000) * 50
x2 <- x1 * 3 + 10 + rnorm(1000) * 30
y1 <- -3 * x2 + 25 + rnorm(1000) * 50
y2 <- c(x1, x2)
x <- c(y1, y2)
y par(mfrow = c(1, 3))
plot(x1, y1)
plot(x2, y2)
plot(x, y)
lm(y1 ~ x1)
Call:
lm(formula = y1 ~ x1)
Coefficients:
(Intercept) x1
9.573 2.976
lm(y2 ~ x2)
Call:
lm(formula = y2 ~ x2)
Coefficients:
(Intercept) x2
25.196 -3.097
If we try to fit data by using a single linear regression, we fail:
plot(x, y)
<- lm(y ~ x)
lm1 summary(lm1)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-537.79 -102.55 1.96 102.96 513.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.26748 3.53271 5.171 2.56e-07 ***
x 0.08846 0.07026 1.259 0.208
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 158 on 1998 degrees of freedom
Multiple R-squared: 0.0007929, Adjusted R-squared: 0.0002928
F-statistic: 1.586 on 1 and 1998 DF, p-value: 0.2081
abline(lm1$coefficients, col = "gold", lwd = 3)
And from summary(lm1), R-squared = 0.0008135
summary(lm1)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-537.79 -102.55 1.96 102.96 513.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.26748 3.53271 5.171 2.56e-07 ***
x 0.08846 0.07026 1.259 0.208
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 158 on 1998 degrees of freedom
Multiple R-squared: 0.0007929, Adjusted R-squared: 0.0002928
F-statistic: 1.586 on 1 and 1998 DF, p-value: 0.2081
In the example above, it is quite obvious we are missing something. But if we had more noise, or many variables, we might not notice this pattern, and just determine that there was no predictability on average. But there is a very systematic pattern within the data we might be able to uncover–if only we knew which group people were in. But this problem is just like model-based clustering. What if we could both fit two separate models and determine which group each person belongs to? We can do that using the flexmix library.
Fitting mixtures of regressions using E-M and flexmix
If we could use the same approach as with clustering, but instead of gaussians around means, we might have each sub-group defined by a linear model. In this case, as long as we can produce a likelihood estimate for a model, we can use the E-M algorithm to identify and estimate the groups. That is, we can fit a model to a subgroup, obtain the maximum likelihood estimate, then place each observation in the model it is best described by, and repeat until things settle down.
To review: E-M stands for Expectation-Maximization. It is an iterative process whereby you apply two complementary processes. Suppose you assume there are two sub-groups. First, randomly assign members to either of the sub-groups. Next, you compute a maximum likelihood (ML) estimate for each sub-group. Now, because of random variation, there are likely to be members of one group that are better described by the other group (in terms of likelihood). You then apply the ‘maximization’ step to re-sort data into the group that better describes them. You repeat this process, re-estimating your models and re-sorting members until nobody is better described by a model that they are not in. Original research on the E-M algorithm proved that it would converge to a local maximum; you might not get the global maximum however, so you typically repeat the process many times and pick the best outcome you find.
This is possible to do by hand, and in this case is not too complicated.
<- rep(1:2, 1000)
split
plot(x, y, col = split)
for (i in 1:5) {
<- glm(y[split == 1] ~ x[split == 1])
lm1 <- glm(y[split == 2] ~ x[split == 2])
lm2
## Compute residuals of all the data for the two models
<- (predict(lm1, data.frame(x)) - y)
res1 <- (predict(lm2, data.frame(x)) - y)
res2
<- cbind(dnorm(res1, mean = 0, sd = sd(res1)), dnorm(res2, mean = 0,
likelihood sd = sd(res2)))
<- apply(likelihood, 1, which.max)
split plot(x, y, col = split, main = i)
print(lm1)
print(lm2)
}
Call: glm(formula = y[split == 1] ~ x[split == 1])
Coefficients:
(Intercept) x[split == 1]
17.09494 0.08062
Degrees of Freedom: 999 Total (i.e. Null); 998 Residual
Null Deviance: 23240000
Residual Deviance: 23220000 AIC: 12900
Call: glm(formula = y[split == 2] ~ x[split == 2])
Coefficients:
(Intercept) x[split == 2]
19.43532 0.09495
Degrees of Freedom: 999 Total (i.e. Null); 998 Residual
Null Deviance: 26660000
Residual Deviance: 26640000 AIC: 13030
Call: glm(formula = y[split == 1] ~ x[split == 1])
Coefficients:
(Intercept) x[split == 1]
-14.6458 0.2283
Degrees of Freedom: 1058 Total (i.e. Null); 1057 Residual
Null Deviance: 23420000
Residual Deviance: 23290000 AIC: 13600
Call: glm(formula = y[split == 2] ~ x[split == 2])
Coefficients:
(Intercept) x[split == 2]
54.99022 -0.01926
Degrees of Freedom: 940 Total (i.e. Null); 939 Residual
Null Deviance: 24090000
Residual Deviance: 24090000 AIC: 12230
Call: glm(formula = y[split == 1] ~ x[split == 1])
Coefficients:
(Intercept) x[split == 1]
-107.3808 0.1218
Degrees of Freedom: 997 Total (i.e. Null); 996 Residual
Null Deviance: 9401000
Residual Deviance: 9364000 AIC: 11970
Call: glm(formula = y[split == 2] ~ x[split == 2])
Coefficients:
(Intercept) x[split == 2]
143.75711 -0.07905
Degrees of Freedom: 1001 Total (i.e. Null); 1000 Residual
Null Deviance: 8980000
Residual Deviance: 8964000 AIC: 11970
Call: glm(formula = y[split == 1] ~ x[split == 1])
Coefficients:
(Intercept) x[split == 1]
-109.0668 0.1154
Degrees of Freedom: 984 Total (i.e. Null); 983 Residual
Null Deviance: 9183000
Residual Deviance: 9150000 AIC: 11800
Call: glm(formula = y[split == 2] ~ x[split == 2])
Coefficients:
(Intercept) x[split == 2]
142.20347 -0.08373
Degrees of Freedom: 1014 Total (i.e. Null); 1013 Residual
Null Deviance: 9173000
Residual Deviance: 9155000 AIC: 12130
Call: glm(formula = y[split == 1] ~ x[split == 1])
Coefficients:
(Intercept) x[split == 1]
-110.747 0.109
Degrees of Freedom: 971 Total (i.e. Null); 970 Residual
Null Deviance: 8973000
Residual Deviance: 8944000 AIC: 11640
Call: glm(formula = y[split == 2] ~ x[split == 2])
Coefficients:
(Intercept) x[split == 2]
140.63808 -0.08789
Degrees of Freedom: 1027 Total (i.e. Null); 1026 Residual
Null Deviance: 9374000
Residual Deviance: 9354000 AIC: 12290
We can see that the two models typically found have low slope, and a difference in the intercept. But the two models are not that great; if we happened upon a pair of models that crossed, the solution would be better. We might have to run a few dozen or hundreds of times to find that one. The flexmix library automates this all for us.
Flexmix modeling:
The ‘flexmix’ package allows us to do this very easily. FlexMix implements a general framework for fitting discrete mixtures of regression models in the R statistical computing environment.
library(flexmix)
<- flexmix(y ~ x, k = 2)
model plot(x, y, col = clusters(model))
abline(parameters(model)[1:2, 1], col = "blue", lwd = 3)
abline(parameters(model)[1:2, 2], col = "green", lwd = 3)
Along with being easier to use, it gets the original (crossing) model right almost every time–there are other heuristics used for model search that help the flexmix model get a better solution than we get when building by hand.
The summary()
method gives the estimated prior
probabilities, the number of observations assigned to the corresponding
clusters, the number of observations where posterior probability >
\(\delta\) for each component (with a
default of .0001). This indicates the proportion and number of
observations that are fit even a little bit by the model, in comparison
to those fit best by the model. In the above example, one component has
a ratio of around .78–there were 1375 points that had non-zero
likelihood of being in that group, and of those 78% were best fit by
that group. The distribution of those likelihoods is shown as a
rootogram (with the count scale is transformed). Histograms or
rootograms of the posterior class probabilities can be used to visually
assess the cluster structure. Rootograms are very similar to histograms,
the only difference is that the height of the bars correspond to square
roots of counts rather than the counts themselves, so that low counts
are more visible and peaks less so.
summary(model)
Call:
flexmix(formula = y ~ x, k = 2)
prior size post>0 ratio
Comp.1 0.494 933 1501 0.622
Comp.2 0.506 1067 1375 0.776
'log Lik.' -11251.22 (df=7)
AIC: 22516.45 BIC: 22555.66
plot(model)
parameters(model)
Comp.1 Comp.2
coef.(Intercept) 24.607277 9.317427
coef.x -3.098094 2.974661
sigma 47.842963 31.950443
When interpreting rootograms, keep in mind that there is no ‘ground truth’. A peak near probability 1 indicates that many of the points are overwhelmingly well-represented by that cluster. A peak near 0 would indicate that many points clearly don’t fit the category (which is also usually good). Points in the middle indicate a lack of separation–there are points that are only moderately well-described by that group, which should be considered with respect to the number of groups you think exist.
Example: A mixture of variances
Mixture modeling can be used to identify outliers or select elements that are caused by the same process but have mutch larger variance. Sometimes, skewed distributions with long tails are modeled as mixtures of two gaussians–one to capture the fairly symmetrical centroid of the distribution, and a second to capture the long tail. At the end of the day, you may treat the tail as a nuisance and use the centroid model as the one you care about.
- Random 1000 variables for x and y
<- rep(1:10, 50)
x1 <- rep(1:10, each = 5)
x2 <- 3 * x1 + 10 + rnorm(500) * 3
y1 <- 3 * x2 + 10 + rnorm(50) * 50
y2 <- c(x1, x2)
x <- c(y1, y2)
y plot(x, y)
Notice that we have a clear linear relationship, along with a handful of outliers. For a regular maxmimum-likelihood model, these outliers present a problem because they are either very unlikely (punishing the model), or the model’s variance must get larger to make them reasonable, making all the other data less likely.
library(flexmix)
<- flexmix(y ~ x, k = 1)
model2a summary(model2a)
Call:
flexmix(formula = y ~ x, k = 1)
prior size post>0 ratio
Comp.1 1 550 550 1
'log Lik.' -2222.624 (df=3)
AIC: 4451.247 BIC: 4464.177
<- flexmix(y ~ x, k = 2)
model2b summary(model2b)
Call:
flexmix(formula = y ~ x, k = 2)
prior size post>0 ratio
Comp.1 0.9025 505 510 0.9902
Comp.2 0.0975 45 550 0.0818
'log Lik.' -1639.939 (df=7)
AIC: 3293.878 BIC: 3324.048
parameters(model2b)
Comp.1 Comp.2
coef.(Intercept) 9.984226 16.913096
coef.x 3.015274 2.298855
sigma 2.873646 43.135081
clusters(model2b)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[ reached getOption("max.print") -- omitted 475 entries ]
In this case, the 2-group model will often select 40/50 of the outliers to be in the secondary group. We’d like to know whether the 2-group solution is any better? We can use a criterion such as BIC to do this.
Bayesian Information Criterion (BIC)
The BIC is a statistic that is based on both the likelihood of the data AND a punishing factor related to the number of parameters used. If you have three parameters for a default model (intercept, slope, and variance), you would have six for a model with two clusters, and nine for one with three. To use the BIC statistic, you can just compare across models and determine which is the smallest. This model provides the best trade-off between accuracy and complexity. However, sometimes the differences are small and you still may prefer a more complex model for other reasons.
In the above example, the BIC is given as part of the output of each model, or we can get it directly:
BIC(model2a)
[1] 4464.177
BIC(model2b)
[1] 3324.048
Here, model2b has a much smaller BIC, so we would prefer this one–the two-group model. We’d like to be able to automate this, looking across a number of options; we’d also like to run this many times (and with more than just 2 groups) and see if there are any better solutions. Here is a way to do both. In the stepFlexmix function, it allows you to specify the numbers of clusters to test, and the number of repetitions at each configuration. We can then select the one with the smallest BIC using the getModel() function
<- stepFlexmix(y ~ x, k = 1:5, nrep = 10) models
1 : * * * * * * * * * *
2 : * * * * * * * * * *
3 : * * * * * * * * * *
4 : * * * * * * * * * *
5 : * * * * * * * * * *
models
Call:
stepFlexmix(y ~ x, k = 1:5, nrep = 10)
iter converged k k0 logLik AIC BIC ICL
1 2 TRUE 1 1 -2222.624 4451.247 4464.177 4464.177
2 9 TRUE 2 2 -1639.939 3293.877 3324.047 3346.978
3 200 FALSE 3 3 -1636.294 3294.589 3341.998 3644.351
4 200 FALSE 4 4 -1635.600 3301.200 3365.849 3840.103
5 200 FALSE 5 5 -1635.334 3308.668 3390.556 4024.163
summary(models)
Length Class Mode
1 stepFlexmix S4
plot(models)
<- getModel(models, "BIC")
model model
Call:
stepFlexmix(y ~ x, k = 2, nrep = 10)
Cluster sizes:
1 2
505 45
convergence after 9 iterations
plot(model)
summary(model)
Call:
stepFlexmix(y ~ x, k = 2, nrep = 10)
prior size post>0 ratio
Comp.1 0.9024 505 510 0.9902
Comp.2 0.0976 45 550 0.0818
'log Lik.' -1639.939 (df=7)
AIC: 3293.877 BIC: 3324.047
parameters(model)
Comp.1 Comp.2
coef.(Intercept) 9.983601 16.904495
coef.x 3.015429 2.298849
sigma 2.872162 43.083920
clusters(model)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[ reached getOption("max.print") -- omitted 475 entries ]
This does pretty well. There are a few things to examine. First, we can look at the post>0 ratio. This sets a very small criterion and asks how many of the data points have likelihood greater than that small level (post>0). It then asks how many are placed in the category based on maximum likelihood. For the larger component, we have a high proportion–almost all of the points that are described by the larger component at all are placed in that category. The second component is less good–almost all of the points could be described by that component, but are just better descibed by the other component. This makes sense in our case, because we have two overlapping distributions. Other times, we might want to have mutually-exclusive clusters.
Note that the model estimates completely different parameters for each group; we’d probably want to just estimate a different variance parameter, and to do that we’d need either to do this by hand or use another flexmix model (possibly a custom model)–it might not even be possible.
Example: learners and non-learners
In this example, suppose that you have tested a group of people repeatedly. Maybe, some people get fatigued in this situation, and so their data get worse over time; another group learn and so their data get better over time. On average, the performance may be flat, but this could hide two separate groups. But these groups might not be labeled, and they may not be easily detectable.
<- t(matrix(5 + 0.5 * (1:10) + 1.5 * rnorm(100), 10))
learners <- t(matrix(10 - 0.5 * (1:10) + 1.5 * rnorm(100), 10))
fatigues <- rbind(learners, fatigues)
data par(mfrow = c(1, 3))
matplot(t(learners), pch = 16, col = "black", main = "Learners")
matplot(t(fatigues), pch = 16, col = "black", main = "Fatigued")
matplot(t(data), pch = 16, col = "black")
Although we have two different groups, without knowing about their identity, we might just look at the data and declare that they are all the same. Of course, if they were all the same, we would see some that have a positive slope, and some that have a negative slope. Seeing some with positive and some with negative slope doesn’t necessarily tell us we have two groups though. For example, here is a a data set created from a single group
<- matrix(5 + rnorm(200) * 2, 20)
data2
<- matrix(0, ncol = 2, nrow = 20)
pars1 <- matrix(0, ncol = 2, nrow = 20)
pars2 <- 1:10
pred for (i in 1:20) {
<- lm(data[i, ] ~ pred)$coef
pars1[i, ] <- lm(data2[i, ] ~ pred)$coef
pars2[i, ]
}
par(mfrow = c(1, 2))
plot(pars1, xlab = "intercepts", ylab = "slopes", main = "Two groups")
plot(pars2, xlab = "intercepts", ylab = "slopes", main = "One group")
By comparing the slope and intercept of each person, we don’t see a lot of difference. But notice that when we have two groups, that if we fit a regression to each person, the resulting parameters appear to cluster into two groups, with a gap in the middle. When we have one group, they tend to vary more smoothly. A mixture model should be able to take advantage of this.
Howeever, to do so we need to reshape the data
library(reshape2)
<- as.data.frame(data)
data $sub <- 1:20
data<- melt(data, id.vars = c("sub"))
long $time <- rep(1:10, each = 20)
long
1:10, ] long[
sub variable value time
1 1 V1 2.561813 1
2 2 V1 5.422148 1
3 3 V1 4.410111 1
4 4 V1 4.313140 1
5 5 V1 3.733472 1
6 6 V1 6.992038 1
7 7 V1 5.745078 1
8 8 V1 5.129919 1
9 9 V1 6.918519 1
10 10 V1 5.451357 1
Here, each predictor variable and subject are fitted separately. Now, we can make a repeated-measures regression with the following formula, which will aggregate across subject–exactly what we want. The points of each subject are classified all-or-none into one or the other cluster.
<- flexmix(value ~ time | as.factor(sub), data = long, k = 2, control = list(iter.max = 10))
f parameters(f)
Comp.1 Comp.2
coef.(Intercept) 4.8171232 9.9187058
coef.time 0.5481169 -0.5051437
sigma 1.5262642 1.4203917
clusters(f)
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
[39] 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
[ reached getOption("max.print") -- omitted 125 entries ]
table(clusters(f), long$sub)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 10 10 10 10 10 10 10 10 10 10 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 10 10 10 10 10 10 10 10 10 10
<- flexmix(value ~ time | as.factor(sub), data = long, k = 1, control = list(iter.max = 10)) f1
Example: Curved versus straight models
Mixtures don’t always come from the same process or type of model. Here is a one in which one component is linear, and a second is curved.
<- rnorm(1000) * 5
x1 <- rnorm(1000) * 5
x2 <- 3 * x1 + 10 + rnorm(1000) * 5
y1 <- 10 + 5 * x2 + 0.3 * x2^2 + rnorm(1000) * 5
y2 <- c(x1, x2)
x <- c(y1, y2)
y plot(x, y)
Here, we fit two polynomial models, this might work, because a linear model is just a subset of a polynomial model
<- flexmix(y ~ poly(x, 2), k = 2)
model3 summary(model3)
Call:
flexmix(formula = y ~ poly(x, 2), k = 2)
prior size post>0 ratio
Comp.1 0.505 926 1872 0.495
Comp.2 0.495 1074 1877 0.572
'log Lik.' -6438.9 (df=9)
AIC: 12895.8 BIC: 12946.21
plot(x, y, col = c("red", "green")[clusters(model3)])
<- order(x)
ord points(x[ord], predict(model3)[[1]][ord], type = "l", col = "black", lwd = 6)
points(x[ord], predict(model3)[[1]][ord], type = "l", col = "red", lwd = 3)
points(x[ord], predict(model3)[[2]][ord], type = "l", col = "black", lwd = 6)
points(x[ord], predict(model3)[[2]][ord], type = "l", col = "green", lwd = 3)
We can look at basic statistical tests with the following trick:
<- flexmix(y ~ poly(x, 2), k = 2)
model3 <- refit(model3)
rm1 summary(rm1)
$Comp.1
Estimate Std. Error z value Pr(>|z|)
(Intercept) 16.73154 0.23836 70.195 < 2.2e-16 ***
poly(x, 2)1 1117.82809 8.45443 132.218 < 2.2e-16 ***
poly(x, 2)2 463.06664 7.56269 61.230 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
$Comp.2
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.6721 0.2387 40.520 <2e-16 ***
poly(x, 2)1 677.3607 8.9369 75.794 <2e-16 ***
poly(x, 2)2 7.1761 8.6461 0.830 0.4066
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You can see that for one model, it is quadratic, but for the other the quadratic term is not significant.
Control parameters
If you want to set a maximum number of iterations = 10 in model3,
>model31 <- flexmix(y~poly(x,2),k=2,control = list(iter.max=10))
Another control parameter is minprior. it is the minimum prior probability that components are enforced to have. Components falling below this threshold (the current default is 0.05) are removed during EM iteration to avoid numerical instabilities for components containing only a few observations. Using a minimum prior of 0 disables component removal. Here is a regression model with two latent classes. We can compare it to the actual class at the end:
data("NPreg")
<- flexmix(yp ~ x + I(x^2), data = NPreg, k = 4, control = list(minprior = 0.2))
m1 summary(m1)
Call:
flexmix(formula = yp ~ x + I(x^2), data = NPreg, k = 4, control = list(minprior = 0.2))
prior size post>0 ratio
Comp.1 0.48 78 200 0.390
Comp.2 0.52 122 180 0.678
'log Lik.' -439.1561 (df=9)
AIC: 896.3123 BIC: 925.9971
plot(m1)
clusters(m1)
[1] 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 1 1 1 2 1 2 2 2 2 2 1 2 1 1 1 2 2 2 1
[39] 1 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 2 1 2 2
[ reached getOption("max.print") -- omitted 125 entries ]
table(NPreg$class, clusters(m1))
1 2
1 25 75
2 53 47
Model-based classification and semi-supervised classification.
Many of the model-based clustering libraries, including mclust, have capabilities for doing model-based classification and semi-supervised classification. You can think of classification and clustering on two ends of a spectrum–when you know the answer and when you do not know the answer. Mid-points include when you know the answers of only SOME of the cases, and when you know answers for all or some of the cases but only probabilistically. These frameworks integrate classification and clustering under one approach. For mclust, there are two related functions:
- MClustSSC: This performs semi-supervised clustering. It STARTS with the labeled data as the initial clusters, but then performs the normal EM algorithm and reclassifies each element based on an evolving set of clusters.
- MClustDA: Cluster-based LDA. This is much more like traditional LDA, but you can specify each of the different kinds of gaussian models (constraining variance, covarience, volume, etc.)
Cluster-based discriminant analysis.
For the discriminant analysis, suppose we have two groups that are not well defined by a single gaussian, but rather one is a ‘peanut’
library(ggplot2)
library(mclust)
<- rbind(cbind(x = rnorm(100, mean = 30, sd = 5), y = rnorm(100, mean = 80,
groupA sd = 5)), cbind(x = rnorm(100, mean = 60, sd = 5), y = rnorm(100, mean = 80,
sd = 5)))
<- cbind(x = rnorm(200, mean = 30, sd = 10), y = rnorm(200, mean = 60, sd = 10))
groupB
<- data.frame(group = rep(c("A", "B"), each = 200), x = c(groupA[, 1], groupB[,
dat 1]), y = c(groupA[, 2], groupB[, 2]))
|>
dat ggplot(aes(x = x, y = y, color = group)) + geom_point() + theme_bw()
<- MclustDA(dat[, 2:3], dat$group)
model
plot(model, "classification")
plot(model, "scatterplot")
summary(model)
------------------------------------------------
Gaussian finite mixture model for classification
------------------------------------------------
MclustDA model summary:
log-likelihood n df BIC
-3078.028 400 9 -6209.979
Classes n % Model G
A 200 50 EII 2
B 200 50 XII 1
Training confusion matrix:
Predicted
Class A B
A 189 11
B 10 190
Classification error = 0.0525
Brier score = 0.0379
Notice that the MclustDA allows multiple gaussians to describe groups. It does its own leave-one-out cross-validation, and determines an error rate of around 7%. You can use it to make a prediction about a new data set or observation as well:
predict(model, newdata = data.frame(x = 0.5, y = 0.5))
$classification
[1] B
Levels: A B
$z
A B
[1,] 3.529834e-60 1
This LDA not only permits multiple clusters, but gaussians of any of the shapes we considered in the clustering technique. This gives the model a lot of power to describe arbitrarily complex classes.
Semi-supervised clustering
Semi-supervised clustering is often used for cases in which you have labels for a few cases, but most cases are unlabelled. In our data set above, let’s pick just a few:
$label <- NA
dat$label[1:3] <- "PEANUT"
dat$label[101:103] <- "PEANUT"
dat$label[201:203] <- "OREO"
dat
<- MclustSSC(dat[, 2:3], dat$label)
partialmodel1 summary(partialmodel1)
----------------------------------------------------------------
Gaussian finite mixture model for semi-supervised classification
----------------------------------------------------------------
log-likelihood n df BIC
-3159.963 400 9 -6373.848
Classes n % Model G
OREO 3 0.75 VVI 1
PEANUT 6 1.50 VVI 1
<NA> 391 97.75
Classification summary:
Predicted
Class OREO PEANUT
OREO 3 0
PEANUT 0 6
<NA> 219 172
plot(partialmodel1, "uncertainty")
plot(partialmodel1, "classification")
table(dat$group, partialmodel1$classification)
OREO PEANUT
A 30 170
B 192 8
In this case, we don’t do very well, because the even though the two pods of the peanut had labelled cases, they get put into different clusters. The EM is performed without regard to accuracy; it only starts out correctly. Furthermore, unlike the LDA application, it will not assign a single label to multiple clusters. What if we forced it to use three clusters?
## FORCE IT TO USE THREE LATENT CLUSTERS:
<- MclustSSC(dat[, 2:3], dat$label, G = 3)
partialmodel2 summary(partialmodel2)
----------------------------------------------------------------
Gaussian finite mixture model for semi-supervised classification
----------------------------------------------------------------
log-likelihood n df BIC
-3122.516 400 8 -6292.964
Classes n % Model G
OREO 3 0.75 VII 1
PEANUT 6 1.50 VII 1
class3 0 0.00 VII 1
<NA> 391 97.75
Classification summary:
Predicted
Class OREO PEANUT class3
OREO 3 0 0
PEANUT 0 6 0
<NA> 197 100 94
plot(partialmodel2, "uncertainty")
plot(partialmodel2, "classification")
table(dat$group, partialmodel2$classification)
OREO PEANUT class3
A 10 103 87
B 190 3 7
This is not bad, but it infers a ‘class3’ label that is actually the peanut group. Several of the other EM clustering techniques provide other partially-supervised algorithms, and those might in fact be more flexible that this particular one.
Summary
In this module, we learned about extending model-based clustering to regression and classification problems. We used flexmix to build arbitrary mixtures of regressions, and explored an interesting classification method with mclust using mixtures of guassians. Finally, we examined a situation where we have a small number of labelled cases and we want to use that to label the clusters that emerge.
References
Leisch, F. (2004). Flexmix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software, 1-18.