---
title: "Multinomial Regression"
author: "Shane T. Mueller shanem@mtu.edu"
date: "`r Sys.Date()`"
output:
  rmdformats::readthedown:
    gallery: yes
    highlight: kate
    self_contained: no
  pdf_document: default
  html_document:
    df_print: paged
  word_document:
    reference_docx: ../template.docx
always_allow_html: yes
---


```{r knitr_init, echo=FALSE, cache=FALSE}
library(knitr)
library(rmdformats)

## Global options
options(max.print="75")
opts_chunk$set(echo=TRUE,
	             cache=TRUE,
               prompt=FALSE,
               tidy=TRUE,
               comment=NA,
               message=FALSE,
               warning=FALSE)
opts_knit$set(width=75)
```
# Modeling categorical outcomes with more than two levels

The generalized linear regression adapt linear regression with a transformation (link) and distribution (alternative to gaussian) with maximum-likelihood estimation.  With logistic regression, we saw how we could essentially transform linear regression into predicting the likelihood of being in one of two binary states, using a binomial model.  What if you have more than two categories?  This would be a multinomial (rather than binomial) model.


## Separate one-dimensional logistics
At a high level, a reasonable approach might be to fit a separate logistic model for each category, where we predict the target is or is not part of the category.  We can do this by hand for the iris data set:

```{r}


lm1 <- glm((Species=="setosa")~Sepal.Length+Sepal.Width + Petal.Length+Petal.Width,family=binomial,data=iris)
lm2 <- glm((Species=="versicolor")~Sepal.Length+Sepal.Width + Petal.Length+Petal.Width,family=binomial,data=iris)
lm3 <- glm((Species=="virginica")~Sepal.Length+Sepal.Width + Petal.Length+Petal.Width,family=binomial,data=iris)

summary(lm1)
summary(lm2)
summary(lm3)
```

If we want to predict the probability of each class, we end up with some problems.  If  we look at the '''probability''' of each membership from each separate model (taking logit), the three do not sum to 1.0! We can normalize and come up with an answer, but this seems rather ad hoc.  In this case, we only made 3 errors, but maybe there is a better way.

```{r}
logit <- function(x){1/(1+exp(-x))}
modelpreds <- cbind(predict(lm1),predict(lm2),predict(lm3))
hist(rowSums(logit(modelpreds)))

#normalize them here:
probs <- (exp(modelpreds) / rowSums(exp(modelpreds)))
head(round(probs,3))
class <- apply(probs,1,which.max)
table(iris$Species,class)

```
## Computing binary decision criteria

By computing exp(beta)/sum(exp(beta), we can get an estimated probability of each group membership--ensuring they sum to 1.0.  Alternately, we could compare the probabilities of any pairing by taking the sum of two columns

```{r}
plot(logit(modelpreds[,1]-modelpreds[,2]),main = "Setosa vs. Versicolor")
plot(logit(modelpreds[,1]-modelpreds[,3]),main="Setosa vs. Virginica")
plot(logit(modelpreds[,2]-modelpreds[,3]),main="Versicolor vs. Virginica")
```
## The multinom model
However, this doesn't fit all the information simultaneously, and so the separation is pretty minimal.  We can fit this within a poisson glm (See the Faraway book for examples, treating species as a predictor, and the count as the dependent variable), but the nnet library has a multinom function that will do exactly this, but sort of treat the species as the predicted variable.  Instead of three models, it essentially fits two log-transform models, each in comparison to the first level of the DV. This is a log-ratio model, which is consequently akin to the log-odds transform.


```{r}
library(nnet)

model2 <- multinom(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris[iris$Species!="virginica",])

model <- multinom(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris)
summary(model)
predict(model)

head(round(predict(model,type="probs"),4))

table(iris$Species,predict(model))
```
For this data set, we get almost  perfect classification--better than with the separate models. The coefficients for each model indicate the log-probability ratio of each model to the baseline.  To compare two other models, we can just take the difference of these values, because the denominator of the first model will cancel out.


Overall, this is useful for predicting, but could you do hypothesis testing as well? Let's remove one of the predictors:

```{r}

model2 <- multinom(Species~Sepal.Length+Sepal.Width+Petal.Length,data=iris)
anova(model,model2)
```


Now, comparing the two model,s, we can see there is a significant difference, indicating that Sepal.Length is an important predictor.