Naive Bayes Classifiers
A naive bayes classifier uses Bayes rule to combine information about a set of predictors. Although many Bayesian approaches can be quite complex and computationally-intensive, Naive Bayes classifiers are simple such that they can often be implemented without any special library. They are also easy to use even when you have hundreds or thousands of features, such as when trying to classify text (e.g., each word is a feature/predictor), for classification problems like sentiment analysis or spam detection.
The klaR
library includes the NaiveBayes function, which
is a simple implementation of the Naive Bayes classifier, and using it
is pretty simple–it looks basically like an lm
model. Before discussing how it works, let’s start with
an example application for the iPhone data set:
dat <- read.csv("data_study1.csv")
dat$Smartphone <- factor(dat$Smartphone)
nb <- NaiveBayes(Smartphone ~ ., data = dat)
Now, instead of just the mean and standard deviation, we have estimated quantiles of the distribution. This might help us when we have skewed distributions, but we need a lot of observations in order to get reliable estimates. 200-300 observations in each group might not be enough to do well. If we look at the predictions:
Overall accuracy = 0.699
Confusion matrix
Predicted (cv)
Actual Android iPhone
Android 0.557 0.443
iPhone 0.200 0.800
Here, we do a bit better, and in contrast to some of the other classification methods, we have a 55% chance of getting Android users right (the lda model sometimes had less than a 50% chance of getting them correct.)
Again, it would be good to implement a cross-validation here.
Example: Predicting Dengue Fever
Fitting the NaiveBayes function is sensitive to missing data and zero-variances. If you have a variable with no variance, any new value will have likelihood of 0, and we have a chance of getting a likelihood ratio that is infinite. A single 0 or infinite posterior likelihood will break the classifier, because the posterior value will be 0 or inf no matter what other values are. Similarly, an NA in the values in the training set can cause trouble, dependent on how the model handles it. It might be useful to impute NA data, or add small amounts of noise to the training set to smooth out the values. The NaiveBayes function also allows you to ignore NA values, which we will do below.
humid humid90 temp temp90
Min. : 0.6714 Min. : 1.066 Min. :-18.68 Min. :-10.07
1st Qu.:10.0088 1st Qu.:10.307 1st Qu.: 11.10 1st Qu.: 12.76
Median :16.1433 Median :16.870 Median : 20.99 Median : 22.03
Mean :16.7013 Mean :17.244 Mean : 18.41 Mean : 19.41
3rd Qu.:23.6184 3rd Qu.:24.131 3rd Qu.: 25.47 3rd Qu.: 25.98
h10pix h10pix90 trees trees90
Min. : 4.317 Min. : 5.848 Min. : 0.0 Min. : 0.00
1st Qu.:14.584 1st Qu.:14.918 1st Qu.: 1.0 1st Qu.: 6.00
Median :23.115 Median :24.130 Median :15.0 Median :30.60
Mean :21.199 Mean :21.557 Mean :22.7 Mean :35.21
3rd Qu.:28.509 3rd Qu.:28.627 3rd Qu.:37.0 3rd Qu.:63.62
NoYes Xmin Xmax Ymin
Min. :0.0000 Min. :-179.50 Min. :-172.00 Min. :-54.50
1st Qu.:0.0000 1st Qu.: -12.00 1st Qu.: -10.00 1st Qu.: 6.00
Median :0.0000 Median : 16.00 Median : 17.75 Median : 18.00
Mean :0.4155 Mean : 13.31 Mean : 15.63 Mean : 19.78
3rd Qu.:1.0000 3rd Qu.: 42.62 3rd Qu.: 44.50 3rd Qu.: 39.00
Min. :-55.50
1st Qu.: 5.00
Median : 17.00
Mean : 18.16
3rd Qu.: 37.00
Notice that there are about a dozen or so values with NA data.
# This doesn't work nb.dengue <- NaiveBayes(NoYes~.,data=dengue)
nb.dengue <- NaiveBayes(NoYes ~ ., data = dengue, na.action = "na.omit") ##this works
# This one works with the klaR NaiveBayes:
nb.dengue2 <- klaR::NaiveBayes(NoYes ~ h10pix + Xmin + Ymin, data = dengue)
# this one works--from e1071 package
nb.dengue3 <- naiveBayes(NoYes ~ ., data = dengue)
# when we remove the na, we need to remove it from the ground truth too:
confusion(dengue$NoYes[![, -9]))], predict(nb.dengue)$class)
Overall accuracy = 0.882
Confusion matrix
Predicted (cv)
Actual 0 1
0 0.845 0.155
1 0.066 0.934
Overall accuracy = 0.883
Confusion matrix
Predicted (cv)
Actual 0 1
0 0.843 0.157
1 0.059 0.941
Overall accuracy = 0.881
Confusion matrix
Predicted (cv)
Actual 0 1
0 0.844 0.156
1 0.066 0.934
Example: Naive Bayes with the mnist (handwriting) set
##this code will dowload a create a 500-letter two-class training set from
##mnist via tensorflow
datasets <- tf$contrib$learn$datasets
mnist <- datasets$mnist$read_data_sets("MNIST-data", one_hot = TRUE)
##extract just two labels and sample images
mnist.1 <- mnist$train$labels[,1]
mnist.2 <- mnist$train$labels[,2]
mnist.img1 <- mnist$train$images[mnist.1==1,]
mnist.img2 <- mnist$train$images[mnist.2==1,]
##plot prototypes
##these are too many. Sample 250 from each
traintest1 <- sample(1:nrow(mnist.img1),size=500)
traintest2 <- sample(1:nrow(mnist.img2),size=500)
train <- rbind(mnist.img1[traintest1[1:250],],
test <- rbind(mnist.img1[traintest1[251:500],],
train$labels <- rep(0:1,each=250)
train <- read.csv("trainmnist.csv")
test <- read.csv("testmnist.csv")
## smooth out the training set a bit by adding some noise, so no pixel has a sd
## of 0.
for (i in 1:ncol(train)) {
train[, i] <- rnorm(500, as.numeric(train[, i]), 0.1)
par(mfrow = c(2, 4))
image(matrix(unlist(train[1, ]), nrow = 28))
image(matrix(unlist(train[2, ]), nrow = 28))
image(matrix(unlist(train[3, ]), nrow = 28))
image(matrix(colMeans(train[1:250, ]), nrow = 28), main = "Average of 250 prototypes")
image(matrix(unlist(train[251, ]), nrow = 28))
image(matrix(unlist(train[252, ]), nrow = 28))
image(matrix(unlist(train[253, ]), nrow = 28))
image(matrix(colMeans(train[251:500, ]), nrow = 28), main = "Average of 250 prototypes")
Build Naive Bayes model
train$labels <- as.factor(rep(0:1, each = 250))
nb3 <- NaiveBayes(labels ~ ., usekernel = T, data = train)
p3a <- predict(nb3) ##this takes a while
confusion(train$labels, p3a$class) ##almost perfect
Overall accuracy = 0.998
Confusion matrix
Predicted (cv)
Actual 0 1
0 0.996 0.004
1 0.000 1.000
Overall accuracy = 0.994
Confusion matrix
Predicted (cv)
Actual [,1] [,2]
[1,] 1.000 0.000
[2,] 0.012 0.988
Testing with additional noise
What if we add some noise to the test? I will add gaussian noise with mean=0 so we hopefully won’t be biased toward one or the other case.
for (i in 1:ncol(test)) {
test[, i] <- rnorm(500, as.numeric(test[, i]), 0.25)
par(mfrow = c(2, 4))
image(matrix(unlist(test[1, ]), nrow = 28))
image(matrix(unlist(test[2, ]), nrow = 28))
image(matrix(unlist(test[3, ]), nrow = 28))
image(matrix(colMeans(test[1:250, ]), nrow = 28), main = "Average of 250 prototypes")
image(matrix(unlist(test[251, ]), nrow = 28))
image(matrix(unlist(test[252, ]), nrow = 28))
image(matrix(unlist(test[253, ]), nrow = 28))
image(matrix(colMeans(test[251:500, ]), nrow = 28), main = "Average of 250 prototypes")
p3c <- predict(nb3, test) ##This one is still pretty good
confusion(rep(0:1, each = 250), p3c$class)
Overall accuracy = 0.736
Confusion matrix
Predicted (cv)
Actual [,1] [,2]
[1,] 1.000 0.000
[2,] 0.528 0.472
Retraining with noisy data
For this example with a small amount of noise, the classifier starts to fail–it appears to be biased toward 0, maybe because 0 has more overall pixels. This is true even though we can easily tell what the class is visually. I think what is happening is that because the classifier is trained with data that has much less noise, individual features can produce very high positive or negative evidence toward a class. When we add large amounts of noise, when a few of those features turn on haphazardly, they overpower the other features. It would probably be better to train the model with the kind of noise we’d expect and it might to a lot better. To illustrate, let’s add MORE noise.
noisetest <- test
for (i in 1:ncol(test)) {
train[, i] <- rnorm(500, as.numeric(train[, i]), 2)
noisetest[, i] <- rnorm(500, as.numeric(test[, i]), 2)
nb3 <- NaiveBayes(labels ~ ., usekernel = T, data = train)
par(mfrow = c(2, 4))
image(matrix(unlist(noisetest[1, ]), nrow = 28))
image(matrix(unlist(noisetest[2, ]), nrow = 28))
image(matrix(unlist(noisetest[3, ]), nrow = 28))
image(matrix(colMeans(noisetest[1:250, ]), nrow = 28), main = "Average of 250 prototypes")
image(matrix(unlist(noisetest[251, ]), nrow = 28))
image(matrix(unlist(noisetest[252, ]), nrow = 28))
image(matrix(unlist(noisetest[253, ]), nrow = 28))
image(matrix(colMeans(noisetest[251:500, ]), nrow = 28), main = "Average of 250 prototypes")
Overall accuracy = 0.988
Confusion matrix
Predicted (cv)
Actual [,1] [,2]
[1,] 0.976 0.024
[2,] 0.000 1.000
Now, even though the actual signal is almost completely obscured to the naked eye, the NB classifier is again nearly perfect.