Naive Bayes Classifiers

A naive bayes classifier uses Bayes rule to combine information about a set of predictors. Although many Bayesian approaches can be quite complex and computationally-intensive, Naive Bayes classifiers are simple such that they can often be implemented without any special library. They are also easy to use even when you have hundreds or thousands of features, such as when trying to classify text (e.g., each word is a feature/predictor), for classification problems like sentiment analysis or spam detection.

The klaR library includes the NaiveBayes function, which is a simple implementation of the Naive Bayes classifier, and using it is pretty simple–it looks basically like an lm or lda model. Before discussing how it works, let’s start with an example application for the iPhone data set:

dat <- read.csv("data_study1.csv")
dat$Smartphone <- factor(dat$Smartphone)
library(klaR)
nb <- NaiveBayes(Smartphone ~ ., data = dat)
library(DAAG)
summary(nb)

          Length Class      Mode     
apriori    2     table      numeric  
tables    12     -none-     list     
levels     2     -none-     character
call       3     -none-     call     
x         12     data.frame list     
usekernel  1     -none-     logical  
varnames  12     -none-     character

nb$apriori

grouping
  Android    iPhone 
0.4139887 0.5860113

nb$tables

$Gender
         var
grouping     female      male
  Android 0.5753425 0.4246575
  iPhone  0.7516129 0.2483871

$Age
            [,1]     [,2]
Android 31.42466 13.52190
iPhone  26.84839 12.19792

$Honesty.Humility
            [,1]      [,2]
Android 3.598174 0.5971617
iPhone  3.351935 0.6269406

$Emotionality
            [,1]      [,2]
Android 3.231963 0.7187408
iPhone  3.455806 0.6648467

$Extraversion
            [,1]      [,2]
Android 3.196347 0.6842832
iPhone  3.288710 0.6360540

$Agreeableness
            [,1]      [,2]
Android 3.175799 0.5900619
iPhone  3.123226 0.6390040

$Conscientiousness
            [,1]      [,2]
Android 3.544292 0.5910664
iPhone  3.573871 0.6173289

$Openness
            [,1]      [,2]
Android 3.542922 0.6218371
iPhone  3.415806 0.6224821

$Avoidance.Similarity
            [,1]      [,2]
Android 2.545205 0.8413687
iPhone  2.306452 0.8150825

$Phone.as.status.object
            [,1]      [,2]
Android 2.086758 0.5411150
iPhone  2.375484 0.6006299

$Social.Economic.Status
            [,1]     [,2]
Android 5.949772 1.586132
iPhone  6.012903 1.485582

$Time.owned.current.phone
            [,1]      [,2]
Android 12.72146 11.497509
iPhone  11.68065  8.748811

To understand how it works, let’s look at the output. First, we have the a priori values (.41 vs .59)–this determines the overall base rate of the outcomes. Sensitivity to base rate is one of the important features of Naive Bayes. Then, we have the mean and standard deviation of normal distribution describing of each feature for each group. To confirm, look at the values for age:

means <- tapply(as.numeric(dat$Age), list(dat$Smartphone), mean)
sds <- tapply(as.numeric(dat$Age), list(dat$Smartphone), sd)
means

 Android   iPhone 
31.42466 26.84839

sds

 Android   iPhone 
13.52190 12.19792

Naive Bayes works very similarly to lda or qda, but from a probabilistic perspective. These attempt to find a decision point consistent with the most likely outcome that combines all features optimally. In contrast, NB combines all features equally–for each feature computing the likelihood of each option, and combining those one at a time. For the Age predictor, the two densities look like this:

library(ggplot2)
phone <- rep(c("iphone", "android"), each = 100)
density <- c(dnorm(1:100, mean = means[2], sd = sds[2]), dnorm(1:100, mean = means[1],
    sd = sds[1]))
df <- data.frame(age = c(1:100, 1:100), phone, density)

ggplot(df, aes(x = age, y = density, colour = phone)) + geom_line(size = 2) + theme_bw() +
    scale_color_manual(values = c("orange2", "navy"))

So, in a way this is very similar to LDA/QDA, because it uses a model-based approach, assuming a normal distribution, but it permits different means AND standard deviations for each group. Notice how for different ages, there is a higher likelihood for one group or the other. We can compute a ratio easily:

df <- data.frame(age = c(1:100), lr = (dnorm(1:100, mean = means[2], sd = sds[2])/dnorm(1:100,
    mean = means[1], sd = sds[1])))


ggplot(df, aes(x = age, y = lr)) + geom_line() + geom_hline(yintercept = 1) + theme_bw() +
    scale_color_manual(values = c("orange2", "navy"))

This shows the likelihood of ratio in favor of being an iPhone user across age. Notice that this is highly model-based–if your assumed model is wrong, like it if it is not normally-distributed, the ratio can be wrong. This is especially true out in the tails of the distribution, where we probably have the least amount of data, but where the observations can have the greatest impact. If this were a decision tree, we’d use the point the likelihood ratio crosses 1.0 as the place to put the cut-off for that dimension. If we were doing LDA or QDA, we’d combine all axes and find a cutoff along a single vector. NB combines the likelihood ratio curves for each dimension using Bayes Rule, which also permits combining prior probability (base rate). So, like LDA, it combines all the dimensions, but like a decision tree, it computes a fit on each dimension.

In contrast to LDA and QDA, the way features are combined in NB differs. In QDA and LDA, a decision criteria or quadratic function is selected from among a well-formed class (a plane or quadratic along the discriminant function), which minimizes the number of classification errors in the training set. Consequently, the selection of that rule depends on base rate In NB, the likelihood ratio of each feature is incorporated independently into a posterior likelihood (the probability of each class given the data), and then base rate is also included optimally at the end. If you have a missing value, you can ignore the contribution of that feature, as the best guess is equal evidence for all options. Equal evidence is 1.0, and likelihood is computing by taking the product of the likelihood ratios of each feature, so missing values are equivalent of multiplying the current posterior likelihood by 1.0–which has no impact on the classification. For LDA and QDA, it is not as easy to deal with missing values. This is one benefit of Bayes Factor–it is useful for data that are often likely to be missing. This is one of the reasons why it is frequently used for classifying text data, which may have thousands of features (i.e., words) but most of them are missing for any given document.

We can use this for prediction:

Overall accuracy = 0.667 

Confusion matrix 
         Predicted (cv)
Actual    Android iPhone
  Android   0.543  0.457
  iPhone    0.245  0.755

This performs about as well as any of the other models. There is no simple way to do cross-validation with this implementation–you would need to manually split the data set if you wanted to implement cross-validation. But also note that if a feature is non-predictive, it will essentially have very little impact on the outcome.

Using empirically-estimated kernels

The model assumes a normal/gaussian distribution, but Naive Bayes will permit us to estimate the of each feature empirically, which happens when we specify usekernel=T. Instead of fitting a normal distribution to the pattern of data, it will estimate the shape of the distributions, and thus the shape of the likelihood ratio function.

nb2 <- NaiveBayes(Smartphone ~ ., usekernel = T, data = dat)
nb2$tables

$Gender
         var
grouping     female      male
  Android 0.5753425 0.4246575
  iPhone  0.7516129 0.2483871

$Age
$Age$Android

Call:
    density.default(x = xx)

Data: xx (219 obs.);    Bandwidth 'bw' = 4.142

       x                y            
 Min.   : 3.575   Min.   :5.350e-06  
 1st Qu.:23.537   1st Qu.:2.524e-03  
 Median :43.500   Median :8.884e-03  
 Mean   :43.500   Mean   :1.251e-02  
 3rd Qu.:63.463   3rd Qu.:1.954e-02  
 Max.   :83.425   Max.   :4.128e-02  

$Age$iPhone

Call:
    density.default(x = xx)

Data: xx (310 obs.);    Bandwidth 'bw' = 2.292

       x                y            
 Min.   : 9.123   Min.   :1.259e-05  
 1st Qu.:27.062   1st Qu.:8.960e-04  
 Median :45.000   Median :7.547e-03  
 Mean   :45.000   Mean   :1.392e-02  
 3rd Qu.:62.938   3rd Qu.:1.111e-02  
 Max.   :80.877   Max.   :9.380e-02  


$Honesty.Humility
$Honesty.Humility$Android

Call:
    density.default(x = xx)

Data: xx (219 obs.);    Bandwidth 'bw' = 0.1829

       x               y            
 Min.   :1.151   Min.   :0.0001146  
 1st Qu.:2.226   1st Qu.:0.0197670  
 Median :3.300   Median :0.1441117  
 Mean   :3.300   Mean   :0.2324787  
 3rd Qu.:4.374   3rd Qu.:0.4797145  
 Max.   :5.449   Max.   :0.5965253  

$Honesty.Humility$iPhone

Call:
    density.default(x = xx)

Data: xx (310 obs.);    Bandwidth 'bw' = 0.1791

       x               y            
 Min.   :1.163   Min.   :0.0000963  
 1st Qu.:2.206   1st Qu.:0.0382611  
 Median :3.250   Median :0.1789519  
 Mean   :3.250   Mean   :0.2392920  
 3rd Qu.:4.294   3rd Qu.:0.4296896  
 Max.   :5.337   Max.   :0.6257473  


$Emotionality
$Emotionality$Android

Call:
    density.default(x = xx)

Data: xx (219 obs.);    Bandwidth 'bw' = 0.2202

       x                y            
 Min.   :0.4395   Min.   :0.0000934  
 1st Qu.:1.7448   1st Qu.:0.0166069  
 Median :3.0500   Median :0.1157574  
 Mean   :3.0500   Mean   :0.1913477  
 3rd Qu.:4.3552   3rd Qu.:0.3759779  
 Max.   :5.6605   Max.   :0.5323054  

$Emotionality$iPhone

Call:
    density.default(x = xx)

Data: xx (310 obs.);    Bandwidth 'bw' = 0.19

       x                y            
 Min.   :0.9301   Min.   :0.0000773  
 1st Qu.:2.0900   1st Qu.:0.0219588  
 Median :3.2500   Median :0.1508402  
 Mean   :3.2500   Mean   :0.2153124  
 3rd Qu.:4.4100   3rd Qu.:0.3922926  
 Max.   :5.5699   Max.   :0.6291699  


$Extraversion
$Extraversion$Android

Call:
    density.default(x = xx)

Data: xx (219 obs.);    Bandwidth 'bw' = 0.2057

       x                y            
 Min.   :0.7828   Min.   :0.0001224  
 1st Qu.:1.9164   1st Qu.:0.0407787  
 Median :3.0500   Median :0.1578239  
 Mean   :3.0500   Mean   :0.2203167  
 3rd Qu.:4.1836   3rd Qu.:0.3968226  
 Max.   :5.3172   Max.   :0.5870843  

$Extraversion$iPhone

Call:
    density.default(x = xx)

Data: xx (310 obs.);    Bandwidth 'bw' = 0.1817

       x               y            
 Min.   :1.055   Min.   :0.0000833  
 1st Qu.:2.077   1st Qu.:0.0339189  
 Median :3.100   Median :0.2100553  
 Mean   :3.100   Mean   :0.2442280  
 3rd Qu.:4.123   3rd Qu.:0.4375608  
 Max.   :5.145   Max.   :0.5799863  


$Agreeableness
$Agreeableness$Android

Call:
    density.default(x = xx)

Data: xx (219 obs.);    Bandwidth 'bw' = 0.1807

       x               y            
 Min.   :1.058   Min.   :0.0001347  
 1st Qu.:2.104   1st Qu.:0.0270296  
 Median :3.150   Median :0.1628894  
 Mean   :3.150   Mean   :0.2387443  
 3rd Qu.:4.196   3rd Qu.:0.4663836  
 Max.   :5.242   Max.   :0.6169762  

$Agreeableness$iPhone

Call:
    density.default(x = xx)

Data: xx (310 obs.);    Bandwidth 'bw' = 0.1826

       x                y           
 Min.   :0.8522   Min.   :0.000095  
 1st Qu.:1.9261   1st Qu.:0.027074  
 Median :3.0000   Median :0.181926  
 Mean   :3.0000   Mean   :0.232570  
 3rd Qu.:4.0739   3rd Qu.:0.445709  
 Max.   :5.1478   Max.   :0.568577  


$Conscientiousness
$Conscientiousness$Android

Call:
    density.default(x = xx)

Data: xx (219 obs.);    Bandwidth 'bw' = 0.181

       x                y           
 Min.   :0.9569   Min.   :0.000114  
 1st Qu.:2.1034   1st Qu.:0.021708  
 Median :3.2500   Median :0.101329  
 Mean   :3.2500   Mean   :0.217826  
 3rd Qu.:4.3966   3rd Qu.:0.402436  
 Max.   :5.5431   Max.   :0.706375  

$Conscientiousness$iPhone

Call:
    density.default(x = xx)

Data: xx (310 obs.);    Bandwidth 'bw' = 0.1706

       x               y            
 Min.   :1.188   Min.   :0.0001964  
 1st Qu.:2.244   1st Qu.:0.0359617  
 Median :3.300   Median :0.1567514  
 Mean   :3.300   Mean   :0.2365316  
 3rd Qu.:4.356   3rd Qu.:0.4422605  
 Max.   :5.412   Max.   :0.6719030  


$Openness
$Openness$Android

Call:
    density.default(x = xx)

Data: xx (219 obs.);    Bandwidth 'bw' = 0.1905

       x               y            
 Min.   :1.029   Min.   :0.0001105  
 1st Qu.:2.139   1st Qu.:0.0234947  
 Median :3.250   Median :0.1359343  
 Mean   :3.250   Mean   :0.2248574  
 3rd Qu.:4.361   3rd Qu.:0.4378584  
 Max.   :5.471   Max.   :0.6162173  

$Openness$iPhone

Call:
    density.default(x = xx)

Data: xx (310 obs.);    Bandwidth 'bw' = 0.1779

       x               y            
 Min.   :1.166   Min.   :0.0000961  
 1st Qu.:2.233   1st Qu.:0.0252414  
 Median :3.300   Median :0.1804943  
 Mean   :3.300   Mean   :0.2341143  
 3rd Qu.:4.367   3rd Qu.:0.4645641  
 Max.   :5.434   Max.   :0.5693733  


$Avoidance.Similarity
$Avoidance.Similarity$Android

Call:
    density.default(x = xx)

Data: xx (219 obs.);    Bandwidth 'bw' = 0.2286

       x                y          
 Min.   :0.3142   Min.   :0.00027  
 1st Qu.:1.6071   1st Qu.:0.03275  
 Median :2.9000   Median :0.15129  
 Mean   :2.9000   Mean   :0.19315  
 3rd Qu.:4.1929   3rd Qu.:0.37572  
 Max.   :5.4858   Max.   :0.46712  

$Avoidance.Similarity$iPhone

Call:
    density.default(x = xx)

Data: xx (310 obs.);    Bandwidth 'bw' = 0.1706

       x                y            
 Min.   :0.4882   Min.   :0.0004298  
 1st Qu.:1.7441   1st Qu.:0.0316214  
 Median :3.0000   Median :0.1668941  
 Mean   :3.0000   Mean   :0.1988441  
 3rd Qu.:4.2559   3rd Qu.:0.2669040  
 Max.   :5.5118   Max.   :0.8710047  


$Phone.as.status.object
$Phone.as.status.object$Android

Call:
    density.default(x = xx)

Data: xx (219 obs.);    Bandwidth 'bw' = 0.1657

       x                y            
 Min.   :0.5028   Min.   :0.0001238  
 1st Qu.:1.4514   1st Qu.:0.0213441  
 Median :2.4000   Median :0.1708817  
 Mean   :2.4000   Mean   :0.2632744  
 3rd Qu.:3.3486   3rd Qu.:0.4800005  
 Max.   :4.2972   Max.   :0.6997651  

$Phone.as.status.object$iPhone

Call:
    density.default(x = xx)

Data: xx (310 obs.);    Bandwidth 'bw' = 0.1706

       x                y            
 Min.   :0.4882   Min.   :0.0001104  
 1st Qu.:1.4441   1st Qu.:0.0385170  
 Median :2.4000   Median :0.2001546  
 Mean   :2.4000   Mean   :0.2612763  
 3rd Qu.:3.3559   3rd Qu.:0.5033981  
 Max.   :4.3118   Max.   :0.6255264  


$Social.Economic.Status
$Social.Economic.Status$Android

Call:
    density.default(x = xx)

Data: xx (219 obs.);    Bandwidth 'bw' = 0.4572

       x                 y            
 Min.   :-0.3715   Min.   :4.495e-05  
 1st Qu.: 2.5642   1st Qu.:7.736e-03  
 Median : 5.5000   Median :5.593e-02  
 Mean   : 5.5000   Mean   :8.507e-02  
 3rd Qu.: 8.4358   3rd Qu.:1.503e-01  
 Max.   :11.3715   Max.   :2.785e-01  

$Social.Economic.Status$iPhone

Call:
    density.default(x = xx)

Data: xx (310 obs.);    Bandwidth 'bw' = 0.4245

       x                 y            
 Min.   :-0.2735   Min.   :6.835e-05  
 1st Qu.: 2.6133   1st Qu.:7.145e-03  
 Median : 5.5000   Median :3.632e-02  
 Mean   : 5.5000   Mean   :8.652e-02  
 3rd Qu.: 8.3867   3rd Qu.:1.511e-01  
 Max.   :11.2735   Max.   :2.871e-01  


$Time.owned.current.phone
$Time.owned.current.phone$Android

Call:
    density.default(x = xx)

Data: xx (219 obs.);    Bandwidth 'bw' = 2.972

       x                 y            
 Min.   : -8.915   Min.   :0.000e+00  
 1st Qu.: 20.293   1st Qu.:8.400e-07  
 Median : 49.500   Median :5.839e-04  
 Mean   : 49.500   Mean   :8.550e-03  
 3rd Qu.: 78.707   3rd Qu.:1.160e-02  
 Max.   :107.915   Max.   :4.714e-02  

$Time.owned.current.phone$iPhone

Call:
    density.default(x = xx)

Data: xx (310 obs.);    Bandwidth 'bw' = 2.5

       x               y            
 Min.   :-7.50   Min.   :5.810e-06  
 1st Qu.: 8.25   1st Qu.:9.996e-04  
 Median :24.00   Median :5.201e-03  
 Mean   :24.00   Mean   :1.586e-02  
 3rd Qu.:39.75   3rd Qu.:3.214e-02  
 Max.   :55.50   Max.   :5.031e-02

plot(nb2, col = c("navy", "orange2"), lwd = 1.5)

Now, instead of just the mean and standard deviation, we have estimated quantiles of the distribution. This might help us when we have skewed distributions, but we need a lot of observations in order to get reliable estimates. 200-300 observations in each group might not be enough to do well. If we look at the predictions:

Overall accuracy = 0.699 

Confusion matrix 
         Predicted (cv)
Actual    Android iPhone
  Android   0.557  0.443
  iPhone    0.200  0.800

Here, we do a bit better, and in contrast to some of the other classification methods, we have a 55% chance of getting Android users right (the lda model sometimes had less than a 50% chance of getting them correct.)

Again, it would be good to implement a cross-validation here.

Example: Predicting Dengue Fever

Fitting the NaiveBayes function is sensitive to missing data and zero-variances. If you have a variable with no variance, any new value will have likelihood of 0, and we have a chance of getting a likelihood ratio that is infinite. A single 0 or infinite posterior likelihood will break the classifier, because the posterior value will be 0 or inf no matter what other values are. Similarly, an NA in the values in the training set can cause trouble, dependent on how the model handles it. It might be useful to impute NA data, or add small amounts of noise to the training set to smooth out the values. The NaiveBayes function also allows you to ignore NA values, which we will do below.

library(e1071)
data(dengue)
summary(dengue)

     humid            humid90            temp            temp90      
 Min.   : 0.6714   Min.   : 1.066   Min.   :-18.68   Min.   :-10.07  
 1st Qu.:10.0088   1st Qu.:10.307   1st Qu.: 11.10   1st Qu.: 12.76  
 Median :16.1433   Median :16.870   Median : 20.99   Median : 22.03  
 Mean   :16.7013   Mean   :17.244   Mean   : 18.41   Mean   : 19.41  
 3rd Qu.:23.6184   3rd Qu.:24.131   3rd Qu.: 25.47   3rd Qu.: 25.98  
     h10pix          h10pix90          trees         trees90     
 Min.   : 4.317   Min.   : 5.848   Min.   : 0.0   Min.   : 0.00  
 1st Qu.:14.584   1st Qu.:14.918   1st Qu.: 1.0   1st Qu.: 6.00  
 Median :23.115   Median :24.130   Median :15.0   Median :30.60  
 Mean   :21.199   Mean   :21.557   Mean   :22.7   Mean   :35.21  
 3rd Qu.:28.509   3rd Qu.:28.627   3rd Qu.:37.0   3rd Qu.:63.62  
     NoYes             Xmin              Xmax              Ymin       
 Min.   :0.0000   Min.   :-179.50   Min.   :-172.00   Min.   :-54.50  
 1st Qu.:0.0000   1st Qu.: -12.00   1st Qu.: -10.00   1st Qu.:  6.00  
 Median :0.0000   Median :  16.00   Median :  17.75   Median : 18.00  
 Mean   :0.4155   Mean   :  13.31   Mean   :  15.63   Mean   : 19.78  
 3rd Qu.:1.0000   3rd Qu.:  42.62   3rd Qu.:  44.50   3rd Qu.: 39.00  
      Ymax       
 Min.   :-55.50  
 1st Qu.:  5.00  
 Median : 17.00  
 Mean   : 18.16  
 3rd Qu.: 37.00  
 [ reached getOption("max.print") -- omitted 2 rows ]

dengue$NoYes <- as.factor(dengue$NoYes)

Notice that there are about a dozen or so values with NA data.

# This doesn't work nb.dengue <- NaiveBayes(NoYes~.,data=dengue)

nb.dengue <- NaiveBayes(NoYes ~ ., data = dengue, na.action = "na.omit")  ##this works

# This one works with the klaR NaiveBayes:
nb.dengue2 <- klaR::NaiveBayes(NoYes ~ h10pix + Xmin + Ymin, data = dengue)


# this one works--from e1071 package
nb.dengue3 <- naiveBayes(NoYes ~ ., data = dengue)


# when we remove the na, we need to remove it from the ground truth too:
confusion(dengue$NoYes[!is.na(rowMeans(dengue[, -9]))], predict(nb.dengue)$class)

Overall accuracy = 0.882 

Confusion matrix 
      Predicted (cv)
Actual     0     1
     0 0.845 0.155
     1 0.066 0.934

confusion(dengue$NoYes, predict(nb.dengue2)$class)

Overall accuracy = 0.883 

Confusion matrix 
      Predicted (cv)
Actual     0     1
     0 0.843 0.157
     1 0.059 0.941

confusion(dengue$NoYes, predict(nb.dengue3, newdata = dengue))

Overall accuracy = 0.881 

Confusion matrix 
      Predicted (cv)
Actual     0     1
     0 0.844 0.156
     1 0.066 0.934

Example: Naive Bayes with the mnist (handwriting) set

##this code will dowload a create a 500-letter two-class training set from 
##mnist via tensorflow
library(tensorflow)
datasets <- tf$contrib$learn$datasets
mnist <- datasets$mnist$read_data_sets("MNIST-data", one_hot = TRUE)

##extract just two labels and sample images

mnist.1 <- mnist$train$labels[,1]
mnist.2 <- mnist$train$labels[,2]

mnist.img1 <- mnist$train$images[mnist.1==1,]
mnist.img2 <- mnist$train$images[mnist.2==1,]


##plot prototypes
par(mfrow=c(1,2))
image((matrix(colMeans(mnist.img1),28,28)))
image((matrix(colMeans(mnist.img2),28,28)))

##these are too many. Sample 250 from each
traintest1 <- sample(1:nrow(mnist.img1),size=500)
traintest2 <- sample(1:nrow(mnist.img2),size=500)

train <- rbind(mnist.img1[traintest1[1:250],],
               mnist.img2[traintest2[1:250],])

test <- rbind(mnist.img1[traintest1[251:500],],
               mnist.img2[traintest2[251:500],])
train<-as.data.frame(train)
test<- as.data.frame(test)
train$labels <- rep(0:1,each=250)



write.csv(train,"trainmnist.csv",row.names=F)
write.csv(test,"testmnist.csv",row.names=F)

library(klaR)
library(DAAG)
train <- read.csv("trainmnist.csv")
test <- read.csv("testmnist.csv")

set.seed(101)

## smooth out the training set a bit by adding some noise, so no pixel has a sd
## of 0.
for (i in 1:ncol(train)) {
    train[, i] <- rnorm(500, as.numeric(train[, i]), 0.1)
}

par(mfrow = c(2, 4))
image(matrix(unlist(train[1, ]), nrow = 28))
image(matrix(unlist(train[2, ]), nrow = 28))
image(matrix(unlist(train[3, ]), nrow = 28))
image(matrix(colMeans(train[1:250, ]), nrow = 28), main = "Average of 250 prototypes")

image(matrix(unlist(train[251, ]), nrow = 28))
image(matrix(unlist(train[252, ]), nrow = 28))
image(matrix(unlist(train[253, ]), nrow = 28))
image(matrix(colMeans(train[251:500, ]), nrow = 28), main = "Average of 250 prototypes")

Build Naive Bayes model

train$labels <- as.factor(rep(0:1, each = 250))
nb3 <- NaiveBayes(labels ~ ., usekernel = T, data = train)
p3a <- predict(nb3)  ##this takes a while
confusion(train$labels, p3a$class)  ##almost perfect

Overall accuracy = 0.998 

Confusion matrix 
      Predicted (cv)
Actual     0     1
     0 0.996 0.004
     1 0.000 1.000

p3b <- predict(nb3, test)  ##equally good
confusion(rep(0:1, each = 250), p3b$class)

Overall accuracy = 0.994 

Confusion matrix 
      Predicted (cv)
Actual  [,1]  [,2]
  [1,] 1.000 0.000
  [2,] 0.012 0.988

Testing with additional noise

What if we add some noise to the test? I will add gaussian noise with mean=0 so we hopefully won’t be biased toward one or the other case.

set.seed(200)
for (i in 1:ncol(test)) {
    test[, i] <- rnorm(500, as.numeric(test[, i]), 0.25)

}

par(mfrow = c(2, 4))
image(matrix(unlist(test[1, ]), nrow = 28))
image(matrix(unlist(test[2, ]), nrow = 28))
image(matrix(unlist(test[3, ]), nrow = 28))
image(matrix(colMeans(test[1:250, ]), nrow = 28), main = "Average of 250 prototypes")

image(matrix(unlist(test[251, ]), nrow = 28))
image(matrix(unlist(test[252, ]), nrow = 28))
image(matrix(unlist(test[253, ]), nrow = 28))
image(matrix(colMeans(test[251:500, ]), nrow = 28), main = "Average of 250 prototypes")

p3c <- predict(nb3, test)  ##This one is still pretty good
confusion(rep(0:1, each = 250), p3c$class)

Overall accuracy = 0.736 

Confusion matrix 
      Predicted (cv)
Actual  [,1]  [,2]
  [1,] 1.000 0.000
  [2,] 0.528 0.472

Retraining with noisy data

For this example with a small amount of noise, the classifier starts to fail–it appears to be biased toward 0, maybe because 0 has more overall pixels. This is true even though we can easily tell what the class is visually. I think what is happening is that because the classifier is trained with data that has much less noise, individual features can produce very high positive or negative evidence toward a class. When we add large amounts of noise, when a few of those features turn on haphazardly, they overpower the other features. It would probably be better to train the model with the kind of noise we’d expect and it might to a lot better. To illustrate, let’s add MORE noise.

noisetest <- test
set.seed(100)
for (i in 1:ncol(test)) {
    train[, i] <- rnorm(500, as.numeric(train[, i]), 2)
    noisetest[, i] <- rnorm(500, as.numeric(test[, i]), 2)
}


nb3 <- NaiveBayes(labels ~ ., usekernel = T, data = train)


par(mfrow = c(2, 4))
image(matrix(unlist(noisetest[1, ]), nrow = 28))
image(matrix(unlist(noisetest[2, ]), nrow = 28))
image(matrix(unlist(noisetest[3, ]), nrow = 28))
image(matrix(colMeans(noisetest[1:250, ]), nrow = 28), main = "Average of 250 prototypes")

image(matrix(unlist(noisetest[251, ]), nrow = 28))
image(matrix(unlist(noisetest[252, ]), nrow = 28))
image(matrix(unlist(noisetest[253, ]), nrow = 28))
image(matrix(colMeans(noisetest[251:500, ]), nrow = 28), main = "Average of 250 prototypes")

p3c <- predict(nb3, test)
confusion(rep(0:1, each = 250), p3c$class)

Overall accuracy = 0.988 

Confusion matrix 
      Predicted (cv)
Actual  [,1]  [,2]
  [1,] 0.976 0.024
  [2,] 0.000 1.000

Now, even though the actual signal is almost completely obscured to the naked eye, the NB classifier is again nearly perfect.

Naive Bayes Classifiers