K-Nearest-Neighbor classifiers

K Nearest-Neighbor classifiers

Nearest-neighbor methods are a class of non-model-based classifiers. These methods require you to develop a similarity or distance space–usually on the basis of a set of predictors or features. Once this distance space is defined, we determine the class by finding a few neighbor in that space, and having those neighbors vote, using the most common label of items in the neighborhood. Because this could be susceptible to noisy data, we usually find a set of \(k\) neighbors and have them vote on the outcome. These are K-nearest neighbor classifiers, or KNN. This approach can be especially useful if your data classes are not well described by a multivariate distribution. Ultimately, this means KNN approaches make fewer assumptions, but also have to make ad hoc assumptions about combining dimensions that may not be valid.

The problem with KNN is similar to many classifiers–how do we know which features to use. Furthermore, because we need to combine features to create a distance, we also need to come up with formulas for combining multiple features to create a distance measure. Should all dimensions be equally weighted? What if our features are on different scales? What if some of our features are categorical? Unlike something like logistic regression, we cannot easily test particular predictors, but we can always do cross-validation tests to determine whether the a decision in selecting variables or creating a distance metric improve outcomes.

Choosing K.

The choice of \(k\) can have a big impact on the results. If \(k\) is too small, classification will be highly impacted by local neighborhood; if it is too big (i.e., if \(K\) = \(N\)), it will always just respond with the most likely response in the entire population.

Calculating distance

The distance metric used in KNN can easily be chosen poorly. For example, if you want to decide on political party based on age, income level, and gender, you need to figure out how to weigh and combine these. By default, you might add differences along each dimension, which would combine something that differs on the scale of dozens with something that differs on the scale of thousands, and so the KNN model would essentially ignore gender and age. That is, if we used a city-block distance, a 22-year-old female democrat who makes $30,000/year might be 2 units away from a 22-year-old male republican who also makes $30K, but 1,000 units away from a 22-year-old female republican who makes $31,000. You might want to adjust the scale of different dimensions, select/ignore different dimensions, or weigh some more than others.

If we create a sample data set with randomly-generated people, coding gender and party as 1 or 2 to allow us easier calculation of distance using the dist function:

library(ggplot2)
vals <- data.frame(salary = sample(c(30000, 50000, 1e+05), 50, replace = T), gender = sample(1:2,
    50, replace = T), party = sample(1:2, 50, replace = T))
tmpdat <- data.frame(values = as.numeric(dist(vals)))

tmpdat |>
    ggplot(aes(x = values)) + geom_histogram() + theme_bw()

table(tmpdat)
values
               0                1  1.4142135623731            20000 
              92              215              110               50 
    20000.000025      20000.00005            50000      50000.00001 
              92               50               90              172 
     50000.00002            70000 70000.0000071429 70000.0000142857 
              90               64              130               70 

Notice that there appear to be only 4 distances between cases, but looking more carefully we have a some other cases, e.g., 20000, 20000.00005, 20000.000025, etc. We might choose to normalize these somehow.

Normalization selection

A typical approach would be to normalize all variables first. Then each variable has equal weight and range. We can normalize against the entire range, or via z-scores. If you have multiple variables that are on the same overall scale (i.e., a likert scale), you might apply the same normalization to all variables. The scale function in R can do this for a matrix of values:

vals2 <- scale(vals)
tmpdat <- data.frame(values = as.numeric(dist(vals2)))
tmpdat |>
    ggplot(aes(x = values)) + geom_histogram() + theme_bw()

There is a much broader range here. But because gender and party have a max difference of 1, maybe we want salary to also be scaled similarly.

vals2 <- vals
vals2$salary <- (vals2$salary - min(vals2$salary))/(max(vals2$salary) - min(vals2$salary))
tmpdat <- data.frame(values = as.numeric(dist(vals2)))
tmpdat |>
    ggplot(aes(x = values)) + geom_histogram() + theme_bw()

This measure is sort of like downweighting the role of salary. We could similarly weigh dimensions unequally if we want, by multiplying the feature values by a vector.

vals3 <- t((t(vals2) * c(0.33, 0.333, 0.33)))
tmpdat <- data.frame(values = as.numeric(dist(vals3)))
tmpdat |>
    ggplot(aes(x = values)) + geom_histogram() + theme_bw()

Alternative distance metrics

The dist() function provides a number of alternative distance metrics. The most common metric is euclidean, calculating the shortest distance as if the features were a space where you can move diagonally. Another common metric is the manhattan or city-block metric, which adds up distances on each feature. There are others, including minkowski which provides a parameterized function that can scale from min to median to max. A min or max distance might be reasonable because they match the closest dimension or worst-case difference.

tmpdat <- data.frame(values = as.numeric(dist(vals2, method = "manhattan")))
tmpdat |>
    ggplot(aes(x = values)) + geom_histogram() + theme_bw() + ggtitle("Manhattan distance")

tmpdat <- data.frame(values = as.numeric(dist(vals2, method = "minkowski", p = 100)))
tmpdat |>
    ggplot(aes(x = values)) + geom_histogram() + theme_bw() + ggtitle("Minkowski distance p= 100 (max)")

tmpdat <- data.frame(values = as.numeric(dist(vals2, method = "minkowski", p = 1/100)))
tmpdat |>
    ggplot(aes(x = values)) + geom_histogram() + theme_bw() + ggtitle("Minkowski distance p= 1/100 (min)")

Using the knn function

Unlike previous methods where we train a system based on data and then make classifications, for KNN the trained system is just a set of data–no statistics are actually calculated from it. A number of libraries and functions are available:

Libraries and functions.

There are several options for knn classifers.

Library: Class

  • Function: knn
  • Function: knn.cv (cross-validation)

Library: kknn

  • Function: kknn
  • This provides a ‘weighted k-nearest neighbor classifier’. Here, weighting is a weighting kernel, which gives greater weights to the nearer neighbors.

Library: klaR

  • Function: sknn
    • A ‘simple’ knn function. Permits using a parameterized weighting kernel as well.

Cross-validation

Some of the libraries include cross-validation schemes, which is important.

Example

For our 50-case data set, suppose we want to use knn to predict gender based on party and salary. These are random, so we should not be able to do very well. Let’s try a simple knn(), which calculates the distance unweighted via euclidean distance:

library(DAAG)
library(class)
k1 <- knn(vals2[, -3], vals2[, -3], cl = vals2$party, k = 3)
confusion(k1, vals2$party)
Overall accuracy = 0.62 

Confusion matrix 
      Predicted (cv)
Actual    1    2
     1 0.64 0.36
     2 0.40 0.60

Here, we got 66% accuracy. This seems surprising because the data were random, but because we are predicting based on the same data we are using, and use k=1, we are biasing the response by including the tested data in the training set. Because our data set is clumpy (only 3 salaries), it could be worse–we could easily get 100% accuracy. We need to cross-validate. We can use knn.cv to do a leave-one-out cross-validation on knn. That is, it fits each case with all the other points and returns the results.

k2 <- knn.cv(vals2[, -3], cl = vals2$party, k = 3)
confusion(k2, vals2$party)
Overall accuracy = 0.46 

Confusion matrix 
      Predicted (cv)
Actual     1     2
     1 0.478 0.522
     2 0.556 0.444

This will not help you fit new data, but does give sort of an upper bound for how good the classifier might be. We might also do some cross-validation by hand, by holding 10 out and fitting them with the other 40:

test <- sample(1:50, 10)

testset <- vals2[test, ]
trainset <- vals2[-test, ]
k2 <- knn(trainset[, -3], testset[, -3], cl = trainset$party, k = 3)
confusion(k2, testset$party)
Overall accuracy = 0.5 

Confusion matrix 
      Predicted (cv)
Actual   1   2
     1 0.5 0.5
     2 0.5 0.5

Running this multiple times you will find that the average accuracy tends to be 50%, which is what we would expect.

iPhone data set example:

In general, we need to normalize the variables first when using knn, because for the simplest knn we have no control over the distance metrics.

phone <- read.csv("data_study1.csv")
# Normalizing the values by creating a new function
normal = function(x) {
    xn = (x - min(x))/(max(x) - min(x))
    return(xn)
}



phone2 <- phone
phone2$Gender <- as.numeric(as.factor(phone2$Gender))
for (i in 2:ncol(phone2)) {
    phone2[, i] = normal(as.numeric(phone2[, i]))
}

# we might also use scale:
phone3 <- scale(phone[, -(1:2)])

# checking the range of values after normalization
par(mfrow = c(3, 1))
boxplot(phone[, 3:12])
boxplot(phone2[, 3:12])
boxplot(phone3)

## Fitting a simple KNN model

Here, we use the class:knn model. We specify separate the training and testing separately, but they can be the same, and if we use knn.cv it will exclude the point when making a classification. Again, ‘training’ a knn is a bit deceptive, the training pool is just the set of points being used to classify the test cases. We will use a k of 11, which means each case will be compared to the 11 closes cases and a decision made based on the vote of those. Using an odd number means that there won’t be a tie.

library(class)
m1 = knn.cv(train = phone2[, -1], cl = phone2$Smartphone, k = 11)
confusion(phone2$Smartphone, m1)
Overall accuracy = 0.645 

Confusion matrix 
      Predicted (cv)
Actual  [,1]  [,2]
  [1,] 0.511 0.489
  [2,] 0.261 0.739

That is pretty good, but what if we expand k

m1 = knn.cv(train = phone2[, -1], cl = phone2$Smartphone, k = 25)
confusion(phone2$Smartphone, m1)
Overall accuracy = 0.641 

Confusion matrix 
      Predicted (cv)
Actual  [,1]  [,2]
  [1,] 0.507 0.493
  [2,] 0.265 0.735

It looks we the choice of k is not going to matter much. What we could do is play with cross-validation though–right now we are testing on the same set as we are ‘training’ with, which will boost our accuracy (since the correct answer will always be one of the elements.) We can also try to do variable selection to change the similarity space to something that works better.

phone2 <- phone2[order(runif(nrow(phone2))), ]  #random sort
phone.train = phone2[1:450, 2:12]
phone.test = phone2[451:nrow(phone2), 2:12]
phone.train.target = phone2[1:450, 1]
phone.test.target = phone2[451:nrow(phone2), 1]

train <- sample(1:nrow(phone2), 300)
m2 <- knn(train = phone.train, test = phone.test, cl = phone.train.target, k = 10)
confusion(phone.test.target, m2)
Overall accuracy = 0.696 

Confusion matrix 
      Predicted (cv)
Actual  [,1]  [,2]
  [1,] 0.543 0.457
  [2,] 0.182 0.818

Using alternate distances

The kknn:kknn function provides two advances: it uses a minkowski distance enabling you to set parameter p=2 (euclidian distance); p=1 (manhattan distance), or very large (max) or very close to 0 (min). In addition, the voting is doen according to a specified ‘kernel’, so that you can weigh closer cases more strongly than farther cases. kknn uses a regression-like formula to specify the trained class and variables you want to use. Here, I fit with euclidean distance, manhattan distance, and try a triangular kernel:

library(kknn)
phone2 <- phone2[order(runif(nrow(phone2))), ]  #random sort
phone2$Smartphone <- as.factor(phone2$Smartphone)  ## knnn needs a factor to predict!
phone.train = phone2[1:450, ]
phone.test = phone2[-(1:450), ]



m3a <- kknn(Smartphone ~ ., train = phone.train, test = phone.test, k = 11, distance = 2)
m3b <- kknn(Smartphone ~ ., train = phone.train, test = phone.test, k = 11, distance = 1)



confusion(phone.test$Smartphone, fitted(m3a))
Overall accuracy = 0.696 

Confusion matrix 
         Predicted (cv)
Actual    Android iPhone
  Android   0.500  0.500
  iPhone    0.218  0.782
print("-------------")
[1] "-------------"
confusion(phone.test$Smartphone, fitted(m3b))
Overall accuracy = 0.671 

Confusion matrix 
         Predicted (cv)
Actual    Android iPhone
  Android   0.458  0.542
  iPhone    0.236  0.764
m3c <- train.kknn(Smartphone ~ ., phone.train, kmax = 11, distance = 1, kernel = c("triangular"))

print("-------------")
[1] "-------------"
confusion(phone.test$Smartphone, predict(m3c, phone.test))
Overall accuracy = 0.658 

Confusion matrix 
         Predicted (cv)
Actual    Android iPhone
  Android   0.417  0.583
  iPhone    0.236  0.764

Summary

KNN classifiers provide model-free classifier that can be useful if the shape of the distributions of your classes is non-regular. They may end up being slower than other approaches because they require calculating large distance measures for your whole training set on each new case, and using them requires making ad hoc assumptions to calculate the distance, which might not be justified or easy to see the consequences of.

Additional resources