--- title: "K-Nearest-Neighbor classifiers" author: "Shane T. Mueller shanem@mtu.edu" date: "`r Sys.Date()`" output: rmdformats::readthedown: gallery: yes highlight: kate self_contained: no pdf_document: default html_document: df_print: paged word_document: reference_docx: ../template.docx always_allow_html: yes --- ```{r knitr_init, echo=FALSE, cache=FALSE} library(knitr) library(rmdformats) ## Global options options(max.print="75") opts_chunk$set(echo=TRUE, cache=TRUE, prompt=FALSE, tidy=TRUE, comment=NA, message=FALSE, warning=FALSE) opts_knit$set(width=75) ``` ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # K Nearest-Neighbor classifiers Nearest-neighbor methods are a class of non-model-based classifiers. These methods require you to develop a similarity or distance space--usually on the basis of a set of predictors or features. Once this distance space is defined, we determine the class by finding a few neighbor in that space, and having those neighbors vote, using the most common label of items in the neighborhood. Because this could be susceptible to noisy data, we usually find a set of $k$ neighbors and have them vote on the outcome. These are K-nearest neighbor classifiers, or KNN. This approach can be especially useful if your data classes are not well described by a multivariate distribution. Ultimately, this means KNN approaches make fewer assumptions, but also have to make ad hoc assumptions about combining dimensions that may not be valid. The problem with KNN is similar to many classifiers--how do we know which features to use. Furthermore, because we need to combine features to create a distance, we also need to come up with formulas for combining multiple features to create a distance measure. Should all dimensions be equally weighted? What if our features are on different scales? What if some of our features are categorical? Unlike something like logistic regression, we cannot easily test particular predictors, but we can always do cross-validation tests to determine whether the a decision in selecting variables or creating a distance metric improve outcomes. ## Choosing K. The choice of $k$ can have a big impact on the results. If $k$ is too small, classification will be highly impacted by local neighborhood; if it is too big (i.e., if $K$ = $N$), it will always just respond with the most likely response in the entire population. # Calculating distance The distance metric used in KNN can easily be chosen poorly. For example, if you want to decide on political party based on age, income level, and gender, you need to figure out how to weigh and combine these. By default, you might add differences along each dimension, which would combine something that differs on the scale of dozens with something that differs on the scale of thousands, and so the KNN model would essentially ignore gender and age. That is, if we used a city-block distance, a 22-year-old female democrat who makes \$30,000/year might be 2 units away from a 22-year-old male republican who also makes \$30K, but 1,000 units away from a 22-year-old female republican who makes $31,000. You might want to adjust the scale of different dimensions, select/ignore different dimensions, or weigh some more than others. If we create a sample data set with randomly-generated people, coding gender and party as 1 or 2 to allow us easier calculation of distance using the ```dist``` function: ```{r} library(ggplot2) vals <- data.frame(salary =sample(c(30000,50000,100000),50,replace=T), gender=sample(1:2,50,replace=T), party = sample(1:2,50,replace=T)) tmpdat <- data.frame(values=as.numeric(dist(vals))) tmpdat |> ggplot(aes(x=values)) + geom_histogram() + theme_bw() table(tmpdat) ``` Notice that there appear to be only 4 distances between cases, but looking more carefully we have a some other cases, e.g., 20000, 20000.00005, 20000.000025, etc. We might choose to normalize these somehow. # Normalization selection A typical approach would be to normalize all variables first. Then each variable has equal weight and range. We can normalize against the entire range, or via z-scores. If you have multiple variables that are on the same overall scale (i.e., a likert scale), you might apply the same normalization to all variables. The ```scale``` function in R can do this for a matrix of values: ```{r} vals2 <- scale(vals) tmpdat <- data.frame(values=as.numeric(dist(vals2))) tmpdat |> ggplot(aes(x=values)) + geom_histogram() + theme_bw() ``` There is a much broader range here. But because gender and party have a max difference of 1, maybe we want salary to also be scaled similarly. ```{r} vals2 <- vals vals2$salary <- (vals2$salary-min(vals2$salary)) / (max(vals2$salary)-min(vals2$salary)) tmpdat <- data.frame(values=as.numeric(dist(vals2))) tmpdat |> ggplot(aes(x=values)) + geom_histogram() + theme_bw() ``` This measure is sort of like downweighting the role of salary. We could similarly weigh dimensions unequally if we want, by multiplying the feature values by a vector. ```{r} vals3 <- t((t(vals2) * c(.33,.333,.33))) tmpdat <- data.frame(values=as.numeric(dist(vals3))) tmpdat |> ggplot(aes(x=values)) + geom_histogram() + theme_bw() ``` # Alternative distance metrics The dist() function provides a number of alternative distance metrics. The most common metric is euclidean, calculating the shortest distance as if the features were a space where you can move diagonally. Another common metric is the manhattan or city-block metric, which adds up distances on each feature. There are others, including minkowski which provides a parameterized function that can scale from min to median to max. A min or max distance might be reasonable because they match the closest dimension or worst-case difference. ```{r} tmpdat <- data.frame(values=as.numeric(dist(vals2,method="manhattan"))) tmpdat |> ggplot(aes(x=values)) + geom_histogram() + theme_bw() + ggtitle("Manhattan distance") tmpdat <- data.frame(values=as.numeric(dist(vals2,method="minkowski",p=100))) tmpdat |> ggplot(aes(x=values)) + geom_histogram() + theme_bw() + ggtitle("Minkowski distance p= 100 (max)") tmpdat <- data.frame(values=as.numeric(dist(vals2,method="minkowski",p=1/100))) tmpdat |> ggplot(aes(x=values)) + geom_histogram() + theme_bw() + ggtitle("Minkowski distance p= 1/100 (min)") ``` ## Using the knn function Unlike previous methods where we train a system based on data and then make classifications, for KNN the trained system is just a set of data--no statistics are actually calculated from it. A number of libraries and functions are available: ## Libraries and functions. There are several options for knn classifers. ### Library: Class * Function: knn * Function: knn.cv (cross-validation) ### Library: kknn * Function: kknn - This provides a 'weighted k-nearest neighbor classifier'. Here, weighting is a weighting kernel, which gives greater weights to the nearer neighbors. ### Library: klaR * Function: sknn - A 'simple' knn function. Permits using a parameterized weighting kernel as well. ## Cross-validation Some of the libraries include cross-validation schemes, which is important. # Example For our 50-case data set, suppose we want to use knn to predict gender based on party and salary. These are random, so we should not be able to do very well. Let's try a simple knn(), which calculates the distance unweighted via euclidean distance: ```{r} library(DAAG) library(class) k1 <- knn(vals2[,-3],vals2[,-3],cl=vals2$party,k=3) confusion(k1,vals2$party) ``` Here, we got 66% accuracy. This seems surprising because the data were random, but because we are predicting based on the same data we are using, and use k=1, we are biasing the response by including the tested data in the training set. Because our data set is clumpy (only 3 salaries), it could be worse--we could easily get 100% accuracy. We need to cross-validate. We can use ```knn.cv``` to do a leave-one-out cross-validation on knn. That is, it fits each case with all the other points and returns the results. ```{r} k2 <- knn.cv(vals2[,-3],cl=vals2$party,k=3) confusion(k2,vals2$party) ``` This will not help you fit new data, but does give sort of an upper bound for how good the classifier might be. We might also do some cross-validation by hand, by holding 10 out and fitting them with the other 40: ```{r} test <- sample(1:50,10) testset <- vals2[test,] trainset <- vals2[-test,] k2 <- knn(trainset[,-3],testset[,-3],cl=trainset$party,k=3) confusion(k2,testset$party) ``` Running this multiple times you will find that the average accuracy tends to be 50%, which is what we would expect. # iPhone data set example: In general, we need to normalize the variables first when using knn, because for the simplest knn we have no control over the distance metrics. ```{r,fig.width=7,fig.height=8} phone <- read.csv("data_study1.csv") # Normalizing the values by creating a new function normal = function(x){ xn = (x - min(x))/(max(x)-min(x)) return(xn) } phone2 <- phone phone2$Gender <- as.numeric(as.factor(phone2$Gender)) for(i in 2:ncol(phone2)){ phone2[,i] = normal(as.numeric(phone2[,i])) } # we might also use scale: phone3 <- scale(phone[,-(1:2)]) # checking the range of values after normalization par(mfrow=c(3,1)) boxplot(phone[,3:12]) boxplot(phone2[,3:12]) boxplot(phone3) ``` ## Fitting a simple KNN model Here, we use the class:knn model. We specify separate the training and testing separately, but they can be the same, and if we use knn.cv it will exclude the point when making a classification. Again, 'training' a knn is a bit deceptive, the training pool is just the set of points being used to classify the test cases. We will use a k of 11, which means each case will be compared to the 11 closes cases and a decision made based on the vote of those. Using an odd number means that there won't be a tie. ```{r} library(class) m1 = knn.cv(train = phone2[,-1], cl = phone2$Smartphone, k = 11) confusion(phone2$Smartphone,m1) ``` That is pretty good, but what if we expand k ```{r} m1 = knn.cv(train = phone2[,-1], cl = phone2$Smartphone, k = 25) confusion(phone2$Smartphone,m1) ``` It looks we the choice of k is not going to matter much. What we could do is play with cross-validation though--right now we are testing on the same set as we are 'training' with, which will boost our accuracy (since the correct answer will always be one of the elements.) We can also try to do variable selection to change the similarity space to something that works better. ```{r} phone2 <- phone2[order(runif(nrow(phone2))),] #random sort phone.train = phone2[1:450, 2:12] phone.test = phone2[451:nrow(phone2), 2:12] phone.train.target = phone2[1:450, 1] phone.test.target = phone2[451:nrow(phone2), 1] train <- sample(1:nrow(phone2),300) m2 <- knn(train = phone.train, test = phone.test, cl = phone.train.target, k = 10) confusion(phone.test.target,m2) ``` # Using alternate distances The kknn:kknn function provides two advances: it uses a minkowski distance enabling you to set parameter p=2 (euclidian distance); p=1 (manhattan distance), or very large (max) or very close to 0 (min). In addition, the voting is doen according to a specified 'kernel', so that you can weigh closer cases more strongly than farther cases. kknn uses a regression-like formula to specify the trained class and variables you want to use. Here, I fit with euclidean distance, manhattan distance, and try a triangular kernel: ```{r} library(kknn) phone2 <- phone2[order(runif(nrow(phone2))),] #random sort phone2$Smartphone <- as.factor(phone2$Smartphone) ## knnn needs a factor to predict! phone.train = phone2[1:450,] phone.test = phone2[-(1:450),] m3a <- kknn(Smartphone~., train = phone.train, test = phone.test, k = 11,distance=2) m3b <- kknn(Smartphone~., train = phone.train, test = phone.test, k = 11,distance=1) confusion(phone.test$Smartphone,fitted(m3a)) print("-------------") confusion(phone.test$Smartphone,fitted(m3b)) m3c <- train.kknn(Smartphone~., phone.train, kmax = 11,distance=1, kernel=c("triangular")) print("-------------") confusion(phone.test$Smartphone,predict(m3c,phone.test)) ``` # Summary KNN classifiers provide model-free classifier that can be useful if the shape of the distributions of your classes is non-regular. They may end up being slower than other approaches because they require calculating large distance measures for your whole training set on each new case, and using them requires making ad hoc assumptions to calculate the distance, which might not be justified or easy to see the consequences of. # Additional resources * K-Nearest Neighbor Classification: - MASS: Chapter 12.3, p. 341 - https://rstudio-pubs-static.s3.amazonaws.com/123438_3b9052ed40ec4cd2854b72d1aa154df9.html