A Simple Implementation of k-Nearest Neighbors using R

While idly browsing my archive of very old files, I recently stumbled upon a short program I wrote five years ago which uses R to implement a lazy learning algorithm called k-nearest neighbors (kNN). I decided to post the code here for anyone who might be interested.

The principle behind kNN is simple. (A deeper discussion can be found here.) The input (training data) is used the predict a test subject, depending on the modal (most frequently occuring) property (the dependent variable) of its k nearest neighbors based on Euclidean distance:

distance((x_1,x_2,\ldots,x_n),(y_1,y_2,\ldots,y_n)) = \sqrt[2]{(x_1-y_1)^2 + (x_2-y_2)^2 + \cdots + (x_n-y_n)^2}

The n here refers to the number of independent variables. Each n-tuple corresponds to the coordinate s of the dependent variables. It is impossible to plot the coordinates if n>3, and I do not know enough R to plot in 3D. The program below, hence, only implements n=2.

So how does kNN predict? Once all the observations from the training data have been plotted, the k observations with the smallest Euclidean distance from the test subject, are obtained. These are, essentially. We then determine the most common response among the k nearest neighbors and this becomes the predicted response of the test subject. The program below only allows one independent variable. And besides, it uses only the basic principle of computing the Euclidean distance which is really too ideal to use for actual research. In reality, any data gathered is subject to noise and much of statistical analysis is concerned with how to deal with these noise (the probability involved in selecting the sample from a population, the responses are affected by external factors which cannot be measured, etc.) There are in fact, a number of kNN packages available from the R website. and I believe these are sophisticated enough to be used for professional level research.

I have included the codes below, along with some discussions. To start, simply open R and click File->New Script and paste the code below. Place the cursor before the first line and press ctrl+R. R will then run each line of code one by one. This, I believe, is less cumbersome than having to type every single line of command in the command module. And besides, you get to place all the codes in a neat file which you can save later for future use.

Now that I realized that Im no longer using R as often as I used to, let me tell you something about my experiences in R. As a statistics major (or rather, as a math major with concentration on statistical analysis), R has been a very indispensible part of my undergraduate statistics classes. There has been effort from the Math Department to promote R, and they have very good reasons for doing so. R is free and has a very active community checking bugs, writing packages of specialized functions and developing the software. As such, it has more capabilities than commercial ones such as SPSS or SAS. I doubt, however, if R has kept on in the industry in terms of use, and I bet SPSS and SAS are still the preferred softwares, not to mention Excel. I think the biggest hurdle in using R is its command-line interface which can be very intimidating for the beginner. A typo or a missing or misplaced delimiter can either cause an error, which R will try to point out via a warning message, or that R will run the command anyway and give the wrong result. I have, at a certain point, taught/tutored R to a mostly adult audience (30-50) and I can feel how frustrated they are with the confusing interface that, for them, requires the fortitude of a young zealous programmer to use. Nothing really beats entering data on a spreadsheet and clicking here and there to get the desired output. The great thing about R, though, is its customizability — one can write desired programs to suit ones needs and hence R is best for students who are expected to implement theory, and most especially, to academic researchers, who maybe, are testing a new method for analyzing data.

# function knn returns the kNN class
# test is a numeric vector of length t
# response is a vector of length r
# train is an r by t numeric matrix
# k is any integer less than r. If k=r, the computations
# are rendered useless since the function will just return the
# modal class.
#
###### Function begins here
knn distance<-matrix(nrow=nrow(train),ncol=2)
for(i in 1:nrow(train)){
distance[i,1]<-0
for(j in 1:ncol(train)){
distance[i,1]<-distance[i,1]+(test[j]-train[i,j])^2
}}
distance[,2]<-cbind(response)
srt srt<-srt[srt<=srt[k]]
res<-NULL
for(i in 1:k) for(j in 1:nrow(train)) if(srt[i]==as.numeric(distance[j,1])) res<-c(res,distance[j,2])
k<-table(res)
class<-(labels(k[k==max(k)]))
class
}
###### Function ends here

# Here are some test data.
# Age in years
age<-c(35,22,63,59,25)
# Income per month in thousand pesos
income<-c(35,50,200,170,40)
# Number of credit cards
nc<-c(3,2,1,1,4)
# The training matrix
train<-cbind(age,income,nc)
# Decisions made by respondents on whether they will buy
# Product A
response<-c(“No”, “Yes”, “No”, “No”, “Yes”)
# Now we want to know if our respondent with
# age 37, income 50T pesos and 2 credit cards will buy
# Product A
testk rows
#
###### Functions begins here
knn.recall correct<-0
for(i in 1:nrow(train)){
if (knn(train=train[-i,], test=rbind(train[i,]), response=response[-i], k=k) == response[i]) correct<-correct+1
}
correct/nrow(train)
}
###### Function ends here

# We have an 80% recall for k=1, zero for the other values
# of k! ^_^. Either we have a bad data, or that kNN is not
# a suitable learning method.
# One might consider doing SVM, LDA, or even multiple regression.
#
knn.recall(train, response, k=1)
knn.recall(train, response, k=2)
knn.recall(train, response, k=3)
knn.recall(train, response, k=4)
#
# Supppose we have,
k1<-2+rnorm(10)
k2<-2+rnorm(10)
j1<-10+rnorm(10)
j2<-10+rnorm(10)
train<-cbind(c(k1,j1),c(k2,j2))
response<-c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1)
knn.recall(train, response, k=7)
# Wherein we get 100% recall for any value of k.
# This is obviously an “overtrained” data.

# Or we can try,
k1<-2+rnorm(10)
k2<-2+rnorm(10)
j1<-4+rnorm(10)
j2<-4+rnorm(10)
train<-cbind(c(k1,j1),c(k2,j2))
response<-c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1)
knn.recall(train, response, k=7)
# Which is a reasonable recall value.

# It would be more exciting, though, if we can visualize
# the kNN algorithm. So I tinkered on the knn function above
# and simply added the parameter plot

# same parameters as knn

###### Function begins here
knn.plot distance<-matrix(nrow=nrow(train),ncol=2)
for(i in 1:nrow(train)){
distance[i,1]<-0
for(j in 1:ncol(train)){
distance[i,1]<-distance[i,1]+(test[j]-train[i,j])^2
}}

distance[,2]<-cbind(response)
srt srt<-srt[srt<=srt[k]]
res<-NULL
for(i in 1:k) for(j in 1:nrow(train)) if(srt[i]==as.numeric(distance[j,1])) res<-c(res,distance[j,2])
if((plot==TRUE)&&(ncol(train)==2)){
kk<-table(response)
kk<-labels(kk)
gv plot(gv, col=’white’)
for(i in 1:nrow(train)){
for(j in 1:length(kk[[1]])) if(response[i]==kk[[1]][j]) clr=j
text(x=train[i,1], y=train[i,2], labels=response[i], cex=.8, col=4+clr)
}
text(x=test[1], y=test[2], labels=”unknown”, cex=.75, col=1)
for(i in 1:k){
for(j in 1:nrow(distance)){
if(srt[i]==as.numeric(distance[j,1])) {
text(x=train[j,1], y=train[j,2], labels=response[j], cex=.8, col=3)
segments(test[1], test[2], train[j,1], train[j,2], col=3)
}}}
} # if plot
if((plot==TRUE)&&(ncol(train)!=2)){
cat(“\nCannot plot training data with more than 2 dimensions.\n”)
}
k<-table(res)
class<-(labels(k[k==max(k)]))
class
}
###### Function ends here

# Using the data:
k1<-10*rnorm(15)
k2<-30*rnorm(15)
j1<-45*rnorm(15)
j2<-32*rnorm(15)
train<-cbind(c(k1,j1),c(k2,j2))
test response # we can have a modest plot implicitly showing the
# kNN associations
knn.plot(train, test, response, k=7, plot=TRUE)
#
# Let’s have another one. We go back to the first example
# but with more data. Unfortunately, the plotting function
# can only handle two independent variables. Let’s take off age instead.
# I have also tweaked the data a bit, if you have noticed.
# Income per month in thousand pesos
income<-c(20,20,35,40,60,100,100,200,300,200,300,120)
# Number of credit cards (with added jitter)
nc<-c(1.1,0.3,1.4,1.3,0.5,4.2,2,3,6,5,4,3)
# The training matrix
train<-cbind(income,nc)
# Decisions made by respondents on whether they will buy
# Product A
response<-c(“No”, “No”, “Yes”, “No”, “Yes”, “No”, “Yes”, “Yes”, “Yes”, “Yes”, “Yes”, “Yes”)
# Now we want to know if our respondent with
# income 50T pesos and 2 credit cards will buy
# Product A
test<-c(50,2)
# And here we go
knn.plot(train, test, response, k=7, plot=TRUE)
# Note that:
# Horizontal axis = income
# Vertical axis = number of credit cards.
# The “unknown” shows where our test subject is.
# It’s 7 nearest neighbors answered “NO”

About these ads
This entry was posted in R, statistics. Bookmark the permalink.

2 Responses to A Simple Implementation of k-Nearest Neighbors using R

  1. NZ says:

    Hi
    I dont understand this line
    srt srt<-srt[srt<=srt[k]]
    in your code. What is "srt" ? Is it an input parameter? Because you didn't define this "srt" variable nowhere.
    Thx
    NZ

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s