TechHacks Talk - Dec. 6, 2013 - Recommender System

rec-sys : : HTML | Rmd - R Markdown | md - Markdown

Data Mining Example - Recommender Systems

We will explore building a recommender system.

Load data

First step is to load in some sample data. The data was collected by survey from cs1000 students in Fall 2013 and is available here

dat1 <- read.csv("~/Desktop/TechHacks-12-6-13/data/cs1000_f13_data.csv")

Note, you will need to set the data to load from your working directory or where it is located on your computer.

Install and Load RecommenderLab R package

We will make use of some available R functions.

If you haven't already installed the Recommender Lab package, do so:

install.packages("recommenderlab", dependencies = TRUE)

Then load the package:

Work with Data

First, set up the data in the needed format for the recommender library functions.

m1 <- as.matrix(dat1)
r1 <- as(m1, "realRatingMatrix")

In order, to get a view on the distribution of ratings we can plot the ratings as heat maps.

rn1 <- normalize(r1)
image(r1, main = "Ratings")

plot of chunk ratingPlot

image(rn1, main = "Normalized Ratings")

plot of chunk ratingPlot

Next, we can look at the mean rating for each student and plot this as a histogram.

rmn <- rowMeans(r1)
rmn

##  [1] 3.300 2.800 3.000 2.667 3.182 2.700 3.857 2.000 5.000 3.000 3.333
## [12] 2.556 4.000 3.333 3.000 1.667 3.556 2.700 3.000 3.231 2.750 2.900
## [23] 2.333 2.429 3.000 2.286 5.000 3.000 3.333 2.800 3.000 1.600 3.375
## [34] 5.000 2.417 4.000 2.214 3.500 3.467 4.500 3.200 2.818 3.273 3.600
## [45] 1.667 3.462 2.600 2.786 2.636 3.800 3.750 3.600 1.615 3.600 2.769
## [56] 3.500 3.182 1.462 2.667 3.500 2.231 2.444 3.000 1.875 2.583 1.182
## [67] 2.800 3.000 4.667 3.714 2.667 3.091 3.000 3.125 3.455 4.500 3.000
## [78] 2.667 2.500 3.667 3.429 2.000 4.333 2.889 1.778

hist(rmn, breaks = 10, main = "Histogram of Mean User Ratings", xlab = "Ratings")

plot of chunk unnamed-chunk-5

We can also look at the mean rating for each show.

cmn <- colMeans(r1)
cmn

##                 American.Idol               Big.Bang.Theory 
##                         2.058                         3.556 
##                  Breaking.Bad               Game.of.Thrones 
##                         4.064                         3.732 
##                Grey.s.Anatomy                      Homeland 
##                         1.878                         2.316 
##         How.I.Met.Your.Mother                     Justified 
##                         3.491                         2.545 
##                       Mad.Men                 Modern.Family 
##                         2.750                         2.659 
##                          NCIS          Parks.and.Recreation 
##                         3.551                         2.792 
## Real.Housewives.of.New.Jersey                     The.Voice 
##                         1.300                         2.062 
##              The.Walking.Dead 
##                         3.750

hist(cmn, breaks = 10, main = "Histogram of Mean TV Rating", xlab = "Ratings")

plot of chunk unnamed-chunk-6

In fact, we can see by sorting via the mean TV rating, that the highest rated shows are:

Breaking Bad
The Walking Dead
Game of Thrones

nd <- order(cmn)
colnames(dat1)[nd]

##  [1] "Real.Housewives.of.New.Jersey" "Grey.s.Anatomy"               
##  [3] "American.Idol"                 "The.Voice"                    
##  [5] "Homeland"                      "Justified"                    
##  [7] "Modern.Family"                 "Mad.Men"                      
##  [9] "Parks.and.Recreation"          "How.I.Met.Your.Mother"        
## [11] "NCIS"                          "Big.Bang.Theory"              
## [13] "Game.of.Thrones"               "The.Walking.Dead"             
## [15] "Breaking.Bad"

Following examples of recommenderlab documentation.

Recommendation System

ubr <- Recommender(r1, method = "UBCF")
pred <- predict(ubr, r1, type = "ratings")
# as(pred, 'matrix')

The supplied ratings for User 3 are:

dat1[3, ]

##   American.Idol Big.Bang.Theory Breaking.Bad Game.of.Thrones
## 3             3               1            5              NA
##   Grey.s.Anatomy Homeland How.I.Met.Your.Mother Justified Mad.Men
## 3             NA       NA                     5        NA      NA
##   Modern.Family NCIS Parks.and.Recreation Real.Housewives.of.New.Jersey
## 3            NA   NA                   NA                            NA
##   The.Voice The.Walking.Dead
## 3         2                2

The predicted ratings for the missing shows are:

colnames(dat1)[is.na(dat1[3, ])]

## [1] "Game.of.Thrones"               "Grey.s.Anatomy"               
## [3] "Homeland"                      "Justified"                    
## [5] "Mad.Men"                       "Modern.Family"                
## [7] "NCIS"                          "Parks.and.Recreation"         
## [9] "Real.Housewives.of.New.Jersey"

getRatings(pred[3, ])

## [1] 3.414 2.627 2.872 3.034 3.021 2.932 3.099 3.152 2.657

Present the top 2 predicted ratings for each user (output surpressed for space)

pred <- predict(ubr, r1, n = 2)

recs <- bestN(pred, n = 2)
as(recs, "list")

Present the top 3 predicted ratins for user 3

pred <- predict(ubr, r1[3], n = 3)
as(pred, "list")

## [[1]]
## [1] "Game.of.Thrones"      "Parks.and.Recreation" "NCIS"