5210 Chapter 2 Lecture Notes

Shane Mueller
Sept 11 2018

Announcements & Agenda

  • Problem set 2 is available (Due Sunday at 10 PM)
  • Be sure to keep up with pre-class discussion and reading.
    • One discussion today (Due by midnight)
    • One discussion Thursday (Due by 1:00 pm)
  • 1: Review solution to Problem Set 1
  • 2: Chapter 2
  • 3: Problem Set 2 Available

Chapter 2: Data Management

  • Reading in data
  • Examining data structures
  • Sorting
  • Aggregation
    • table
    • aggregate and tapply
    • apply
  • Chick Weight example

Reading data into R

  • read.table and read.csv are most common for raw data files.
  • For other data types, try RStudio's menu system.
  • Look at the data after you read it in, to be sure the headers have been made properly
  • Default variable names are difficult to work with.

Reading data into R

Example:

data <- read.table("c5data.txt")
head(data)
  V1 V2 V3   V4  V5  V6  V7  V8  V9 V10 V11 V12      V13      V14
1 15  1  0 6668   0 652 300 653 300   1   0   0  1.00000  1.00000
2 15  1  1 6701  33 652 306 653 300   1  33  33  6.08276  7.08276
3 15  1  2 6732  64 652 313 653 300   1  31  64 13.03840 20.12120
4 15  1  3 6771 103 652 321 653 301   1  39 103 20.02500 40.14620
5 15  1  4 6797 129 651 327 653 301   0  26 103 26.07680 66.22300
6 15  1  5 6820 152 650 332 653 301   0  23 103 31.14480 97.36780

Exercise

Read in data file c5data.txt from menu, after setting the working directory. Then, copy the generated command into an .R file, and load it directly from there.

Other related functions

read.csv()
write.csv()
write.table()

-Understand how to change the headers before, during, and after reading them in.

Exercise

  • Generate a matrix of random numbers in a table that is 10 columns and 100 rows.
  • Name the columns after the first ten letters of the alphabet (letters[1:10]).
  • Save it out to a .csv data file, and then read it in again.

Exercise Solution

 dat <- matrix ( runif (1000) ,100 ,10)
 colnames(dat) <- letters [1:10]
 write.csv(dat , "random.csv" )
 newdat <- read.csv( "random.csv" )

Inspecting data objects

There are a number of ways to look at an object and see what how it is stored:

data(trees)
str(trees)
'data.frame':   31 obs. of  3 variables:
 $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
 $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
 $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
attributes(trees)
$names
[1] "Girth"  "Height" "Volume"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30 31

$class
[1] "data.frame"
summary(trees)
     Girth           Height       Volume     
 Min.   : 8.30   Min.   :63   Min.   :10.20  
 1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
 Median :12.90   Median :76   Median :24.20  
 Mean   :13.25   Mean   :76   Mean   :30.17  
 3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
 Max.   :20.60   Max.   :87   Max.   :77.00  

Sorting

Sorting is useful, and the built-in sort function will do this for a vector:

sort(runif(10))
 [1] 0.02291064 0.15750374 0.22634794 0.24906685 0.37653700 0.61181822
 [7] 0.74620890 0.85396900 0.88831001 0.93120421

This won't work for data frames, where you may want to sort a frame by the values of one column. Use order for this (not to be confused with the similar rank)

ord <- order(trees$Height)
ord
 [1]  3 20  2  7 14  1 19  4 24 16 23  8 10 15 12 13 25 21 11  9 22 28 29
[24] 30  5 26 27  6 17 18 31

Sorting pt 2

This indicates the indexes in order from least to greatest. If you use the subset operation, it will reorder that vector or data frame in that order

trees$Height[ord]
 [1] 63 64 65 66 69 70 71 72 72 74 74 75 75 75 76 76 77 78 79 80 80 80 80
[24] 80 81 81 82 83 85 86 87
head(trees[ord,])
   Girth Height Volume
3    8.8     63   10.2
20  13.8     64   24.9
2    8.6     65   10.3
7   11.0     66   15.6
14  11.7     69   21.3
1    8.3     70   10.3

Sorting Exercise

The type argument of plot allows you to plot points connected by lines, using the type=”b” argument. First, plot tree height by volume in its original order, connecting adjacent values, using the type=“b” argument. Then re-sort them by tree height and re-plot. Finally, re-sort them in a random order, and re-plot.

Sorting Exercise Solution

data ( trees )
par ( mfrow = c (1 ,3) )
plot (trees$Volume , trees$Height , type = "b" )
ord <- order ( trees$Height )
plot ( trees$Volume [ ord ] , trees$Height [ ord ] , type = "b" )
ord <- sample (1: nrow(trees) )
plot ( trees$Volume[ ord ] , trees$Height[ ord ] , type = "b" )

plot of chunk unnamed-chunk-7

Aggregation: table

  • The table command gives a count of categorical (or numerical) values
  • With two arguments, it gives a cross-tabulation
 party <-  c("R","R","D","R","R","D","D","D","R","R","D")
 gender <- c("M","M","F","F","F","F","M","M","F","M","M")
 vote   <- c("A","B","A","A","A","B","A","A","B","B","A")
 survey <- data.frame(party,gender,vote)
 ##look at values of each variable: 
 table(survey$party)

D R 
5 6 

Aggregation: cross-tabulation

look at pairs/contingency tables:

table(survey$gender,survey$vote)

    A B
  F 3 2
  M 4 2

Aggregation: aggregate and tapply

These work by dividing data by levels of one or more categorical variables, applying a function to each group, and returning a data structure that recombines these values

 set.seed(111); x <- rnorm(500)                  ##generate random numbers
y <- x + runif(500,-.3,.3)       ##related random numbers
dat3 <- data.frame(x=x,y=y)      ##create a data frame
dat3$factor <- factor(round(dat3$x/10,1)*10)  ##make bins from x
dat3$group <- sample(c("A","B"),500,replace=T)
dat3.agg <-     aggregate(dat3$y,list(bin=dat3$factor),mean)
##aggregate x by the same bins:
dat3.agg$xvals <- aggregate(dat3$x,list(bin=dat3$factor),mean)$x
dat3.agg
  bin          x       xvals
1  -3 -2.9694215 -3.00802536
2  -2 -1.8622799 -1.84992413
3  -1 -0.8908935 -0.89592586
4   0  0.0382442  0.04730851
5   1  0.9096850  0.91197362
6   2  1.7929266  1.82510517
7   3  2.6943167  2.69485129

Aggregation: tapply

tapply works like aggregate, but produces a different output. ##use tapply to aggregate y by the bins:

dat3.tab <- tapply(dat3$y,list(bin=dat3$factor),mean)
dat3.tab
bin
        -3         -2         -1          0          1          2 
-2.9694215 -1.8622799 -0.8908935  0.0382442  0.9096850  1.7929266 
         3 
 2.6943167 
tapply(dat3$y,list(bin=dat3$factor,group=dat3$group),mean)
    group
bin           A           B
  -3 -3.1106901 -2.75751854
  -2 -1.6859730 -1.92360406
  -1 -0.8322358 -0.95914982
  0   0.0284143  0.04672135
  1   0.8842331  0.93338158
  2   1.8288174  1.76002681
  3   2.7074811  2.61533043

Aggregation by column or row: apply

apply applies a function to the row (1) or column (2) of a matrix or data frame

set.seed(100)
m <- matrix(runif(28),7,4)
#Find minimum in each column
apply(m,2,min) ##By column
[1] 0.05638315 0.17026205 0.20461216 0.17142021
#find maximum in each column:
apply(m,1,max) ##By row
[1] 0.7625511 0.6690217 0.7489722 0.6249965 0.8821655 0.7703016 0.8819536

End-to-end Example walkthrough: Chick Weights

  • This data set follows many chicks (birds) as they grow
  • analogous to learning, sales growth, and the like acrross time
  • Initial data requires reorganization to make sense of
help(ChickWeight)
head(ChickWeight)
  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1

Follow along in walkthrough