5210 Chapter 2 Lecture Notes

Shane Mueller
Sept 11 2018

Agenda

Chapter 2: Data Management

  • Reading in data
  • Examining data structures
  • Sorting
  • Aggregation
    • table
    • aggregate and tapply
    • apply
  • Chick Weight example

Reading data into R

  • read.table and read.csv are most common for raw data files.
  • For other data types, try RStudio's menu system.
  • Look at the data after you read it in, to be sure the headers have been made properly
  • Default variable names are difficult to work with.

data <- read.table("c5data.txt")
  V1 V2 V3   V4  V5  V6  V7  V8  V9 V10 V11 V12      V13      V14
1 15  1  0 6668   0 652 300 653 300   1   0   0  1.00000  1.00000
2 15  1  1 6701  33 652 306 653 300   1  33  33  6.08276  7.08276
3 15  1  2 6732  64 652 313 653 300   1  31  64 13.03840 20.12120
4 15  1  3 6771 103 652 321 653 301   1  39 103 20.02500 40.14620
5 15  1  4 6797 129 651 327 653 301   0  26 103 26.07680 66.22300
6 15  1  5 6820 152 650 332 653 301   0  23 103 31.14480 97.36780


Read in data file c5data.txt from menu, after setting the working directory. Then, copy the generated command into an .R file, and load it directly from there.

-Understand how to change the headers before, during, and after reading them in.


  • Generate a matrix of random numbers in a table that is 10 columns and 100 rows.
  • Name the columns after the first ten letters of the alphabet (letters[1:10]).
  • Save it out to a .csv data file, and then read it in again.

 dat <- matrix ( runif (1000) ,100 ,10)
 colnames(dat) <- letters [1:10]
 write.csv(dat , "random.csv" )
 newdat <- read.csv( "random.csv" )

Inspecting data objects

There are a number of ways to look at an object and see what how it is stored:

'data.frame':   31 obs. of  3 variables:
 $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
 $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
 $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
[1] "Girth"  "Height" "Volume"

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30 31

[1] "data.frame"
     Girth           Height       Volume     
 Min.   : 8.30   Min.   :63   Min.   :10.20  
 1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
 Median :12.90   Median :76   Median :24.20  
 Mean   :13.25   Mean   :76   Mean   :30.17  
 3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
 Max.   :20.60   Max.   :87   Max.   :77.00  


Sorting is useful, and the built-in sort function will do this for a vector:

 [1] 0.02291064 0.15750374 0.22634794 0.24906685 0.37653700 0.61181822
 [7] 0.74620890 0.85396900 0.88831001 0.93120421

This won't work for data frames, where you may want to sort a frame by the values of one column. Use order for this (not to be confused with the similar rank)

ord <- order(trees$Height)
 [1]  3 20  2  7 14  1 19  4 24 16 23  8 10 15 12 13 25 21 11  9 22 28 29
[24] 30  5 26 27  6 17 18 31

This indicates the indexes in order from least to greatest. If you use the subset operation, it will reorder that vector or data frame in that order

 [1] 63 64 65 66 69 70 71 72 72 74 74 75 75 75 76 76 77 78 79 80 80 80 80
[24] 80 81 81 82 83 85 86 87
   Girth Height Volume
3    8.8     63   10.2
20  13.8     64   24.9
2    8.6     65   10.3
7   11.0     66   15.6
14  11.7     69   21.3
1    8.3     70   10.3

The type argument of plot allows you to plot points connected by lines, using the type=”b” argument. First, plot tree height by volume in its original order, connecting adjacent values, using the type=“b” argument. Then re-sort them by tree height and re-plot. Finally, re-sort them in a random order, and re-plot.

data ( trees )
par ( mfrow = c (1 ,3) )
plot (trees$Volume , trees$Height , type = "b" )
ord <- order ( trees$Height )
plot ( trees$Volume [ ord ] , trees$Height [ ord ] , type = "b" )
ord <- sample (1: nrow(trees) )
plot ( trees$Volume[ ord ] , trees$Height[ ord ] , type = "b" )

Aggregation: table

  • The table command gives a count of categorical (or numerical) values
  • With two arguments, it gives a cross-tabulation
 party <-  c("R","R","D","R","R","D","D","D","R","R","D")
 gender <- c("M","M","F","F","F","F","M","M","F","M","M")
 vote   <- c("A","B","A","A","A","B","A","A","B","B","A")
 survey <- data.frame(party,gender,vote)
 ##look at values of each variable: 

D R 
5 6 

look at pairs/contingency tables:


    A B
  F 3 2
  M 4 2

These work by dividing data by levels of one or more categorical variables, applying a function to each group, and returning a data structure that recombines these values

 set.seed(111); x <- rnorm(500)                  ##generate random numbers
y <- x + runif(500,-.3,.3)       ##related random numbers
dat3 <- data.frame(x=x,y=y)      ##create a data frame
dat3$factor <- factor(round(dat3$x/10,1)*10)  ##make bins from x
dat3$group <- sample(c("A","B"),500,replace=T)
dat3.agg <-     aggregate(dat3$y,list(bin=dat3$factor),mean)
##aggregate x by the same bins:
dat3.agg$xvals <- aggregate(dat3$x,list(bin=dat3$factor),mean)$x
  bin          x       xvals
1  -3 -2.9694215 -3.00802536
2  -2 -1.8622799 -1.84992413
3  -1 -0.8908935 -0.89592586
4   0  0.0382442  0.04730851
5   1  0.9096850  0.91197362
6   2  1.7929266  1.82510517
7   3  2.6943167  2.69485129

tapply works like aggregate, but produces a different output. ##use tapply to aggregate y by the bins:

dat3.tab <- tapply(dat3$y,list(bin=dat3$factor),mean)
        -3         -2         -1          0          1          2 
-2.9694215 -1.8622799 -0.8908935  0.0382442  0.9096850  1.7929266 
bin           A           B
  -3 -3.1106901 -2.75751854
  -2 -1.6859730 -1.92360406
  -1 -0.8322358 -0.95914982
  0   0.0284143  0.04672135
  1   0.8842331  0.93338158
  2   1.8288174  1.76002681
  3   2.7074811  2.61533043

apply applies a function to the row (1) or column (2) of a matrix or data frame

m <- matrix(runif(28),7,4)
#Find minimum in each column
apply(m,2,min) ##By column
[1] 0.05638315 0.17026205 0.20461216 0.17142021
#find maximum in each column:
apply(m,1,max) ##By row
[1] 0.7625511 0.6690217 0.7489722 0.6249965 0.8821655 0.7703016 0.8819536

  • This data set follows many chicks (birds) as they grow
  • analogous to learning, sales growth, and the like acrross time
  • Initial data requires reorganization to make sense of
  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1

Follow along in walkthrough