5210 Chapter 2 Lecture Notes ======================================================== author: Shane Mueller date: Sept 11 2018 autosize: false Announcements & Agenda ======================================================== - Problem set 2 is available (Due Sunday at 10 PM) - Be sure to keep up with pre-class discussion and reading. + One discussion today (Due by midnight) + One discussion Thursday (Due by 1:00 pm) - 1: Review solution to Problem Set 1 - 2: Chapter 2 - 3: Problem Set 2 Available Chapter 2: Data Management ======================================================== - Reading in data - Examining data structures - Sorting - Aggregation + table + aggregate and tapply + apply - Chick Weight example Reading data into R ======================================================== - read.table and read.csv are most common for raw data files. - For other data types, try RStudio's menu system. - Look at the data after you read it in, to be sure the headers have been made properly - Default variable names are difficult to work with. Reading data into R ======================================================== Example: ```{r} data <- read.table("c5data.txt") head(data) ``` Exercise ======================================================== Read in data file c5data.txt from menu, after setting the working directory. Then, copy the generated command into an .R file, and load it directly from there. Other related functions ======================================================== ``` read.csv() write.csv() write.table() ``` -Understand how to change the headers before, during, and after reading them in. Exercise ======================================================== - Generate a matrix of random numbers in a table that is 10 columns and 100 rows. - Name the columns after the first ten letters of the alphabet (letters[1:10]). - Save it out to a .csv data file, and then read it in again. Exercise Solution ======================================================== ```{r} dat <- matrix ( runif (1000) ,100 ,10) colnames(dat) <- letters [1:10] write.csv(dat , "random.csv" ) newdat <- read.csv( "random.csv" ) ``` Inspecting data objects ======================================================== There are a number of ways to look at an object and see what how it is stored: ```{r} data(trees) str(trees) attributes(trees) summary(trees) ``` Sorting ======================================================== Sorting is useful, and the built-in sort function will do this for a vector: ```{r} sort(runif(10)) ``` This won't work for data frames, where you may want to sort a frame by the values of one column. Use ```order``` for this (not to be confused with the similar ```rank```) ```{r} ord <- order(trees$Height) ord ``` Sorting pt 2 ======================================================== This indicates the indexes in order from least to greatest. If you use the subset operation, it will reorder that vector or data frame in that order ```{r} trees$Height[ord] head(trees[ord,]) ``` Sorting Exercise ======================================================== The type argument of plot allows you to plot points connected by lines, using the type=”b” argument. First, plot tree height by volume in its original order, connecting adjacent values, using the type="b" argument. Then re-sort them by tree height and re-plot. Finally, re-sort them in a random order, and re-plot. Sorting Exercise Solution ======================================================== ```{r,fig.width=30,fig.heght=15} data ( trees ) par ( mfrow = c (1 ,3) ) plot (trees$Volume , trees$Height , type = "b" ) ord <- order ( trees$Height ) plot ( trees$Volume [ ord ] , trees$Height [ ord ] , type = "b" ) ord <- sample (1: nrow(trees) ) plot ( trees$Volume[ ord ] , trees$Height[ ord ] , type = "b" ) ``` Aggregation: table ======================================================== - The table command gives a count of categorical (or numerical) values - With two arguments, it gives a cross-tabulation ```{r} party <- c("R","R","D","R","R","D","D","D","R","R","D") gender <- c("M","M","F","F","F","F","M","M","F","M","M") vote <- c("A","B","A","A","A","B","A","A","B","B","A") survey <- data.frame(party,gender,vote) ##look at values of each variable: table(survey$party) ``` Aggregation: cross-tabulation ======================================================== # look at pairs/contingency tables: ```{r} table(survey$gender,survey$vote) ``` Aggregation: aggregate and tapply ======================================================== These work by dividing data by levels of one or more categorical variables, applying a function to each group, and returning a data structure that recombines these values ```{r} set.seed(111); x <- rnorm(500) ##generate random numbers y <- x + runif(500,-.3,.3) ##related random numbers dat3 <- data.frame(x=x,y=y) ##create a data frame dat3$factor <- factor(round(dat3$x/10,1)*10) ##make bins from x dat3$group <- sample(c("A","B"),500,replace=T) ``` ```{r} dat3.agg <- aggregate(dat3$y,list(bin=dat3$factor),mean) ##aggregate x by the same bins: dat3.agg$xvals <- aggregate(dat3$x,list(bin=dat3$factor),mean)$x dat3.agg ``` Aggregation: tapply ======================================================== tapply works like aggregate, but produces a different output. ##use tapply to aggregate y by the bins: ```{r} dat3.tab <- tapply(dat3$y,list(bin=dat3$factor),mean) dat3.tab tapply(dat3$y,list(bin=dat3$factor,group=dat3$group),mean) ``` Aggregation by column or row: apply ======================================================== apply applies a function to the row (1) or column (2) of a matrix or data frame ```{r} set.seed(100) m <- matrix(runif(28),7,4) #Find minimum in each column apply(m,2,min) ##By column #find maximum in each column: apply(m,1,max) ##By row ``` End-to-end Example walkthrough: Chick Weights ======================================================== - This data set follows many chicks (birds) as they grow - analogous to learning, sales growth, and the like acrross time - Initial data requires reorganization to make sense of ```{r} help(ChickWeight) head(ChickWeight) ``` Follow along in walkthrough