5210 Chapter 2 Lecture  Notes 
========================================================
author: Shane Mueller
date: Sept 11 2018
autosize: false

Announcements & Agenda
========================================================
 - Problem set 2 is available (Due Sunday at 10 PM)
 - Be sure to keep up with pre-class discussion and reading.
   + One discussion today (Due by midnight)
   + One discussion Thursday (Due by 1:00 pm)
 -  1: Review solution to Problem Set 1
 -  2: Chapter 2
 -  3: Problem Set 2 Available

Chapter 2: Data Management
========================================================
 - Reading in data
 - Examining data structures
 - Sorting
 - Aggregation
   + table
   + aggregate and tapply
   + apply
 - Chick Weight example
 
Reading data into R
========================================================
 - read.table and read.csv are most common for raw data files.
 - For other data types, try RStudio's menu system.
 - Look at the data after you read it in, to be sure the headers have been made properly
 - Default variable names are difficult to work with.

Reading data into R
========================================================

Example:

```{r}
data <- read.table("c5data.txt")
head(data)
```

Exercise
========================================================

Read in data file c5data.txt from menu, after setting the working directory. Then, copy the generated command into an .R file, and load it directly from there.

Other related functions
========================================================

```
read.csv()
write.csv()
write.table()

```
-Understand how to change the headers before, during, and after reading them in.


Exercise
========================================================

 - Generate a matrix of random numbers in a table that is 10 columns and 100 rows.
 - Name the columns after the first ten letters of the alphabet (letters[1:10]). 
 - Save it out to a .csv data file, and then read it in again.

Exercise Solution
========================================================
 
 ```{r}
dat <- matrix ( runif (1000) ,100 ,10)
colnames(dat) <- letters [1:10]
write.csv(dat , "random.csv" )
newdat <- read.csv( "random.csv" )
```

 Inspecting data objects
========================================================
There are a number of ways to look at an object and see what how it is stored:

```{r}
data(trees)
str(trees)
attributes(trees)
summary(trees)
```


 Sorting
========================================================

Sorting is useful, and the built-in sort function will do this for a vector:

```{r}
sort(runif(10))
```
This won't work for data frames, where you may want to sort a frame by the values of one column. Use ```order``` for this (not to be confused with the similar ```rank```)
```{r}
ord <- order(trees$Height)
ord
```

Sorting pt 2
========================================================

This indicates the indexes in order from least to greatest. If you use the subset operation, it will reorder that vector or data frame in that order

```{r}
trees$Height[ord]
head(trees[ord,])
```


Sorting Exercise
========================================================
  
The type argument of plot allows you to plot points connected by lines, using the
type=”b” argument. First, plot tree height by volume in its original order, connecting adjacent values, using the type="b" argument. Then re-sort them by tree height and re-plot. Finally, re-sort them in a random order, and re-plot.


Sorting Exercise Solution
========================================================
```{r,fig.width=30,fig.heght=15}
data ( trees )
par ( mfrow = c (1 ,3) )
plot (trees$Volume , trees$Height , type = "b" )
ord <- order ( trees$Height )
plot ( trees$Volume [ ord ] , trees$Height [ ord ] , type = "b" )
ord <- sample (1: nrow(trees) )
plot ( trees$Volume[ ord ] , trees$Height[ ord ] , type = "b" )
```


Aggregation: table
========================================================
 - The table command gives a count of categorical (or numerical) values
 - With two arguments, it gives a cross-tabulation
 
```{r}
 party <-  c("R","R","D","R","R","D","D","D","R","R","D")
 gender <- c("M","M","F","F","F","F","M","M","F","M","M")
 vote   <- c("A","B","A","A","A","B","A","A","B","B","A")
 survey <- data.frame(party,gender,vote)
 ##look at values of each variable: 
 table(survey$party)
```


Aggregation: cross-tabulation
========================================================

# look at pairs/contingency tables:
```{r}
table(survey$gender,survey$vote)
```

Aggregation: aggregate and tapply
========================================================
These work by dividing data by levels of one or more categorical variables, applying a function to each group, and returning a data structure that recombines these values

```{r}
 set.seed(111); x <- rnorm(500)                  ##generate random numbers
y <- x + runif(500,-.3,.3)       ##related random numbers
dat3 <- data.frame(x=x,y=y)      ##create a data frame
dat3$factor <- factor(round(dat3$x/10,1)*10)  ##make bins from x
dat3$group <- sample(c("A","B"),500,replace=T)
```

```{r}
dat3.agg <-     aggregate(dat3$y,list(bin=dat3$factor),mean)
##aggregate x by the same bins:
dat3.agg$xvals <- aggregate(dat3$x,list(bin=dat3$factor),mean)$x
dat3.agg
```

Aggregation: tapply
========================================================
tapply works like aggregate, but produces a different output.
##use tapply to aggregate y by the bins:
```{r}
dat3.tab <- tapply(dat3$y,list(bin=dat3$factor),mean)
dat3.tab
tapply(dat3$y,list(bin=dat3$factor,group=dat3$group),mean)
```


Aggregation by column or row: apply
========================================================
apply applies a function to the row (1) or column (2) of a matrix or data frame
```{r}
set.seed(100)
m <- matrix(runif(28),7,4)
#Find minimum in each column
apply(m,2,min) ##By column
#find maximum in each column:
apply(m,1,max) ##By row
```


End-to-end Example walkthrough: Chick Weights
========================================================
- This data set follows many chicks (birds) as they grow
- analogous to learning, sales growth, and the like acrross time
- Initial data requires reorganization to make sense of
```{r}
help(ChickWeight)
head(ChickWeight)
```
Follow along in walkthrough