Shane Mueller
Sept 11 2018
Example:
data <- read.table("c5data.txt")
head(data)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1 15 1 0 6668 0 652 300 653 300 1 0 0 1.00000 1.00000
2 15 1 1 6701 33 652 306 653 300 1 33 33 6.08276 7.08276
3 15 1 2 6732 64 652 313 653 300 1 31 64 13.03840 20.12120
4 15 1 3 6771 103 652 321 653 301 1 39 103 20.02500 40.14620
5 15 1 4 6797 129 651 327 653 301 0 26 103 26.07680 66.22300
6 15 1 5 6820 152 650 332 653 301 0 23 103 31.14480 97.36780
Read in data file c5data.txt from menu, after setting the working directory. Then, copy the generated command into an .R file, and load it directly from there.
read.csv()
write.csv()
write.table()
-Understand how to change the headers before, during, and after reading them in.
dat <- matrix ( runif (1000) ,100 ,10)
colnames(dat) <- letters [1:10]
write.csv(dat , "random.csv" )
newdat <- read.csv( "random.csv" )
There are a number of ways to look at an object and see what how it is stored:
data(trees)
str(trees)
'data.frame': 31 obs. of 3 variables:
$ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
$ Height: num 70 65 63 72 81 83 66 75 80 75 ...
$ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
attributes(trees)
$names
[1] "Girth" "Height" "Volume"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30 31
$class
[1] "data.frame"
summary(trees)
Girth Height Volume
Min. : 8.30 Min. :63 Min. :10.20
1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
Median :12.90 Median :76 Median :24.20
Mean :13.25 Mean :76 Mean :30.17
3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
Max. :20.60 Max. :87 Max. :77.00
Sorting is useful, and the built-in sort function will do this for a vector:
sort(runif(10))
[1] 0.02291064 0.15750374 0.22634794 0.24906685 0.37653700 0.61181822
[7] 0.74620890 0.85396900 0.88831001 0.93120421
This won't work for data frames, where you may want to sort a frame by the values of one column. Use order
for this (not to be confused with the similar rank
)
ord <- order(trees$Height)
ord
[1] 3 20 2 7 14 1 19 4 24 16 23 8 10 15 12 13 25 21 11 9 22 28 29
[24] 30 5 26 27 6 17 18 31
This indicates the indexes in order from least to greatest. If you use the subset operation, it will reorder that vector or data frame in that order
trees$Height[ord]
[1] 63 64 65 66 69 70 71 72 72 74 74 75 75 75 76 76 77 78 79 80 80 80 80
[24] 80 81 81 82 83 85 86 87
head(trees[ord,])
Girth Height Volume
3 8.8 63 10.2
20 13.8 64 24.9
2 8.6 65 10.3
7 11.0 66 15.6
14 11.7 69 21.3
1 8.3 70 10.3
The type argument of plot allows you to plot points connected by lines, using the type=”b” argument. First, plot tree height by volume in its original order, connecting adjacent values, using the type=“b” argument. Then re-sort them by tree height and re-plot. Finally, re-sort them in a random order, and re-plot.
data ( trees )
par ( mfrow = c (1 ,3) )
plot (trees$Volume , trees$Height , type = "b" )
ord <- order ( trees$Height )
plot ( trees$Volume [ ord ] , trees$Height [ ord ] , type = "b" )
ord <- sample (1: nrow(trees) )
plot ( trees$Volume[ ord ] , trees$Height[ ord ] , type = "b" )
party <- c("R","R","D","R","R","D","D","D","R","R","D")
gender <- c("M","M","F","F","F","F","M","M","F","M","M")
vote <- c("A","B","A","A","A","B","A","A","B","B","A")
survey <- data.frame(party,gender,vote)
##look at values of each variable:
table(survey$party)
D R
5 6
table(survey$gender,survey$vote)
A B
F 3 2
M 4 2
These work by dividing data by levels of one or more categorical variables, applying a function to each group, and returning a data structure that recombines these values
set.seed(111); x <- rnorm(500) ##generate random numbers
y <- x + runif(500,-.3,.3) ##related random numbers
dat3 <- data.frame(x=x,y=y) ##create a data frame
dat3$factor <- factor(round(dat3$x/10,1)*10) ##make bins from x
dat3$group <- sample(c("A","B"),500,replace=T)
dat3.agg <- aggregate(dat3$y,list(bin=dat3$factor),mean)
##aggregate x by the same bins:
dat3.agg$xvals <- aggregate(dat3$x,list(bin=dat3$factor),mean)$x
dat3.agg
bin x xvals
1 -3 -2.9694215 -3.00802536
2 -2 -1.8622799 -1.84992413
3 -1 -0.8908935 -0.89592586
4 0 0.0382442 0.04730851
5 1 0.9096850 0.91197362
6 2 1.7929266 1.82510517
7 3 2.6943167 2.69485129
tapply works like aggregate, but produces a different output. ##use tapply to aggregate y by the bins:
dat3.tab <- tapply(dat3$y,list(bin=dat3$factor),mean)
dat3.tab
bin
-3 -2 -1 0 1 2
-2.9694215 -1.8622799 -0.8908935 0.0382442 0.9096850 1.7929266
3
2.6943167
tapply(dat3$y,list(bin=dat3$factor,group=dat3$group),mean)
group
bin A B
-3 -3.1106901 -2.75751854
-2 -1.6859730 -1.92360406
-1 -0.8322358 -0.95914982
0 0.0284143 0.04672135
1 0.8842331 0.93338158
2 1.8288174 1.76002681
3 2.7074811 2.61533043
apply applies a function to the row (1) or column (2) of a matrix or data frame
set.seed(100)
m <- matrix(runif(28),7,4)
#Find minimum in each column
apply(m,2,min) ##By column
[1] 0.05638315 0.17026205 0.20461216 0.17142021
#find maximum in each column:
apply(m,1,max) ##By row
[1] 0.7625511 0.6690217 0.7489722 0.6249965 0.8821655 0.7703016 0.8819536
help(ChickWeight)
head(ChickWeight)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
Follow along in walkthrough