Return to course site

Overview of prepdat, gdata, and readxl

The gdata package

gdata has a lot of data management tools, related to sampling, summarizing, and the like. It has a number of useful functions for reading in .xls and .xlsx files. Note that the similar ‘’foreign’’ package includes ways of reading in additonal file formats, such as spss, sas, stata, and octave, and the readxl package which is built-in to rstudio.

If loading the library returns an error, it is likely that you need to install perl on your computer. Instructions (courtesy Raghavendran Shankar):

Step-by-step procedure:

  1. I have referred the cran r page : https://cran.r-project.org/web/packages/gdata/INSTALL

  2. In the webpage, I have used the link for installing perl ( http://www.activestate.com/activeperl/) and downloaded and installed perl for 64 bit windows.

  3. After installation, there was a folder called Perl in C drive. In that, I went to bin folder and there was an .exe file

  4. I copied the path and used it in read.xls shown below:

data <- read.xls(“bigfive-codingcomplete”,perl = “C:/Perl64/bin/Perl.exe”)

  1. The data gets imported in R.

Reading in excel files

The gdata library uses a third-party system in perl, and so you may be prompted to update that librery, but it should work out of the box to read in xls and xlsx files. It is kind of slow for large files such as this one (with more than 1000 observations an 50+ variables). It creates a .csv file along the way that gets deleted, but you can use a related function to convert automatically to csv, as that may be faster.

# A tibble: 6 x 53
  Subnum Gender Education    Q1    Q2    Q3    Q4    Q5    Q6    Q7    Q8
   <dbl> <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1      1 F              2     3     4     4     2     3     4     5     4
2      2 M              2     4     4     3     2     3     3     3     4
3      3 F              3     2     2     5     1     3     4     4     4
4      4 M              3     3     3     4     4     2     4     5     2
5      5 M              1     2     4     4     4     4     5     3     2
6      6 F              2     5     2     4     1     3     3     4     4
# ... with 42 more variables: Q9 <dbl>, Q10 <dbl>, Q11 <dbl>, Q12 <dbl>,
#   Q13 <dbl>, Q14 <dbl>, Q15 <dbl>, Q16 <dbl>, Q17 <dbl>, Q18 <dbl>,
#   Q19 <dbl>, Q20 <dbl>, Q21 <dbl>, Q22 <dbl>, Q23 <dbl>, Q24 <dbl>,
#   Q25 <dbl>, Q26 <dbl>, Q27 <dbl>, Q28 <dbl>, Q29 <dbl>, Q30 <dbl>,
#   Q31 <dbl>, Q32 <dbl>, Q33 <dbl>, Q34 <dbl>, Q35 <dbl>, Q36 <dbl>,
#   Q37 <dbl>, Q38 <dbl>, Q39 <dbl>, Q40 <dbl>, Q41 <dbl>, Q42 <dbl>,
#   Q43 <dbl>, Q44 <dbl>, Extra <dbl>, Agreeable <dbl>, Consc <dbl>,
#   Neuro <dbl>, Openness <dbl>, extrabinary <chr>

There are sometimes multiple sheets in a spreadsheet. By default this will grab the first sheet. A specific sheet can be specified too. This reads the second sheet, which is just the composite (mean) scores on five personality dimensions.

# A tibble: 6 x 9
  Subnum Gender Education Extra Agreeable Consc Neuro Openness extrabinary
   <dbl> <chr>      <dbl> <dbl>     <dbl> <dbl> <dbl>    <dbl> <chr>
1      1 F              2  2.75      3.44  2.78  2.25      3   I
2      2 M              2  3.5       3.11  3.44  3.25      3.6 E
3      3 F              3  2.38      3.89  3.78  1.75      2.7 I
4      4 M              3  2.25      4.11  3     3         2.8 I
5      5 M              1  1.5       2.89  3.22  3.88      4   I
6      6 F              2  3.75      3.11  3     2.71      3.2 E          

The documentation warns that strings will be quoted, and you may need to play with the quote argument if you have quoted text in the spreadsheet, but this should work reasonably well for simple files.

The prepdat package

The prepdat package is a new package that has two interesting functions: one that merges multiple files together, and another that aggregates summary statistics, especially useful for response time.

The file_merge will merge multiple files (possibly in nested directories) with specific formats/matching strigns into a single data frame:

  subnum block trial code type correctresp localstim globalstim
1    105     1     1    0    0           E         E          O
2    105     1     2    0    0           E         E          O
3    105     1     3    0    0           H         H          O
4    105     1     4    0    0           E         E          O
  consistency correctLocal correctGlobal positionX positionY response
1           0            1            NA       960       540 <rshift>
2           0            1            NA       960       540 <rshift>
3           0            1            NA       960       540 <lshift>
4           0            1            NA       960       540 <rshift>
  correct  time1    rt
1       1  97295 15634
2       1 114447  9514
3       1 125478  1454
4       1 128449   669
 [ reached getOption("max.print") -- omitted 2 rows ]

There is also a ‘prep’ function that will summarize your data

  subnum block trial code type correctresp localstim globalstim
1    105     1     1    0    0           E         E          O
2    105     1     2    0    0           E         E          O
3    105     1     3    0    0           H         H          O
4    105     1     4    0    0           E         E          O
  consistency correctLocal correctGlobal positionX positionY response
1           0            1            NA       960       540 <rshift>
2           0            1            NA       960       540 <rshift>
3           0            1            NA       960       540 <lshift>
4           0            1            NA       960       540 <rshift>
  correct  time1    rt within
1       1  97295 15634      1
2       1 114447  9514      1
3       1 125478  1454      2
4       1 128449   669      1
 [ reached getOption("max.print") -- omitted 2 rows ]
  subnum block trial code type correctresp localstim globalstim
1    105     1     1    0    0           E         E          O
2    105     1     2    0    0           E         E          O
3    105     1     3    0    0           H         H          O
  consistency correctLocal correctGlobal positionX positionY response
1           0            1            NA       960       540 <rshift>
2           0            1            NA       960       540 <rshift>
3           0            1            NA       960       540 <lshift>
  correct  time1    rt within within_condition
1       1  97295 15634      1                1
2       1 114447  9514      1                1
3       1 125478  1454      2                2
 [ reached getOption("max.print") -- omitted 3 rows ]
  subnum block trial code type correctresp localstim globalstim
1    105     1     1    0    0           E         E          O
2    105     1     2    0    0           E         E          O
3    105     1     3    0    0           H         H          O
  consistency correctLocal correctGlobal positionX positionY response
1           0            1            NA       960       540 <rshift>
2           0            1            NA       960       540 <rshift>
3           0            1            NA       960       540 <lshift>
  correct  time1    rt within within_condition
1       1  97295 15634      1                1
2       1 114447  9514      1                1
3       1 125478  1454      2                2
 [ reached getOption("max.print") -- omitted 3 rows ]
    subnum    mdvc1    mdvc2     sdvc1    sdvc2 meddvc1 meddvc2   t1dvc1
105    105 782.6719 732.6778 1022.1598 322.9465   651.5   647.0 687.7302
106    106 533.6719 499.7694  196.9555 127.7340   504.5   471.0 509.7036
      t1dvc2 t1.5dvc1 t1.5dvc2   t2dvc1   t2dvc2 n1tr1 n1tr2 n1.5tr1
105 664.1006 695.4132 669.5457 695.4132 678.4244     5    42       3
106 480.7973 513.1885 476.6607 515.5587 482.4769    13    64       7
    n1.5tr2 n2tr1 n2tr2 ndvc1 ndvc2 p1tr1 p1tr2 p1.5tr1 p1.5tr2 p2tr1
105      21     3    16   320   360 0.016 0.117   0.009   0.058 0.009
106      24     5    14   320   360 0.041 0.178   0.022   0.067 0.016
    p2tr2   rminv1   rminv2 p0.05dvc1 p0.05dvc2 p0.25dvc1 p0.25dvc2
105 0.044 635.3231 647.6101    406.95    435.90    540.00    536.50
106 0.039 502.9250 476.6627    387.00    362.00    437.00    423.75
    p0.75dvc1 p0.75dvc2 p0.95dvc1 p0.95dvc2
105    794.25    815.25   1169.15   1267.10
106    589.25    550.00    700.05    725.25
 [ reached getOption("max.print") -- omitted 4 rows ]

Shane T. Mueller shanem@mtu.edu

2019-01-17