--- title: "Item Response Theory" author: "Shane T. Mueller shanem@mtu.edu" date: "`r Sys.Date()`" output: pdf_document: default html_document: df_print: paged word_document: reference_docx: ../template.docx rmdformats::readthedown: gallery: yes highlight: kate self_contained: no always_allow_html: yes --- ```{r knitr_init, echo=FALSE, cache=FALSE} library(knitr) library(rmdformats) ## Global options options(max.print="75") opts_chunk$set(echo=TRUE, cache=TRUE, prompt=FALSE, tidy=TRUE, comment=NA, message=FALSE, warning=FALSE) opts_knit$set(width=75) ``` # Advanced Psychometrics using Item Response Theory, the Rasch Model, and related concepts. We previously examined psychometrics using measures such as alpha, GLB, and related measures, to help us look whether questions are representative and might be worthwhile using. These measures are generally applicable to a number of measures, especially questionnaires and scales in which there is no correct answer. When we have a set of questions that can be thought of as a knowledge test, or at least a directional scale we are trying to use to discriminate, there are additional psychometric properties related to distinguishing people that we might be interested in. For example, a question might be easier or more difficult, and we might want to include both easy and difficult questions. Easy questions might help distinguish between the least knowledgeable people and everybody else. Similarly, a difficult question might distinguish between the most knowledgeable person and every one else. In addition, we might want to know whether an individual question is good at separating your tested population into two groups, regardless of difficulty. Looking at the discriminability of items and tests can be helpful by allowing us to design tests appropriate to their use. For example, the quantitative GRE discriminates those with low math skills from those with better math skills, but it might not be appropriate for predicting success in a mathematics graduate program, because everyone who gets admitted will be in the 90th percentile or above. Similarly, a higher-level calculus test might be good at this, but would be useful as a measure of math achievement for elementary school children. A systematic approach for doing this can simply use regression modeling to try to capture different aspects of test items and the people taking the test. If we think about modeling an entire test by considering the chance of getting any item correct, we can frame a set of scored test questions as a single DV (correctness) and a single IV (question ID). Then, the intercept or additive constant for each question essentially identifies how difficult or easy the question is. Similarly, we might also include student as a second predictor. This is similar to how we do repeated measures models so that all observations of an individual can be linked, but this essentially captures how some people are better on average than others. This model helps provide understanding of whether individual questions are predictive of the whole. As an example, we will begin by fitting a logistic regression to two parallel tests--an easy and a difficult one--given to a single group of people. # Fitting subject parameters in logistic regression In a regression model, it is common to include participant as a predictor variable to account for overall individual variability, and this is essentially how mixed-effects models work when you give subject a random intercept. Suppose that you have a test with ten questions, and with individual variability across 50 individuals. Also, let's suppose that each question has a different difficulty. For participant j and question i, we can think about the log-odds of successfully answering a question as being related to both the difficulty of the question and the ability of the person. The simplest version of this would be to take a factor $\theta_j$ related to the ability of the person and add to it a value related to the easiness of each question. This is the same as subtracting $b_i$--something related to the difficulty of a question. So, a linear prediction in log-odds space would be ($\theta_j -b_i$) ```{r} logodds <- function(p) {log(p/1-p)} ##The probability of a "yes" for a given set of predictor values. logit <- function(lo) {1/(1+exp(-lo))} ##This is the inverse of the logodds set.seed(1009) numsubs <- 50 numqs <- 20 skilllevel <- rnorm(numsubs) questiondiff <- rnorm(numqs) combined <- outer(skilllevel,questiondiff,function(x,y){x-y}) pcorrect <- logit(combined) pcorrect.2 <- logit(combined+2) ## An easier test with the same subjects and problems. ``` Now, the matrix pcorrect indicates the probability of each person answering each question correctly. We can simulate a given experiment by comparing each probability value to a randomly chosen number uniformly between 0 and 1 ```{r} sim1 <- pcorrect > runif(numsubs*numqs) sim2 <- pcorrect.2 > runif(numsubs*numqs) ``` Now, because this is all framed in terms of a log-odds an logistic transforms, we should be able to take the data in sim1 and estimate the parameters used to create them using logistic regression. To do so, we need to put the matrix in long format: ```{r} simdat <- data.frame(sub=factor(rep(1:numsubs,numqs)), question = factor(rep(1:numqs,each=numsubs)), corr =as.vector(sim1)+0) simdat.2 <- data.frame(sub=factor(rep(1:numsubs,numqs)), question = factor(rep(1:numqs,each=numsubs)), corr =as.vector(sim2)+0) ``` Now, we just fit a logistic regression model. Because the baseline data had no intercept, we can re-estimate the parameters using a no-intercept model (specify +0 in the predictors) ```{r} model <- glm(corr~0+sub+question,family=binomial(),data=simdat) summary(model) anova(model,test="Chisq") model2 <- glm(corr~0+sub+question,family=binomial(),data=simdat.2) summary(model2) ``` We have a lack of identifiability here, because for any set of parameters, I can always add a constant to all subject parameters while subtracting it from all question parameters and obtain the same values. This can be seen in the model coefficients, which don't have a question1. The performance on question1 is taken as a baseline, and all subject and question parameters are scaled to match it. We can use sum-to-zero contrasts and play with the intercept, and now our question coefficients will sum to 0, meaning the average difficulty of the questions is 0, easy questions have a positive coefficient, and difficult questions have a negative coefficient. Because we are fitting coefficients for each person, this does not mean the average accuracy of a question is 50%. Note that the subject coefficients have a similar interpretation: M ```{r} library(ggplot2) library(tidyverse) contrasts(simdat$question) <- contr.sum(levels(simdat$question)) contrasts(simdat.2$question) <- contr.sum(levels(simdat.2$question)) model <- glm(corr~0+sub+question,family=binomial(),data=simdat) model2 <- glm(corr~0+sub+question,family=binomial(),data=simdat.2) items <- data.frame(names=names(model$coefficients[51:69]), model1=model$coefficients[51:69], model2 = model2$coefficients[51:69]) knitr::kable(items) items %>% ggplot(aes(x=model1,y=model2)) + geom_point() + geom_abline(intercept=0,slope=1) + theme_bw() + ggtitle("Item parameters") ## Person coefficients person <- data.frame(names=names(model$coefficients[1:50]), model1=model$coefficients[1:50], model2 = model2$coefficients[1:50]) knitr::kable(person) person %>% ggplot(aes(x=model1,y=model2)) + geom_point() + geom_abline(intercept=0,slope=1) + theme_bw() + ggtitle("Person parameters") ``` Notice that when framed this way, the item parameters estimate values about the same, but the person parameters are different for the two tests. The two regressions do not know about one another, so there is no reason they should try to constrain the ability to be the same for each person across the two tests, but this does raise the question of how this should be done ideally. We might try to force the person parameters to have an average of 0 and the items parameters to scale to fit the difficulty of the test, which could make more sense. So, how good is it? Let's compare our estimated parameters to our actual parameters: ```{r} library(ggplot2) qplot(x=model$coef[1:numsubs],y=skilllevel)+geom_point(size=4)+xlab("Estimated Model coefficient") + ylab("Person Ability") + geom_label(label=1:numsubs) + theme_bw() cor(model$coef[1:numsubs],skilllevel) ``` This is not bad--we predict fairly well the skill level of each person based on 10 yes/no answers. How about assessing the question difficulty. In our sum-to-zero coding, question 20 is the negative average of the other items. ```{r} itempars <- c((model$coef[-(1:numsubs)]), -mean(model$coef[-(1:numsubs)])) itempars2 <- c((model2$coef[-(1:numsubs)]), -mean(model2$coef[-(1:numsubs)])) qplot(x=itempars,y=questiondiff)+ geom_label(label=1:length(questiondiff))+ xlab("Estimated Model coefficient") + ylab("Question difficulty") + theme_bw() cor(itempars,questiondiff) ``` We could have scored each person and each question according to accuracy: ```{r} library(gridExtra) rowMeans(sim1) colMeans(sim1) grid.arrange( qplot(rowMeans(sim1),model$coef[1:numsubs],xlab="Person accuracy",ylab="Person ability parameter") + theme_bw(), qplot(colMeans(sim1),itempars,xlab="Question accuracy",ylab="Item difficulty parameter")+ theme_bw(),ncol=2) grid.arrange( qplot(rowMeans(sim2),model2$coef[1:numsubs],xlab="Person accuracy",ylab="Person ability parameter")+theme_bw(), qplot(colMeans(sim2),itempars2,xlab="Question accuracy",ylab="Item difficulty parameter")+theme_bw(),ncol=2) ``` Notice that there is a fairly close mapping between the question accuracy and the difficulty. What if we look at the two different tests and compare parameter estimates: ```{r} abilities <- data.frame(set1=model$coef[1:50],set2=model2$coef[1:50]) cor(abilities) ggplot(abilities,aes(x=set1,y=set2))+geom_point() + ggtitle("Person abilities") + theme_bw() probdifficulty <- data.frame(set1=itempars, set2 = itempars2) cor(probdifficulty) ggplot(probdifficulty,aes(x=set1,y=set2)) + geom_point() + ggtitle("Question difficulty") + theme_bw() ``` We are able to extract 'person' related coefficients across the two tests that are reasonably well related. Furthermore, we get high correlations for the test parameters This analysis is essentially equivalent to what is known as the _Rasch_ model of Item Response Theory (IRT). The ltm package estimates these directly from a wide data format. We might look especially at how the Rasch model chooses to anchor difficulty parameters. ```{r} library(ltm) p1 <- sim1+0 p2 <- sim2+0 irt1 <- rasch(p1) irt2 <- rasch(p2) summary(irt1) summary(irt2) ##this is an alternative to alpha in psych package descript(p1) ``` Compare to our logistic regression: ```{r} plot(itempars,irt1$coef[,1],main= "Comparison of model Item coefficients",xlab="Logistic coefficients",ylab="IRT coefficients") abline(0,1) plot(itempars2,irt2$coef[,1],main= "Comparison of model Item coefficients",xlab="Logistic coefficients",ylab="IRT coefficients") abline(0,1) ``` The item coefficients are not centered the same so their absolute values are different. These models use person ability parameters, but do not return those coefficients so it is not easy to make a direct comparison to the logistic, or do a person-type analysis. Instead, the focus of IRT is on the items or questions # Visualizing the Rasch Model If you plot() the model, it will display the inferred logistic curves for all the questions ```{r} plot(irt1) plot(irt2) ``` Notice that each curve is identical but shifted. The slope of the model is fit as a common value for all items, with different constant offsets (i.e., intercepts) for each question. ## Boundary conditions of the Rasch Model The data/questions in this example were all created as if they obeyed IRT. Thus, the model worked fairly well. If we have any violations of the model, the estimates can get less precise, and the small number of respondents (50) for the questions we chose (20) would not be enough. Typically you would want more, and the more complicated the model, the more participants. What happens if they don't--if we have 'bad' questions. One way to do this is to recode a few questions in the opposite direction, so that the people with high ability are more likely to get it wrong ```{r} set.seed(10010) irt2 <- rasch(sim2+0) sim3 <- sim2 sim3[,1:5] <- (runif(5*numsubs)<.5 )+0 irt3 <- rasch(sim3) summary(irt2) summary(irt3) plot((cbind(irt1$coef[,1],irt3$coef[,1])),main="Item coefficients with bad questions",xlab="test 2",ylab="test 3") plot((cbind(irt1$coef[,1],irt3$coef[,1])),main="Item coefficients with bad questions (zoomed)",xlab="test 2",ylab="test 3",ylim=c(-5,5)) ``` In this case, the 'bad' questions all ended up with negative difficulty coefficients. If we examine the questions using item.fit, it will test whether each question fits into the basic model. When everything was from the model, none of the items were selected as bad. Once we made 5 items (25%) bad, in this case a bunch of items get flagged. This includes all the items 1..5, but also 6, 7, and 9 and maybe 18. Strangely, a few bad questions might make other questions look bad as well. ```{r} item.fit(irt2) item.fit(irt3) ``` Some of the other things we can look at to examine the fit of the model: ```{r} person.fit(irt2) person.fit(irt3) ``` # Extending and Constraining IRT ## Slope of the item characteristic function In the Rasch model, all items are of the same family, and have the same slope, or steepness. A very steep function means that there is a sharp cut-off between who can answer it correctly and who cannot. This is often called 'discriminability'. A good test item typically has high discriminabilty, and a good test has a set of highly-discriminable items that have different difficulty. Typically, high discriminability will correspond to good part-whole item correlations. As a sort of ideal situation, the easiest item will be answered correctly by everyone but the lowest-ability person, the hardest item will only be answered correctly by the highest-ability person, and the person's ability will directly control how many of the items they can answer. As a rule of thumb, higher discriminability values (greater than 1.0, or better yet greater than 2.0) are good. By default, the rasch model estimates a slope. However, the default logistic model will have a slope of 1.0, and so this is sometimes considered a simpler model. You might do this if you have limited data--maybe a test from a class with relatively few students, because it will hopefully make estimation more stable. For example, The following is are the results of a psychology test: ```{r} dat <- read.csv("testscores.csv") ##descript(dat) ##doesn't work. Thus compute Cronbach's alahp on the data descript(dat,chi.squared=F) #force the discrimination parameter to be 1 model1 <- rasch(dat,constraint=cbind(length(dat) + 1, 1)) model1 #summary(model1) #allow discrimination parameter to be estimated model2 <- rasch(dat) model2 #summary(model2) par(mfrow=c(1,2)) plot(model1) plot(model2) ``` Notice that several of the questions have difficulty parameters of -49.02. These are the problems that everybody got correct. This also likely led to the error messages returned by the models. It is difficult to estimate the difficulty of such questions, because they must be really easy. We fit two different models; one in which has a discrimination parameter. Is it worthwhile using this extra parameter? ```{r} anova(model1,model2) ``` Results show that there is no difference between the two, despite the fact that the mean discriminability when estimated was .5 instead of 1. ## Fitting individual difficulty parameters In other cases, it is likely that different items have different discriminabilities, and you might want to use this to help create a better simpler test. You might be able to choose 5 highly discriminable items to replace 50 low-discriminable items, for example. The two-parameter IRT model can estimate a difficulty and discriminibility for each item. It is fit with the ltm() function in ltm. The ltm function has more bells and whistles that we won't deal with. For example, it lets you estimate a set of latent predictors--sort of a factor analysis. We will use a single factor, which ends up being the two-parameter model. The syntax is a bit different. You need to write a formula, and tell it how many latent factors to estimate. We will specify a single factor by doing data~z1, but you can use two by doing data~z1 + z2. This model is sometimes referred to as the two-parameter logistic (TPL) model. ```{r} model3 <- ltm(dat~z1) model3 plot(model3) #summary(model3) ``` Notice that items vary in their difficulty and discriminibility, and that some are negatively discrimination. It is sort of a mess. This is not unexpected because we have so few participants in this test--there just isn't enough information to reliably and stably estimate anything. Before we go on, we can look at a few things about how well the model fits: ```{r} item.fit(model3) ``` This gives a 'fit' parameter for each question. A few items, like Q18, have bad fit parameters. Looking at the psych::alpha function, it has very low item-whole correlation. ```{r} psych::alpha(dat) ``` We can look at the person-parameters. These tell us how well each person is described by the model. The tests are like a chi-squared test. We assume that a person of a certain ability should be getting problems easier than they are right, and more difficult than they are wrong. If this assumption is violated for a person, that indicates they violate this model, and are somehow different from the rest of the class which determines the model. Possibly they were cheating so they didn't get many right but the ones they got right were the most difficult. ```{r} person.fit(model3) ``` These are not bad--most people are reasonably-well fit in the model. The margins() function looks at whether there are particular parings of items that happen more often than by chance. ```{r} margins(model3) ``` For example, consider the first line. According to the model, we'd expect 0.11 people to get both 13 and 37 wrong. But the margins show 1 person got them both wrong, which would be unlikely to happen by chance. We can check the table here: ```{r} table(dat[,13],dat[,37]) ``` These might indicate that there are questions that are not independent and so may violate the model assumptions. For 30 and 47, we'd expect only 2.06 people to get them both correct, but 5 did. In these cases, the two questions might be redundant. In other cases, this could capture things like 'leakage', where you learn the answer to one question based on another; or exclusive-or choices, where there are two parallel questions with one attractive answer so that people who are guessing are likely to either get them both right or wrong, or are likely to get only one right and the other wrong. ## Multiple latent traits The simple ltm model is essentially logistic regression, but at its core assumes your test measures a single ability dimension. What if your test meaured multiple dimensions that differed systematically and indepedently across people? Usually, you might do a PCA or factor analysis to examine this, but the ltm model will let you test up to two latent traits directly. This should also remind you a bit of how MANOVA works. As a brief example, here is how we'd do multiple latent traits. ```{r} model4a <- ltm(dat[,1:15]~z1) plot(model4a) model4a #summary(model4) item.fit(model4a) model4b <- ltm(dat[,1:15]~z1+z2) model4b anova(model4a,model4b) #item.fit(model4b) fs <- factor.scores(model4b) barplot(t(fs$coef),beside=T) plot(fs$coef[,2],fs$coef[,3]) ``` For the z1 only model, we get a difficulty and discriminability parameter, which is like a mean and variance. For the z1 + z2 model, we get intercept which is going to be akin to the difficulty parameter, and z1/z2 represent variance along two principle components. Here, we don't see too much similarity across models, but with a larger data set it is likely we would. It is also likely that z1 or z2 is going to be similar to the discriminability parameter, but not guaranteed. Here, we see z2 is somewhat correlated with the discriminability ```{r} plot(rank(coef(model4a)[,1]),rank(coef(model4b)[,1])) plot(coef(model4a)[,2],coef(model4b)[,2]) plot(coef(model4a)[,2],coef(model4b)[,3]) ``` ## Guessing parameters: the three-parameter model If you have questions that differ in the ability to get the question right by chance, you might want to incorporate a guessing parameter. These are just the normal ltm model, but bottom out at a lower level that you either estimate or specify. This might be useful if you have a true/false test, where accuracy should be at least 50%, especially if this is mixed with other questions like short answer or multiple choice where guessing accuracy would be lower. In this case, you could fix the parameters based on question type. Otherwise, you might want to estimate them directly--but you would need to be sure you had enough data to get good estimates. This model is called the three-parameter model (TPM). It incorporates a guessing value, if the chance of getting an answer right is non-zero by guessing. ```{r} model9 <- tpm(dat[,1:15],type="latent.trait",max.guessing =.5) model9 plot(model9) ``` Notice how different items bottom out at different levels. With a small class, there are a lot of items with negative discriminability. Let's look at how they work out, by comparing average test score to particular answers: ```{r fig.width=6,fig.height=7} par(mfrow=c(2,2)) boxplot(dat$q5,rowMeans(dat),main="Correct on q5",names=c("Incorrect (3)","correct (18)")) boxplot(dat$q4,rowMeans(dat),main="Correct on q4",names=c("Incorrect (1)","correct (20)")) boxplot(dat$q10,rowMeans(dat),main="Correct on q10",names=c("Incorrect (3)","correct (18)")) boxplot(dat$q11,rowMeans(dat),main="Correct on q11",names=c("Incorrect (13)","correct (8)")) ``` We can see that for some of these, accuracy on the question is negatively correlated with accuracy on the test. For others, there are other strange things, like very small numbers of errors that might make estimation difficult. # Information curves Each questions can be transformed into an information score, which is the distribution of information implied by the cumulative score. Also, you can plot the characteristic of the entire test: ```{r} plot(model4a) plot(model4a,legend=T,type="IIC",items=1:5) plot(model4a,type="IIC",legend=T,item=c(1:15)[c(-11,-6,-10)]) ``` The height of the curve indicates where the most informative ability level for each question is. A very discriminative question will have a sharp rise at a specific point, and you would be good at separating those below from those above. ```{r} model5 <- ltm(dat[,c(1,3,5,9,14)]~z1) model5 plot(model5,legend=T,type="IIC") ``` You can specify different items, or items=0 tells you the entire test. This tells you the range of abilities that the test or items will be good at. You can also specify a range to integrate over, to see which range the test is best at discriminating. This can be used to understand whether the test is good at discrimating low-performers (maybe a test for remidial instruction) on high-performers (a test for entrance into a competitive class or program). ```{r} plot(model5,legend=T, type="IIC",items=0) info <- information(model5,c(-4,4)) info ``` # Graded response model and partial credit model. The ltm library provides five models, rasch, ltm, tpm, grm (graded response model-polytomous) and gpcm (general partial credit-polytomous). These last two are appropriate for rank-order and categorical responses, and might be useful for evaluating personality scales. The basic assumptions of IRT is that you have a binary outcome (correct or incorrect). But it could be interesting to do an IRT-like analysis for non-binary responses. If you have a set of likert-scale responses, where they are all coded in the same direction, and they each independently give support for some construct, you can use a graded response model. This might be useful for personality data, for example. Let's consider measures from the big five personality questionnaire we have examined in the past. A related model in the ltm package is the graded partial credit model (gpcm). This would allow you to place an ordinal scale on correctness, and do an IRT analysis. Maybe in a short answer response, you score full credit for one response, and partial credit for another. We won't cover this model here, but it has some similarity to the GRM. To examine the GRM, Let's obtain just the introversion/extraversion values, and reverse code so they are all in the same direction. for convenience, I'll also remove any values that are NA. ```{r} big5 <- read.csv("bigfive.csv") qtype <- c("E","A","C","N","O", "E","A","C","N","O", "E","A","C","N","O", "E","A","C","N","O", "E","A","C","N","O", "E","A","C","N","O", "E","A","C","N","O", "E","A","C","N","O", "O","A","C","O") valence <- c(1,-1,1,1,1, -1,1,-1,-1,1, 1,-1,1,1,1, 1,1,-1,1,1, -1,1,-1,-1,1, 1,-1,1,1,1, -1,1,1,-1,-1, 1,-1,1,1,1, -1,1,-1,1) ##reverse code for(i in 2:ncol(big5)) { if(valence[i-1]==-1) { big5[,i] <- 6-big5[,i] } } ei <- big5[,c(T,qtype=="E")] ei <- ei[!is.na(rowSums(ei)),] ``` Now, the graded response model in ltm (grm) will do a irt-like analysis, treating these as *ordinal* values. You can use a constrained or unconstrained model--the constrained model fits an equal discriminability across all questions. Because we have a lot of data, this model takes a while to fit. ```{r} g1 <- grm(ei[,-1], constrained = TRUE) g1 summary(g1) ``` We can see that each question is modeled with its own IRT-like model. There are five levels here, and four transitions between levels, which are modeled as sort of difficulty parameters for each transition between items. Plotting each question gives us another look ```{r,fig.width=7,fig.height=10} par(mfrow=c(4,2)) plot(g1,items=1) plot(g1,items=2) plot(g1,items=3) plot(g1,items=4) plot(g1,items=5) plot(g1,items=6) plot(g1,items=7) plot(g1,items=8) ``` The margins() function works here as well. We can see that there are a couple that violate the two-way independence (q1-q21; q6-q21, etc.) ```{r} margins(g1) ``` Let's fit this unconstrained: ```{r} g2 <- grm(ei[,-1], constrained = FALSE) g2 summary(g2) ``` ```{r,fig.width=7,fig.height=10} par(mfrow=c(4,2)) plot(g2,items=1) plot(g2,items=2) plot(g2,items=3) plot(g2,items=4) plot(g2,items=5) plot(g2,items=6) plot(g2,items=7) plot(g2,items=8) ``` For this model, we might consider the midpoint transition (extrm2) is the 'center' of the question. We can see that Q36 and Q26 are low, while Q21 and Q31 are high. We might also use this to infer that a 4 on Q36 is about equivalent to a 3 on Q21. # Discussion and Conclusions about the IRT model The basic IRT/Rasch model is ubiquitous in areas of testing and measurement. It is helpful for designing tests that are maximally predictive using the smallest number of questions. It does not completely replace more traditional psychometric approaches, but augments them so that you can design smaller more effective tests, and is an important tool that both educators and researchers are typically not aware of, even if it can be helpful. ## Additional resources * There are special-purpose standalone software packages many people use for IRT style models. SPSS recommends that you use an R package via their R connector. * The ```mirt``` library estimates IRT models. It also has a companion mirtCat and ggmirt library for using it in other contexts. (ggmirt is not in the CRAN and needs to be downloaded). A number of people have shown how to use ggplot with mirt on various blogs. MIRT looks like it can use syntax from some commonly IRT system, and you need to use R to program that other system via a text string, but blogs appear to show simple ways to apply it as well. * The ```irt``` library has objects for representing testbeds and items in a computerized adaptive testing framework, but I'm not sure if it can estimate parameters directly from data. It looks more useful for driving an adaptive test or simulating tests. * Other packages noted by Chalmers, the developer of mirt, include eRm, MCMCpack, and he notes that mirt was the only one to include confirmatory item factor analysis methods, at least in 2012.