IRT in Practice: 2PL vs GRM and When to Use Each
Item Response Theory can feel like you have to pick a model first and ask questions later. In real work it is the other way around: you start with the response format and the decision you want to make, then you pick the model that matches that reality.
This post is about two workhorse models:
- 2PL for dichotomous items (right or wrong, yes or no)
- GRM for ordered categories (Likert-style responses)
Quick intuition
2PL asks: “What is the probability of a correct response at ability θ?”
GRM asks: “What is the probability of responding in category k or higher at ability θ?”
Same latent trait idea, different data-generating story.
2PL in plain language
The 2PL model has two key item parameters:
- a (discrimination): how sharply the item separates low vs high θ
- b (difficulty): where on θ the item is most informative, the “location”
If you have multiple-choice scored as correct or incorrect, or any binary outcome, 2PL is usually a natural start.
Use 2PL when
- Your items are dichotomous
- You want item-level interpretability (a, b)
- You care about test information across θ, like where the test is most precise
2PL watchouts
- If guessing is substantial (common in multiple-choice), a 3PL might be more honest
- If items are locally dependent (testlets, shared stems), your fit can look better than it should
GRM in plain language
The Graded Response Model is designed for ordered categories, like 1 to 5 Likert responses. Instead of one b, you typically get multiple thresholds:
- a (discrimination): still the slope
- b1, b2, ..., b(m-1) (thresholds): where someone is equally likely to respond at or above each category boundary
In practice: GRM models the “step-up” points between categories.
Use GRM when
- Your categories are ordered and meaningfully so
- You want to respect more information than a binary collapse
- Your items look like graded intensity rather than right or wrong
GRM watchouts
- If categories are rarely used, thresholds can get unstable
- If respondents do not treat categories as ordered, the model is solving the wrong problem
- If you suspect different category use across groups, you are now in DIF territory
Choosing between 2PL and GRM
This is the simplest rule that actually works:
- Binary item → start with 2PL
- Ordered polytomous item → start with GRM
More nuanced version: if you have ordinal items but only two categories are really used, collapsing to binary and using 2PL can be defensible, but you are trading information for stability.
What to report in practice
- Item parameter summaries (a and b, or a and thresholds)
- Test information function, or reliability-like summaries across θ
- Model fit checks and diagnostic plots
- DIF checks if group comparisons matter
Quick R sketch (mirt)
library(mirt)
# 2PL for dichotomous items
fit_2pl <- mirt(dat_binary, 1, itemtype = "2PL")
# GRM for ordered categorical items
fit_grm <- mirt(dat_ordinal, 1, itemtype = "graded")
# Item parameters
coef(fit_2pl, IRTpars = TRUE, simplify = TRUE)
coef(fit_grm, IRTpars = TRUE, simplify = TRUE)
# Information
plot(fit_2pl, type = "info")
plot(fit_grm, type = "info")
Closing
2PL and GRM are not rivals. They are tools for different kinds of items. If you pick the model that matches how people actually respond, most of the “IRT is scary” part disappears, and you can focus on what matters: precision, fairness, and the decisions you are going to make from the scores.
- Fatih