IRT in Practice: 2PL vs GRM and When to Use Each | Fatih Ozkan

IRT in Practice: 2PL vs GRM and When to Use Each

Fatih Ozkan | Dec 19, 2025 min read

Item Response Theory can feel like you have to pick a model first and ask questions later. In real work it is the other way around: you start with the response format and the decision you want to make, then you pick the model that matches that reality.

This post is about two workhorse models:

2PL for dichotomous items (right or wrong, yes or no)
GRM for ordered categories (Likert-style responses)

Quick intuition

2PL asks: “What is the probability of a correct response at ability θ?”

GRM asks: “What is the probability of responding in category k or higher at ability θ?”

Same latent trait idea, different data-generating story.

2PL in plain language

The 2PL model has two key item parameters:

a (discrimination): how sharply the item separates low vs high θ
b (difficulty): where on θ the item is most informative, the “location”

If you have multiple-choice scored as correct or incorrect, or any binary outcome, 2PL is usually a natural start.

Use 2PL when

Your items are dichotomous
You want item-level interpretability (a, b)
You care about test information across θ, like where the test is most precise

2PL watchouts

If guessing is substantial (common in multiple-choice), a 3PL might be more honest
If items are locally dependent (testlets, shared stems), your fit can look better than it should

GRM in plain language

The Graded Response Model is designed for ordered categories, like 1 to 5 Likert responses. Instead of one b, you typically get multiple thresholds:

a (discrimination): still the slope
b1, b2, ..., b(m-1) (thresholds): where someone is equally likely to respond at or above each category boundary

In practice: GRM models the “step-up” points between categories.

Use GRM when

Your categories are ordered and meaningfully so
You want to respect more information than a binary collapse
Your items look like graded intensity rather than right or wrong

GRM watchouts

If categories are rarely used, thresholds can get unstable
If respondents do not treat categories as ordered, the model is solving the wrong problem
If you suspect different category use across groups, you are now in DIF territory

Choosing between 2PL and GRM

This is the simplest rule that actually works:

Binary item → start with 2PL
Ordered polytomous item → start with GRM

More nuanced version: if you have ordinal items but only two categories are really used, collapsing to binary and using 2PL can be defensible, but you are trading information for stability.

What to report in practice

Item parameter summaries (a and b, or a and thresholds)
Test information function, or reliability-like summaries across θ
Model fit checks and diagnostic plots
DIF checks if group comparisons matter

Quick R sketch (mirt)

library(mirt)

# 2PL for dichotomous items
fit_2pl <- mirt(dat_binary, 1, itemtype = "2PL")

# GRM for ordered categorical items
fit_grm <- mirt(dat_ordinal, 1, itemtype = "graded")

# Item parameters
coef(fit_2pl, IRTpars = TRUE, simplify = TRUE)
coef(fit_grm, IRTpars = TRUE, simplify = TRUE)

# Information
plot(fit_2pl, type = "info")
plot(fit_grm, type = "info")

Closing

2PL and GRM are not rivals. They are tools for different kinds of items. If you pick the model that matches how people actually respond, most of the “IRT is scary” part disappears, and you can focus on what matters: precision, fairness, and the decisions you are going to make from the scores.

- Fatih

nouwen