Item Response Theory and Rasch Measurement Theory

An introduction to modern test theory

2024-11-18

Overview

  • Measuring latent constructs
  • Psychometric criteria
  • Types of data
  • Brief comparison of classical test theory and IRT
  • Overview of IRT models
  • Examples using the Rasch model
    • assessing the psychometric criteria

About me

  • Lic. psychologist (Uppsala) & PhD behavior analysis (Oslo)
  • Scientist at RISE Research Institutes of Sweden
  • Background:
    • CEO in private care
    • 10 years consulting; OBM, leadership/groups/organizations, sustainability, etc.
    • Swedish pilot trial of the PAX Good Behavior Game (universal prevention in elementary school)
    • Prevention and measurement

Latent constructs

We want to measure something that is not directly accessible by using one or more proxy indicators.

  • in psychology - tests and questionnaires
    • almost always ordered categorical data (ordinal)
  • tests of ability/IQ/etc, often consisting of subtests or multiple items
    • type of data: correct/incorrect response; response time
  • questionnaires to measure depression, wellbeing, anxiety, loneliness, etc
    • type of data: yes/no, Likert scales, visual analogue scales, etc

Latent variable

Based on observed indicators (response data collected from participants), the latent variable is assigned a value for each respondent/participant. The value of the latent variable is the measurement. The indicators themselves are not measurements, they are indicators of a latent variable.

HS.model <- ' visual  =~ x1 + x2 + x3 
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 '

fit <- cfa(HS.model, data = HolzingerSwineford1939)

lavaanPlot(model = fit)

Exercise

You will create a brief questionnaire to measure a unidimensional latent variable of your choice.

  • You will be given a few minutes to come up with three items that you think measure the latent variable.

Types of indicator data

  • Dichotomous data - two categories (yes/no; correct/incorrect)
  • Polytomous data - more than two ordered categories (Likert, etc)
  • Interval data - i.e response time
  • Count data - i.e frequency of behavior during a time period

We’ll primarily look at the first two types, as they are most commonly used in psychology and psychometrics.

Psychometric quality assessment

  • There is no substantial agreement on how to assess the psychometric quality of a measure(!).
  • If a research article claims that a measure is “valid and reliable”, it can mean anything. You need to look at the paper(s) referenced to find out.
  • There are many questionable practices that most agree are wrong (i.e “sum score and alpha”, see next slide)
    • but if you ask two scientists how they determine which measure to use, you will probably get two different answers or one bad answer (“what everyone else uses”).
  • We propose five basic psychometric criteria in a preprint (Johansson et al. 2023) to help practitioners and researchers assess the quality of a measure.

“Sum score and alpha”

  • This is a huge problem in published research.
  • Many papers include “measures” that consist of questionnaires created ad hoc for a study, report Cronbach’s alpha, and use a sum score based on putting numbers on response categories.
  • Most psychological “measures” are actually used only once (Elson et al. 2023)

We need a reasonable psychometric analysis to justify the use of a sum score! And even then the “sum score” is a debatable metric in itself since it is ordinal data, but often gets treated like interval data in statistical models.

Psychometric criteria (1/5)

We will look at each of these in more detail during this lecture.

Criterion Description
Unidimensionality & local independence Items represent one latent variable, without strongly correlated item residuals (‘local independence’). Principal Component Analysis and Exploratory Factor Analysis of raw data are explorative methods.

Psychometric criteria (2/5)

Criterion Description
Ordered response categories A higher person location (sum score) on the latent variable should entail an increased probability of a higher response (category) for all items and vice versa. Sometimes referred to as ‘monotonicity’.

Psychometric criteria (3/5)

Criterion Description
Invariance Item and measure properties are consistent between relevant demographic groups (gender, age, ethnicity, time, etc). Test-retest correlation is not an invariance test unless it is used to assess item properties (not person scores).

Psychometric criteria (4/5)

Criterion Description
Targeting Item (threshold) locations compared to person locations should be well matched and not show ceiling or floor effects, or large gaps.

Psychometric criteria (5/5)

Criterion Description
Reliability Sufficient reliability for the expected properties of the target population and intended use of results. Reliability is contingent upon the other criteria being fulfilled and should not be reported for scales with inadequate properties.

Intro ramblings over

Let’s dive into Item Response Theory and Rasch Measurement Theory!

Ability testing

  • As a first and simple example, we’ll look at data from a test where a participant can either score correct (1) or incorrect (0). More items solved correctly = higher ability.
  • We will use the dichotomous Rasch model.
  • A key assumption of the Rasch model (shared with all IRT models) is that items will have a systematic ordering of difficulty that is similar across participants.
  • This is the structure of the dataset (items = columns):
q1 q2 q3 q4 q5 q6 q7 q8 q9
0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0
1 0 0 0 0 1 0 0 0

Guttman pattern

The assumed basic structure of the data is that there is a systematic pattern across items and participants of an increased probability of correct responses as the latent ability increases. The Rasch model can be described as a probabilistic Guttman scale (Andrich 1985).

RIheatmap(rasch_data)

This figure shows items and persons sorted based on the number of correct responses (colored blue). You can see the gradual shift from lower left to upper right that shows the Guttman pattern.

Item difficulty/location

rasch <- RM(rasch_data)
plotICC(rasch, item.subset = 4, ask = FALSE, xlim = c(-5,6))

This is a key figure in understanding IRT and the concept of item difficulty/location. The x axis is the latent ability (aka latent variable/dimension/continuum), and the y axis is the probability of a correct response.

Item difficulty/location

The point on the x-axis where the y-axis = 0.5 is the item difficulty (aka item location or item threshold). This is the threshold when the probability of a correct response is equal to the probability of an incorrect response. A correct response on this item indicates a higher ability than the item difficulty, and an incorrect response indicates a lower ability than the item difficulty.

A note on terminology

  • By tradition in Rasch/IRT terminology, the term “difficulty” is used to describe the item location on the latent dimension/variable/continuum and “ability” is used to describe the person location on the latent dimension/variable/continuum.

  • Ability and difficulty are intuitive when describing ability tests, but may be confusing when looking at other types of latent constructs, such as depression or well-being.

  • A more generic term is the “location” of items and persons on the latent variable. We will move towards using that more consistently in this lecture, but there will be some variation in terminology (sorry!).

    • Hopefully, this inconsistency on my part will make it easier to read other materials that use different terms for the same concepts.

Item difficulty pt 2

plotjointICC(rasch, xlim = c(-5,5), main = "ICC for 9 dichotomous items")
abline(h = 0.5, lty = "dashed")

This figure illustrates how the items are ordered (“item hierarchy”) according to the locations on the latent dimension/variable where they provide the most information - the item threshold - which is the location on the x axis where probability = 0.5 (on the y axis) for each dichotomous item.

Item difficulty sorted

Here you can see the item threshold locations indicated by points on the x axis.

The items are sorted based on the item threshold locations.

The item threshold locations are the locations on the latent variable where the probability of a correct response is 0.5.

plotPImap(rasch, sorted = TRUE)

A note on scaling

  • IRT/Rasch uses the logit scale, which is an interval scale, for both items and persons. This means that the distance between two points on the scale is the same, regardless of where on the scale you are. Great for statistical analysis!

  • However, the values on the logit scale have no inherent meaning or external reference point. This is why we need to look at the item difficulty and person ability in relation to each other. Do not conflate the zero point on the logit scale with something like 0 on a Z-score scale.

  • You can also not interpret a person location as good or bad, or high or low, without looking at it in the context of more person locations. A person with a location of 0 is not necessarily “average”, “normal”, or “healthy” (or anything else). It is just a location on the latent variable.

Targeting

This is another key figure. The points in the bottom part show the locations for each item.

The top histogram shows the distribution on the latent variable for the respondents (aka person locations/abilities).

The middle section aggregates the item thresholds to help visualize how the item locations correspond to the person locations.

Targeting pt 2

Since items and persons are on the same scale, we can infer a person’s item responses from their latent variable location/score. Let’s say we have a person with Location = 0 as an example.

Which items would this person be most likely to score correctly?

Summing up so far

We have looked at the Rasch model for dichotomous data, also sometimes (not quite correctly) referred to as the IRT 1PL model. The key parameter is the item location/difficulty. 1PL stands for “one parameter logistic” model, and the parameter is the item location.

I generally use the term “location” for both items and persons as it is more generic. But for didactic purposes, when speaking of ability tests, it is probably easier to think about the specific term “difficulty” for items and how it related to the persons latent “ability”.

In IRT terminology, “person location” is frequently referred to as “theta”, often using this symbol: \(\theta\)

Exercise pt 2

Let’s return to the items you created before.

  • How would you describe the location of the items in terms of hierarchy amongst them?
    • Did you think about the items in terms of location/difficulty when you created them?
  • For each item, which response do you think would be the most common, based on the average person as a respondent?

Other IRT models

The 2PL and 3PL models are also commonly used. The 2PL model adds a second parameter, item discrimination, which describes how well the item separates between persons with high and low ability.

mirt(rasch_data, 1, itemtype = "Rasch", verbose = FALSE) %>% plot(type="trace", facet_items = FALSE)
mirt(rasch_data, 1, itemtype = "2PL", verbose = FALSE) %>% plot(type="trace", facet_items = FALSE)

Rasch model

2PL model

3PL model

The 3PL model adds a third parameter, which makes the figure look like this.

mirt(rasch_data, 1, itemtype = "3PL", verbose = FALSE) %>% plot(type="trace", facet_items = FALSE)

Can you guess what the third parameter is?

  • The guessing parameter! It describes how likely it is that a person will provide a correct answer to an item located above their ability. Useful in multiple choice questions, for instance.

CTT and IRT - similarities

  • Classical Test Theory (CTT) includes a range of methods, most often some form of factor analysis
    • There is a slide at the end of this presentation about differences between CTT and IRT
  • Item threshold locations are similar to item intercepts in CTT
  • Item discrimination is similar to factor loadings in CTT

Polytomous models

Now let’s move on to questionnaire type of data with ordered response categories.

We’ll focus on the Rasch model, in part since it is complicated enough for this short lecture. But also since it is the only model that allows the ordinal sum score to be used as a sufficient statistic for the latent variable.

ICC for polytomous models

This is the same type of figure as before, and it now shows probabilities for all response categories for one item. How do you interpret it?

ICC with labels

Item category thresholds

Guesstimating theta

Recall \(\theta\) (person location)?

  • Let’s say a person responds “Often” to this item, in which range is it most likely that their theta is?

Guesstimating theta

This is an item from the Perceived Stress Scale (PSS).

Estimating theta

In context

Is this person’s theta a high or low value? How does it relate to the overall sample?

ICC → targeting figure

Adding another item

q8n: “found that you could not cope with all the things that you had to do?” - “Sometimes”

One more item…

q2n: “felt that you were unable to control important things in your life?” - “Seldom”

See how that error bar gets smaller and smaller as we add more items? That’s because we’re getting more and more information about the respondents theta with each item.

PSS-7 targeting

Here are all 7 items from the PSS negative items (Rozental, Forsström, and Johansson 2023). Discuss 2 & 2 how you interpret this figure.

RItargeting(pss7, xlim = c(-4,4))

PSS-7 item hierarchy

Why is this figure of interest?

RIitemHierarchy2(pss7) + theme_rise()

Reliability

Test information - reflects item properties, not sample/person properties.

  • The same goes for test-retest as a reliability test, it should be used to assess stability in item properties over time, not sample properties.

Summing up polytomous models

We have been using the Partial Credit Model with Conditional Maximum Likelihood estimation, using the eRm package for R.

When analyzing polytomous data with Rasch/IRT models, the lowest response category is always set to 0, as it reflects on the number of thresholds “passed” by the respondent.

Think of this related to the dichotomous mode, where 0 and 1 are the only scores available and the sum score is just a count of the number of items with correct responses. In the polytomous case, the sum score is a count of the number of thresholds passed per item, summed together.

Psychometric criteria

We could put any set of items into a Rasch model and just look at ICC curves and targeting and estimate thetas. But that is not much better than “sum score and alpha”.

So let’s put the Rasch model to use in context of the five psychometric criteria mentioned in the beginning.

If desired, you will find a more detailed description of analyses and metrics in the preprint on psychometric criteria (Johansson et al. 2023).

Unidimensionality

While multidimensional constructs are possible, it is outside the scope of this lecture. Most often, even unidimensional measures are not well constructed. The most common problem (in my experience) with dimensionality is residual correlations. We’ll look at four ways to assess unidimensionality in a Rasch model:

  • Conditional item fit statistics
  • Residual correlation matrix
  • Plot of factor loadings on the first residual contrast
  • PCA of residuals

PSS-14 example

We’ll use the same published dataset and paper analyzing the PSS-14 scale as before (Rozental, Forsström, and Johansson 2023) as an example for dimensionality analysis.

We use multiple methods to analyze dimensionality since there is no single method that allows one to assess unidimensionality properly. This is also true for factor analysis.

Does everyone know what residuals are?

PCA of residuals

The highest eigenvalue should be below 2.0 (closer to 1.5)

Eigenvalue
6.38
1.43
0.84
0.80
0.69

Loadings on 1st contrast

We clearly have two clusters of items. Can you spot the pattern?

Residual correlation matrix

It is the relative correlation between items that is important, not the absolute value (Christensen, Makransky, and Horton 2017). Similarly to item fit, we need bootstrapping/simulation to determine cutoff values to interpret the results.

q1n q2n q3n q4p q5p q6p q7p q8n q9p q10p q11n q12n q13p q14n
q1n
q2n 0.53
q3n 0.4 0.53
q4p -0.33 -0.43 -0.35
q5p -0.37 -0.52 -0.44 0.58
q6p -0.33 -0.54 -0.49 0.5 0.64
q7p -0.4 -0.52 -0.44 0.39 0.46 0.51
q8n 0.21 0.42 0.42 -0.28 -0.38 -0.39 -0.39
q9p -0.47 -0.46 -0.4 0.42 0.42 0.38 0.39 -0.32
q10p -0.38 -0.58 -0.52 0.39 0.48 0.5 0.53 -0.49 0.42
q11n 0.42 0.35 0.3 -0.29 -0.35 -0.29 -0.34 0.14 -0.47 -0.27
q12n 0.14 0.27 0.36 -0.12 -0.17 -0.22 -0.17 0.35 -0.19 -0.28 0.06
q13p -0.2 -0.45 -0.36 0.27 0.36 0.35 0.31 -0.46 0.24 0.49 -0.17 -0.3
q14n 0.31 0.54 0.51 -0.43 -0.52 -0.57 -0.5 0.47 -0.37 -0.6 0.25 0.25 -0.5
Note:
Relative cut-off value is 0.102, which is 0.13 above the average correlation (-0.028).
Correlations above the cut-off are highlighted in red text.

Another residual corr matrix

This better illustrates the issue of items being too similar, often referred to as ‘local dependence’. Items 1 & 3, and 6 & 8 show the highest levels.

itemnr item
1 I lead a purposeful and meaningful life.
2 My social relationships are supportive and rewarding.
3 I am engaged and interested in my daily activities.
4 I actively contribute to the happiness and well-being of others.
5 I am competent and capable in the activities that are important to me.
6 I am a good person and live a good life.
7 I am optimistic about my future.
8 People respect me.
flourish1 flourish2 flourish3 flourish4 flourish5 flourish6 flourish7 flourish8
flourish1
flourish2 -0.18
flourish3 0.15 -0.15
flourish4 -0.04 0.07 -0.19
flourish5 -0.09 -0.32 0.06 -0.21
flourish6 -0.24 -0.18 -0.26 -0.2 -0.14
flourish7 -0.22 -0.2 -0.15 -0.24 -0.13 0.05
flourish8 -0.11 0.03 -0.25 0.11 -0.28 0.18 -0.11
Note:
Relative cut-off value is 0.03, which is 0.146 above the average correlation (-0.116).
Correlations above the cut-off are highlighted in red text.

Addressing dimensionality issues

We could easily spot the two dimensions in data just by reviewing the patterns in residuals. In a real analysis situation, I would also have looked at item fit and other things before taking action, but to get to a more manageable number of items for this example we will separate the PSS-14 into negative and positive items and focus on the positive ones when investigating item fit next.

Unidimensionality - Item fit

“Outfit” = unweighted mean squared standardized residuals, while “infit” is an information weighted mean square, less sensitive to outliers. Low item fit indicates a better than expected fit to the Rasch model but can inflate reliability without adding much information. High item fit indicates misfit to the Rasch model. Values should be close to 1. The distribution of fit values is unknown and we need bootstrapping/simulation to determine appropriate cutoff scores for our particular set of items and persons.

Item InfitMSQ OutfitMSQ Location
q1n 1.021 1.007 0.88
q2n 0.729 0.731 0.48
q3n 0.767 0.755 -0.61
q8n 1.067 1.140 -0.28
q11n 1.312 1.307 0.66
q12n 1.188 1.254 -1.60
q14n 0.933 0.920 0.47
Note:
MSQ values based on conditional estimation (n = 797 complete cases).

Item fit visualized

Here we see results of simulating datasets that fit the Rasch model perfectly, mimicking our data with 7 items (each with 5 response categories) and 793 respondents (and their thetas).

Loadings on 1st residual contrast

Ordered response categories

A higher person location on the latent variable should entail an increased probability of a higher response (category) for all items and vice versa. This is sometimes referred to as ‘monotonicity’.

We can check this by looking at the item characteristic curves (ICC). So far, we have only seen ICCs with ordered response categories. We will look at an example with disordered response categories.

An important note on types of data

  • Ordinal data - ordered response categories with unknown distance between categories.

This is by far the most common type of data in psychology and social sciences. But it is extremely common to pretend that ordinal data is interval data, and use it as such in everything from simple calculations such as mean/SD to more complex statistical models. This is a problem (Liddell and Kruschke 2018) and there are methods for analyzing ordinal data properly (i.e Bürkner and Vuorre 2019).

Adding more response categories does not make them anything else than ordinal, neither does removing the labels. Visual Analogue Scales have no strong case for being interval level either. All three methods usually add to problems with disordered response categories and invariance issues.

Example: Flourishing Scale

We’ll use an open dataset (Didino et al. 2019) for the Flourishing Scale (FS) (Diener et al. 2010). It has 8 items:

itemnr item
flourish1 I lead a purposeful and meaningful life.
flourish2 My social relationships are supportive and rewarding.
flourish3 I am engaged and interested in my daily activities.
flourish4 I actively contribute to the happiness and well-being of others.
flourish5 I am competent and capable in the activities that are important to me.
flourish6 I am a good person and live a good life.
flourish7 I am optimistic about my future.
flourish8 People respect me.

FS response categories

All items share the same set of 7 response categories:

Response Ordinal
Strongly disagree 0
Disagree 1
Slightly disagree 2
Mixed or neither agree nor disagree 3
Slightly agree 4
Agree 5
Strongly agree 6

ICC for item 6

Disordered categories

What we look for in this figure is response categories that at no point on the x-axis has the highest probability. Visually, this means that the problematic probability curve does not pass “above” other categories. Which ones can you identify in this example?

Addressing disordered response categories

There are several ways to address disordered response categories in the analysis phase. The most common is to merge adjacent categories and rerun the ICC analysis to check results. However, disordered thresholds can (at least partially) be a sign of other problems, such as misfitting items, multidimensionality, or local dependence.

Long term, it is important to investigate the cause and revise the questionnaire. Usually, the cause of disordered categories is related to having too many response categories, having bad or no labels on categories (only endpoints labeled is bad practice). Also, item formulation needs to work well with the response categories.

Invariance

  • In essence, invariance is about the stability of item locations across subgroups. It can be explained as an interaction effect between item location and a demographic variable.
  • Invariance is important to establish to ensure that we can compare measurements between groups of interest.
  • For instance, for two persons with same theta, belonging to different groups (i.e gender), each item should have about the same response threshold location.
  • (We can also look at other item properties, such as item fit, and how it varies across subgroups, but that is outside the scope of this lecture)

Invariance pt 2

  • We can use global tests of invariance, that assess the measurement model as a whole (all items together) and compare groups
    • Likelihood ratio test (LRT)
  • And, usually more interesting, tests that examine each item separately
    • Differential Item Functioning (DIF)
  • Several types of tests are available for both
    • some tests can only compare two groups, some can deal with interactions between two types of groups (education level+sex) and continuous variables (age in years)

Invariance pt 3

A preprint reviewing invariance tests in published research (D’Urso et al. 2022) concluded that:

(1) 4% of the 918 scales underwent MI testing across groups or time and that these tests were generally poorly reported, (2) none of the reported MI tests could be successfully reproduced, and (3) of 161 newly performed MI tests, a mere 46 (29%) reached sufficient MI (scalar invariance), and MI often failed completely (89; 55%). Thus, MI tests were rarely done and poorly reported in psychological studies, and the frequent violations of MI indicate that reported group differences cannot be solely attributed to group differences in the latent constructs.

DIF example (age)

Item
2
3
Mean location
StDev
MaxDiff
q4p
0.380
0.377
0.378
0.002
0.003
q5p
-0.216
-0.621
-0.418
0.287
0.406
q6p
-0.265
-0.691
-0.478
0.301
0.426
q7p
0.275
0.333
0.304
0.041
0.058
q9p
-0.561
-0.021
-0.291
0.382
0.540
q10p
0.042
0.127
0.085
0.060
0.085
q13p
0.346
0.498
0.422
0.108
0.152

DIF figure item locations

DIF figure item thresholds

Item threshold locations

DIF ICC

DIF comments

  • Important and complex topic
  • The practical effects of DIF found are affected both by the total number of items and the magnitude of the DIF
  • Large sample sizes will indicate statistically significant DIF, but DIF magnitude is the key metric
  • How to deal with problematic DIF is outside the scope of this lecture
  • For more examples and details on DIF, see: (Andrich and Hagquist 2012, 2015; Hagquist 2019; Hagquist and Andrich 2017)

Reliability

  • Affects statistical power to detect changes/differences in the latent trait. Reliability is a function of the number of items/item thresholds and the item locations. Having more items with more thresholds (ordered response categories), and the more dispersed the item locations are (instead of overlapping), the higher the reliability.

  • It is important to understand the reliability of the measure itself, and then how it applies to the population of interest. Since reliability is not constant across the latent variable continuum, targeting becomes a factor in determining reliability in a practical use case.

  • In IRT/Rasch, the Person Separation Index (PSI) is sometimes used, as it is comparable to the ubiquitous Cronbach’s alpha in having a 0-1 range. However, the PSI represents the sample SEM, not the test (see ?RItif).

    • Presenting the sample reliability is relevant for subsequent statistical analyses, just remember that it is not necessarily representative of the reliability of the test itself.

Standard error of measurement

Recall this figure? The horizontal lines (SEM) around the diamond shapes (estimated theta values) become shorter for each item we add.

SEM figure 1

Ordinal/interval table

When using the Rasch model, we can simply look up the ordinal sum score in a table to get the interval score and its SEM value.

Ordinal sum score Logit score Logit std.error
0 -4.9848616 0.6850926
1 -3.7421709 0.8245563
2 -3.0889631 0.7629346
3 -2.6131928 0.6818828
4 -2.2265466 0.6178044
5 -1.8948868 0.5716178
6 -1.6011144 0.5378262
7 -1.3351856 0.5120369
8 -1.0905563 0.4916172
9 -0.8626667 0.4750658
10 -0.6481832 0.4615076
11 -0.4445566 0.4504199
12 -0.2497315 0.4414946
13 -0.0619436 0.4345680
14 0.1204240 0.4295839
15 0.2989442 0.4265761
16 0.4752254 0.4256631
17 0.6509860 0.4270540
18 0.8281393 0.4310692
19 1.0089101 0.4381836
20 1.1960061 0.4491107
21 1.3928846 0.4649547
22 1.6041961 0.4874640
23 1.8365772 0.5193590
24 2.1002306 0.5644366
25 2.4125375 0.6262811
26 2.8080216 0.7021208
27 3.3753567 0.7610840
28 4.5267620 0.6450487

SEM figure 2

TIF and SEM

The Test Information Function (Samejima 1969, 1994) is basically the inverse of the standard error of measurement.

SEM by item

Targeting & reliability

Wrapping up

  • We have covered a lot today!
  • Most of it you are not expected to remember (few know these things)
  • You will be expected to know basic terms and concepts
  • You now know that these methods and tools exist and where to find them

Classical vs modern test theory

  • CTT does not include analysis of item location/difficulty, which is a key aspect of IRT
  • CTT assumes that reliability is constant across the continuum (and equal for all respondents)
  • CTT has no analysis of monotonicity/ordering of response categories
  • IRT enables the analysis of items and persons on the same (interval level) scale
  • IRT produces model fit metrics for both items and persons
  • IRT enables estimation of latent variable scores on an interval scale
  • Only Rasch models can justify ordinal sum scores as a sufficient metric to represent measurement on the latent construct scale

Fallacies with either method

No psychometric/statistical method is “safe” from misuse. Some common mistakes to look for in papers:

  • CTT: residual correlations added to model to improve fit
  • CTT: default estimator ML used (or estimator not reported), when WLSMV is appropriate (which it usually is for ordinal data) and using ML-based cutoff values (i.e Hu & Bentler/Kline/Brown) that are inappropriate for WLSMV
  • CTT and IRT: lack of invariance analysis
  • IRT: not reporting enough aspects/metrics (i.e only reporting item fit or model fit)

Report everything!?

  • One can share a fully documented report file as an appendix document with the manuscript/paper/preprint
  • Very helpful for oneself as a traceable record of the analysis process
  • collect code and comments in one place with R and Quarto
  • This is a good way to share the analysis and make it reproducible
  • Makes it easier for others to learn about (and critique) the analysis
  • Helpful during the review process
  • A paper that uses this approach (Rozental, Forsström, and Johansson 2023):

Rozental, A., Forsström, D., & Johansson, M. (2023). A psychometric evaluation of the Swedish translation of the Perceived Stress Scale: A Rasch analysis. BMC Psychiatry, 23(1), 690. https://doi.org/10.1186/s12888-023-05162-4

Bayesian IRT

There are guides and tools available for doing Bayesian IRT in R. The brms package is highly recommended (Bürkner 2017, 2021) and this excellent blog post is very helpful to get started: https://solomonkurz.netlify.app/blog/2021-12-29-notes-on-the-bayesian-cumulative-probit/

Exercise

Let’s look at a paper using the 5 criteria and common fallacies.

Criterion
Unidimensionality & local independence
Ordered response categories
Invariance
Targeting
Reliability
  • CTT: default estimator ML used (or estimator not reported), when WLSMV is appropriate (which it usually is for ordinal data) and using ML-based cutoff values (i.e Hu & Bentler/Kline/Brown) that are inappropriate for WLSMV
  • CTT: residual correlations added to model to improve fit

R packages used

Package Version Citation
base 4.4.1 R Core Team (2024)
car 3.1.2 Fox and Weisberg (2019)
catR 3.17 Magis and Raîche (2012); Magis and Barrada (2017)
easyRasch 0.3.1.1 Johansson (2024)
eRm 1.0.6 Mair and Hatzinger (2007b); Mair and Hatzinger (2007a); Hatzinger and Rusch (2009); Rusch, Maier, and Hatzinger (2013); Koller, Maier, and Hatzinger (2015); Debelak and Koller (2019)
iarm 0.4.3 Mueller (2022)
janitor 2.2.0 Firke (2023)
lavaan 0.6.19 Rosseel (2012)
lavaanPlot 0.8.1 Lishinski (2024)
mirt 1.42 Chalmers (2012)
patchwork 1.3.0 Pedersen (2024)
qrcode 0.3.0 Onkelinx and Teh (2024)
rmarkdown 2.28 Xie, Allaire, and Grolemund (2018); Xie, Dervieux, and Riederer (2020); Allaire et al. (2024)
scales 1.3.0 Wickham, Pedersen, and Seidel (2023)
tidyverse 2.0.0 Wickham et al. (2019)

References

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2024. rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.
Andrich, David. 1985. “An Elaboration of Guttman Scaling with Rasch Models for Measurement.” Sociological Methodology 15: 33. https://doi.org/10.2307/270846.
———. 2004. “Controversy and the Rasch Model.” Medical Care 42 (1): 7. https://doi.org/10.1097/01.mlr.0000103528.48582.7c.
Andrich, David, and Curt Hagquist. 2012. “Real and Artificial Differential Item Functioning.” Journal of Educational and Behavioral Statistics 37 (3): 387–416. https://doi.org/10.3102/1076998611411913.
———. 2015. “Real and Artificial Differential Item Functioning in Polytomous Items.” Educational and Psychological Measurement 75 (2): 185–207. https://doi.org/10.1177/0013164414534258.
Boone, William J., and Amity Noltemeyer. 2017. “Rasch Analysis: A Primer for School Psychology Researchers and Practitioners.” Edited by Gregory Yates. Cogent Education 4 (1): 1416898. https://doi.org/10.1080/2331186X.2017.1416898.
Bürkner, Paul-Christian. 2017. “Brms: An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software 80 (August): 1–28. https://doi.org/10.18637/jss.v080.i01.
———. 2021. “Bayesian Item Response Modeling in R with Brms and Stan.” Journal of Statistical Software 100 (November): 1–54. https://doi.org/10.18637/jss.v100.i05.
Bürkner, Paul-Christian, and Matti Vuorre. 2019. “Ordinal Regression Models in Psychology: A Tutorial.” Advances in Methods and Practices in Psychological Science 2 (1): 77–101. https://doi.org/10.1177/2515245918823199.
Chalmers, R. Philip. 2012. mirt: A Multidimensional Item Response Theory Package for the R Environment.” Journal of Statistical Software 48 (6): 1–29. https://doi.org/10.18637/jss.v048.i06.
Christensen, Karl Bang, Guido Makransky, and Mike Horton. 2017. “Critical Values for Yens Q3: Identification of Local Dependence in the Rasch Model Using Residual Correlations.” Applied Psychological Measurement 41 (3): 178–94. https://doi.org/10.1177/0146621616677520.
D’Urso, Epifanio Damiano, Esther Maassen, Marcel A. L. M. van Assen, Michèle B. Nuijten, Kim De Roover, and Jelte Wicherts. 2022. “The Dire Disregard of Measurement Invariance Testing in Psychological Science.” https://doi.org/10.31234/osf.io/n3f5u.
Debelak, Rudolf, and Ingrid Koller. 2019. “Testing the Local Independence Assumption of the Rasch Model with Q3-Based Nonparametric Model Tests.” Applied Psychological Measurement 44. https://doi.org/10.1177/0146621619835501.
Didino, Daniele, Ekaterina A. Taran, Galina A. Barysheva, and Fabio Casati. 2019. “Additional File 2: Of Psychometric Evaluation of the Russian Version of the Flourishing Scale in a Sample of Older Adults Living in Siberia.” https://doi.org/10.6084/m9.figshare.7705739.v1.
Diener, Ed, Derrick Wirtz, William Tov, Chu Kim-Prieto, Dong-won Choi, Shigehiro Oishi, and Robert Biswas-Diener. 2010. “New Well-Being Measures: Short Scales to Assess Flourishing and Positive and Negative Feelings.” Social Indicators Research 97 (2): 143–56. https://doi.org/10.1007/s11205-009-9493-y.
Elson, Malte, Ian Hussey, Taym Alsalti, and Ruben C. Arslan. 2023. “Psychological Measures Arent Toothbrushes.” Communications Psychology 1 (1): 1–4. https://doi.org/10.1038/s44271-023-00026-9.
Firke, Sam. 2023. janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.
Fox, John, and Sanford Weisberg. 2019. An R Companion to Applied Regression. Third. Thousand Oaks CA: Sage. https://socialsciences.mcmaster.ca/jfox/Books/Companion/.
Hagquist, Curt. 2019. “Explaining Differential Item Functioning Focusing on the Crucial Role of External Information an Example from the Measurement of Adolescent Mental Health.” BMC Medical Research Methodology 19 (1): 185. https://doi.org/10.1186/s12874-019-0828-3.
Hagquist, Curt, and David Andrich. 2017. “Recent Advances in Analysis of Differential Item Functioning in Health Research Using the Rasch Model.” Health and Quality of Life Outcomes 15 (September). https://doi.org/10.1186/s12955-017-0755-0.
Hatzinger, Reinhold, and Thomas Rusch. 2009. “IRT Models with Relaxed Assumptions in eRm: A Manual-Like Instruction.” Psychology Science Quarterly 51.
Johansson, Magnus. 2024. easyRasch: Psychometric Analysis in r with Rasch Measurement Theory. https://github.com/pgmj/easyRasch.
Johansson, Magnus, Marit Preuter, Simon Karlsson, Marie-Louise Möllerberg, Hanna Svensson, and Jeanette Melin. 2023. “Valid and Reliable? Basic and Expanded Recommendations for Psychometric Reporting and Quality Assessment.” OSF Preprints. https://doi.org/10.31219/osf.io/3htzc.
Koller, Ingrid, Marco Maier, and Reinhold Hatzinger. 2015. “An Empirical Power Analysis of Quasi-Exact Tests for the Rasch Model: Measurement Invariance in Small Samples.” Methodology 11. https://doi.org/10.1027/1614-2241/a000090.
Kyngdon, Andrew. 2008. “Conjoint Measurement, Error and the Rasch Model: A Reply to Michell, and Borsboom and Zand Scholten.” Theory & Psychology 18 (1): 125–31. https://doi.org/10.1177/0959354307086927.
Liddell, Torrin M., and John K. Kruschke. 2018. “Analyzing Ordinal Data with Metric Models: What Could Possibly Go Wrong?” Journal of Experimental Social Psychology 79 (November): 328–48. https://doi.org/10.1016/j.jesp.2018.08.009.
Lishinski, Alex. 2024. lavaanPlot: Path Diagrams for Lavaan Models via DiagrammeR. https://CRAN.R-project.org/package=lavaanPlot.
Magis, David, and Juan Ramon Barrada. 2017. “Computerized Adaptive Testing with R: Recent Updates of the Package catR.” Journal of Statistical Software, Code Snippets 76 (1): 1–19. https://doi.org/10.18637/jss.v076.c01.
Magis, David, and Gilles Raîche. 2012. “Random Generation of Response Patterns Under Computerized Adaptive Testing with the R Package catR.” Journal of Statistical Software 48 (8): 1–31. https://doi.org/10.18637/jss.v048.i08.
Mair, Patrick, and Reinhold Hatzinger. 2007a. “CML Based Estimation of Extended Rasch Models with the eRm Package in r.” Psychology Science 49. https://doi.org/10.18637/jss.v020.i09.
———. 2007b. “Extended Rasch Modeling: The eRm Package for the Application of IRT Models in r.” Journal of Statistical Software 20. https://doi.org/10.18637/jss.v020.i09.
Mueller, Marianne. 2022. iarm: Item Analysis in Rasch Models. https://CRAN.R-project.org/package=iarm.
Onkelinx, Thierry, and Victor Teh. 2024. qrcode: Generate QRcodes with r. Version 0.3.0. https://doi.org/10.5281/zenodo.5040088.
Pedersen, Thomas Lin. 2024. patchwork: The Composer of Plots. https://CRAN.R-project.org/package=patchwork.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Rosseel, Yves. 2012. lavaan: An R Package for Structural Equation Modeling.” Journal of Statistical Software 48 (2): 1–36. https://doi.org/10.18637/jss.v048.i02.
Rozental, Alexander, David Forsström, and Magnus Johansson. 2023. “A Psychometric Evaluation of the Swedish Translation of the Perceived Stress Scale: A Rasch Analysis.” BMC Psychiatry 23 (1): 690. https://doi.org/10.1186/s12888-023-05162-4.
Rusch, Thomas, Marco Maier, and Reinhold Hatzinger. 2013. “Linear Logistic Models with Relaxed Assumptions in r.” In Algorithms from and for Nature and Life, edited by Berthold Lausen, Dirk van den Poel, and Alfred Ultsch. Studies in Classification, Data Analysis, and Knowledge Organization. New York: Springer. https://doi.org/10.1007/978-3-319-00035-0_34.
Samejima, Fumiko. 1969. “Estimation of Latent Ability Using a Response Pattern of Graded Scores.” Psychometrika 34 (1): 1–97. https://doi.org/10.1007/BF03372160.
———. 1994. “Estimation of Reliability Coefficients Using the Test Information Function and Its Modifications.” Applied Psychological Measurement 18 (3): 229–44. https://doi.org/10.1177/014662169401800304.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Thomas Lin Pedersen, and Dana Seidel. 2023. scales: Scale Functions for Visualization. https://CRAN.R-project.org/package=scales.
Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.
Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.