Assessing model and item fit in psychometrics

Moving toward up-to-date methods and appropriate cutoff values

SAMC Copenhagen, 2026-06-10

Challenges of psychometric analyses

  • Evaluating model–data fit (Christensen et al. 2021)
    • unidimensionality
    • local independence
    • ordered response categories
    • invariance
  • Using adequate methods
    • with appropriate interpretations

Before diving in…

Scope

  • Examples are primarily related to Rasch Measurement Theory (RMT), focusing on unidimensionality and item fit, and local (in)dependence.
  • But the concepts and general methods apply to any measurement model.

This will be a non-technical presentation, focused on results and recommendations rather than explanations and details.

Link to this presentation on GitHub is provided on the last slide

Traditional workflow

  • We evaluate how well the data fit the measurement model using metrics such as item fit statistics, \(Q_3\) residuals, etc.
  • But how do we decide what counts as misfit?

Critical values — a.k.a. “cutoffs”

  • Rule-of-thumb cutoffs dominate practice…
  • …despite strong published evidence that they depend on sample size, number of items, and more.

A better solution -> Simulation-based cutoffs

Key takeaways

(Sorry for the bluntness…)

Data-adaptive cutoffs

The core idea is simple: instead of comparing a fit statistic against a fixed number, we compare it against the range of values we would expect to see if the data at hand actually fit the model.

  • Estimate item & person parameters from your sample and items
  • Simulations are tailored to mimic your data — and so is the expected distribution of the fit metric
  • Compare the expected distribution with the observed value — voilà!

Simulating from model parameters estimated on data = parametric bootstrap.

Simulation-based workflow

Setup

  1. Start with the dataset being analyzed
  2. Estimate item & person parameters from observed data

Simulation (parametric bootstrap)

  1. Resample person parameters (\(\theta\)) with replacement
  2. Using the item parameters, simulate item responses based on \(\theta\)
  3. Estimate the model parameters
  4. Calculate the fit metric

Repeat steps 3–6 many times (~500–1000).

A visual example

Ridgeline distributions of conditional infit MSQ for items q1–q9. Blue simulated distributions are centred near 1.0; observed values appear as orange diamonds, with several items falling in the tails.

Conditional infit MSQ

PHQ-9, N = 600

  • Blue = simulated (expected) distribution
  • Orange ♦ = observed value
  • Values out in the tails → flagged as underfit (above) or overfit (below)

How do you get simulation-based cutoffs?

Free, open-source software! Both of these rely on R packages others built.

jamovi

jamovi.org — based on R, but with a point-and-click user interface.

  • Install easyRasch2jmv from the built-in jamovi module library.
  • Most example images in this presentation are from the jamovi module.
  • Available for Mac, Windows, Linux, and ChromeOS.

R

Frequentist and Bayesian Rasch packages available on CRAN, both using expected/observed fit metrics.

For documentation and examples:

jamovi overview

Item fit

Two key papers:

  • Müller (2020) — on conditional item fit.
  • Johansson (2025) — on power to detect misfit due to multidimensionality, using conditional infit/outfit and item–restscore (Goodman–Kruskal’s \(\gamma\)).

Important

Since unconditional item fit is unreliable with a sample size larger than ~200 (Müller 2020) and the detection of misfit items is generally underpowered with sample sizes below 200 (Johansson 2025), we should use conditional item infit.

Conditional item fit MSQ (Müller 2020)

  • The paper compares conditional and unconditional infit/outfit.
  • Unconditional item fit is inconsistent from n > ~200.
    • ZSTD transformation of MSQ does not help.
  • Unconditional/ZSTD is what is used in all software unless explicitly noted otherwise.
  • For conditional item fit, standard errors are not consistent → we need bootstrap/simulation to interpret results.
  • Limitation: Requires complete response data.
    • Multiple imputation with chained equations (MICE) can be used.
    • MICE is implemented in jamovi module and easyRasch2.

Detecting item misfit (Johansson 2025)

  • Outfit is much less powerful than conditional infit and item–restscore (GK \(\gamma\)) in detecting multidimensionality issues (underfit items).
  • Conditional infit is slightly more powerful than item-restscore for n < 500.
  • Off-target items are harder to detect (-2 logits compared to sample mean).
    • +/- 1 logit makes little difference.
  • Large samples (n > 1000) should use item-restscore with non-parametric bootstrap (available in jamovi & easyRasch2).

Detecting item misfit 2/3

  • You will see a plot that shows detection rate (%) of underfit items.
  • 20 dichotomous items, of which 3 are underfit due to multidimensionality.
  • Targeting for item 9 = same as sample mean
    • item 18 = -1 logits
    • item 13 = -2 logits
  • There are panel plots for different sample sizes, from 150 to 2000.

Detecting item misfit 3/3

Conditional infit

Conditional outfit

Local dependence: \(Q_3\) residuals

Global cutoff vs. separate item pair cutoffs

Ridgeline distributions of simulated Q3 residuals with observed item-pair values marked, used to judge local dependence against a data-adaptive cutoff.

CFA model fit

  • Confirmatory factor analysis (CFA) is a test of dimensionality.
  • Usually judged by several indices — RMSEA, SRMR, CFI, TLI, etc.
    • Hu & Bentler (1999): ~150k citations for rule-of-thumb cutoffs.
      • Same issue as other metrics — they don’t generalize.
      • Are extra inappropriate for ordinal data.
  • Simulation-based cutoffs should be used (McNeish and Wolf 2024).
    • R package dynamic; but easier with easyRasch2 and jamovi module.
  • Remember to use a correct estimator for ordinal data!
    • WLSMV, ULSMV, DWLS, etc. (lavaan::cfa(data, ordered = TRUE, estimator = "WLSMV"))

CFA: observed vs. simulated (jamovi)

Three histograms (CFI, RMSEA, SRMR) of 500 parametric-bootstrap datasets simulated under unidimensionality. Observed values, shown as red diamonds, fall far outside the simulated distributions, indicating misfit.

Where to draw the line?

Choosing a cutoff

  • Where should we draw the line on the simulated distribution?
    • p-value? Which alpha/correction to use?
    • Distribution-based, 95th percentile? 99th? Fully outside the expected range?
  • Bayesian view: compare the expected distribution to the posterior distribution — how much do they overlap?
    • What is the probability that data were generated by the model?
  • A similar idea works in a frequentist frame, although we have only one observed value to compare with the simulated distribution of expected values.

\(Q_3\) residuals: simulation vs Bayesian

Ridgeline distributions of simulated Q3 residuals with observed item-pair values marked, used to judge local dependence against a data-adaptive cutoff.

...

A call to action

  • We need to collectively move the norms regarding cutoffs and methods
    • Please spread the word
  • Use this information when reviewing and writing papers
    • In your paper intros, mention limitations for old papers using old methods and/or fixed cutoffs.
    • Don’t assume that published papers are correct - most are likely to have errors of various severity
  • Editors could have a brief guideline/checklist for authors

Thank you!

Magnus Johansson, PhD

Department of Clinical Neuroscience, Karolinska Institutet

https://orcid.org/0000-0003-1669-592X
https://ki.se/en/people/magnus-johansson-3

References

Chou, Yeh-Tai, and Wen-Chung Wang. 2010. “Checking Dimensionality in Item Response Models With Principal Component Analysis on Standardized Residuals.” Educational and Psychological Measurement 70 (5): 717–31. https://doi.org/10.1177/0013164410379322.
Christensen, Karl Bang, Jonathan D. Comins, Michael R. Krogsgaard, et al. 2021. “Psychometric Validation of PROM Instruments.” Scandinavian Journal of Medicine & Science in Sports 31 (6): 1225–38. https://doi.org/10.1111/sms.13908.
Christensen, Karl Bang, Svend Kreiner, and Mounir Mesbah, eds. 2013. Rasch Models in Health. Applied Mathematics Series. ISTE ; John Wiley & Sons.
Christensen, Karl Bang, Guido Makransky, and Mike Horton. 2017. “Critical Values for Yens Q3: Identification of Local Dependence in the Rasch Model Using Residual Correlations.” Applied Psychological Measurement 41 (3): 178–94. https://doi.org/10.1177/0146621616677520.
Johansson, Magnus. 2025. “Detecting Item Misfit in Rasch Models.” Educational Methods & Psychometrics 3 (18). https://doi.org/10.61186/emp.2025.5.
Marsh, Herbert W., Kit-Tai Hau, and Zhonglin Wen. 2004. “In Search of Golden Rules: Comment on Hypothesis-Testing Approaches to Setting Cutoff Values for Fit Indexes and Dangers in Overgeneralizing Hu and Bentler’s (1999) Findings.” Structural Equation Modeling: A Multidisciplinary Journal 11 (July): 320–41. https://doi.org/10.1207/s15328007sem1103_2.
McNeish, Daniel, and Melissa G. Wolf. 2024. “Direct Discrepancy Dynamic Fit Index Cutoffs for Arbitrary Covariance Structure Models.” Structural Equation Modeling: A Multidisciplinary Journal 31 (5): 835–62. https://doi.org/10.1080/10705511.2024.2308005.
Müller, Marianne. 2020. “Item Fit Statistics for Rasch Analysis: Can We Trust Them?” Journal of Statistical Distributions and Applications 7 (1): 5. https://doi.org/10.1186/s40488-020-00108-7.
Wang, Wen-Chung, and Cheng-Te Chen. 2005. “Item Parameter Recovery, Standard Error Estimates, and Fit Statistics of the Winsteps Program for the Family of Rasch Models.” Educational and Psychological Measurement 65 (3): 376–404. https://doi.org/10.1177/0013164404268673.