Skip to contents

Tests whether the observed one-factor categorical-CFA fit indices (CFI, RMSEA, SRMR) are consistent with the data being generated by a unidimensional Rasch / Partial Credit Model. The simulation generates iterations datasets from the fitted PCM (or RM, for dichotomous data) using the observed item parameters and a resampled person distribution; each simulated dataset is fitted with lavaan::cfa(..., ordered = TRUE, estimator = "WLSMV") and the three fit indices are recorded. The result is a parametric-bootstrap null distribution against which the observed CFA fit can be compared, one-sided in the unfavourable direction for each index (CFI from below; RMSEA and SRMR from above).

Usage

RMdimCFACutoff(
  data,
  iterations = 250L,
  percentile = 99,
  output = c("kable", "list"),
  parallel = TRUE,
  n_cores = NULL,
  verbose = FALSE,
  seed = NULL,
  estimator = "WLSMV"
)

Arguments

data

A data.frame or matrix of item responses (non-negative integers, 0-based). One column per item, one row per person.

iterations

Integer. Number of parametric-bootstrap iterations. Default 250.

percentile

Numeric in (50, 100). The strictness of the one-sided cutoff. Default 99. Higher = stricter:

  • For CFI (higher = better fit), the cutoff is the (100 - percentile)-th percentile of the simulated null distribution; observed values below this are flagged.

  • For RMSEA and SRMR (lower = better fit), the cutoff is the percentile-th percentile; observed values above this are flagged.

output

Character. "kable" (default) returns a formatted knitr::kable() summary of observed values vs cutoffs, with the full result list attached as attr(., "result"). "list" returns the result list directly.

parallel

Logical. If TRUE (default), uses parallel processing via mirai. Falls back to sequential if mirai is not installed or n_cores cannot be resolved.

n_cores

Integer or NULL. Number of parallel workers. When NULL, getOption("mc.cores") is consulted; if neither is set, sequential is used.

verbose

Logical. Show a progress bar (default FALSE).

seed

Integer or NULL. Master seed for reproducibility.

estimator

Character. The lavaan estimator passed to lavaan::cfa(). Default "WLSMV". Other limited-information estimators that produce robust/scaled fit indices (e.g., "DWLS", "ULSMV") are also accepted; full-information ML is rejected (incompatible with ordered = TRUE).

Value

If output = "kable", a knitr_kable object summarising each index against its cutoff, with the full result list as attr(., "result"). If output = "list", that list directly. The list has components:

observed

Named numeric vector of observed CFA fit indices (cfi.scaled, rmsea.scaled, srmr).

simulated

data.frame with one row per successful iteration and columns iteration, cfi, rmsea, srmr.

percentile

Numeric: the strictness setting used.

cutoffs

Named numeric vector (cfi, rmsea, srmr) of simulated cutoffs at the chosen percentile.

flagged

Named logical vector indicating whether each observed index falls outside its cutoff in the unfavourable direction.

actual_iterations

Number of successful MC iterations.

sample_n

Number of complete cases used.

n_items

Number of items.

item_names

Character vector of item names.

is_polytomous

Logical: was a PCM (vs RM) fitted?

estimator

The lavaan estimator used.

Details

At the default 99th-percentile cutoff, an item is flagged when the observed value lies in the worst 1\ distribution.

Generative model. The data-generating process for each simulated dataset is the PCM (or RM) fitted to the observed data, with persons drawn from the empirical theta distribution (resampled with replacement). This means the simulated data perfectly satisfy the PCM unidimensional assumption.

Estimation model. The CFA on each simulated dataset uses a single-factor model with all items as ordinal indicators (F1 =~ I1 + I2 + ...), fitted with WLSMV by default. Reported CFI / RMSEA are the Satorra-Bentler-scaled variants (cfi.scaled, rmsea.scaled) for consistency across iterations; the Yuan-Bentler mean-variance-adjusted "robust" variants are sometimes NA at small n and would produce holes in the simulated null distribution. SRMR is reported unchanged. For percentile-based comparison the binding requirement is that the same metric is computed for observed and simulated iterations, which both variants satisfy.

Why a null distribution. A perfectly PCM-unidimensional dataset will typically not yield CFA fit indices at their ideal values (CFI = 1, RMSEA = 0). Two reasons: (1) PCM uses a logistic threshold structure while WLSMV uses a probit-link via the polychoric correlation matrix, so there is a small built-in metric mismatch even under correct unidimensionality; (2) finite samples produce sampling variability in the polychoric correlations. The simulated distribution captures both. Comparing observed to this distribution is more honest than rule-of-thumb cutoffs (CFI > 0.95, RMSEA < 0.06, SRMR < 0.08) which were derived under continuous-data ML and do not transfer cleanly to ordinal WLSMV.

Iteration failures. Some simulated datasets cause WLSMV to fail (non-positive-definite polychoric matrix, boundary thresholds, empty categories). Failed iterations are recorded with a character message and dropped; actual_iterations reflects the number that succeeded.

Companion functions. See RMdimCFAPlot for a faceted visualisation of the observed value against each simulated distribution. This test complements RMdimResidualPCA (which identifies which items deviate) and RMdimMartinLof (which tests a specific hypothesised partition).

References

Yuan, K.-H., & Bentler, P. M. (2000). Three likelihood-based methods for mean and covariance structure analysis with nonnormal missing data. Sociological Methodology, 30(1), 165-200. doi:10.1111/0081-1750.00078

Rosseel, Y. (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), 1-36. doi:10.18637/jss.v048.i02

Examples

# \donttest{
if (requireNamespace("lavaan", quietly = TRUE)) {
  data("raschdat1", package = "eRm")

  # Few iterations for a fast example; use 250+ in real analyses
  result <- RMdimCFACutoff(raschdat1[, 1:8], iterations = 50,
                           parallel = FALSE, seed = 1, output = "list")
  result$observed
  result$flagged

  if (requireNamespace("ggplot2", quietly = TRUE)) {
    RMdimCFAPlot(result)
  }
}

# }