Posterior-predictive CFA fit-index cutoffs under PCM unidimensionality
Source:R/cfa_cutoff.R
RMdimCFACutoff.RdTests whether the observed one-factor categorical-CFA fit indices (CFI,
RMSEA, SRMR) are consistent with the data being generated by a
unidimensional Rasch / Partial Credit Model. The simulation generates
iterations datasets from the fitted PCM (or RM, for dichotomous
data) using the observed item parameters and a resampled person
distribution; each simulated dataset is fitted with
lavaan::cfa(..., ordered = TRUE, estimator = "WLSMV") and the
three fit indices are recorded. The result is a parametric-bootstrap
null distribution against which the observed CFA fit can be compared,
one-sided in the unfavourable direction for each index (CFI from
below; RMSEA and SRMR from above).
Usage
RMdimCFACutoff(
data,
iterations = 250L,
percentile = 99,
output = c("kable", "list"),
parallel = TRUE,
n_cores = NULL,
verbose = FALSE,
seed = NULL,
estimator = "WLSMV"
)Arguments
- data
A data.frame or matrix of item responses (non-negative integers, 0-based). One column per item, one row per person.
- iterations
Integer. Number of parametric-bootstrap iterations. Default
250.- percentile
Numeric in (50, 100). The strictness of the one-sided cutoff. Default
99. Higher = stricter:For CFI (higher = better fit), the cutoff is the
(100 - percentile)-th percentile of the simulated null distribution; observed values below this are flagged.For RMSEA and SRMR (lower = better fit), the cutoff is the
percentile-th percentile; observed values above this are flagged.
- output
Character.
"kable"(default) returns a formattedknitr::kable()summary of observed values vs cutoffs, with the full result list attached asattr(., "result")."list"returns the result list directly.- parallel
Logical. If
TRUE(default), uses parallel processing viamirai. Falls back to sequential ifmiraiis not installed orn_corescannot be resolved.- n_cores
Integer or
NULL. Number of parallel workers. WhenNULL,getOption("mc.cores")is consulted; if neither is set, sequential is used.- verbose
Logical. Show a progress bar (default
FALSE).- seed
Integer or
NULL. Master seed for reproducibility.- estimator
Character. The lavaan estimator passed to
lavaan::cfa(). Default"WLSMV". Other limited-information estimators that produce robust/scaled fit indices (e.g.,"DWLS","ULSMV") are also accepted; full-information ML is rejected (incompatible withordered = TRUE).
Value
If output = "kable", a knitr_kable object summarising
each index against its cutoff, with the full result list as
attr(., "result"). If output = "list", that list directly. The
list has components:
observedNamed numeric vector of observed CFA fit indices (
cfi.scaled,rmsea.scaled,srmr).simulateddata.frame with one row per successful iteration and columns
iteration,cfi,rmsea,srmr.percentileNumeric: the strictness setting used.
cutoffsNamed numeric vector (
cfi,rmsea,srmr) of simulated cutoffs at the chosen percentile.flaggedNamed logical vector indicating whether each observed index falls outside its cutoff in the unfavourable direction.
actual_iterationsNumber of successful MC iterations.
sample_nNumber of complete cases used.
n_itemsNumber of items.
item_namesCharacter vector of item names.
is_polytomousLogical: was a PCM (vs RM) fitted?
estimatorThe lavaan estimator used.
Details
At the default 99th-percentile cutoff, an item is flagged when the observed value lies in the worst 1\ distribution.
Generative model. The data-generating process for each simulated dataset is the PCM (or RM) fitted to the observed data, with persons drawn from the empirical theta distribution (resampled with replacement). This means the simulated data perfectly satisfy the PCM unidimensional assumption.
Estimation model. The CFA on each simulated dataset uses a
single-factor model with all items as ordinal indicators
(F1 =~ I1 + I2 + ...), fitted with WLSMV by default. Reported
CFI / RMSEA are the Satorra-Bentler-scaled variants (cfi.scaled,
rmsea.scaled) for consistency across iterations; the Yuan-Bentler
mean-variance-adjusted "robust" variants are sometimes NA at small
n and would produce holes in the simulated null distribution. SRMR
is reported unchanged. For percentile-based comparison the binding
requirement is that the same metric is computed for observed and
simulated iterations, which both variants satisfy.
Why a null distribution. A perfectly PCM-unidimensional dataset will typically not yield CFA fit indices at their ideal values (CFI = 1, RMSEA = 0). Two reasons: (1) PCM uses a logistic threshold structure while WLSMV uses a probit-link via the polychoric correlation matrix, so there is a small built-in metric mismatch even under correct unidimensionality; (2) finite samples produce sampling variability in the polychoric correlations. The simulated distribution captures both. Comparing observed to this distribution is more honest than rule-of-thumb cutoffs (CFI > 0.95, RMSEA < 0.06, SRMR < 0.08) which were derived under continuous-data ML and do not transfer cleanly to ordinal WLSMV.
Iteration failures. Some simulated datasets cause WLSMV to
fail (non-positive-definite polychoric matrix, boundary thresholds,
empty categories). Failed iterations are recorded with a character
message and dropped; actual_iterations reflects the number that
succeeded.
Companion functions. See RMdimCFAPlot for a
faceted visualisation of the observed value against each simulated
distribution. This test complements RMdimResidualPCA (which
identifies which items deviate) and RMdimMartinLof (which
tests a specific hypothesised partition).
References
Yuan, K.-H., & Bentler, P. M. (2000). Three likelihood-based methods for mean and covariance structure analysis with nonnormal missing data. Sociological Methodology, 30(1), 165-200. doi:10.1111/0081-1750.00078
Rosseel, Y. (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), 1-36. doi:10.18637/jss.v048.i02
Examples
# \donttest{
if (requireNamespace("lavaan", quietly = TRUE)) {
data("raschdat1", package = "eRm")
# Few iterations for a fast example; use 250+ in real analyses
result <- RMdimCFACutoff(raschdat1[, 1:8], iterations = 50,
parallel = FALSE, seed = 1, output = "list")
result$observed
result$flagged
if (requireNamespace("ggplot2", quietly = TRUE)) {
RMdimCFAPlot(result)
}
}
# }