RMU --- relative measurement uncertainty

Magnus Johansson

RMU — relative measurement uncertainty

Comparing methods and metrics

Author

Affiliation

Magnus Johansson, PhD

Karolinska Institutet

Published

2025-10-23

Modified

2025-10-24

1 Introduction

This post has two purposes. The primary one is to compare some reliability metrics, since a new preprint (Bignardi, Kievit, and Bürkner 2025) inspired some ideas. They propose a general “relative measurement uncertainty” (RMU) metric, applicable in many situations thanks to Bayesian modeling and posterior MCMC draws. My idea that is evaluated here is to use “plausible values” (Mislevy 1991; Wu 2005) instead of fully Bayesian model posterior draws to estimate the RMU. This would make the RMU metric more accessible to users unfamiliar with Bayesian statistics, and easily applicable for existing Rasch measurement software in R, such as my package easyRasch. Please read the preprint by Bignardi and colleagues to learn how the RMU is estimated. A summary is available on pages 12-13.

The secondary purpose is to show how to fit Rasch models, dichotomous and partial credit models (PCM), using Bayesian models with brms, and make some comparisons of person and item parameters relative CML/WLE estimation. There is already an excellent paper on Bayesian Item Response Modeling (Bürkner 2021), but I thought a blog post like this might still be a useful addition to learning materials available. And also an opportunity for myself to learn. This will be added to the post when I have time to work on it.

Code

library(tidyverse)
library(ggdist)
library(easyRasch)
library(brms)
library(qs2)

set.seed(789012475)

2 Method

We won’t check model fit in this initial part, as we focus on how the reliability metrics compare and how to specify Rasch models using brms. Weakly informative priors are used for item parameters, borrowing from Bürkner (2021). We use four parallel chains with four CPU cores, each with 2000 draws of which half are warmup, resulting in 4000 post-warmup draws from the posterior distribution. In the dichotomous example, we also fit the same model with 8000 draws (code not shown below).

I will at some point add descriptions of the comparison metrics used, and how the mirt and TAM packages generate plausible values. For now, I hope this will be of use as it is.

3 Rasch dichotomous model

For this example, the eRm::raschdat1 dataset is used, selecting the first 20 items. The dataset has 100 respondents.

3.1 Bayesian model

We need the dataset in long form for brms.

Code

df <- eRm::raschdat1[,1:20] %>%
  rownames_to_column("id") %>%
  pivot_longer(!id, names_to = "item", values_to = "response")

rasch_prior <- prior("normal(0, 3)", class = "sd", group = "id") +
  prior("normal(0, 3)", class = "sd", group = "item")

brms_rasch <- brm(
  response ~ 0 + (1 | item) + (1 | id),
  data    = df,
  prior = rasch_prior,
  chains  = 4,
  cores   = 4,
  iter = 2000,
  family = "bernoulli"
)

qs_save(brms_rasch, file = "brms_rasch4000.qs2",compress_level = 11)
qs_save(brms_rasch2, file = "brms_rasch8000.qs2",compress_level = 13)

Code

brms_rasch8000 <- qs_read("brms_rasch8000.qs2")
brms_rasch4000 <- qs_read("brms_rasch4000.qs2")

# summary(brms_rasch)
# loo(brms_rasch)

posterior_draws_rm4000 = brms_rasch4000 %>%
  as_draws_df() %>%
  select(starts_with("r_id")) %>%
  t()

posterior_draws_rm8000 = brms_rasch8000 %>%
  as_draws_df() %>%
  select(starts_with("r_id")) %>%
  t()

4000 draws:

Code

RMUreliability(posterior_draws_rm4000)

  rmu_estimate hdci_lowerbound hdci_upperbound .width .point .interval
1    0.7500278       0.6771337       0.8131132   0.95   mean      hdci

8000 draws:

Code

RMUreliability(posterior_draws_rm8000)

  rmu_estimate hdci_lowerbound hdci_upperbound .width .point .interval
1    0.7524065       0.6819247       0.8210144   0.95   mean      hdci

3.2 Plausible values

As seen in the code below, I did some simple timing tests to compare the performance of TAM and mirt packages, where the latter has options in how to estimate thetas (latent scores) used for generating the plausible values (not available in TAM). Since WLE is generally considered to have the lowest bias (Kreiner 2025), this is the default choice in the RIreliability() function, dubbed WLE-RMU.

Code

# defining a separate function for RIreliability to use in this post, in case I make changes later on in the easyRasch package
RIreliability3 <- function(data, conf_int = .95, draws = 1000, estim = "WLE", boot = FALSE, cpu = 4, pv = "mirt") {
  
  message("Note that PSI is calculated with max/min scoring individuals excluded.")
  
  if(min(as.matrix(data), na.rm = T) > 0) {
    stop("The lowest response category needs to coded as 0. Please recode your data.")
  } else if(max(as.matrix(data), na.rm = T) == 1) {
    model <- "RM"
  } else if(max(as.matrix(data), na.rm = T) > 1) {
    model <- "PCM"
  }
  
  if (model == "PCM") {
    mirt_out <- mirt(
      data,
      model = 1,
      itemtype = "Rasch",
      verbose = FALSE,
      accelerate = "squarem"
    )
    erm_out <- eRm::PCM(data)
  } else if (model == "RM") {
    mirt_out <- mirt(
      data,
      model = 1,
      itemtype = "1PL",
      verbose = FALSE,
      accelerate = "squarem"
    )
    erm_out <- eRm::RM(data)
  }
  
  wle <- RI_iarm_person_estimates(erm_out, properties = TRUE)[[2]] %>%
    as.data.frame()
  rownames(wle) <- NULL
  
  empirical_rel <- mirt::fscores(mirt_out,
                                 method = estim,
                                 theta_lim = c(-10, 10),
                                 full.scores.SE = TRUE,
                                 verbose = FALSE) %>%
    mirt::empirical_rxx()
  
  if (boot == TRUE) {
    require(doParallel)
    registerDoParallel(cores = cpu)
    # bootstrap CI for empirical
    fit <- data.frame()
    fit <- foreach(i = 1:draws, .combine = rbind) %dopar% {
      
      dat <- data[sample(1:nrow(data), nrow(data), replace = TRUE), ]
      
      if (model == "PCM") {
        mirt_out2 <- mirt(
          dat,
          model = 1,
          itemtype = "Rasch",
          verbose = FALSE,
          accelerate = "squarem"
        )
      } else if (model == "RM") {
        mirt_out2 <- mirt(
          dat,
          model = 1,
          itemtype = "1PL",
          verbose = FALSE,
          accelerate = "squarem"
        )
      }
      
      mirt::fscores(mirt_out2,
                    method = estim,
                    theta_lim = c(-10, 10),
                    full.scores.SE = TRUE,
                    verbose = FALSE) %>%
        mirt::empirical_rxx()
      
    }
    
    emp_boot <- mean_hdci(fit) %>%
      mutate(across(where(is.numeric), ~ round(.x, 3)))
    
  }
  
  if (pv == "mirt") {
  plvals <- mirt::fscores(mirt_out, method = estim,
                          theta_lim = c(-10, 10),
                          plausible.draws = draws,
                          plausible.type = "MH",
                          verbose = FALSE)
  
  rmu <- do.call(cbind.data.frame, plvals) %>%
    RMUreliability(level = conf_int) %>%
    mutate(across(where(is.numeric), ~ round(.x, 3)))
  } else if (pv == "TAM") {
    estim = "EAP"
    if (model == "PCM") {
    tam_out <- TAM::tam.mml(resp = data, irtmodel = "PCM", verbose = FALSE)
    } else if (model == "RM") {
    tam_out <- TAM::tam.mml(resp = data, irtmodel = "1PL", verbose = FALSE)
    }
    
    plvals <- TAM::tam.pv(tam_out, nplausible = draws, verbose = FALSE)[["pv"]][,-1]
    
    rmu <- RMUreliability(plvals, level = conf_int) %>%
    mutate(across(where(is.numeric), ~ round(.x, 3)))
  }
  
  message(paste0("RMU reliability estimates based on ",draws," posterior draws (plausible values) from ",nrow(data)," respondents.",
                 "\nSee Bignardi, Kievit, & Bürkner (2025). 'A general method for estimating reliability using Bayesian Measurement Uncertainty' for details. ",
                 "The procedure has been modified in `easyRasch` to use plausible values based on an MML estimated Rasch model."))
  
  if (boot == TRUE) {
    
    return(list(WLE = wle,
                PSI = eRm::person.parameter(erm_out) %>% eRm::SepRel(),
                Empirical = paste0(estim,"_empirical = ",round(empirical_rel,3)),
                Empirical_bootstrap = paste0(estim,"_empirical = ",emp_boot$y," (95% HDCI [",emp_boot$ymin,", ",emp_boot$ymax,"]) (",draws," bootstrap resamples)"),
                RMU = paste0(estim,"-RMU = ",rmu$rmu_estimate," (95% HDCI [",rmu$hdci_lowerbound,", ",rmu$hdci_upperbound,"]) (",draws," draws) using package ",pv)
    )
    )
  } else {
    return(list(WLE = wle,
                PSI = eRm::person.parameter(erm_out) %>% eRm::SepRel(),
                Empirical = paste0(estim,"_empirical = ",round(empirical_rel,3)),
                RMU = paste0(estim,"-RMU = ",rmu$rmu_estimate," (95% HDCI [",rmu$hdci_lowerbound,", ",rmu$hdci_upperbound,"]) (",draws," draws) using package ",pv)
    )
    )
  }
}

Code

t0 <- Sys.time()
rmu_mirt_wle4000 <- RIreliability3(eRm::raschdat1[,1:20], draws = 4000)
t1 <- Sys.time()
t1 - t0 # 4.5s

t0 <- Sys.time()
rmu_mirt_eap4000 <- RIreliability3(eRm::raschdat1[,1:20], draws = 4000, estim = "EAP")
t1 <- Sys.time()
t1 - t0 # 4.1s

t0 <- Sys.time()
rmu_tam_eap4000 <- RIreliability3(eRm::raschdat1[,1:20], draws = 4000, pv = "TAM")
t1 <- Sys.time()
t1 - t0 # 13.9s

Using mirt, there is not much difference between WLE and EAP, approximately 4.1s for EAP and 4.5s for WLE. TAM takes quite a bit longer, at around 14s. I only did two iterations, so take this with a grain of salt.

Code

t0 <- Sys.time()
rmu_mirt_wle8000 <- RIreliability3(eRm::raschdat1[,1:20], draws = 8000)
t1 <- Sys.time()
t1 - t0 # 8.1 

t0 <- Sys.time()
rmu_mirt_eap8000 <- RIreliability3(eRm::raschdat1[,1:20], draws = 8000, estim = "EAP")
t1 <- Sys.time()
t1 - t0 # 8.1

t0 <- Sys.time()
rmu_tam_eap8000 <- RIreliability3(eRm::raschdat1[,1:20], draws = 8000, pv = "TAM")
t1 <- Sys.time()
t1 - t0 # 31s

With 8000 draws, WLE and EAP with mirt are both at 8.1s, while TAM takes about 31s.

3.3 Conditional reliability

This function and its code is borrowed from a recent paper by McNeish and Dumas (2025). Alpha and omega are reported with 95% confidence intervals.

Alpha
Omega

Code

RIrelRep(eRm::raschdat1[,1:20])$plot

Coefficient Alpha:                  0.75
Coefficient Alpha 95% CI:          [0.68,0.82]

People above Alpha interval:    0%
People within Alpha interval   :    99%
People below  Alpha interval:    1%

Code

RIrelRep(eRm::raschdat1[,1:20], rel = "omega")$plot

Coefficient Omega:                  0.72
Coefficient Omega 95% CI:          [0.64,0.79]

People above Omega interval:    0%
People within Omega interval   :    100%
People below  Omega interval:    0%

3.4 Summarizing RMU findings

Since RMU uses a random pairing of posterior draws, there is variation in both point values and HDCI estimates if one runs the RMU estimation repeatedly using the same set of draws. Let’s investigate how much it varies, using the brms model with 4000 draws.

Code

rel <- map_dfr(1:1000, ~ RMUreliability(posterior_draws_rm4000)[1:3])
qs_save(rel,"rel4000.qs2")

Code

rel <- qs_read("rel4000.qs2")
rel_text <- rel %>% 
  pivot_longer(everything()) %>% 
  group_by(name) %>% 
  summarise(text_location = mean(value))

rel %>%
  pivot_longer(everything()) %>% 
  ggplot(aes(x = value, y = 1, fill = name)) +
  stat_slabinterval() +
  geom_text(data = rel_text,
            aes(label = name, x = text_location),
            position = position_nudge(y = -0.04)) +
  labs(title = "RMU random variation",
       subtitle = "Using a `brms` Rasch model with 4000 posterior draws, estimating RMU 1000 times",
       x = "Relative Measurement Uncertainty",
       y = NULL) +
  theme_minimal() +
  theme(legend.position = "none",
        axis.title.x = element_text(vjust = -0.1)) +
  scale_x_continuous(limits = c(0.65,0.85), breaks = scales::breaks_pretty(8))

CI’s have more variation than the point estimate.

Code

rel %>%
  pivot_longer(everything()) %>%
  group_by(name) %>%
  reframe(min = min(value), 
          max = max(value),
          mean_hdci = mean_hdci(value))

# A tibble: 3 × 4
  name              min   max mean_hdci$y $ymin $ymax $.width $.point $.interval
  <chr>           <dbl> <dbl>       <dbl> <dbl> <dbl>   <dbl> <chr>   <chr>     
1 hdci_lowerbound 0.664 0.688       0.679 0.673 0.686    0.95 mean    hdci      
2 hdci_upperbound 0.809 0.830       0.818 0.812 0.825    0.95 mean    hdci      
3 rmu_estimate    0.748 0.752       0.750 0.749 0.751    0.95 mean    hdci

Let’s do the same with 8000 draws.

Code

rel8 <- map_dfr(1:1000, ~ RMUreliability(posterior_draws_rm8000)[1:3])
qs_save(rel8,"rel8000.qs2")

Code

rel8 <- qs_read("rel8000.qs2")

rel8_text <- rel8 %>% 
  pivot_longer(everything()) %>% 
  group_by(name) %>% 
  summarise(text_location = mean(value))

rel8 %>%
  pivot_longer(everything()) %>% 
  ggplot(aes(x = value, y = 1, fill = name)) +
  stat_slabinterval() +
  geom_text(data = rel_text,
            aes(label = name, x = text_location),
            position = position_nudge(y = -0.04)) +
  labs(title = "RMU random variation",
       subtitle = "Using a `brms` Rasch model with 8000 posterior draws, estimating RMU 1000 times",
       x = "Relative Measurement Uncertainty",
       y = NULL) +
  theme_minimal() +
  theme(legend.position = "none",
        axis.title.x = element_text(vjust = -0.1)) +
  scale_x_continuous(limits = c(0.65,0.85), breaks = scales::breaks_pretty(8))

Code

rel8 %>%
  pivot_longer(everything()) %>%
  group_by(name) %>%
  reframe(min = min(value), 
          max = max(value),
          mean_hdci = mean_hdci(value))

# A tibble: 3 × 4
  name              min   max mean_hdci$y $ymin $ymax $.width $.point $.interval
  <chr>           <dbl> <dbl>       <dbl> <dbl> <dbl>   <dbl> <chr>   <chr>     
1 hdci_lowerbound 0.670 0.685       0.678 0.673 0.683    0.95 mean    hdci      
2 hdci_upperbound 0.811 0.825       0.819 0.814 0.823    0.95 mean    hdci      
3 rmu_estimate    0.748 0.751       0.750 0.749 0.750    0.95 mean    hdci

Less variation - more precision in estimates with 8000 draws compared to 4000 draws.

How about just estimating RMU 50 times, does it converge on the same mean values for point and CI estimates, compared to estimating RMU 1000 times? This is obviously not a thorough test, it should be evaluated more properly at some point.

Code

t0 <- Sys.time()
rel50 <- map_dfr(1:50, ~ RMUreliability(posterior_draws_rm4000)[1:3])
t1 <- Sys.time()
t1 - t0 # ~3.4s

Time difference of 3.317572 secs

Code

rel50 %>%
  pivot_longer(everything()) %>%
  group_by(name) %>%
  reframe(mean_hdci = mean_hdci(value))

# A tibble: 3 × 2
  name            mean_hdci$y $ymin $ymax $.width $.point $.interval
  <chr>                 <dbl> <dbl> <dbl>   <dbl> <chr>   <chr>     
1 hdci_lowerbound       0.679 0.674 0.684    0.95 mean    hdci      
2 hdci_upperbound       0.818 0.814 0.824    0.95 mean    hdci      
3 rmu_estimate          0.750 0.749 0.751    0.95 mean    hdci

Clearly seems more stable than just estimating RMU once.

This finding leads me to incorporate the 50x RMU estimation as a default setting in the RIreliability() function in the next version update. Results reported below are based on this method.

Code

RIreliability50 <- function(data, conf_int = .95, draws = 1000, estim = "WLE", boot = FALSE, cpu = 4, pv = "mirt", iter = 50, verbose = TRUE) {
  
  if (verbose == TRUE) {
  message("Note that PSI is calculated with max/min scoring individuals excluded.")
  message(paste0("RMU reliability estimates based on ",draws," posterior draws (plausible values) from ",nrow(data)," respondents.",
                 "\nSee Bignardi, Kievit, & Bürkner (2025). 'A general method for estimating reliability using Bayesian Measurement Uncertainty' for details. ",
                 "The procedure has been modified in `easyRasch` to use plausible values based on an MML estimated Rasch model."))
  }
  
  if(min(as.matrix(data), na.rm = T) > 0) {
    stop("The lowest response category needs to coded as 0. Please recode your data.")
  } else if(max(as.matrix(data), na.rm = T) == 1) {
    model <- "RM"
  } else if(max(as.matrix(data), na.rm = T) > 1) {
    model <- "PCM"
  }
  
  if (model == "PCM") {
    mirt_out <- mirt(
      data,
      model = 1,
      itemtype = "Rasch",
      verbose = FALSE,
      accelerate = "squarem"
    )
    erm_out <- eRm::PCM(data)
  } else if (model == "RM") {
    mirt_out <- mirt(
      data,
      model = 1,
      itemtype = "1PL",
      verbose = FALSE,
      accelerate = "squarem"
    )
    erm_out <- eRm::RM(data)
  }
  
  wle <- RI_iarm_person_estimates(erm_out, properties = TRUE)[[2]] %>%
    as.data.frame()
  rownames(wle) <- NULL
  
  empirical_rel <- mirt::fscores(mirt_out,
                                 method = estim,
                                 theta_lim = c(-10, 10),
                                 full.scores.SE = TRUE,
                                 verbose = FALSE) %>%
    mirt::empirical_rxx()
  
  if (boot == TRUE) {
    require(doParallel)
    registerDoParallel(cores = cpu)
    # bootstrap CI for empirical
    fit <- data.frame()
    fit <- foreach(i = 1:draws, .combine = rbind) %dopar% {
      
      dat <- data[sample(1:nrow(data), nrow(data), replace = TRUE), ]
      
      if (model == "PCM") {
        mirt_out2 <- mirt(
          dat,
          model = 1,
          itemtype = "Rasch",
          verbose = FALSE,
          accelerate = "squarem"
        )
      } else if (model == "RM") {
        mirt_out2 <- mirt(
          dat,
          model = 1,
          itemtype = "1PL",
          verbose = FALSE,
          accelerate = "squarem"
        )
      }
      
      mirt::fscores(mirt_out2,
                    method = estim,
                    theta_lim = c(-10, 10),
                    full.scores.SE = TRUE,
                    verbose = FALSE) %>%
        mirt::empirical_rxx()
      
    }
    
    emp_boot <- mean_hdci(fit) %>%
      mutate(across(where(is.numeric), ~ round(.x, 3)))
    
  }
  
  if (pv == "mirt") {
  plvals <- mirt::fscores(mirt_out, method = estim,
                          theta_lim = c(-10, 10),
                          plausible.draws = draws,
                          plausible.type = "MH",
                          verbose = FALSE)
  
  rmu_draws <- do.call(cbind.data.frame, plvals) 
  rmu_iter <- map_dfr(1:iter, ~ RMUreliability(rmu_draws, level = conf_int)[1:3])

  rmu <- rmu_iter %>% 
    summarise(rmu_estimate = mean(rmu_estimate),
              hdci_lowerbound = mean(hdci_lowerbound),
              hdci_upperbound = mean(hdci_upperbound)
              ) %>% 
    mutate(across(where(is.numeric), ~ round(.x, 3)))
  
  } else if (pv == "TAM") {
    estim = "EAP"
    if (model == "PCM") {
    tam_out <- TAM::tam.mml(resp = data, irtmodel = "PCM", verbose = FALSE)
    } else if (model == "RM") {
    tam_out <- TAM::tam.mml(resp = data, irtmodel = "1PL", verbose = FALSE)
    }
    
    plvals <- TAM::tam.pv(tam_out, nplausible = draws, verbose = FALSE)[["pv"]][,-1]
    
    rmu_iter <- map_dfr(1:iter, ~ RMUreliability(plvals, level = conf_int)[1:3])

    rmu <- rmu_iter %>% 
      summarise(rmu_estimate = mean(rmu_estimate),
                hdci_lowerbound = mean(hdci_lowerbound),
                hdci_upperbound = mean(hdci_upperbound)
                ) %>% 
      mutate(across(where(is.numeric), ~ round(.x, 3)))
  }
  
  if (boot == TRUE) {
    
    return(list(WLE = wle,
                PSI = eRm::person.parameter(erm_out) %>% eRm::SepRel(),
                Empirical = paste0(estim,"_empirical = ",round(empirical_rel,3)),
                Empirical_bootstrap = paste0(estim,"_empirical = ",emp_boot$y," (95% HDCI [",emp_boot$ymin,", ",emp_boot$ymax,"]) (",draws," bootstrap resamples)"),
                RMU = paste0(estim,"-RMU = ",rmu$rmu_estimate," (95% HDCI [",rmu$hdci_lowerbound,", ",rmu$hdci_upperbound,"]) (",draws," draws) using package ",pv," and ",iter," RMU iterations.")
    )
    )
  } else {
    return(list(WLE = wle,
                PSI = eRm::person.parameter(erm_out) %>% eRm::SepRel(),
                Empirical = paste0(estim,"_empirical = ",round(empirical_rel,3)),
                RMU = paste0(estim,"-RMU = ",rmu$rmu_estimate," (95% HDCI [",rmu$hdci_lowerbound,", ",rmu$hdci_upperbound,"]) (",draws," draws) using package ",pv," and ",iter," RMU iterations.")
    )
    )
  }
}

Using the 50 iteration setting 10 times with 4000 draws. What is the variation!?

Code

iter50 <- map_vec(1:10, ~ RIreliability50(raschdat1[,1:20], draws = 4000, verbose = FALSE)$RMU)
iter50

Point estimate varies from .761 to .764, with lower/upper CI ranging from .694 to .698 and .825 to .827. Estimates seem unlikely to vary more than 0.01, which I find acceptable.

3.5 Summary table

Code

rmu_mirt_wle_full <- RIreliability50(eRm::raschdat1[,1:20], draws = 4000, verbose = F)
rmu_mirt_eap_full <- RIreliability50(eRm::raschdat1[,1:20], draws = 4000, estim = "EAP", verbose = F)

t0 <- Sys.time()
rmu_mirt_wle4000 <- RIreliability50(eRm::raschdat1[,1:20], draws = 4000, verbose = F)$RMU
t1 <- Sys.time()
t1 - t0 # 8.3s

t0 <- Sys.time()
rmu_mirt_eap4000 <- RIreliability50(eRm::raschdat1[,1:20], draws = 4000, estim = "EAP", verbose = F)$RMU
t1 <- Sys.time()
t1 - t0 # 7.7s

t0 <- Sys.time()
rmu_tam_eap4000 <- RIreliability50(eRm::raschdat1[,1:20], draws = 4000, pv = "TAM", verbose = F)$RMU
t1 <- Sys.time()
t1 - t0 # 10.6s

brms4000 <- map_dfr(1:50, ~ RMUreliability(posterior_draws_rm4000)[1:3]) %>% 
  summarise(rmu_estimate = mean(rmu_estimate),
              hdci_lowerbound = mean(hdci_lowerbound),
              hdci_upperbound = mean(hdci_upperbound)
              ) %>% 
  mutate(across(where(is.numeric), ~ round(.x, 3)))

brms8000 <- map_dfr(1:50, ~ RMUreliability(posterior_draws_rm8000)[1:3]) %>% 
  summarise(rmu_estimate = mean(rmu_estimate),
              hdci_lowerbound = mean(hdci_lowerbound),
              hdci_upperbound = mean(hdci_upperbound)
              ) %>% 
  mutate(across(where(is.numeric), ~ round(.x, 3)))

Code

data.frame(Metric = c("RMU brms 8000","RMU brms 4000","WLE-RMU PV 4000 (mirt)","EAP-RMU PV 4000 (mirt)","EAP-RMU PV 4000 (TAM)","EAP (TAM)","PSI (eRm)","WLE empirical (mirt)","EAP empirical (mirt)","Alpha","Omega"),
           Estimate = c("0.750, 95% HDCI [0.678, 0.818]",
                        "0.750, 95% HDCI [0.679, 0.818]",
                        "0.762, 95% HDCI [0.696, 0.827]",
                        "0.762, 95% HDCI [0.695, 0.826]",
                        "0.740, 95% HDCI [0.669, 0.809]",
                        "0.749",
                        "0.747",
                        "0.792",
                        "0.761",
                        "0.75 95% CI [0.68,0.82]",
                        "0.72 95% CI [0.64,0.80]")
           ) %>% 
  knitr::kable()

Metric	Estimate
RMU brms 8000	0.750, 95% HDCI [0.678, 0.818]
RMU brms 4000	0.750, 95% HDCI [0.679, 0.818]
WLE-RMU PV 4000 (mirt)	0.762, 95% HDCI [0.696, 0.827]
EAP-RMU PV 4000 (mirt)	0.762, 95% HDCI [0.695, 0.826]
EAP-RMU PV 4000 (TAM)	0.740, 95% HDCI [0.669, 0.809]
EAP (TAM)	0.749
PSI (eRm)	0.747
WLE empirical (mirt)	0.792
EAP empirical (mirt)	0.761
Alpha	0.75 95% CI [0.68,0.82]
Omega	0.72 95% CI [0.64,0.80]

4 Partial credit model (PCM)

For this example, the eRm::pcmdat2 dataset is used. It has 4 items with 3 response categories each, and 300 respondents. It might not be an optimal dataset to use since item 3 is a bit overfit to the PCM.

4.1 Bayesian model

To fit a PCM with brms, we need family = "acat", which defaults to use logit link.

Code

df2 <- pcmdat2 %>%
  mutate(across(everything(), ~ .x + 1)) %>% # lowest category needs to be 1, unlike most Rasch packages that use 0
  rownames_to_column("id") %>%
  pivot_longer(!id, names_to = "item", values_to = "response")

get_prior(response | thres(gr = item) ~ 1 + (1 | id),
  data    = df2,
  family  = "acat")

prior_pcm <- prior("normal(0, 5)", class = "Intercept") +
  prior("normal(0, 3)", class = "sd", group = "id")

brms_pcm <- brm(
  response | thres(gr = item) ~ 1 + (1 | id),
  data    = df2,
  prior   = prior_pcm,
  chains  = 4,
  cores   = 4,
  iter    = 2000,
  family  = "acat",
  control = list(adapt_delta = 0.99)
)
loo(brms_pcm)

qs_save(brms_pcm, file = "brms_pcm.qs2", compress_level = 11)

Code

brms_pcm <- qs_read("brms_pcm.qs2")

posterior_draws_pcm <- brms_pcm %>%
  as_draws_df() %>%
  select(starts_with("r_id")) %>%
  t()

4.2 Conditional reliability

Alpha
Omega

Code

RIrelRep(eRm::pcmdat2)$plot

Coefficient Alpha:                  0.66
Coefficient Alpha 95% CI:          [0.59,0.72]

People above Alpha interval:    0%
People within Alpha interval   :    87.67%
People below  Alpha interval:    12.33%

Code

RIrelRep(eRm::pcmdat2, rel = "omega")$plot

Coefficient Omega:                  0.63
Coefficient Omega 95% CI:          [0.56,0.7]

People above Omega interval:    50%
People within Omega interval   :    37.67%
People below  Omega interval:    12.33%

4.3 Summary table

Code

brms4000pcm <- map_dfr(1:50, ~ RMUreliability(posterior_draws_pcm)[1:3]) %>% 
  summarise(rmu_estimate = mean(rmu_estimate),
              hdci_lowerbound = mean(hdci_lowerbound),
              hdci_upperbound = mean(hdci_upperbound)
              ) %>% 
  mutate(across(where(is.numeric), ~ round(.x, 3)))

rmu_mirt_wle_full <- RIreliability50(eRm::pcmdat2, draws = 4000, verbose = F)
rmu_mirt_eap_full <- RIreliability50(eRm::pcmdat2, draws = 4000, estim = "EAP", verbose = F)

rmu_mirt_wle4000 <- RIreliability50(eRm::pcmdat2, draws = 4000, verbose = F)$RMU
rmu_mirt_eap4000 <- RIreliability50(eRm::pcmdat2, draws = 4000, estim = "EAP", verbose = F)$RMU
rmu_tam_eap4000 <- RIreliability50(eRm::pcmdat2, draws = 4000, pv = "TAM", verbose = F)$RMU

Code

data.frame(Metric = c("RMU brms 4000","WLE-RMU PV 4000 (mirt)","EAP-RMU PV 4000 (mirt)","EAP-RMU PV 4000 (TAM)","EAP","WLE empirical (mirt)","EAP empirical (mirt)","PSI","Alpha","Omega"),
           Estimate = c("0.672, 95% HDCI [0.615, 0.729]",
                        "0.671, 95% HDCI [0.618, 0.721]",
                        "0.670, 95% HDCI [0.618, 0.721]",
                        "0.672, 95% HDCI [0.621, 0.722]",
                        "0.669",
                        "0.718",
                        "0.670",
                        "0.380",
                        "0.66 95% CI [0.59,0.72]",
                        "0.63 95% CI [0.56,0.70]")
           ) %>% 
  knitr::kable()

Metric	Estimate
RMU brms 4000	0.672, 95% HDCI [0.615, 0.729]
WLE-RMU PV 4000 (mirt)	0.671, 95% HDCI [0.618, 0.721]
EAP-RMU PV 4000 (mirt)	0.670, 95% HDCI [0.618, 0.721]
EAP-RMU PV 4000 (TAM)	0.672, 95% HDCI [0.621, 0.722]
EAP	0.669
WLE empirical (mirt)	0.718
EAP empirical (mirt)	0.670
PSI	0.380
Alpha	0.66 95% CI [0.59,0.72]
Omega	0.63 95% CI [0.56,0.70]

5 Session info

Code

sessionInfo()

R version 4.4.3 (2025-02-28)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.7.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Stockholm
tzcode source: internal

attached base packages:
 [1] parallel  grid      stats4    stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] qs2_0.1.5         brms_2.23.0       Rcpp_1.0.14       easyRasch_0.4.1.2
 [5] Bayesrel_0.7.8    doParallel_1.0.17 iterators_1.0.14  furrr_0.3.1      
 [9] future_1.67.0     foreach_1.5.2     janitor_2.2.1     iarm_0.4.3       
[13] hexbin_1.28.4     catR_3.17         glue_1.8.0        ggrepel_0.9.6    
[17] patchwork_1.3.2   reshape_0.8.10    matrixStats_1.4.1 psychotree_0.16-2
[21] psychotools_0.7-4 partykit_1.2-24   mvtnorm_1.3-3     libcoin_1.0-10   
[25] psych_2.5.6       mirt_1.45.1       lattice_0.22-6    eRm_1.0-10       
[29] kableExtra_1.4.0  formattable_0.2.1 ggdist_3.3.3      lubridate_1.9.4  
[33] forcats_1.0.0     stringr_1.5.2     dplyr_1.1.4       purrr_1.1.0      
[37] readr_2.1.5       tidyr_1.3.1       tibble_3.3.0      ggplot2_3.5.2    
[41] tidyverse_2.0.0  

loaded via a namespace (and not attached):
  [1] splines_4.4.3        R.oo_1.26.0          cellranger_1.1.0    
  [4] rpart_4.1.24         lifecycle_1.0.4      Rdpack_2.6.1        
  [7] StanHeaders_2.32.10  rprojroot_2.1.1      globals_0.18.0      
 [10] MASS_7.3-64          backports_1.5.0      magrittr_2.0.3      
 [13] vcd_1.4-12           Hmisc_5.2-4          rmarkdown_2.29      
 [16] yaml_2.3.10          pkgbuild_1.4.6       sessioninfo_1.2.3   
 [19] pbapply_1.7-2        RColorBrewer_1.1-3   multcomp_1.4-26     
 [22] abind_1.4-5          audio_0.1-11         quadprog_1.5-8      
 [25] R.utils_2.12.3       TH.data_1.1-2        nnet_7.3-20         
 [28] tensorA_0.36.2.1     sandwich_3.1-1       inline_0.3.19       
 [31] listenv_0.9.1        testthat_3.2.1.1     vegan_2.6-8         
 [34] bridgesampling_1.1-2 parallelly_1.45.1    svglite_2.1.3       
 [37] permute_0.9-7        codetools_0.2-20     xml2_1.3.6          
 [40] tidyselect_1.2.1     bayesplot_1.14.0     farver_2.1.2        
 [43] base64enc_0.1-3      jsonlite_1.8.9       progressr_0.14.0    
 [46] Formula_1.2-5        survival_3.8-3       emmeans_1.10.4      
 [49] systemfonts_1.2.3    tools_4.4.3          gnm_1.1-5           
 [52] mnormt_2.1.1         gridExtra_2.3        xfun_0.46           
 [55] here_1.0.2           mgcv_1.9-1           distributional_0.4.0
 [58] ca_0.71.1            loo_2.8.0            withr_3.0.2         
 [61] beepr_2.0            fastmap_1.2.0        digest_0.6.37       
 [64] timechange_0.3.0     R6_2.5.1             estimability_1.5.1  
 [67] colorspace_2.1-1     R.methodsS3_1.8.2    inum_1.0-5          
 [70] utf8_1.2.4           generics_0.1.3       data.table_1.17.8   
 [73] SimDesign_2.21       htmlwidgets_1.6.4    pkgconfig_2.0.3     
 [76] gtable_0.3.6         lmtest_0.9-40        brio_1.1.5          
 [79] htmltools_0.5.8.1    lavaan_0.6-20        scales_1.4.0        
 [82] posterior_1.6.0      snakecase_0.11.1     knitr_1.48          
 [85] rstudioapi_0.17.1    tzdb_0.4.0           curl_7.0.0          
 [88] coda_0.19-4.1        checkmate_2.3.2      nlme_3.1-167        
 [91] zoo_1.8-12           relimp_1.0-5         vcdExtra_0.8-5      
 [94] foreign_0.8-88       pillar_1.10.1        vctrs_0.6.5         
 [97] stringfish_0.16.0    xtable_1.8-4         Deriv_4.1.3         
[100] cluster_2.1.8        dcurver_0.9.2        GPArotation_2024.3-1
[103] htmlTable_2.4.3      pbivnorm_0.6.0       evaluate_1.0.1      
[106] cli_3.6.3            compiler_4.4.3       rlang_1.1.6         
[109] rstantools_2.4.0     future.apply_1.20.0  LaplacesDemon_16.1.6
[112] labeling_0.4.3       plyr_1.8.9           rstan_2.32.6        
[115] stringi_1.8.7        QuickJSR_1.3.1       viridisLite_0.4.2   
[118] V8_4.4.2             Brobdingnag_1.2-9    Matrix_1.7-2        
[121] qvcalc_1.0.3         hms_1.1.3            rbibutils_2.2.16    
[124] clipr_0.8.0          RcppParallel_5.1.8   readxl_1.4.5

6 References

Bignardi, Giacomo, Rogier Kievit, and Paul-Christian Bürkner. 2025. “A General Method for Estimating Reliability Using Bayesian Measurement Uncertainty.” OSF. https://doi.org/10.31234/osf.io/h54k8_v1.

Bürkner, Paul-Christian. 2021. “Bayesian Item Response Modeling in R with Brms and Stan.” Journal of Statistical Software 100 (November): 1–54. https://doi.org/10.18637/jss.v100.i05.

Kreiner, Svend. 2025. “On Specific Objectivity and Measurement by Rasch Models: A Statistical Viewpoint.” Educational Methods & Psychometrics 3. https://doi.org/10.61186/emp.2025.7.

McNeish, Daniel, and Denis Dumas. 2025. “Reliability Representativeness: How Well Does Coefficient Alpha Summarize Reliability Across the Score Distribution?” Behavior Research Methods 57 (3): 93. https://doi.org/10.3758/s13428-025-02611-8.

Mislevy, Robert J. 1991. “Randomization-Based Inference about Latent Variables from Complex Samples.” Psychometrika 56 (2): 177–96. https://doi.org/10.1007/BF02294457.

Wu, Margaret. 2005. “The Role of Plausible Values in Large-Scale Surveys.” Studies in Educational Evaluation 31 (2-3): 114–28. https://doi.org/10.1016/j.stueduc.2005.05.005.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{johansson2025,
  author = {Johansson, Magnus},
  title = {RMU -\/-\/- Relative Measurement Uncertainty},
  date = {2025-10-23},
  url = {https://pgmj.github.io/reliability.html},
  langid = {en}
}

For attribution, please cite this work as:

Johansson, Magnus. 2025. “RMU --- Relative Measurement Uncertainty.” October 23, 2025. https://pgmj.github.io/reliability.html.

--- title: "RMU --- relative measurement uncertainty" subtitle: "Comparing methods and metrics" author: name: 'Magnus Johansson' affiliation: 'Karolinska Institutet' orcid: '0000-0003-1669-592X' email: 'magnus.johansson.3@ki.se' degrees: PhD url: https://ki.se/personer/magnus-johansson-3 date: 2025-10-23 date-format: iso date-modified: last-modified google-scholar: true citation: type: 'webpage' format: html: code-fold: true execute: cache: true warning: false message: false editor_options: chunk_output_type: console bibliography: reliability.bib --- ## Introduction This post has two purposes. The primary one is to compare some reliability metrics, since a new preprint [@bignardi_general_2025] inspired some ideas. They propose a general "relative measurement uncertainty" (RMU) metric, applicable in many situations thanks to Bayesian modeling and posterior MCMC draws. My idea that is evaluated here is to use "plausible values" [@mislevy_randomization_based_1991;@wu_role_2005] instead of fully Bayesian model posterior draws to estimate the RMU. This would make the RMU metric more accessible to users unfamiliar with Bayesian statistics, and easily applicable for existing Rasch measurement software in R, such as my package [`easyRasch`](https://pgmj.github.io/easyRasch/index.html). Please read [the preprint](https://osf.io/h54k8_v1) by Bignardi and colleagues to learn how the RMU is estimated. A summary is available on pages 12-13. The secondary purpose is to show how to fit Rasch models, dichotomous and partial credit models (PCM), using Bayesian models with `brms`, and make some comparisons of person and item parameters relative CML/WLE estimation. There is already an excellent paper on Bayesian Item Response Modeling [@burkner_bayesian_2021], but I thought a blog post like this might still be a useful addition to learning materials available. And also an opportunity for myself to learn. This will be added to the post when I have time to work on it. ```{r} library(tidyverse) library(ggdist) library(easyRasch) library(brms) library(qs2) set.seed(789012475) ``` ## Method We won't check model fit in this initial part, as we focus on how the reliability metrics compare and how to specify Rasch models using `brms`. Weakly informative priors are used for item parameters, borrowing from Bürkner [-@burkner_bayesian_2021]. We use four parallel chains with four CPU cores, each with 2000 draws of which half are warmup, resulting in 4000 post-warmup draws from the posterior distribution. In the dichotomous example, we also fit the same model with 8000 draws (code not shown below). I will at some point add descriptions of the comparison metrics used, and how the `mirt` and `TAM` packages generate plausible values. For now, I hope this will be of use as it is. ## Rasch dichotomous model For this example, the `eRm::raschdat1` dataset is used, selecting the first 20 items. The dataset has 100 respondents. ### Bayesian model We need the dataset in long form for `brms`. ```{r} #| eval: false #| code-fold: show df <- eRm::raschdat1[,1:20] %>% rownames_to_column("id") %>% pivot_longer(!id, names_to = "item", values_to = "response") rasch_prior <- prior("normal(0, 3)", class = "sd", group = "id") + prior("normal(0, 3)", class = "sd", group = "item") brms_rasch <- brm( response ~ 0 + (1 | item) + (1 | id), data = df, prior = rasch_prior, chains = 4, cores = 4, iter = 2000, family = "bernoulli" ) qs_save(brms_rasch, file = "brms_rasch4000.qs2",compress_level = 11) qs_save(brms_rasch2, file = "brms_rasch8000.qs2",compress_level = 13) ``` ```{r} brms_rasch8000 <- qs_read("brms_rasch8000.qs2") brms_rasch4000 <- qs_read("brms_rasch4000.qs2") # summary(brms_rasch) # loo(brms_rasch) posterior_draws_rm4000 = brms_rasch4000 %>% as_draws_df() %>% select(starts_with("r_id")) %>% t() posterior_draws_rm8000 = brms_rasch8000 %>% as_draws_df() %>% select(starts_with("r_id")) %>% t() ``` 4000 draws: ```{r} RMUreliability(posterior_draws_rm4000) ``` 8000 draws: ```{r} RMUreliability(posterior_draws_rm8000) ``` ### Plausible values As seen in the code below, I did some simple timing tests to compare the performance of `TAM` and `mirt` packages, where the latter has options in how to estimate thetas (latent scores) used for generating the plausible values (not available in `TAM`). Since WLE is generally considered to have the lowest bias [@kreiner_specific_2025], this is the default choice in the `RIreliability()` function, dubbed WLE-RMU. ```{r} # defining a separate function for RIreliability to use in this post, in case I make changes later on in the easyRasch package RIreliability3 <- function(data, conf_int = .95, draws = 1000, estim = "WLE", boot = FALSE, cpu = 4, pv = "mirt") { message("Note that PSI is calculated with max/min scoring individuals excluded.") if(min(as.matrix(data), na.rm = T) > 0) { stop("The lowest response category needs to coded as 0. Please recode your data.") } else if(max(as.matrix(data), na.rm = T) == 1) { model <- "RM" } else if(max(as.matrix(data), na.rm = T) > 1) { model <- "PCM" } if (model == "PCM") { mirt_out <- mirt( data, model = 1, itemtype = "Rasch", verbose = FALSE, accelerate = "squarem" ) erm_out <- eRm::PCM(data) } else if (model == "RM") { mirt_out <- mirt( data, model = 1, itemtype = "1PL", verbose = FALSE, accelerate = "squarem" ) erm_out <- eRm::RM(data) } wle <- RI_iarm_person_estimates(erm_out, properties = TRUE)[[2]] %>% as.data.frame() rownames(wle) <- NULL empirical_rel <- mirt::fscores(mirt_out, method = estim, theta_lim = c(-10, 10), full.scores.SE = TRUE, verbose = FALSE) %>% mirt::empirical_rxx() if (boot == TRUE) { require(doParallel) registerDoParallel(cores = cpu) # bootstrap CI for empirical fit <- data.frame() fit <- foreach(i = 1:draws, .combine = rbind) %dopar% { dat <- data[sample(1:nrow(data), nrow(data), replace = TRUE), ] if (model == "PCM") { mirt_out2 <- mirt( dat, model = 1, itemtype = "Rasch", verbose = FALSE, accelerate = "squarem" ) } else if (model == "RM") { mirt_out2 <- mirt( dat, model = 1, itemtype = "1PL", verbose = FALSE, accelerate = "squarem" ) } mirt::fscores(mirt_out2, method = estim, theta_lim = c(-10, 10), full.scores.SE = TRUE, verbose = FALSE) %>% mirt::empirical_rxx() } emp_boot <- mean_hdci(fit) %>% mutate(across(where(is.numeric), ~ round(.x, 3))) } if (pv == "mirt") { plvals <- mirt::fscores(mirt_out, method = estim, theta_lim = c(-10, 10), plausible.draws = draws, plausible.type = "MH", verbose = FALSE) rmu <- do.call(cbind.data.frame, plvals) %>% RMUreliability(level = conf_int) %>% mutate(across(where(is.numeric), ~ round(.x, 3))) } else if (pv == "TAM") { estim = "EAP" if (model == "PCM") { tam_out <- TAM::tam.mml(resp = data, irtmodel = "PCM", verbose = FALSE) } else if (model == "RM") { tam_out <- TAM::tam.mml(resp = data, irtmodel = "1PL", verbose = FALSE) } plvals <- TAM::tam.pv(tam_out, nplausible = draws, verbose = FALSE)[["pv"]][,-1] rmu <- RMUreliability(plvals, level = conf_int) %>% mutate(across(where(is.numeric), ~ round(.x, 3))) } message(paste0("RMU reliability estimates based on ",draws," posterior draws (plausible values) from ",nrow(data)," respondents.", "\nSee Bignardi, Kievit, & Bürkner (2025). 'A general method for estimating reliability using Bayesian Measurement Uncertainty' for details. ", "The procedure has been modified in `easyRasch` to use plausible values based on an MML estimated Rasch model.")) if (boot == TRUE) { return(list(WLE = wle, PSI = eRm::person.parameter(erm_out) %>% eRm::SepRel(), Empirical = paste0(estim,"_empirical = ",round(empirical_rel,3)), Empirical_bootstrap = paste0(estim,"_empirical = ",emp_boot$y," (95% HDCI [",emp_boot$ymin,", ",emp_boot$ymax,"]) (",draws," bootstrap resamples)"), RMU = paste0(estim,"-RMU = ",rmu$rmu_estimate," (95% HDCI [",rmu$hdci_lowerbound,", ",rmu$hdci_upperbound,"]) (",draws," draws) using package ",pv) ) ) } else { return(list(WLE = wle, PSI = eRm::person.parameter(erm_out) %>% eRm::SepRel(), Empirical = paste0(estim,"_empirical = ",round(empirical_rel,3)), RMU = paste0(estim,"-RMU = ",rmu$rmu_estimate," (95% HDCI [",rmu$hdci_lowerbound,", ",rmu$hdci_upperbound,"]) (",draws," draws) using package ",pv) ) ) } } ``` ```{r} #| eval: false t0 <- Sys.time() rmu_mirt_wle4000 <- RIreliability3(eRm::raschdat1[,1:20], draws = 4000) t1 <- Sys.time() t1 - t0 # 4.5s t0 <- Sys.time() rmu_mirt_eap4000 <- RIreliability3(eRm::raschdat1[,1:20], draws = 4000, estim = "EAP") t1 <- Sys.time() t1 - t0 # 4.1s t0 <- Sys.time() rmu_tam_eap4000 <- RIreliability3(eRm::raschdat1[,1:20], draws = 4000, pv = "TAM") t1 <- Sys.time() t1 - t0 # 13.9s ``` Using `mirt`, there is not much difference between WLE and EAP, approximately 4.1s for EAP and 4.5s for WLE. `TAM` takes quite a bit longer, at around 14s. I only did two iterations, so take this with a grain of salt. ```{r} #| eval: false t0 <- Sys.time() rmu_mirt_wle8000 <- RIreliability3(eRm::raschdat1[,1:20], draws = 8000) t1 <- Sys.time() t1 - t0 # 8.1 t0 <- Sys.time() rmu_mirt_eap8000 <- RIreliability3(eRm::raschdat1[,1:20], draws = 8000, estim = "EAP") t1 <- Sys.time() t1 - t0 # 8.1 t0 <- Sys.time() rmu_tam_eap8000 <- RIreliability3(eRm::raschdat1[,1:20], draws = 8000, pv = "TAM") t1 <- Sys.time() t1 - t0 # 31s ``` With 8000 draws, WLE and EAP with `mirt` are both at 8.1s, while `TAM` takes about 31s. ### Conditional reliability This function and its code is borrowed from a recent paper by McNeish and Dumas [-@mcneish_reliability_2025]. Alpha and omega are reported with 95% confidence intervals. ```{r} #| echo: false RIrelRep <- function(data, items=c(names(data)), rel="alpha", missing="NA", method="CI", width=NULL, raw.low=NULL, raw.high=NULL) { if ( !(method%in% c("CI", "width", "raw"))){ stop("Error: the method input must be one of the following options: 'CI','width', or 'raw'") } if (method=="width" & is.null(width)){ stop("Error: method='width' requires a user-defined value in the 'width=' option") } if (method=="raw" & (is.null(raw.low)|is.null(raw.high))){ stop("Error: method='raw' requires a user-defined values both the 'raw.low=' and 'raw.high' options") } ##recode missing data to NA for R ##only used in Shiny app data[data == missing] <- NA #subset the uploaded data to only include selected items dat<-data[,items] ## delete any rows that are all missing to avoid errors dat<-dat[rowSums(is.na(dat)) != ncol(dat), ] #fit IRT model for which sum scores are a sufficient statistic model<-mirt::mirt(data=dat, itemtype="Rasch", verbose=F) ############ #functions for conditional reliability #lifted from ggmirt package (Phillip Masurp, https://github.com/masurp/ggmirt) ############ computeItemtrace <- function(pars, Theta, itemloc, offterm = matrix(0L, 1L, length(itemloc)-1L), CUSTOM.IND, pis = NULL){ if(is.null(pis)){ itemtrace <- .Call('computeItemTrace', pars, Theta, itemloc, offterm) if(length(CUSTOM.IND)){ for(i in CUSTOM.IND) itemtrace[,itemloc[i]:(itemloc[i+1L] - 1L)] <- ProbTrace(pars[[i]], Theta=Theta) } } else { tmp_itemtrace <- vector('list', length(pis)) for(g in seq_len(length(pis))){ tmp_itemtrace[[g]] <- .Call('computeItemTrace', pars[[g]]@ParObjects$pars, Theta, itemloc, offterm) if(length(CUSTOM.IND)){ for(i in CUSTOM.IND) tmp_itemtrace[[g]][,itemloc[i]:(itemloc[i+1L] - 1L)] <- ProbTrace(pars[[g]]@ParObjects$pars[[i]], Theta=Theta) } } itemtrace <- do.call(rbind, tmp_itemtrace) } return(itemtrace) } ExtractGroupPars <- function(x){ if(x@itemclass < 0L) return(list(gmeans=0, gcov=matrix(1))) nfact <- x@nfact gmeans <- x@par[seq_len(nfact)] phi_matches <- grepl("PHI", x@parnames) if (x@dentype == "Davidian") { phi <- x@par[phi_matches] tmp <- x@par[-c(seq_len(nfact), which(phi_matches))] gcov <- matrix(0, nfact, nfact) gcov[lower.tri(gcov, diag=TRUE)] <- tmp gcov <- makeSymMat(gcov) return(list(gmeans=gmeans, gcov=gcov, phi=phi)) } else { par <- x@par if(x@dentype == "mixture") par <- par[-length(par)] # drop pi tmp <- par[-seq_len(nfact)] gcov <- matrix(0, nfact, nfact) gcov[lower.tri(gcov, diag=TRUE)] <- tmp gcov <- makeSymMat(gcov) return(list(gmeans=gmeans, gcov=gcov)) } } makeSymMat <- function(mat){ if(ncol(mat) > 1L){ mat[is.na(mat)] <- 0 mat <- mat + t(mat) - diag(diag(mat)) } mat } #compute number of sum scores and number of people with each scores f<-mirt::fscores(model,method="EAPsum", full.scores=F,verbose=F, na.rm=TRUE) ## functions for computing conditional reliabilty (partially lifted for ggmirt package) nfact <- model@Model$nfact J <- model@Data$nitems theta <- f$F1 ThetaFull <- Theta <- thetaComb(theta, nfact) info <- testinfo(model, ThetaFull) itemtrace <- computeItemtrace(model@ParObjects$pars, ThetaFull, model@Model$itemloc, CUSTOM.IND=model@Internals$CUSTOM.IND) mins <- model@Data$mins maxs <- extract.mirt(model, 'K') + mins - 1 gp <- ExtractGroupPars(model@ParObjects$pars[[J+1]]) score <- c() for(i in 1:J) score <- c(score, (0:(model@Data$K[i]-1) + mins[i]) * (i %in% c(1:J))) score <- matrix(score, nrow(itemtrace), ncol(itemtrace), byrow = TRUE) plt <- data.frame(cbind(info,score=rowSums(score*itemtrace),Theta=Theta)) colnames(plt) <- c("info", "score", "Theta") plt$SE <- 1 / sqrt(plt$info) plt$rxx <- plt$info / (plt$info + 1/gp$gcov[1L,1L]) #possible sum scores in data plt$sum<-f$Sum.Scores #number of people with each score plt$observed<-f$observed if(method=="CI"){ #compute alpha and 95% CI ci<-Bayesrel::strel(data=dat, estimates=rel, n.burnin=5000,n.iter=10000, missing="listwise") low<-as.numeric(ci$Bayes$cred$low) upp<-as.numeric(ci$Bayes$cred$up) est<-as.numeric(ci$Bayes$est) } if(method=="width"){ ci<-Bayesrel::strel(data=dat, estimates=rel, n.burnin=5000,n.iter=10000, missing="listwise") est<-as.numeric(ci$Bayes$est) low<-est-width if(low<0){ low<-0 warning( "Warning: The 'width =' value implies an interval lower bound below 0. The lower bound has been replaced with 0. Check that your width was accurately entered", immediate.=T) } upp<-est+width if (upp>1) { upp<-1 warning( "Warning: The 'width =' value implies an interval upper bound above 1. The upper bound has been replaced with 1. Check that your width was accurately entered", immediate.=T) } } if(method=="raw"){ low<-raw.low upp<-raw.high if(upp>1) { stop( "Error: The 'raw.high =' option exceeds 1. Check that your width was accurately entered") } if (low<0) { stop( "Error: The 'low =' option is below 0. Check that your width was accurately entered") } if ((raw.low>raw.high | raw.low==raw.high)) { stop( "Error: The interval lower bound exceeds the interval upper bound. Check that your width was accurately entered") } ci<-Bayesrel::strel(data=dat, estimates=rel, n.burnin=5000,n.iter=10000, missing="listwise") est<-as.numeric(ci$Bayes$est) } #percentage within CI, above CI, and below CI plt$within<-ifelse(low < plt$rxx & plt$rxx < upp, 1,0) plt$above<-ifelse(upp < plt$rxx, 1,0) plt$below<-ifelse(plt$rxx < low, 1,0) people<-round(100*apply(plt[,c("within","above","below")], 2, weighted.mean, w=plt$observed),2) #used for plot legend alpha <- data.frame(yintercept=est, Lines=str_to_title(rel)) if(method=="CI"){ CI <- data.frame(yintercept=low, Lines=paste(str_to_title(rel),'95% CI')) } if(method!="CI"){ CI <- data.frame(yintercept=low, Lines='Test Interval') } #plot #conditional reliability, weighted by number of responses #with alpha and 95% CI #x axis is sum score, not theta score p<-ggplot(plt, aes(x = sum, y = rxx)) + geom_line(aes(color = (observed+1)),lwd=1.5) + scale_colour_gradient("People at Score",high ="#009ca6",low="sienna1")+ #low = "#fff7dd", high = "#009ca6" ylim(0, 1) + theme_minimal(base_size=16)+ geom_hline(yintercept=upp, linetype="dotted", lwd=.5)+ geom_hline(aes(yintercept=yintercept, linetype=Lines), CI, lwd=.5)+ geom_hline(aes(yintercept=yintercept, linetype=Lines), alpha, lwd=.5)+ xlab("Sum Score") + ylab(paste("Reliability (Cronbach's alpha)")) #print alpha and percentages in console base::cat(paste0("Coefficient ", str_to_title(rel),": ", format(round(est,2),nsmall=2))) base::cat("\n") base::cat(paste0("Coefficient ", str_to_title(rel), " 95% CI: ", "[",round(low,2),",",round(upp,2),"]")) base::cat("\n") base::cat("\n") base::cat(paste0("People above ", str_to_title(rel)," interval: ")) base::cat(paste0(unname(people[2]),"%")) base::cat("\n") base::cat(paste0("People within ", str_to_title(rel)," interval : ")) base::cat(paste0(unname(people[1]),"%")) base::cat("\n") base::cat(paste0("People below ", str_to_title(rel), " interval: ")) base::cat(paste0(unname(people[3]),"%")) ##Save a table for Shiny if(method=="CI"){ y<-c(paste0("Coefficient ",str_to_title(rel),":"),paste0("Coefficient ", str_to_title(rel), " 95% CI:"), " ", paste0("People above ", str_to_title(rel)," interval:"), paste0("People within ", str_to_title(rel)," interval:"), paste0("People below ", str_to_title(rel)," interval :"), format(round(est,2),nsmall=2), paste0("[",format(round(low,2),nsmall=2),",",format(round(upp,2),nsmall=2),"]"), " ", paste0(unname(people[2]),"%"), paste0(unname(people[1]),"%"), paste0(unname(people[3]),"%")) } if(method!="CI"){ y<-c(paste0("Coefficient ",str_to_title(rel),":"),paste0("Specified test interval:"), " ", paste0("People above ", str_to_title(rel)," interval:"), paste0("People within ", str_to_title(rel)," interval:"), paste0("People below ", str_to_title(rel)," interval :"), format(round(est,2),nsmall=2), paste0("[",format(round(low,2),nsmall=2),",",format(round(upp,2),nsmall=2),"]"), " ", paste0(unname(people[2]),"%"), paste0(unname(people[1]),"%"), paste0(unname(people[3]),"%")) } x<-matrix(y, nrow=6, ncol=2) x1<-data.frame(x) d<-plt[,c(6,5,7)] d[,2]<-round(d[,2],3) names(d)<-c("Sum Score","Conditional Reliability","Count") z<-list() #save summary table z$stat<-x1 #save plot z$plot<-p #save conditional reliability table z$table<-d #print the plot by default #print(z$plot) #return the plot with the function return(z) } ``` ::: panel-tabset #### Alpha ```{r} RIrelRep(eRm::raschdat1[,1:20])$plot ``` #### Omega ```{r} RIrelRep(eRm::raschdat1[,1:20], rel = "omega")$plot ``` ::: ### Summarizing RMU findings Since RMU uses a random pairing of posterior draws, there is variation in both point values and HDCI estimates if one runs the RMU estimation repeatedly using the same set of draws. Let's investigate how much it varies, using the `brms` model with 4000 draws. ```{r} #| eval: false rel <- map_dfr(1:1000, ~ RMUreliability(posterior_draws_rm4000)[1:3]) qs_save(rel,"rel4000.qs2") ``` ```{r} #| fig-height: 4 rel <- qs_read("rel4000.qs2") rel_text <- rel %>% pivot_longer(everything()) %>% group_by(name) %>% summarise(text_location = mean(value)) rel %>% pivot_longer(everything()) %>% ggplot(aes(x = value, y = 1, fill = name)) + stat_slabinterval() + geom_text(data = rel_text, aes(label = name, x = text_location), position = position_nudge(y = -0.04)) + labs(title = "RMU random variation", subtitle = "Using a `brms` Rasch model with 4000 posterior draws, estimating RMU 1000 times", x = "Relative Measurement Uncertainty", y = NULL) + theme_minimal() + theme(legend.position = "none", axis.title.x = element_text(vjust = -0.1)) + scale_x_continuous(limits = c(0.65,0.85), breaks = scales::breaks_pretty(8)) ``` CI's have more variation than the point estimate. ```{r} rel %>% pivot_longer(everything()) %>% group_by(name) %>% reframe(min = min(value), max = max(value), mean_hdci = mean_hdci(value)) ``` Let's do the same with 8000 draws. ```{r} #| eval: false rel8 <- map_dfr(1:1000, ~ RMUreliability(posterior_draws_rm8000)[1:3]) qs_save(rel8,"rel8000.qs2") ``` ```{r} #| fig-height: 4 rel8 <- qs_read("rel8000.qs2") rel8_text <- rel8 %>% pivot_longer(everything()) %>% group_by(name) %>% summarise(text_location = mean(value)) rel8 %>% pivot_longer(everything()) %>% ggplot(aes(x = value, y = 1, fill = name)) + stat_slabinterval() + geom_text(data = rel_text, aes(label = name, x = text_location), position = position_nudge(y = -0.04)) + labs(title = "RMU random variation", subtitle = "Using a `brms` Rasch model with 8000 posterior draws, estimating RMU 1000 times", x = "Relative Measurement Uncertainty", y = NULL) + theme_minimal() + theme(legend.position = "none", axis.title.x = element_text(vjust = -0.1)) + scale_x_continuous(limits = c(0.65,0.85), breaks = scales::breaks_pretty(8)) ``` ```{r} rel8 %>% pivot_longer(everything()) %>% group_by(name) %>% reframe(min = min(value), max = max(value), mean_hdci = mean_hdci(value)) ``` Less variation - more precision in estimates with 8000 draws compared to 4000 draws. How about just estimating RMU 50 times, does it converge on the same mean values for point and CI estimates, compared to estimating RMU 1000 times? This is obviously not a thorough test, it should be evaluated more properly at some point. ```{r} t0 <- Sys.time() rel50 <- map_dfr(1:50, ~ RMUreliability(posterior_draws_rm4000)[1:3]) t1 <- Sys.time() t1 - t0 # ~3.4s rel50 %>% pivot_longer(everything()) %>% group_by(name) %>% reframe(mean_hdci = mean_hdci(value)) ``` Clearly seems more stable than just estimating RMU once. This finding leads me to incorporate the 50x RMU estimation as a default setting in the `RIreliability()` function in the next version update. Results reported below are based on this method. ```{r} RIreliability50 <- function(data, conf_int = .95, draws = 1000, estim = "WLE", boot = FALSE, cpu = 4, pv = "mirt", iter = 50, verbose = TRUE) { if (verbose == TRUE) { message("Note that PSI is calculated with max/min scoring individuals excluded.") message(paste0("RMU reliability estimates based on ",draws," posterior draws (plausible values) from ",nrow(data)," respondents.", "\nSee Bignardi, Kievit, & Bürkner (2025). 'A general method for estimating reliability using Bayesian Measurement Uncertainty' for details. ", "The procedure has been modified in `easyRasch` to use plausible values based on an MML estimated Rasch model.")) } if(min(as.matrix(data), na.rm = T) > 0) { stop("The lowest response category needs to coded as 0. Please recode your data.") } else if(max(as.matrix(data), na.rm = T) == 1) { model <- "RM" } else if(max(as.matrix(data), na.rm = T) > 1) { model <- "PCM" } if (model == "PCM") { mirt_out <- mirt( data, model = 1, itemtype = "Rasch", verbose = FALSE, accelerate = "squarem" ) erm_out <- eRm::PCM(data) } else if (model == "RM") { mirt_out <- mirt( data, model = 1, itemtype = "1PL", verbose = FALSE, accelerate = "squarem" ) erm_out <- eRm::RM(data) } wle <- RI_iarm_person_estimates(erm_out, properties = TRUE)[[2]] %>% as.data.frame() rownames(wle) <- NULL empirical_rel <- mirt::fscores(mirt_out, method = estim, theta_lim = c(-10, 10), full.scores.SE = TRUE, verbose = FALSE) %>% mirt::empirical_rxx() if (boot == TRUE) { require(doParallel) registerDoParallel(cores = cpu) # bootstrap CI for empirical fit <- data.frame() fit <- foreach(i = 1:draws, .combine = rbind) %dopar% { dat <- data[sample(1:nrow(data), nrow(data), replace = TRUE), ] if (model == "PCM") { mirt_out2 <- mirt( dat, model = 1, itemtype = "Rasch", verbose = FALSE, accelerate = "squarem" ) } else if (model == "RM") { mirt_out2 <- mirt( dat, model = 1, itemtype = "1PL", verbose = FALSE, accelerate = "squarem" ) } mirt::fscores(mirt_out2, method = estim, theta_lim = c(-10, 10), full.scores.SE = TRUE, verbose = FALSE) %>% mirt::empirical_rxx() } emp_boot <- mean_hdci(fit) %>% mutate(across(where(is.numeric), ~ round(.x, 3))) } if (pv == "mirt") { plvals <- mirt::fscores(mirt_out, method = estim, theta_lim = c(-10, 10), plausible.draws = draws, plausible.type = "MH", verbose = FALSE) rmu_draws <- do.call(cbind.data.frame, plvals) rmu_iter <- map_dfr(1:iter, ~ RMUreliability(rmu_draws, level = conf_int)[1:3]) rmu <- rmu_iter %>% summarise(rmu_estimate = mean(rmu_estimate), hdci_lowerbound = mean(hdci_lowerbound), hdci_upperbound = mean(hdci_upperbound) ) %>% mutate(across(where(is.numeric), ~ round(.x, 3))) } else if (pv == "TAM") { estim = "EAP" if (model == "PCM") { tam_out <- TAM::tam.mml(resp = data, irtmodel = "PCM", verbose = FALSE) } else if (model == "RM") { tam_out <- TAM::tam.mml(resp = data, irtmodel = "1PL", verbose = FALSE) } plvals <- TAM::tam.pv(tam_out, nplausible = draws, verbose = FALSE)[["pv"]][,-1] rmu_iter <- map_dfr(1:iter, ~ RMUreliability(plvals, level = conf_int)[1:3]) rmu <- rmu_iter %>% summarise(rmu_estimate = mean(rmu_estimate), hdci_lowerbound = mean(hdci_lowerbound), hdci_upperbound = mean(hdci_upperbound) ) %>% mutate(across(where(is.numeric), ~ round(.x, 3))) } if (boot == TRUE) { return(list(WLE = wle, PSI = eRm::person.parameter(erm_out) %>% eRm::SepRel(), Empirical = paste0(estim,"_empirical = ",round(empirical_rel,3)), Empirical_bootstrap = paste0(estim,"_empirical = ",emp_boot$y," (95% HDCI [",emp_boot$ymin,", ",emp_boot$ymax,"]) (",draws," bootstrap resamples)"), RMU = paste0(estim,"-RMU = ",rmu$rmu_estimate," (95% HDCI [",rmu$hdci_lowerbound,", ",rmu$hdci_upperbound,"]) (",draws," draws) using package ",pv," and ",iter," RMU iterations.") ) ) } else { return(list(WLE = wle, PSI = eRm::person.parameter(erm_out) %>% eRm::SepRel(), Empirical = paste0(estim,"_empirical = ",round(empirical_rel,3)), RMU = paste0(estim,"-RMU = ",rmu$rmu_estimate," (95% HDCI [",rmu$hdci_lowerbound,", ",rmu$hdci_upperbound,"]) (",draws," draws) using package ",pv," and ",iter," RMU iterations.") ) ) } } ``` Using the 50 iteration setting 10 times with 4000 draws. What is the variation!? ```{r} #| eval: false iter50 <- map_vec(1:10, ~ RIreliability50(raschdat1[,1:20], draws = 4000, verbose = FALSE)$RMU) iter50 ``` Point estimate varies from .761 to .764, with lower/upper CI ranging from .694 to .698 and .825 to .827. Estimates seem unlikely to vary more than 0.01, which I find acceptable. ### Summary table ```{r} #| eval: false rmu_mirt_wle_full <- RIreliability50(eRm::raschdat1[,1:20], draws = 4000, verbose = F) rmu_mirt_eap_full <- RIreliability50(eRm::raschdat1[,1:20], draws = 4000, estim = "EAP", verbose = F) t0 <- Sys.time() rmu_mirt_wle4000 <- RIreliability50(eRm::raschdat1[,1:20], draws = 4000, verbose = F)$RMU t1 <- Sys.time() t1 - t0 # 8.3s t0 <- Sys.time() rmu_mirt_eap4000 <- RIreliability50(eRm::raschdat1[,1:20], draws = 4000, estim = "EAP", verbose = F)$RMU t1 <- Sys.time() t1 - t0 # 7.7s t0 <- Sys.time() rmu_tam_eap4000 <- RIreliability50(eRm::raschdat1[,1:20], draws = 4000, pv = "TAM", verbose = F)$RMU t1 <- Sys.time() t1 - t0 # 10.6s brms4000 <- map_dfr(1:50, ~ RMUreliability(posterior_draws_rm4000)[1:3]) %>% summarise(rmu_estimate = mean(rmu_estimate), hdci_lowerbound = mean(hdci_lowerbound), hdci_upperbound = mean(hdci_upperbound) ) %>% mutate(across(where(is.numeric), ~ round(.x, 3))) brms8000 <- map_dfr(1:50, ~ RMUreliability(posterior_draws_rm8000)[1:3]) %>% summarise(rmu_estimate = mean(rmu_estimate), hdci_lowerbound = mean(hdci_lowerbound), hdci_upperbound = mean(hdci_upperbound) ) %>% mutate(across(where(is.numeric), ~ round(.x, 3))) ``` ```{r} data.frame(Metric = c("RMU brms 8000","RMU brms 4000","WLE-RMU PV 4000 (mirt)","EAP-RMU PV 4000 (mirt)","EAP-RMU PV 4000 (TAM)","EAP (TAM)","PSI (eRm)","WLE empirical (mirt)","EAP empirical (mirt)","Alpha","Omega"), Estimate = c("0.750, 95% HDCI [0.678, 0.818]", "0.750, 95% HDCI [0.679, 0.818]", "0.762, 95% HDCI [0.696, 0.827]", "0.762, 95% HDCI [0.695, 0.826]", "0.740, 95% HDCI [0.669, 0.809]", "0.749", "0.747", "0.792", "0.761", "0.75 95% CI [0.68,0.82]", "0.72 95% CI [0.64,0.80]") ) %>% knitr::kable() ``` ## Partial credit model (PCM) For this example, the `eRm::pcmdat2` dataset is used. It has 4 items with 3 response categories each, and 300 respondents. It might not be an optimal dataset to use since item 3 is a bit overfit to the PCM. ### Bayesian model To fit a PCM with `brms`, we need `family = "acat"`, which defaults to use logit link. ```{r} #| eval: false #| code-fold: show df2 <- pcmdat2 %>% mutate(across(everything(), ~ .x + 1)) %>% # lowest category needs to be 1, unlike most Rasch packages that use 0 rownames_to_column("id") %>% pivot_longer(!id, names_to = "item", values_to = "response") get_prior(response | thres(gr = item) ~ 1 + (1 | id), data = df2, family = "acat") prior_pcm <- prior("normal(0, 5)", class = "Intercept") + prior("normal(0, 3)", class = "sd", group = "id") brms_pcm <- brm( response | thres(gr = item) ~ 1 + (1 | id), data = df2, prior = prior_pcm, chains = 4, cores = 4, iter = 2000, family = "acat", control = list(adapt_delta = 0.99) ) loo(brms_pcm) qs_save(brms_pcm, file = "brms_pcm.qs2", compress_level = 11) ``` ```{r} brms_pcm <- qs_read("brms_pcm.qs2") posterior_draws_pcm <- brms_pcm %>% as_draws_df() %>% select(starts_with("r_id")) %>% t() ``` ### Conditional reliability ::: panel-tabset #### Alpha ```{r} RIrelRep(eRm::pcmdat2)$plot ``` #### Omega ```{r} RIrelRep(eRm::pcmdat2, rel = "omega")$plot ``` ::: ### Summary table ```{r} #| eval: false brms4000pcm <- map_dfr(1:50, ~ RMUreliability(posterior_draws_pcm)[1:3]) %>% summarise(rmu_estimate = mean(rmu_estimate), hdci_lowerbound = mean(hdci_lowerbound), hdci_upperbound = mean(hdci_upperbound) ) %>% mutate(across(where(is.numeric), ~ round(.x, 3))) rmu_mirt_wle_full <- RIreliability50(eRm::pcmdat2, draws = 4000, verbose = F) rmu_mirt_eap_full <- RIreliability50(eRm::pcmdat2, draws = 4000, estim = "EAP", verbose = F) rmu_mirt_wle4000 <- RIreliability50(eRm::pcmdat2, draws = 4000, verbose = F)$RMU rmu_mirt_eap4000 <- RIreliability50(eRm::pcmdat2, draws = 4000, estim = "EAP", verbose = F)$RMU rmu_tam_eap4000 <- RIreliability50(eRm::pcmdat2, draws = 4000, pv = "TAM", verbose = F)$RMU ``` ```{r} data.frame(Metric = c("RMU brms 4000","WLE-RMU PV 4000 (mirt)","EAP-RMU PV 4000 (mirt)","EAP-RMU PV 4000 (TAM)","EAP","WLE empirical (mirt)","EAP empirical (mirt)","PSI","Alpha","Omega"), Estimate = c("0.672, 95% HDCI [0.615, 0.729]", "0.671, 95% HDCI [0.618, 0.721]", "0.670, 95% HDCI [0.618, 0.721]", "0.672, 95% HDCI [0.621, 0.722]", "0.669", "0.718", "0.670", "0.380", "0.66 95% CI [0.59,0.72]", "0.63 95% CI [0.56,0.70]") ) %>% knitr::kable() ``` ## Session info ```{r} sessionInfo() ``` ## References