Out of interest of better understanding som aspects of differential item functioning (DIF, a form of invariance test):
how much does DIF affect estimated thetas (factor scores)?
how to do item-split in R (creating separate items for subgroups from one item with problematic DIF)
how an item-split compares to removing the item (and keeping the DIF item) in terms of absolute differences in estimated thetas
We’ll use simulated data in order to have knowledge of the thetas used to generate response data (“input thetas” in the text below), and make objective comparisons using the different estimated thetas.
Ideally, this would be a simulation study where we create lots of datasets with a systematic variation in some parameters to investigate effects. But maybe this is a first step towards that.
# make a tibble/dataframe also, for possible later useinputParams1<-tibble( q1 =c(1.2, 1.8, 2.4), q2 =c(-1.3, -0.5, 0.5), q3a =c(-0.3, 0.3, 1.2), # this is the DIF item q3b =c(-0.3+1, 0.3+1, 1.2+1), # this is the DIF item q4 =c(0.1, 0.6, 1.6), q5 =c(-0.3, 0.7, 1.5), q6 =c(-1.6, -1, -0.3), q7 =c(1, 1.8, 2.5), q8 =c(-1.3, -0.7, 0.4), q9 =c(-0.8, 1.4, 1.9), q10 =c(0.25, 1.25, 2.15))%>%t()%>%as.matrix()# center to 0inputParams1c<-inputParams1-mean(inputParams1)# item list for simulation for group 1tlist1<-list( q1 =list(inputParams1c[1,]), q2 =list(inputParams1c[2,]), q3 =list(inputParams1c[3,]), # this is the DIF item q4 =list(inputParams1c[5,]), q5 =list(inputParams1c[6,]), q6 =list(inputParams1c[7,]), q7 =list(inputParams1c[8,]), q8 =list(inputParams1c[9,]), q9 =list(inputParams1c[10,]), q10 =list(inputParams1c[11,]))# item list for simulation for group 2tlist2<-list( q1 =list(inputParams1c[1,]), q2 =list(inputParams1c[2,]), q3 =list(inputParams1c[4,]), # this is the DIF item q4 =list(inputParams1c[5,]), q5 =list(inputParams1c[6,]), q6 =list(inputParams1c[7,]), q7 =list(inputParams1c[8,]), q8 =list(inputParams1c[9,]), q9 =list(inputParams1c[10,]), q10 =list(inputParams1c[11,]))
Then generate random thetas that we save to file to be able to reproduce the analysis.
# simulate thetasthetas1<-rnorm(300, mean =0, sd =1.5)thetas2<-rnorm(300, mean =0, sd =1.5)input_thetas<-c(thetas1,thetas2)# simulate response data based on the above defined item thresholdstd1<-SimPartialScore( deltaslist =tlist1, thetavec =thetas1)%>%as.data.frame()td2<-SimPartialScore( deltaslist =tlist2, thetavec =thetas2)%>%as.data.frame()d<-rbind(td1,td2)%>%add_column(group =rep(1:2, each =300))dif.group<-factor(d$group)d$group<-NULLall_data<-list(simResponses =d, dif_group =dif.group, input_thetas =input_thetas)# save simulated data for reproducibility#saveRDS(all_data,"dif_magnitude_1_0.Rdata")
# read simulated data for reprodubilityall_data<-readRDS("dif_magnitude_1_0.Rdata")d<-all_data$simResponsesdif.group<-all_data$dif_groupinput_thetas<-all_data$input_thetas
We now have 10 items with 4 categories each. There are 600 respondents in all, with 300 showing differential item functioning for one item (item q3). DIF is induced at +1 logit uniform difference in location (all thresholds for item q3 are unformly +1 logits).
Values highlighted in red are above the chosen cutoff 0.5 logits. Background color brown and blue indicate the lowest and highest values among the DIF groups.
DIF clearly shown, however it is closer to 0.8 logits than the 1.0 used in the input values. This is still generally considered to be a large DIF size, so it should serve our purpose.
2.4 Item split
Now, we’ll do an item split and compare thetas for both groups with and without split, and also with the DIF item removed.
Item threshold locations for the q3 split items range from -1 logits to +1.5 logits along the theta/latent continuum. This is roughly the range where we expect some impact from DIF.
First, we’ll use the convenient function RIestThetas() that estimates item parameters (using eRm) and thetas (using iarm), to see how that works when we have a split item, each with missing data for 50% of respondents.
thetas_separate<-RIestThetas(d2)hist(thetas_separate$WLE, breaks =30, col ="lightblue")hist(thetas_together$WLE, breaks =30, col ="lightpink")hist(thetas_q3_removed$WLE, breaks =30, , col ="sienna4")
The upper range is rather different for the item split subgroup when using this method, with max score of about 2.2, compared to 4.1 for the item set with the original q3 item and entirely without q3.
The method for estimating item parameters and thetas used in the function RIestThetas() may be at fault for the odd results in thetas estimated by the item set with item split? We can separate the two steps, and use a separate function for theta estimation with manual input of item parameters.
Looks like the two step approach worked a lot better. Since the item parameter estimation is identical (both are using eRm::PCM()), the reason should be the difference in theta estimation. The two-step approach uses catR::thetaEst() for theta estimation, which is probably handling missing data better than iarm::person_estimates(). Note: both approaches use the Weighted Likelihood Estimation to minimize bias (Warm, 1989).
3 Results
3.1 Summarised
First, absolute differences in estimated thetas compared to input thetas. By using absolute differences we can assess both DIF groups simultaneously.
c_diff<-c%>%mutate(with_q3 =abs(together_RIestThetas-input_thetas), q3_removed =abs(together_q3_rem-input_thetas), q3_split =abs(separate_catR-input_thetas))%>%select(!names(c))c_diff%>%pivot_longer(everything())%>%ggplot(aes(x =value))+geom_histogram(bins =100)+facet_wrap(~name, ncol =1)+labs(x ="Absolute difference in logits", title ="Comparing input thetas to estimated", subtitle ="Distribution of bias")
We should look more closely at the particular region where the DIF item is located, since it should have the most impact there.
3.2 Across the latent continuum
First, the test information function (TIF) curve could be of interest to understand what to expect in terms of estimation bias due to reliability limitations. Even more interesting is the table showing range of SEM.
The lowest SEM with q3 is 0.384, at logit score 0.356.
Lowest SEM without q3 is 0.416, at logit score 0.322 to 0.491, which makes a difference in minimal SEM of about 0.032 compared to including the DIF item. 0.032 * 1.96 = 0.063 for a 95% CI.
Item split slightly reduces bias compared to keeping the DIF item (3), while removing item 3 increases bias in the theta range of about -1 to +1.5 logits. The maximum bias looks like approximately 0.055 logits (at theta = 0), according to the loess smoothed line.
3.4 Statistical analysis of means (limited range)
While the practical impact of theta estimation bias induced by a DIF variable should be judged by how problematic the maximum bias is for the intended use and need for precision, it could be interesting to quantify the differences using statistical analysis. It is sometimes suggested to look at the mean of the groups, but I think this is mistaken. The bias is local and related to the DIF item’s location, which makes it relevant to look at that region separately.
As such, the comparison below is about mean differences in estimated thetas to input thetas, limited to the theta range from -0.5 to +1 logits, where differences seem the biggest according to Figure 1. Note that the figure shows loess smoothed lines that include both groups.
We need to do this separately for the two DIF groups, since the groups will have opposite effects of the DIF.
rbind(g1,g2)%>%add_column(Group =rep(c("Group 1","Group 2"), each =3))%>%ggplot(aes(x =Group, y =estimate, color =Model))+geom_point(position =position_dodge(width =0.2))+geom_errorbar(aes(ymin =conf.low, ymax =conf.high), width =0.1, position =position_dodge(width =0.2))+scale_color_ghibli_d("MononokeMedium", direction =-1)+labs(color ="Item set", title ="Mean bias in theta estimation", subtitle ="Across a limited theta region (-0.5 to 1.0 logits)", y ="Model estimate", x ="")
Since we are primarily interested in comparing the different item sets to each other, it is not the difference from input thetas (estimate = 0) that is most relevant here. As such, I chose to display 84% confidence intervals to be able to assess differences between item sets in each group by looking at whether the CI’s overlap or not.
