Data wrangling for psychometrics in R
R code examples for psychometricians
While the easyRasch
R package simplifies the process of doing Rasch analysis in R, users still need to be able to import and often make modifications to their data. This demands some knowledge in R data wrangling. For a comprehensive treatment of this topic, please see the “R for data science” book. This guide focuses on common data wrangling issues that occur in psychometric analyses.
I will rely heavily on library(tidyverse)
functions in most examples.
This page will be updated intermittently. The planned content includes:
- dividing a set of items into two or more dataframes in order to run separate analyses (for instance based on an exploratory analysis of a large item set, or based on a demographic variable)
- recoding/merging response categories
- splitting an item into two based on a DIF variable
- merging two items into a super-item/testlet
1 Removing respondents with missing data
There may be situations where you want to specify a minimum number of items that a respondent must have answered in order to be included in the analysis. This is done by using the filter()
function from library(dplyr)
. The syntax is as follows:
<- 5 # set the minimum number of responses
min.responses
<- df %>%
df2 filter(length(itemlabels$itemnr) - rowSums(is.na(.[itemlabels$itemnr])) >= min.responses)
The object itemlabels$itemnr
is a vector (column in a dataframe) with the short item labels as they are used in the dataframe df
which contain the data, with items as columns using the corresponding labels.
The length()
function is used to count the number of items in the vector, and rowSums(is.na())
is used to count the number of missing values in each row. Then a simple subtraction is done, total number of items minus number of missing responses. The >=
operator is used to compare the number of responses to the minimum number of responses specified in the object min.responses
. The filter()
function removes all rows that do not meet the criteria. And the new dataset is saved to df2
.
2 Dividing a dataset
3 Recoding response categories
Many options are available. I have settled on primarily using car::recode()
from library(car)
since I find the syntax to be logical and consistent. Another option to consider is dplyr::case_when()
.
The basic syntax of car::recode()
(henceforth referred to as only recode()
) is this:
recode(variable_to_recode, "newvalue1=oldvalue1;newvalue2=oldvalue2", as.factor = FALSE/TRUE)
As you can see, semi-colon ;
is used to separate recodings. When numbers are recoded, you just write them out as 1=0
. When characters are involved, you need the single quote symbol to enclose the character string, i.e 'Never'=0
for recoding to numerics (which also necessitates the option as.factor = FALSE
if you want the recoded variable to be numeric), or 'Never'='Nevver'
when you need to correct misspelled response options.
Remember to define your preferred recoding function in your script to avoid headaches due to unexpected error messages. This is done by using the assignment operator <-
. Example below.
recode <- car::recode
There are some more or less clever ways to use recode()
. You can simply copy&paste code for each variable, such as:
$q1 <- recode(df$q1, "1=0;2=1;3=2", as.factor = FALSE)
df$q2 <- recode(df$q2, "1=0;2=1;3=2", as.factor = FALSE)
df$q3 <- recode(df$q3, "1=0;2=1;3=2", as.factor = FALSE)
df$q4 <- recode(df$q4, "1=0;2=1;3=2", as.factor = FALSE)
df$q5 <- recode(df$q5, "1=0;2=1;3=2", as.factor = FALSE) df
Such an approach might actually be sensible if each variable needed different recodings. But when you want to make the same changes to many variables, there are of course more efficient strategies. The code above could be rewritten as:
%>%
df mutate(across(q1:q5, ~ car::recode(.x, "1=0;2=1;3=2", as.factor = FALSE)))
Or, in this case, where we only want to recode our scale from the range of 1-3 to 0-2, i.e. subtract one:
%>%
df mutate(across(q1:q5, ~ .x - 1))
The syntax in these two examples is related to how tidyverse
uses unnamed functions by the combination of ~
and .x
, where the latter becomes a placeholder for the variables defined in the first term of across()
.
It is good practice to check that your recoding worked as intended. The first step I recommend is RItileplot()
.
%>%
df mutate(across(q1:q5, ~ car::recode(.x, "1=0;2=1;3=2", as.factor = FALSE))) %>%
RItileplot()
If you are trying out different ways to merge response categories to resolve issues with disordered thresholds, you may also want to review the probability curves before committing your recode to a new data object.
%>%
df mutate(across(q1:q5, ~ car::recode(.x, "1=0;2=1;3=2", as.factor = FALSE))) %>%
RIitemCats(legend = "left")
Below is an example where we have multiple cell contents that we want to recode to NA
to ensure that R interprets them as missing data. We want to recode three different things:
- all the numbers from 990 to 999 (usually a way to differentiate between types of missing data)
- blank cells
- “Don’t know” responses
$q45 <- recode(df$q45,"990:999=NA;''=NA;'Don't know'=NA") df
The :
means that all numbers from 990 to 999 will be recoded into NA
.
4 Item split
It can be desirable to split an item due to issues with DIF. This refers to taking a variable and creating two (or more) replacement variables, one for each demographic group. For each variable, data will be missing for the other group. Often this is relevant for gender DIF, which we will use in this example.
This example assumes that there is a (DIF) vector variable dif.gender
with the gender data, which has the same length as the number of rows in the dataset df
. We’ll create a new dataframe to store the dataset with the item split. Item q10 is the one we want to split.
<- df %>%
df.q10split add_column(gender = dif.gender) %>%
mutate(q10f = if_else(gender == "Female", q10, NA), # create variable q10f when gender is "Female"
q10m = if_else(gender == "Male", q10, NA)
%>%
) select(!gender) %>% # remove gender and q10 variables from the dataset
select(!q10)
# check the data
RItileplot(df.q10split)
The if_else()
function used within mutate()
has three inputs/options in the example above:
- condition (logical statement)
- assignment if the condition true
- assignment if false
5 Item merge
Sometimes it is desirable to merge two items into one. This is often done when there is a high residual correlation between two items. This is called a testlet. The items are merged into a new variable, and the original items are removed from the dataset.
It is usually a good idea to create a new dataframe with the merged variable, in case you need to go back to the original dataset.
<- df %>%
df2 mutate(sdq2_15 = sdq2 + sdq15) %>% # create variable by adding them
select(!sdq2) %>%
select(!sdq15)
Then you should check the ICC curves for the merged item:
RIitemCats(df2, item = "sdq2_15")
6 Checking response distribution prior to DIF analysis
This example assumes that you have previously created a DIF variable (vector) for gender with two groups. The variable is a factor with the labels “Female” and “Male”.
The reason for doing this is making sure that there are no empty cells (particularly in lower response categories) in either group prior to running the DIF analysis, since this could lead to DIF being indicated incorrectly.
<- df %>%
difGenderTileplots add_column(gender = dif.gender) %>%
split(.$gender) %>%
map(~ RItileplot(.x %>% select(!gender)) + labs(title = .x$gender))
library(patchwork)
$Female + difGenderTileplots$Male difGenderTileplots
The last part will make the tile plots show up side by side, labeled with the gender variable as it is coded in the dataset. You will need to adapt the code to the factor labels in your own dataset.
7 Simulating data based on known item parameters
library(easyRasch)
# read item parameters
<- read_csv("item_params.csv") %>%
w select(!Location) %>%
as.matrix()
# generate 10 000 random theta values
<- rnorm(10000,0,2)
t
# get item parameters into a list object, each item as a separate vector
<- map(1:3, ~ w[.x,] %>% as.numeric() %>% na.omit())
itemlist
# simulate response data
<- SimPartialScore(
d deltaslist = itemlist,
thetavec = t
%>%
) as.data.frame()
RItargeting(d)
RItif(d, samplePSI = T)