Data wrangling for psychometrics in R

R code examples for psychometricians

Author

Affiliation

Magnus Johansson

Published

2024-12-18

While the easyRasch R package simplifies the process of doing Rasch analysis in R, users still need to be able to import and often make modifications to their data. This demands some knowledge in R data wrangling. For a comprehensive treatment of this topic, please see the “R for data science” book. This guide focuses on common data wrangling issues that occur in psychometric analyses.

I will rely heavily on library(tidyverse) functions in most examples.

This page will be updated intermittently. The planned content includes:

dividing a set of items into two or more dataframes in order to run separate analyses (for instance based on an exploratory analysis of a large item set, or based on a demographic variable)
recoding/merging response categories
splitting an item into two based on a DIF variable
merging two items into a super-item/testlet

1 Removing respondents with missing data

There may be situations where you want to specify a minimum number of items that a respondent must have answered in order to be included in the analysis. This is done by using the filter() function from library(dplyr). The syntax is as follows:

min.responses <- 5 # set the minimum number of responses

df2 <- df %>% 
  filter(length(itemlabels$itemnr) - rowSums(is.na(.[itemlabels$itemnr])) >=     min.responses)

The object itemlabels$itemnr is a vector (column in a dataframe) with the short item labels as they are used in the dataframe df which contain the data, with items as columns using the corresponding labels.

The length() function is used to count the number of items in the vector, and rowSums(is.na()) is used to count the number of missing values in each row. Then a simple subtraction is done, total number of items minus number of missing responses. The >= operator is used to compare the number of responses to the minimum number of responses specified in the object min.responses. The filter() function removes all rows that do not meet the criteria. And the new dataset is saved to df2.

2 Dividing a dataset

3 Recoding response categories

Many options are available. I have settled on primarily using car::recode() from library(car) since I find the syntax to be logical and consistent. Another option to consider is dplyr::case_when().

The basic syntax of car::recode() (henceforth referred to as only recode()) is this:

recode(variable_to_recode, "newvalue1=oldvalue1;newvalue2=oldvalue2", as.factor = FALSE/TRUE)

As you can see, semi-colon ; is used to separate recodings. When numbers are recoded, you just write them out as 1=0. When characters are involved, you need the single quote symbol to enclose the character string, i.e 'Never'=0 for recoding to numerics (which also necessitates the option as.factor = FALSE if you want the recoded variable to be numeric), or 'Never'='Nevver' when you need to correct misspelled response options.

Tip

Remember to define your preferred recoding function in your script to avoid headaches due to unexpected error messages. This is done by using the assignment operator <-. Example below.

recode <- car::recode

There are some more or less clever ways to use recode(). You can simply copy&paste code for each variable, such as:

df$q1 <- recode(df$q1, "1=0;2=1;3=2", as.factor = FALSE)
df$q2 <- recode(df$q2, "1=0;2=1;3=2", as.factor = FALSE)
df$q3 <- recode(df$q3, "1=0;2=1;3=2", as.factor = FALSE)
df$q4 <- recode(df$q4, "1=0;2=1;3=2", as.factor = FALSE)
df$q5 <- recode(df$q5, "1=0;2=1;3=2", as.factor = FALSE)

Such an approach might actually be sensible if each variable needed different recodings. But when you want to make the same changes to many variables, there are of course more efficient strategies. The code above could be rewritten as:

df %>% 
  mutate(across(q1:q5, ~ car::recode(.x, "1=0;2=1;3=2", as.factor = FALSE)))

Or, in this case, where we only want to recode our scale from the range of 1-3 to 0-2, i.e. subtract one:

df %>% 
  mutate(across(q1:q5, ~ .x - 1))

The syntax in these two examples is related to how tidyverse uses unnamed functions by the combination of ~ and .x, where the latter becomes a placeholder for the variables defined in the first term of across().

It is good practice to check that your recoding worked as intended. The first step I recommend is RItileplot().

df %>% 
  mutate(across(q1:q5, ~ car::recode(.x, "1=0;2=1;3=2", as.factor = FALSE))) %>% 
  RItileplot()

If you are trying out different ways to merge response categories to resolve issues with disordered thresholds, you may also want to review the probability curves before committing your recode to a new data object.

df %>% 
  mutate(across(q1:q5, ~ car::recode(.x, "1=0;2=1;3=2", as.factor = FALSE))) %>% 
  RIitemCats(legend = "left")

Below is an example where we have multiple cell contents that we want to recode to NA to ensure that R interprets them as missing data. We want to recode three different things:

all the numbers from 990 to 999 (usually a way to differentiate between types of missing data)
blank cells
“Don’t know” responses

df$q45 <- recode(df$q45,"990:999=NA;''=NA;'Don't know'=NA")

The : means that all numbers from 990 to 999 will be recoded into NA.

4 Item split

It can be desirable to split an item due to issues with DIF. This refers to taking a variable and creating two (or more) replacement variables, one for each demographic group. For each variable, data will be missing for the other group. Often this is relevant for gender DIF, which we will use in this example.

This example assumes that there is a (DIF) vector variable dif.gender with the gender data, which has the same length as the number of rows in the dataset df. We’ll create a new dataframe to store the dataset with the item split. Item q10 is the one we want to split.

df.q10split <- df %>% 
  add_column(gender = dif.gender) %>% 
  mutate(q10f = if_else(gender == "Female", q10, NA), # create variable q10f when gender is "Female"
         q10m = if_else(gender == "Male", q10, NA)
         ) %>% 
  select(!gender) %>% # remove gender and q10 variables from the dataset
  select(!q10)

# check the data
RItileplot(df.q10split)

The if_else() function used within mutate() has three inputs/options in the example above:

condition (logical statement)
assignment if the condition true
assignment if false

5 Item merge

Sometimes it is desirable to merge two items into one. This is often done when there is a high residual correlation between two items. This is called a testlet. The items are merged into a new variable, and the original items are removed from the dataset.

It is usually a good idea to create a new dataframe with the merged variable, in case you need to go back to the original dataset.

df2 <- df %>% 
  mutate(sdq2_15 = sdq2 + sdq15) %>% # create variable by adding them
  select(!sdq2) %>% 
  select(!sdq15)

Then you should check the ICC curves for the merged item:

RIitemCats(df2, item = "sdq2_15")

6 Checking response distribution prior to DIF analysis

This example assumes that you have previously created a DIF variable dif.gender for gender using two factor levels, using the labels Female and Male.

The reason for doing this is making sure that there are no empty cells (particularly in lower response categories) in either group prior to running the DIF analysis, since this could lead to DIF being indicated incorrectly.

difplots <- df %>% 
  add_column(dif = dif.gender) %>% 
  split(.$dif) %>%
  map(~ RItileplot(.x %>% select(!dif)) + labs(title = .x$dif))
  
library(patchwork)
difplots$Female + difplots$Male

The last part will make the tile plots show up side by side, labeled with the gender variable as it is coded in the dataset. You will need to adapt the code to the factor labels in your own dataset. You can also access the plots using difplots[[1]] for the first plot. There will be one plot for each level of your factor variable.

7 Simulating data based on known item parameters

library(easyRasch)

# read item parameters
w <- read_csv("item_params.csv") %>% 
  select(!Location) %>% 
  as.matrix()

# generate 10 000 random theta values
t <- rnorm(10000,0,2)

# get item parameters into a list object, each item as a separate vector
itemlist <- map(1:3, ~ w[.x,] %>% as.numeric() %>% na.omit())

# simulate response data
d <- SimPartialScore(
  deltaslist = itemlist,
  thetavec = t
) %>%
  as.data.frame()

RItargeting(d)

RItif(d, samplePSI = T)

8 Links

https://github.com/Cghlewis/data-wrangling-functions/wiki

Reuse

CC BY 4.0

--- title: "Data wrangling for psychometrics in R" subtitle: "R code examples for psychometricians" author: name: 'Magnus Johansson' affiliation: 'RISE Research Institutes of Sweden' affiliation-url: 'https://www.ri.se/en/shic' orcid: '0000-0003-1669-592X' date: 2024-12-18 date-format: iso execute: cache: false warning: false message: false editor_options: chunk_output_type: console --- While the [`easyRasch` R package](https://pgmj.github.io/easyRasch/) simplifies the process of doing Rasch analysis in R, users still need to be able to import and often make modifications to their data. This demands some knowledge in R data wrangling. For a comprehensive treatment of this topic, please see the ["R for data science"](https://r4ds.had.co.nz/) book. This guide focuses on common data wrangling issues that occur in psychometric analyses. I will rely heavily on `library(tidyverse)` functions in most examples. This page will be updated intermittently. The planned content includes: - dividing a set of items into two or more dataframes in order to run separate analyses (for instance based on an exploratory analysis of a large item set, or based on a demographic variable) - recoding/merging response categories - splitting an item into two based on a DIF variable - merging two items into a super-item/testlet ## Removing respondents with missing data There may be situations where you want to specify a minimum number of items that a respondent must have answered in order to be included in the analysis. This is done by using the `filter()` function from `library(dplyr)`. The syntax is as follows: ``` r min.responses <- 5 # set the minimum number of responses df2 <- df %>% filter(length(itemlabels$itemnr) - rowSums(is.na(.[itemlabels$itemnr])) >= min.responses) ``` The object `itemlabels$itemnr` is a vector (column in a dataframe) with the short item labels as they are used in the dataframe `df` which contain the data, with items as columns using the corresponding labels. The `length()` function is used to count the number of items in the vector, and `rowSums(is.na())` is used to count the number of missing values in each row. Then a simple subtraction is done, total number of items minus number of missing responses. The `>=` operator is used to compare the number of responses to the minimum number of responses specified in the object `min.responses`. The `filter()` function removes all rows that do not meet the criteria. And the new dataset is saved to `df2`. ## Dividing a dataset ## Recoding response categories Many options are available. I have settled on primarily using `car::recode()` from `library(car)` since I find the syntax to be logical and consistent. Another option to consider is `dplyr::case_when()`. The basic syntax of `car::recode()` (henceforth referred to as only `recode()`) is this: ``` r recode(variable_to_recode, "newvalue1=oldvalue1;newvalue2=oldvalue2", as.factor = FALSE/TRUE) ``` As you can see, semi-colon `;` is used to separate recodings. When numbers are recoded, you just write them out as `1=0`. When characters are involved, you need the single quote symbol to enclose the character string, i.e `'Never'=0` for recoding to numerics (which also necessitates the option `as.factor = FALSE` if you want the recoded variable to be numeric), or `'Never'='Nevver'` when you need to correct misspelled response options. ::: {.callout-tip} Remember to define your preferred recoding function in your script to avoid headaches due to unexpected error messages. This is done by using the assignment operator `<-`. Example below. `recode <- car::recode` ::: There are some more or less clever ways to use `recode()`. You can simply copy&paste code for each variable, such as: ``` r df$q1 <- recode(df$q1, "1=0;2=1;3=2", as.factor = FALSE) df$q2 <- recode(df$q2, "1=0;2=1;3=2", as.factor = FALSE) df$q3 <- recode(df$q3, "1=0;2=1;3=2", as.factor = FALSE) df$q4 <- recode(df$q4, "1=0;2=1;3=2", as.factor = FALSE) df$q5 <- recode(df$q5, "1=0;2=1;3=2", as.factor = FALSE) ``` Such an approach might actually be sensible if each variable needed different recodings. But when you want to make the same changes to many variables, there are of course more efficient strategies. The code above could be rewritten as: ``` r df %>% mutate(across(q1:q5, ~ car::recode(.x, "1=0;2=1;3=2", as.factor = FALSE))) ``` Or, in this case, where we only want to recode our scale from the range of 1-3 to 0-2, i.e. subtract one: ``` r df %>% mutate(across(q1:q5, ~ .x - 1)) ``` The syntax in these two examples is related to how `tidyverse` uses unnamed functions by the combination of `~` and `.x`, where the latter becomes a placeholder for the variables defined in the first term of `across()`. It is good practice to check that your recoding worked as intended. The first step I recommend is `RItileplot()`. ``` r df %>% mutate(across(q1:q5, ~ car::recode(.x, "1=0;2=1;3=2", as.factor = FALSE))) %>% RItileplot() ``` If you are trying out different ways to merge response categories to resolve issues with disordered thresholds, you may also want to review the probability curves before committing your recode to a new data object. ``` r df %>% mutate(across(q1:q5, ~ car::recode(.x, "1=0;2=1;3=2", as.factor = FALSE))) %>% RIitemCats(legend = "left") ``` Below is an example where we have multiple cell contents that we want to recode to `NA` to ensure that R interprets them as missing data. We want to recode three different things: - all the numbers from 990 to 999 (usually a way to differentiate between types of missing data) - blank cells - "Don't know" responses ``` r df$q45 <- recode(df$q45,"990:999=NA;''=NA;'Don't know'=NA") ``` The `:` means that all numbers from 990 to 999 will be recoded into `NA`. ## Item split It can be desirable to split an item due to issues with DIF. This refers to taking a variable and creating two (or more) replacement variables, one for each demographic group. For each variable, data will be missing for the other group. Often this is relevant for gender DIF, which we will use in this example. This example assumes that there is a (DIF) vector variable `dif.gender` with the gender data, which has the same length as the number of rows in the dataset `df`. We'll create a new dataframe to store the dataset with the item split. Item q10 is the one we want to split. ``` r df.q10split <- df %>% add_column(gender = dif.gender) %>% mutate(q10f = if_else(gender == "Female", q10, NA), # create variable q10f when gender is "Female" q10m = if_else(gender == "Male", q10, NA) ) %>% select(!gender) %>% # remove gender and q10 variables from the dataset select(!q10) # check the data RItileplot(df.q10split) ``` The `if_else()` function used within `mutate()` has three inputs/options in the example above: - condition (logical statement) - assignment if the condition true - assignment if false ## Item merge Sometimes it is desirable to merge two items into one. This is often done when there is a high residual correlation between two items. This is called a testlet. The items are merged into a new variable, and the original items are removed from the dataset. It is usually a good idea to create a new dataframe with the merged variable, in case you need to go back to the original dataset. ``` r df2 <- df %>% mutate(sdq2_15 = sdq2 + sdq15) %>% # create variable by adding them select(!sdq2) %>% select(!sdq15) ``` Then you should check the ICC curves for the merged item: ``` r RIitemCats(df2, item = "sdq2_15") ``` ## Checking response distribution prior to DIF analysis This example assumes that you have previously created a DIF variable `dif.gender` for gender using two factor levels, using the labels Female and Male. The reason for doing this is making sure that there are no empty cells (particularly in lower response categories) in either group prior to running the DIF analysis, since this could lead to DIF being indicated incorrectly. ``` r difplots <- df %>% add_column(dif = dif.gender) %>% split(.$dif) %>% map(~ RItileplot(.x %>% select(!dif)) + labs(title = .x$dif)) library(patchwork) difplots$Female + difplots$Male ``` The last part will make the tile plots show up side by side, labeled with the gender variable as it is coded in the dataset. You will need to adapt the code to the factor labels in your own dataset. You can also access the plots using `difplots[[1]]` for the first plot. There will be one plot for each level of your factor variable. ## Simulating data based on known item parameters ``` r library(easyRasch) # read item parameters w <- read_csv("item_params.csv") %>% select(!Location) %>% as.matrix() # generate 10 000 random theta values t <- rnorm(10000,0,2) # get item parameters into a list object, each item as a separate vector itemlist <- map(1:3, ~ w[.x,] %>% as.numeric() %>% na.omit()) # simulate response data d <- SimPartialScore( deltaslist = itemlist, thetavec = t ) %>% as.data.frame() RItargeting(d) RItif(d, samplePSI = T) ``` ## Links <https://github.com/Cghlewis/data-wrangling-functions/wiki>