Define a ggplot theme theme_ki(), a standard table function, kbl_ki(), and a color palette based on KI’s design guide, ki_color_palette.
Code
source("ki.R")# this reads an external file and loads whatever is in it
3.1 Adaptions
Some functions exist in multiple packages, which can be a source of headaches and confusion. Loading library(conflicted) will provide errors every time you use a function that is available in multiple loaded packages, which can be helpful to avoid problems (but also annoying if you already have things under control).
Below we define preferred functions that are frequently used. If desired, we can still use specific functions by using their package prefix, for instance dplyr::recode().
The open dataset we will use for our experiments was retrieved from https://doi.org/10.26180/13240304 and is available in the data subfolder of the R project folder we are currently working in. The description of the dataset on Figshare is:
De-identified dataset from a randomised controlled trial of Mindfulness-integrated cognitive behaviour therapy (MiCBT) versus a treatment-as-usual waitlist control. All participants completed the measures one week before the start of the MiCBT group intervention (T0), after week 4 (T1), at week 8 (T2, post-intervention), and then again after a 6-month follow up period (T3). A full description of the project methodology including the measures used in the trial is provided in the protocol paper (see References).
And from the study protocol:
The intent of this study is to examine the effectiveness of MiCBT to create changes in clinical measures of depression, anxiety and stress. It is hypothesized that these changes will occur during the program in stages 1,2 and 3 and be enhanced in stage 4 because of the additional practice time. Compassion and ethics are taught in Stage 4 for relapse prevention which is not the focus of the current study.
Look in the Environment quadrant (upper right). How many observations and variables do we have in the df object?
Press the circle to the left of df to get a quick look at the data. We can see the word “missing” noted in several fields. Anything else you notice about the variables?
Let’s re-import the data and tell read_excel() to code missing correctly.
Code
df<-read_excel("data/MiCBT RCT data_Bridges repository.xlsx", na ="missing")
Have another look at the data now and see what happened. You can go back and run the previous chunk to see the difference more clearly.
Also, have a look at the naming scheme and see what pattern you find?
The K10 questionnaire is used for pre-intervention measurement and screening, as well as follow-up measurement. Let’s look at the variables containing “K10”.
# A tibble: 106 × 7
K10_Score_GP GPK10_coded TK10_t0 K10_di_t0 TK10_t1 TK10_t2 TK10_t3
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 33 1 30 30+ 25 30 21
2 24 0 30 30+ 27 31 32
3 36 1 38 30+ 28 18 23
4 29 0 27 less than 30 21 24 27
5 27 0 22 less than 30 32 NA 23
6 29 0 16 less than 30 15 16 17
7 23 0 23 less than 30 20 15 17
8 27 0 30 30+ 42 27 19
9 29 0 25 less than 30 17 16 15
10 30 1 29 less than 30 21 17 21
# ℹ 96 more rows
K10_di_t0 is a categorical variable created from TK10_t0, and it does not repeat for other time points. As such, it is mislabeled and we want to fix this. While we are at it, we can rename some other variables too.
df<-df%>%rename(id =BridgesID, Group =GROUP, K10preCat =K10_di_t0)
5 Demographics
The dataset does not include any demographics. Just for fun, we’ll add randomly assigned age and gender variables. mutate() helps us create or modify variables.
We can see that there isn’t a difference, and we expect this since random sampling was used. But if you wanted to test the difference, this is one way.
Welch Two Sample t-test
data: age by Group
t = 0.64071, df = 101.07, p-value = 0.5232
alternative hypothesis: true difference in means between group Control and group MiCBT is not equal to 0
95 percent confidence interval:
-2.061734 4.028936
sample estimates:
mean in group Control mean in group MiCBT
42.70909 41.72549
hist(df$age, col ="lightblue", main ="Histogram of participant age", xlab ="Age", breaks =24)
With ggplot we have a lot more flexibility. Note that as soon as ggplot() has been called, the line ends with + when we add plot configurations.
Code
df%>%ggplot(aes(x =age))+geom_histogram(fill ="lightblue", color ="black")+labs(title ="Histogram of participant age", x ="Age", y ="Count")+theme_ki()
Let’s look separately at the Control group.
Code
df%>%filter(Group=="Control")%>%ggplot(aes(x =age))+geom_histogram(fill ="darkgreen", color ="white")+labs(title ="Histogram of participant age", x ="Age", y ="Count", subtitle ="Control group only")+theme_ki()
Practice
Make a new plot for intervention group age.
5.3 tidy filter/select
Note
Two key things to learn:
filter() works on rows, based on their column content
Or we can use facet_wrap() to make parallel plots.
Code
df%>%ggplot(aes(x =age, fill =Group))+geom_histogram(color ="white", binwidth =3)+labs(title ="Histogram of participant age", x ="Age", y ="Count")+theme_ki()+scale_y_continuous(breaks =c(0,4,8,12))+scale_color_manual(values =ki_color_palette, aesthetics =c("fill","color"), guide ="none")+facet_wrap(~Group, ncol =2)
Practice
Plot age grouped by gender (bonus: facet_wrap by Group)
6 Variable names
Generally we should have systematic naming of variables, avoiding things like spaces (” “). There is an amazing function called janitor::clean_names() which defaults to using snake_case. It also offers options for things like camelCase and others. This functions is primarily useful when you get a dataset that someone else collected and you need to bring order to variables names.
A lot more readable. Please note that we did not “save” our changes in the previous code chunk. Let’s rename all variables in the dataframe that start with a capital “T”.
From here, we will practice to “break down” the pieces in the tidy/ggplot code by selecting pieces of it and running it, adding one row/function/layour at a time.
df%>%cor_test("ANX_t0", "DEP_t0")%>%plot()+theme_ki()+geom_point(data =df, aes(ANX_t0, DEP_t0), size =2.4, color ="#870052")+geom_smooth(data =df, aes(ANX_t0, DEP_t0), method ="lm", fill ="#FF876F", color ="#4F0433", alpha =0.4)+labs(y ="Depression at pre (t0)", x ="Anxiety at pre (t0)", title ="Correlation between DASS-D and DASS-A at time 0.")
Exercise: create separate correlation plots for gender. Bonus points if you can get both in the same plot!
OK: No outliers detected.
- Based on the following method and threshold: mahalanobis (34.528).
- For variables: id, ANX_t0, ANX_t1, ANX_t2, ANX_t3, DEP_t0, DEP_t1, DEP_t2, DEP_t3, STRESS_t0, STRESS_t1, STRESS_t2, STRESS_t3
Should we try other methods? See ?check_outliers.
11 Wide to long format
Almost everything in R likes long format. Let’s look at the variables ending with “t”.
If you have a large dataset, I highly recommend using library(arrow) and the function write_parquet(), since it is incredibly fast and produces a small file. As an example, a 450mb SPSS datafile became 8mb when saved in .parquet format.
13 Bonus - dealing with questionnaire data
Note
This section will mostly be skimmed and the various plots and solutions are included for you to use as a reference if/when you would like to implement something similar in your future analyses. We can also go back to this section if we have time at the end.
We’ll use a dataset that actually includes raw response data from SurveyMonkey.
And, just for reference, manually ordering the response categories. Here, we also added the category names to each plot facet by adding scales = "free" to the facet_wrap() call. Note that this frees the y axis to vary for each facet too, which can be less desirable. This can be easily solved by adding scale_y_continuous(limits = c(0,150)) to have all facets range from 0 to 150.
Code
df2%>%na.omit()%>%pivot_longer(everything(), values_to ="category", names_to ="itemnr")%>%group_by(itemnr)%>%count(category)%>%left_join(.,itemlabels, # this adds the item description to the dataset by ="itemnr")%>%mutate(category =factor(category, levels =c("Aldrig","Ibland","Ganska ofta","Sällan","Mycket ofta","Alltid")))%>%### order response categoriesggplot(aes(x =category, y =n, fill =item))+geom_col()+facet_wrap(~item, # makes a separate facet/plot for each item ncol =1, scales ="free")+theme_ki()+scale_fill_manual(values =ki_color_palette, guide ="none")