Reproducible data analysis with R and Quarto

2023-06-12

Ambitions with this course

  • It is not a statistics course
  • We’ll cover the basics of using R, Rstudio and Quarto
  • More advanced things will be addressed at later sessions
    • requests are welcome
  • I like pragmatic learning, so we’ll use practical examples for most things

Expectations and hopes

I hope that you will be fairly comfortable with:

  • importing a dataset & doing basic analysis
  • documenting your analysis alongside your code
    • creating an output document
  • getting something out of the many online articles using R for analysis
  • searching the internet for help

Words of wisdom

https://datavizs23.classes.andrewheiss.com/slides/15-slides.html#62

Overview for today

  • the basics of using Rstudio
  • the basics of setting up and using Quarto (in Rstudio)
  • a tiny bit of really basic R functions
  • importing data (and naming variables)
  • data wrangling
  • descriptive analysis & visualization
  • modeling & visualization

Overview of files

  • Our content: start.qmd (this file), data_import.qmd, and data_analysis.qmd
    • these have been rendered to HTML in /docs folder (including all output and code)
  • Structure:
    1. walkthrough of this presentation in parallel of looking at Rstudio & code
    2. work through the data_* files together, exercises are interspersed in the files

Keyboard shortcuts

You will benefit greatly from doing things cleverly and establishing good habits from the start

  • avoid unnecessary typing and using the mouse/trackpad

I have prepared a handout containing the most frequently used shortcuts

  • We’ll walk through them later on, and you have the handout as a reference

Fundamentals

  • never touch your datafile manually!!!
    • everything you do with your data needs to be documented
  • Don’t erase code/things you try (only erase faulty code)
    • you need to be able to retrace your steps
    • makes it easier to involve others
    • You can copy&paste selected part of your code later if you need to clean up for publication/sharing

Good things

  • Creating step-by-step templates for data analysis workflow
    • quality assurance!
  • Traceability
    • Transparency in decision making during analysis
  • Reproducibility
  • We use freely available tools
  • Documentation in the same file as analysis code

Levels of reproducibility

Meme credit: https://bookdown.org/pdr_higgins/rmrwr/introduction-to-reproducibility.html#cleaning-and-analyzing-your-data

Rstudio layout

  • 4 quadrants with tabs in each
    • code, console, environment & output/help
  • projects!
  • outline & line numbers
  • moving a code tab to a separate window
  • code editor is helpful and sometimes confusing

Are you using the course project?

  • Check the upper right corner in Rstudio
    • It should say RstudioQuartoIntro

If it doesn’t, you need to navigate your file explorer/finder to where you saved the course files, and open the file RstudioQuartoIntro.Rproj

Quarto basics

  • YAML
  • code chunks & chunk output
  • source view and “visual”
  • headings and structure
  • labelling with tbl- and fig-
  • (panel-tabset)

Quarto basics recap

Basics about R

  • based on functions() and objects
    • functions do things with data
    • objects store data and output from functions
  • let’s first skim basic definitions, then go to practical examples
  • decimal symbol is always .! Commas are used to separate things.
  • everything written after a # is a comment

Functions

  • a function always has () tied to it
  • various settings within a function are divided by ,
  • we can pipe/forward an object or function output to another function
    • there are two pipes (we’ll only use the first one)
    • %>% - the magrittr pipe (shift+ctrl+m)
    • |> - the base pipe

Function syntax

  • comma separates input within the parantheses
hist(data, xlab = "Distribution of variable", ylab = "Count")
  • use the Help tab in lower right quadrant
    • or ?function in console - ie. ?hist
    • tab-complete also works (I’ll demo)

Objects

  • objects can have many different classes/formats
  • you can use class(object) to check
  • we will mostly work with dataframes
    • similar to what you see in a spreadsheet file (Excel)
    • they consist of vectors (rows and columns)
    • a vector is a string of values
  • each object or column in a dataframe can have a different class
    • we will primarily work with
    • character (chr)
    • factor (fct)
    • numerical (num, int, or dbl)

Simple object examples

Single values

number <- 1
character <- "1"

Multiple values (a vector)

numbers <- c(1,2)
characters <- c("one","two")

Dataframe

df <- read_excel("path/to/excelfile.xls", sheet = 1)
df2 <- read.spss("path/to/spssfile.sav", to.data.frame = TRUE)

Keyboard shortcuts (in Swedish)

  • TAB TAB-tangenten (ovanför CAPS LOCK) har många funktioner, dels kan den fylla i objekt- eller funktionsnamn när du skriver kod, men även i sökvägar när du ska läsa en fil. “TAB-completion”
  • CTRL+enter
  • Kör kodraden du har markören på (och hoppar ner till nästa).
  • Kör koden du har markerat.
  • Om du har ett block med flera rader av tidy-kod (som hålls samman med pipe-symbolen %>%) eller ggplot (som hålls samman med +) så spelar det ingen roll var bland raderna markören är, hela blocket körs ändå
  • CTRL+SHIFT+enter Kör en hel code chunk i Quarto. Spelar ingen roll var i chunken du har markören.
  • CTRL+SHIFT+m Skapar en pipe-symbol (%>% eller |> beroende på vad du valt i Rstudio).
  • CTRL+ALT(option)+i Skapar en ny code chunk i Quarto.
  • CTRL/CMD+SHIFT+c Kommentera bort/in alla rader som är markerade.

Intro done

Next: importing, inspecting and visualizing data