Managing complicated research workflows in R with {targets}

Eric R. Scott

2025-02-26

CCT Data Science

https://datascience.cct.arizona.edu/

We offer:

  • Workshops

  • Drop-in hours support

  • Incubator projects for ALVSCE

  • Collaboration on funded projects

Join our mailing list!

Before we get started

You’ll need the following R packages today:

#For making `targets` work:
library(targets)
library(tarchetypes)
library(visNetwork)
library(crew)

#For our data wrangling and analysis:
library(dplyr)
library(ggplot2)
library(janitor)
library(car)
library(broom)

#for installing demos and course materials:
library(usethis)

Learning Objectives

  • Understand what problems targets (and workflow managers in general) solve

  • Be able to set up a simple project using targets, view the dependency graph, and run the workflow

  • Write a (good enough) custom R function

  • Have an awareness of the possibilities: parallelization, cloud storage, iteration, Bayesian analyses, geospatial targets

  • Know where to go for help with targets

Part 1: Context

Moving Toward Reproducibility

Working toward reproducible analyses benefits:

  • You in the future

  • Your collaborators

  • The greater community

Cute fuzzy monsters putting rectangular data tables onto a conveyor belt. Along the conveyor belt line are different automated “stations” that update the data, reading “WRANGLE”, “VISUALIZE”, and “MODEL”. A monster at the end of the conveyor belt is carrying away a table that reads “Complete analysis.”

Image from: Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Workflow management

  • Automatically detect dependencies

  • Run your entire analysis with one master command

  • Skip steps that don’t need to be re-run

  • Scaleable

Part 2: Demonstration

Demo

To install a demo targets pipeline, run the following R code and follow the prompts:

usethis::use_course("cct-datascience/targets-demo")

In this new project, try the following:

library(targets)
tar_visnetwork() # view dependency graph
tar_make() # run the pipeline
tar_visnetwork()
tar_read(lm_full) # view the "lm_full" target

Now, in R/fit_models.R, change the response variable from flipper_length_mm to bill_length_mm and run tar_visnetwork() again.

Working with targets

  • The result of each step is stored as an R object1 in _targets/

  • Everything that happens to make that target is written as a function

Anatomy of a targets project

  1. _targets.R
    • Configure and define workflow
  2. R/
    • R script(s) with custom functions
  3. _targets/
    1. Generated by tar_make()

    2. Contains targets saved as R objects

    3. Should not be tracked in version control

Part 3: Refactoring Exercise

Refactoring a project to use targets

refactor

/rēˈfaktər/

verb

  1. restructure (the source code of an application or piece of software) so as to improve operation without altering functionality.

First, let’s start with a “traditional” R analysis project. Download and take a look around:

usethis::use_course("cct-datascience/targets-refactor")
03:00

Refactoring a project to use targets

  1. use_targets() to set up infrastructure including _targets.R
  2. Add package dependencies to tar_option_set()
  3. Convert R script(s) to functions
  4. Write targets
  5. Run tar_make()
  6. Debug

We’re going to walk through this process together

1. use_targets()

Exercise

Run use_targets() and open the _targets.R file it generates

2. Package dependencies

Exercise

Figure out what packages are needed and add them to tar_option_set() in _targets.R.

_targets.R
tar_option_set(
  packages = c(
    "tidyverse",
    "lubridate",
    #add more packages here
    ),
  format = "rds"
)

Tip

This would be a good time to install any necessary packages with install.packages() also!

02:00

Custom functions in R

Functions in R are created with the function() function

add_ten <-        #name of function
  function(x) {   #argument names and defaults defined here
    x + 10        #what the function does
  }

add_ten(x = 13)
[1] 23

Using multiple arguments:

to_fractional_power <- 
  function(x, numerator, denominator) {
    power <- numerator / denominator
    x^power
  }
to_fractional_power(27, 1, 3)
[1] 3

Important

The last line of a function must return something, so don’t end a function by assigning results with <-!

“Good enough” functions for targets

  • Use a naming convention (verbs are good)
  • Make the function arguments the same as target names
  • Document!

WORSE:

m1 <- function(x) {
  lm(stem_length ~ watershed, data = x)
}

BETTER:

# Fit linear model to explain stem lenght as a function of watershed treatment
# data_clean = a tibble; the `data_clean` target
fit_model <- function(data_clean) {
  lm(stem_length ~ watershed, data = data_clean)
}

3. Create functions

Exercise

Convert the code in 01-read_wrangle_data.R into a function that takes the file path to the raw data as an argument and returns the cleaned maples data frame.

08:00

Defining the workflow

Steps in the workflow are defined by the tar_target() function inside of the list() at the end of _targets.R

_targets.R
list(
  tar_target(name = file, command = "data/penguins_raw.csv", format = "file"),
  tar_target(name = data, command = get_data(file))
)
  • Usually you only need to use the first two arguments, name for the name of the target, and command for what that target does

  • Targets that are files (or create files) need the additional argument format = "file" (for files on disk) or format = "url" (for files on the web)

Tip

A common mistake is to leave a trailing comma after the last target. This will result in the error: Error running targets::tar_make() Last error: argument 5 is empty

Naming targets

  • Use a naming convention (nouns are good)

  • Use concise but descriptive target names

WORSE:

  • data1, data2, data3
  • histogram_by_site_plot

BETTER:

  • data_file, data_raw, data_clean

  • plot_hist_site

4. Create Targets

Exercise

Create targets for the input CSV file and the results of your data wrangling function in _targets.R. Remember to use format = "file" in the target for the CSV file.

Check your progress by running tar_visnetwork() and tar_make().

Once you’ve gotten those first two targets working, try creating functions and targets for additional steps in the analysis.

10:00

Debugging targets

  • Errors in tar_make() are sometimes uninformative because code is run in a separate R session
  • Use tar_meta() to access error messages
tar_meta(fields = error, complete_only = TRUE)

Debugging targets

When a target errors, you can load a “workspace” with all the functions, data, and packages needed to reproduce the error interactively.

Enable this with:

tar_option_set(
  workspace_on_error = TRUE
)

Then, load the workspace with tar_workspace(<NAME OF TARGET>)

#load upstream target `data_clean` and `fit_model()` function into global environment
tar_workspace(linear_model) 
#try interactively
fit_model(data_clean)
Error in eval(mf, parent.frame()): object 'df_cleaned' not found

Can you spot the source of the error?

fit_model.R
fit_model <- function(data_clean) {
  lm(y ~ x + z, data = df_cleaned)
}

Alternative workflow definitions

tarchetypes provides alternatives to list() for defining workflow.

tar_plan() allows a name = command() shortcut

_targets.R
tarchetypes::tar_plan(
  tar_file(data_file, "data.csv"),
  data = read.csv(data_file),
  model = fit_model(data),
)

tar_assign() allows the assignment operator (<-) and works well with pipes (%>% or |>).

_targets.R
tarchetypes::tar_assign({      
  data_file <- tar_file("data.csv")
  data <- 
    read.csv(data_file) |>
    filter(year >= 2020) |> 
    tar_target()
  model <- 
    lm(y ~ x + z, data = data) |> 
    tar_target()
})

Part 4: A Taste of What’s Possible

Bayesian analyses with targets

“target factories” for Bayesian analyses

  • Simplify analyses by automatically creating targets for multiple steps

  • E.g. defining a target with tar_stan_mcmc() actually generates multiple targets that wrangle data, run the MCMC, create a table of posterior draws, etc.

Geospatial targets

geotargets provides helpers to use geospatial packages like terra and stars with targets

Iteration

  • Iterate targets over a list of inputs with dynamic branching—useful for large tasks where it would be cumbersome to write out individual targets

  • E.g. fitting model to 100 bootstraps of data

  #creates list of 100 bootstrapped dataframes
  tar_target(
    data_boot,
    purrr::map(1:100, ~sample_frac(data, size = 1, replace = TRUE)),
    iteration = "list"
  ),
  # Fit model to each data frame, save results as a list
  tar_target(
    lm_boot,
    fit_lm_full(data_boot),
    pattern = map(data_boot),
    iteration = "list"
  ),

Parallel computation

In the workflow below, the three models and the three plot targets can all be run independently at the same time.

High Performance Computing

  • crew.cluster provides additional “controllers” for HPC including crew_controller_slurm().
  • Check out this template repository with code that you can use to run a targets workflow on the UA HPC

Cloud Storage

  • By default, the _targets/ store is on your computer and not shared with collaborators

  • Collaborators will have to run tar_make() to reproduce the workflow, which might not be convenient if some targets take days or weeks to run

Diagram describing how the _targets/ folder is stored locally and not synced to GitHub (only _targets.R and the R folder are on GitHub).  A collaborator therefore doesn't have the _targets folder and needs to run tar_make() on their computer.

Cloud Storage

  • Optionally, _targets/ can be stored in the cloud (Amazon Web Services or Google Cloud S3 buckets)

  • These stores can be versioned, so you can roll back your _targets.R and not have to re-compute targets.

Variation of previous diagram but now with _targets/ stored in a cloud that is accessible to both computers

Wrap-up

When to use targets?

Things to consider:

  • Are intermediates R objects or files (as opposed to, say, in-place modifications to a database)?

  • Benefits of parallelization

  • Your collaborators

  • Your comfort using targets

Targets in the wild

“Real life” examples of targets workflows:

Where to go for help