Managing complicated research workflows in R with {targets}

Eric R. Scott

2025-02-26

CCT Data Science

https://datascience.cct.arizona.edu/

We offer:

Workshops
Drop-in hours support
Incubator projects for ALVSCE
Collaboration on funded projects

Join our mailing list!

Before we get started

You’ll need the following R packages today:

#For making `targets` work:
library(targets)
library(tarchetypes)
library(visNetwork)
library(crew)

#For our data wrangling and analysis:
library(dplyr)
library(ggplot2)
library(janitor)
library(car)
library(broom)

#for installing demos and course materials:
library(usethis)

Learning Objectives

Understand what problems targets (and workflow managers in general) solve
Be able to set up a simple project using targets, view the dependency graph, and run the workflow
Write a (good enough) custom R function
Have an awareness of the possibilities: parallelization, cloud storage, iteration, Bayesian analyses, geospatial targets
Know where to go for help with targets

Part 1: Context

Moving Toward Reproducibility

Working toward reproducible analyses benefits:

You in the future
Your collaborators
The greater community

Cute fuzzy monsters putting rectangular data tables onto a conveyor belt. Along the conveyor belt line are different automated “stations” that update the data, reading “WRANGLE”, “VISUALIZE”, and “MODEL”. A monster at the end of the conveyor belt is carrying away a table that reads “Complete analysis.”

Image from: Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Workflow management

Automatically detect dependencies
Run your entire analysis with one master command
Skip steps that don’t need to be re-run
Scaleable

Part 2: Demonstration

Demo

To install a demo targets pipeline, run the following R code and follow the prompts:

usethis::use_course("cct-datascience/targets-demo")

In this new project, try the following:

library(targets)
tar_visnetwork() # view dependency graph
tar_make() # run the pipeline
tar_visnetwork()
tar_read(lm_full) # view the "lm_full" target

Now, in R/fit_models.R, change the response variable from flipper_length_mm to bill_length_mm and run tar_visnetwork() again.

Working with targets

The result of each step is stored as an R object¹ in _targets/
Everything that happens to make that target is written as a function

Anatomy of a targets project

_targets.R
- Configure and define workflow
R/
- R script(s) with custom functions
_targets/
1. Generated by tar_make()
2. Contains targets saved as R objects
3. Should not be tracked in version control

Part 3: Refactoring Exercise

Refactoring a project to use targets

refactor

/rēˈfaktər/

verb

restructure (the source code of an application or piece of software) so as to improve operation without altering functionality.

First, let’s start with a “traditional” R analysis project. Download and take a look around:

usethis::use_course("cct-datascience/targets-refactor")

03:00

Refactoring a project to use targets

use_targets() to set up infrastructure including _targets.R
Add package dependencies to tar_option_set()
Convert R script(s) to functions
Write targets
Run tar_make()
Debug

We’re going to walk through this process together

1. `use_targets()`

Exercise

Run use_targets() and open the _targets.R file it generates

library(targets)
use_targets()

2. Package dependencies

Exercise

Figure out what packages are needed and add them to tar_option_set() in _targets.R.

_targets.R

tar_option_set(
  packages = c(
    "tidyverse",
    "lubridate",
    #add more packages here
    ),
  format = "rds"
)

Tip

This would be a good time to install any necessary packages with install.packages() also!

02:00

Custom functions in R

Functions in R are created with the function() function

add_ten <-        #name of function
  function(x) {   #argument names and defaults defined here
    x + 10        #what the function does
  }

add_ten(x = 13)

[1] 23

Using multiple arguments:

to_fractional_power <- 
  function(x, numerator, denominator) {
    power <- numerator / denominator
    x^power
  }
to_fractional_power(27, 1, 3)

[1] 3

Important

The last line of a function must return something, so don’t end a function by assigning results with <-!

“Good enough” functions for targets

Use a naming convention (verbs are good)
Make the function arguments the same as target names
Document!

WORSE:

m1 <- function(x) {
  lm(stem_length ~ watershed, data = x)
}

BETTER:

# Fit linear model to explain stem lenght as a function of watershed treatment
# data_clean = a tibble; the `data_clean` target
fit_model <- function(data_clean) {
  lm(stem_length ~ watershed, data = data_clean)
}

3. Create functions

Exercise

Convert the code in 01-read_wrangle_data.R into a function that takes the file path to the raw data as an argument and returns the cleaned maples data frame.

08:00

Defining the workflow

Steps in the workflow are defined by the tar_target() function inside of the list() at the end of _targets.R

_targets.R

list(
  tar_target(name = file, command = "data/penguins_raw.csv", format = "file"),
  tar_target(name = data, command = get_data(file))
)

Usually you only need to use the first two arguments, name for the name of the target, and command for what that target does
Targets that are files (or create files) need the additional argument format = "file" (for files on disk) or format = "url" (for files on the web)

Tip

A common mistake is to leave a trailing comma after the last target. This will result in the error: Error running targets::tar_make() Last error: argument 5 is empty

Naming targets

Use a naming convention (nouns are good)
Use concise but descriptive target names

WORSE:

data1, data2, data3
histogram_by_site_plot

BETTER:

data_file, data_raw, data_clean
plot_hist_site

4. Create Targets

Exercise

Create targets for the input CSV file and the results of your data wrangling function in _targets.R. Remember to use format = "file" in the target for the CSV file.

Check your progress by running tar_visnetwork() and tar_make().

Once you’ve gotten those first two targets working, try creating functions and targets for additional steps in the analysis.

10:00

Debugging targets

Errors in tar_make() are sometimes uninformative because code is run in a separate R session
Use tar_meta() to access error messages

tar_meta(fields = error, complete_only = TRUE)

Debugging targets

When a target errors, you can load a “workspace” with all the functions, data, and packages needed to reproduce the error interactively.

Enable this with:

tar_option_set(
  workspace_on_error = TRUE
)

Then, load the workspace with tar_workspace(<NAME OF TARGET>)

#load upstream target `data_clean` and `fit_model()` function into global environment
tar_workspace(linear_model) 
#try interactively
fit_model(data_clean)

Error in eval(mf, parent.frame()): object 'df_cleaned' not found

Can you spot the source of the error?

fit_model.R

fit_model <- function(data_clean) {
  lm(y ~ x + z, data = df_cleaned)
}

Alternative workflow definitions

tarchetypes provides alternatives to list() for defining workflow.

tar_plan() allows a name = command() shortcut

_targets.R

tarchetypes::tar_plan(
  tar_file(data_file, "data.csv"),
  data = read.csv(data_file),
  model = fit_model(data),
)

tar_assign() allows the assignment operator (<-) and works well with pipes (%>% or |>).

_targets.R

tarchetypes::tar_assign({      
  data_file <- tar_file("data.csv")
  data <- 
    read.csv(data_file) |>
    filter(year >= 2020) |> 
    tar_target()
  model <- 
    lm(y ~ x + z, data = data) |> 
    tar_target()
})

Part 4: A Taste of What’s Possible

Bayesian analyses with targets

“target factories” for Bayesian analyses

Simplify analyses by automatically creating targets for multiple steps
E.g. defining a target with tar_stan_mcmc() actually generates multiple targets that wrangle data, run the MCMC, create a table of posterior draws, etc.

Geospatial targets

geotargets provides helpers to use geospatial packages like terra and stars with targets

Iteration

Iterate targets over a list of inputs with dynamic branching—useful for large tasks where it would be cumbersome to write out individual targets
E.g. fitting model to 100 bootstraps of data

  #creates list of 100 bootstrapped dataframes
  tar_target(
    data_boot,
    purrr::map(1:100, ~sample_frac(data, size = 1, replace = TRUE)),
    iteration = "list"
  ),
  # Fit model to each data frame, save results as a list
  tar_target(
    lm_boot,
    fit_lm_full(data_boot),
    pattern = map(data_boot),
    iteration = "list"
  ),

Parallel computation

In the workflow below, the three models and the three plot targets can all be run independently at the same time.

Provide a “controller” created by the crew package to tar_option_set() to run in parallel using multiple workers.

E.g. to use 3 concurrent R sessions on your computer:

tar_option_set(
  controller = crew::crew_controller_local(workers = 3)
)

High Performance Computing

crew.cluster provides additional “controllers” for HPC including crew_controller_slurm().
Check out this template repository with code that you can use to run a targets workflow on the UA HPC

Cloud Storage

By default, the _targets/ store is on your computer and not shared with collaborators
Collaborators will have to run tar_make() to reproduce the workflow, which might not be convenient if some targets take days or weeks to run

Diagram describing how the _targets/ folder is stored locally and not synced to GitHub (only _targets.R and the R folder are on GitHub). A collaborator therefore doesn't have the _targets folder and needs to run tar_make() on their computer.

Cloud Storage

Optionally, _targets/ can be stored in the cloud (Amazon Web Services or Google Cloud S3 buckets)
These stores can be versioned, so you can roll back your _targets.R and not have to re-compute targets.

Variation of previous diagram but now with _targets/ stored in a cloud that is accessible to both computers

Wrap-up

When to use targets?

Things to consider:

Are intermediates R objects or files (as opposed to, say, in-place modifications to a database)?
Benefits of parallelization
Your collaborators
Your comfort using targets

Targets in the wild

“Real life” examples of targets workflows:

https://github.com/usa-npn/cales-thermal-calendars (uses geotargets and crew.cluster to run on UA HPC)
https://github.com/njtierney/icebreaker (simple example with geotargets)
https://github.com/ecohealthalliance/mpx-diagnosis (medium complexity)
https://github.com/idem-lab/map-ir-pipeline/blob/main/_targets.R (uses dynamic branching and other advanced features)

Where to go for help

targets manual: https://books.ropensci.org/targets/
targets reference: https://docs.ropensci.org/targets/
targets discussion board: https://github.com/ropensci/targets/discussions
CCT Data Science Team drop-in hours every Wednesday afternoon
Make an appointment with Eric to discuss this content and get troubleshooting help
Sign up for our mailing list to be notified about future workshops

Managing complicated research workflows in R with {targets}

CCT Data Science

Before we get started

Learning Objectives

Part 1: Context

Moving Toward Reproducibility

Workflow management

Part 2: Demonstration

Demo

Working with targets

Anatomy of a targets project

Part 3: Refactoring Exercise

Refactoring a project to use targets

Refactoring a project to use targets

1. use_targets()

2. Package dependencies

Custom functions in R

“Good enough” functions for targets

3. Create functions

Defining the workflow

Naming targets

4. Create Targets

Debugging targets

Debugging targets

Alternative workflow definitions

Part 4: A Taste of What’s Possible

Bayesian analyses with targets

Geospatial targets

Iteration

Parallel computation

High Performance Computing

Cloud Storage

Cloud Storage

Wrap-up

Targets in the wild

Where to go for help

1. `use_targets()`