Harnessing the Power of HPC From the Comfort of

Eric R. Scott
Univeristy of Arizona

2024-10-17

My Background

  • Ecologist → “Scientific Programmer & Educator”

  • = my comfort zone ❤️

  • Attempted (unsuccessfully) to use HPC as PhD student

  • Successfully used HPC as postdoc

Barriers to HPC use

  • Requirement to use shell commands

  • Uncomfortable way of editing and running R code

  • Not seeing HPC resources as “for me”

Technologies to bridge the gap

Key skills that empowered me to use HPC while minimizing time outside of my comfort zone:

  1. GitHub

  2. renv 📦 for managing R package dependencies

  1. Open OnDemand

  2. targets 📦

  • RStudio IDE1 in a web browser
  • Code run on HPC cores

The form to configure an Open OnDemand instance of RStudio with fields for Cluster, R version, Queue, Run Time, and Core Count

I can avoid the command line entirely:

  • RStudio file pane for upload/download of files

  • RStudio git pane for interacting with git/GitHub

  • Run parallel R code on HPC cores without SLURM

  • Cons: can’t load additional modules (?)

targets

  • Make-like workflow management package for R
  • Skips computationally-intensive steps that are already up to date
  • Orchestrates parallel computing

targets

_targets.R
library(targets)
tar_source()
tar_option_set(packages = c("dplyr", "ggplot2"))
list(
  tar_target(file, "data.csv", format = "file"),
  tar_target(data, read.csv(file)),
  tar_target(model, fit_model(data)),
  tar_target(plot, plot_model(model, data))
)
1
Sources all R scripts in R/ with custom functions fit_model() and plot_model()
2
Define packages needed for pipeline and other options
3
Define pipeline

targets

Visualize pipeline with tar_visnetwork()

targets

Run pipeline with tar_make()

targets::tar_make()
✔ skipped target file
✔ skipped target data
▶ dispatched target model
● completed target model [3.008 seconds, 2.879 kilobytes]
▶ dispatched target plot
● completed target plot [0.101 seconds, 1.081 kilobytes]
▶ ended pipeline [3.779 seconds]

Parallel execution with crew

We can set up a crew controller to run targets in parallel.

_targets.R
library(targets)
tar_source()                                              
tar_option_set(
  packages = c("dplyr", "ggplot2"),
  controller = crew::crew_controller_local(workers = 3)
) 
list(                                                     
  tar_target(file, "data.csv", format = "file"),          
  tar_target(data, read.csv(file)),                       
  tar_target(model1, fit_model1(data)),
  tar_target(model2, fit_model2(data)),
  tar_target(model3, fit_model3(data))
)  
1
This will set up three R sessions that can run tasks in parallel
2
These three targets can all be run in parallel

This “local” controller also works on Open OnDemand!

On the HPC with crew.cluster

Use SLURM (or PBS, SGE, etc.) without writing a bash script!

crew.cluster::crew_controller_slurm(
  workers = 5,
  slurm_partition = "standard",
  slurm_time_minutes = 1200,
  slurm_log_output = "logs/crew_log_%A.out",
  slurm_log_error = "logs/crew_log_%A.err",
  slurm_memory_gigabytes_per_cpu = 5,
  slurm_cpus_per_task = 2,
  script_lines = c(
    "#SBATCH --account kristinariemer",
    "module load R"
  ),
  seconds_idle = 600
)
1
Launches 5 R sessions as SLURM jobs
2
R code that gets translated into SBATCH script
3
Creates semi-transient workers

Template repository

cct-datascience/targets-uahpc

  • Links to relevant tutorials for prerequisite skills

  • Example targets pipeline

  • Uses renv for package management

  • Example crew controllers with all required fields set

  • Includes run.sh to launch targets::tar_make() as a SLURM job

How we can help bridge the gap

  • Collaborative workshops led by HPC RSEs & Domain RSEs

  • Offer workshops on using HPC without the command line

  • HPC workshops tailored to R/RStudio users

  • Create a template repo for using targets on your HPC

Questions?

ericrscott@arizona.edu

@LeafyEricScott@fosstodon.org



Template Repo

Slides

crew technical details

nanonext nanonext, R bindings for NNG (Nanomsg Next Gen), which powers…

mirai logo mirai, a “minimalist async evaluation framework for R”, which powers …

crew logo crew, a unifying interface for creating distributed worker launchers

Optimizing crew.cluster