Harnessing HPC power with {targets}

Eric R. Scott

CCT Data Science, University of Arizona

Jan 31, 2023

My Background

  • Ecologist turned research software engineer

  • Not an HPC professional or expert

  • Feels most comfortable never leaving the comfort of RStudio Desktop

{targets} forces you to make parallelization easy

  • Modular workflows & branching create independent targets

Running targets in parallel

  • use_targets() automatically sets things up

  • Use tar_make_clustermq() or tar_make_future() to run in parallel

  • Parallel processes on your computer or jobs on a computing cluster

  • Potentially easy entry to high performance computing

Persistent vs. Transient workers

Persistent workers with clustermq

  • One-time cost to set up workers

  • System dependency on zeromq

Transient workers with future

  • Every target gets its own worker (more overhead)

  • No additional system dependencies

Setup clustermq on a cluster

  1. Take the basic HPC training at your organization
  2. Install clustermq R package on the cluster
  3. You might need to open a support ticket to get ZeroMQ (https://zeromq.org/) installed
  4. On the cluster, in a directory, launch R and run targets::use_targets()

Setup clustermq on a cluster

  1. Edit the SLURM (or other scheduler) template that was created

#SBATCH --job-name={{ job_name }}        # job name
#SBATCH --partition=hpg-default          # partition
#SBATCH --output={{ log_file | logs/workers/pipeline%j_%a.out }} # you can add .%a for array index
#SBATCH --error={{ log_file | logs/workers/pipeline%j_%a.err}}   # log file
#SBATCH --mem-per-cpu={{ memory | 8GB }}     # memory
#SBATCH --array=1-{{ n_jobs }}               # job array
#SBATCH --cpus-per-task={{ cores | 1 }}
#SBATCH --time={{ time | 1440 }}

source /etc/profile

ulimit -v $(( 1024 * {{ memory | 8192 }} ))
module load R/4.0 #R 4.1 not working currently           
module load pandoc  #For rendering RMarkdown
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'
  1. Check that clustermq works
  2. Check that tar_make_clustermq() works

How to work comfortably

  • May need to run from command line without RStudio

  • Options for RStudio could be clunky

  • Data store (_targets/) not synced with local computer

Develop local, sync, run on cluster

Cloud storage

SSH connection

  • Develop and run workflow on your computer

  • Targets are sent off to the cluster to be run as SLURM jobs

  • Results returned and _targets/ store remains on your computer

  • Ideal when:

    • Only some targets need cluster computing

    • Targets don’t run too long

    • No comfortable way to use RStudio on the cluster

SSH connection setup

  1. Copy SLURM template to cluster
  2. Edit ~/.Rprofile on the cluster:
  clustermq.scheduler = "slurm",
  #path to template on cluster
  clustermq.template = "~/slurm_clustermq.tmpl"
  1. Set options in _targets.R on your computer:
  clustermq.scheduler = "ssh",
  clustermq.ssh.host = "<username@hpc.university.edu>", # however you SSH into cluster
  clustermq.ssh.timeout = 30, # longer timeout
  clustermq.ssh.log = "~/cmq_ssh.log" # log for easier debugging


Packages used in the pipeline need to be installed on the cluster and local computer

Lessons Learned: UF

  • Transfer of R objects back and forth is biggest bottleneck for SSH connector
  • 2FA surprisingly not an issue

Lessons Learned: Tufts University

  • Couldn’t get zeromq installed because I couldn’t get an HPC person to email me back!
  • future backend worked, but overhead was too much to be helpful

Lessons Learned: University of Arizona

  • SSH connector requires an R session to run on login node—not possible at UA!
  • Open On Demand RStudio Server
  • targets auto-detects SLURM, but need to run as “multicore”

One last step: write it all down!

  • Template GitHub repo with setup instructions in README

  • Tell the HPC experts about it

University of Florida:

University of Arizona (WIP): cct-datascience/targets-uahpc