GitHub Actions for Automating Scientific Workflows

Eric R. Scott

December 4, 2024

Learning Objectives

  • Understand the role of GitHub Actions workflows in (research) software development

  • Know how to trigger workflows in several different ways and determine which trigger is useful for different scientific applications

  • Be able to export and access data created by workflows in a variety of ways

  • Use GitHub actions for distributed computing

Vocabulary

  • Repository: a folder with your code and data in it with changes tracked by git

  • GitHub: a cloud platform for syncing git repositories, publishing websites, running automated workflows (this workshop), and more

What is GitHub Actions?

  • Run basically any workflow on a virtual machine(s) in the cloud in a way that integrates with GitHub

  • Easily incorporate workflows for common, complex tasks created by others

  • Designed for continuous integration & delivery (CI/CD)—software development practices that translate to scientific code and data.

Continuous Integration

  • Automate integrating changes in code or data into the main version of your project in a safe way. For example:

    • When data is updated, run some data validation checks

    • Before incorporating changes from a collaborator, make sure their code adheres to a particular style

Continuous Delivery

  • “Delivery” is basically any way you make your data, code, or code outputs available to view or download. For example:

    • Every month, archive a new version of the data with Zenodo and get a new DOI

    • When code or data is updated, re-render a report for collaborators

Workflow overview

Workflows are defined with YAML files placed in .github/workflows/

on:
  workflow_dispatch

jobs:
  hello-world:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup R
        uses: r-lib/actions/setup-r@v2 #installs R
      
      - name: Run my code
        run: Rscript mycode.R
1
Triggering event(s)—workflow_dispatch = a button on GitHub
2
Jobs, in this case only one—hello-world
3
Runner for the job—in this case the latest release of Ubuntu Linux
4
Steps to run sequentially on the runner
5
An “action” to get the files in your repository onto the runner
6
An “action” to install R on the runner
7
A shell command to source the mycode.R script

Workshop setup

  1. Go to template repository

  2. Click the green “Use this template” button and select “create new repository”

  3. For “Owner” choose your own GitHub username.

To use actions that publish to GitHub pages, your repository must be public!

Run your first action

In the “Actions” tab, find the “hello_world” example and click “Run workflow”

A note on renv 📦

renv analyzes your code and creates a renv.lock file with information on:

  • What R packages your code uses
  • What versions of those packages are installed
  • Where they were installed from (CRAN, Bioconductor, r-universe, GitHub, etc.)

This can be used with r-lib/actions/setup-renv to install all the required R packages.

Workflow components: events

Workflows are triggered by events that occur in your GitHub repository and defined under on:

  • push: triggered whenever changes are made to your repository

  • pull_request: runs on a “pull request”—a common way of making changes to a repository

  • workflow_dispatch: creates a “run workflow” button that you can click on GitHub

  • schedule: E.g. run on the first Monday of every month.

Workflow components: runners

Workflows contain one or more jobs that run on virtual machines, called runners.

You can use GitHub provided runners for free

runs-on: OS CPUs RAM Storage
ubuntu-latest Ubuntu Linux 4 16 GB 14 GB
windows-latest Windows 4 16 GB 14 GB
macos-latest macOS (M1) 3 7 GB 14 GB

Workflow components: jobs

Each job may contain multiple steps. Steps can be either run a script or an action.

%%{init:{'flowchart':{'nodeSpacing': 15, 'rankSpacing': 10}}}%%
flowchart TD
    c["Event"]-->ubuntu
    c --> mac
        subgraph ubuntu [runs-on: ubuntu-latest]
        subgraph Job 1
            direction TB
            a1["Step 1: Checkout"]--> a2["Step 2: Install R"]
            a2--> a3["Step 3: Run R Script"]
        end
        end
        subgraph mac [runs-on: macos-latest]
        subgraph Job 2
            direction TB
            b1["Step 1: Checkout"]--> b2["Step 2: Install Python"]
            b2--> b3["Step 3: Run Python Script"]
        end
        end

Workflow components: actions

Actions are pre-packaged workflows for common complex tasks.

Common actions:

Finding Actions

  • Find additional actions in the marketplace.

  • Try searching the web for “<thing you want to do> GitHub action”

Workflow components: variables

  • Environment variables can be set in workflows with

    action.yaml
    env:
      VARIABLE_NAME: "variable"

    and accessed in R code with

    script.R
    Sys.getenv("VARIABLE_NAME")
  • Secrets can be stored in GitHub and accessed in workflows with

    action.yaml
    key: ${{ secrets.SECRET_VAR }}

Matrix variables

A matrix strategy can be used for iteration to spawn parallel runners.

action.yaml
jobs:
  myjob:
    strategy:
      matrix:
        letters: ["A", "B", "C"]
    runs-on: ubuntu-latest
    steps:
      - name: Setup R
        uses: r-lib/actions/setup-r@v2
      - name: Print letters
        run: Rscript -e "print(Sys.getenv('ALPHABET'))"
        env:
          ALPHABET: ${{ matrix.letters }}

Example: data validation script

validate.yaml

  • Runs R/validate.R which either errors or doesn’t
  • Pros: simple
  • Cons: fails fast (only the first error is reported)

Example: data validation tests

testthat.yaml

  • Runs testthat.R which uses the testthat R package

  • Pros: all tests are run even with multiple errors

  • Cons: more complicated setup, testthat is usually for R packages

Example: rendering a report to markdown

render_readme.yaml

  • Renders a .Qmd (Quarto) file to “github flavored markdown” which is rendered into html by GitHub

  • Pros: relatively simple with quarto-actions/render action

  • Cons: must commit results in order to see them—could cause git confusion! GFM doesn’t support all Quarto HTML features

Example: rendering a report to GitHub pages

validation_report.yaml

  • Renders a validation report to a webpage served on GitHub

  • Pros: all Quarto HTML features supported, full-fledged website you can send to your collaborators

  • Cons: must run quarto publish gh-pages locally once before the action works, repository must be public

Example: parallel computing

matrix.yaml

  • Uses the matrix: key to iterate over multiple weather stations, pull data, fit a model, and combine model summaries

  • Pros: Scheduling a task like this is easier on GitHub actions than, say, the HPC

  • Cons: Limitations on number of concurrent runners and computational power of runners

Resources

CCT Data Science