Running Nextflow on Google Batch

Intro

Google Batch is a new feature from Google Cloud that is effectively a batch system as a service. Google Cloud also provides the Life Sciences API, which is a specific-batch backend for many Life Sciences applications. Based on my interest in workflow management systems, I have been using Nextflow and evaulating it’s capabilities to run reproducible scientific pipelines in the Life Sciences space.

As Google Batch is brand new, I was interested to see how Nextflow on Google Batch compares to the implementation on Google Life Sciences, and what parameters are supported.

The core of this is described in the tutorial. To provide a parallel test for both backends, we will use the example RNASequencing pipeline

Setup

You will need a GCP project at http://console.cloud.google.com . Open the console, and start the Cloud Shell to bring up a terminal. This is required as Nextflow is a command line tool, allowing for different compute backends with the same front-end interface. Additionally, The APIs for both Batch and Life Sciences need to enabled.

  1. Create a working directory in the cloud shell for this test mkdir nextflow-test
  2. To start, within the cloud shell, download the nextflow binary with curl -s https://get.nextflow.io | bash. We will need to use the “edge” release to accomodate the latest Google Batch funcionality, so follow the instructions to set up edge
export NXF_EDGE=1
./nextflow self-update
  1. To get the core clone the sequencing pipline repository at https://github.com/nf-core/rnaseq.git as git clone https://github.com/nf-core/rnaseq.git
  2. Working directories (on Google Cloud Storage) are created for each of the process - they will sit in a specific work dir bucket
gsutil mb gs://[REPLACE-WITH-UNIQUE-WORKING-DIRECTORY-NAME]
gsutil mb gs://[REPLACE-WITH-UNIQUE-OUTPUT-DIRECTORY-NAME]

Nextflow uses a Groovy-based configuration file and you can specify the configuration at run-time, so we will create two configuration files - one for LS API and one for Batch.

  1. Create a directory called conf
  2. Move the test and full configuration files from the cloned example repository to the conf directory
mkdir conf
cp rnaseq/conf/test.config conf/test.config
cp rnaseq/conf/test_full.config conf/test_full.config
  1. Create a top-level configuration file that inherits the configurtation called nextflow.config
profiles {
    test      { includeConfig 'conf/test.config'      }
    full      { includeConfig 'conf/test_full.config' }
}
  1. Open up each of the config files and increase the max resource settings - as this is set minimally for use on GitHub Actions (f/ex 16 CPUs and 32GB RAM)

The configuration is located in rnaseq/conf directory of the repository and used in the nextflow command line with the -profile flag.

Life Sciences backend

The configuration of the Life Sciences backend is as follows, and we will use a updated variant of the GCP and Nextflow instructions:

Create the file conf/test-lifesciences.config with the following content:

params {
    config_profile_name        = 'Google LS API profile'
    config_profile_description = 'Google LS API Configuration'
}
process {
    executor = 'google-lifesciences'
    container = 'nextflow/rnaseq-nf:latest'
}
workDir = 'gs://[REPLACE-WITH-UNIQUE-WORKING-DIRECTORY-NAME]'
google {
     project = 'PROJECT-NAME'
     location = 'us-central1'

Key things to call out:

  • process.executor - is the name of the background
  • process.container - the image used
  • google.region - GCP region to run the jobs in
  • google.project - the GCP project to
  • workDir - the

We can add this configuration to the overall config bvy updating the nextflow.config file:

profiles {
    test             { includeConfig 'conf/test.config'      }
    full             { includeConfig 'conf/test_full.config' }
    lifesciences     { includeConfig 'conf/test_batch.config' }
}

Google Batch backend

The configuration of the Google Batch backend is described here. Create the file conf/test-batch.config with the following content:

params {
    config_profile_name        = 'Google Batch profile'
    config_profile_description = 'Google Batch Configuration'
}
process {
    executor = 'google-batch'
    container = 'nextflow/rnaseq-nf:latest'
}
workDir = 'gs://[REPLACE-WITH-UNIQUE-WORKING-DIRECTORY-NAME]'
google {
     project = 'PROJECT-NAME'
     location = 'us-central1'
}

and included in the nextflow.config file as

profiles {
    test             { includeConfig 'conf/test.config'      }
    full             { includeConfig 'conf/test_full.config' }
    lifesciences     { includeConfig 'conf/test_lifesciences.config' }
    batch            { includeConfig 'conf/test_batch.config' }
}

Now we are ready to execute Nextflow and the RNASequencing pipeline.

Executing the pipeline

We can execute the sequencing pipeline for the LifeSciences backend with:

./nextflow -c nextflow.config run nf-core/rnaseq -profile lifesciences,test --outdir gs://REPLACE-WITH-UNIQUE-OUTPUT-DIRECTORY-NAME -with-report lifesciences.html  -with-trace trace-lifesciences.txt -with-timeline timeline-lifesciences.html -with-dag dag-lifesciences.png

Check that pipelines are launched in the LifeSciences UI. Intermediate files will be put in the working directory bucket, and output files will be put in the output directory bucket.

For analysis of the run, we can use the command line output, or the nextflow.log - move this to lifesciences-nextflow.log

We can execute the sequencing pipeline for the Batch backend with:

./nextflow -c nextflow.config run nf-core/rnaseq -profile batch,test --outdir gs://REPLACE-WITH-UNIQUE-OUTPUT-DIRECTORY-NAME -with-report batch.html -with-trace batch-trace.txt -with-timeline timeline-batch.html -with-dag dag-batch.png 

Check that jobs are launched in the console here For analysis of the run, we can use the command line output, or the nextflow.log - move this to batch-nextflow.log

Troubleshooting failures

If the pipeline fails for an intermittant error, Nextflow has a -resume parameter to allow re-running from the same state.

Potential failures are

  • account does not have enough authentication to launch jobs. Here, you can use either a service account or your personal account with the right permissions (see the GCP documentation)
  • mis-configuration of the bucket

Comparison

The key data is provided in the line after: “nextflow.trace.ReportObserver - Execution report summary data” in the nextflow log file. It is also provided in the report.html and the trace.txt files

Time of execution

WORK IN PROGRESS

Cost of execution

As neither LifeSciences API or Google Batch includes pricing above the resource usage, the cost comes from the the compute and memory resources used in each of the batch jobs.

The standard way to calculate billing is to use labels on Google cloud resources - however, this is currently a WIP from the Nextflow team: https://github.com/nextflow-io/nextflow/pull/2853

As such, we can do a slightly more complex method - calculate the duration of the resources used, and match that up against the published Google Cloud cost. This obviously ignores discounts, credits and committed use discounts, but hopefully gives a benchmark comparison. Do talk to your friendly GCP contact for more information.

WORK IN PROGRESS

Additional configuration

There are additional parameters that both backends support, and which are critical for running in an enterprise environment. These include the following

  • Network and subnetwork. Using a specific VPC and subnetwork - this is critical for projects customied beyond the default.
  • private IP addresses In enterprise environments, often internet access is blocked, and public IPs are unnecessary - this configuration allows for private configuration for the batch VMS.
  • premptible / spot instances. This allows for cheaper, more cost effective instances at the expense of assuming that jobs are high priority and allowing them to be killed.
  • VPC-SC Currently only supported by the LifeSciences API, this supports the resource access and data-exfiltration controls integrated into GCP.
  • serviceAccountEmail In this example, we have run with the user account - for production purposes, it is best to use a dedicated functional (service) account that has explicit and limited permissions to run LifeSciences or Batch jobs.