Aller au contenu

Part 5: Hello Config

This section will explore how to configure Nextflow pipelines using configuration files, profiles, process directives, and executors. Configuration management is an essential aspect of Nextflow pipeline development, allowing you to customize the behavior of your pipeline, adapt it to different environments, and optimize resource usage. By understanding and effectively utilizing these configuration options, you can enhance the flexibility, scalability, and performance of your pipelines.

1. Check and modify configuration

1.1. Run nf-hello-gatk with default settings

nextflow run seqeralabs/nf-hello-gatk -r main

When you run the pipeline with the default settings using the command above, the following happens:

  1. Nextflow downloads the pipeline from the GitHub repository seqeralabs/nf-hello-gatk.
  2. It then executes the pipeline using the default configuration.
  3. The pipeline will likely use Docker containers to run the required tools (Samtools and GATK).
  4. It processes the input BAM files, creates index files, and performs variant calling.
  5. The results are generated by default in the results directory.
  6. Nextflow also creates a work directory containing intermediate files and logs.
  7. Upon completion, Nextflow displays a run summary, including any errors or warnings.

Now, let's see how this was configured and set up.

1.2. Check configuration

Open the nextflow.config file and inspect the contents:

code nextflow.config

The contents should look like this:

nextflow.config
docker.enabled = true

This config block tells the pipeline to use Docker containers to run the required tools.

1.3. Modify configuration

Let's modify the configuration to use Conda instead of Docker and explicitly disable Docker.

Before:

nextflow.config
docker.enabled = true

After:

nextflow.config
docker.enabled = false
conda.enabled = true

Now let's run the pipeline again with the modified configuration:

nextflow run seqeralabs/nf-hello-gatk -r main

This time, the pipeline will use Conda environments to run the required tools.

Takeaway

You know how to switch software packaging systems using configuration files.

What's next?

Learn how to use profiles to customize the behavior of your pipeline.


2. Profiles

Profiles are a way to customize the behavior of Nextflow pipelines by selection, rather than setting them permanently.

2.1. Create a profile

Before:

nextflow.config
docker.enabled = false
conda.enabled = true

After:

nextflow.config
1
2
3
4
5
6
7
8
profiles {
    docker {
        docker.enabled = true
    }
    conda {
        conda.enabled = true
    }
}

2.2. Run the pipeline with a profile

nextflow run seqeralabs/nf-hello-gatk -r main -profile docker

or

nextflow run seqeralabs/nf-hello-gatk -r main -profile conda

As demonstrated above, by creating and using profiles, we've enhanced our pipeline's flexibility and ease of use. We can now run our pipeline with Docker or Conda using a single command line argument by specifying the appropriate profile (-profile docker or -profile conda). This method of configuration management improves the portability and maintainability of our Nextflow pipeline, enabling us to accommodate various execution scenarios easily.

Takeaway

You know how to use profiles to customize the configuration of your pipeline.

What's next?

Learn how to change process resource use with configuration.


3. Process directives and resources

3.1. Process directives

In a previous training module, we used process directives to modify the behavior of a process when we added the publishDir directive to export files from the working directory. Let's look into directives in more detail.

3.1.1 Set process resources

By default, Nextflow will use a single CPU and 2GB of memory for each process. We can modify this behavior by setting the cpu and memory directives in the process block. Add the following to the end of your nextflow.config file:

nextflow.config
process {
    cpus = 8
    memory = 4.GB
}

Run the pipeline again with the modified configuration:

nextflow run seqeralabs/nf-hello-gatk -r main -profile docker

You shouldn't see any difference; however, you might notice that the three processes get bottlenecked behind each other. This is because Nextflow will ensure we aren't using more CPUs than are available.

Tip

You can check the number of CPUs given to the process by looking at the .command.run. There will be a function called nxf_launch() that includes the command docker run -i --cpu-shares 1024, where --cpu-shares is the number of CPUs given to the process multiplied by 1024.

3.1.2 Modify process resources for a specific process

We can also modify the resources for a specific process using the withName directive. Add the following to the end of your nextflow.config file:

nextflow.config
process {
    withName: 'GATK_HAPLOTYPECALLER' {
        cpus = 8
        memory = 4.GB
    }
}

Run the pipeline again with the modified configuration:

nextflow run seqeralabs/nf-hello-gatk -r main -profile docker

Now, the settings are only applied to the GATK HaplotypeCaller process. This is useful when your processes have different resource requirements so you can right-size your resources for each process.

Takeaway

You know how to modify process resources using configuration files.

What's next?

Learn how to change the executor used by Nextflow.


4. Executor

4.1. Local executor

Until now, we have been running our pipeline with the local executor. This runs each step on the same machine that Nextflow is running on. However, for large genomics pipelines, you will want to use a distributed executor. Nextflow supports several different distributed executors, including:

  • HPC (SLURM, PBS, SGE)
  • AWS Batch
  • Google Batch
  • Azure Batch
  • Kubernetes

We can modify the executor used by nextflow using the executor process directive. Because local is the default executor, the following configuration is implied:

nextflow.config
process {
    executor = 'local'
}

4.2. Other executors

Note

This is a demonstration and designed to go wrong!

If we wish to change executor, we could simply set this to one of the values in the documentation:

nextflow.config
process {
    executor = 'slurm'
}

However, if we add this to our config and run the pipeline we will that includes this error:

Cannot run program "sbatch"

Nextflow has interpreted that we wish to submit to a Slurm cluster, which requires the use of the command sbatch. However, because our Gitpod instance doesn't have slurm installed (and isn't connected to a cluster) this throws an error.

If we check inside the .command.run file created in the work directory, we can see that Nextflow has created a script to submit the job to Slurm.

Note

The output of your nextflow console will have the hash of the work subdirectory, which will differ from the paths shown below.

.command.run
#!/bin/bash
#SBATCH -J nf-SAMTOOLS_INDEX_(1)
#SBATCH -o /home/gitpod/work/34/850fe31af0eb62a0eb1643ed77b84f/.command.log
#SBATCH --no-requeue
#SBATCH --signal B:USR2@30
NXF_CHDIR=/home/gitpod/work/34/850fe31af0eb62a0eb1643ed77b84f
### ---
### name: 'SAMTOOLS_INDEX (1)'
### container: 'community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464'
### outputs:
### - 'reads_father.bam'
### - 'reads_father.bam.bai'
### ...

If our process had more directives, such as clusterOptions, cpus, memory, queue, and time, these would also be included in the .command.run file and directly passed to the Slurm execution. They would also be translated to the equivalent options for other executors. This is how Nextflow creates the commands required to correctly submit a job to the sbatch cluster via a single configuration change.

4.3. Using Executors in Profiles

Let's combine profiles with executors. Add the following to your configuration file:

Remove the following lines:

nextflow.config
process {
    executor = 'slurm'
}

Before:

nextflow.config
profiles {
    docker {
        docker.enabled = true
        conda.enabled = false
    }
    conda {
        docker.enabled = false
        conda.enabled = true
    }
}

After:

nextflow.config
profiles {
    docker {
        docker.enabled = true
        conda.enabled = false
    }
    conda {
        docker.enabled = false
        conda.enabled = true
    }
    local {
        process.executor = 'local'
    }
    slurm {
        process.executor = 'slurm'
    }
}

Now run the pipeline using two profiles, docker and local:

nextflow run seqeralabs/nf-hello-gatk -r main -profile docker,local

We have returned to the original configuration of using Docker containers with local execution. However, now we can use profiles to switch to a different software packaging system (conda) or a different executor (slurm) with a single command-line option.

Takeaway

You now know how to change the executor in Nextflow.

What's next?

Well done! You've successfully modified the execution of a pipeline without altering a single line of code. This highlights the power of Nextflow's configuration; enabling you to control how the pipeline runs without changing what it runs. Use this flexibility to adapt your pipeline to run in any environment.