6. Processes¶
In Nextflow, a process
is the basic computing primitive to execute foreign functions (i.e., custom scripts or tools).
The process
definition starts with the keyword process
, followed by the process name and finally the process body delimited by curly brackets.
A basic process
, only using the script
definition block, looks like the following:
Info
The process
name is commonly written in upper case by convention.
However, the process body can contain up to five definition blocks:
- Directives are initial declarations that define optional settings
- Input defines the expected input channel(s)
- Output defines the expected output channel(s)
- When is an optional clause statement to allow conditional processes
- Script is a string statement that defines the command to be executed by the process' task
The full process syntax is defined as follows:
Click the icons in the code for explanations.
- Zero, one, or more process directives
- Zero, one, or more process inputs
- Zero, one, or more process outputs
- An optional boolean conditional to trigger the process execution
- The command to be executed
6.1 Script¶
The script
block is a string statement that defines the command to be executed by the process.
A process can execute only one script
block. It must be the last statement when the process contains input
and output
declarations.
The script
block can be a single or a multi-line string. The latter simplifies the writing of non-trivial scripts composed of multiple commands spanning over multiple lines. For example:
snippet.nf | |
---|---|
Tip
In the snippet below the directive debug
is used to enable the debug mode for the process. This is useful to print the output of the process script in the console.
By default, the process
command is interpreted as a Bash script. However, any other scripting language can be used by simply starting the script with the corresponding Shebang declaration. For example:
snippet.nf | |
---|---|
Tip
Multiple programming languages can be used within the same workflow script. However, for large chunks of code it is better to save them into separate files and invoke them from the process script. One can store the specific scripts in the ./bin/
folder.
6.1.1 Script parameters¶
Script parameters (params
) can be defined dynamically using variable values. For example:
snippet.nf | |
---|---|
Info
A process script can contain any string format supported by the Groovy programming language. This allows us to use string interpolation as in the script above or multiline strings. Refer to String interpolation for more information.
Warning
Since Nextflow uses the same Bash syntax for variable substitutions in strings, Bash environment variables need to be escaped using the \
character. The escaped version will be resolved later, returning the task directory (e.g. work/7f/f285b80022d9f61e82cd7f90436aa4/), while $PWD
would show the directory where you're running Nextflow.
snippet.nf | |
---|---|
Your expected output will look something like this:
It can be tricky to write a script that uses many Bash variables. One possible alternative is to use a script
string delimited by single-quote characters ('
).
snippet.nf | |
---|---|
Your expected output will look something like this:
However, using the single quotes ('
) will block the usage of Nextflow variables in the command script.
Another alternative is to use a shell
statement instead of script
and use a different syntax for Nextflow variables, e.g., !{..}
. This allows the use of both Nextflow and Bash variables in the same script.
snippet.nf | |
---|---|
6.1.2 Conditional script¶
The process script can also be defined in a completely dynamic manner using an if
statement or any other expression for evaluating a string value. For example:
Exercise
Execute this script using the command line to choose bzip2
compression.
Summary
In this step you have learned:
- How to use the
script
declaration to define the command to be executed by the process - How to use the
params
variable to define dynamic script parameters - How to use the
shell
declaration to define the command to be executed by the process - How to use the
if
statement to define a conditional script
6.2 Inputs¶
Nextflow process instances (tasks) are isolated from each other but can communicate between themselves by sending values through channels.
Inputs implicitly determine the dependencies and the parallel execution of the process. The process execution is fired each time new data is ready to be consumed from the input channel:
The input
block defines the names and qualifiers of variables that refer to channel elements directed at the process. You can only define one input
block at a time, and it must contain one or more input declarations.
The input
block follows the syntax shown below:
There are several input qualifiers that can be used to define the input declaration. The most common are outlined in detail below.
6.2.1 Input values¶
The val
qualifier allows you to receive data of any type as input. It can be accessed in the process script by using the specified input name. For example:
snippet.nf | |
---|---|
In the above example the process is executed three times, each time a value is received from the channel num
it is used by the script. Thus, it results in an output similar to the one shown below:
Warning
The channel guarantees that items are delivered in the same order as they have been sent - but - since the process is executed in a parallel manner, there is no guarantee that they are processed in the same order as they are received.
6.2.2 Input files¶
The path
qualifier allows the handling of file values in the process execution context. This means that Nextflow will stage it in the process execution directory, and it can be accessed by the script using the name specified in the input declaration. For example:
snippet.nf | |
---|---|
In this case, the process is executed six times and will print the name of the file sample.fastq
six times as this is the name of the file in the input declaration and despite the input file name being different in each execution (e.g., lung_1.fq
).
The input file name can also be defined using a variable reference as shown below:
snippet.nf | |
---|---|
In this case, the process is executed six times and will print the name of the variable input file six times (e.g., lung_1.fq
).
The same syntax is also able to handle more than one input file in the same execution and only requires changing the channel composition using an operator (e.g., collect
).
snippet.nf | |
---|---|
Note that while the output looks the same, this process is only executed once.
Warning
In the past, the file
qualifier was used for files, but the path
qualifier should be preferred over file to handle process input files when using Nextflow 19.10.0 or later. When a process declares an input file, the corresponding channel elements must be file objects created with the file helper function from the file specific channel factories (e.g., Channel.fromPath
or Channel.fromFilePairs
).
6.2.3 Combine input channels¶
A key feature of processes is the ability to handle inputs from multiple channels. However, it’s important to understand how channel contents and their semantics affect the execution of a process.
Consider the following example:
snippet.nf | |
---|---|
Both channels emit three values, therefore the process is executed three times, each time with a different pair:
The process waits until there’s a complete input configuration, i.e., it receives an input value from all the channels declared as input.
When this condition is verified, it consumes the input values coming from the respective channels, spawns a task execution, then repeats the same logic until one or more channels have no more content.
This means channel values are consumed serially one after another and the first empty channel causes the process execution to stop, even if there are other values in other channels.
What happens when channels do not have the same cardinality (i.e., they emit a different number of elements)?
snippet.nf | |
---|---|
In the above example, the process is only executed once because the process stops when a channel has no more data to be processed.
However, replacing ch2
with a value
channel will cause the process to be executed three times, each time with the same value of a
:
snippet.nf | |
---|---|
As ch2
is now a value channel, it can be consumed multiple times and does not affect process termination.
Exercise
Write a process that is executed for each read file matching the pattern data/ggal/*_1.fq
and use the same data/ggal/transcriptome.fa
in each execution.
Solution
One possible solution is shown below:
You may also consider using other Channel factories or operators to create your input channels.
6.2.4 Input repeaters¶
The each
qualifier allows you to repeat the execution of a process for each item in a collection every time new data is received. For example:
snippet.nf | |
---|---|
t_coffee -in gut_1.fq -mode regular
t_coffee -in lung_1.fq -mode espresso
t_coffee -in liver_1.fq -mode regular
t_coffee -in gut_1.fq -mode espresso
t_coffee -in lung_1.fq -mode regular
t_coffee -in liver_1.fq -mode espresso
In the above example, every time a file of sequences is received as an input by the process, it executes three tasks, each running a different alignment method set as a mode
variable. This is useful when you need to repeat the same task for a given set of parameters.
Exercise
Extend the previous example so a task is executed for an additional type of coffee.
Solution
Modify the methods list and add another coffee type:
snippet.nf | |
---|---|
Your output will look something like this:
t_coffee -in gut_1.fq -mode regular
t_coffee -in lung_1.fq -mode regular
t_coffee -in gut_1.fq -mode espresso
t_coffee -in liver_1.fq -mode cappuccino
t_coffee -in liver_1.fq -mode espresso
t_coffee -in lung_1.fq -mode espresso
t_coffee -in liver_1.fq -mode regular
t_coffee -in gut_1.fq -mode cappuccino
t_coffee -in lung_1.fq -mode cappuccino
Summary
In this step you have learned:
- How to use the
val
qualifier to define the input channel(s) of a process - How to use the
path
qualifier to define the input file(s) of a process - How to use the
each
qualifier to repeat the execution of a process for each item in a collection
6.3 Outputs¶
The output declaration block defines the channels used by the process to send out the results produced.
Only one output block, that can contain one or more output declaration, can be defined. The output block follows the syntax shown below:
6.3.1 Output values¶
The val
qualifier specifies a defined value in the script context. Values are frequently defined in the input
and/or output
declaration blocks, as shown in the following example:
snippet.nf | |
---|---|
6.3.2 Output files¶
The path
qualifier specifies one or more files produced by the process into the specified channel as an output.
snippet.nf | |
---|---|
In the above example the process RANDOMNUM
creates a file named result.txt
containing a random number.
Since a file parameter using the same name is declared in the output block, the file is sent over the receiver_ch
channel when the task is complete. A downstream process
declaring the same channel as input will be able to receive it.
6.3.3 Multiple output files¶
When an output file name contains a wildcard character (*
or ?
) it is interpreted as a glob path matcher. This allows us to capture multiple files into a list object and output them as a sole emission. For example:
snippet.nf | |
---|---|
Prints the following:
[/workspace/gitpod/nf-training/work/ca/baf931d379aa7fa37c570617cb06d1/chunk_aa, /workspace/gitpod/nf-training/work/ca/baf931d379aa7fa37c570617cb06d1/chunk_ab, /workspace/gitpod/nf-training/work/ca/baf931d379aa7fa37c570617cb06d1/chunk_ac, /workspace/gitpod/nf-training/work/ca/baf931d379aa7fa37c570617cb06d1/chunk_ad]
Some caveats on glob pattern behavior:
- Input files are not included in the list of possible matches
- Glob pattern matches both files and directory paths
- When a two asterisks pattern
**
is used to recourse across directories, only file paths are matched i.e., directories are not included in the result list.
Exercise
Add the flatMap
operator and see out the output changes. The documentation for the flatMap
operator is available at this link.
Solution
Add the flatMap
operator to the letters
channel.
snippet.nf | |
---|---|
Your output will look something like this:
/workspace/gitpod/nf-training/work/54/9d79f9149f15085e00dde2d8ead150/chunk_aa
/workspace/gitpod/nf-training/work/54/9d79f9149f15085e00dde2d8ead150/chunk_ab
/workspace/gitpod/nf-training/work/54/9d79f9149f15085e00dde2d8ead150/chunk_ac
/workspace/gitpod/nf-training/work/54/9d79f9149f15085e00dde2d8ead150/chunk_ad
6.3.4 Dynamic output file names¶
When an output file name needs to be expressed dynamically, it is possible to define it using a dynamic string that references values defined in the input declaration block or in the script global context. For example:
In the above example, each time the process is executed an alignment file is produced whose name depends on the actual value of the x
input.
6.3.5 Composite inputs and outputs¶
So far you have seen how to declare multiple input and output channels that can handle one value at a time. However, Nextflow can also handle a tuple of values.
The input
and output
declarations for tuples must be declared with a tuple
qualifier followed by the definition of each element in the tuple.
The output will looks something like this:
[lung, /workspace/gitpod/nf-training/work/23/fe268295bab990a40b95b7091530b6/sample.bam]
[liver, /workspace/gitpod/nf-training/work/32/656b96a01a460f27fa207e85995ead/sample.bam]
[gut, /workspace/gitpod/nf-training/work/ae/3cfc7cf0748a598c5e2da750b6bac6/sample.bam]
Exercise
Modify the script of the previous exercise so that the --sample file is named as the given sample_id
.
Solution
6.3.6 Output definitions¶
Nextflow allows the use of alternative output definitions within workflows to simplify your code.
You can also explicitly define the output of a channel using the .out
attribute:
This command will produce an error message, because .view()
operates on single channels, and FOO.out contains multiple channels.
If a process defines two or more output channels, each channel can be accessed by indexing the .out
attribute, e.g., .out[0]
, .out[1]
, etc. In this example you only have the [0]'th
output:
Alternatively, the process output
definition allows the use of the emit
statement to define a named identifier that can be used to reference the channel in the external scope.
Exercise
Modify the previous example so that the bai
output channel is printed to your terminal.
Solution
Your workflow will look something like this:
Summary
In this step you have learned:
- How to use the
val
qualifier to define the output channel(s) of a process - How to use the
path
qualifier to define the output file(s) of a process - How to use the
tuple
qualifier to define the output channel(s) of a process - How to manage multiple output files using glob patterns
- How to use dynamic output file names
- How to use composite inputs and outputs
- How to define outputs
6.4 When¶
The when
declaration allows you to define a condition that must be verified in order to execute the process. This can be any expression that evaluates a boolean value.
It is useful to enable/disable the process execution depending on the state of various inputs and parameters. For example:
snippet.nf | |
---|---|
Summary
In this step you have learned:
- How to use the
when
declaration to allow conditional processes
6.5 Directives¶
Directive declarations allow the definition of optional settings that affect the execution of the current process without affecting the semantic of the task itself.
They must be entered at the top of the process body, before any other declaration blocks (i.e., input, output, etc.).
Directives are commonly used to define the amount of computing resources to be used or other meta directives that allow the definition of extra configuration of logging information. For example:
snippet.nf | |
---|---|
The complete list of directives is available at this link. Some of the most common are described in detail below.
6.5.1 Resource allocation¶
Directives that allow you to define the amount of computing resources to be used by the process. These are:
Name | Description |
---|---|
cpus |
Allows you to define the number of (logical) CPUs required by the process’ task. |
time |
Allows you to define how long the task is allowed to run (e.g., time 1h: 1 hour, 1s 1 second, 1m 1 minute, 1d 1 day). |
memory |
Allows you to define how much memory the task is allowed to use (e.g., 2 GB is 2 GB). Can also use B, KB,MB,GB and TB. |
disk |
Allows you to define how much local disk storage the task is allowed to use. |
These directives can be used in combination with each other to allocate specific resources to each process. For example:
snippet.nf | |
---|---|
6.5.2 PublishDir directive¶
Given each task is being executed in separate temporary work/
folder (e.g., work/f1/850698…
), you may want to save important, non-intermediary, and/or final files in a results folder.
To store our workflow result files, you need to explicitly mark them using the directive publishDir in the process that’s creating the files. For example:
The above example will copy all BAM files created by the FOO
process into the directory path results
.
Tip
The publish directory can be local or remote. For example, output files could be stored using an AWS S3 bucket by using the s3://
prefix in the target path.
You can use more than one publishDir
to keep different outputs in separate directories. For example:
Exercise
Edit the publishDir
directive in the previous example to store the output files for each sample type in a different directory.
Summary
In this step you have learned:
- How to use the cpus, time, memory, and disk directives to define the amount of computing resources to be used by the process
- How to use the publishDir directive to store the output files in a results folder