1. Operator Tour¶
In this chapter, we take a curated tour of the Nextflow operators. Commonly used and well understood operators are not covered here - only those that we've seen could use more attention or those where the usage could be more elaborate. These set of operators have been chosen to illustrate tangential concepts and Nextflow features.
Map is certainly the most commonly used of the operators covered here. It's a way to supply a closure through which each element in the channel is passed. The return value of the closure is emitted as an element in a new output channel. A canonical example is a closure that multiplies two numbers:
The code above is available in a starter
main.nf file available at
advanced/operators/main.nf. It is recommended to open and edit this file to follow along with the examples given in the rest of this chapter. The workflow can be executed with:
By default, the element being passed to the closure is given the default name
it. If you would prefer a more informative variable name, it can be named by using the
Groovy is an optionally typed language, and it is possible to specify the type of the argument passed to the closure.
1.1.2 Named Closures¶
If you find yourself re-using the same closure multiple times in your pipeline, the closure can be named and referenced:
If you have these re-usable closures defined, you can compose them together.
The above is the same as writing:
For those inclined towards functional programming, you'll be happy to know that closures can be curried:
In addition to the argument-less usage of
view as shown above, this operator can also take a closure to customize the stdout message. We can create a closure to print the value of the elements in a channel as well as their type, for example:
Most closures will remain anonymous
In many cases, it is simply cleaner to keep the closure anonymous, defined inline. Giving closures a name is only recommended when you find yourself defining the same or similar closures repeatedly in a given workflow.
A common Nextflow pattern is for a simple samplesheet to be passed as primary input into a workflow. We'll see some more complicated ways to manage these inputs later on in the workshop, but the
splitCsv (docs) is an excellent tool to have in a pinch. This operator will parse a csv/tsv and return a channel where each item is a row in the csv/tsv:
From the directory
advanced/operators, use the
map operators to read the file
data/samplesheet.csv and return a channel that would be suitable input to the process below. Feel free to consult the splitCsv documentation for tips.
header argument in the
splitCsv operator, we have convenient named access to csv elements. The closure returns a list of two elements where the second element a list of paths.
Convert Strings to Paths
The fastq paths are simple strings in the context of a csv row. In order to pass them as paths to a Nextflow process, they need to be converted into objects that adjere to the
Path interface. This is accomplished by wrapping them in
In the sample above, we've lost an important piece of metadata - the tumor/normal classification, choosing only the sample id as the first element in the output list.
In the next chapter, we'll discuss the "meta map" pattern in more detail, but we can preview that here.
The construction of this map is very repetitive, and in the next chapter, we'll discuss some Groovy methods available on the
Map class that can make this pattern more concise and less error-prone.
multiMap (documentation) operator is a way of taking a single input channel and emitting into multiple channels for each input element.
Let's assume we've been given a samplesheet that has tumor/normal pairs bundled together on the same row. View the example samplesheet with:
splitCsv operator would give us one entry that would contain all four fastq files. Let's consider that we wanted to split these fastqs into separate channels for tumor and normal. In other words, for every row in the samplesheet, we would like to emit an entry into two new channels. To do this, we can use the
The closure supplied to
multiMap needs to return multiple channels, so using named closures as described in the
map section above will not work. Fortunately, Nextflow provides the convenience
multiMapCriteria method to allow you to define named
multiMap closures should you need them. See the
multiMap documentation for more info.
branch operator (documentation) is a way of taking a single input channel and emitting a new element into one (and only one) of a selection of output channels.
In the example above, the
multiMap operator was necessary because we were supplied with a samplesheet that combined two pairs of fastq per row and we wanted to turn each row into new elements in multiple channels. If we were to use the neater samplesheet that had tumor/normal pairs on separate rows, we could use the
branch operator to achieve the same result as we are routing each input element into a single output channel.
An element is only emitted to the first channel were the test condition is met. If an element does not meet any of the tests, it is not emitted to any of the output channels. You can 'catch' any such samples by specifying
true as a condition. If we knew that all samples would be either tumor or normal and no third 'type', we could write
We may want to emit a slightly different element than the one passed as input. The
branch operator can (optionally) return a new element to an channel. For example, to add an extra key in the meta map of the tumor samples, we add a new line under the condition and return our new element. In this example, we modify the first element of the
List to be a new list that is the result of merging the existing meta map with a new map containing a single key:
How would you modify the element returned in the
tumor channel to have the key:value pair
type:'abnormal' instead of
There are many ways to accomplish this, but the map merging pattern introduced above can also be used to safely and concisely rename values in a map.
Merging maps is safe
+ operator to merge two or more Maps returns a new Map. There are rare edge cases where modification of map rather than returning a new map can affect other channels. We discuss this further in the next chapter, but just be aware that this
+ operator is safer and often more convenient than modifying the
meta object directly.
See the Groovy Map documentation for details.
1.5.1 Multi-channel Objects¶
Some Nextflow operators return objects that contain multiple channels. The
branch operators are excellent examples. In most instances, the output is assigned to a variable and then addressed by name:
or by using the
set operator (documentation):
A more interesting situation occurs when given a process that takes multiple channels as input:
You can either provide the channels individually:
or you can provide the multichannel as a single input:
For an even cleaner solution, you can skip the now-redundant
If you have processes that output multiple channels and input multiple channels and the cardinality matches, they can be chained together in the same manner.
A common operation is to group elements from a single channel where those elements share a common key. Take this example samplesheet as an example:
We see that there are multiple rows where the first element in the item emitted by the channel is the Map
[id:sampleA, type:normal] and items in the channel where the first element is the Map
groupTuple operator allows us to combine elements that share a common key:
The transpose operator is often misunderstood. It can be thought of as the inverse of the
groupTuple operator. Give the following workflow, the
transpose operators cancel each other out. Removing lines 8 and 9 returns the same result.
Given a workflow that returns one element per sample, where we have grouped the samplesheet lines on a meta containing only id and type:
N E X T F L O W ~ version 23.04.1 Launching `./main.nf` [spontaneous_rutherford] DSL2 - revision: 7dc1cc0039 [[id:sampleA, type:normal], [1, 2], [[data/reads/sampleA_rep1_normal_R1.fastq.gz, data/reads/sampleA_rep1_normal_R2.fastq.gz], [data/reads/sampleA_rep2_normal_R1.fastq.gz, data/reads/sampleA_rep2_normal_R2.fastq.gz]]] [[id:sampleA, type:tumor], [1, 2], [[data/reads/sampleA_rep1_tumor_R1.fastq.gz, data/reads/sampleA_rep1_tumor_R2.fastq.gz], [data/reads/sampleA_rep2_tumor_R1.fastq.gz, data/reads/sampleA_rep2_tumor_R2.fastq.gz]]] [[id:sampleB, type:normal], , [[data/reads/sampleB_rep1_normal_R1.fastq.gz, data/reads/sampleB_rep1_normal_R2.fastq.gz]]] [[id:sampleB, type:tumor], , [[data/reads/sampleB_rep1_tumor_R1.fastq.gz, data/reads/sampleB_rep1_tumor_R2.fastq.gz]]] [[id:sampleC, type:normal], , [[data/reads/sampleC_rep1_normal_R1.fastq.gz, data/reads/sampleC_rep1_normal_R2.fastq.gz]]] [[id:sampleC, type:tumor], , [[data/reads/sampleC_rep1_tumor_R1.fastq.gz, data/reads/sampleC_rep1_tumor_R2.fastq.gz]]]
If we add in a
transpose, each repeat number is matched back to the appropriate list of reads:
N E X T F L O W ~ version 23.04.1 Launching `./main.nf` [elegant_rutherford] DSL2 - revision: 2c5476b133 [[id:sampleA, type:normal], 1, [data/reads/sampleA_rep1_normal_R1.fastq.gz, data/reads/sampleA_rep1_normal_R2.fastq.gz]] [[id:sampleA, type:normal], 2, [data/reads/sampleA_rep2_normal_R1.fastq.gz, data/reads/sampleA_rep2_normal_R2.fastq.gz]] [[id:sampleA, type:tumor], 1, [data/reads/sampleA_rep1_tumor_R1.fastq.gz, data/reads/sampleA_rep1_tumor_R2.fastq.gz]] [[id:sampleA, type:tumor], 2, [data/reads/sampleA_rep2_tumor_R1.fastq.gz, data/reads/sampleA_rep2_tumor_R2.fastq.gz]] [[id:sampleB, type:normal], 1, [data/reads/sampleB_rep1_normal_R1.fastq.gz, data/reads/sampleB_rep1_normal_R2.fastq.gz]] [[id:sampleB, type:tumor], 1, [data/reads/sampleB_rep1_tumor_R1.fastq.gz, data/reads/sampleB_rep1_tumor_R2.fastq.gz]] [[id:sampleC, type:normal], 1, [data/reads/sampleC_rep1_normal_R1.fastq.gz, data/reads/sampleC_rep1_normal_R2.fastq.gz]] [[id:sampleC, type:tumor], 1, [data/reads/sampleC_rep1_tumor_R1.fastq.gz, data/reads/sampleC_rep1_tumor_R2.fastq.gz]]