7. Operators¶
Nextflow operators are methods that allow you to manipulate channels. Every operator, with the exception of set
and subscribe
, produces one or more new channels, allowing you to chain operators to fit your needs.
There are seven main groups of operators are described in greater detail within the Nextflow Reference Documentation, linked below:
- Filtering operators
- Transforming operators
- Splitting operators
- Combining operators
- Forking operators
- Maths operators
- Other operators
7.1 Basic example¶
The map
operator applies a function of your choosing to every item emitted by a channel, and returns the items so obtained as a new channel. The function applied is called the mapping function and is expressed with a closure as shown in the example below:
Click the icons in the code for explanations.
snippet.nf | |
---|---|
- Creates a queue channel emitting four values
- Creates a new channel, transforming each number into its square
- Prints the channel content
Operators can also be chained to implement custom behaviors, so the previous snippet can also be written as:
Summary
In this step you have learned:
- The basic features of an operator
7.2 Commonly used operators¶
Here you will explore some of the most commonly used operators.
7.2.1 view()
¶
The view
operator prints the items emitted by a channel to the console standard output, appending a new line character to each item. For example:
An optional closure parameter can be specified to customize how items are printed. For example:
7.2.2 map()
¶
The map
operator applies a function of your choosing to every item emitted by a channel and returns the items obtained as a new channel. The function applied is called the mapping function and is expressed with a closure. In the example below the groovy reverse
method has been used to reverse the order of the characters in each string emitted by the channel.
A map
can associate a generic tuple to each element and can contain any data. In the example below the groovy size
method is used to return the length of each string emitted by the channel.
Exercise
Use fromPath
to create a channel emitting the fastq files matching the pattern data/ggal/*.fq
, then use map
to return a pair containing the file name and the file path. Finally, use view
to print the resulting channel.
Hint
You can use the name
method to get the file name.
Solution
Here is one possible solution:
Your output should look like this:
[gut_1.fq, /workspace/gitpod/nf-training/data/ggal/gut_1.fq]
[gut_2.fq, /workspace/gitpod/nf-training/data/ggal/gut_2.fq]
[liver_1.fq, /workspace/gitpod/nf-training/data/ggal/liver_1.fq]
[liver_2.fq, /workspace/gitpod/nf-training/data/ggal/liver_2.fq]
[lung_1.fq, /workspace/gitpod/nf-training/data/ggal/lung_1.fq]
[lung_2.fq, /workspace/gitpod/nf-training/data/ggal/lung_2.fq]
7.2.3 mix()
¶
The mix
operator combines the items emitted by two (or more) channels.
snippet.nf | |
---|---|
It prints a single channel containing all the items emitted by the three channels:
Warning
The items in the resulting channel have the same order as in the respective original channels. However, there is no guarantee that the element of the second channel are appended after the elements of the first. Indeed, in the example above, the element a
has been printed before 3
.
7.2.4 flatten()
¶
The flatten
operator transforms a channel in such a way that every tuple is flattened so that each entry is emitted as a sole element by the resulting channel.
7.2.5 collect()
¶
The collect
operator collects all of the items emitted by a channel in a list and returns the object as a sole emission.
Info
The result of the collect
operator is a value channel.
7.2.6 groupTuple()
¶
The groupTuple
operator collects tuples (or lists) of values emitted by the source channel, grouping the elements that share the same key. Finally, it emits a new tuple object for each distinct key collected.
snippet.nf | |
---|---|
This operator is especially useful to process a group together with all the elements that share a common property or grouping key.
Exercise
Use fromPath
to create a channel emitting all of the files in the folder data/meta/
, then use a map
to associate the baseName
method to each file. Finally, group all files that have the same common prefix.
Solution
snippet.nf | |
---|---|
[patients_1, [/workspace/gitpod/nf-training/data/meta/patients_1.csv]]
[patients_2, [/workspace/gitpod/nf-training/data/meta/patients_2.csv]]
[random, [/workspace/gitpod/nf-training/data/meta/random.txt]]
[regions, [/workspace/gitpod/nf-training/data/meta/regions.json, /workspace/gitpod/nf-training/data/meta/regions.tsv, /workspace/gitpod/nf-training/data/meta/regions.yml]]
[regions2, [/workspace/gitpod/nf-training/data/meta/regions2.json]]
7.2.7 join()
¶
The join
operator creates a channel that joins together the items emitted by two channels with a matching key. The key is defined, by default, as the first element in each item emitted.
snippet.nf | |
---|---|
Note
Notice P is missing in the final result.
7.2.8 branch()
¶
The branch
operator allows you to forward the items emitted by a source channel to one or more output channels.
The selection criterion is defined by specifying a closure that provides one or more boolean expressions, each of which is identified by a unique label. For the first expression that evaluates to a true value, the item is bound to a named channel as the label identifier.
snippet.nf | |
---|---|
Info
The branch
operator returns a multi-channel object (i.e., a variable that holds more than one channel object).
Note
In the above example, what would happen to a value of 10? To deal with this, you can also use >=
.
Summary
In this step you have learned:
- How to use the
view
operator to print the content of a channel - How to use the
map
operator to transform the content of a channel - How to use the
mix
operator to combine the content of two or more channels - How to use the
flatten
operator to flatten the content of a channel - How to use the
collect
operator to collect the content of a channel - How to use the
groupTuple
operator to group the content of a channel - How to use the
join
operator to join the content of two channels - How to use the
branch
operator to split the content of a channel
7.3 Text files¶
7.3.1 splitText()
¶
The splitText
operator allows you to split multi-line strings or text file items, emitted by a source channel into chunks containing n lines, which will be emitted by the resulting channel.
snippet.nf | |
---|---|
- Instructs Nextflow to make a channel from the path
data/meta/random.txt
- The
splitText
operator splits each item into chunks of one line by default. - View contents of the channel.
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book.
It has survived not only five centuries, but also the leap into electronic typesetting,
...
You can define the number of lines in each chunk by using the parameter by
, as shown in the following example:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book.
It has survived not only five centuries, but also the leap into electronic typesetting,
...
An optional closure can also be specified in order to transform the text chunks produced by the operator. The following example shows how to split text files into chunks of 2 lines and transform them into capital letters:
snippet.nf | |
---|---|
LOREM IPSUM IS SIMPLY DUMMY TEXT OF THE PRINTING AND TYPESETTING INDUSTRY.
LOREM IPSUM HAS BEEN THE INDUSTRY'S STANDARD DUMMY TEXT EVER SINCE THE 1500S,
WHEN AN UNKNOWN PRINTER TOOK A GALLEY OF TYPE AND SCRAMBLED IT TO MAKE A TYPE SPECIMEN BOOK.
IT HAS SURVIVED NOT ONLY FIVE CENTURIES, BUT ALSO THE LEAP INTO ELECTRONIC TYPESETTING,
...
7.3.2 splitCsv()
¶
The splitCsv
operator allows you to parse text items emitted by a channel, that are CSV formatted.
It then splits them into records or groups them as a list of records with a specified length.
In the simplest case, just apply the splitCsv
operator to a channel emitting a CSV formatted text files or text entries. For example, to view only the first and fourth columns:
snippet.nf | |
---|---|
patient_id, num_samples
ATX-TBL-001-GB-02-117, 3
ATX-TBL-001-GB-01-110, 3
ATX-TBL-001-GB-03-101, 3
ATX-TBL-001-GB-04-201, 3
ATX-TBL-001-GB-02-120, 3
ATX-TBL-001-GB-04-102, 3
ATX-TBL-001-GB-03-104, 3
ATX-TBL-001-GB-03-103, 3
When the CSV begins with a header line defining the column names, you can specify the parameter header: true
which allows you to reference each value by its column name, as shown in the following example:
snippet.nf | |
---|---|
Alternatively, you can provide custom header names by specifying a list of strings in the header parameter as shown below:
snippet.nf | |
---|---|
patient_id, num_samples
ATX-TBL-001-GB-02-117, 3
ATX-TBL-001-GB-01-110, 3
ATX-TBL-001-GB-03-101, 3
ATX-TBL-001-GB-04-201, 3
ATX-TBL-001-GB-02-120, 3
ATX-TBL-001-GB-04-102, 3
ATX-TBL-001-GB-03-104, 3
ATX-TBL-001-GB-03-103, 3
You can also process multiple CSV files at the same time:
snippet.nf | |
---|---|
ATX-TBL-001-GB-02-117 3
ATX-TBL-001-GB-01-110 3
ATX-TBL-001-GB-03-101 3
ATX-TBL-001-GB-04-201 3
ATX-TBL-001-GB-02-120 3
ATX-TBL-001-GB-04-102 3
ATX-TBL-001-GB-03-104 3
ATX-TBL-001-GB-03-103 3
ATX-TBL-001-GB-01-111 2
ATX-TBL-001-GB-01-112 3
ATX-TBL-001-GB-04-202 3
ATX-TBL-001-GB-02-124 3
ATX-TBL-001-GB-02-107 3
ATX-TBL-001-GB-01-105 3
ATX-TBL-001-GB-02-108 3
ATX-TBL-001-GB-01-113 3
Tip
Notice that you can change the output format simply by adding a different delimiter.
Finally, you can also operate on CSV files outside the channel context:
Exercise
Create a CSV file and use it as input for script7.nf
, part of the Simple RNA-Seq workflow tutorial.
Solution
Add a CSV text file containing the following, as an example input with the name "fastq.csv":
gut,/workspace/gitpod/nf-training/data/ggal/gut_1.fq,/workspace/gitpod/nf-training/data/ggal/gut_2.fq
Then replace the input channel for the reads in script7.nf
. Changing the following lines:
To a splitCsv channel factory input:
script7.nf | |
---|---|
Finally, change the cardinality of the processes that use the input data:
script7.nf | |
---|---|
Repeat the above for the fastqc step.
script7.nf | |
---|---|
Now the workflow should run from a CSV file.
7.3.3 Tab separated values (.tsv)¶
Parsing TSV files works in a similar way. Simply add the sep: '\t'
option in the splitCsv
context:
snippet.nf | |
---|---|
Exercise
Use the tab separation technique on the file data/meta/regions.tsv
, but print just the first column, and remove the header.
7.3.4 splitJson()
¶
You can parse the JSON file format using the splitJson
channel operator.
The splitJson
operator supports JSON arrays:
snippet.nf | |
---|---|
As well as JSON arrays in objects:
snippet.nf | |
---|---|
And even a JSON array of JSON objects:
snippet.nf | |
---|---|
You can also parse JSON files directly:
[
{ "name": "Bob", "height": 180, "champion": false },
{ "name": "Alice", "height": 170, "champion": false }
]
Summary
In this step you have learned:
- How to use the
splitText
operator to split text files of various formats - How to use the
splitJson
operator to split JSON files of various formats
7.4 More resources¶
Check the operators documentation on Nextflow web site.