5. Channels¶
Channels are a key data structure of Nextflow that allows the implementation of reactive-functional oriented computational workflows based on the Dataflow programming paradigm.
They are used to logically connect tasks to each other or to implement functional style data transformations.
5.1 Channel types¶
Nextflow distinguishes two different kinds of channels: queue channels and value channels.
5.1.1 Queue channel¶
A queue channel is an asynchronous unidirectional FIFO queue that connects two processes or operators.
- asynchronous means that operations are non-blocking.
- unidirectional means that data flows from a producer to a consumer.
- FIFO means that the data is guaranteed to be delivered in the same order as it is produced. First In, First Out.
A queue channel is implicitly created by process output definitions or using channel factories such as Channel.of or Channel.fromPath.
Try the following snippets:
Click the icons in the code for explanations.
- Applying the
view
channel operator to thech
channel prints each item emitted by the channels
Exercise
The script snippet.nf
contains the code from above. Execute it with Nextflow and view the output.
5.1.2 Value channels¶
A value channel (a.k.a. a singleton channel) is bound to a single value and it can be read unlimited times without consuming its contents. A value
channel is created using the value channel factory or by operators returning a single value, such as first, last, collect, count, min, max, reduce, and sum.
To see the difference between value and queue channels, you can modify snippet.nf
to the following:
snippet.nf | |
---|---|
This workflow creates two channels, ch1
and ch2
, and then uses them as inputs to the SUM
process. The SUM
process sums the two inputs and prints the result to the standard output.
When you run this script, it only prints 2
, as you can see below:
A process will only instantiate a task when there are elements to be consumed from all the channels provided as input to it. Because ch1
and ch2
are queue channels, and the single element of ch2
has been consumed, no new process instances will be launched, even if there are other elements to be consumed in ch1
.
To use the single element in ch2
multiple times, you can either use the Channel.value
channel factory, or use a channel operator that returns a single element, such as first()
:
snippet.nf | |
---|---|
In many situations, Nextflow will implicitly convert variables to value channels when they are used in a process invocation.
For example, when you invoke a process with a workflow parameter (params.ch2
) which has a string value, it is automatically cast into a value channel:
snippet.nf | |
---|---|
As you can see, the output is the same as the previous example when the first()
operator was used:
Exercise
Use the .first()
operator to create a value channel from ch2
so that all 3 elements of ch1
are consumed.
snippet.nf | |
---|---|
Summary
In this step you have learned:
- The features of a value and queue channels
- Strategies to change channel types
5.2 Channel factories¶
Channel factories are Nextflow commands for creating channels that have implicit expected inputs and functions. There are several different Channel factories which are useful for different situations. The following sections will cover the most common channel factories.
Tip
New in version 20.07.0: channel was introduced as an alias of Channel, allowing factory methods to be specified as channel.of()
or Channel.of()
, and so on.
5.2.1 value()
¶
The value
channel factory is used to create a value channel. An optional not null
argument can be specified to bind the channel to a specific value. For example:
snippet.nf | |
---|---|
- Creates an empty value channel
- Creates a value channel and binds a string to it
- Creates a value channel and binds a list object to it that will be emitted as a sole emission
5.2.2 of()
¶
The factory Channel.of
allows the creation of a queue channel with the values specified as arguments.
This example creates a channel that emits the values specified as a parameter in the of
channel factory. It will print the following:
The Channel.of
channel factory works in a similar manner to Channel.from
(which is now deprecated), fixing some inconsistent behaviors of the latter and providing better handling when specifying a range of values. For example, the following works with a range from 1 to 23:
5.2.3 fromList()
¶
The Channel.fromList
channel factory creates a channel emitting the elements provided by a list object specified as an argument:
5.2.4 fromPath()
¶
The fromPath
channel factory creates a queue channel emitting one or more files matching the specified glob pattern.
This example creates a channel and emits as many items as there are files with a csv
extension in the ./data/meta
folder. Each element is a file object implementing the Path interface.
Tip
Two asterisks, i.e. **
, works like *
but cross directory boundaries. This syntax is generally used for matching complete paths. Curly brackets specify a collection of sub-patterns.
Some channel factories also have options to help you control their behaviour. For example, the fromPath
channel factory has the following options:
Name | Description |
---|---|
glob | When true interprets characters * , ? , [] and {} as glob wildcards, otherwise handles them as normal characters (default: true ) |
type | Type of path returned, either file , dir or any (default: file ) |
hidden | When true includes hidden files in the resulting paths (default: false ) |
maxDepth | Maximum number of directory levels to visit (default: no limit ) |
followLinks | When true symbolic links are followed during directory tree traversal, otherwise they are managed as files (default: true ) |
relative | When true return paths are relative to the top-most common directory (default: false ) |
checkIfExists | When true throws an exception when the specified path does not exist in the file system (default: false ) |
Learn more about the glob patterns syntax at this link.
Exercise
Use the Channel.fromPath
channel factory to create a channel emitting all files with the suffix .fq
in the data/ggal/
directory and any subdirectory. Include any hidden files and print the file names with the view
operator.
5.2.5 fromFilePairs()
¶
The fromFilePairs
channel factory creates a channel emitting the file pairs matching a glob pattern provided by the user. The matching files are emitted as tuples, in which the first element is the grouping key of the matching pair and the second element is the list of files (sorted in lexicographical order).
It will produce an output similar to the following:
[liver, [/workspace/gitpod/nf-training/data/ggal/liver_1.fq, /workspace/gitpod/nf-training/data/ggal/liver_2.fq]]
[gut, [/workspace/gitpod/nf-training/data/ggal/gut_1.fq, /workspace/gitpod/nf-training/data/ggal/gut_2.fq]]
[lung, [/workspace/gitpod/nf-training/data/ggal/lung_1.fq, /workspace/gitpod/nf-training/data/ggal/lung_2.fq]]
Warning
The glob pattern must contain at least an asterisk wildcard character (*
).
The fromFilePairs
channel factory also has options to help you control its behaviour:
Name | Description |
---|---|
type | Type of paths returned, either file , dir or any (default: file ) |
hidden | When true includes hidden files in the resulting paths (default: false ) |
maxDepth | Maximum number of directory levels to visit (default: no limit ) |
followLinks | When true symbolic links are followed during directory tree traversal, otherwise they are managed as files (default: true ) |
size | Defines the number of files each emitted item is expected to hold (default: 2 ). Set to -1 for any |
flat | When true the matching files are produced as sole elements in the emitted tuples (default: false ) |
checkIfExists | When true , it throws an exception of the specified path that does not exist in the file system (default: false ) |
Exercise
Use the fromFilePairs
channel factory to create a channel emitting all pairs of fastq reads in the data/ggal/
directory. Execute this script twice, once with the option flat: true
and once with flat: false
. What is the difference?
Solution
Use the following with the flat
option equaling true:
And false:
Check the square brackets around the file names, to see the difference withflat
.
5.2.6 fromSRA()
¶
The Channel.fromSRA
channel factory makes it possible to query the NCBI SRA archive and returns a channel emitting the FASTQ files matching the specified selection criteria.
The query can be project ID(s) or accession number(s) supported by the NCBI ESearch API.
Info
This function now requires an API key you can only get by logging into your NCBI account.
Instructions for NCBI login and key acquisition
- Go to: https://www.ncbi.nlm.nih.gov/
- Click the top right "Log in" button to sign into NCBI. Follow their instructions.
- Once into your account, click the button at the top right, usually your ID.
- Go to Account settings
- Scroll down to the API Key Management section.
- Click on "Create an API Key".
- The page will refresh and the key will be displayed where the button was. Copy your key.
The following snippet will print the contents of an NCBI project ID:
snippet.nf | |
---|---|
Replace <Your API key here>
with your API key.
This should print:
[SRR3383346, [/vol1/fastq/SRR338/006/SRR3383346/SRR3383346_1.fastq.gz, /vol1/fastq/SRR338/006/SRR3383346/SRR3383346_2.fastq.gz]]
[SRR3383347, [/vol1/fastq/SRR338/007/SRR3383347/SRR3383347_1.fastq.gz, /vol1/fastq/SRR338/007/SRR3383347/SRR3383347_2.fastq.gz]]
[SRR3383344, [/vol1/fastq/SRR338/004/SRR3383344/SRR3383344_1.fastq.gz, /vol1/fastq/SRR338/004/SRR3383344/SRR3383344_2.fastq.gz]]
[SRR3383345, [/vol1/fastq/SRR338/005/SRR3383345/SRR3383345_1.fastq.gz, /vol1/fastq/SRR338/005/SRR3383345/SRR3383345_2.fastq.gz]]
// (remaining omitted)
Multiple accession IDs can be specified using a list object:
snippet.nf | |
---|---|
[ERR908507, [/vol1/fastq/ERR908/ERR908507/ERR908507_1.fastq.gz, /vol1/fastq/ERR908/ERR908507/ERR908507_2.fastq.gz]]
[ERR908506, [/vol1/fastq/ERR908/ERR908506/ERR908506_1.fastq.gz, /vol1/fastq/ERR908/ERR908506/ERR908506_2.fastq.gz]]
[ERR908505, [/vol1/fastq/ERR908/ERR908505/ERR908505_1.fastq.gz, /vol1/fastq/ERR908/ERR908505/ERR908505_2.fastq.gz]]
Info
Read pairs are implicitly managed and are returned as a list of files.
It’s straightforward to use this channel as an input using the usual Nextflow syntax.
The code below creates a channel containing two samples from a public SRA study and runs FASTQC
on the resulting files. See:
Summary
In this step you have learned:
- How to use common channel factories
- How to use the
fromSRA
channel factory to query the NCBI SRA archive