Splitting and Grouping¶
Nextflow provides powerful tools for working with data flexibly. A key capability is splitting data into different streams and then grouping related items back together. This is especially valuable in bioinformatics workflows where you need to process different types of samples separately before combining results for analysis.
Think of it like sorting mail: you separate letters by destination, process each pile differently, then recombine items going to the same person. Nextflow uses special operators to accomplish this with scientific data. This approach is also commonly known as the scatter/gather pattern in distributed computing and bioinformatics workflows.
Nextflow's channel system is at the heart of this flexibility. Channels connect different parts of your workflow, allowing data to flow through your analysis. You can create multiple channels from a single data source, process each channel differently, and then merge channels back together when needed. This approach lets you design workflows that naturally mirror the branching and converging paths of complex bioinformatics analyses.
In this side quest, you'll learn to split and group data using Nextflow's channel operators. We'll start with a CSV file containing sample information and associated data files, then manipulate and reorganize this data. By the end, you'll be able to separate and combine data streams effectively, creating more efficient and understandable workflows.
You will:
- Read data from files using
splitCsv
- Filter and transform data with
filter
andmap
- Combine related data using
join
andgroupTuple
- Create data combinations with
combine
for parallel processing - Optimize data structure using
subMap
and deduplication strategies - Build reusable functions with named closures to help you manipulate channel structures
These skills will help you build workflows that can handle multiple input files and different types of data efficiently, while maintaining clean, maintainable code structure.
0. Warmup¶
0.1. Prerequisites¶
Before taking on this side quest you should:
- Complete the Hello Nextflow tutorial
- Understand basic Nextflow concepts (processes, channels, operators, working with files, meta data)
You may also find it useful to review Working with metadata before starting here, as it covers in detail how to work with metadata associated with files in your workflows.
0.2. Starting Point¶
Let's move into the project directory.
You'll find a data
directory containing a samplesheet and a main workflow file.
samplesheet.csv
contains information about samples from different patients, including the patient ID, sample repeat number, type (normal or tumor), and paths to BAM files (which don't actually exist, but we will pretend they do).
id,repeat,type,bam
patientA,1,normal,patientA_rep1_normal.bam
patientA,1,tumor,patientA_rep1_tumor.bam
patientA,2,normal,patientA_rep2_normal.bam
patientA,2,tumor,patientA_rep2_tumor.bam
patientB,1,normal,patientB_rep1_normal.bam
patientB,1,tumor,patientB_rep1_tumor.bam
patientC,1,normal,patientC_rep1_normal.bam
patientC,1,tumor,patientC_rep1_tumor.bam
Note there are 8 samples in total from 3 patients (patientA has 2 repeats), 4 normal and 4 tumor.
We're going to read in samplesheet.csv, then group and split the samples based on their data.
1. Read in sample data¶
1.1. Read in sample data with splitCsv¶
Let's start by reading in the sample data with splitCsv
. In the main.nf
, you'll see that we've already started the workflow.
Note
Throughout this tutorial, we'll use the ch_
prefix for all channel variables to clearly indicate they are Nextflow channels.
We can use the splitCsv
operator to split the data into a channel of maps (key/ value pairs), where each map represents a row from the CSV file.
Note
We'll encounter two different concepts called map
in this training:
- Data structure: The Groovy map (equivalent to dictionaries/hashes in other languages) that stores key-value pairs
- Channel operator: The
.map()
operator that transforms items in a channel
We'll clarify which one we mean in context, but this distinction is important to understand when working with Nextflow.
Apply these changes to main.nf
:
splitCsv
takes the file passed to it from the channel factory and the header: true
option tells Nextflow to use the first row of the CSV file as the header row, which will be used as keys for the values. We're using the view
operator you should have encountered before to examine the output this gives us.
Run the pipeline:
N E X T F L O W ~ version 25.04.3
Launching `main.nf` [deadly_mercator] DSL2 - revision: bd6b0224e9
[id:patientA, repeat:1, type:normal, bam:patientA_rep1_normal.bam]
[id:patientA, repeat:1, type:tumor, bam:patientA_rep1_tumor.bam]
[id:patientA, repeat:2, type:normal, bam:patientB_rep1_normal.bam]
[id:patientA, repeat:2, type:tumor, bam:patientB_rep1_tumor.bam]
[id:patientB, repeat:1, type:normal, bam:patientC_rep1_normal.bam]
[id:patientB, repeat:1, type:tumor, bam:patientC_rep1_tumor.bam]
[id:patientC, repeat:1, type:normal, bam:patientD_rep1_normal.bam]
[id:patientC, repeat:1, type:tumor, bam:patientD_rep1_tumor.bam]
Each row from the CSV file has become a single item in the channel, with each item being a map with keys matching the header row.
You should be able to see that each map contains:
id
: The patient identifier (patientA, patientB, patientC)repeat
: The replicate number (1 or 2)type
: The sample type (normal or tumor)bam
: Path to the BAM file
This format makes it easy to access specific fields from each sample via their keys in the map. We can access the BAM file path with the bam
key, but also any of the 'metadata' fields that describe the file via id
, repeat
, type
.
Note
For a more extensive introduction on working with metadata, you can work through the training Working with metadata
Let's separate the metadata from the files. We can do this with a map
operation:
Apply that change and re-run the pipeline:
N E X T F L O W ~ version 25.04.3
Launching `main.nf` [deadly_mercator] DSL2 - revision: bd6b0224e9
[[id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam]
[[id:patientA, repeat:1, type:tumor], patientA_rep1_tumor.bam]
[[id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam]
[[id:patientA, repeat:2, type:tumor], patientA_rep2_tumor.bam]
[[id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam]
[[id:patientB, repeat:1, type:tumor], patientB_rep1_tumor.bam]
[[id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam]
[[id:patientC, repeat:1, type:tumor], patientC_rep1_tumor.bam]
We separated the sample meta data from the files into a map. We now have a channel of maps and files, each representing a row from the input sample sheet, which we will use in this training to split and group our workload.
Takeaway¶
In this section, you've learned:
- Reading in a data sheet: How to read in data sheet with
splitCsv
- Combining patient-specific information: Using groovy maps to hold information about a patient
2. Filter and transform data¶
2.1. Filter data with filter
¶
We can use the filter
operator to filter the data based on a condition. Let's say we only want to process normal samples. We can do this by filtering the data based on the type
field. Let's insert this before the view
operator.
Run the workflow again to see the filtered result:
N E X T F L O W ~ version 25.04.3
Launching `main.nf` [admiring_brown] DSL2 - revision: 194d61704d
[[id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam]
[[id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam]
[[id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam]
[[id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam]
We have successfully filtered the data to only include normal samples. Let's recap how this works.
The filter
operator takes a closure that is applied to each element in the channel. If the closure returns true
, the element is included; if it returns false
, the element is excluded.
In our case, we want to keep only samples where meta.type == 'normal'
. The closure uses the tuple meta,file
to refer to each sample, accesses the sample type with meta.type
, and checks if it equals 'normal'
.
This is accomplished with the single closure we introduced above:
main.nf | |
---|---|
2.2. Create separate filtered channels¶
Currently we're applying the filter to the channel created directly from the CSV, but we want to filter this in more ways than one, so let's re-write the logic to create a separate filtered channel for normal samples:
Once again, run the pipeline to see the results:
N E X T F L O W ~ version 25.04.3
Launching `main.nf` [trusting_poisson] DSL2 - revision: 639186ee74
[[id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam]
[[id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam]
[[id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam]
[[id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam]
We've successfully filtered the data and created a separate channel for normal samples. Let's create a filtered channel for the tumor samples as well:
N E X T F L O W ~ version 25.04.3
Launching `main.nf` [maniac_boltzmann] DSL2 - revision: 3636b6576b
Tumor sample: [[id:patientA, repeat:1, type:tumor], patientA_rep1_tumor.bam]
Tumor sample: [[id:patientA, repeat:2, type:tumor], patientA_rep2_tumor.bam]
Normal sample: [[id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam]
Normal sample: [[id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam]
Normal sample: [[id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam]
Normal sample: [[id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam]
Tumor sample: [[id:patientB, repeat:1, type:tumor], patientB_rep1_tumor.bam]
Tumor sample: [[id:patientC, repeat:1, type:tumor], patientC_rep1_tumor.bam]
We've separated out the normal and tumor samples into two different channels, and used a closure supplied to view()
to label them differently in the output: ch_tumor_samples.view{'Tumor sample: ' + it}
.
Takeaway¶
In this section, you've learned:
- Filtering data: How to filter data with
filter
- Splitting data: How to split data into different channels based on a condition
- Viewing data: How to use
view
to print the data and label output from different channels
We've now separated out the normal and tumor samples into two different channels. Next, we'll join the normal and tumor samples on the id
field.
3. Joining channels by identifiers¶
In the previous section, we separated out the normal and tumor samples into two different channels. These could be processed independently using specific processes or workflows based on their type. But what happens when we want to compare the normal and tumor samples from the same patient? At this point, we need to join them back together making sure to match the samples based on their id
field.
Nextflow includes many methods for combining channels, but in this case the most appropriate operator is join
. If you are familiar with SQL, it acts like the JOIN
operation, where we specify the key to join on and the type of join to perform.
3.1. Use map
and join
to combine based on patient ID¶
If we check the join
documentation, we can see that by default it joins two channels based on the first item in each tuple. If you don't have the console output still available, let's run the pipeline to check our data structure and see how we need to modify it to join on the id
field.
N E X T F L O W ~ version 25.04.3
Launching `main.nf` [maniac_boltzmann] DSL2 - revision: 3636b6576b
Tumour sample: [[id:patientA, repeat:1, type:tumor], patientA_rep1_tumor.bam]
Tumour sample: [[id:patientA, repeat:2, type:tumor], patientA_rep2_tumor.bam]
Normal sample: [[id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam]
Normal sample: [[id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam]
Normal sample: [[id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam]
Normal sample: [[id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam]
Tumour sample: [[id:patientB, repeat:1, type:tumor], patientB_rep1_tumor.bam]
Tumour sample: [[id:patientC, repeat:1, type:tumor], patientC_rep1_tumor.bam]
We can see that the id
field is the first element in each meta map. For join
to work, we should isolate the id
field in each tuple. After that, we can simply use the join
operator to combine the two channels.
To isolate the id
field, we can use the map
operator to create a new tuple with the id
field as the first element.
N E X T F L O W ~ version 25.04.3
Launching `main.nf` [mad_lagrange] DSL2 - revision: 9940b3f23d
Tumour sample: [patientA, [id:patientA, repeat:1, type:tumor], patientA_rep1_tumor.bam]
Tumour sample: [patientA, [id:patientA, repeat:2, type:tumor], patientA_rep2_tumor.bam]
Normal sample: [patientA, [id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam]
Normal sample: [patientA, [id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam]
Tumour sample: [patientB, [id:patientB, repeat:1, type:tumor], patientB_rep1_tumor.bam]
Tumour sample: [patientC, [id:patientC, repeat:1, type:tumor], patientC_rep1_tumor.bam]
Normal sample: [patientB, [id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam]
Normal sample: [patientC, [id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam]
It might be subtle, but you should be able to see the first element in each tuple is the id
field. Now we can use the join
operator to combine the two channels based on the id
field.
Once again, we will use view
to print the joined outputs.
N E X T F L O W ~ version 25.04.3
Launching `main.nf` [soggy_wiles] DSL2 - revision: 3bc1979889
[patientA, [id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam, [id:patientA, repeat:1, type:tumor], patientA_rep1_tumor.bam]
[patientA, [id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam, [id:patientA, repeat:2, type:tumor], patientA_rep2_tumor.bam]
[patientB, [id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam, [id:patientB, repeat:1, type:tumor], patientB_rep1_tumor.bam]
[patientC, [id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam, [id:patientC, repeat:1, type:tumor], patientC_rep1_tumor.bam]
It's a little hard to tell because it's so wide, but you should be able to see the samples have been joined by the id
field. Each tuple now has the format:
id
: The sample IDnormal_meta_map
: The normal sample meta data including type, replicate and path to bam filenormal_sample_file
: The normal sample filetumor_meta_map
: The tumor sample meta data including type, replicate and path to bam filetumor_sample
: The tumor sample including type, replicate and path to bam file
Warning
The join
operator will discard any un-matched tuples. In this example, we made sure all samples were matched for tumor and normal but if this is not true you must use the parameter remainder: true
to keep the unmatched tuples. Check the documentation for more details.
Takeaway¶
In this section, you've learned:
- How to use
map
to isolate a field in a tuple - How to use
join
to combine tuples based on the first field
With this knowledge, we can successfully combine channels based on a shared field. Next, we'll consider the situation where you want to join on multiple fields.
3.2. Join on multiple fields¶
We have 2 replicates for sampleA, but only 1 for sampleB and sampleC. In this case we were able to join them effectively by using the id
field, but what would happen if they were out of sync? We could mix up the normal and tumor samples from different replicates!
To avoid this, we can join on multiple fields. There are actually multiple ways to achieve this but we are going to focus on creating a new joining key which includes both the sample id
and replicate
number.
Let's start by creating a new joining key. We can do this in the same way as before by using the map
operator to create a new tuple with the id
and repeat
fields as the first element.
Now we should see the join is occurring but using both the id
and repeat
fields. Run the workflow:
N E X T F L O W ~ version 25.04.3
Launching `main.nf` [prickly_wing] DSL2 - revision: 3bebf22dee
[[patientA, 1], [id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam, [id:patientA, repeat:1, type:tumor], patientA_rep1_tumor.bam]
[[patientA, 2], [id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam, [id:patientA, repeat:2, type:tumor], patientA_rep2_tumor.bam]
[[patientB, 1], [id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam, [id:patientB, repeat:1, type:tumor], patientB_rep1_tumor.bam]
[[patientC, 1], [id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam, [id:patientC, repeat:1, type:tumor], patientC_rep1_tumor.bam]
Note how we have a tuple of two elements (id
and repeat
fields) as the first element of each joined result. This demonstrates how complex items can be used as a joining key, enabling fairly intricate matching between samples from the same conditions.
If you want to explore more ways to join on different keys, check out the join operator documentation for additional options and examples.
3.3. Use subMap to create a new joining key¶
The previous approach loses the field names from our joining key - the id
and repeat
fields become just a list of values. To retain the field names for later access, we can use the subMap
method.
The subMap
method extracts only the specified key-value pairs from a map. Here we'll extract just the id
and repeat
fields to create our joining key.
main.nf | |
---|---|
N E X T F L O W ~ version 25.04.3
Launching `main.nf` [reverent_wing] DSL2 - revision: 847016c3b7
[[id:patientA, repeat:1], [id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam, [id:patientA, repeat:1, type:tumor], patientA_rep1_tumor.bam]
[[id:patientA, repeat:2], [id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam, [id:patientA, repeat:2, type:tumor], patientA_rep2_tumor.bam]
[[id:patientB, repeat:1], [id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam, [id:patientB, repeat:1, type:tumor], patientB_rep1_tumor.bam]
[[id:patientC, repeat:1], [id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam, [id:patientC, repeat:1, type:tumor], patientC_rep1_tumor.bam]
Now we have a new joining key that not only includes the id
and repeat
fields but also retains the field names so we can access them later by name, e.g. meta.id
and meta.repeat
.
3.4. Use a named closure in map¶
To avoid duplication and reduce errors, we can use a named closure. A named closure allows us to create a reusable function that we can call in multiple places.
To do so, first we define the closure as a new variable:
We've defined the map transformation as a named variable that we can reuse. Note that we also convert the file path to a Path object using file()
so that any process receiving this channel can handle the file correctly (for more information see Working with files).
Let's implement the closure in our workflow:
main.nf | |
---|---|
Note
The map
operator has switched from using { }
to using ( )
to pass the closure as an argument. This is because the map
operator expects a closure as an argument and { }
is used to define an anonymous closure. When calling a named closure, use the ( )
syntax.
Just run the workflow once more to check everything is still working:
N E X T F L O W ~ version 25.04.3
Launching `main.nf` [angry_meninsky] DSL2 - revision: 2edc226b1d
[[id:patientA, repeat:1], [id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam, [id:patientA, repeat:1, type:tumor], patientA_rep1_tumor.bam]
[[id:patientA, repeat:2], [id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam, [id:patientA, repeat:2, type:tumor], patientA_rep2_tumor.bam]
[[id:patientB, repeat:1], [id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam, [id:patientB, repeat:1, type:tumor], patientB_rep1_tumor.bam]
[[id:patientC, repeat:1], [id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam, [id:patientC, repeat:1, type:tumor], patientC_rep1_tumor.bam]
Using a named closure allows us to reuse the same transformation in multiple places, reducing the risk of errors and making the code more readable and maintainable.
3.5. Reduce duplication of data¶
We have a lot of duplicated data in our workflow. Each item in the joined samples repeats the id
and repeat
fields. Since this information is already available in the grouping key, we can avoid this redundancy. As a reminder, our current data structure looks like this:
[
[
"id": "sampleC",
"repeat": "1",
],
[
"id": "sampleC",
"repeat": "1",
"type": "normal",
],
"sampleC_rep1_normal.bam"
[
"id": "sampleC",
"repeat": "1",
"type": "tumor",
],
"sampleC_rep1_tumor.bam"
]
Since the id
and repeat
fields are available in the grouping key, let's remove them from the rest of each channel item to avoid duplication. We can do this by using the subMap
method to create a new map with only the type
field. This approach allows us to maintain all necessary information while eliminating redundancy in our data structure.
Now the closure returns a tuple where the first element contains the id
and repeat
fields, and the second element contains only the type
field. This eliminates redundancy by storing the id
and repeat
information once in the grouping key, while maintaining all necessary information.
Run the workflow to see what this looks like:
[[id:patientA, repeat:1], [type:normal], /workspaces/training/side-quests/splitting_and_grouping/patientA_rep1_normal.bam, [type:tumor], /workspaces/training/side-quests/splitting_and_grouping/patientA_rep1_tumor.bam]
[[id:patientA, repeat:2], [type:normal], /workspaces/training/side-quests/splitting_and_grouping/patientA_rep2_normal.bam, [type:tumor], /workspaces/training/side-quests/splitting_and_grouping/patientA_rep2_tumor.bam]
[[id:patientB, repeat:1], [type:normal], /workspaces/training/side-quests/splitting_and_grouping/patientB_rep1_normal.bam, [type:tumor], /workspaces/training/side-quests/splitting_and_grouping/patientB_rep1_tumor.bam]
[[id:patientC, repeat:1], [type:normal], /workspaces/training/side-quests/splitting_and_grouping/patientC_rep1_normal.bam, [type:tumor], /workspaces/training/side-quests/splitting_and_grouping/patientC_rep1_tumor.bam]
We can see we only state the id
and repeat
fields once in the grouping key and we have the type
field in the sample data. We haven't lost any information but we managed to make our channel contents more succinct.
3.6. Remove redundant information¶
We removed duplicated information above, but we still have some other redundant information in our channels.
In the beginning, we separated the normal and tumor samples using filter
, then joined them based on id
and repeat
keys. The join
operator preserves the order in which tuples are merged, so in our case, with normal samples on the left side and tumor samples on the right, the resulting channel maintains this structure: id, <normal elements>, <tumor elements>
.
Since we know the position of each element in our channel, we can simplify the structure further by dropping the [type:normal]
and [type:tumor]
metadata.
Run again to see the result:
N E X T F L O W ~ version 25.04.3
Launching `main.nf` [confident_leavitt] DSL2 - revision: a2303895bd
[[id:patientA, repeat:1], patientA_rep1_normal.bam, patientA_rep1_tumor.bam]
[[id:patientA, repeat:2], patientA_rep2_normal.bam, patientA_rep2_tumor.bam]
[[id:patientB, repeat:1], patientB_rep1_normal.bam, patientB_rep1_tumor.bam]
[[id:patientC, repeat:1], patientC_rep1_normal.bam, patientC_rep1_tumor.bam]
Takeaway¶
In this section, you've learned:
- Manipulating Tuples: How to use
map
to isolate a field in a tuple - Joining Tuples: How to use
join
to combine tuples based on the first field - Creating Joining Keys: How to use
subMap
to create a new joining key - Named Closures: How to use a named closure in map
- Multiple Field Joining: How to join on multiple fields for more precise matching
- Data Structure Optimization: How to streamline channel structure by removing redundant information
You now have a workflow that can split a samplesheet, filter the normal and tumor samples, join them together by sample ID and replicate number, then print the results.
This is a common pattern in bioinformatics workflows where you need to match up samples or other types of data after processing independently, so it is a useful skill. Next, we will look at repeating a sample multiple times.
4. Spread patients over intervals¶
A key pattern in bioinformatics workflows is distributing analysis across genomic regions. For instance, variant calling can be parallelized by dividing the genome into intervals (like chromosomes or smaller regions). This parallelization strategy significantly improves pipeline efficiency by distributing computational load across multiple cores or nodes, reducing overall execution time.
In the following section, we'll demonstrate how to distribute our sample data across multiple genomic intervals. We'll pair each sample with every interval, allowing parallel processing of different genomic regions. This will multiply our dataset size by the number of intervals, creating multiple independent analysis units that can be brought back together later.
4.1. Spread samples over intervals using combine
¶
Let's start by creating a channel of intervals. To keep life simple, we will just use 3 intervals we will manually define. In a real workflow, you could read these in from a file input or even create a channel with lots of interval files.
Now remember, we want to repeat each sample for each interval. This is sometimes referred to as the Cartesian product of the samples and intervals. We can achieve this by using the combine
operator. This will take every item from channel 1 and repeat it for each item in channel 2. Let's add a combine operator to our workflow:
Now let's run it and see what happens:
N E X T F L O W ~ version 25.04.3
Launching `main.nf` [mighty_tesla] DSL2 - revision: ae013ab70b
[[id:patientA, repeat:1], patientA_rep1_normal.bam, patientA_rep1_tumor.bam, chr1]
[[id:patientA, repeat:1], patientA_rep1_normal.bam, patientA_rep1_tumor.bam, chr2]
[[id:patientA, repeat:1], patientA_rep1_normal.bam, patientA_rep1_tumor.bam, chr3]
[[id:patientA, repeat:2], patientA_rep2_normal.bam, patientA_rep2_tumor.bam, chr1]
[[id:patientA, repeat:2], patientA_rep2_normal.bam, patientA_rep2_tumor.bam, chr2]
[[id:patientA, repeat:2], patientA_rep2_normal.bam, patientA_rep2_tumor.bam, chr3]
[[id:patientB, repeat:1], patientB_rep1_normal.bam, patientB_rep1_tumor.bam, chr1]
[[id:patientB, repeat:1], patientB_rep1_normal.bam, patientB_rep1_tumor.bam, chr2]
[[id:patientB, repeat:1], patientB_rep1_normal.bam, patientB_rep1_tumor.bam, chr3]
[[id:patientC, repeat:1], patientC_rep1_normal.bam, patientC_rep1_tumor.bam, chr1]
[[id:patientC, repeat:1], patientC_rep1_normal.bam, patientC_rep1_tumor.bam, chr2]
[[id:patientC, repeat:1], patientC_rep1_normal.bam, patientC_rep1_tumor.bam, chr3]
Success! We have repeated every sample for every single interval in our 3 interval list. We've effectively tripled the number of items in our channel. It's a little hard to read though, so in the next section we will tidy it up.
4.2. Organise the channel¶
We can use the map
operator to tidy and refactor our sample data so it's easier to understand. Let's move the intervals string to the joining map at the first element.
Let's break down what this map operation does step by step.
First, we use named parameters to make the code more readable. By using the names grouping_key
, normal
, tumor
and interval
, we can refer to the elements in the tuple by name instead of by index:
Next, we combine the grouping_key
with the interval
field. The grouping_key
is a map containing id
and repeat
fields. We create a new map with the interval
and merge them using Groovy's map addition (+
):
Finally, we return this as a tuple with three elements: the combined metadata map, the normal sample file, and the tumor sample file:
Let's run it again and check the channel contents:
N E X T F L O W ~ version 25.04.3
Launching `main.nf` [sad_hawking] DSL2 - revision: 1f6f6250cd
[[id:patientA, interval:chr1], patientA_rep1_normal.bam, patientA_rep1_tumor.bam]
[[id:patientA, interval:chr2], patientA_rep1_normal.bam, patientA_rep1_tumor.bam]
[[id:patientA, interval:chr3], patientA_rep1_normal.bam, patientA_rep1_tumor.bam]
[[id:patientA, interval:chr1], patientA_rep2_normal.bam, patientA_rep2_tumor.bam]
[[id:patientA, interval:chr2], patientA_rep2_normal.bam, patientA_rep2_tumor.bam]
[[id:patientA, interval:chr3], patientA_rep2_normal.bam, patientA_rep2_tumor.bam]
[[id:patientB, interval:chr1], patientB_rep1_normal.bam, patientB_rep1_tumor.bam]
[[id:patientB, interval:chr2], patientB_rep1_normal.bam, patientB_rep1_tumor.bam]
[[id:patientB, interval:chr3], patientB_rep1_normal.bam, patientB_rep1_tumor.bam]
[[id:patientC, interval:chr1], patientC_rep1_normal.bam, patientC_rep1_tumor.bam]
[[id:patientC, interval:chr2], patientC_rep1_normal.bam, patientC_rep1_tumor.bam]
[[id:patientC, interval:chr3], patientC_rep1_normal.bam, patientC_rep1_tumor.bam]
Using map
to coerce your data into the correct structure can be tricky, but it's crucial for effective data manipulation.
We now have every sample repeated across all genomic intervals, creating multiple independent analysis units that can be processed in parallel. But what if we want to bring related samples back together? In the next section, we'll learn how to group samples that share common attributes.
Takeaway¶
In this section, you've learned:
- Spreading samples over intervals: How to use
combine
to repeat samples over intervals - Creating Cartesian products: How to generate all combinations of samples and intervals
- Organizing channel structure: How to use
map
to restructure data for better readability - Parallel processing preparation: How to set up data for distributed analysis
5. Aggregating samples using groupTuple
¶
In the previous sections, we learned how to split data from an input file and filter by specific fields (in our case normal and tumor samples). But this only covers a single type of joining. What if we want to group samples by a specific attribute? For example, instead of joining matched normal-tumor pairs, we might want to process all samples from "sampleA" together regardless of their type. This pattern is common in bioinformatics workflows where you may want to process related samples separately for efficiency reasons before comparing or combining the results at the end.
Nextflow includes built in methods to do this, the main one we will look at is groupTuple
.
Let's start by grouping all of our samples that have the same id
and interval
fields, this would be typical of an analysis where we wanted to group technical replicates but keep meaningfully different samples separated.
To do this, we should separate out our grouping variables so we can use them in isolation.
The first step is similar to what we did in the previous section. We must isolate our grouping variable as the first element of the tuple. Remember, our first element is currently a map of id
, repeat
and interval
fields:
We can reuse the subMap
method from before to isolate our id
and interval
fields from the map. Like before, we will use map
operator to apply the subMap
method to the first element of the tuple for each sample.
Let's run it again and check the channel contents:
N E X T F L O W ~ version 25.04.3
Launching `main.nf` [hopeful_brenner] DSL2 - revision: 7f4f7fea76
[[id:patientA, interval:chr1], patientA_rep1_normal.bam, patientA_rep1_tumor.bam]
[[id:patientA, interval:chr2], patientA_rep1_normal.bam, patientA_rep1_tumor.bam]
[[id:patientA, interval:chr3], patientA_rep1_normal.bam, patientA_rep1_tumor.bam]
[[id:patientA, interval:chr1], patientA_rep2_normal.bam, patientA_rep2_tumor.bam]
[[id:patientA, interval:chr2], patientA_rep2_normal.bam, patientA_rep2_tumor.bam]
[[id:patientA, interval:chr3], patientA_rep2_normal.bam, patientA_rep2_tumor.bam]
[[id:patientB, interval:chr1], patientB_rep1_normal.bam, patientB_rep1_tumor.bam]
[[id:patientB, interval:chr2], patientB_rep1_normal.bam, patientB_rep1_tumor.bam]
[[id:patientB, interval:chr3], patientB_rep1_normal.bam, patientB_rep1_tumor.bam]
[[id:patientC, interval:chr1], patientC_rep1_normal.bam, patientC_rep1_tumor.bam]
[[id:patientC, interval:chr2], patientC_rep1_normal.bam, patientC_rep1_tumor.bam]
[[id:patientC, interval:chr3], patientC_rep1_normal.bam, patientC_rep1_tumor.bam]
We can see that we have successfully isolated the id
and interval
fields, but not grouped the samples yet.
Note
We are discarding the replicate
field here. This is because we don't need it for further downstream processing. After completing this tutorial, see if you can include it without affecting the later grouping!
Let's now group the samples by this new grouping element, using the groupTuple
operator.
That's all there is to it! We just added a single line of code. Let's see what happens when we run it:
N E X T F L O W ~ version 25.04.3
Launching `main.nf` [friendly_jang] DSL2 - revision: a1bee1c55d
[[id:patientA, interval:chr1], [patientA_rep1_normal.bam, patientA_rep2_normal.bam], [patientA_rep1_tumor.bam, patientA_rep2_tumor.bam]]
[[id:patientA, interval:chr2], [patientA_rep1_normal.bam, patientA_rep2_normal.bam], [patientA_rep1_tumor.bam, patientA_rep2_tumor.bam]]
[[id:patientA, interval:chr3], [patientA_rep1_normal.bam, patientA_rep2_normal.bam], [patientA_rep1_tumor.bam, patientA_rep2_tumor.bam]]
[[id:patientB, interval:chr1], [patientB_rep1_normal.bam], [patientB_rep1_tumor.bam]]
[[id:patientB, interval:chr2], [patientB_rep1_normal.bam], [patientB_rep1_tumor.bam]]
[[id:patientB, interval:chr3], [patientB_rep1_normal.bam], [patientB_rep1_tumor.bam]]
[[id:patientC, interval:chr1], [patientC_rep1_normal.bam], [patientC_rep1_tumor.bam]]
[[id:patientC, interval:chr2], [patientC_rep1_normal.bam], [patientC_rep1_tumor.bam]]
[[id:patientC, interval:chr3], [patientC_rep1_normal.bam], [patientC_rep1_tumor.bam]]
Note our data has changed structure and within each channel element the files now contained in tuples like [patientA_rep1_normal.bam, patientA_rep2_normal.bam]
. This is because when we use groupTuple
, Nextflow combines the single files for each sample of a group. This is important to remember when trying to handle the data downstream.
Note
transpose
is the opposite of groupTuple. It unpacks the items in a channel and flattens them. Try and add transpose
and undo the grouping we performed above!
Takeaway¶
In this section, you've learned:
- Grouping related samples: How to use
groupTuple
to aggregate samples by common attributes - Isolating grouping keys: How to use
subMap
to extract specific fields for grouping - Handling grouped data structures: How to work with the nested structure created by
groupTuple
- Technical replicate handling: How to group samples that share the same experimental conditions
Summary¶
In this side quest, you've learned how to split and group data using channels. By modifying the data as it flows through the pipeline, you can construct a pipeline that handles as many items as possible with no loops or while statements. It gracefully scales to large numbers of items. Here's what we achieved:
-
Read in samplesheet with splitCsv: We read in a CSV file with sample data and viewed the contents.
-
Use filter (and/or map) to manipulate into 2 separate channels: We used
filter
to split the data into two channels based on thetype
field. -
Join on ID and repeat: We used
join
to join the two channels on theid
andrepeat
fields. -
Combine by intervals: We used
combine
to create Cartesian products of samples with genomic intervals. -
Group by ID and interval: We used
groupTuple
to group samples by theid
andinterval
fields, aggregating technical replicates.
This approach offers several advantages over writing a pipeline as more standard code, such as using for and while loops:
- We can scale to as many or as few inputs as we want with no additional code
- We focus on handling the flow of data through the pipeline, instead of iteration
- We can be as complex or simple as required
- The pipeline becomes more declarative, focusing on what should happen rather than how it should happen
- Nextflow will optimize execution for us by running independent operations in parallel
By mastering these channel operations, you can build flexible, scalable pipelines that handle complex data relationships without resorting to loops or iterative programming. This declarative approach allows Nextflow to optimize execution and parallelize independent operations automatically.
Key Concepts¶
- Reading data sheets
- Filtering
- Joining Channels
// Join two channels by key (first element of tuple)
tumor_ch.join(normal_ch)
// Extract joining key and join by this value
tumor_ch.map { meta, file -> [meta.id, meta, file] }
.join(
normal_ch.map { meta, file -> [meta.id, meta, file] }
)
// Join on multiple fields using subMap
tumor_ch.map { meta, file -> [meta.subMap(['id', 'repeat']), meta, file] }
.join(
normal_ch.map { meta, file -> [meta.subMap(['id', 'repeat']), meta, file] }
)
- Grouping Data
- Combining Channels
- Data Structure Optimization
// Extract specific fields using subMap
meta.subMap(['id', 'repeat'])
// Named closures for reusable transformations
getSampleIdAndReplicate = { meta, file -> [meta.subMap(['id', 'repeat']), file] }
channel.map(getSampleIdAndReplicate)