Ir para o conteúdo

Splitting and Grouping

Nextflow provides powerful tools for working with data flexibly. A key capability is splitting data into different streams and then grouping related items back together. This is especially valuable in bioinformatics workflows where you need to process different types of samples separately before combining results for analysis.

Think of it like sorting mail: you separate letters by destination, process each pile differently, then recombine items going to the same person. Nextflow uses special operators to accomplish this with scientific data. This approach is also commonly known as the scatter/gather pattern in distributed computing and bioinformatics workflows.

Nextflow's channel system is at the heart of this flexibility. Channels connect different parts of your workflow, allowing data to flow through your analysis. You can create multiple channels from a single data source, process each channel differently, and then merge channels back together when needed. This approach lets you design workflows that naturally mirror the branching and converging paths of complex bioinformatics analyses.

In this side quest, you'll learn to split and group data using Nextflow's channel operators. We'll start with a CSV file containing sample information and associated data files, then manipulate and reorganize this data. By the end, you'll be able to separate and combine data streams effectively, creating more efficient and understandable workflows.

You will:

  • Read data from files using splitCsv
  • Filter and transform data with filter and map
  • Combine related data using join and groupTuple
  • Create data combinations with combine for parallel processing
  • Optimize data structure using subMap and deduplication strategies
  • Build reusable functions with named closures to help you manipulate channel structures

These skills will help you build workflows that can handle multiple input files and different types of data efficiently, while maintaining clean, maintainable code structure.


0. Warmup

0.1. Prerequisites

Before taking on this side quest you should:

  • Complete the Hello Nextflow tutorial
  • Understand basic Nextflow concepts (processes, channels, operators, working with files, meta data)

You may also find it useful to review Working with metadata before starting here, as it covers in detail how to work with metadata associated with files in your workflows.

0.2. Starting Point

Let's move into the project directory.

cd side-quests/splitting_and_grouping

You'll find a data directory containing a samplesheet and a main workflow file.

Directory contents
> tree
.
├── data
│   └── samplesheet.csv
└── main.nf

samplesheet.csv contains information about samples from different patients, including the patient ID, sample repeat number, type (normal or tumor), and paths to BAM files (which don't actually exist, but we will pretend they do).

samplesheet.csv
id,repeat,type,bam
patientA,1,normal,patientA_rep1_normal.bam
patientA,1,tumor,patientA_rep1_tumor.bam
patientA,2,normal,patientA_rep2_normal.bam
patientA,2,tumor,patientA_rep2_tumor.bam
patientB,1,normal,patientB_rep1_normal.bam
patientB,1,tumor,patientB_rep1_tumor.bam
patientC,1,normal,patientC_rep1_normal.bam
patientC,1,tumor,patientC_rep1_tumor.bam

Note there are 8 samples in total from 3 patients (patientA has 2 repeats), 4 normal and 4 tumor.

We're going to read in samplesheet.csv, then group and split the samples based on their data.


1. Read in sample data

1.1. Read in sample data with splitCsv

Let's start by reading in the sample data with splitCsv. In the main.nf, you'll see that we've already started the workflow.

main.nf
1
2
3
workflow {
    ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")
}

Note

Throughout this tutorial, we'll use the ch_ prefix for all channel variables to clearly indicate they are Nextflow channels.

We can use the splitCsv operator to split the data into a channel of maps (key/ value pairs), where each map represents a row from the CSV file.

Note

We'll encounter two different concepts called map in this training:

  • Data structure: The Groovy map (equivalent to dictionaries/hashes in other languages) that stores key-value pairs
  • Channel operator: The .map() operator that transforms items in a channel

We'll clarify which one we mean in context, but this distinction is important to understand when working with Nextflow.

Apply these changes to main.nf:

main.nf
2
3
4
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .view()
main.nf
    ch_samplesheet = Channel.fromPath("./data/samplesheet.csv")

splitCsv takes the file passed to it from the channel factory and the header: true option tells Nextflow to use the first row of the CSV file as the header row, which will be used as keys for the values. We're using the view operator you should have encountered before to examine the output this gives us.

Run the pipeline:

Test the splitCsv operation
nextflow run main.nf
Read data with splitCsv
 N E X T F L O W   ~  version 25.04.3

Launching `main.nf` [deadly_mercator] DSL2 - revision: bd6b0224e9

[id:patientA, repeat:1, type:normal, bam:patientA_rep1_normal.bam]
[id:patientA, repeat:1, type:tumor, bam:patientA_rep1_tumor.bam]
[id:patientA, repeat:2, type:normal, bam:patientB_rep1_normal.bam]
[id:patientA, repeat:2, type:tumor, bam:patientB_rep1_tumor.bam]
[id:patientB, repeat:1, type:normal, bam:patientC_rep1_normal.bam]
[id:patientB, repeat:1, type:tumor, bam:patientC_rep1_tumor.bam]
[id:patientC, repeat:1, type:normal, bam:patientD_rep1_normal.bam]
[id:patientC, repeat:1, type:tumor, bam:patientD_rep1_tumor.bam]

Each row from the CSV file has become a single item in the channel, with each item being a map with keys matching the header row.

You should be able to see that each map contains:

  • id: The patient identifier (patientA, patientB, patientC)
  • repeat: The replicate number (1 or 2)
  • type: The sample type (normal or tumor)
  • bam: Path to the BAM file

This format makes it easy to access specific fields from each sample via their keys in the map. We can access the BAM file path with the bam key, but also any of the 'metadata' fields that describe the file via id, repeat, type.

Note

For a more extensive introduction on working with metadata, you can work through the training Working with metadata

Let's separate the metadata from the files. We can do this with a map operation:

main.nf
2
3
4
5
6
7
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map{ row ->
          [[id:row.id, repeat:row.repeat, type:row.type], row.bam]
        }
        .view()
main.nf
2
3
4
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .view()

Apply that change and re-run the pipeline:

Test the metadata separation
nextflow run main.nf
Sample data with separated metadata
 N E X T F L O W   ~  version 25.04.3

Launching `main.nf` [deadly_mercator] DSL2 - revision: bd6b0224e9

[[id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam]
[[id:patientA, repeat:1, type:tumor], patientA_rep1_tumor.bam]
[[id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam]
[[id:patientA, repeat:2, type:tumor], patientA_rep2_tumor.bam]
[[id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam]
[[id:patientB, repeat:1, type:tumor], patientB_rep1_tumor.bam]
[[id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam]
[[id:patientC, repeat:1, type:tumor], patientC_rep1_tumor.bam]

We separated the sample meta data from the files into a map. We now have a channel of maps and files, each representing a row from the input sample sheet, which we will use in this training to split and group our workload.

Takeaway

In this section, you've learned:

  • Reading in a data sheet: How to read in data sheet with splitCsv
  • Combining patient-specific information: Using groovy maps to hold information about a patient

2. Filter and transform data

2.1. Filter data with filter

We can use the filter operator to filter the data based on a condition. Let's say we only want to process normal samples. We can do this by filtering the data based on the type field. Let's insert this before the view operator.

main.nf
2
3
4
5
6
7
8
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map{ row ->
          [[id:row.id, repeat:row.repeat, type:row.type], row.bam]
        }
        .filter { meta, file -> meta.type == 'normal' }
        .view()
main.nf
2
3
4
5
6
7
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map{ row ->
          [[id:row.id, repeat:row.repeat, type:row.type], row.bam]
        }
        .view()

Run the workflow again to see the filtered result:

Test the filter operation
nextflow run main.nf
Filtered normal samples
 N E X T F L O W   ~  version 25.04.3

Launching `main.nf` [admiring_brown] DSL2 - revision: 194d61704d

[[id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam]
[[id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam]
[[id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam]
[[id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam]

We have successfully filtered the data to only include normal samples. Let's recap how this works.

The filter operator takes a closure that is applied to each element in the channel. If the closure returns true, the element is included; if it returns false, the element is excluded.

In our case, we want to keep only samples where meta.type == 'normal'. The closure uses the tuple meta,file to refer to each sample, accesses the sample type with meta.type, and checks if it equals 'normal'.

This is accomplished with the single closure we introduced above:

main.nf
    .filter { meta, file -> meta.type == 'normal' }

2.2. Create separate filtered channels

Currently we're applying the filter to the channel created directly from the CSV, but we want to filter this in more ways than one, so let's re-write the logic to create a separate filtered channel for normal samples:

main.nf
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map{ row ->
            [[id:row.id, repeat:row.repeat, type:row.type], row.bam]
        }
    ch_normal_samples = ch_samples
        .filter { meta, file -> meta.type == 'normal' }
    ch_normal_samples
        .view()
main.nf
2
3
4
5
6
7
8
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map{ row ->
          [[id:row.id, repeat:row.repeat, type:row.type], row.bam]
        }
        .filter { meta, file -> meta.type == 'normal' }
        .view()

Once again, run the pipeline to see the results:

Test separate channel creation
nextflow run main.nf
Filtered normal samples
 N E X T F L O W   ~  version 25.04.3

Launching `main.nf` [trusting_poisson] DSL2 - revision: 639186ee74

[[id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam]
[[id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam]
[[id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam]
[[id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam]

We've successfully filtered the data and created a separate channel for normal samples. Let's create a filtered channel for the tumor samples as well:

main.nf
    ch_normal_samples = ch_samples
        .filter { meta, file -> meta.type == 'normal' }
    ch_tumor_samples = ch_samples
        .filter { meta, file -> meta.type == 'tumor' }
    ch_normal_samples
        .view{'Normal sample: ' + it}
    ch_tumor_samples
        .view{'Tumor sample: ' + it}
main.nf
    ch_normal_samples = ch_samples
        .filter { meta, file -> meta.type == 'normal' }
    ch_normal_samples
        .view()
Test filtering both sample types
nextflow run main.nf
Normal and tumor samples
 N E X T F L O W   ~  version 25.04.3

Launching `main.nf` [maniac_boltzmann] DSL2 - revision: 3636b6576b

Tumor sample: [[id:patientA, repeat:1, type:tumor], patientA_rep1_tumor.bam]
Tumor sample: [[id:patientA, repeat:2, type:tumor], patientA_rep2_tumor.bam]
Normal sample: [[id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam]
Normal sample: [[id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam]
Normal sample: [[id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam]
Normal sample: [[id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam]
Tumor sample: [[id:patientB, repeat:1, type:tumor], patientB_rep1_tumor.bam]
Tumor sample: [[id:patientC, repeat:1, type:tumor], patientC_rep1_tumor.bam]

We've separated out the normal and tumor samples into two different channels, and used a closure supplied to view() to label them differently in the output: ch_tumor_samples.view{'Tumor sample: ' + it}.

Takeaway

In this section, you've learned:

  • Filtering data: How to filter data with filter
  • Splitting data: How to split data into different channels based on a condition
  • Viewing data: How to use view to print the data and label output from different channels

We've now separated out the normal and tumor samples into two different channels. Next, we'll join the normal and tumor samples on the id field.


3. Joining channels by identifiers

In the previous section, we separated out the normal and tumor samples into two different channels. These could be processed independently using specific processes or workflows based on their type. But what happens when we want to compare the normal and tumor samples from the same patient? At this point, we need to join them back together making sure to match the samples based on their id field.

Nextflow includes many methods for combining channels, but in this case the most appropriate operator is join. If you are familiar with SQL, it acts like the JOIN operation, where we specify the key to join on and the type of join to perform.

3.1. Use map and join to combine based on patient ID

If we check the join documentation, we can see that by default it joins two channels based on the first item in each tuple. If you don't have the console output still available, let's run the pipeline to check our data structure and see how we need to modify it to join on the id field.

Check current data structure
nextflow run main.nf
Normal and tumor samples
 N E X T F L O W   ~  version 25.04.3

Launching `main.nf` [maniac_boltzmann] DSL2 - revision: 3636b6576b

Tumour sample: [[id:patientA, repeat:1, type:tumor], patientA_rep1_tumor.bam]
Tumour sample: [[id:patientA, repeat:2, type:tumor], patientA_rep2_tumor.bam]
Normal sample: [[id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam]
Normal sample: [[id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam]
Normal sample: [[id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam]
Normal sample: [[id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam]
Tumour sample: [[id:patientB, repeat:1, type:tumor], patientB_rep1_tumor.bam]
Tumour sample: [[id:patientC, repeat:1, type:tumor], patientC_rep1_tumor.bam]

We can see that the id field is the first element in each meta map. For join to work, we should isolate the id field in each tuple. After that, we can simply use the join operator to combine the two channels.

To isolate the id field, we can use the map operator to create a new tuple with the id field as the first element.

main.nf
    ch_normal_samples = ch_samples
        .filter { meta, file -> meta.type == 'normal' }
        .map { meta, file -> [meta.id, meta, file] }
    ch_tumor_samples = ch_samples
        .filter { meta, file -> meta.type == 'tumor' }
        .map { meta, file -> [meta.id, meta, file] }
    ch_normal_samples
        .view{'Normal sample: ' + it}
    ch_tumor_samples
        .view{'Tumor sample: ' + it}
main.nf
    ch_normal_samples = ch_samples
        .filter { meta, file -> meta.type == 'normal' }
    ch_tumor_samples = ch_samples
        .filter { meta, file -> meta.type == 'tumor' }
    ch_normal_samples
        .view{'Normal sample: ' + it}
    ch_tumor_samples
        .view{'Tumor sample: ' + it}
Test the map transformation
nextflow run main.nf
Samples with ID as first element
 N E X T F L O W   ~  version 25.04.3

Launching `main.nf` [mad_lagrange] DSL2 - revision: 9940b3f23d

Tumour sample: [patientA, [id:patientA, repeat:1, type:tumor], patientA_rep1_tumor.bam]
Tumour sample: [patientA, [id:patientA, repeat:2, type:tumor], patientA_rep2_tumor.bam]
Normal sample: [patientA, [id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam]
Normal sample: [patientA, [id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam]
Tumour sample: [patientB, [id:patientB, repeat:1, type:tumor], patientB_rep1_tumor.bam]
Tumour sample: [patientC, [id:patientC, repeat:1, type:tumor], patientC_rep1_tumor.bam]
Normal sample: [patientB, [id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam]
Normal sample: [patientC, [id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam]

It might be subtle, but you should be able to see the first element in each tuple is the id field. Now we can use the join operator to combine the two channels based on the id field.

Once again, we will use view to print the joined outputs.

main.nf
    ch_normal_samples = ch_samples
        .filter { meta, file -> meta.type == 'normal' }
        .map { meta, file -> [meta.id, meta, file] }
    ch_tumor_samples = ch_sample
        .filter { meta, file -> meta.type == 'tumor' }
        .map { meta, file -> [meta.id, meta, file] }
    ch_joined_samples = ch_normal_samples
        .join(ch_tumor_samples)
    ch_joined_samples.view()
main.nf
    ch_normal_samples = ch_samples
        .filter { meta, file -> meta.type == 'normal' }
        .map { meta, file -> [meta.id, meta, file] }
    ch_tumor_samples = ch_samples
        .filter { meta, file -> meta.type == 'tumor' }
        .map { meta, file -> [meta.id, meta, file] }
    ch_normal_samples
        .view{'Normal sample: ' + it}
    ch_tumor_samples
        .view{'Tumor sample: ' + it}
Test the join operation
nextflow run main.nf
Joined normal and tumor samples
 N E X T F L O W   ~  version 25.04.3

Launching `main.nf` [soggy_wiles] DSL2 - revision: 3bc1979889

[patientA, [id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam, [id:patientA, repeat:1, type:tumor], patientA_rep1_tumor.bam]
[patientA, [id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam, [id:patientA, repeat:2, type:tumor], patientA_rep2_tumor.bam]
[patientB, [id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam, [id:patientB, repeat:1, type:tumor], patientB_rep1_tumor.bam]
[patientC, [id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam, [id:patientC, repeat:1, type:tumor], patientC_rep1_tumor.bam]

It's a little hard to tell because it's so wide, but you should be able to see the samples have been joined by the id field. Each tuple now has the format:

  • id: The sample ID
  • normal_meta_map: The normal sample meta data including type, replicate and path to bam file
  • normal_sample_file: The normal sample file
  • tumor_meta_map: The tumor sample meta data including type, replicate and path to bam file
  • tumor_sample: The tumor sample including type, replicate and path to bam file

Warning

The join operator will discard any un-matched tuples. In this example, we made sure all samples were matched for tumor and normal but if this is not true you must use the parameter remainder: true to keep the unmatched tuples. Check the documentation for more details.

Takeaway

In this section, you've learned:

  • How to use map to isolate a field in a tuple
  • How to use join to combine tuples based on the first field

With this knowledge, we can successfully combine channels based on a shared field. Next, we'll consider the situation where you want to join on multiple fields.

3.2. Join on multiple fields

We have 2 replicates for sampleA, but only 1 for sampleB and sampleC. In this case we were able to join them effectively by using the id field, but what would happen if they were out of sync? We could mix up the normal and tumor samples from different replicates!

To avoid this, we can join on multiple fields. There are actually multiple ways to achieve this but we are going to focus on creating a new joining key which includes both the sample id and replicate number.

Let's start by creating a new joining key. We can do this in the same way as before by using the map operator to create a new tuple with the id and repeat fields as the first element.

main.nf
    ch_normal_samples = ch_samples
        .filter { meta, file -> meta.type == 'normal' }
        .map { meta, file -> [[meta.id, meta.repeat], meta, file] }
    ch_tumor_samples = ch_samples
        .filter { meta, file -> meta.type == 'tumor' }
        .map { meta, file -> [[meta.id, meta.repeat], meta, file] }
main.nf
    ch_normal_samples = ch_samples
        .filter { meta, file -> meta.type == 'normal' }
        .map { meta, file -> [meta.id, meta, file] }
    ch_tumor_samples = ch_sample
        .filter { meta, file -> meta.type == 'tumor' }
        .map { meta, file -> [meta.id, meta, file] }

Now we should see the join is occurring but using both the id and repeat fields. Run the workflow:

Test multi-field joining
nextflow run main.nf
Samples joined on multiple fields
 N E X T F L O W   ~  version 25.04.3

Launching `main.nf` [prickly_wing] DSL2 - revision: 3bebf22dee

[[patientA, 1], [id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam, [id:patientA, repeat:1, type:tumor], patientA_rep1_tumor.bam]
[[patientA, 2], [id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam, [id:patientA, repeat:2, type:tumor], patientA_rep2_tumor.bam]
[[patientB, 1], [id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam, [id:patientB, repeat:1, type:tumor], patientB_rep1_tumor.bam]
[[patientC, 1], [id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam, [id:patientC, repeat:1, type:tumor], patientC_rep1_tumor.bam]

Note how we have a tuple of two elements (id and repeat fields) as the first element of each joined result. This demonstrates how complex items can be used as a joining key, enabling fairly intricate matching between samples from the same conditions.

If you want to explore more ways to join on different keys, check out the join operator documentation for additional options and examples.

3.3. Use subMap to create a new joining key

The previous approach loses the field names from our joining key - the id and repeat fields become just a list of values. To retain the field names for later access, we can use the subMap method.

The subMap method extracts only the specified key-value pairs from a map. Here we'll extract just the id and repeat fields to create our joining key.

main.nf
    ch_normal_samples = ch_samples
        .filter { meta, file -> meta.type == 'normal' }
        .map { meta, file -> [meta.subMap(['id', 'repeat']), meta, file] }
    ch_tumor_samples = ch_samples
        .filter { meta, file -> meta.type == 'tumor' }
        .map { meta, file -> [meta.subMap(['id', 'repeat']), meta, file] }
main.nf
    ch_normal_samples = ch_samples
        .filter { meta, file -> meta.type == 'normal' }
        .map { meta, file -> [[meta.id, meta.repeat], meta, file] }
    ch_tumor_samples = ch_samples
        .filter { meta, file -> meta.type == 'tumor' }
        .map { meta, file -> [[meta.id, meta.repeat], meta, file] }
Test subMap joining keys
nextflow run main.nf
Samples with subMap joining keys
 N E X T F L O W   ~  version 25.04.3

Launching `main.nf` [reverent_wing] DSL2 - revision: 847016c3b7

[[id:patientA, repeat:1], [id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam, [id:patientA, repeat:1, type:tumor], patientA_rep1_tumor.bam]
[[id:patientA, repeat:2], [id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam, [id:patientA, repeat:2, type:tumor], patientA_rep2_tumor.bam]
[[id:patientB, repeat:1], [id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam, [id:patientB, repeat:1, type:tumor], patientB_rep1_tumor.bam]
[[id:patientC, repeat:1], [id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam, [id:patientC, repeat:1, type:tumor], patientC_rep1_tumor.bam]

Now we have a new joining key that not only includes the id and repeat fields but also retains the field names so we can access them later by name, e.g. meta.id and meta.repeat.

3.4. Use a named closure in map

To avoid duplication and reduce errors, we can use a named closure. A named closure allows us to create a reusable function that we can call in multiple places.

To do so, first we define the closure as a new variable:

main.nf
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map{ row ->
            [[id:row.id, repeat:row.repeat, type:row.type], row.bam]
        }

    getSampleIdAndReplicate = { meta, bam -> [ meta.subMap(['id', 'repeat']), meta, file(bam) ] }

    ch_normal_samples = ch_samples
        .filter { meta, file -> meta.type == 'normal' }
main.nf
2
3
4
5
6
7
8
    ch_samples = Channel.fromPath("./data/samplesheet.csv")
        .splitCsv(header: true)
        .map{ row ->
            [[id:row.id, repeat:row.repeat, type:row.type], row.bam]
        }
    ch_normal_samples = ch_samples
        .filter { meta, file -> meta.type == 'normal' }

We've defined the map transformation as a named variable that we can reuse. Note that we also convert the file path to a Path object using file() so that any process receiving this channel can handle the file correctly (for more information see Working with files).

Let's implement the closure in our workflow:

main.nf
    ch_normal_samples = ch_samples
        .filter { meta, file -> meta.type == 'normal' }
         .map ( getSampleIdAndReplicate )
    ch_tumor_samples = ch_samples
        .filter { meta, file -> meta.type == 'tumor' }
         .map ( getSampleIdAndReplicate )
main.nf
    ch_normal_samples = ch_samples
        .filter { meta, file -> meta.type == 'normal' }
        .map { meta, file -> [meta.subMap(['id', 'repeat'], meta, file] }
    ch_tumor_samples = ch_samples
        .filter { meta, file -> meta.type == 'tumor' }
        .map { meta, file -> [meta.subMap(['id', 'repeat'], meta, file] }

Note

The map operator has switched from using { } to using ( ) to pass the closure as an argument. This is because the map operator expects a closure as an argument and { } is used to define an anonymous closure. When calling a named closure, use the ( ) syntax.

Just run the workflow once more to check everything is still working:

Test the named closure
nextflow run main.nf
Samples using named closure
 N E X T F L O W   ~  version 25.04.3

Launching `main.nf` [angry_meninsky] DSL2 - revision: 2edc226b1d

[[id:patientA, repeat:1], [id:patientA, repeat:1, type:normal], patientA_rep1_normal.bam, [id:patientA, repeat:1, type:tumor], patientA_rep1_tumor.bam]
[[id:patientA, repeat:2], [id:patientA, repeat:2, type:normal], patientA_rep2_normal.bam, [id:patientA, repeat:2, type:tumor], patientA_rep2_tumor.bam]
[[id:patientB, repeat:1], [id:patientB, repeat:1, type:normal], patientB_rep1_normal.bam, [id:patientB, repeat:1, type:tumor], patientB_rep1_tumor.bam]
[[id:patientC, repeat:1], [id:patientC, repeat:1, type:normal], patientC_rep1_normal.bam, [id:patientC, repeat:1, type:tumor], patientC_rep1_tumor.bam]

Using a named closure allows us to reuse the same transformation in multiple places, reducing the risk of errors and making the code more readable and maintainable.

3.5. Reduce duplication of data

We have a lot of duplicated data in our workflow. Each item in the joined samples repeats the id and repeat fields. Since this information is already available in the grouping key, we can avoid this redundancy. As a reminder, our current data structure looks like this:

[
  [
    "id": "sampleC",
    "repeat": "1",
  ],
  [
    "id": "sampleC",
    "repeat": "1",
    "type": "normal",
  ],
  "sampleC_rep1_normal.bam"
  [
    "id": "sampleC",
    "repeat": "1",
    "type": "tumor",
  ],
  "sampleC_rep1_tumor.bam"
]

Since the id and repeat fields are available in the grouping key, let's remove them from the rest of each channel item to avoid duplication. We can do this by using the subMap method to create a new map with only the type field. This approach allows us to maintain all necessary information while eliminating redundancy in our data structure.

main.nf
    getSampleIdAndReplicate = { meta, bam -> [ meta.subMap(['id', 'repeat']), meta.subMap(['type']), file(bam) ] }
main.nf
    getSampleIdAndReplicate = { meta, bam -> [ meta.subMap(['id', 'repeat']), meta, file(bam) ] }

Now the closure returns a tuple where the first element contains the id and repeat fields, and the second element contains only the type field. This eliminates redundancy by storing the id and repeat information once in the grouping key, while maintaining all necessary information.

Run the workflow to see what this looks like:

Test data deduplication
nextflow run main.nf
Deduplicated sample data
[[id:patientA, repeat:1], [type:normal], /workspaces/training/side-quests/splitting_and_grouping/patientA_rep1_normal.bam, [type:tumor], /workspaces/training/side-quests/splitting_and_grouping/patientA_rep1_tumor.bam]
[[id:patientA, repeat:2], [type:normal], /workspaces/training/side-quests/splitting_and_grouping/patientA_rep2_normal.bam, [type:tumor], /workspaces/training/side-quests/splitting_and_grouping/patientA_rep2_tumor.bam]
[[id:patientB, repeat:1], [type:normal], /workspaces/training/side-quests/splitting_and_grouping/patientB_rep1_normal.bam, [type:tumor], /workspaces/training/side-quests/splitting_and_grouping/patientB_rep1_tumor.bam]
[[id:patientC, repeat:1], [type:normal], /workspaces/training/side-quests/splitting_and_grouping/patientC_rep1_normal.bam, [type:tumor], /workspaces/training/side-quests/splitting_and_grouping/patientC_rep1_tumor.bam]

We can see we only state the id and repeat fields once in the grouping key and we have the type field in the sample data. We haven't lost any information but we managed to make our channel contents more succinct.

3.6. Remove redundant information

We removed duplicated information above, but we still have some other redundant information in our channels.

In the beginning, we separated the normal and tumor samples using filter, then joined them based on id and repeat keys. The join operator preserves the order in which tuples are merged, so in our case, with normal samples on the left side and tumor samples on the right, the resulting channel maintains this structure: id, <normal elements>, <tumor elements>.

Since we know the position of each element in our channel, we can simplify the structure further by dropping the [type:normal] and [type:tumor] metadata.

main.nf
    getSampleIdAndReplicate = { meta, file -> [ meta.subMap(['id', 'repeat']), file ] }
main.nf
    getSampleIdAndReplicate = { meta, file -> [ meta.subMap(['id', 'repeat']), meta.subMap(['type']), file ] }

Run again to see the result:

Test streamlined data structure
nextflow run main.nf
Streamlined sample data
 N E X T F L O W   ~  version 25.04.3

Launching `main.nf` [confident_leavitt] DSL2 - revision: a2303895bd

[[id:patientA, repeat:1], patientA_rep1_normal.bam, patientA_rep1_tumor.bam]
[[id:patientA, repeat:2], patientA_rep2_normal.bam, patientA_rep2_tumor.bam]
[[id:patientB, repeat:1], patientB_rep1_normal.bam, patientB_rep1_tumor.bam]
[[id:patientC, repeat:1], patientC_rep1_normal.bam, patientC_rep1_tumor.bam]

Takeaway

In this section, you've learned:

  • Manipulating Tuples: How to use map to isolate a field in a tuple
  • Joining Tuples: How to use join to combine tuples based on the first field
  • Creating Joining Keys: How to use subMap to create a new joining key
  • Named Closures: How to use a named closure in map
  • Multiple Field Joining: How to join on multiple fields for more precise matching
  • Data Structure Optimization: How to streamline channel structure by removing redundant information

You now have a workflow that can split a samplesheet, filter the normal and tumor samples, join them together by sample ID and replicate number, then print the results.

This is a common pattern in bioinformatics workflows where you need to match up samples or other types of data after processing independently, so it is a useful skill. Next, we will look at repeating a sample multiple times.

4. Spread patients over intervals

A key pattern in bioinformatics workflows is distributing analysis across genomic regions. For instance, variant calling can be parallelized by dividing the genome into intervals (like chromosomes or smaller regions). This parallelization strategy significantly improves pipeline efficiency by distributing computational load across multiple cores or nodes, reducing overall execution time.

In the following section, we'll demonstrate how to distribute our sample data across multiple genomic intervals. We'll pair each sample with every interval, allowing parallel processing of different genomic regions. This will multiply our dataset size by the number of intervals, creating multiple independent analysis units that can be brought back together later.

4.1. Spread samples over intervals using combine

Let's start by creating a channel of intervals. To keep life simple, we will just use 3 intervals we will manually define. In a real workflow, you could read these in from a file input or even create a channel with lots of interval files.

main.nf
        .join(ch_tumor_samples)
    ch_intervals = Channel.of('chr1', 'chr2', 'chr3')
main.nf
        .join(ch_tumor_samples)
    ch_joined_samples.view()

Now remember, we want to repeat each sample for each interval. This is sometimes referred to as the Cartesian product of the samples and intervals. We can achieve this by using the combine operator. This will take every item from channel 1 and repeat it for each item in channel 2. Let's add a combine operator to our workflow:

main.nf
    ch_intervals = Channel.of('chr1', 'chr2', 'chr3')

    ch_combined_samples = ch_joined_samples
        .combine(ch_intervals)
        .view()
main.nf
    ch_intervals = Channel.of('chr1', 'chr2', 'chr3')

Now let's run it and see what happens:

Test the combine operation
nextflow run main.nf
Samples combined with intervals
 N E X T F L O W   ~  version 25.04.3

Launching `main.nf` [mighty_tesla] DSL2 - revision: ae013ab70b

[[id:patientA, repeat:1], patientA_rep1_normal.bam, patientA_rep1_tumor.bam, chr1]
[[id:patientA, repeat:1], patientA_rep1_normal.bam, patientA_rep1_tumor.bam, chr2]
[[id:patientA, repeat:1], patientA_rep1_normal.bam, patientA_rep1_tumor.bam, chr3]
[[id:patientA, repeat:2], patientA_rep2_normal.bam, patientA_rep2_tumor.bam, chr1]
[[id:patientA, repeat:2], patientA_rep2_normal.bam, patientA_rep2_tumor.bam, chr2]
[[id:patientA, repeat:2], patientA_rep2_normal.bam, patientA_rep2_tumor.bam, chr3]
[[id:patientB, repeat:1], patientB_rep1_normal.bam, patientB_rep1_tumor.bam, chr1]
[[id:patientB, repeat:1], patientB_rep1_normal.bam, patientB_rep1_tumor.bam, chr2]
[[id:patientB, repeat:1], patientB_rep1_normal.bam, patientB_rep1_tumor.bam, chr3]
[[id:patientC, repeat:1], patientC_rep1_normal.bam, patientC_rep1_tumor.bam, chr1]
[[id:patientC, repeat:1], patientC_rep1_normal.bam, patientC_rep1_tumor.bam, chr2]
[[id:patientC, repeat:1], patientC_rep1_normal.bam, patientC_rep1_tumor.bam, chr3]

Success! We have repeated every sample for every single interval in our 3 interval list. We've effectively tripled the number of items in our channel. It's a little hard to read though, so in the next section we will tidy it up.

4.2. Organise the channel

We can use the map operator to tidy and refactor our sample data so it's easier to understand. Let's move the intervals string to the joining map at the first element.

main.nf
    ch_combined_samples = ch_joined_samples
        .combine(ch_intervals)
        .map { grouping_key, normal, tumor, interval ->
            [
                grouping_key + [interval: interval],
                normal,
                tumor
            ]
        }
        .view()
main.nf
    ch_combined_samples = ch_joined_samples
        .combine(ch_intervals)
        .view()

Let's break down what this map operation does step by step.

First, we use named parameters to make the code more readable. By using the names grouping_key, normal, tumor and interval, we can refer to the elements in the tuple by name instead of by index:

        .map { grouping_key, normal, tumor, interval ->

Next, we combine the grouping_key with the interval field. The grouping_key is a map containing id and repeat fields. We create a new map with the interval and merge them using Groovy's map addition (+):

                grouping_key + [interval: interval],

Finally, we return this as a tuple with three elements: the combined metadata map, the normal sample file, and the tumor sample file:

            [
                grouping_key + [interval: interval],
                normal,
                tumor
            ]

Let's run it again and check the channel contents:

Test the reorganized structure
nextflow run main.nf
Samples combined with intervals
 N E X T F L O W   ~  version 25.04.3

Launching `main.nf` [sad_hawking] DSL2 - revision: 1f6f6250cd

[[id:patientA, interval:chr1], patientA_rep1_normal.bam, patientA_rep1_tumor.bam]
[[id:patientA, interval:chr2], patientA_rep1_normal.bam, patientA_rep1_tumor.bam]
[[id:patientA, interval:chr3], patientA_rep1_normal.bam, patientA_rep1_tumor.bam]
[[id:patientA, interval:chr1], patientA_rep2_normal.bam, patientA_rep2_tumor.bam]
[[id:patientA, interval:chr2], patientA_rep2_normal.bam, patientA_rep2_tumor.bam]
[[id:patientA, interval:chr3], patientA_rep2_normal.bam, patientA_rep2_tumor.bam]
[[id:patientB, interval:chr1], patientB_rep1_normal.bam, patientB_rep1_tumor.bam]
[[id:patientB, interval:chr2], patientB_rep1_normal.bam, patientB_rep1_tumor.bam]
[[id:patientB, interval:chr3], patientB_rep1_normal.bam, patientB_rep1_tumor.bam]
[[id:patientC, interval:chr1], patientC_rep1_normal.bam, patientC_rep1_tumor.bam]
[[id:patientC, interval:chr2], patientC_rep1_normal.bam, patientC_rep1_tumor.bam]
[[id:patientC, interval:chr3], patientC_rep1_normal.bam, patientC_rep1_tumor.bam]

Using map to coerce your data into the correct structure can be tricky, but it's crucial for effective data manipulation.

We now have every sample repeated across all genomic intervals, creating multiple independent analysis units that can be processed in parallel. But what if we want to bring related samples back together? In the next section, we'll learn how to group samples that share common attributes.

Takeaway

In this section, you've learned:

  • Spreading samples over intervals: How to use combine to repeat samples over intervals
  • Creating Cartesian products: How to generate all combinations of samples and intervals
  • Organizing channel structure: How to use map to restructure data for better readability
  • Parallel processing preparation: How to set up data for distributed analysis

5. Aggregating samples using groupTuple

In the previous sections, we learned how to split data from an input file and filter by specific fields (in our case normal and tumor samples). But this only covers a single type of joining. What if we want to group samples by a specific attribute? For example, instead of joining matched normal-tumor pairs, we might want to process all samples from "sampleA" together regardless of their type. This pattern is common in bioinformatics workflows where you may want to process related samples separately for efficiency reasons before comparing or combining the results at the end.

Nextflow includes built in methods to do this, the main one we will look at is groupTuple.

Let's start by grouping all of our samples that have the same id and interval fields, this would be typical of an analysis where we wanted to group technical replicates but keep meaningfully different samples separated.

To do this, we should separate out our grouping variables so we can use them in isolation.

The first step is similar to what we did in the previous section. We must isolate our grouping variable as the first element of the tuple. Remember, our first element is currently a map of id, repeat and interval fields:

main.nf
1
2
3
4
5
{
  "id": "sampleA",
  "repeat": "1",
  "interval": "chr1"
}

We can reuse the subMap method from before to isolate our id and interval fields from the map. Like before, we will use map operator to apply the subMap method to the first element of the tuple for each sample.

main.nf
    ch_combined_samples = ch_joined_samples
        .combine(ch_intervals)
        .map { grouping_key, normal, tumor, interval ->
            [
                grouping_key + [interval: interval],
                normal,
                tumor
            ]
        }

    ch_grouped_samples = ch_combined_samples
        .map { grouping_key, normal, tumor ->
            [
                grouping_key.subMap('id', 'interval'),
                normal,
                tumor
            ]
          }
          .view()
main.nf
    ch_combined_samples = ch_joined_samples
        .combine(ch_intervals)
        .map { grouping_key, normal, tumor, interval ->
            [
                grouping_key + [interval: interval],
                normal,
                tumor
            ]
        }
        .view()

Let's run it again and check the channel contents:

Test grouping key isolation
nextflow run main.nf
Samples prepared for grouping
 N E X T F L O W   ~  version 25.04.3

Launching `main.nf` [hopeful_brenner] DSL2 - revision: 7f4f7fea76

[[id:patientA, interval:chr1], patientA_rep1_normal.bam, patientA_rep1_tumor.bam]
[[id:patientA, interval:chr2], patientA_rep1_normal.bam, patientA_rep1_tumor.bam]
[[id:patientA, interval:chr3], patientA_rep1_normal.bam, patientA_rep1_tumor.bam]
[[id:patientA, interval:chr1], patientA_rep2_normal.bam, patientA_rep2_tumor.bam]
[[id:patientA, interval:chr2], patientA_rep2_normal.bam, patientA_rep2_tumor.bam]
[[id:patientA, interval:chr3], patientA_rep2_normal.bam, patientA_rep2_tumor.bam]
[[id:patientB, interval:chr1], patientB_rep1_normal.bam, patientB_rep1_tumor.bam]
[[id:patientB, interval:chr2], patientB_rep1_normal.bam, patientB_rep1_tumor.bam]
[[id:patientB, interval:chr3], patientB_rep1_normal.bam, patientB_rep1_tumor.bam]
[[id:patientC, interval:chr1], patientC_rep1_normal.bam, patientC_rep1_tumor.bam]
[[id:patientC, interval:chr2], patientC_rep1_normal.bam, patientC_rep1_tumor.bam]
[[id:patientC, interval:chr3], patientC_rep1_normal.bam, patientC_rep1_tumor.bam]

We can see that we have successfully isolated the id and interval fields, but not grouped the samples yet.

Note

We are discarding the replicate field here. This is because we don't need it for further downstream processing. After completing this tutorial, see if you can include it without affecting the later grouping!

Let's now group the samples by this new grouping element, using the groupTuple operator.

main.nf
    ch_grouped_samples = ch_combined_samples
        .map { grouping_key, normal, tumor ->
            [
                grouping_key.subMap('id', 'interval'),
                normal,
                tumor
            ]
          }
          .groupTuple()
          .view()
main.nf
    ch_grouped_samples = ch_combined_samples
        .map { grouping_key, normal, tumor ->
            [
                grouping_key.subMap('id', 'interval'),
                normal,
                tumor
            ]
          }
          .view()

That's all there is to it! We just added a single line of code. Let's see what happens when we run it:

Test the groupTuple operation
nextflow run main.nf
Grouped samples by ID and interval
 N E X T F L O W   ~  version 25.04.3

Launching `main.nf` [friendly_jang] DSL2 - revision: a1bee1c55d

[[id:patientA, interval:chr1], [patientA_rep1_normal.bam, patientA_rep2_normal.bam], [patientA_rep1_tumor.bam, patientA_rep2_tumor.bam]]
[[id:patientA, interval:chr2], [patientA_rep1_normal.bam, patientA_rep2_normal.bam], [patientA_rep1_tumor.bam, patientA_rep2_tumor.bam]]
[[id:patientA, interval:chr3], [patientA_rep1_normal.bam, patientA_rep2_normal.bam], [patientA_rep1_tumor.bam, patientA_rep2_tumor.bam]]
[[id:patientB, interval:chr1], [patientB_rep1_normal.bam], [patientB_rep1_tumor.bam]]
[[id:patientB, interval:chr2], [patientB_rep1_normal.bam], [patientB_rep1_tumor.bam]]
[[id:patientB, interval:chr3], [patientB_rep1_normal.bam], [patientB_rep1_tumor.bam]]
[[id:patientC, interval:chr1], [patientC_rep1_normal.bam], [patientC_rep1_tumor.bam]]
[[id:patientC, interval:chr2], [patientC_rep1_normal.bam], [patientC_rep1_tumor.bam]]
[[id:patientC, interval:chr3], [patientC_rep1_normal.bam], [patientC_rep1_tumor.bam]]

Note our data has changed structure and within each channel element the files now contained in tuples like [patientA_rep1_normal.bam, patientA_rep2_normal.bam]. This is because when we use groupTuple, Nextflow combines the single files for each sample of a group. This is important to remember when trying to handle the data downstream.

Note

transpose is the opposite of groupTuple. It unpacks the items in a channel and flattens them. Try and add transpose and undo the grouping we performed above!

Takeaway

In this section, you've learned:

  • Grouping related samples: How to use groupTuple to aggregate samples by common attributes
  • Isolating grouping keys: How to use subMap to extract specific fields for grouping
  • Handling grouped data structures: How to work with the nested structure created by groupTuple
  • Technical replicate handling: How to group samples that share the same experimental conditions

Summary

In this side quest, you've learned how to split and group data using channels. By modifying the data as it flows through the pipeline, you can construct a pipeline that handles as many items as possible with no loops or while statements. It gracefully scales to large numbers of items. Here's what we achieved:

  1. Read in samplesheet with splitCsv: We read in a CSV file with sample data and viewed the contents.

  2. Use filter (and/or map) to manipulate into 2 separate channels: We used filter to split the data into two channels based on the type field.

  3. Join on ID and repeat: We used join to join the two channels on the id and repeat fields.

  4. Combine by intervals: We used combine to create Cartesian products of samples with genomic intervals.

  5. Group by ID and interval: We used groupTuple to group samples by the id and interval fields, aggregating technical replicates.

This approach offers several advantages over writing a pipeline as more standard code, such as using for and while loops:

  • We can scale to as many or as few inputs as we want with no additional code
  • We focus on handling the flow of data through the pipeline, instead of iteration
  • We can be as complex or simple as required
  • The pipeline becomes more declarative, focusing on what should happen rather than how it should happen
  • Nextflow will optimize execution for us by running independent operations in parallel

By mastering these channel operations, you can build flexible, scalable pipelines that handle complex data relationships without resorting to loops or iterative programming. This declarative approach allows Nextflow to optimize execution and parallelize independent operations automatically.

Key Concepts

  • Reading data sheets
// Read CSV with header
Channel.fromPath('samplesheet.csv')
    .splitCsv(header: true)
  • Filtering
// Filter channel based on condition
channel.filter { it.type == 'tumor' }
  • Joining Channels
// Join two channels by key (first element of tuple)
tumor_ch.join(normal_ch)

// Extract joining key and join by this value
tumor_ch.map { meta, file -> [meta.id, meta, file] }
    .join(
       normal_ch.map { meta, file -> [meta.id, meta, file] }
     )

// Join on multiple fields using subMap
tumor_ch.map { meta, file -> [meta.subMap(['id', 'repeat']), meta, file] }
    .join(
       normal_ch.map { meta, file -> [meta.subMap(['id', 'repeat']), meta, file] }
     )
  • Grouping Data
// Group by the first element in each tuple
channel.groupTuple()
  • Combining Channels
// Combine with Cartesian product
samples_ch.combine(intervals_ch)
  • Data Structure Optimization
// Extract specific fields using subMap
meta.subMap(['id', 'repeat'])

// Named closures for reusable transformations
getSampleIdAndReplicate = { meta, file -> [meta.subMap(['id', 'repeat']), file] }
channel.map(getSampleIdAndReplicate)

Resources