Part 3: Multi-sample paired-end implementation¶
Previously, you built a per-sample variant calling pipeline that processed each sample's data independently. In this part of the course, we're going to take our simple workflow to the next level by turning it into a powerful batch automation tool to handle arbitrary numbers of samples. And while we're at it, we're also going to update it to expect paired-end data, which is more common in newer studies.
How to begin from this section
This section of the course assumes you have completed Part 1: Method Overview, Part 2: Single-sample implementation and have a working rnaseq.nf pipeline with filled-in module files.
If you did not complete Part 2 or want to start fresh for this part, you can use the Part 2 solution as your starting point.
Run these commands from inside the nf4-science/rnaseq/ directory:
cp solutions/part2/rnaseq-2.nf rnaseq.nf
cp solutions/part2/modules/fastqc.nf modules/
cp solutions/part2/modules/trim_galore.nf modules/
cp solutions/part2/modules/hisat2_align.nf modules/
cp solutions/part2/nextflow.config .
This gives you a complete single-sample processing workflow. You can test that it runs successfully:
Assignment¶
In this part of the course, we're going to extend the workflow to do the following:
- Read sample information from a CSV samplesheet
- Run per-sample QC, trimming, and alignment on all samples in parallel
- Aggregate all QC reports into a comprehensive MultiQC report
This automates the steps from the second section of Part 1: Method overview, where you ran these commands manually in their containers.
Lesson plan¶
We've broken this down into three stages:
- Make the workflow accept multiple input samples.
This covers switching from a single file path to a CSV samplesheet, parsing it with
splitCsv(), and running all existing processes on multiple samples. - Add comprehensive QC report generation.
This introduces the
collect()operator to aggregate outputs across samples, and adds a MultiQC process to produce a combined report. - Switch to paired-end RNAseq data. This covers adapting processes for paired-end inputs (using tuples), creating paired-end modules, and setting up a separate test profile.
This implements the method described in Part 1: Method Overview (second section covering the multi-sample use case) and builds directly on the workflow produced by Part 2.
Tip
Make sure you're in the correct working directory:
cd /workspaces/training/nf4-science/rnaseq
1. Make the workflow accept multiple input samples¶
To run on multiple samples, we need to change how we manage the input: instead of providing a single file path, we'll read sample information from a CSV file.
We provide a CSV file containing sample IDs and FASTQ file paths in the data/ directory.
This CSV file includes a header line that names the columns.
Note that this is still single-end read data.
Warning
The file paths in the CSV are absolute paths that must match your environment. If you are not running this in the training environment we provide, you will need to update the paths to match your system.
1.1. Change the primary input to a CSV of file paths in the test profile¶
First, we need to update the test profile in nextflow.config to provide the CSV file path instead of the single FASTQ path.
Next, we'll need to update the channel creation to read from this CSV.
1.2. Update the channel factory to parse CSV input¶
We need to load the contents of the file into the channel instead of just the file path itself.
We can do this using the same pattern we used in Part 2 of Hello Nextflow: applying the splitCsv() operator to parse the file, then a map operation to extract the FASTQ file path from each row.
One thing that is new compared to what you encountered in the Hello Nextflow course is that this CSV has a header line, so we add header: true to the splitCsv() call.
That allows us to reference columns by name in the map operation: row.fastq_path extracts the file path from the fastq_path column of each row.
The input handling is updated and the workflow is ready to test.
1.3. Run the workflow¶
The workflow now reads sample information from a CSV file and processes all samples in parallel.
Command output
This time each step gets run 6 times, once for each sample in the CSV file.
That's all it took to get the workflow to run on multiple files. Nextflow handles all the parallelism for us.
Takeaway¶
You know how to switch from a single-file input to CSV-based multi-sample input that Nextflow processes in parallel.
What's next?¶
Add a QC report aggregation step that combines metrics from all samples.
2. Aggregate pre-processing QC metrics into a single MultiQC report¶
All this produces a lot of QC reports, and we don't want to have to dig through individual reports. This is the perfect point to put in a MultiQC report aggregation step.
Recall the multiqc command from Part 1:
The command scans the current directory for recognized QC output files and aggregates them into a single HTML report.
The container URI was community.wave.seqera.io/library/pip_multiqc:a3c26f6199d64b7c.
We need to set up an additional parameter, prepare the inputs, write the process, wire it in, and update the output handling.
2.1. Set up the inputs¶
The MultiQC process needs a report name parameter and the collected QC outputs from all previous steps bundled together.
2.1.1. Add a report_id parameter¶
Add a parameter to name the output report.
Add the report ID default to the test profile:
Next, we'll need to prepare the inputs for the MultiQC process.
2.1.2. Collect and combine QC outputs from previous steps¶
We need to give the MULTIQC process all the QC-related outputs from previous steps bundled together.
For that, we use the .mix() operator, which aggregates multiple channels into a single one.
We start from channel.empty() and mix in all the output channels we want to combine.
This is cleaner than chaining .mix() onto one of the output channels directly, because it treats all inputs symmetrically.
In our workflow, the QC-related outputs to aggregate are:
FASTQC.out.zipFASTQC.out.htmlTRIM_GALORE.out.trimming_reportsTRIM_GALORE.out.fastqc_reportsHISAT2_ALIGN.out.log
We mix them into a single channel, then use .collect() to aggregate the reports across all samples into a single list.
Add these lines to the workflow body after the HISAT2_ALIGN call:
Using intermediate variables makes each step clear: multiqc_files_ch contains all individual QC files mixed into one channel, and multiqc_files_list is the collected bundle ready to pass to MultiQC.
2.2. Write the QC aggregation process and call it in the workflow¶
As before, we need to fill in the process definition, import the module, and add the process call.
2.2.1. Fill in the module for the QC aggregation process¶
Open modules/multiqc.nf and examine the outline of the process definition.
Go ahead and fill in the process definition by yourself using the information provided above, then check your work against the solution in the "After" tab below.
This process uses path '*' as the input qualifier for the QC files.
The '*' wildcard tells Nextflow to stage all the collected files into the working directory without requiring specific names.
The val output_name input is a string that controls the report filename.
The multiqc . command scans the current directory (where all the staged QC files are) and generates the report.
Once you've completed this, the process is ready to use.
2.2.2. Include the module¶
Add the import statement to rnaseq.nf:
Now add the process call to the workflow.
2.2.3. Add the process call¶
Pass the collected QC files and the report ID to the MULTIQC process:
The MultiQC process is now wired into the workflow.
2.3. Update the output handling¶
We need to add the MultiQC outputs to the publish declaration and configure where they go.
2.3.1. Add publish targets for the MultiQC outputs¶
Add the MultiQC outputs to the publish: section:
| rnaseq.nf | |
|---|---|
Next, we'll need to tell Nextflow where to put these outputs.
2.3.2. Configure the new output targets¶
Add entries for the MultiQC targets in the output {} block, publishing them into a multiqc/ subdirectory:
The output configuration is complete.
2.4. Run the workflow¶
We use -resume so that the previous processing steps are cached and only the new MultiQC step runs.
Command output
N E X T F L O W ~ version 24.10.0
Launching `rnaseq.nf` [modest_pare] DSL2 - revision: fc724d3b49
executor > local (1)
[07/3ff9c5] FASTQC (6) [100%] 6 of 6, cached: 6 ✔
[2c/8d8e1e] TRIM_GALORE (5) [100%] 6 of 6, cached: 6 ✔
[a4/7f9c44] HISAT2_ALIGN (6) [100%] 6 of 6, cached: 6 ✔
[56/e1f102] MULTIQC [100%] 1 of 1 ✔
A single call to MULTIQC has been added after the cached process calls.
You can find the MultiQC outputs in the results directory.
results/multiqc
├── all_single-end_data
│ ├── cutadapt_filtered_reads_plot.txt
│ ├── cutadapt_trimmed_sequences_plot_3_Counts.txt
│ ├── cutadapt_trimmed_sequences_plot_3_Obs_Exp.txt
│ ├── fastqc_adapter_content_plot.txt
│ ├── fastqc_overrepresented_sequences_plot.txt
│ ├── fastqc_per_base_n_content_plot.txt
│ ├── fastqc_per_base_sequence_quality_plot.txt
│ ├── fastqc_per_sequence_gc_content_plot_Counts.txt
│ ├── fastqc_per_sequence_gc_content_plot_Percentages.txt
│ ├── fastqc_per_sequence_quality_scores_plot.txt
│ ├── fastqc_sequence_counts_plot.txt
│ ├── fastqc_sequence_duplication_levels_plot.txt
│ ├── fastqc_sequence_length_distribution_plot.txt
│ ├── fastqc-status-check-heatmap.txt
│ ├── fastqc_top_overrepresented_sequences_table.txt
│ ├── hisat2_se_plot.txt
│ ├── multiqc_citations.txt
│ ├── multiqc_cutadapt.txt
│ ├── multiqc_data.json
│ ├── multiqc_fastqc.txt
│ ├── multiqc_general_stats.txt
│ ├── multiqc_hisat2.txt
│ ├── multiqc.log
│ ├── multiqc_software_versions.txt
│ └── multiqc_sources.txt
└── all_single-end.html
That last all_single-end.html file is the full aggregated report, conveniently packaged into one easy to browse HTML file.
Takeaway¶
You know how to collect outputs from multiple channels, bundle them with .mix() and .collect(), and pass them to an aggregation process.
What's next?¶
Adapt the workflow to handle paired-end RNAseq data.
3. Enable processing paired-end RNAseq data¶
Right now our workflow can only handle single-end RNAseq data. It's increasingly common to see paired-end RNAseq data, so we want to be able to handle that.
Making the workflow completely agnostic of the data type would require using slightly more advanced Nextflow language features, so we're not going to do that here, but we can make a paired-end processing version to demonstrate what needs to be adapted.
3.1. Copy the workflow and update the inputs¶
We start by copying the single-end workflow file and updating it for paired-end data.
3.1.1. Copy the workflow file¶
Create a copy of the workflow file to use as a starting point for the paired-end version.
Now update the parameters and input handling in the new file.
3.1.2. Add a paired-end test profile¶
We provide a second CSV file containing sample IDs and paired FASTQ file paths in the data/ directory.
Add a test_pe profile to nextflow.config that points to this file and uses a paired-end report ID.
The test profile for paired-end data is ready.
3.1.3. Update the channel factory¶
The .map() operator needs to grab both FASTQ file paths and return them as a list.
The input handling is configured for paired-end data.
3.2. Adapt the FASTQC module for paired-end data¶
Copy the module to create a paired-end version:
The FASTQC process input doesn't need to change — when Nextflow receives a list of two files, it stages both and reads expands to both filenames.
The only change needed is in the output block: since we now get two FastQC reports per sample, we switch from simpleName-based patterns to wildcards.
This generalizes the process in a way that makes it able to handle either single-end or paired-end data.
Update the import in rnaseq_pe.nf to use the paired-end version:
The FASTQC module and its import are updated for paired-end data.
3.3. Adapt the TRIM_GALORE module for paired-end data¶
Copy the module to create a paired-end version:
This module needs more substantial changes:
- The input changes from a single path to a tuple of two paths
- The command adds the
--pairedflag and takes both read files - The output changes to reflect Trim Galore's paired-end naming conventions, producing separate FastQC reports for each read file
| modules/trim_galore_pe.nf | |
|---|---|
Update the import in rnaseq_pe.nf:
The TRIM_GALORE module and its import are updated for paired-end data.
3.4. Adapt the HISAT2_ALIGN module for paired-end data¶
Copy the module to create a paired-end version:
This module needs similar changes:
- The input changes from a single path to a tuple of two paths
- The HISAT2 command changes from
-U(unpaired) to-1and-2(paired) read arguments - All uses of
reads.simpleNamechange toread1.simpleNamesince we now reference a specific member of the pair
Update the import in rnaseq_pe.nf:
The HISAT2_ALIGN module and its import are updated for paired-end data.
3.5. Update the MultiQC aggregation for paired-end outputs¶
The paired-end TRIM_GALORE process now produces two separate FastQC report channels (fastqc_reports_1 and fastqc_reports_2) instead of one.
Update the .mix() block in rnaseq_pe.nf to include both:
The MultiQC aggregation now includes both sets of paired-end FastQC reports.
3.6. Update the output handling for paired-end outputs¶
The publish: section and output {} block also need to reflect the two separate FastQC report channels from the paired-end TRIM_GALORE process.
Update the publish: section in rnaseq_pe.nf:
Update the corresponding entries in the output {} block:
The paired-end workflow is now fully updated and ready to run.
3.7. Run the workflow¶
We don't use -resume since this wouldn't cache, and there's twice as much data to process than before, but it should still complete in under a minute.
Command output
Now we have two slightly divergent versions of our workflow, one for single-end read data and one for paired-end data. The next logical step would be to make the workflow accept either data type on the fly, which is out of scope for this course, but we may tackle that in a follow-up.
Takeaway¶
You know how to adapt a single-sample workflow to parallelize processing of multiple samples, generate a comprehensive QC report, and adapt the workflow to use paired-end read data.
What's next?¶
Give yourself a big pat on the back! You have completed the Nextflow for RNAseq course.
Head on to the final course summary to review what you learned and find out what comes next.