Part 2: Single-sample implementation¶
In this part of the course, we're going to write the simplest possible workflow that wraps all the commands we ran in Part 1 to automate running them, and we'll just aim to process one sample at a time.
We'll do this in three stages:
- Write a single-stage workflow that runs the initial QC step
- Add adapter trimming and post-trimming QC
- Add alignment to the reference genome
1. Write a single-stage workflow that runs the initial QC¶
Let's start by writing a simple workflow that runs the FastQC tool on a FASTQ file containing single-end RNAseq reads.
We provide you with a workflow file, rnaseq.nf
, that outlines the main parts of the workflow.
rnaseq.nf | |
---|---|
Keep in mind this workflow code is correct but it's not functional; its purpose is just to serve as a skeleton that you'll use to write the actual workflow.
1.1. Create a directory to store modules¶
We'll create standalone modules for each process to make it easier to manage and reuse them, so let's create a directory to store them.
1.2. Create a module for the QC metrics collection process¶
Let's create a module file called modules/fastqc.nf
to house the FASTQC
process:
Open the file in the code editor and copy the following code into it:
1.2. Import the module into the workflow file¶
Add the statement include { FASTQC } from './modules/fastqc.nf'
to the rnaseq.nf
file:
1.3. Add an input declaration¶
Declare an input parameter with a default value:
1.4. Create an input channel in the workflow block¶
Use a basic .fromPath()
channel factory to create the input channel:
rnaseq.nf | |
---|---|
1.5. Call the FASTQC
process on the input channel¶
rnaseq.nf | |
---|---|
1.6. Run the workflow to test that it works¶
We could use the --reads
parameter to specify an input from command line, but during development we can be lazy and just use the test default we set up.
This should run very quickly if you worked through Part 1 and have already pulled the container. If you skipped it, Nextflow will pull the container for you; you don't have to do anything for it to happen, but you may need to wait up to a minute.
N E X T F L O W ~ version 24.10.0
Launching `rnaseq.nf` [fabulous_snyder] DSL2 - revision: 3394c725ee
executor > local (1)
[d6/d94c3a] FASTQC (1) [100%] 1 of 1 ✔
You can find the outputs under results/fastqc
as specified in the FASTQC
process by the publishDir
directive.
2. Add adapter trimming and post-trimming quality control¶
We're going to use the Trim_Galore wrapper, which bundles Cutadapt for the trimming itself and FastQC for the post-trimming quality control.
2.1. Create a module for the trimming and QC process¶
Let's create a module file called modules/trim_galore.nf
to house the TRIM_GALORE
process:
Open the file in the code editor and copy the following code into it:
2.2. Import the module into the workflow file¶
Add the statement include { TRIM_GALORE } from './modules/trim_galore.nf'
to the rnaseq.nf
file:
rnaseq.nf | |
---|---|
2.3. Call the process on the input channel¶
rnaseq.nf | |
---|---|
2.4. Run the workflow to test that it works¶
This should run very quickly too, since we're runnng on such a small input file.
N E X T F L O W ~ version 24.10.0
Launching `rnaseq.nf` [fabulous_snyder] DSL2 - revision: 3394c725ee
executor > local (1)
[d6/d94c3a] FASTQC (1) [100%] 1 of 1 ✔
[c2/e4a9bb] TRIM_GALORE (1) [100%] 1 of 1 ✔
You can find the outputs under results/trimming
as specified in the TRIM_GALORE
process by the publishDir
directive.
ENCSR000COQ1_1.fastq.gz_trimming_report.txt ENCSR000COQ1_1_trimmed_fastqc.zip
ENCSR000COQ1_1_trimmed_fastqc.html ENCSR000COQ1_1_trimmed.fq.gz
3. Align the reads to the reference genome¶
Finally we can run the genome alignment step using Hisat2, which will also emit FastQC-style quality control metrics.
3.1. Create a module for the HiSat2 process¶
Let's create a module file called modules/hisat2_align.nf
to house the HISAT2_ALIGN
process:
Open the file in the code editor and copy the following code into it:
3.2. Import the module into the workflow file¶
Add the statement include { HISAT2_ALIGN } from './modules/hisat2_align.nf'
to the rnaseq.nf
file:
rnaseq.nf | |
---|---|
3.3. Add a parameter declaration to provide the genome index¶
Declare an input parameter with a default value:
3.4. Call the HISAT2_ALIGN
process on the trimmed reads output by TRIM_GALORE
¶
The trimmed reads are in the TRIM_GALORE.out.trimmed_reads
channel output by the previous step.
In addition, we use file (params.hisat2_index_zip)
to provide the Hisat2 tool with the gzipped genome index tarball.
3.4. Run the workflow to test that it works¶
This should run very quickly too, since we're runnng on such a small input file.
N E X T F L O W ~ version 24.10.0
Launching `rnaseq.nf` [extravagant_khorana] DSL2 - revision: 701b41bd16
executor > local (3)
[e4/d15ad4] FASTQC (1) [100%] 1 of 1 ✔
[c6/12b2be] TRIM_GALORE (1) [100%] 1 of 1 ✔
[c6/7a9f13] HISAT2_ALIGN (1) [100%] 1 of 1 ✔
You can find the outputs under results/align
as specified in the HISAT2_ALIGN
process by the publishDir
directive.
This completes the basic processing we need to apply to each sample.
We'll add MultiQC report aggregation in Part 2, after we've made the workflow accept multiple samples at a time.
Takeaway¶
You know how to wrap all the core steps to process single-end RNAseq samples individually.
What's next?¶
Learn how to modify the workflow to process multiple samples in parallel, aggregate QC reports across all steps for all samples, and enable running the workflow on paired-end RNAseq data.