Pipeline Implementation¶
Data preparation¶
A first step in any pipeline is to prepare the input data. You will find all the data required to run the pipeline in the folder data
within the $HOME/environment/hands-on
repository directory.
There are four data inputs that we will use in this tutorial:
- Genome File (
data/genome.fa
)- Human chromosome 22 in FASTA file format
- Read Files (
data/reads/
)- Sample ENCSR000COQ1: 76bp paired-end reads (
ENCSR000COQ1_1.fq.gz
andENCSR000COQ1_2.fq.gz
).
- Sample ENCSR000COQ1: 76bp paired-end reads (
- Variants File (
data/known_variants.vcf.gz
)- Known variants, gzipped as a Variant Calling File (VCF) format.
- Blacklist File (
data/blacklist.bed
)- Genomic locations which are known to produce artifacts and spurious variants in Browser Extensible Data (BED) format.
Input parameters¶
We can begin writing the pipeline by creating and editing a text file called main.nf
from the $HOME/nf-course/hands-on
repository directory with your favourite text editor. In this example we are using nano
:
Edit this file to specify the input files as script parameters. Using this notation allows you to override them by specifying different values when launching the pipeline execution.
Info
Click the icons in the code for explanations.
- The
/*
,*
and*/
specify comment lines which are ignored by Nextflow. - The
baseDir
variable represents the main script path location. - The
reads
parameter uses a glob pattern to specify the forward (ENCSR000COQ1_1.fq.gz
) and reverse (ENCSR000COQ1_2.fq.gz
) reads are pairs of the same sample. - The
results
parameter is used to specify a directory calledresults
. - The
gatk
parameter specifies the location of the GATK jar file.
Tip
You can copy the above text ( top right or Cmd+C), then move in the terminal window, open nano
and paste using the keyboard Cmd+V shortcut.
Once you have the default parameters in the main.nf
file, you can save and run the main script for the first time.
Tip
With nano
you can save and close the file with Ctrl+O, then Enter, followed by Ctrl+X.
To run the main script use the following command:
You should see the script execute, print Nextflow version and pipeline revision and then exit.
Problem #1
Great, now we need to define a channel variable to handle the read-pair files. To do that open the main.nf
file and copy the lines below at the end of the file.
Tip
In nano
you can move to the end of the file using Ctrl+W and then Ctrl+V.
This time you must fill the BLANK
space with the correct function and parameter.
Tip
Use the fromFilePairs channel factory method. The second one, declares a variable named GATK
specifying the path of the GATK application file.
Once you think you have data organised, you can again run the pipeline. However this time, we can use the the -resume
flag.
Tip
See here for more details about using the resume
option.
Solution
- Create a channel using fromFilePairs().
- A variable representing the path of GATK application file.
Process 1A: Create a FASTA genome index¶
Now we have our inputs set up we can move onto the processes. In our first process we will create a genome index using samtools.
You should implement a process having the following structure:
- Name:
1A_prepare_genome_samtools
- Command: create a genome index for the genome fasta with samtools
- Input: the genome fasta file
- Output: the samtools genome index file
Problem #2
Copy the code below and paste it at the end of main.nf
.
Your aim is to replace BLANK
placeholder with the the correct variable name of the genome file that you have defined in previous problem.
In plain english, the process could be written as:
- A process called
1A_prepare_genome_samtools
- takes as input the genome file from
BLANK
- and creates as output a genome index file which goes into channel
genome_index_ch
- script: using samtools create the genome index from the genome file
Now when we run the pipeline, we see that the process 1A is submitted:
N E X T F L O W ~ version 20.10.0
Launching `main.nf` [cranky_bose] - revision: d1df5b7267
executor > local (1)
[cd/47f882] process > 1A_prepare_genome_samtools [100%] 1 of 1 ✔
Solution
- The solution is to use the
params.genome
parameter defined at the beginning of the script.
Process 1B: Create a FASTA genome sequence dictionary with Picard for GATK¶
Our first process created the genome index for GATK using samtools. For the next process we must do something very similar, this time creating a genome sequence dictionary using Picard.
You should implement a process having the following structure:
- Name:
1B_prepare_genome_picard
- Command: create a genome dictionary for the genome fasta with Picard tools
- Input: the genome fasta file
- Output: the genome dictionary file
Problem #3
Fill in the BLANK
words for both the input and output sections.
Copy the code below and paste it at the end of main.nf
.
Your aim is to insert the correct input name from into the input step (written as BLANK
) of the process and run the pipeline.
Info
You can choose any channel output name that makes sense to you.
Note
.baseName
returns the filename without the file suffix. If "${genome}"
is human.fa
, then "${genome.baseName}.dict"
would be human.dict
.
Solution
- Take as input the
genome
file from theparams.genome
parameter - Give as output the file
${genome.baseName}.dict
and adds it to the channelgenome_dict_ch
Process 1C: Create STAR genome index file¶
Next we must create a genome index for the STAR mapping software.
You should implement a process having the following structure:
- Name: 1C_prepare_star_genome_index
- Command: create a STAR genome index for the genome fasta
- Input: the genome fasta file
- Output: a directory containing the STAR genome index
Problem #4
This is a similar exercise as problem 3, except this time both input
and output
lines have been left BLANK
and must be completed.
Info
The output of the STAR genomeGenerate command is specified here as genome_dir
.
Solution
- Take as input the
genome
file from theparams.genome
parameter. - The
output
is afile
* calledgenome_dir
and is addedinto
a channel calledgenome_dir_ch
. You can call the channel whatever you wish. - Creates the output directory that will contain the resulting STAR genome index.
Note
The file in this case is a directory however it makes no difference.
Process 1D: Filtered and recoded set of variants¶
Next on to something a little more tricky. The next process takes two inputs: the variants file and the blacklist file.
It should output a channel named prepared_vcf_ch
which emitting a tuple of two files.
Info
In Nextflow, tuples can be defined in the input or output using the tuple
qualifier.
You should implement a process having the following structure:
- Name
1D_prepare_vcf_file
- Command
- create a filtered and recoded set of variants
- Input
- the variants file
- the blacklisted regions file
- Output
- a tuple containing the filtered/recoded VCF file and the tab index (TBI) file.
Problem #5
You must fill in the two BLANK_LINES
in the input and the two BLANK
output files.
Broken down, here is what the script is doing:
vcftools --gzvcf $variantsFile -c \ # (1)!
--exclude-bed ${blacklisted} \ # (2)!
--recode | bgzip -c \
> ${variantsFile.baseName}.filtered.recode.vcf.gz # (3)!
tabix ${variantsFile.baseName}.filtered.recode.vcf.gz # (4)!
- The input variable for the variants file
- The input variable for the blacklist file
- The first of the two output files
- Generates the second output file named
"${variantsFile.baseName}.filtered.recode.vcf.gz.tbi"
Try run the pipeline from the project directory with:
Solution
- Take as input the variants file, assigning the name
${variantsFile}
. - Take as input the blacklisted file, assigning the name
${blacklisted}
. - Out a tuple (or set) of two files into the
prepared_vcf_ch
channel. - Defines the name of the first output file.
- Generates the second output file (with
.tbi
suffix).
Congratulations! Part 1 is now complete.
We have all the data prepared and into channels ready for the more serious steps
Process 2: STAR Mapping¶
In this process, for each sample, we align the reads to our genome using the STAR index we created previously.
You should implement a process having the following structure:
- Name
2_rnaseq_mapping_star
- Command
- mapping of the RNA-Seq reads using STAR
- Input
- the genome fasta file
- the STAR genome index
- a tuple containing the replicate id and paired read files
- Output
- a tuple containing replicate id, aligned bam file & aligned bam file index
Problem #6
Copy the code below and paste it at the end of main.nf
.
You must fill in the three BLANK_LINE
lines in the input and the one BLANK_LINE
line in the output.
Info
The final command produces an bam index which is the full filename with an additional .bai
suffix.
Solution
-
the genome fasta file.
-
the STAR genome index directory from the
genome_dir_ch
channel created in the process1C_prepare_star_genome_index
. -
set containing replicate ID and pairs of reads.
-
set containing the replicate ID, resulting bam file and bam index.
-
line specifying the name of the resulting bam file which is indexed with samtools to create a bam index file (
.bai
).
The next step is a filtering step using GATK. For each sample, we split all the reads that contain N characters in their CIGAR string.
Process 3: GATK Split on N¶
The process creates k+1
new reads (where k
is the number of N
cigar elements) that correspond to the segments of the original read beside/between the splicing events represented by the N
s in the original CIGAR.
You should implement a process having the following structure:
- Name
- 3_rnaseq_gatk_splitNcigar
- Command
- split reads on Ns in CIGAR string using GATK
- Input
- the genome fasta file
- the genome index made with samtools
- the genome dictionary made with picard
- a tuple containing replicate id, aligned bam file and aligned bam file index from the STAR mapping
- Output
- a tuple containing the replicate id, the split bam file and the split bam index file
Problem #7
Copy the code below and paste it at the end of main.nf
.
You must fill in the four BLANK_LINE
lines in the input and the one BLANK_LINE
line in the output.
Warning
There is an optional tag
line added to the start of this process. The tag
line allows you to assign a name to a specific task (single execution of a process). This is particularly useful when there are many samples/replicates which pass through the same process.
Info
The GATK command above automatically creates a bam index (.bai
) of the split.bam
output file
Example
A tag
line would also be useful in Process 2
Solution
tag
line with the using the replicate id as the tag.- the genome fasta file
- the genome index from the
genome_index_ch
channel created in the process1A_prepare_genome_samtools
- the genome dictionary from the
genome_dict_ch
channel created in the process1B_prepare_genome_picard
- the set containing the aligned reads from the
aligned_bam_ch
channel created in the process2 _rnaseq_mapping_star
- a set containing the sample id, the split bam file and the split bam index
- specifies the input file names
$genome
and$bam
to GATK - specifies the output file names to GATK
Next we perform a Base Quality Score Recalibration step using GATK.
Process 4: GATK Recalibrate¶
This step uses GATK to detect systematic errors in the base quality scores, select unique alignments and then index the resulting bam file with samtools. You can find details of the specific GATK BaseRecalibrator parameters here.
You should implement a process having the following structure:
- Name
- 4_rnaseq_gatk_recalibrate
- Command
- recalibrate reads from each replicate using GATK
- Input
- the genome fasta file
- the genome index made with samtools
- the genome dictionary made with picard
- a tuple containing replicate id, aligned bam file and aligned bam file index from process 3
- a tuple containing the filtered/recoded VCF file and the tab index (TBI) file from process 1D
- Output
- a tuple containing the sample id, the unique bam file and the unique bam index file
Problem #8
Copy the code below and paste it at the end of main.nf
.
You must fill in the five BLANK_LINE
lines in the input and the one BLANK
in the output line.
-
The files resulting from this process will be used in two downstream processes. If a process is executed more than once, and the downstream channel is used by more than one process, we must duplicate the channel. We can do this using the
into
operator with parenthesis in the output section. See here for more information on usinginto
. -
The unique bam file
- The index of the unique bam file (bam file name +
.bai
)
Solution
- the genome fasta file.
- the genome index from the
genome_index_ch
channel created in the process1A_prepare_genome_samtools
. - the genome dictionary from the
genome_dict_ch
channel created in the process1B_prepare_genome_picard
. - the set containing the split reads from the
splitted_bam_ch
channel created in the process3_rnaseq_gatk_splitNcigar
. - the set containing the filtered/recoded VCF file and the tab index (TBI) file from the
prepared_vcf_ch
channel created in the process1D_prepare_vcf_file
. - the set containing the replicate id, the unique bam file and the unique bam index file which goes into two channels.
- line specifying the filename of the output bam file
Now we are ready to perform the variant calling with GATK.
Process 5: GATK Variant Calling¶
This steps call variants with GATK HaplotypeCaller. You can find details of the specific GATK HaplotypeCaller parameters here.
You should implement a process having the following structure:
- Name
- 5_rnaseq_call_variants
- Command
- variant calling of each sample using GATK
- Input
- the genome fasta file
- the genome index made with samtools
- the genome dictionary made with picard
- a tuple containing replicate id, aligned bam file and aligned bam file index from process 4
- Output
- a tuple containing the sample id the resulting variant calling file (vcf)
Problem #9
In this problem we will introduce the use of a channel operator in the input section. The groupTuple operator groups together the tuples emitted by a channel which share a common key.
Warning
Note that in process 4, we used the sampleID (not replicateID) as the first element of the tuple in the output. Now we combine the replicates by grouping them on the sample ID. It follows from this that process 4 is run one time per replicate and process 5 is run one time per sample.
Fill in the BLANK_LINE
lines and BLANK
words as before.
Solution
tag
line with the using the sample id as the tag.- the genome fasta file.
- the genome index from the
genome_index_ch
channel created in the process1A_prepare_genome_samtools
. - the genome dictionary from the
genome_dict_ch
channel created in the process1B_prepare_genome_picard
. - the sets grouped by sampleID from the
final_output_ch
channel created in the process4_rnaseq_gatk_recalibrate
. - the set containing the sample ID and final VCF file.
- the line specifying the name resulting final vcf file.
Processes 6A and 6B: ASE & RNA Editing¶
In the final steps we will create processes for Allele-Specific Expression and RNA Editing Analysis.
We must process the VCF result to prepare variants file for allele specific expression (ASE) analysis. We will implement both processes together.
You should implement two processes having the following structure:
- 1st process
- Name
- 6A_post_process_vcf
- Command
- post-process the variant calling file (vcf) of each sample
- Input
- tuple containing the sample ID and vcf file
- a tuple containing the filtered/recoded VCF file and the tab index (TBI) file from process 1D
- Output
- a tuple containing the sample id, the variant calling file (vcf) and a file containing common SNPs
- Name
- 2nd process
- Name
- 6B_prepare_vcf_for_ase
- Command
- prepare the VCF for allele specific expression (ASE) and generate a figure in R.
- Input
- a tuple containing the sample id, the variant calling file (vcf) and a file containing common SNPs
- Output
- a tuple containing the sample ID and known SNPs in the sample for ASE
- a figure of the SNPs generated in R as a PDF file
- Name
Problem #10
Here we introduce the publishDir
directive. This allows us to specify a location for the outputs of the process. See here for more details.
You must have the output of process 6A become the input of process 6B.
- here the output location is specified as a combination of a pipeline parameter and a process input variable
Solution
The final step is the GATK ASEReadCounter.
Problem #11
We have seen the basics of using processes in Nextflow. Yet one of the features of Nextflow is the operations that can be performed on channels outside of processes. See here for details on the specific operators.
Before we perform the GATK ASEReadCounter process, we must group the data for allele-specific expression. To do this we must combine channels.
The bam_for_ASE_ch
channel emites tuples having the following structure, holding the final BAM/BAI files:
The vcf_for_ASE
channel emits tuples having the following structure:
In the first operation, the BAMs are grouped together by sample id.
Next, this resulting channel is merged with the VCFs (vcf_for_ASE) having the same sample id.
We must take the merged channel and creates a channel named grouped_vcf_bam_bai_ch
emitting the following tuples:
Your aim is to fill in the BLANKS
below.
- an operator that groups tuples that contain a common first element.
- the phase operator synchronizes the values emitted by two other channels. See here for more details
- the map operator can apply any function to every item on a channel. In this case we take our tuple from the phase operation, define the separate elements and create a new tuple.
- define
sampleId
to be the first element of left. - define bam to be the second element of left.
- define bai to be the third element of left.
- define vcf to be the first element of right.
- create a new tuple made of four elements
- rename the resulting as
grouped_vcf_bam_bai_ch
Note
left
and right
above are arbitrary names. From the phase operator documentation, we see that phase returns pairs of items. So here left
originates from contents of the bam_for_ASE_ch
channel and right
originates from the contents of vcf_for_ASE
channel.
Process 6C: Allele-Specific Expression analysis with GATK ASEReadCounter¶
Now we are ready for the final process.
You should implement a process having the following structure:
- Name
- 6C_ASE_knownSNPs
- Command
- calculate allele counts at a set of positions with GATK tools
- Input
- genome fasta file
- genome index file from samtools
- genome dictionary file
- the
grouped_vcf_bam_bai_ch
channel
- Output
- the allele specific expression file (
ASE.tsv
)
- the allele specific expression file (
Problem #12
You should construct the process and run the pipeline in its entirety.
Solution
Congratulations! If you made it this far you now have all the basics to create your own Nextflow workflows.
Results overview¶
For each processed sample the pipeline stores results into a folder named after the sample identifier. These folders are created in the directory specified as a parameter in params.results
.
Result files for this workshop can be found in the folder results
within the current folder. There you should see a directory called ENCSR000COQ/
containing the following files:
Variant calls¶
final.vcf
This file contains all somatic variants (SNVs) called from RNAseq data. You will see variants that pass all filters, with the PASS
keyword in the 7th field of the vcf file (filter status
), and also those that did not pass one or more filters.
commonSNPs.diff.sites_in_files
Tab-separated file with comparison between variants obtained from RNAseq and "known" variants from DNA.
The file is sorted by genomic position and contains 8 fields:
1 | CHROM |
chromosome name; |
2 | POS1 |
position of the SNV in file #1 (RNAseq data); |
3 | POS2 |
position of SNV in file #2 (DNA "known" variants); |
4 | IN_FILE |
flag whether SNV is present in the file #1 1, in the file #2 2, or in both files B; |
5 | REF1 |
reference sequence in the file 1; |
6 | REF2 |
reference sequence in the file 2; |
7 | ALT1 |
alternative sequence in the file 1; |
8 | ALT2 |
alternative sequence in the file 2 |
known_snps.vcf
Variants that are common to RNAseq and "known" variants from DNA.
Allele specific expression quantification¶
ASE.tsv
Tab-separated file with allele counts at common SNVs positions (only SNVs from the file known_snps.vcf
)
The file is sorted by coordinates and contains 13 fields:
1 | contig |
contig, scaffold or chromosome name of the variant |
2 | position |
position of the variant |
3 | variant ID |
variant ID in the dbSNP |
4 | refAllele |
reference allele sequence |
5 | altAllele |
alternate allele sequence |
6 | refCount |
number of reads that support the reference allele |
7 | altCount |
number of reads that support the alternate allele |
8 | totalCount |
total number of reads at the site that support both reference and alternate allele and any other alleles present at the site |
9 | lowMAPQDepth |
number of reads that have low mapping quality |
10 | lowBaseQDepth |
number of reads that have low base quality |
11 | rawDepth |
total number of reads at the site that support both reference and alternate allele and any other alleles present at the site |
12 | otherBases |
number of reads that support bases other than reference and alternate bases |
13 | improperPairs |
number of reads that have malformed pairs |
Allele frequency histogram¶
AF.histogram.pdf
This file contains a histogram plot of allele frequency for SNVs common to RNA-seq and "known" variants from DNA.
Bonus step¶
Until now the pipeline has been executed using just a single sample (ENCSR000COQ1
).
Now we can re-execute the pipeline specifying a large set of samples by using the command shown below:
It will print an output similar to the one below:
N E X T F L O W ~ version 20.10.0
Launching `main.nf` [hungry_wing] - revision: a6359031a1
executor > local (27)
[cd/47f882] process > 1A_prepare_genome_samtools [100%] 1 of 1, cached: 1 ✔
[5f/216ba8] process > 1B_prepare_genome_picard [100%] 1 of 1, cached: 1 ✔
[76/5fdc20] process > 1C_prepare_star_genome_index [100%] 1 of 1, cached: 1 ✔
[19/f8842c] process > 1D_prepare_vcf_file [100%] 1 of 1, cached: 1 ✔
[f1/d66ba8] process > 2_rnaseq_mapping_star (6) [100%] 6 of 6, cached: 1 ✔
[74/c0f3a3] process > 3_rnaseq_gatk_splitNcigar (ENCSR000CPO2) [100%] 6 of 6, cached: 1 ✔
[b6/59d9f7] process > 4_rnaseq_gatk_recalibrate (ENCSR000CPO2) [100%] 6 of 6, cached: 1 ✔
[22/4a07fa] process > 5_rnaseq_call_variants (ENCSR000CPO) [100%] 3 of 3 ✔
[1a/c68bfe] process > 6A_post_process_vcf (ENCSR000CPO) [100%] 3 of 3 ✔
[dc/e58d02] process > 6B_prepare_vcf_for_ase (ENCSR000CPO) [100%] 3 of 3 ✔
[2a/0e4e7b] process > 6C_ASE_knownSNPs (ENCSR000CPO) [100%] 3 of 3 ✔
You can notice that this time the pipeline spawns the execution of more tasks because three samples have been provided instead of one.
This shows the ability of Nextflow to implicitly handle multiple parallel task executions depending on the specified pipeline input dataset.
A fully functional version of this pipeline is available at the following GitHub repository: CalliNGS-NF.