2. Workflow Description¶
The aim of the pipeline is to process raw RNA-seq data (in FASTQ format) and obtain the list of small variants, SNVs (SNPs and INDELs) for the downstream analysis. The pipeline is based on the GATK best practices for variant calling with RNAseq data and includes all major steps. In addition the pipeline includes SNVs postprocessing and quantification for allele specific expression.
Samples processing is done independently for each replicate. This includes mapping of the reads, splitting at the CIGAR, reassigning mapping qualities and recalibrating base qualities.
Variant calling is done simultaneously on bam files from all replicates. This allows to improve coverage of genomic regions and obtain more reliable results.
2.1 Software manuals¶
Documentation for all software used in the workflow can be found at the following links:
- GATK tools
2.2 Pipeline steps¶
In order to get a general idea of the workflow, all the composing steps, together with the corresponding commands, are explained in the next sections.
2.2.1 Preparing data¶
This step prepares input files for the analysis. Genome indexes are created and variants overlapping blacklisted regions are filtered out.
Genome indices with
picard are produced first. They will be needed for GATK commands such as
Genome index for
STAR, needed for RNA-seq reads mappings, is created next. Index files are written to the folder
Variants overlapping blacklisted regions are then filtered in order to reduce false positive calls [optional]:
2.2.2 Mapping RNA-seq reads to the reference¶
To align RNA-seq reads to the genome we’re using STAR 2-pass approach. The first alignment creates a table with splice-junctions that is used to guide final alignments. The alignments at both steps are done with default parameters.
Additional fields with the read groups, libraries and sample information are added into the final bam file at the second mapping step. As a result we do not need to run Picard processing step from GATK best practices.
Create new genome index using splice-junction table:
STAR 2-pass, final alignments:
STAR --genomeDir genome_dir \
--readFilesIn ENCSR000COQ1_1.fastq.gz ENCSR000COQ1_2.fastq.gz \
--runThreadN 4 \
--readFilesCommand zcat \
--outFilterType BySJout \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 \
--outSAMtype BAM SortedByCoordinate \
--outSAMattrRGline ID:ENCSR000COQ1 LB:library PL:illumina PU:machine SM:GM12878
Index the resulting bam file:
2.2.3 Split’N'Trim and reassign mapping qualities¶
The RNA-seq reads overlapping exon-intron junctions can produce false positive variants due to inaccurate splicing. To solve this problem the GATK team recommend to hard-clip any sequence that overlap intronic regions and developed a special tool for this purpose:
SplitNCigarReads. The tool identifies Ns in the CIGAR string of the alignment and split reads at this position so that few new reads are created.
At this step we also reassign mapping qualities to the alignments. This is important because STAR assign the value
255 (high quality) to “unknown” mappings that are meaningless to GATK and to variant calling in general.
This step is done with recommended parameters from the GATK best practices.
2.2.4 Base Recalibration¶
The proposed workflow does not include an indel re-alignment step, which is an optional step in the GATK best practices. We excluded that since it is quite time-intensive and does not really improve variant calling.
We instead include a base re-calibration step. This step allows to remove possible systematic errors introduced by the sequencing machine during the assignment of read qualities. To do this, the list of known variants is used as a training set to the machine learning algorithm that models possible errors. Base quality scores are then adjusted based on the obtained results.
2.2.5 Variant Calling and Variant filtering¶
The variant calling is done on the uniquely aligned reads only in order to reduce the number of false positive variants called:
For variant calling we’re using the GATK tool
HaplotypeCaller with default parameters:
Variant filtering is done as recommended in the GATK best practices:
- keep clusters of at least 3 SNPs that are within a window of 35 bases between them
- estimate strand bias using Fisher’s Exact Test with values > 30.0 (Phred-scaled p-value)
- use variant call confidence score
QualByDepth(QD) with values < 2.0. The QD is the QUAL score normalized by allele depth (AD) for a variant.
2.2.6 Variant Post-processing¶
For downstream analysis we’re considering only sites that pass all filters and are covered with at least 8 reads:
Filtered RNA-seq variants are compared with those obtained from DNA sequencing (from Illumina platinum genome project). Variants that are common to these two datasets are "known" SNVs. The ones present only in the RNA-seq cohort only are "novel".
Known SNVs will be used for allele specific expression analysis.
Novel variants will be used to detect RNA-editing events.
We compare two variants files to detect common and different sites:
Here we select sites present in both files ("known" SNVs only):
Plot a histogram with allele frequency distribution for "known" SNVs:
Calculate read counts for each "known" SNVs per allele for allele specific expression analysis: