1. Data Description¶

The input data used to test the pipeline implementation is described below. For the purpose of this project, only a subset of the original data is used for most of the data types.

1.1 Genome assembly¶

genome.fa

The human genome assembly hg19 (GRCh37) from GenBank, chromosome 22 only.

1.2 RNA-seq reads¶

ENCSR000COQ[12]_[12].fastq.gz

The RNA-seq data comes from the human GM12878 cell line from whole cell, cytosol and nucleus extraction (see table below).

The libraries are stranded PE76 Illumina GAIIx RNA-Seq from rRNA-depleted Poly-A+ long RNA (> 200 nucleotides in size).

Only reads mapped to the 22q11^ locus of the human genome (chr22:16000000-18000000) are used.

ENCODE ID	Cellular fraction	Replicate ID	File names
ENCSR000COQ	Whole Cell	1 2	`ENCSR000COQ1_1.fastq.gz` `ENCSR000COQ2_1.fastq.gz`	`ENCSR000COQ1_2.fastq.gz` `ENCSR000COQ2_2.fastq.gz`
ENCSR000CPO	Nuclear	1 2	`ENCSR000CPO1_1.fastq.gz` `ENCSR000CPO2_1.fastq.gz`	`ENCSR000CPO1_2.fastq.gz` `ENCSR000CPO2_2.fastq.gz`
ENCSR000COR	Cytosolic	1 2	`ENCSR000COR1_1.fastq.gz` `ENCSR000COR2_1.fastq.gz`	`ENCSR000COR1_2.fastq.gz` `ENCSR000COR2_2.fastq.gz`

1.3 "Known" variants¶

known_variants.vcf.gz

Known variants come from high confident variant calls for GM12878 from the Illumina Platinum Genomes project. These variant calls were obtained by taking into account pedigree information and the concordance of calls across different methods.

We’re using the subset from chromosome 22 only.

1.4 Blacklisted regions¶

blacklist.bed

Blacklisted regions are regions of the genomes with anomalous coverage. We use regions for the hg19 assembly, taken from the ENCODE project portal. These regions were identified with DNAse and ChiP-seq samples over ~60 human tissues/cell types, and had a very high ratio of multi-mapping to unique-mapping reads and high variance in mappability.