Saltar a contenido

Part 2: Hello GATK

The GATK (Genome Analysis Toolkit) is a widely used software package developed by the Broad Institute to analyze high-throughput sequencing data. We're going to use GATK and a related tool, Samtools, in a very basic pipeline that identifies genomic variants through a method called variant calling.

GATK pipeline

Note

Don't worry if you're not familiar with GATK or genomics in general. We'll summarize the necessary concepts as we go, and the workflow implementation principles we demonstrate here apply broadly to any command line tool that takes in some input files and produce some output files.

A full variant calling pipeline typically involves a lot of steps. For simplicity, we are only going to look at the core variant calling steps.

Method overview

  1. Generate an index file for each BAM input file using Samtools
  2. Run the GATK HaplotypeCaller on each BAM input file to generate per-sample variant calls in GVCF (Genomic Variant Call Format)
eyJ2ZXJzaW9uIjoiMSIsImVuY29kaW5nIjoiYnN0cmluZyIsImNvbXByZXNzZWQiOnRydWUsImVuY29kZWQiOiJ4nOVb62/ayFx1MDAxNv+ev1wiYr8203k/Kl2t8mxh82CTkLZZrSpcdTAwMTdcZnFibK8xXHUwMDA0sur/fs84XHUwMDA027xzXHQt6LpSXHUwMDEybDM+Pj6/x5lx/93Z3S0lg8gtfdgtuf2643uN2HksvbP7e27c8cJcdTAwMDBcdTAwMGXR9HMn7Mb19My7JIk6XHUwMDFm3r9vO/GDm0S+U3dRz+t0XHUwMDFkv5N0XHUwMDFiXojqYfu9l7jtzu/257nTdv9cdTAwMTOF7UZcdTAwMTKj7Fwie27DS8L4+Vqu77bdIOnA6H/B593df9Ofuehit544Qct30y+kh3JcdTAwMDFiMr73PFxm0mCJJsJcdTAwMTDOmVx1MDAxZZ3hdY7geonbgMNNiNnNjthdpYvPg4/V/oN/453+UVVcdHeaXf8su2zT8/2rZOCnYXVCuJvsWCeJw1x1MDAwN/ez10juXvKW2z/rW3HYbd1cdTAwMDVux95+diNh5NS9ZGD3YTza+5yDXHUwMDBmu9mePnximCCKXHUwMDE5oWK0O/2iNohSKWRu/3Moh6FcdTAwMGa5h1B+I679l1x1MDAwNfPdqT+0IKKgkZ3jiIZuNrNzXHUwMDFlhzdIXHUwMDA0R8Ko4lx1MDAwNe5cXK91l8BRYZBcdTAwMTJKXHUwMDEzyk26aZlcdTAwMDXhptknjGmmMclcdTAwMWSxl47KjbRcdTAwMTL+znJcdTAwMWVDXHKV7VeCru/nXHUwMDEzXHUwMDE3NIaJe6mYrGbYcM+P7N7s+cfjtZavt0LNJW4/XHUwMDE53XSuQM7vL8+/db1q5as8L6uz2lx1MDAxZWs/XpRG5/14N33Y5y+zcvn7qds6ifig5/lcdTAwMTH9XHUwMDFhXHUwMDFjfYuKV3m5vlx1MDAxM8fhY27c4V9ZWrpRw3muY6JcYiFcdTAwMTRzTFxiz0rD94KH8Zz5Yf0hK/2dXFzAXHUwMDEzmCvcf1x1MDAwZW5E4JlwY5RIzTWWS8NtejY3XHUwMDFibowhjIkxhCpiXHUwMDE0lbJcdTAwMDA7iiWSWHCp1Uq4S2In6EROXGbVOok9g5FcdTAwMDVcdTAwMTVlXG6n1zKTXHUwMDAwzKN+XGI4qTWHja6Ot8KBXHRgza1RzYx8RY1mUYVBcuU9uWmKXHUwMDBie0+ctudcdTAwMGZcbs8zLV5IYjlI3LhcdTAwMDfjlFxuh/Z9r2VruVR37Vx0hTJPPFx1MDAxMKnRXHRtr9HIy05cdTAwMWSu53iBXHUwMDFil5dRizD2Wl7g+Nezw4FcZrifXp5cdTAwMTnwuMjVRMe1R1Mqn1x1MDAwYtW58sjkbLxqaqhkfGm0SufoLDis3l5cdTAwMTAl2/29j5WquL3ebLRcbiyQ1loqksdj+l1cdTAwMDBcdTAwMTFlnK1cdTAwMDbTtckjsDnW3OSezjao45eaU7usxPK4rconbNDTXHUwMDA31aS2rDpe0XK72dmLa175qtfpq5v20/n1dqkjIzPRpkFcdTAwMTmwXHUwMDE0ZnkvOj2Zm1xyN0mQNErg8cp/lkaFJJhcdTAwMDRNVrOkc6WRU2RcdTAwMTRcdTAwMDBcblx1MDAxN1x1MDAxNXieLmJFscFcZq+OtG3RxYP9szUp4lx1MDAwMolcdTAwMThXxEIg69dCPFx1MDAxM51MKyMk18tb10rtoqKe7ox3XHUwMDE39T/XP7ukX6nxzUanXCJcdTAwMThJKTGZgKZtXHUwMDE2N1NcdTAwMDdcdIaIwWfnevhtXHUwMDEwwq/JzWnl8PtVpZaU44POXfvwXHUwMDBmXFx/RZv4z9FFs/74VD87co4/fTnvdb5tl1x1MDAxMFI5c1ZGXHUwMDBiolx1MDAwNGVcdTAwMDYvXHK16cncbKhxjLjgWiu4UcKkmmhcdTAwMTK10lx1MDAwNq84OTO/SVx1MDAxNCjfplxuOVx0vyliqLVcdTAwMDFcIiS/XFxcZtUrKnQ1Mbx0my5ksO6uSVx1MDAxMlx1MDAxN1xixbgkTlx0Z+3CSHKudLxJxNJ6I62Wb1x1MDAxM1x1MDAxZs/KqnJ64Vx1MDAwNia67Fx1MDAxZP/TdJnpss2GK1x1MDAxN1x1MDAxODGulMFKSVxu4C3KI6NcdTAwMTQxubpz/a3ZdOvGTGKVSomolTrNueKCTJnQUVxmnDUwt5xcdTAwMDStgFaRXHUwMDE1prm3QVwio9rlQzlcdTAwMWGIVrP2Z0s/3Vx1MDAxY/fPbqs/oVecO+7Jsfel+v22e9P+0lx1MDAwZv683m9cdTAwMWST7sVcdTAwMWKMu8LM76+zXG5MW+yv2yowbMb3vpCPxVwiIVxmZ0SwiHumXHUwMDE31UZzXHUwMDBmNMuIXHUwMDFhXHUwMDA2XCItXHUwMDE50UyYjGlT7uFcdTAwMTIprI1eeapqXHUwMDE2+Vx1MDAxMK5cdTAwMTFcXF5cdTAwMGKuhiS3lFNggikpKXmDXHUwMDE5qm2xXG6fnMhcdTAwMGZtPVx1MDAxZjq+n/dcdTAwMDRvalx1MDAxOFx1MDAxNujnuGGYXHUwMDE51NptXHUwMDAzzbVrY8g14HqpIHz5fvrmI9u/P7rofb29XHUwMDE3g2vhtI9cdTAwMDdccmfTkTtrcplLvtGLr+BcdTAwMTWoxtCgbJVjuI+Md8pPXHUwMDA2x/dlPrhcbipcdTAwMTfV8HTwXHUwMDEzlH2xUv6kpprPxFx1MDAxYmFESoGxWN6mT8/mZlx1MDAwM04oZPiM6WWuyfpXXqFP4K+bXuZEMq3Ir192/Xky+fHm8GRN2rhAJca1sVx1MDAxOMlbXGLiMzdMXHUwMDAxJ8d8fO9IXGaxkpzJV2BzfnuzmdjUXGZhwKAklChMxdhrXHUwMDExoElcZrztujyshSWXhlNlbalQRk7TRTHUxSnTXkqA95breVx1MDAxOYkuXHUwMDE2xHldmFF5XHUwMDA1f1x1MDAwNXg7iVx1MDAxMydcdTAwMDde0PCCVjGw4Tt3y6zZ2FQ4ka1iVFgtT2mg3rXR72GEMaWY2DVrTTRcdTAwMDdLJPXETbtBY3Ew883vKFx1MDAxOIWoNFwi/9LbeDxcdTAwMDZIj2Jcbo1cdTAwMTNcdTAwMTNSXHUwMDEwM1x1MDAxMY7vdJLDsN32XHUwMDEySHc19IJkPK1p/vYt4O9cXGfimcPt5I+NM0NkRyyaneyv3VxmPOmH0d9/v5t69t6c+rZbVtnZSDv536+mM0pme1xyzTBYX0OXN/fzzdfG8plcdTAwMTBYK8pcdTAwMTlVkmSjpGZcdTAwMDOD2Vx1MDAxOF9Je1tCY1x1MDAxYdhcdTAwMTJSzVx1MDAwNTVcdTAwMTLn1CXjM4k4N9SSXHUwMDE3ttuEzVx1MDAwN3xgQd5kNn/rWG1JXCJcdTAwMTFcdTAwMTOeMUckOH3Tj3LIoeJQ8lx1MDAxNFxugf9vvDbfuIzCkYiQQjtZjIZQQ4TBUmOsXHLmZpJkt4nVZte43Saq+62ojecmuMYnXHUwMDFjhVx1MDAxNvRcdTAwMTUvXHUwMDAxzJ/W3UheY0pcIlx1MDAwMfVjX1x1MDAxZKWCYzNu1PiKb1x1MDAwMsyeZlx1MDAxNIhSYic4XHK3XHUwMDFjMMWhSYKwlFNcdTAwMTc5jNJSa7GeRY5ccuey+e+ZXHUwMDE2KEJQhYWlM0CUYZrlTnohPJGq1lrtmUTgUKazXHUwMDE4WFx1MDAxOUuj0i5aXHRmMOB/q0mMXG5cdTAwMDQtPqaM2JU7IGWW/zqhSFx1MDAwZlx1MDAxN1x1MDAwZVx1MDAxN1xyXHUwMDA0xFx1MDAwZiZcdTAwMGb4ZzRSYaB0NT91vFx1MDAwYlx1MDAwN5pcdDS75SD2VoRcblx1MDAwNTXTK0JmbNyv8Irzl7Q2klOV1lxiulx1MDAxZTulXG5NpJnofVx1MDAxOeJcdTAwMDbygJl9XHUwMDA3XHUwMDFmr8kxXHUwMDEyhpFiikpOIVxmzKes4lxijaRcIlxmmnPK7GQhm3CMlFx1MDAwMHtcdTAwMTg7qf1/SLPz39Qoclx1MDAxOHgvYzQlXHUwMDE0Mm5cdTAwMTgnubOeKZAjyijT0JSlz5yvt1x1MDAxZibAqphcdTAwMDDrM1x1MDAxMHSwrFn3VlBcdTAwMDfKwUhcdTAwMTkwj0pcdTAwMWJojtXkVP02Ue9cdTAwMWWl1qDDXHUwMDFkXHUwMDEx+1qxNNlcdTAwMWOn3Yj9f3Fwm2BcdTAwMWThUXCsXHUwMDE3jmdcdTAwMTRcdTAwMTLWjmqpXHUwMDAwq1xcXHUwMDE3h1x1MDAwM1bV0PZcdTAwMWFcZlx1MDAwNlx1MDAxZFx1MDAwM8pcdTAwMTZcdTAwMGU3XHUwMDEzkXabxOIsOt5cdTAwMTleoeRE0VVcdTAwMDKYXHUwMDE4lVx1MDAwYmDTa1xmp2ezp1Dqee7jwVTmsJtd2Eift6VSN1x1MDAwNeqPnVx1MDAxZv9cdTAwMDUlTVxuXSJ9 IntervalsBAMReferenceHaplotypeCallerGVCF

Dataset

  • A reference genome consisting of a small region of the human chromosome 20 (from hg19/b37) and its accessory files (index and sequence dictionary).
  • Three whole genome sequencing samples corresponding to a family trio (mother, father and son), which have been subset to a small portion on chromosome 20 to keep the file sizes small. The sequencing data is in BAM (Binary Alignment Map) format, i.e. genome sequencing reads that have already been mapped to the reference genome.
  • A list of genomic intervals, i.e. coordinates on the genome where our samples have data suitable for calling variants, provided in BED format.

0. Warmup: Run Samtools and GATK directly

Just like in the Hello World example, we want to try out the commands manually before we attempt to wrap them in a workflow. The difference here is that we're going to use Docker containers to obtain and run the tools.

Note

Make sure you're in the correct working directory: cd /workspace/gitpod/hello-nextflow

0.1. Index a BAM input file with Samtools

0.1.1. Pull the samtools container

docker pull community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464

0.1.2. Spin up the container interactively

docker run -it -v ./data:/data community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464

0.1.3. Run the indexing command

samtools index /data/bam/reads_mother.bam

0.1.4. Check that the BAM index has been produced

ls /data/bam/

This should show:

Output
reads_father.bam      reads_mother.bam      reads_mother.bam.bai  reads_son.bam

Where reads_mother.bam.bai has been created as an index to reads_mother.bam.

0.1.5. Exit the container

exit

0.2. Call variants with GATK HaplotypeCaller

0.2.1. Pull the GATK container

docker pull community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867

0.2.2. Spin up the container interactively

docker run -it -v ./data:/data community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867

0.2.3. Run the variant calling command

gatk HaplotypeCaller \
        -R /data/ref/ref.fasta \
        -I /data/bam/reads_mother.bam \
        -O reads_mother.g.vcf \
        -L /data/ref/intervals.bed \
        -ERC GVCF

0.2.4. Check the contents of the output file

cat reads_mother.g.vcf

0.2.5. Exit the container

exit

1. Write a single-stage workflow that runs Samtools index on a BAM file

1.1. Define the indexing process

hello-gatk.nf
/*
 * Generate BAM index file
 */
process SAMTOOLS_INDEX {

    container 'community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464'

    publishDir 'results', mode: 'copy'

    input:
        path input_bam

    output:
        path "${input_bam}.bai"

    """
    samtools index '$input_bam'
    """
}

1.2. Add parameter declarations up top

hello-gatk.nf
/*
 * Pipeline parameters
 */

// Primary input
params.reads_bam = "${projectDir}/data/bam/reads_mother.bam"

1.3. Add workflow block to run SAMTOOLS_INDEX

hello-gatk.nf
workflow {

    // Create input channel (single file via CLI parameter)
    reads_ch = Channel.fromPath(params.reads_bam)

    // Create index file for input BAM file
    SAMTOOLS_INDEX(reads_ch)
}

1.4. Run it to verify you can run the indexing step

nextflow run hello-gatk.nf

Should produce something like:

Output
N E X T F L O W  ~  version 24.02.0-edge
Launching `hello-gatk.nf` [compassionate_cray] DSL2 - revision: 9b97744397
executor >  local (1)
[bf/072bd7] process > SAMTOOLS_INDEX (1) [100%] 1 of 1 ✔

Takeaway

You know how to wrap a real bioinformatics tool in a single-step Nextflow workflow.

What's next?

Add a second step that consumes the output of the first.


2. Add a second step that runs GATK HaplotypeCaller on the indexed BAM file

2.1. Define the variant calling process

hello-gatk.nf
/*
 * Call variants with GATK HaplotypeCaller in GVCF mode
 */
process GATK_HAPLOTYPECALLER {

    container "community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867"

    publishDir 'results', mode: 'copy'

    input:
        path input_bam
        path input_bam_index
        path ref_fasta
        path ref_index
        path ref_dict
        path interval_list

    output:
        path "${input_bam}.g.vcf"
        path "${input_bam}.g.vcf.idx"

    """
    gatk HaplotypeCaller \
        -R ${ref_fasta} \
        -I ${input_bam} \
        -O ${input_bam}.g.vcf \
        -L ${interval_list} \
        -ERC GVCF
    """
}

2.2. Add definitions for accessory inputs

hello-gatk.nf
// Accessory files
params.reference        = "${projectDir}/data/ref/ref.fasta"
params.reference_index  = "${projectDir}/data/ref/ref.fasta.fai"
params.reference_dict   = "${projectDir}/data/ref/ref.dict"
params.intervals        = "${projectDir}/data/ref/intervals.bed"

2.3. Make a value channel for each of the accessory files

Add this to the workflow block (after the reads_ch creation):

hello-gatk.nf
// Create channels for the accessory files (reference and intervals)
ref_file        = file(params.reference)
ref_index_file  = file(params.reference_index)
ref_dict_file   = file(params.reference_dict)
intervals_file  = file(params.intervals)

This will load each of the accessory files in its own single-element value channel.

2.4. Add a call to the workflow block to run GATK_HAPLOTYPECALLER

hello-gatk.nf
// Call variants from the indexed BAM file
GATK_HAPLOTYPECALLER(
    reads_ch,
    SAMTOOLS_INDEX.out,
    ref_file,
    ref_index_file,
    ref_dict_file,
    intervals_file
)

2.4. Run the workflow to verify that the variant calling step works

nextflow run hello-gatk.nf -resume

Now we see the two processes being run:

Output
N E X T F L O W  ~  version 24.02.0-edge
Launching `hello-gatk.nf` [lethal_keller] DSL2 - revision: 30a64b9325
executor >  local (2)
[97/0f85bf] process > SAMTOOLS_INDEX (1)       [100%] 1 of 1 ✔
[2d/43c247] process > GATK_HAPLOTYPECALLER (1) [100%] 1 of 1 ✔

If you check the work directory, you'll find the output file reads_mother.bam.g.vcf. Because this is a small test file, you can click on it to open it and view the contents, which consist of 92 lines of header metadata followed by a list of genomic variant calls, one per line.

Note

A GVCF is a special kind of VCF that contains non-variant records as well as variant calls. The first actual variant call in this file occurs at line 325:

20 10040772 . C CT,<NON_REF> 473.03 . DP=22;ExcessHet=0.0000;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQandDP=79200,22 GT:AD:DP:GQ:PL:SB 1/1:0,17,0:17:51:487,51,0,488,51,488:0,0,7,10

Takeaway

You know how to make a very basic two-step variant calling workflow.

What's next?

Make the workflow handle multiple samples in bulk.


3. Adapt the workflow to run on a batch of samples

3.1. Turn the input parameter declaration into a list of the three samples

Before:

hello-gatk.nf
// Primary input
params.reads_bam = "${projectDir}/data/bam/reads_mother.bam"

After:

hello-gatk.nf
// Primary input (list of three samples)
params.reads_bam = [
    "${projectDir}/data/bam/reads_mother.bam",
    "${projectDir}/data/bam/reads_father.bam",
    "${projectDir}/data/bam/reads_son.bam"
]

3.2. Run the workflow to verify that it runs on all three samples

nextflow run hello-gatk.nf -resume

Uh-oh! It will probably fail with an error like this:

Output
ERROR ~ Error executing process > 'GATK_HAPLOTYPECALLER (2)'

Caused by:
  Process `GATK_HAPLOTYPECALLER (2)` terminated with an error exit status (2)

Command executed:

  gatk HaplotypeCaller         -R ref.fasta         -I reads_father.bam         -O reads_father.bam.g.vcf         -L intervals.bed         -ERC GVCF

Command exit status:
  2

Command error:

And buried in the GATK command error output, there will be a line like this:

A USER ERROR has occurred: Traversal by intervals was requested but some input files are not indexed.

This is because the script as written so far is not safe for running on multiple samples, because the order of items in the output channel is not guaranteed to match the order of items in the original input channel. This causes the wrong files to be paired up in the second step. So we need to make sure the BAM files and their index files travel together through the channels.

3.3. Change the output of the SAMTOOLS_INDEX process into a tuple that keeps the input file and its index together

Before:

hello-gatk.nf
output:
    path "${input_bam}.bai"

After:

hello-gatk.nf
output:
    tuple path(input_bam), path("${input_bam}.bai")

3.4. Change the input to the GATK_HAPLOTYPECALLER process to be a tuple

Before:

hello-gatk.nf
input:
    path input_bam
    path input_bam_index

After:

hello-gatk.nf
input:
    tuple path(input_bam), path(input_bam_index)

3.5. Update the call to GATK_HAPLOTYPECALLER in the workflow block

Before:

hello-gatk.nf
GATK_HAPLOTYPECALLER(
    reads_ch,
    SAMTOOLS_INDEX.out,

After:

hello-gatk.nf
GATK_HAPLOTYPECALLER(
    SAMTOOLS_INDEX.out,

3.6. Run the workflow to verify it works correctly on all three samples now

nextflow run hello-gatk.nf -ansi-log false

This time everything should run correctly:

Output
N E X T F L O W  ~  version 24.02.0-edge
Launching `hello-gatk.nf` [adoring_hopper] DSL2 - revision: 8cad21ea51
[e0/bbd6ef] Submitted process > SAMTOOLS_INDEX (3)
[71/d26b2c] Submitted process > SAMTOOLS_INDEX (2)
[e6/6cad6d] Submitted process > SAMTOOLS_INDEX (1)
[26/73dac1] Submitted process > GATK_HAPLOTYPECALLER (1)
[23/12ed10] Submitted process > GATK_HAPLOTYPECALLER (2)
[be/c4a067] Submitted process > GATK_HAPLOTYPECALLER (3)

Takeaway

You know how to make a variant calling workflow run on multiple samples (independently).

What's next?

Make it easier to handle samples in bulk.


4. Make it nicer to run on arbitrary samples by using a list of files as input

4.1. Create a text file listing the input paths

sample_bams.txt
/workspace/gitpod/hello-nextflow/data/bam/reads_mother.bam
/workspace/gitpod/hello-nextflow/data/bam/reads_father.bam
/workspace/gitpod/hello-nextflow/data/bam/reads_son.bam

4.2. Update the parameter default

Before:

hello-gatk.nf
// Primary input
params.reads_bam = [
    "${projectDir}/data/bam/reads_mother.bam",
    "${projectDir}/data/bam/reads_father.bam",
    "${projectDir}/data/bam/reads_son.bam"
]

After:

hello-gatk.nf
// Primary input (list of input files, one per line)
params.reads_bam = "${projectDir}/data/sample_bams.txt"

4.3. Update the channel factory to read lines from a file

Before:

hello-gatk.nf
// Create input channel (single file via CLI parameter)
reads_ch = Channel.fromPath(params.reads_bam)

After:

hello-gatk.nf
// Create input channel from list of input files in plain text
reads_ch = Channel.fromPath(params.reads_bam).splitText()

4.4. Run the workflow to verify that it works correctly

nextflow run hello-gatk.nf -resume -ansi-log false

This should produce essentially the same result as before:

Output
N E X T F L O W  ~  version 24.02.0-edge
Launching `hello-gatk.nf` [backstabbing_raman] DSL2 - revision: 5378632b71
[56/5f8548] Cached process > SAMTOOLS_INDEX (1)
[69/7a0ce1] Cached process > SAMTOOLS_INDEX (3)
[f1/6ae50e] Cached process > SAMTOOLS_INDEX (2)
[49/abf910] Cached process > GATK_HAPLOTYPECALLER (1)
[03/4407cd] Cached process > GATK_HAPLOTYPECALLER (2)
[af/2ea71a] Cached process > GATK_HAPLOTYPECALLER (3)

Takeaway

You know how to make a variant calling workflow handle a file containing input samples.

What's next?

Add a joint genotyping step that combines the data from all the samples.


5. Stretch goal: Add joint genotyping step

To complicate matters a little, the GATK variant calling method calls for a consolidation step where we combine and re-analyze the variant calls obtained per sample in order to obtain definitive 'joint' variant calls for a group or cohort of samples (in this case, the family trio).

Joint analysis

This involves using a GATK tool called GenomicsDBImport that combines the per-sample calls into a data store (analogous to a database), followed by another GATK tool, GenotypeGVCFs, which performs the actual 'joint genotyping' analysis. These two tools can be run in series within the same process.

One slight complication is that these tools require the use of individually specified VCF files, and the syntax of the GenomicsDBImport tool looks like this:

hello-gatk.nf
gatk GenomicsDBImport \
    -V sample1.vcf.gz \
    -V sample2.vcf.gz \
    -V sample3.vcf.gz \
    ...

So to perform joint genotyping, we will need to collect all VCF files together in a single process and construct a command line for GenomicsDBImport.

5.1. Write a process called GATK_JOINTGENOTYPING that wraps GenomicsDBImport and GenotypeGVCFs

hello-gatk.nf
/*
 * Consolidate GVCFs and apply joint genotyping analysis
 */
process GATK_JOINTGENOTYPING {

    container "community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867"

    publishDir 'results', mode: 'copy'

    input:
        path vcfs
        path idxs
        val cohort_name
        path ref_fasta
        path ref_index
        path ref_dict
        path interval_list

    output:
        path "${cohort_name}.joint.vcf"
        path "${cohort_name}.joint.vcf.idx"

    script:
    """
    gatk GenomicsDBImport \
        -V ${vcfs} \
        --genomicsdb-workspace-path ${cohort_name}_gdb \
        -L ${interval_list}

    gatk GenotypeGVCFs \
        -R ${ref_fasta} \
        -V gendb://${cohort_name}_gdb \
        -O ${cohort_name}.joint.vcf \
        -L ${interval_list}
    """
}

5.2. Add default value for the cohort name parameter up top

hello-gatk.nf
// Base name for final output file
params.cohort_name = "family_trio"

5.3. Gather the outputs of GATK_HAPLOTYPECALLER across samples using collect()

We collect the VCFs and their indices separately in order to list only the VCFs in the command we're going to construct. Since we'll give all of those files together to the joint genotyping process, we don't have to worry about the order of files like we did earlier.

Add this after the call to GATK_HAPLOTYPECALLER:

hello-gatk.nf
// Collect variant calling outputs across samples
all_vcfs = GATK_HAPLOTYPECALLER.out[0].collect()
all_tbis = GATK_HAPLOTYPECALLER.out[1].collect()

Tip

You can view the contents of the channel after performing this collect operation using .view()

5.4. Add a call to the workflow block to run GATK_JOINTGENOTYPING

hello-gatk.nf
// Consolidate GVCFs and apply joint genotyping analysis
GATK_JOINTGENOTYPING(
    all_vcfs,
    all_tbis,
    params.cohort_name,
    ref_file,
    ref_index_file,
    ref_dict_file,
    intervals_file
)

5.5. Run the workflow

nextflow run hello-gatk.nf -resume

Oh no! The pipeline produces an error. When we dig into the console output, we can see the command executed isn't correct:

Command executed:

  gatk GenomicsDBImport -V reads_mother.bam.g.vcf reads_father.bam.g.vcf reads_son.bam.g.vcf --genomicsdb-workspace-path family_trio_gdb -L intervals.bed

  gatk GenotypeGVCFs -R ref.fasta -V gendb://family_trio_gdb -O family_trio.joint.vcf -L intervals.bed

Can you spot the error? We gave gatk GenomicsDBImport multiple VCF files for a single -V argument, but the tool expects a separate -V argument for each VCF file.

5.6. Construct a command line with a separate -V argument for each input VCF

We use some string manipulations to repeat the VCFs with the argument -V, and replace -V ${vcfs} with the resulting ${vcfs_line}:

Before:

hello-gatk.nf
    script:
    """
    gatk GenomicsDBImport \
        -V ${vcfs} \
        --genomicsdb-workspace-path ${cohort_name}_gdb \
        -L ${interval_list}

After:

hello-gatk.nf
    script:
    def vcfs_line = vcfs.collect { "-V ${it}" }.join(' ')
    """
    gatk GenomicsDBImport \
        ${vcfs_line} \
        --genomicsdb-workspace-path ${cohort_name}_gdb \
        -L ${interval_list}

5.7. Run the workflow to verify that it generates the final VCF output as expected

nextflow run hello-gatk.nf -resume

Now we see the additional process show up in the log output (showing the compact view):

Output
N E X T F L O W  ~  version 24.02.0-edge
Launching `hello-gatk.nf` [nauseous_thompson] DSL2 - revision: b346a53aae
executor >  local (7)
[d1/43979a] process > SAMTOOLS_INDEX (2)       [100%] 3 of 3 ✔
[20/247592] process > GATK_HAPLOTYPECALLER (3) [100%] 3 of 3 ✔
[14/7145b6] process > GATK_JOINTGENOTYPING (1) [100%] 1 of 1 ✔

You can find the final output file, family_trio.joint.vcf, in the work directory for the last process. Click on it to open it and you'll see 40 lines of metadata header followed by just under 30 jointly genotyped variant records (meaning at least one of the family members has a variant genotype at each genomic position listed).

Tip

Keep in mind the data files covered only a tiny portion of chromosome 20; the real size of a variant callset would be counted in millions of variants. That's why we use only tiny subsets of data for training purposes!

Takeaway

You know how to make a joint variant calling workflow that outputs a cohort VCF.

What's next?

Celebrate your success and take an extra long break! This was tough and you deserve it.

In future trainings, you'll learn more sophisticated methods for managing inputs and outputs, as well as modularizing your code, testing it, and configuring execution.

Good luck!