Part 1: Method overview¶
Variant calling is a genomic analysis method that aims to identify variations in a genome sequence relative to a reference genome. Here we are going to use tools and methods designed for calling short germline variants, i.e. SNPs and indels, in whole-genome sequencing data.

A full variant calling pipeline typically involves a lot of steps, including mapping to the reference (sometimes referred to as genome alignment) and variant filtering and prioritization. For simplicity, in this course we are going to focus on just the variant calling part.
Methods¶
We're going to show you two ways to apply variant calling to whole-genome sequencing samples to identify germline SNPs and indels. First we'll start with a simple per-sample approach that calls variants independently from each sample. Then we'll show you a more sophisticated joint calling approach that analyzes multiple samples together, producing more accurate and informative results.
Before we dive into writing any workflow code for either approach, we are going to try out the commands manually on some test data.
Dataset¶
We provide the following data and related resources:
- A reference genome consisting of a small region of the human chromosome 20 (from hg19/b37) and its accessory files (index and sequence dictionary).
- Three whole genome sequencing samples corresponding to a family trio (mother, father and son), which have been subset to a small slice of data on chromosome 20 to keep the file sizes small. This is Illumina short-read sequencing data that have already been mapped to the reference genome, provided in BAM format (Binary Alignment Map, a compressed version of SAM, Sequence Alignment Map).
- A list of genomic intervals, i.e. coordinates on the genome where our samples have data suitable for calling variants, provided in BED format.
Software¶
The two main tools involved are Samtools, a widely used toolkit for manipulating sequence alignment files, and GATK (Genome Analysis Toolkit), a set of tools for variant discovery developed at the Broad Institute.
These tools are not installed in the GitHub Codespaces environment, so we'll use them via containers retrieved via the Seqera Containers service (see Hello Containers).
Tip
Make sure you're in the nf4-science/genomics directory so that the last part of the path shown when you type pwd is genomics.
1. Per-sample variant calling¶
Per-sample variant calling processes each sample independently: the variant caller examines the sequencing data for one sample at a time and identifies positions where the sample differs from the reference.
In this section we test the two commands that make up the per-sample variant calling approach: indexing a BAM file with Samtools and calling variants with GATK HaplotypeCaller. These are the commands we'll wrap into a Nextflow workflow in Part 2 of this course.
- Generate an index file for a BAM input file using Samtools
- Run the GATK HaplotypeCaller on the indexed BAM file to generate per-sample variant calls in VCF (Variant Call Format)
We start by testing the two commands on just one sample.
1.1. Index a BAM input file with Samtools¶
Index files are a common feature of bioinformatics file formats; they contain information about the structure of the main file that allows tools like GATK to access a subset of the data without having to read through the whole file. This is important because of how large these files can get.
BAM files are often provided without an index, so the first step in many analysis workflows is to generate one using samtools index.
We're going to pull down a Samtools container, spin it up interactively and run the samtools index command on one of the BAM files.
1.1.1. Pull the Samtools container¶
Run the docker pull command to download the Samtools container image:
Command output
1.20--b5dfbd93de237464: Pulling from library/samtools
6360b3717211: Pull complete
2ec3f7ad9b3c: Pull complete
7716ca300600: Pull complete
4f4fb700ef54: Pull complete
8c61d418774c: Pull complete
03dae77ff45c: Pull complete
aab7f787139d: Pull complete
4f4fb700ef54: Pull complete
837d55536720: Pull complete
897362c12ca7: Pull complete
3893cbe24e91: Pull complete
d1b61e94977b: Pull complete
c72ff66fb90f: Pull complete
0e0388f29b6d: Pull complete
Digest: sha256:bbfc45b4f228975bde86cba95e303dd94ecf2fdacea5bfb2e2f34b0d7b141e41
Status: Downloaded newer image for community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464
community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464
If you haven't downloaded this image before, it may take a minute to complete. Once it's done, you have a local copy of the container image.
1.1.2. Spin up the Samtools container interactively¶
To run the container interactively, use docker run with the -it flags.
The -v ./data:/data option mounts the local data directory into the container so the tools can access the input files.
You'll notice your prompt changes to something like (base) root@a1b2c3d4e5f6:/tmp#, indicating you are now inside the container.
Verify that you can see the sequence data files under /data/bam:
With that, you are ready to try your first command.
1.1.3. Run the indexing command¶
The Samtools documentation gives us the command line to run to index a BAM file.
We only need to provide the input file; the tool will automatically generate a name for the output by appending .bai to the input filename.
Run the samtools index command on one data file:
The command does not produce any output in the terminal, but you should now see a file called reads_mother.bam.bai in the same directory as the original BAM input file.
Directory contents
That completes the testing of the first step.
1.1.4. Exit the Samtools container¶
To exit the container, type exit.
Your prompt should now be back to what it was before you started the container.
1.2. Call variants with GATK HaplotypeCaller¶
We want to run the gatk HaplotypeCaller command on the BAM file we just indexed.
1.2.1. Pull the GATK container¶
First, let's run the docker pull command to download the GATK container image:
Command output
Some layers show Already exists because they are shared with the Samtools container image we pulled earlier.
4.5.0.0--730ee8817e436867: Pulling from library/gatk4
6360b3717211: Already exists
2ec3f7ad9b3c: Already exists
7716ca300600: Already exists
4f4fb700ef54: Already exists
8c61d418774c: Already exists
03dae77ff45c: Already exists
aab7f787139d: Already exists
4f4fb700ef54: Already exists
837d55536720: Already exists
897362c12ca7: Already exists
3893cbe24e91: Already exists
d1b61e94977b: Already exists
e5c558f54708: Pull complete
087cce32d294: Pull complete
Digest: sha256:e33413b9100f834fcc62fd5bc9edc1e881e820aafa606e09301eac2303d8724b
Status: Downloaded newer image for community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867
community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867
This should be faster than the first pull because the two container images share most of their layers.
1.2.2. Spin up the GATK container interactively¶
Spin up the GATK container interactively with the data directory mounted, just as we did for Samtools.
Your prompt changes to indicate you are now inside the GATK container.
1.2.3. Run the variant calling command¶
The GATK documentation gives us the command line to run to perform variant calling on a BAM file.
We need to provide the BAM input file (-I) as well as the reference genome (-R), a name for the output file (-O) and a list of genomic intervals to analyze (-L).
However, we don't need to specify the path to the index file; the tool will automatically look for it in the same directory, based on the established naming and co-location convention.
The same applies to the reference genome's accessory files (index and sequence dictionary files, *.fai and *.dict).
gatk HaplotypeCaller \
-R /data/ref/ref.fasta \
-I /data/bam/reads_mother.bam \
-O /data/vcf/reads_mother.vcf \
-L /data/ref/intervals.bed
Command output
Using GATK jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar HaplotypeCaller -R /data/ref/ref.fasta -I /data/bam/reads_mother.bam -O reads_mother.vcf -L /data/ref/intervals.bed
00:27:50.687 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
00:27:50.854 INFO HaplotypeCaller - ------------------------------------------------------------
00:27:50.858 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.5.0.0
00:27:50.858 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
00:27:50.858 INFO HaplotypeCaller - Executing as root@a1fe8ff42d07 on Linux v6.10.14-linuxkit amd64
00:27:50.858 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.11-internal+0-adhoc..src
00:27:50.859 INFO HaplotypeCaller - Start Date/Time: February 8, 2026 at 12:27:50 AM GMT
00:27:50.859 INFO HaplotypeCaller - ------------------------------------------------------------
00:27:50.859 INFO HaplotypeCaller - ------------------------------------------------------------
00:27:50.861 INFO HaplotypeCaller - HTSJDK Version: 4.1.0
00:27:50.861 INFO HaplotypeCaller - Picard Version: 3.1.1
00:27:50.861 INFO HaplotypeCaller - Built for Spark Version: 3.5.0
00:27:50.862 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
00:27:50.862 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
00:27:50.862 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
00:27:50.863 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
00:27:50.864 INFO HaplotypeCaller - Deflater: IntelDeflater
00:27:50.864 INFO HaplotypeCaller - Inflater: IntelInflater
00:27:50.864 INFO HaplotypeCaller - GCS max retries/reopens: 20
00:27:50.864 INFO HaplotypeCaller - Requester pays: disabled
00:27:50.865 INFO HaplotypeCaller - Initializing engine
00:27:50.991 INFO FeatureManager - Using codec BEDCodec to read file file:///data/ref/intervals.bed
00:27:51.016 INFO IntervalArgumentCollection - Processing 6369 bp from intervals
00:27:51.029 INFO HaplotypeCaller - Done initializing engine
00:27:51.040 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
00:27:51.042 INFO NativeLibraryLoader - Loading libgkl_smithwaterman.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_smithwaterman.so
00:27:51.042 INFO SmithWatermanAligner - Using AVX accelerated SmithWaterman implementation
00:27:51.046 INFO HaplotypeCallerEngine - Disabling physical phasing, which is supported only for reference-model confidence output
00:27:51.063 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
00:27:51.085 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
00:27:51.086 INFO IntelPairHmm - Available threads: 10
00:27:51.086 INFO IntelPairHmm - Requested threads: 4
00:27:51.086 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
00:27:51.128 INFO ProgressMeter - Starting traversal
00:27:51.136 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
00:27:51.882 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position 20_10037292_10066351:3480 and possibly subsequent; at least 10 samples must have called genotypes
00:27:52.969 INFO HaplotypeCaller - 7 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
7 total reads filtered out of 1867 reads processed
00:27:52.971 INFO ProgressMeter - 20_10037292_10066351:13499 0.0 35 1145.7
00:27:52.971 INFO ProgressMeter - Traversal complete. Processed 35 total regions in 0.0 minutes.
00:27:52.976 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.003346916
00:27:52.976 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.045731709
00:27:52.977 INFO SmithWatermanAligner - Total compute time in native Smith-Waterman : 0.02 sec
00:27:52.981 INFO HaplotypeCaller - Shutting down engine
[February 8, 2026 at 12:27:52 AM GMT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=203423744
The log output is very verbose, so we've highlighted the most relevant lines in the example above.
The output files, reads_mother.vcf and its index file, reads_mother.vcf.idx, are created inside your working directory in the container.
The VCF file contains the variant calls, as we'll see in a minute, and the index file has the same function as the BAM index file, to allow tools to seek and retrieve subsets of data without loading in the entire file.
Since VCF is a text format and this one is a small test file, you can run cat reads_mother.vcf to open it and view its contents.
If you scroll back up to the start of the file, you'll find a header composed of many lines of metadata, followed by a list of variant calls, one per line.
File contents (abridged)
In the example output above, we've highlighted the last header line, which gives the names of the columns for the tabular data that follows. Each line of data describes a possible variant identified in the sample's sequencing data. For guidance on interpreting VCF format, see this helpful article.
1.2.4. Move the output files¶
Anything that remains inside the container will be inaccessible to future work.
The BAM index file was created directly in the /data/bam directory on the mounted filesystem, but not the VCF file and its index, so we need to move those two manually.
Directory contents
Once that's done, all the files are now accessible in your normal filesystem.
1.2.5. Exit the GATK container¶
To exit the container, type exit.
Your prompt should go back to normal. That concludes the per-sample variant calling test.
Write it as a workflow!
Feel free to move on to Part 2 right away if you'd like to get started implementing this analysis as a Nextflow workflow. You'll just need to come back to complete the second round of testing before moving on to Part 3.
2. Joint calling on a cohort¶
The variant calling approach we just used generates variant calls per sample. That's fine for looking at variants from each sample in isolation, but it yields limited information. It's often more interesting to look at how variant calls differ across multiple samples. GATK offers an alternative method called joint variant calling for this purpose.
Joint variant calling involves generating a special kind of variant output called GVCF (for Genomic VCF) for each sample, then combining the GVCF data from all the samples and running a 'joint genotyping' statistical analysis.

What's special about a sample's GVCF is that it contains records summarizing sequence data statistics about all positions in the targeted area of the genome, not just the positions where the program found evidence of variation. This is critical for the joint genotyping calculation (further reading).
The GVCF is produced by GATK HaplotypeCaller, the same tool we just tested, with an additional parameter (-ERC GVCF).
Combining the GVCFs is done with GATK GenomicsDBImport, which combines the per-sample calls into a data store (analogous to a database).
The actual 'joint genotyping' analysis is then done with GATK GenotypeGVCFs.
Here we test the commands needed to generate GVCFs and run joint genotyping. These are the commands we'll wrap into a Nextflow workflow in Part 3 of this course.
- Generate an index file for each BAM input file using Samtools
- Run the GATK HaplotypeCaller on each BAM input file to generate a GVCF of per-sample genomic variant calls
- Collect all the GVCFs and combine them into a GenomicsDB data store
- Run joint genotyping on the combined GVCF data store to produce a cohort-level VCF
We now need to test all of these commands, starting with indexing all three BAM files.
2.1. Index BAM files for all three samples¶
In the first section above, we only indexed one BAM file. Now we need to index all three samples so that GATK HaplotypeCaller can process them.
2.1.1. Spin up the Samtools container interactively¶
We already pulled the Samtools container image, so we can spin it up directly:
Your prompt changes to indicate you are inside the container, with the data directory mounted as before.
2.1.2. Run the indexing command on all three samples¶
Run the indexing command on each of the three BAM files:
samtools index /data/bam/reads_mother.bam
samtools index /data/bam/reads_father.bam
samtools index /data/bam/reads_son.bam
Directory contents
This should produce the index files in the same directory as the corresponding BAM files.
2.1.3. Exit the Samtools container¶
To exit the container, type exit.
Your prompt should be back to normal.
2.2. Generate GVCFs for all three samples¶
To run the joint genotyping step, we need GVCFs for all three samples.
2.2.1. Spin up the GATK container interactively¶
We already pulled the GATK container image earlier, so we can spin it up directly:
Your prompt changes to indicate you are inside the GATK container.
2.2.2. Run the variant calling command with the GVCF option¶
In order to produce a genomic VCF (GVCF), we add the -ERC GVCF option to the base command, which switches on the HaplotypeCaller's GVCF mode.
We also change the file extension for the output file from .vcf to .g.vcf.
This is technically not a requirement, but it is a strongly recommended convention.
gatk HaplotypeCaller \
-R /data/ref/ref.fasta \
-I /data/bam/reads_mother.bam \
-O reads_mother.g.vcf \
-L /data/ref/intervals.bed \
-ERC GVCF
Command output
Using GATK jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar HaplotypeCaller -R /data/ref/ref.fasta -I /data/bam/reads_mother.bam -O reads_mother.g.vcf -L /data/ref/intervals.bed -ERC GVCF
16:51:00.620 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
16:51:00.749 INFO HaplotypeCaller - ------------------------------------------------------------
16:51:00.751 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.5.0.0
16:51:00.751 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
16:51:00.751 INFO HaplotypeCaller - Executing as root@be1a0302f6c7 on Linux v6.8.0-1030-azure amd64
16:51:00.751 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.11-internal+0-adhoc..src
16:51:00.752 INFO HaplotypeCaller - Start Date/Time: February 11, 2026 at 4:51:00 PM GMT
16:51:00.752 INFO HaplotypeCaller - ------------------------------------------------------------
16:51:00.752 INFO HaplotypeCaller - ------------------------------------------------------------
16:51:00.752 INFO HaplotypeCaller - HTSJDK Version: 4.1.0
16:51:00.753 INFO HaplotypeCaller - Picard Version: 3.1.1
16:51:00.753 INFO HaplotypeCaller - Built for Spark Version: 3.5.0
16:51:00.753 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
16:51:00.753 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:51:00.753 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:51:00.754 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:51:00.754 INFO HaplotypeCaller - Deflater: IntelDeflater
16:51:00.754 INFO HaplotypeCaller - Inflater: IntelInflater
16:51:00.754 INFO HaplotypeCaller - GCS max retries/reopens: 20
16:51:00.754 INFO HaplotypeCaller - Requester pays: disabled
16:51:00.755 INFO HaplotypeCaller - Initializing engine
16:51:00.893 INFO FeatureManager - Using codec BEDCodec to read file file:///data/ref/intervals.bed
16:51:00.905 INFO IntervalArgumentCollection - Processing 6369 bp from intervals
16:51:00.910 INFO HaplotypeCaller - Done initializing engine
16:51:00.912 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
16:51:00.917 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
16:51:00.919 INFO NativeLibraryLoader - Loading libgkl_smithwaterman.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_smithwaterman.so
16:51:00.919 INFO SmithWatermanAligner - Using AVX accelerated SmithWaterman implementation
16:51:00.923 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to -0.0 for reference-model confidence output
16:51:00.923 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
16:51:00.933 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
16:51:00.945 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
16:51:00.945 INFO IntelPairHmm - Available threads: 4
16:51:00.945 INFO IntelPairHmm - Requested threads: 4
16:51:00.945 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
16:51:00.984 INFO ProgressMeter - Starting traversal
16:51:00.985 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
16:51:01.452 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position 20_10037292_10066351:3480 and possibly subsequent; at least 10 samples must have called genotypes
16:51:02.358 INFO HaplotypeCaller - 7 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
7 total reads filtered out of 1867 reads processed
16:51:02.359 INFO ProgressMeter - 20_10037292_10066351:13499 0.0 35 1529.5
16:51:02.359 INFO ProgressMeter - Traversal complete. Processed 35 total regions in 0.0 minutes.
16:51:02.361 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0022800000000000003
16:51:02.361 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.061637120000000004
16:51:02.361 INFO SmithWatermanAligner - Total compute time in native Smith-Waterman : 0.02 sec
16:51:02.362 INFO HaplotypeCaller - Shutting down engine
[February 11, 2026 at 4:51:02 PM GMT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=257949696
This creates the GVCF output file reads_mother.g.vcf in the current working directory in the container, as well as its index file, reads_mother.g.vcf.idx.
If you run head -200 reads_mother.g.vcf to view the first 200 lines of the file's contents, you'll see it's much longer than the equivalent VCF we generated in the first section, and most of the lines look quite different from what we saw in the VCF.
File contents (abridged)
| reads_mother.g.vcf | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 | |
We've once again highlighted the last header line, as well as the first three 'proper' variant calls in the file.
You'll notice the variant call lines are interspersed with many non-variant lines, which represent non-variant regions where the variant caller found no evidence of variation. As mentioned briefly above, this is what is special about the GVCF mode of variant calling: the variant called captures some statistics describing its level of confidence in the absence of variation. This makes it possible to distinguish between two very different case figures: (1) there is good quality data showing that the sample is homozygous-reference, and (2) there is not enough good data available to make a determination either way.
In a GVCF like this, there are typically lots of such non-variant lines, with a smaller number of variant records sprinkled among them.
2.2.3. Repeat the process on the other two samples¶
Now let's generate GVCFs for the remaining two samples by running the commands below, one after another.
gatk HaplotypeCaller \
-R /data/ref/ref.fasta \
-I /data/bam/reads_father.bam \
-O reads_father.g.vcf \
-L /data/ref/intervals.bed \
-ERC GVCF
Command output
Using GATK jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar HaplotypeCaller -R /data/ref/ref.fasta -I /data/bam/reads_father.bam -O reads_father.g.vcf -L /data/ref/intervals.bed -ERC GVCF
17:28:30.677 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
17:28:30.801 INFO HaplotypeCaller - ------------------------------------------------------------
17:28:30.803 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.5.0.0
17:28:30.804 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
17:28:30.804 INFO HaplotypeCaller - Executing as root@be1a0302f6c7 on Linux v6.8.0-1030-azure amd64
17:28:30.804 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.11-internal+0-adhoc..src
17:28:30.804 INFO HaplotypeCaller - Start Date/Time: February 11, 2026 at 5:28:30 PM GMT
17:28:30.804 INFO HaplotypeCaller - ------------------------------------------------------------
17:28:30.804 INFO HaplotypeCaller - ------------------------------------------------------------
17:28:30.805 INFO HaplotypeCaller - HTSJDK Version: 4.1.0
17:28:30.805 INFO HaplotypeCaller - Picard Version: 3.1.1
17:28:30.805 INFO HaplotypeCaller - Built for Spark Version: 3.5.0
17:28:30.806 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:28:30.806 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:28:30.806 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:28:30.806 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:28:30.806 INFO HaplotypeCaller - Deflater: IntelDeflater
17:28:30.807 INFO HaplotypeCaller - Inflater: IntelInflater
17:28:30.807 INFO HaplotypeCaller - GCS max retries/reopens: 20
17:28:30.807 INFO HaplotypeCaller - Requester pays: disabled
17:28:30.807 INFO HaplotypeCaller - Initializing engine
17:28:30.933 INFO FeatureManager - Using codec BEDCodec to read file file:///data/ref/intervals.bed
17:28:30.946 INFO IntervalArgumentCollection - Processing 6369 bp from intervals
17:28:30.951 INFO HaplotypeCaller - Done initializing engine
17:28:30.953 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
17:28:30.957 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:30.959 INFO NativeLibraryLoader - Loading libgkl_smithwaterman.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_smithwaterman.so
17:28:30.960 INFO SmithWatermanAligner - Using AVX accelerated SmithWaterman implementation
17:28:30.963 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to -0.0 for reference-model confidence output
17:28:30.963 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
17:28:30.972 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
17:28:30.987 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
17:28:30.987 INFO IntelPairHmm - Available threads: 4
17:28:30.987 INFO IntelPairHmm - Requested threads: 4
17:28:30.987 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
17:28:31.034 INFO ProgressMeter - Starting traversal
17:28:31.034 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
17:28:31.570 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position 20_10037292_10066351:3480 and possibly subsequent; at least 10 samples must have called genotypes
17:28:32.865 INFO HaplotypeCaller - 9 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
9 total reads filtered out of 2064 reads processed
17:28:32.866 INFO ProgressMeter - 20_10037292_10066351:13338 0.0 38 1245.2
17:28:32.866 INFO ProgressMeter - Traversal complete. Processed 38 total regions in 0.0 minutes.
17:28:32.868 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0035923200000000004
17:28:32.868 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.10765202500000001
17:28:32.868 INFO SmithWatermanAligner - Total compute time in native Smith-Waterman : 0.03 sec
17:28:32.869 INFO HaplotypeCaller - Shutting down engine
[February 11, 2026 at 5:28:32 PM GMT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=299892736
gatk HaplotypeCaller \
-R /data/ref/ref.fasta \
-I /data/bam/reads_son.bam \
-O reads_son.g.vcf \
-L /data/ref/intervals.bed \
-ERC GVCF
Command output
Using GATK jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar HaplotypeCaller -R /data/ref/ref.fasta -I /data/bam/reads_son.bam -O reads_son.g.vcf -L /data/ref/intervals.bed -ERC GVCF
17:30:10.017 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
17:30:10.156 INFO HaplotypeCaller - ------------------------------------------------------------
17:30:10.159 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.5.0.0
17:30:10.159 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
17:30:10.159 INFO HaplotypeCaller - Executing as root@be1a0302f6c7 on Linux v6.8.0-1030-azure amd64
17:30:10.159 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.11-internal+0-adhoc..src
17:30:10.159 INFO HaplotypeCaller - Start Date/Time: February 11, 2026 at 5:30:09 PM GMT
17:30:10.159 INFO HaplotypeCaller - ------------------------------------------------------------
17:30:10.160 INFO HaplotypeCaller - ------------------------------------------------------------
17:30:10.160 INFO HaplotypeCaller - HTSJDK Version: 4.1.0
17:30:10.160 INFO HaplotypeCaller - Picard Version: 3.1.1
17:30:10.161 INFO HaplotypeCaller - Built for Spark Version: 3.5.0
17:30:10.161 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:30:10.161 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:30:10.161 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:30:10.161 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:30:10.161 INFO HaplotypeCaller - Deflater: IntelDeflater
17:30:10.162 INFO HaplotypeCaller - Inflater: IntelInflater
17:30:10.162 INFO HaplotypeCaller - GCS max retries/reopens: 20
17:30:10.162 INFO HaplotypeCaller - Requester pays: disabled
17:30:10.162 INFO HaplotypeCaller - Initializing engine
17:30:10.277 INFO FeatureManager - Using codec BEDCodec to read file file:///data/ref/intervals.bed
17:30:10.290 INFO IntervalArgumentCollection - Processing 6369 bp from intervals
17:30:10.296 INFO HaplotypeCaller - Done initializing engine
17:30:10.298 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
17:30:10.302 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:30:10.303 INFO NativeLibraryLoader - Loading libgkl_smithwaterman.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_smithwaterman.so
17:30:10.304 INFO SmithWatermanAligner - Using AVX accelerated SmithWaterman implementation
17:30:10.307 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to -0.0 for reference-model confidence output
17:30:10.307 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
17:30:10.315 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
17:30:10.328 INFO IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
17:30:10.329 INFO IntelPairHmm - Available threads: 4
17:30:10.329 INFO IntelPairHmm - Requested threads: 4
17:30:10.329 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
17:30:10.368 INFO ProgressMeter - Starting traversal
17:30:10.369 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
17:30:10.875 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position 20_10037292_10066351:3480 and possibly subsequent; at least 10 samples must have called genotypes
17:30:11.980 INFO HaplotypeCaller - 14 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
0 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
14 total reads filtered out of 1981 reads processed
17:30:11.981 INFO ProgressMeter - 20_10037292_10066351:13223 0.0 35 1302.7
17:30:11.981 INFO ProgressMeter - Traversal complete. Processed 35 total regions in 0.0 minutes.
17:30:11.983 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0034843710000000004
17:30:11.983 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.048108363
17:30:11.983 INFO SmithWatermanAligner - Total compute time in native Smith-Waterman : 0.02 sec
17:30:11.984 INFO HaplotypeCaller - Shutting down engine
[February 11, 2026 at 5:30:11 PM GMT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=226492416
Once this completes, you should have three files ending in .g.vcf in your current directory (one per sample) and their respective index files ending in .g.vcf.idx.
Directory contents
At this point, we've called variants in GVCF mode for each of our input samples. It's time to move on to the joint calling.
But don't exit the container! We're going to use the same one in the next step.
2.3. Run joint genotyping¶
Now that we have all the GVCFs, we can try out the joint genotyping approach to generating variant calls for a cohort of samples. It's a two-step method that consists of combining the data from all the GVCFs into a data store, then running the joint genotyping analysis proper to generate the final VCF of joint-called variants.
2.3.1. Combine all the per-sample GVCFs¶
This first step uses another GATK tool, called GenomicsDBImport, to combine the data from all the GVCFs into a GenomicsDB data store. The GenomicsDB data store is a sort of database format that serves as an intermediate storage for the variant information.
gatk GenomicsDBImport \
-V reads_mother.g.vcf \
-V reads_father.g.vcf \
-V reads_son.g.vcf \
-L /data/ref/intervals.bed \
--genomicsdb-workspace-path family_trio_gdb
Command output
Using GATK jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar GenomicsDBImport -V reads_mother.g.vcf -V reads_father.g.vcf -V reads_son.g.vcf -L /data/ref/intervals.bed --genomicsdb-workspace-path family_trio_gdb
17:37:07.569 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
17:37:07.699 INFO GenomicsDBImport - ------------------------------------------------------------
17:37:07.702 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.5.0.0
17:37:07.702 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
17:37:07.703 INFO GenomicsDBImport - Executing as root@be1a0302f6c7 on Linux v6.8.0-1030-azure amd64
17:37:07.703 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v17.0.11-internal+0-adhoc..src
17:37:07.704 INFO GenomicsDBImport - Start Date/Time: February 11, 2026 at 5:37:07 PM GMT
17:37:07.704 INFO GenomicsDBImport - ------------------------------------------------------------
17:37:07.704 INFO GenomicsDBImport - ------------------------------------------------------------
17:37:07.706 INFO GenomicsDBImport - HTSJDK Version: 4.1.0
17:37:07.706 INFO GenomicsDBImport - Picard Version: 3.1.1
17:37:07.707 INFO GenomicsDBImport - Built for Spark Version: 3.5.0
17:37:07.709 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:37:07.709 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:37:07.709 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:37:07.710 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:37:07.710 INFO GenomicsDBImport - Deflater: IntelDeflater
17:37:07.711 INFO GenomicsDBImport - Inflater: IntelInflater
17:37:07.711 INFO GenomicsDBImport - GCS max retries/reopens: 20
17:37:07.711 INFO GenomicsDBImport - Requester pays: disabled
17:37:07.712 INFO GenomicsDBImport - Initializing engine
17:37:07.883 INFO FeatureManager - Using codec BEDCodec to read file file:///data/ref/intervals.bed
17:37:07.886 INFO IntervalArgumentCollection - Processing 6369 bp from intervals
17:37:07.889 INFO GenomicsDBImport - Done initializing engine
17:37:08.560 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.5.1-84e800e
17:37:08.561 INFO GenomicsDBImport - Vid Map JSON file will be written to /tmp/family_trio_gdb/vidmap.json
17:37:08.561 INFO GenomicsDBImport - Callset Map JSON file will be written to /tmp/family_trio_gdb/callset.json
17:37:08.561 INFO GenomicsDBImport - Complete VCF Header will be written to /tmp/family_trio_gdb/vcfheader.vcf
17:37:08.561 INFO GenomicsDBImport - Importing to workspace - /tmp/family_trio_gdb
17:37:08.878 INFO GenomicsDBImport - Importing batch 1 with 3 samples
17:37:09.359 INFO GenomicsDBImport - Importing batch 1 with 3 samples
17:37:09.487 INFO GenomicsDBImport - Importing batch 1 with 3 samples
17:37:09.591 INFO GenomicsDBImport - Done importing batch 1/1
17:37:09.592 INFO GenomicsDBImport - Import completed!
17:37:09.592 INFO GenomicsDBImport - Shutting down engine
[February 11, 2026 at 5:37:09 PM GMT] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=113246208
Tool returned:
true
The output of this step is effectively a directory containing a set of further nested directories holding the combined variant data in the form of multiple different files. You can poke around it but you'll quickly see this data store format is not intended to be read directly by humans.
Tip
GATK includes tools that make it possible to inspect and extract variant call data from the data store as needed.
2.3.2. Run the joint genotyping analysis proper¶
This second step uses yet another GATK tool, called GenotypeGVCFs, to recalculate variant statistics and individual genotypes in light of the data available across all samples in the cohort.
Command output
Using GATK jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar GenotypeGVCFs -R /data/ref/ref.fasta -V gendb://family_trio_gdb -O family_trio.vcf
17:38:45.084 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
17:38:45.217 INFO GenotypeGVCFs - ------------------------------------------------------------
17:38:45.220 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.5.0.0
17:38:45.220 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
17:38:45.220 INFO GenotypeGVCFs - Executing as root@be1a0302f6c7 on Linux v6.8.0-1030-azure amd64
17:38:45.220 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v17.0.11-internal+0-adhoc..src
17:38:45.221 INFO GenotypeGVCFs - Start Date/Time: February 11, 2026 at 5:38:45 PM GMT
17:38:45.221 INFO GenotypeGVCFs - ------------------------------------------------------------
17:38:45.221 INFO GenotypeGVCFs - ------------------------------------------------------------
17:38:45.221 INFO GenotypeGVCFs - HTSJDK Version: 4.1.0
17:38:45.222 INFO GenotypeGVCFs - Picard Version: 3.1.1
17:38:45.222 INFO GenotypeGVCFs - Built for Spark Version: 3.5.0
17:38:45.222 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:38:45.222 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:38:45.222 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:38:45.222 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:38:45.223 INFO GenotypeGVCFs - Deflater: IntelDeflater
17:38:45.223 INFO GenotypeGVCFs - Inflater: IntelInflater
17:38:45.223 INFO GenotypeGVCFs - GCS max retries/reopens: 20
17:38:45.223 INFO GenotypeGVCFs - Requester pays: disabled
17:38:45.223 INFO GenotypeGVCFs - Initializing engine
17:38:45.544 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.5.1-84e800e
17:38:45.561 INFO NativeGenomicsDB - pid=221 tid=222 No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
17:38:45.561 INFO NativeGenomicsDB - pid=221 tid=222 No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
17:38:45.561 INFO NativeGenomicsDB - pid=221 tid=222 No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
17:38:45.577 INFO GenotypeGVCFs - Done initializing engine
17:38:45.615 INFO ProgressMeter - Starting traversal
17:38:45.615 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
17:38:45.903 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position 20_10037292_10066351:3480 and possibly subsequent; at least 10 samples must have called genotypes
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.07757032800000006,Cpu time(s),0.07253379200000037
17:38:46.421 INFO ProgressMeter - 20_10037292_10066351:13953 0.0 3390 252357.3
17:38:46.422 INFO ProgressMeter - Traversal complete. Processed 3390 total variants in 0.0 minutes.
17:38:46.423 INFO GenotypeGVCFs - Shutting down engine
[February 11, 2026 at 5:38:46 PM GMT] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=203423744
This creates the VCF output file family_trio.vcf in the current working directory in the container, as well as its index, family_trio.vcf.idx.
It's another reasonably small file, so you can run cat family_trio.vcf to view the file contents, and scroll down to find the first few variant lines.
File contents (abridged)
We've once again highlighted the last header line, which marks the start of the variant call data.
This looks similar to the VCF we generated earlier, except this time we have genotype-level information for all three samples. The last three columns in the file are the genotype blocks for the samples, listed in alphabetical order of their ID field, as shown in the highlighted header line.
If we look at the genotypes called for our test family trio for the very first variant, we see that the father is heterozygous-variant (0/1), and the mother and son are both homozygous-variant (1/1).
That is ultimately the kind of information we're looking to extract from the dataset!
2.3.3. Move the output files¶
As noted previously, anything that remains inside the container will be inaccessible to future work. Before we exit the container, we're going to move the GVCF files, the final multi-sample VCF and all their index files manually to the filesystem outside the container. That way, we'll have something to compare to when we build our workflow to automate all this work.
Directory contents" hl_lines="14-19 22-23
data
├── bam
│ ├── reads_father.bam
│ ├── reads_father.bam.bai
│ ├── reads_mother.bam
│ ├── reads_mother.bam.bai
│ ├── reads_son.bam
│ └── reads_son.bam.bai
├── ref
│ ├── intervals.bed
│ ├── ref.dict
│ ├── ref.fasta
│ └── ref.fasta.fai
├── samplesheet.csv
└── vcf
├── family_trio.vcf
├── family_trio.vcf.idx
├── reads_father.g.vcf
├── reads_father.g.vcf.idx
├── reads_mother.g.vcf
├── reads_mother.g.vcf.idx
├── reads_mother.vcf
├── reads_mother.vcf.idx
├── reads_son.g.vcf
└── reads_son.g.vcf.idx
Once that's done, all the files are now accessible in your normal filesystem.
2.3.4. Exit the GATK container¶
To exit the container, type exit.
Your prompt should be back to normal. That concludes the manual testing of the joint variant calling commands.
Takeaway¶
You know how to test the Samtools indexing and GATK variant calling commands in their respective containers, including how to generate GVCFs and run joint genotyping on multiple samples.
What's next?¶
Take a break, then head on to Part 2 to learn how to wrap those same commands into workflows that use containers to execute the work.