Part 2: Per-sample variant calling¶
In Part 1, you tested the Samtools and GATK commands manually in their respective containers. Next, we're going to wrap those same commands into a Nextflow workflow.
Assignment¶
In this part of the course, we're going to develop a workflow that does the following:
- Generate an index file for each BAM input file using Samtools
- Run the GATK HaplotypeCaller on each BAM input file to generate per-sample variant calls in VCF (Variant Call Format)
This automates the steps from the first section of Part 1: Method overview, where you ran these commands manually in their containers.
As a starting point, we provide you with a workflow file, genomics.nf, that outlines the main parts of the workflow, as well as two module files, samtools_index.nf and gatk_haplotypecaller.nf, that outline the structure of each process.
Scaffold files
#!/usr/bin/env nextflow
// Module INCLUDE statements
/*
* Pipeline parameters
*/
// Primary input
workflow {
main:
// Create input channel
// Call processes
publish:
// Declare outputs to publish
}
output {
// Configure publish targets
}
These files are not functional; their purpose is just to serve as scaffolds for you to fill in with the interesting parts of the code.
Lesson plan¶
In order to make the development process more educational, we've broken this down into four stages:
- Write a single-stage workflow that runs Samtools index on a BAM file. This covers creating a module, importing it, and calling it in a workflow.
- Add a second process to run GATK HaplotypeCaller on the indexed BAM file. This introduces chaining process outputs to inputs and handling accessory files.
- Adapt the workflow to run on a batch of samples. This covers parallel execution and introduces tuples to keep associated files together.
- Make the workflow accept a text file containing a batch of input files. This demonstrates a common pattern for providing inputs in bulk.
Each step focuses on a specific aspect of workflow development.
Tip
Make sure you're in the correct working directory:
cd /workspaces/training/nf4-science/genomics
1. Write a single-stage workflow that runs Samtools index on a BAM file¶
This first step focuses on the basics: loading a BAM file and generating an index for it.
Recall the samtools index command from Part 1:
The command takes a BAM file as input and produces a .bai index file alongside it.
The container URI was community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464.
We're going to take this information and wrap it in Nextflow in three stages:
- Set up the input
- Write the indexing process and call it in the workflow
- Configure the output handling
1.1. Set up the input¶
We need to declare an input parameter, create a test profile to provide a convenient default value, and create an input channel.
1.1.1. Add an input parameter declaration¶
In the main workflow file genomics.nf, under the Pipeline parameters section, declare a CLI parameter called input.
That sets up the CLI parameter, but we don't want to type out the file path every time we run the workflow during development. There are multiple options for providing a default value; here we use a test profile.
1.1.2. Create a test profile with a default value in nextflow.config¶
A test profile provides convenient default values for trying out a workflow without specifying inputs on the command line. This is a common convention in the Nextflow ecosystem (see Hello Config for more detail).
Add a profiles block to nextflow.config with a test profile that sets the input parameter to one of the test BAM files.
Here, we're using ${projectDir}, a built-in Nextflow variable that points to the directory where the workflow script is located.
This makes it easy to reference data files and other resources without hardcoding absolute paths.
1.1.3. Set up the input channel¶
In the workflow block, create an input channel from the parameter value using the .fromPath channel factory (as used in Hello Channels).
Next, we'll need to create the process to run the indexing on this input.
1.2. Write the indexing process and call it in the workflow¶
We need to write the process definition in the module file, import it into the workflow using an include statement, and call it on the input.
1.2.1. Fill in the module for the indexing process¶
Open modules/samtools_index.nf and examine the outline of the process definition.
You should recognize the main structural elements; if not, consider reading through Hello Nextflow for a refresher.
Go ahead and fill in the process definition by yourself using the information provided above, then check your work against the solution in the "After" tab below.
| modules/samtools_index.nf | |
|---|---|
Once you've completed this, the process is complete. To use it in the workflow, you'll need to import the module and add a process call.
1.2.2. Include the module¶
In genomics.nf, add an include statement to make the process available to the workflow:
The process is now available in the workflow scope.
1.2.3. Call the indexing process on the input¶
Now, let's add a call to SAMTOOLS_INDEX in the workflow block, passing the input channel as an argument.
The workflow now loads the input and runs the indexing process on it. Next, we need to configure how the output is published.
1.3. Configure the output handling¶
We need to declare which process outputs to publish and specify where they should go.
1.3.1. Declare an output in the publish: section¶
The publish: section inside the workflow block declares which process outputs should be published.
Assign the output of SAMTOOLS_INDEX to a named target called bam_index.
Next, we'll need to tell Nextflow where to put the published output.
1.3.2. Configure the output target in the output {} block¶
The output {} block sits outside the workflow and specifies where each named target is published.
Let's add a target for bam_index that publishes into a bam/ subdirectory.
Note
By default, Nextflow publishes output files as symbolic links, which avoids unnecessary duplication.
Even though the data files we're using here are very small, in genomics they can get very large.
Symlinks will break when you clean up your work directory, so for production workflows you may want to override the default publish mode to 'copy'.
1.4. Run the workflow¶
At this point, we have a one-step indexing workflow that should be fully functional. Let's test that it works!
We can run it with -profile test to use the default value set up in the test profile and avoid having to write the path on the command line.
Command output
You can check that the index file has been generated correctly by looking in the work directory or in the results directory.
Work directory contents
There it is!
Takeaway¶
You know how to create a module containing a process, import it into a workflow, call it with an input channel, and publish the results.
What's next?¶
Add a second step that takes the output of the indexing process and uses it to run variant calling.
2. Add a second process to run GATK HaplotypeCaller on the indexed BAM file¶
Now that we have an index for our input file, we can move on to setting up the variant calling step.
Recall the gatk HaplotypeCaller command from Part 1:
gatk HaplotypeCaller \
-R /data/ref/ref.fasta \
-I /data/bam/reads_mother.bam \
-O reads_mother.vcf \
-L /data/ref/intervals.bed
The command takes a BAM file (-I), a reference genome (-R), and an intervals file (-L), and produces a VCF file (-O) along with its index.
The tool also expects the BAM index, reference index, and reference dictionary to be co-located with their respective files.
The container URI was community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867.
We follow the same three stages as before:
- Set up the inputs
- Write the variant calling process and call it in the workflow
- Configure the output handling
2.1. Set up the inputs¶
The variant calling step requires several additional input files. We need to declare parameters for them, add default values to the test profile, and create variables to load them.
2.1.1. Add parameter declarations for accessory inputs¶
Since our new process expects a handful of additional files to be provided, add parameter declarations for them in genomics.nf under the Pipeline parameters section:
As before, we'll provide default values through the test profile rather than inline.
2.1.2. Add accessory file defaults to the test profile¶
Just as we did for input in section 1.1.2, add default values for the accessory files to the test profile in nextflow.config:
| nextflow.config | |
|---|---|
Next, we'll need to create variables that load these file paths for use in the workflow.
2.1.3. Create variables for the accessory files¶
Add variables for the accessory file paths inside the workflow block:
The file() syntax tells Nextflow explicitly to handle these inputs as file paths.
You can learn more about this in the Side Quest Working with files.
2.2. Write the variant calling process and call it in the workflow¶
We need to write the process definition in the module file, import it into the workflow using an include statement, and call it on the input reads plus the output of the indexing step and the accessory files.
2.2.1. Fill in the module for the variant calling process¶
Open modules/gatk_haplotypecaller.nf and examine the outline of the process definition.
Go ahead and fill in the process definition by yourself using the information provided above, then check your work against the solution in the "After" tab below.
You'll notice that this process has more inputs than the GATK command itself requires. The GATK knows to look for the BAM index file and the reference genome's accessory files based on naming conventions, but Nextflow is domain-agnostic and doesn't know about these conventions. We need to list them explicitly so that Nextflow stages them in the working directory at runtime; otherwise GATK will throw an error about missing files.
Similarly, we list the output VCF's index file ("${input_bam}.vcf.idx") explicitly so that Nextflow keeps track of it for subsequent steps.
We use the emit: syntax to assign a name to each output channel, which will become useful when we wire the outputs into the publish block.
Once you've completed this, the process is complete. To use it in the workflow, you'll need to import the module and add a process call.
2.2.2. Import the new module¶
Update genomics.nf to import the new module:
The process is now available in the workflow scope.
2.2.3. Add the process call¶
Add the process call in the workflow body, under main::
You should recognize the *.out syntax from the Hello Nextflow training series; we are telling Nextflow to take the channel output by SAMTOOLS_INDEX and plugging that into the GATK_HAPLOTYPECALLER process call.
Note
Notice that the inputs are provided in the exact same order in the call to the process as they are listed in the input block of the process. In Nextflow, inputs are positional, meaning you must follow the same order; and of course there have to be the same number of elements.
2.3. Configure the output handling¶
We need to add the new outputs to the publish declaration and configure where they go.
2.3.1. Add publish targets for the variant calling outputs¶
Add the VCF and index outputs to the publish: section:
Next, we'll need to tell Nextflow where to put the new outputs.
2.3.2. Configure the new output targets¶
Add entries for the vcf and vcf_idx targets in the output {} block, publishing both into a vcf/ subdirectory:
The VCF and its index are published as separate targets that both go into the vcf/ subdirectory.
2.4. Run the workflow¶
Run the expanded workflow, adding -resume this time so that we don't have to run the indexing step again.
Command output
Now if we look at the console output, we see the two processes listed.
The first process was skipped thanks to the caching, as expected, whereas the second process was run since it's brand new.
You'll find the output files in the results directory (as symbolic links to the work directory).
Directory contents
If you open the VCF file, you should see the same contents as in the file you generated by running the GATK command directly in the container.
File contents
This is the output we care about generating for each sample in our study.
Takeaway¶
You know how to make a two-step modular workflow that does real analysis work and is capable of dealing with genomics file format idiosyncrasies like the accessory files.
What's next?¶
Make the workflow handle multiple samples in bulk.
3. Adapt the workflow to run on a batch of samples¶
It's all well and good to have a workflow that can automate processing on a single sample, but what if you have 1000 samples? Do you need to write a bash script that loops through all your samples?
No, thank goodness! Just make a minor tweak to the code and Nextflow will handle that for you too.
3.1. Update the input to list three samples¶
To run on multiple samples, update the test profile to provide an array of file paths instead of a single one. This is a quick way to test multi-sample execution; in the next step we'll switch to a more scalable approach using a file of inputs.
First, comment out the type annotation in the parameter declaration, since arrays cannot use typed declarations:
Then update the test profile to list all three samples:
| nextflow.config | |
|---|---|
The channel factory in the workflow body (.fromPath) accepts multiple file paths just as well as a single one, so no other changes are needed.
3.2. Run the workflow¶
Try running the workflow now that the plumbing is set up to run on all three test samples.
Funny thing: this might work, OR it might fail. For example, here's a run that succeeded:
Command output
If your workflow run succeeded, run it again until you get an error like this:
Command output
N E X T F L O W ~ version 25.10.2
┃ Launching `genomics.nf` [loving_pasteur] DSL2 - revision: d2a8e63076
executor > local (4)
[01/eea165] SAMTOOLS_INDEX (2) | 3 of 3, cached: 1 ✔
[a5/fa9fd0] GATK_HAPLOTYPECALLER (3) | 1 of 3, cached: 1
ERROR ~ Error executing process > 'GATK_HAPLOTYPECALLER (2)'
Caused by:
Process `GATK_HAPLOTYPECALLER (2)` terminated with an error exit status (2)
Command executed:
gatk HaplotypeCaller -R ref.fasta -I reads_father.bam -O reads_father.bam.vcf -L intervals.bed
Command exit status:
2
Command error:
...
A USER ERROR has occurred: Traversal by intervals was requested but some input files are not indexed.
...
If you look at the GATK command error output, there will be a line like this:
A USER ERROR has occurred: Traversal by intervals was requested but some input files are not indexed.
Well, that's weird, considering we explicitly indexed the BAM files in the first step of the workflow. Could there be something wrong with the plumbing?
3.3. Troubleshoot the problem¶
We'll inspect the work directories and use the view() operator to figure out what went wrong.
3.3.1. Check the work directories for the relevant calls¶
Take a look inside the work directory for the failed GATK_HAPLOTYPECALLER process call listed in the console output.
Directory contents
work/a5/fa9fd0994b6beede5fb9ea073596c2
├── intervals.bed -> /workspaces/training/nf4-science/genomics/data/ref/intervals.bed
├── reads_father.bam.bai -> /workspaces/training/nf4-science/genomics/work/01/eea16597bd6e810fb4cf89e60f8c2d/reads_father.bam.bai
├── reads_son.bam -> /workspaces/training/nf4-science/genomics/data/bam/reads_son.bam
├── reads_son.bam.vcf
├── reads_son.bam.vcf.idx
├── ref.dict -> /workspaces/training/nf4-science/genomics/data/ref/ref.dict
├── ref.fasta -> /workspaces/training/nf4-science/genomics/data/ref/ref.fasta
└── ref.fasta.fai -> /workspaces/training/nf4-science/genomics/data/ref/ref.fasta.fai
Pay particular attention to the names of the BAM file and the BAM index that are listed in this directory: reads_son.bam and reads_father.bam.bai.
What the heck? Nextflow has staged an index file in this process call's work directory, but it's the wrong one. How could this have happened?
3.3.2. Use the view() operator to inspect channel contents¶
Add these two lines in the workflow body before the GATK_HAPLOTYPECALLER process call to view the contents of the channel:
Then run the workflow command again.
Once again, this may succeed or fail. Here's what the output of the two .view() calls looks like for a failed run:
/workspaces/training/nf4-science/genomics/data/bam/reads_mother.bam
/workspaces/training/nf4-science/genomics/data/bam/reads_father.bam
/workspaces/training/nf4-science/genomics/data/bam/reads_son.bam
/workspaces/training/nf4-science/genomics/work/9c/53492e3518447b75363e1cd951be4b/reads_father.bam.bai
/workspaces/training/nf4-science/genomics/work/cc/37894fffdf6cc84c3b0b47f9b536b7/reads_son.bam.bai
/workspaces/training/nf4-science/genomics/work/4d/dff681a3d137ba7d9866e3d9307bd0/reads_mother.bam.bai
The first three lines correspond to the input channel and the second, to the output channel. You can see that the BAM files and index files for the three samples are not listed in the same order!
Note
When you call a Nextflow process on a channel containing multiple elements, Nextflow will try to parallelize execution as much as possible, and will collect outputs in whatever order they become available. The consequence is that the corresponding outputs may be collected in a different order than the original inputs were fed in.
As currently written, our workflow script assumes that the index files will come out of the indexing step listed in the same mother/father/son order as the inputs were given. But that is not guaranteed to be the case, which is why sometimes (though not always) the wrong files get paired up in the second step.
To fix this, we need to make sure the BAM files and their index files travel together through the channels.
Tip
The view() statements in the workflow code don't do anything, so it's not a problem to leave them in.
However they will clutter up your console output, so we recommend removing them when you're done troubleshooting the issue.
3.4. Update the workflow to handle the index files correctly¶
The fix is to bundle each BAM file with its index into a tuple, then update the downstream process and workflow plumbing to match.
3.4.1. Change the output of the SAMTOOLS_INDEX module to a tuple¶
The simplest way to ensure a BAM file and its index stay closely associated is to package them together into a tuple coming out of the index task.
Note
A tuple is a finite, ordered list of elements that is commonly used for returning multiple values from a function. Tuples are particularly useful for passing multiple inputs or outputs between processes while preserving their association and order.
Update the output in modules/samtools_index.nf to include the BAM file:
This way, each index file will be tightly coupled with its original BAM file, and the overall output of the indexing step will be a single channel containing pairs of files.
3.4.2. Change the input of the GATK_HAPLOTYPECALLER module to accept a tuple¶
Since we've changed the 'shape' of the output of the first process, we need to update the input definition of the second process to match.
Update modules/gatk_haplotypecaller.nf:
Next, we'll need to update the workflow to reflect the new tuple structure in the process call and the publish targets.
3.4.3. Update the call to GATK_HAPLOTYPECALLER in the workflow¶
We no longer need to provide the original reads_ch to the GATK_HAPLOTYPECALLER process, since the BAM file is now bundled into the channel output by SAMTOOLS_INDEX.
Update the call in genomics.nf:
Finally, we need to update the publish targets to reflect the new output structure.
3.4.4. Update the publish target for the indexed BAM output¶
Since the SAMTOOLS_INDEX output is now a tuple containing both the BAM file and its index, rename the publish target from bam_index to indexed_bam to better reflect its contents:
With these changes, the BAM and its index are guaranteed to travel together, so the pairing will always be correct.
3.5. Run the corrected workflow¶
Run the workflow again to make sure this will work reliably going forward.
This time (and every time) everything should run correctly:
Command output
The results directory now contains both BAM and BAI files for each sample (from the tuple), along with the VCF outputs:
Results directory contents
results/
├── bam/
│ ├── reads_father.bam -> ...
│ ├── reads_father.bam.bai -> ...
│ ├── reads_mother.bam -> ...
│ ├── reads_mother.bam.bai -> ...
│ ├── reads_son.bam -> ...
│ └── reads_son.bam.bai -> ...
└── vcf/
├── reads_father.bam.vcf -> ...
├── reads_father.bam.vcf.idx -> ...
├── reads_mother.bam.vcf -> ...
├── reads_mother.bam.vcf.idx -> ...
├── reads_son.bam.vcf -> ...
└── reads_son.bam.vcf.idx -> ...
By bundling associated files into tuples, we ensured the correct files always travel together through the workflow. The workflow now processes any number of samples reliably, but listing them individually in the config is not very scalable. In the next step, we'll switch to reading inputs from a file.
Takeaway¶
You know how to make your workflow run on multiple samples (independently).
What's next?¶
Make it easier to handle samples in bulk.
4. Make the workflow accept a text file containing a batch of input files¶
A very common way to provide multiple data input files to a workflow is to do it with a text file containing the file paths. It can be as simple as a text file listing one file path per line and nothing else, or the file can contain additional metadata, in which case it's often called a samplesheet.
Here we are going to show you how to do the simple case.
4.1. Examine the provided CSV file listing the input file paths¶
We already made a CSV file listing the input file paths, called samplesheet.csv, which you can find in the data/ directory.
sample_id,reads_bam
NA12878,/workspaces/training/nf4-science/genomics/data/bam/reads_mother.bam
NA12877,/workspaces/training/nf4-science/genomics/data/bam/reads_father.bam
NA12882,/workspaces/training/nf4-science/genomics/data/bam/reads_son.bam
This CSV file includes a header line that names the columns.
Warning
The file paths in the CSV are absolute paths that must match your environment. If you are not running this in the training environment we provide, you will need to update the paths to match your system.
4.2. Update the parameter and test profile¶
Switch the input parameter to point to the samplesheet.csv file instead of listing individual samples.
Restore the type annotation in the params block (since it's a single path again):
Then update the test profile to point to the text file:
| nextflow.config | |
|---|---|
The list of files no longer lives in the code at all, which is a big step in the right direction.
4.3. Update the channel factory to parse CSV input¶
Currently, our input channel factory treats any files we give it as the data inputs we want to feed to the indexing process. Since we're now giving it a file that lists input file paths, we need to change its behavior to parse the file and treat the file paths it contains as the data inputs.
We can do this using the same pattern we used in Part 2 of Hello Nextflow: applying the splitCsv() operator to parse the file, then a map operation to extract the file path from each row.
One thing that is new compared to what you encountered in the Hello Nextflow course is that this CSV has a header line, so we add header: true to the splitCsv() call.
That allows us to reference columns by name in the map operation: row.reads_bam extracts the file path from the reads_bam column of each row.
Tip
If you're not confident you understand what the operators are doing here, this is another great opportunity to use the .view() operator to look at what the channel contents look like before and after applying them.
4.4. Run the workflow¶
Run the workflow one more time.
Command output
This should produce the same result as before. Our simple variant calling workflow now has all the basic features we wanted.
Takeaway¶
You know how to make a multi-step modular workflow to index a BAM file and apply per-sample variant calling using GATK.
More generally, you've learned how to use essential Nextflow components and logic to build a simple genomics pipeline that does real work, taking into account the idiosyncrasies of genomics file formats and tool requirements.
What's next?¶
Take a break! That was a lot.
When you're feeling refreshed, head on to Part 3, where you'll learn how to transform this simple per-sample variant calling workflow to apply joint variant calling to the data.