18. Workflow Structure¶
Nextflow includes a specific directory structure for workflows which can provide some features that can facilitate or enhance your code. In this section we will explore them.
First, let's move into the right directory:
There are three directories in a Nextflow workflow repository that have a special purpose:
18.1 ./bin
¶
The bin
directory (if it exists) is always added to the $PATH
for all tasks. If the tasks are performed on a remote machine, the directory is copied across to the new machine before the task begins. This Nextflow feature is designed to make it easy to include accessory scripts directly in the workflow without having to commit those scripts into the container. This feature also ensures that the scripts used inside the workflow move on the same revision schedule as the workflow itself.
It is important to know that Nextflow will take care of updating $PATH
and ensuring the files are available wherever the task is running, but will not change the permissions of any files in that directory. If a file is called by a task as an executable, the workflow developer must ensure that the file has the correct permissions to be executed.
For example, let's say we have a small R script that produces a csv and a tsv:
We'd like to use this script in a simple workflow:
To do this, we can create the bin directory, write our R script into the directory. Finally, and crucially, we make the script executable. This is the code we used to create the cars.R
script, no need to run it:
mkdir -p bin
cat << EOF > bin/cars.R
#!/usr/bin/env Rscript
library(tidyverse)
plot <- ggplot(mpg, aes(displ, hwy, colour = class)) + geom_point()
mtcars |> write_tsv("cars.tsv")
ggsave("cars.png", plot = plot)
EOF
chmod +x bin/cars.R
Warning
Always ensure that your scripts are executable. The scripts will not be available to your Nextflow processes without this step.
Let's run the script and see what Nextflow is doing for us behind the scenes:
and then inspect the .command.run
file that Nextflow has generated
You'll notice a nxf_container_env
bash function that appends our bin directory to $PATH
:
When working on the cloud, Nextflow will also ensure that the bin directory is copied onto the virtual machine running your task in addition to the modification of $PATH
.
Warning
Always use a portable shebang line in your bin directory scripts.
In the R script example shown above, I may have the Rscript
program installed at (for example) /opt/homebrew/bin/Rscript
. If I hard-code this path into my cars.R
, everything will work when I'm testing locally outside of the docker container, but will fail when running with docker/singularity or in the cloud as the Rscript
program may be installed in a different location in those contexts.
It is strongly recommended to use #!/usr/bin/env
when setting the shebang for scripts in the bin
directory to ensure maximum portability.
18.2 ./templates
¶
If a process script block is becoming too long, it can be moved to a template file. The template file can then be imported into the process script block using the template
method. This is useful for keeping the process block tidy and readable. Nextflow's use of $
to indicate variables also allows for directly testing the template file by running it as a script.
The structure directory already contains an example template - a very simple python script. We can add a new process that uses this template:
18.3 ./lib
¶
In the previous chapter, we saw the addition of small helper Groovy functions to the main.nf
file. It may at times be helpful to bundle functionality into a new Groovy class. Any classes defined in the lib
directory are available for use in the workflow - both main.nf
and any imported modules.
18.3.1 Making a Metadata Class¶
Note
Using custom Groovy is considered a very advanced use case and you should not need it for the majority of workflows. The language server will complain about this but you can safely ignore it.
Exercise
Create a new class in ./lib/Metadata.groovy
that extends the HashMap
class and adds a hi
method.
Let's consider an example where we create our own custom class to handle metadata. We can create a new class in ./lib/Metadata.groovy
. We'll extend the built-in HashMap
class, and add a simple method to return a value:
We can then use this class in our workflow:
We can use the new hi
method in the workflow:
At the moment, the Metadata
class is not making use of the "Montreal" being passed into the closure. Let's change that by adding a constructor to the class:
Which we can use like so:
We can also use this method when passing the object to a process:
Why might this be helpful? You can add extra classes to the metadata which can be computed from the existing metadata. For example, we might want to add a method to get the adapter prefix into our Metadata class:
Which we might use like so:
You might even want to reach out to external services such as a LIMS or the E-utilities API. Here we add a dummy "getSampleName()" method that reaches out to a public API:
This relies on jsonSlurper which isn't included by default. Import this by adding the following to the top of the file:
Which we can use like so:
Nextflow caching
When we start passing custom classes through the workflow, it's important to understand a little about the Nextflow caching mechanism. When a task is run, a unique hash is calculated based on the task name, the input files/values, and the input parameters. Our class extends from HashMap
, which means that the hash will be calculated based on the contents of the HashMap
. If we add a new method to the class, or amend a class method, this does not change the value of the objects in the hash, which means that the hash will not change.
Exercise
Can you show changing a method in our Metadata
class does not change the hash?
We are not limited to using or extending the built-in Groovy classes. Let's start by creating a Dog
class in ./lib/Dog.groovy
:
We can create a new dog at the beginning of the workflow:
We can pass objects of our class through channels. Here we take a channel of dog names and create a channel of dogs:
If we try to use this new class in a resumed process, no caches will be used.
Exercise
Show that the Dog
class is not cached when resuming a workflow.
18.3.2 Making a ValueObject¶
Nextflow has provided a decorator to help serialize your custom classes. By adding @ValueObject
to the class definition, Nextflow will automatically serialize the class and cache it. This is useful if you want to pass a custom class through a channel, or if you want to use the class in a resumed workflow.
Let's add the decorator to our Dog
class:
Lastly, we will need to register the class with Kryo, the Java serialization framework. Again, Nextflow provides a helper method to do this. We can add the following to the main.nf
file:
Exercise
Show that the Dog
class can now be used in processes and cached correctly.