Part 3: Hello Workflow¶
Most real-world workflows involve more than one step. In this training module, you'll learn how to connect processes together in a multi-step workflow.
This will teach you the Nextflow way of achieving the following:
- Making data flow from one process to the next
- Collecting outputs from multiple process calls into a single process call
- Passing more than one input to a process
- Handling multiple outputs coming out of a process
To demonstrate, we will continue building on the domain-agnostic Hello World example from Parts 1 and 2. This time, we're going to make the following changes to our workflow to better reflect how people build actual workflows:
- Add a second step that converts the greeting to uppercase.
- Add a third step that collects all the transformed greetings and writes them into a single file.
- Add a parameter to name the final output file and pass that as a secondary input to the collection step.
- Make the collection step also output a simple statistic about what was processed.
0. Warmup: Run hello-workflow.nf
¶
We're going to use the workflow script hello-workflow.nf
as a starting point.
It is equivalent to the script produced by working through Part 2 of this training course.
Just to make sure everything is working, run the script once before making any changes:
N E X T F L O W ~ version 24.10.0
Launching `hello-workflow.nf` [stupefied_sammet] DSL2 - revision: b9e466930b
executor > local (3)
[2a/324ce6] sayHello (3) | 3 of 3 ✔
As previously, you will find the output files in the results
directory (specified by the publishDir
directive).
Note
There may also be a file named output.txt
left over if you worked through Part 2 in the same environment.
If that worked for you, you're ready to learn how to assemble a multi-step workflow.
1. Add a second step to the workflow¶
We're going to add a step to convert the greeting to uppercase. To that end, we need to do three things:
- Define the command we'lre going to use to do the uppercase conversion.
- Write a new process that wraps the uppercasing command.
- Add the new process to the workflow and set it up to take the output of the
sayHello()
process as input.
1.1. Define the uppercasing command and test it in the terminal¶
To do the conversion of the greetings to uppercase, we're going to a classic UNIX tool called tr
for 'text replacement', with the following syntax:
This is a very naive text replacement one-liner that does not account for accented letters, so for example 'Holà' will become 'HOLà', but it will do a good enough job for demonstrating the Nextflow concepts and that's what matters.
To test it out, we can run the echo 'Hello World'
command and pipe its output to the tr
command:
The output is a text file called UPPER-output.txt
that contains the uppercase version of the Hello World
string:
That's basically what we're going to try to do with our workflow.
1.1. Write the uppercasing step as a Nextflow process¶
We can model our new process on the first one, since we want to use all the same components.
Add the following process definition to the workflow script:
hello-workflow.nf | |
---|---|
Here, we compose the second output filename based on the input filename, similarly to what we did originally for the output of the first process.
Note
Nextflow will determine the order of operations based on the chaining of inputs and outputs, so the order of the process definitions in the workflow script does not matter. However, we do recommend you be kind to your collaborators and to your future self, and try to write them in a logical order for the sake of readability.
1.2. Add a call to the new process in the workflow block¶
Now we need to tell Nextflow to actually call the process that we just defined.
In the workflow block, make the following code change:
Before:
After:
hello-workflow.nf | |
---|---|
This is not yet functional because we have not specified what should be input to the convertToUpper()
process.
1.3. Pass the output of the first process to the second process¶
Now we need to make the output of the sayHello()
process flow into the convertToUpper()
process.
Conveniently, Nextflow automatically packages the output of a process into a channel called <process>.out
.
So the output of the sayHello
process is a channel called sayHello.out
, which we can plug straight into the call to convertToUpper()
.
In the workflow block, make the following code change:
Before:
After:
For a simple case like this (one output to one input), that's all we need to do to connect two processes!
1.4. Run the workflow again with -resume
¶
Let's run this using the -resume
flag, since we've already run the first step of the workflow successfully.
You should see the following output:
Output | |
---|---|
There is now an extra line in the console output (line 7), which corresponds to the new process we just added.
Let's have a look inside the work directory of one of the calls to the second process.
work/b3/d52708edba8b864024589285cb3445/
├── Bonjour-output.txt -> /workspaces/training/hello-nextflow/work/79/33b2f0af8438486258d200045bd9e8/Bonjour-output.txt
└── UPPER-Bonjour-output.txt
We find two output files: the output of the first process AND the output of the second.
The output of the first process is in there because Nextflow staged it there in order to have everything needed for execution within the same subdirectory. However, it is actually a symbolic link pointing to the the original file in the subdirectory of the first process call. By default, when running on a single machine as we're doing here, Nextflow uses symbolic links rather than copies to stage input and intermediate files.
You'll also find the final outputs in the results
directory since we used the publishDir
directive in the second process too.
results
├── Bonjour-output.txt
├── Hello-output.txt
├── Holà-output.txt
├── UPPER-Bonjour-output.txt
├── UPPER-Hello-output.txt
└── UPPER-Holà-output.txt
Think about how all we did was connect the output of sayHello
to the input of convertToUpper
and the two processes could be run in series.
Nextflow did the hard work of handling individual input and output files and passing them between the two commands for us.
This is one of the reasons Nextflow channels are so powerful: they take care of the busywork involved in connecting workflow steps together.
Takeaway¶
You know how to add a second step that takes the output of the first step as input.
What's next?¶
Learn how to collect outputs from batched process calls and feed them into a single process.
2. Add a third step to collect all the greetings¶
When we use a process to apply a transformation to each of the elements in a channel, like we're doing here to the multiple greetings, we sometimes want to collect elements from the output channel of that process, and feed them into another process that performs some kind of analysis or summation.
In the next step we're simply going to write all the elements of a channel to a single file, using the UNIX cat
command.
2.1. Define the collection command and test it in the terminal¶
The collection step we want to add to our workflow will use the cat
command to concatenate multiple uppercased greetings into a single file.
Let's run the command by itself in the terminal to verify that it works as expected, just like we've done previously.
Run the following in your terminal:
echo 'Hello' | tr '[a-z]' '[A-Z]' > UPPER-Hello-output.txt
echo 'Bonjour' | tr '[a-z]' '[A-Z]' > UPPER-Bonjour-output.txt
echo 'Holà' | tr '[a-z]' '[A-Z]' > UPPER-Holà-output.txt
cat UPPER-Hello-output.txt UPPER-Bonjour-output.txt UPPER-Holà-output.txt > COLLECTED-output.txt
The output is a text file called COLLECTED-output.txt
that contains the uppercase versions of the original greetings.
That is the result we want to achieve with our workflow.
2.1. Create a new process to do the collection step¶
Let's create a new process and call it collectGreetings()
.
We can start writing it based on the previous one.
2.1.1. Write the 'obvious' parts of the process¶
Add the following process definition to the workflow script:
hello-workflow.nf | |
---|---|
This is what we can write with confidence based on what you've learned so far. But this is not functional! It leaves out the input definition(s) and the first half of the script command because we need to figure out how to write that.
2.1.2. Define inputs to collectGreetings()
¶
We need to collect the greetings from all the calls to the convertToUpper()
process.
What do we know we can get from the previous step in the workflow?
The channel output by convertToUpper()
will contain the paths to the individual files containing the uppercased greetings.
That amounts to one input slot; let's call it input_files
for simplicity.
In the process block, make the following code change:
Before:
After:
Notice we use the path
prefix even though we expect this to contain multiple files.
Nextflow doesn't mind, so it doesn't matter.
2.1.3. Compose the concatenation command¶
This is where things could get a little tricky, because we need to be able to handle an arbitrary number of input files. Specifically, we can't write the command up front, so we need to tell Nextflow how to compose it at runtime based on what inputs flow into the process.
In other words, if we have an input channel containing the item [file1.txt, file2.txt, file3.txt]
, we need Nextflow to turn that into cat file1.txt file2.txt file3.txt
.
Fortunately, Nextflow is quite happy to do that for us if we simply write cat ${input_files}
in the script command.
In the process block, make the following code change:
Before:
After:
In theory this should handle any arbitrary number of input files.
Tip
Some command-line tools require providing an argument (like -input
) for each input file.
In that case, we would have to do a little bit of extra work to compose the command.
You can see an example of this in the 'Nextflow for Genomics' training course.
2.2. Add the collection step to the workflow¶
Now we should just need to call the collection process on the output of the uppercasing step.
2.2.1. Connect the process calls¶
In the workflow block, make the following code change:
Before:
After:
hello-workflow.nf | |
---|---|
This connects the output of convertToUpper()
to the input of collectGreetings()
.
2.2.2. Run the workflow with -resume
¶
Let's try it.
It runs successfully, including the third step:
Output | |
---|---|
However, look at the number of calls for collectGreetings()
on line 8.
We were only expecting one, but there are three.
And have a look at the contents of the final output file too:
Oh no. The collection step was run individually on each greeting, which is NOT what we wanted.
We need to do something to tell Nextflow explicitly that we want that third step to run on all the items in the channel output by convertToUpper()
.
2.3. Use an operator to collect the greetings into a single input¶
Yes, once again the answer to our problem is an operator.
Specifically, we are going to use the aptly-named collect()
operator.
2.3.1. Add the collect()
operator¶
This time it's going to look a bit different because we're not adding the operator in the context of a channel factory, but to an output channel.
We take the convertToUpper.out
and append the collect()
operator, which gives us convertToUpper.out.collect()
.
We can plug that directly into the collectGreetings()
process call.
In the workflow block, make the following code change:
Before:
hello-workflow.nf | |
---|---|
After:
hello-workflow.nf | |
---|---|
2.3.2. Add some view()
statements¶
Let's also include a couple of view()
statements to visualize the before and after states of the channel contents.
Before:
hello-workflow.nf | |
---|---|
After:
hello-workflow.nf | |
---|---|
The view()
statements can go anywhere you want; we put them after the call for readability.
2.3.3. Run the workflow again with -resume
¶
Let's try it:
It runs successfully, although the log output may look a little messier than this (we cleaned it up for readability).
This time the third step was only called once!
Looking at the output of the view()
statements, we see the following:
- Three
Before collect:
statements, one for each greeting: at that point the file paths are individual items in the channel. - A single
After collect:
statement: the three file paths are now packaged into a single item.
Have a look at the contents of the final output file too:
This time we have all three greetings in the final output file. Success!
Note
If you run this several times without -resume
, you will see that the order of the greetings changes from one run to the next.
This shows you that the order in which items flow through the pipeline is not guaranteed to be consistent.
Takeaway¶
You know how to collect outputs from a batch of process calls and feed them into a joint analysis or summation step.
What's next?¶
Learn how to pass more than one input to a process.
3. Pass more than one input to a process in order to name the final output file uniquely¶
We want to be able to name the final output file something specific in order to process subsequent batches of greetings without overwriting the final results.
To that end, we're going to make the following refinements to the workflow:
- Modify the collector process to accept a user-defined name for the output file
- Add a command-line parameter to the workflow and pass it to the collector process
3.1. Modify the collector process to accept a user-defined name for the output file¶
We're going to need to declare the additional input and integrate it into the output file name.
3.1.1. Declare the additional input in the process definition¶
Good news: we can declare as many input variables as we want.
Let's call this one batch_name
.
In the process block, make the following code change:
Before:
After:
You can set up your processes to expect as many inputs as you want. Later on, you will learn how to manage required vs. optional inputs.
3.1.2. Use the batch_name
variable in the output file name¶
In the process block, make the following code change:
Before:
hello-workflow.nf | |
---|---|
After:
hello-workflow.nf | |
---|---|
This sets up the process to use the batch_name
value to generate a specific filename for the final output of the workflow.
3.2. Add a batch
command-line parameter¶
Now we need a way to supply the value for batch_name
and feed it to the process call.
3.2.1. Use params
to set up the parameter¶
You already know how to use the params
system to declare CLI parameters.
Let's use that to declare a batch
parameter (with a default value because we are lazy).
In the pipeline parameters section, make the following code changes:
Before:
After:
hello-workflow.nf | |
---|---|
Remember you can override that default value by specifying a value with --batch
on the command line.
3.2.2. Pass the batch
parameter to the process¶
To provide the value of the parameter to the process, we need to add it in the process call.
In the workflow block, make the following code change:
Before:
hello-workflow.nf | |
---|---|
After:
hello-workflow.nf | |
---|---|
Warning
You MUST provide the inputs to a process in the EXACT SAME ORDER as they are listed in the input definition block of the process.
3.3. Run the workflow¶
Let's try running this with a batch name on the command line.
It runs successfully:
Output | |
---|---|
And produces the desired output:
Now, subsequent runs on other batches of inputs won't clobber previous results (as long as we specify the parameter appropriately).
Takeaway¶
You know how to pass more than one input to a process.
What's next?¶
Learn how to emit multiple outputs and handle them conveniently.
4. Add an output to the collector step¶
When a process produces only one output, it's easy to access it (in the workflow block) using the <process>.out
syntax.
When there are two or more outputs, the default way to select a specific output is to use the corresponding (zero-based) index; for example, you would use <process>.out[0]
to get the first output.
This is not terribly convenient; it's too easy to grab the wrong index.
Let's have a look at how we can select and use a specific output of a process when there are more than one.
For demonstration purposes, let's say we want to count and report the number of greetings that are being collected for a given batch of inputs.
To that end, we're going to make the following refinements to the workflow:
- Modify the process to count and output the number of greetings
- Once the process has run, select the count and report it using
view
(in the workflow block)
4.1. Modify the process to count and output the number of greetings¶
This will require two key changes to the process definition: we need a way to count the greetings, then we need to add that count to the output
block of the process.
4.1.1. Count the number of greetings collected¶
Conveniently, Nextflow lets us add arbitrary code in the script:
block of the process definition, which comes in really handy for doing things like this.
That means we can use the built-in size()
function to get the number of files in the input_files
array.
In the process block, make the following code change:
Before:
hello-workflow.nf | |
---|---|
After:
hello-workflow.nf | |
---|---|
The count_greetings
variable will be computed at runtime.
4.1.2. Emit the count as a named output¶
In principle all we need to do is to add the count_greetings
variable to the output:
block.
However, while we're at it, we're also going to add some emit:
tags to our output declarations. These will enable us to select the outputs by name instead of having to use positional indices.
In the process block, make the following code change:
Before:
After:
hello-workflow.nf | |
---|---|
The emit:
tags are optional, and we could have added a tag to only one of the outputs.
But as the saying goes, why not both?
4.2. Report the output at the end of the workflow¶
Now that we have two outputs coming out of the collectGreetings
process, the collectGreetings.out
output channel contains two 'tracks':
collectGreetings.out.outfile
contains the final output filecollectGreetings.out.count
contains the count of greetings
We could send either or both of these to another process for further work. However, in the interest of wrapping this up, we're just going to use view()
to demonstrate that we can access and report the count of greetings.
In the workflow block, make the following code change:
Before:
hello-workflow.nf | |
---|---|
After:
hello-workflow.nf | |
---|---|
Here we are using $it
in the same way we did earlier, as an implicit variable to access the contents of the channel.
Note
There are a few other ways we could achieve a similar result, including some more elegant ones like the count()
operator, but this allows us to show how to handle multiple outputs, which is what we care about.
4.3. Run the workflow¶
Let's try running this with the current batch of greetings.
This runs successfully:
Output | |
---|---|
The last line (line 8) shows that we correctly retrieved the count of greetings processed. Feel free to add more greetings to the CSV and see what happens.
Takeaway¶
You know how to make a process emit a named output and how to access it from the workflow block.
More generally, you understand the key principles involved in connecting processes together in common ways.
What's next?¶
Take an extra long break, you've earned it. When you're ready, move on to Part 4 to learn how to modularize your code for better maintainability and code efficiency.