Recording your phylogenetic workflows

(well, the terminal session anyways)

Analyzing phylogenetics datasets, especially in the -omics era, often involves a huge decision tree. I often find myself stringing together a combination of many different tools, including phylogenetic software, my custom scripts, and basic Unix commands to accomplish even the most basic of workflows. It may be challenging to keep track of all the things you do with the data.

There are good reasons for carefully documenting your pipeline. It will allow you to go back to any step in your workflow in the future and share your analysis with others, increasing reproducibility of your research.

For a long time my preferred method was to laboriously copy-and-paste each command, annotating it with comments in a text file in order to keep track of what, how, and when I did when analyzing data. As it turns out, there are command line tools (of course!) out there that can help us do a bit better.

We can record a terminal session to a text file using the command script. It is a handy little tool that transcribes everything you do in the terminal. It came with my Ubuntu 14.04 LTS distro and all I had to do to start it was:

script test_log.txt

This will start recording to the test_log.txt file. It will keep linefeed, backspace, etc., which makes the output a bit overwhelming. We’ll look at how to make it a bit more transparent later.

Let’s try recording a simple phylogenetic workflow. I will first use R from the command line to download a series of ant sequences from GenBank, then use mafft to align the sequences, concatenate them with AMAS, make some minor adjustments with the indispensable command line tool sed, make a phylogenetic tree with RAxML, and finally visualize it with DendroPy.

I have some sequence identifiers from this study, and I’ll use R packages ape and seqinr to download and write sequences from 16 ant species and two nuclear loci, wingless and long-wave rhodopsin.

library("ape")
library("seqinr")

# input seq names
wg_seqs "EF013664", "EF013669", "AY867423", "EF013671",
"AY867427", "EF013690", "AY703626", "EF013663",
"EF013661", "AY867421", "AY867429", "AY703634",
"EF013660")

# read from GenBank and get taxa names
wg # check the species names
attr(wg, "species")

lr_seqs "EF013536", "EF013541", "AY867485", "EF013543",
"AY867489", "EF013562", "AY703760", "EF013535",
"EF013533", "AY867483", "AY867491", "AY703768",
"EF013532")

lr # make sure the names are the same
attr(lr, "species")

# write the files preserving names
write.fasta(wg, names=attr(wg, "species"), file.out="wingless.fas")
write.fasta(lr, names=attr(lr, "species"), file.out="LWRh.fas")

quit()

The script utility will record not only your commands, but also their output. The species names should look like this for both genes:

[1] "Aenictus_ceylonicus"
[2] "Aenictogiton_sp._ZAM02"
[3] "Tatuidris_sp._ECU01"
[4] "Adetomyrma_sp._MAD02"
[5] "Aneuretus_simoni"
[6] "Acanthostichus_kirbyi"
[7] "Anonychomyrma_gilberti"
[8] "Dorylus_helvolus"
[9] "Cheliomyrmex_cf._morosus_CASENT0007006"
[10] "Ectatomma_opaciventre"
[11] "Acropyga_acutiventris"
[12] "Acanthoponera_minor"
[13] "Leptanilla_sp._D233"
[14] "Leptanilloides_mckennae"
[15] "Myrmecia_pyriformis_Smith,_1858"
[16] "Acanthognathus_ocellatus"

Now I can align the sequences with mafft:

mafft wingless.fas > wingless_aln.fas
mafft LWRh.fas > LWRh_aln.fas

And use AMAS to concatenate the aligned files:

amas concat -i *aln.fas -d dna -f fasta --out-format phylip

This will give you a file called concatenated.out that is a phylip alignment with both genes for all 16 species.

Now it’s (almost) time to make a tree. Our original taxon names included some unusual characters, such as commas and full stops. RAxML doesn’t like that and because of situations like these it’s useful to know basic command line tools for text manipulation. Here I’m using  sed to remove these characters from the concatenated file. Using the -i flag I’m modifying the files in place; normally the output would have been by default printed to screen. This is usually what you want with sed, unless you’re absolutely certain that what you’re doing is correct.

sed -i 's/,//g' concatenated.out
sed -i 's/\.//g' concatenated.out

Now we can run RAxML:

raxml -T 2 -d -m GTRGAMMA -p 1234 -x 1234 -s concatenated.out -o Leptanilla_sp_D233 -n test1

This will create a series of files, with RAxML_bestTree.test1 being the one with the tree in the newick format. Let’s use the Python package dendropy to visualize the tree. After calling the interactive python shell:

import dendropy

# read tree from file
tree = dendropy.Tree.get_from_path("RAxML_bestTree.test1", "newick")

# plot the tree
print(tree.as_ascii_plot())

quit()

That’s it. Now you can stop the script command by simply typing

exit

and pressing Enter.

My log_test.txt output looks like this. If you view it on the command line via less, you will notice the extra characters. A bit cleaner output is produced by cat, which interprets the newline and backspaces:

cat test_log.txt

We can use grep with to find all the log lines containing bash commands, which on my terminal means lines that include the dollar sign:

grep "\\$" test_log.txt

In this case I’m searching for lines that match the literal character $, escaped with \\.

This prints the following commands:

$ R
$ mafft wingless.fas > wingless_aln.fas
$ mafft LWRh.fas > LWRh_aln.fas
$
$ amas concat -i *aln.fas -d dna -f fasta --out-format phylip
$
$ sed -i 's/,//g' concatenated.out
$ sed -i 's/\.//g' concatenated.out
$
$ raxml -T 2 -d -m GTRGAMMA -p 1234 -s concatenated.out -o Leptanilla_sp_D233 -n test1
$ python
$ exit

It’s not ideal, but for a simple tool script gives you the power to record what exactly you’re doing with your data. You can also append to an existing project log with:

script -a test_log.txt

script makes it easier to keep track of your project, record terminal actions for a publication, or put together a Makefile. You can even track the time of each action and then use scriptreplay to replay a simple set of commands! Find more on script and scriptreplay options in this tutorial.

What do you use for keeping track of your command line workflows? Do you know of better tools? Let me know!