(well, the terminal session anyways)
Analyzing phylogenetics datasets, especially in the -omics era, often involves a huge decision tree. I often find myself stringing together a combination of many different tools, including phylogenetic software, my custom scripts, and basic Unix commands to accomplish even the most basic of workflows. It may be challenging to keep track of all the things you do with the data.
There are good reasons for carefully documenting your pipeline. It will allow you to go back to any step in your workflow in the future and share your analysis with others, increasing reproducibility of your research.
For a long time my preferred method was to laboriously copy-and-paste each command, annotating it with comments in a text file in order to keep track of what, how, and when I did when analyzing data. As it turns out, there are command line tools (of course!) out there that can help us do a bit better.
We can record a terminal session to a text file using the command
script. It is a handy little tool that transcribes everything you do in the terminal. It came with my Ubuntu 14.04 LTS distro and all I had to do to start it was:
This will start recording to the
test_log.txt file. It will keep linefeed, backspace, etc., which makes the output a bit overwhelming. We’ll look at how to make it a bit more transparent later.
Let’s try recording a simple phylogenetic workflow. I will first use
R from the command line to download a series of ant sequences from GenBank, then use mafft to align the sequences, concatenate them with AMAS, make some minor adjustments with the indispensable command line tool
sed, make a phylogenetic tree with RAxML, and finally visualize it with DendroPy.
library("ape") library("seqinr") # input seq names wg_seqs "EF013664", "EF013669", "AY867423", "EF013671", "AY867427", "EF013690", "AY703626", "EF013663", "EF013661", "AY867421", "AY867429", "AY703634", "EF013660") # read from GenBank and get taxa names wg # check the species names attr(wg, "species") lr_seqs "EF013536", "EF013541", "AY867485", "EF013543", "AY867489", "EF013562", "AY703760", "EF013535", "EF013533", "AY867483", "AY867491", "AY703768", "EF013532") lr # make sure the names are the same attr(lr, "species") # write the files preserving names write.fasta(wg, names=attr(wg, "species"), file.out="wingless.fas") write.fasta(lr, names=attr(lr, "species"), file.out="LWRh.fas") quit()
script utility will record not only your commands, but also their output. The species names should look like this for both genes:
 "Aenictus_ceylonicus"  "Aenictogiton_sp._ZAM02"  "Tatuidris_sp._ECU01"  "Adetomyrma_sp._MAD02"  "Aneuretus_simoni"  "Acanthostichus_kirbyi"  "Anonychomyrma_gilberti"  "Dorylus_helvolus"  "Cheliomyrmex_cf._morosus_CASENT0007006"  "Ectatomma_opaciventre"  "Acropyga_acutiventris"  "Acanthoponera_minor"  "Leptanilla_sp._D233"  "Leptanilloides_mckennae"  "Myrmecia_pyriformis_Smith,_1858"  "Acanthognathus_ocellatus"
Now I can align the sequences with
mafft wingless.fas > wingless_aln.fas mafft LWRh.fas > LWRh_aln.fas
AMAS to concatenate the aligned files:
amas concat -i *aln.fas -d dna -f fasta --out-format phylip
This will give you a file called
concatenated.out that is a
phylip alignment with both genes for all 16 species.
Now it’s (almost) time to make a tree. Our original taxon names included some unusual characters, such as commas and full stops.
RAxML doesn’t like that and because of situations like these it’s useful to know basic command line tools for text manipulation. Here I’m using
sed to remove these characters from the concatenated file. Using the
-i flag I’m modifying the files in place; normally the output would have been by default printed to screen. This is usually what you want with
sed, unless you’re absolutely certain that what you’re doing is correct.
sed -i 's/,//g' concatenated.out sed -i 's/\.//g' concatenated.out
Now we can run
raxml -T 2 -d -m GTRGAMMA -p 1234 -x 1234 -s concatenated.out -o Leptanilla_sp_D233 -n test1
This will create a series of files, with
RAxML_bestTree.test1 being the one with the tree in the
newick format. Let’s use the
dendropy to visualize the tree. After calling the interactive
import dendropy # read tree from file tree = dendropy.Tree.get_from_path("RAxML_bestTree.test1", "newick") # plot the tree print(tree.as_ascii_plot()) quit()
That’s it. Now you can stop the
script command by simply typing
log_test.txt output looks like this. If you view it on the command line via
less, you will notice the extra characters. A bit cleaner output is produced by
cat, which interprets the newline and backspaces:
We can use
grep with to find all the log lines containing
bash commands, which on my terminal means lines that include the dollar sign:
grep "\\$" test_log.txt
In this case I’m searching for lines that match the literal character
$, escaped with
This prints the following commands:
$ R $ mafft wingless.fas > wingless_aln.fas $ mafft LWRh.fas > LWRh_aln.fas $ $ amas concat -i *aln.fas -d dna -f fasta --out-format phylip $ $ sed -i 's/,//g' concatenated.out $ sed -i 's/\.//g' concatenated.out $ $ raxml -T 2 -d -m GTRGAMMA -p 1234 -s concatenated.out -o Leptanilla_sp_D233 -n test1 $ python $ exit
It’s not ideal, but for a simple tool
script gives you the power to record what exactly you’re doing with your data. You can also append to an existing project log with:
script -a test_log.txt
script makes it easier to keep track of your project, record terminal actions for a publication, or put together a Makefile. You can even track the time of each action and then use
scriptreplay to replay a simple set of commands! Find more on
scriptreplay options in this tutorial.
What do you use for keeping track of your command line workflows? Do you know of better tools? Let me know!