makefile to the rescue

Computational protocols can be quite long and confusing. It is useful to keep them in a script. But we might not want to run the script from scratch every time (if we already have intermediary files).

Make is a tool originally designed to automate the building of executable programs and compiling their libraries. Make relies on simple rules with a target and dependencies and all it needs is a file named “makefile” in the working directory:

target: dependencies 
    command to create target

This then allows to run make to generate the target:

make target

This will check the dependencies to see if they exist (if not then will try to create them if there are rules to create them in the makefile) or if the dependencies have been updated, then will run the command to create the target. This is useful for the management of complex protocols and custom commands:

counts = counts/Pro_NGSs.1.count counts/Pro_NGSs.2.count counts/Pro_NGSs.3.count  

counts/%.count: aligned/%.Psorted.bam#the % sign acts as a wildcard between the target and the dependency 
        mkdir -p counts ;\# makes the folder for the counts to be saved 
        echo counting $* ;\# $* stands for the % 
        python  -m HTSeq.scripts.count \# This runs the counting program 
                -f bam \ 
                aligned/$*.Psorted.bam \ 
                genome/Homo_sapiens.GRCh38.84.gtf > counts/$*.count && \ # this create the target 
        chmod 555 counts/$*.count && \# changes priorities of the target in order for it not to be easily deleted  

counts/merged.csv: $(counts) 
        python -c 'import glob,re,pandas; \ # python command 
                   files = glob.glob("counts/*.count");  \   
                   df=pandas.concat([pandas.Series.from_csv(x,sep="\t") for x in files],axis=1);\ 
                   # this reads the files and concats them in a dataframe 
                   df.columns = [re.findall(".+/(.+)\.",x)[0] for x in files]; \# this renames the header  
                   df.to_csv("counts/merged.csv")'# this saves the new dataframe

In this makefile snippet, the ‘make counts/merged.csv’ command will try to create a file named ‘counts/merged.csv’ but before running a python command to merge all the counts files, it will first check if the files in the variable $(counts) exist. If the files in the $(counts) variable do not exist, then it will run the command to create them (by first checking the if there is a dependency bam file) using htseq-count. This snipped is not part of the entire makefile but the entire makefile can take raw fastaq files, run quality control, index genome for alignment, align the samples… and proceed all the way to some standard analyses such as differential expression or splicing analysis.

Another advantage of make is that it can parallelise some steps. In the example above, when making ‘counts/merged.csv’, it can parallelise the counting of all the counts/%.count files making the process much faster:

make -j 16 counts/merged.csv # -j dictates the maximum number of processes to run

It is important to get familiar with make automatic variables:

$@ The target
$* The target part that is unter "%"
$^ All the prequisites sepatated by a space

lt; The first prequisite

There are much more, but these are the once I use the most
Some of the major advantages of using a makefile for the management of computational protocols is that there is a script available for all the analysis performed, including all the parameters used allowing for replicability and troubleshooting. Another good advantage is that as long as the input files and the makefile are not deleted every file in the analysis can be recovered allowing peace of mind in case of an accident and also diminishes the number of files that need to be archived.  The main disadvantage involves the learning of a reasonably complex computational language, with little experience troubleshooting can be a daunting process.

Pin on Pinterest0Buffer this pageEmail this to someoneShare on Facebook0Share on Google+0Flattr the authorDigg thisPrint this pageTweet about this on TwitterShare on LinkedIn0Share on Reddit1Share on StumbleUpon0Share on Tumblr0

Leave a Reply