Computational protocols can be quite long and confusing. It is useful to keep them in a script. But we might not want to run the script from scratch every time (if we already have intermediary files).
Make is a tool originally designed to automate the building of executable programs and compiling their libraries. Make relies on simple rules with a target and dependencies and all it needs is a file named “makefile” in the working directory:
target: dependencies command to create target
This then allows to run make to generate the target:
This will check the dependencies to see if they exist (if not then will try to create them if there are rules to create them in the makefile) or if the dependencies have been updated, then will run the command to create the target. This is useful for the management of complex protocols and custom commands:
counts = counts/Pro_NGSs.1.count counts/Pro_NGSs.2.count counts/Pro_NGSs.3.count counts/%.count: aligned/%.Psorted.bam#the % sign acts as a wildcard between the target and the dependency mkdir -p counts ;\# makes the folder for the counts to be saved echo counting $* ;\# $* stands for the % python -m HTSeq.scripts.count \# This runs the counting program -f bam \ aligned/$*.Psorted.bam \ genome/Homo_sapiens.GRCh38.84.gtf > counts/$*.count && \ # this create the target chmod 555 counts/$*.count && \# changes priorities of the target in order for it not to be easily deleted counts/merged.csv: $(counts) python -c 'import glob,re,pandas; \ # python command files = glob.glob("counts/*.count"); \ df=pandas.concat([pandas.Series.from_csv(x,sep="\t") for x in files],axis=1);\ # this reads the files and concats them in a dataframe df.columns = [re.findall(".+/(.+)\.",x) for x in files]; \# this renames the header df.to_csv("counts/merged.csv")'# this saves the new dataframe
In this makefile snippet, the ‘make counts/merged.csv’ command will try to create a file named ‘counts/merged.csv’ but before running a python command to merge all the counts files, it will first check if the files in the variable $(counts) exist. If the files in the $(counts) variable do not exist, then it will run the command to create them (by first checking the if there is a dependency bam file) using htseq-count. This snipped is not part of the entire makefile but the entire makefile can take raw fastaq files, run quality control, index genome for alignment, align the samples… and proceed all the way to some standard analyses such as differential expression or splicing analysis.
Another advantage of make is that it can parallelise some steps. In the example above, when making ‘counts/merged.csv’, it can parallelise the counting of all the counts/%.count files making the process much faster:
make -j 16 counts/merged.csv # -j dictates the maximum number of processes to run
It is important to get familiar with make automatic variables:
$@ The target $* The target part that is unter "%" $^ All the prequisites sepatated by a space
lt; The first prequisite
There are much more, but these are the once I use the most
Some of the major advantages of using a makefile for the management of computational protocols is that there is a script available for all the analysis performed, including all the parameters used allowing for replicability and troubleshooting. Another good advantage is that as long as the input files and the makefile are not deleted every file in the analysis can be recovered allowing peace of mind in case of an accident and also diminishes the number of files that need to be archived. The main disadvantage involves the learning of a reasonably complex computational language, with little experience troubleshooting can be a daunting process.