Titanic

July 20, 2016 | Author: zarthon | Category: Types, School Work

Share Embed Donate

Report this link

Short Description

This paper describes the investigation that has been carried out over a period of few weeks, on the first phase of the ...

Description

Titanic Mohit Kothari

Roger Tanuatmadja

Gautam Akiwate

Computer Science and Engineering University of California, San Diego

Computer Science and Engineering University of California, San Diego

Computer Science and Engineering University of California, San Diego

Abstract—Nex Abstract—Next-gen t-generati eration on sequencing sequencing (NGS), also known known as high throughput throughput sequencin sequencing g is currently currently being utilized utilized in the Bioinformatics Department of the University of California at San Diego Diego as a resea researc rch h tool tool to examin examinee and detec detectt varian variants ts (mu(mutations) in genomes of individuals and related family members. Unfortunately, given the current tools and compute environment, a typical NGS work-flow presently takes between 12 to 14 days to complete. complete. There is therefor thereforee a desire desire within the department department to investigate the possibility of reducing the turnaround time to accommodate accommodate current/fut current/future ure researc research h needs needs that would would tread tread a similar path.

TABLE I: Machine Description System Processor Model Clock Speed No. of Processors No. of Cores per Processor RAM Size Disk

II.

This paper describes the investigation that has been carried out over a period of few weeks, on the first phase of the sequencing that takes around 5 to 7 days with the current software tool chain and hardware hardware utilizing utilizing a set of smaller smaller but still represe representati ntative ve data sets and it will conclude with some recommendations and lesson lessonss learn learned. ed. Additi Additiona onally lly,, we were were also also able able to build build a surprisi surprisingly ngly accurate model which which predict predictss the behavior behavior of the tool chain.

OVERVIEW

We start start by talkin talking g about about the differ different ent tools tools that that make make up the pipeline pipeline in Sectio Section n III. III. We then then move move to descri describin bing g the workin workings gs of the pipeli pipeline ne itself itself in Sectio Section n IV. Sectio Section n V describe describess our framewo framework rk to measure measure differen differentt system system resource usage while running the pipeline and also tries explain challenges faced while replicating the environment on a new system system.. Sectio Section n VI descri describes bes initia initiall analys analysis is and includ includes es some insights into CPU, I/O and Memory usage along with Java Java GC and hardwa hardware re perfor performan mance ce counte counters. rs. Secti Section on VII attempts to model the entire pipeline in mathematical terms and section VIII evaluates the model and lists some suprising results. Section IX uses the model to analyse reoccuring issues in the pipeline and makes makes some future predictions predictions.. Finally Finally,, sect sectio ion n X conc conclu lude dess with with less lesson onss lear learnt nt and and some some of the the recommendations for the pipeline.

Keywords— Keywords— NGS, Variant Calling, BWA, Picard, GATK, BAM, SAM

I.

Description Intel Xeon 1.8GHz 1 4 100 100 GB 228 228 GB

I NTRODUCTION

A full full end end to end end NGS NGS work work-fl -flow ow invo involv lves es a numb number er of smaller work-flows/phases including utilizing hardware sequencers such as ones made by Illumina, pre-processing the output BAM files by running them through a set of software tools tools to perfor perform m tasks tasks such such as sequen sequence ce mappin mapping g (hereb (hereby y called the pre-processing phase) before concluding with variant calling, calling, annotation annotation and filtering. filtering. A BAM file is an industry industry standa standard rd binary binary version version of a SAM file with with the later later being being a tab-deli tab-delimite mited d text text file containi containing ng sequence sequence alignmen alignmentt data. For the purposes of this paper, we will only focus on the preprocessing phase since this is the work-flow that is currently taking the longest to complete 1 .

III .

TOOLS AND C ONFIGURATIONS

To understand the pipeline it is important that we understand stand the tools that make make up the pipeline pipeline.. Hence, Hence, we begin begin by describing the tools and the roles they play. The pipeline consists of the following tool chains for processing 2 1) 2) 3) 4) 5)

An overv overvie iew w of the curren currentt hardwa hardware re and softwa software re tool tool chain (including how much time each tool is currently taking to complete in a typical execution) has been provided in Table. II

BWA SAMt AMtool ools HTSlib Picard Genome Genome Analysis Analysis Toolkit oolkit (GATK) (GATK)

A. BWA

The curren currentt work-fl work-flow ow is curren currently tly execu executed ted via a Perl Perl script that executes each stage in Table. II in sequential order. When When possib possible, le, the commun communica icati tion on betwee between n the stages stages is done through Unix pipes to reduce I/O with temporary files being used when piping is not feasible. Additionally, although each stage proceeds sequentially, wherever applicable, the tool utilized in each stage is currently executed with parameters that would take advantage of extra processors/cores.

BWA 3 is a software package for mapping low-divergent BWA sequen sequences ces agains againstt a large large refere reference nce genome genome,, such such as the human genome. It consists of three variations of the BurtonWheeler Wheeler Aligner Aligner algorithm algorithm:: BW BWA-bac A-backtra ktrack, ck, BW BWA-SW A-SW and BWA-MEM. The pipeline uses BWA-MEM as it is capable of for processing up to 1 million base pairs (bp)[1].BWA-MEM 2

http://gatkforums.broadinstitute.org/discussion/2899/howto-install-allsoftware-packages-required-to-f software-packages-required-to-follow-th ollow-the-gatk-best-pr e-gatk-best-practices actices 3 http://bio-bwa.sourceforge.net/

1

http://www.slideshare.net/AustralianBioinformatics/introduction-tonextgeneration

1

TABLE II: Software Tool Chain Stage Name Shuffling Shuffling and Aligning Aligning Input File SAM File Sorting Mark Duplicates BAM File Index construction Building Insert Delete (Indel) Realignment Targets Realignment around Indel Base Q Covariance 1st Stage Base Q Covariance 2nd Stage Plot Base Q Results Base Q Recalibration

Tool Family

Software Tool

N/A Picard Picard Picard GATK GATK GATK GATK GATK GATK

htscmd, bwa, samtools (C language) SAMSort (Java) MarkDuplicates (Java) BuildBamIndex (Java) RealignerTargetCreator (Java) IndelRealigner (Java) BaseRecalibrator (Java) BaseRecalibrator (Java) Analyze Covariates (Java) PrintRead (Java)

Tool RTC IR BR PR

is highly parallel as it works on independent chunks of base pair reads. B. SAMtools

Current Processing Time

Full Name Real Realig ign n Targe argett Crea Creato torr Indel Realigner Base Recalibrator Print Reads

33 h o u r s 8 h o ur s 8 h ou rs 1 hour 8 h ou r s 8 ho ur s 30 hours 80 hours 0 ho u rs 33 hours

Supported Parallelism NT SG NCT, SG NCT

TABLE III: Parallelism in GATK

The SAM (Seque (Sequence nce Alignm Alignment ent/M /Map) ap) file format format is a generic generic format format for storing storing large large nucleotide nucleotide sequence sequence alignalignments.SAM Tools provides various utilities for manipulating manipulating alignment alignmentss in the SAM format, format, including including sorting, sorting, merging, merging, indexing and generating alignments in a per-position format. It is inherently a single threaded application[2].

performance by breaking down large iterative tasks into shorter segments whose output will then be merged into some overall result. Additionally, Additionally, it also employs multi-threading heavily. heavily. Multi-th Multi-thread reading ing is enabled enabled simply simply by using the nt and nct command line arguments[3].

C. HTSlib HTSlib

HTSlib4 is an imple implemen mentat tation ion of a unified unified C libra library ry for access accessing ing common common file formats, formats, such such as SAM, SAM, CRAM CRAM and VCF, VCF, used for high-throu high-throughput ghput sequencing sequencing data, and is the core library used by samtools and bcftools. The binary is called htscmd  and and it is used to shuffle the input data and convert the later later into into a single single fastq fastq file.Th file.This is file is then then provid provided ed as an input to the BWA tool. This tool is also single threaded and doesn’t have any parallelism in it.

Here nt   repres represent entss the number number of data data thread threadss sent sent to the processor and nct   represents the number of CPU threads allocated to each data thread. Apart from multi-threading, they also have a notion of “Scatter-Gather” which can be applied to a cluster of machines. Scatter-Gather (SG) is a very different process from multi-threading because the parallelization happens outside of the program itself. It basically creates separate GATK GATK comma commands nds for a portio portion n of input input data data and send this this command to different nodes of the cluster.

D. Picard tools

Not all the tools support all different kinds of parallelism, below is the table which lists down these support.

Picard5 is comprised of Java-based command-line utilities that that manipu manipulat latee SAM files, and a Java Java API (HTSJD (HTSJDK) K) for creating new programs that read and write SAM files. Both SAM text format and SAM binary (BAM) format are supported by the tool. tool. The pipeline pipeline uses 3 utilit utilities ies from the Picard Picard tool tool set for three three purpos purposes, es, namely namely,, sorti sorting ng the SAM file (SortSam), marking duplicates (MarkDuplicates) and building the BAM index (BuildBamIndex). All these utilities are again unfortunately single threaded and the algorithms don’t employ any parallelism.

As seen in Table III, SG is not supported by all the tools. Ones that support are not the bottle neck of the pipeline as explained in section VII Next section describes describes our framewo framework rk for measurin measuring g the system performance. I V.

T HE P IPELINE

In this section, we attempt to describe our understanding of each step in the workflow in computer science layman terms as opposed to the point of view of a trained Bioinformatics scient scientist ist.. For the purposes purposes on analys analysis, is, we also also divid dividee the pipeline into logical phases.

E. Genome Analysis Toolkit (gatk)

Similar to Picard, GATK 6 too is a set of tools and it serves as the core of the pipeline performing the major analysis tasks on the genome. The GATK is the industry standard for analyses such as identifying rare mutations among exomes as well as specific mutations within a group of patients. The specific tools that are being used are RealignerTargetCreator, IndelRealigner, BaseRecalibrator, PrintReads. The GATK was built from the ground up with performance in mind. It employs the concept of Map Reduce, Reduce, which which is basica basically lly a strate strategy gy to speed speed up

The input to the workflow is a set of BAM files which is essent essential ially ly a really really large large set of files files (total (totallin ling g 1 Billio Billion n short reads per genome per person) containing a collection of short reads. Short reads are fragments of a much longer DNA sequence and these are produced by hardware sequencers such as ones produced by Illumina. There are technologies in the marketplace that can produce long reads as well, but we will not discuss them here given our limited understanding of the topic.

4

https://github.com/samtools/htslib http://picard.sourceforge.net/ 6 http://www.broadinstitute.org/gatk/about/ 5

2

G. Phase 7 and 8: InDel Targ Targets ets and Realign InDels

Once this is done, these short reads are fed into the pipeline that is made up of  14 stages.

The next two phases pertains to Insertion Deletion (InDel) and and thus thus woul would d bene benefit fit from from a shor shortt over overvi vieew. The The term term InDe InDell refe refers rs to a clas classs of vari variat atio ions ns that that is pres presen entt in a human genome. The need to align short reads around InDels arises due to 2 major reasons. The first reason is that InDel can cause cause mappe mappers rs (such (such as the BW BWA A algori algorithm thm emplo employed yed in Phase Phase 1) to misal misalign ign short short reads. reads. The second second reason reason is that that those those misal misalign ignme ments nts then then would would harm harm the accura accuracy cy of downstream processes such as base quality recalibration and variant detection. 8

A. Phase 1: Shuffle and Align

In this stage, the short reads in the BAM files will first be shuffled to minimize bias during the alignment process 7 . Once shuffling is completed, each of the short reads will then need to be aligned (mapped to a specific position) to a large reference genome (in our case this is a 3 billion long sequence). From the outski outskirt, rt, read read alignm alignment ent seems seems to be a simple simple proble problem m i.e. find a particular substring within a bigger string, however read read errors errors can and do happen happen,, hence hence turnin turning g this this into into an approxima approximate te hashing hashing problem problem which is then solved utilizin utilizing g the Burrows Burrows Wheeler Wheeler Aligner Aligner (BWA) (BWA) transform transform algorithm algorithm.. It is our unders understan tandin ding g that that each each of the short short reads reads can be sequenced in this stage without knowledge of any other short read data as long as the input BAM file is valid. While running intial experiments we found an interesting problem related to missing base-pairs which is further discussed in section V.

Here, the regions in the BAM files that will need to be realig realigned ned are identi identified fied.. In genera generall there there are three three types types of realignm realignment ent targets targets:: known known sites sites such as ones coming from the 1000 Genome project, InDels that are seen in the original alignments (as part of the application of the BWA algorithm), and finally sites where evidence suggests a hidden InDel. Once Once the InDel InDel regio regions ns have have been been identi identified fied,, this this stage stage would then perform the actual realignment. H. Phase 9: Remove Files

B. Phase 2: SAM Sorting

Phase 9 removes the temporary files created by the earlier phase.

Many of the downstream analysis programs which utilizes BAM BAM files files actual actually ly requir requiree the files to be sorted sorted since this allows reading from these files to be done more efficiently.

I. Phase 10: Baseq (Base Quality) Covariance Stage 1 (Base Recalibration)

C. Phase 3: Remove Remove Files

Hardware sequencers would associate a quality score with their reads. There is however a tendency for sequencers to be overly optimistic in terms of their confidence scores. In this stage, a recalibration table will be built utilizing some machine learning algorithm based on covariation among several features of base such as read group, the original quality score from the sequencer, 1st/2nd read in a pair, etc. 9

Phase 3 removes the temporary files created in the earlier phases.

D. Phase 4: Mark Duplicates

In this phase, duplicates of any particular unique short read will be marked to prevent a skew during the variant calling process. process. Duplicate Duplicatess are usually produced due to a particul particular ar DNA preparation process and may be unavoidable. Marking duplicates sounds like something that can be built on top of the canonical example of using Map reduce i.e. counting the number of words in a given document.

J. Phase 11: Baseq (Base Quality) Covariance Stage 2

In this phase, the recalibration table built on the previous stage would be utilized to recompute the base quality score. K. Phase 12: Plot Base Quality Covariance Covariance

In this stage, the plot of the recalibration tables are generat erated ed so that that an eval evalua uati tion on can can be made made on whet whethe herr the the recalibration has worked properly.

E. Phase 5: Remove Files

This phase too we remove some files L. Phase 13: Base Quality Recalibration F. Phase 6: Index Dedup Dedup (Build BAM Index)

In this this phase, phase, the recalibr recalibrate ated d data data is subjec subjected ted to some some final processing before written out to disk.

In this phase, the output BAM files from preceding stages are indexed for fast access. This essentially allows a particular short read to be accessed by jumping immediately to a specific offset within a particular BAM file thus negating the need to read preceding data into memory. The output of this process is a set of accompanying index files to the original BAM files.

M. Phase 14: Remove Files

Phase 14 simply removes the Index realignment files. 8

http://hmg.oxfordjournals.org/content/19/R2/R131.full http://weallseqtoseq.blo http://weallseqtoseq.blogspot.com/20 gspot.com/2013/10/gatk 13/10/gatk-best-practices-w -best-practices-workshoporkshopdata-pre.html 9

7

http://www.broadinstitute.org/gatk/guide/tagged?tag=bam

3

V.

F RAMEWORK

significant time estate of the pipeline and hence it shouldn’t impact our future experiments.

In this section, we describe the framework that was setup to run the pipeline and the changes and modifications done to the framework framework for the purposes purposes of analysis analysis.. In addition, addition, we also discuss the challenges faced, while trying to get working the pipeline with a subset of the reads to ease analysis. Further, we also talk about the challenges we faced while duplicating the environment on another machine.

Finally, after all these changes, we were able to replicate the complete complete environm environment ent on a new server server where we could could experiment with different data-set sizes and analyse results. VI.

In order order to gain gain a better better understa understandi nding ng of the pipeline pipeline,, we started our investigation by running the pipeline using the simplest possible configuration and yet still have it perform meaningful work. This is accomplished by setting the number of thre thread adss to 1 in each each phas phasee of the the pipe pipeli line ne as well well as utilizing only 1% of the full data set used in a typical run in the Bioinformatics department, which comes to be around 10 million reads. The choice of  1% of the full set was informed through conversations with a student from the department.

A. Framework Framework Changes

The framework - essentially a script provided to us needed an overhaul and further additons so that we could start measuring suring system system perfor performa mance nce for each each phase phase of the pipeli pipeline. ne. To measure measure basic basic resources resources like, CPU utilizat utilization, ion, Memory utilization and Disk I/O we decided to use dstat  tool for each command. For collecting hardware performance counters, e.g. L1 cache misses, last level cache misses etc. we used perf-stat utility. Also, since most of the tools are Java based, we thought it would beneficial to collect JVM Garbage Collector logs as well.

The next next level level of experime experiments nts consisted consisted of runnning runnning the same pipeline and dataset but with higher number of threads and since we were working on a quad-core machine, we chose 4 threads threads for immediate immediate comparison. comparison. Later in the paper, paper, we look at how SMT performs.

B. Data Sources

Table able IV shows shows that that there there are mainly mainly 5 phases phases which which contribut contributee to approx. approx. 84% of the total total time. time. Lookin Looking g at the graphs for those phases namely, figures 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10, it is evident that neither I/O or Memory is a bottleneck. Remaining graphs of the phases can be found in Appendix A

Our initial initial exper experim iment entss were were run on the master master node node of the Bioinformatics lab during a short period of time when it was not being utilized by the the department. This gave us an advantage since we did not have to simultaneously understand both tools and environment at the outset. However, once the normal usage of the node was resumed, we had to utilize a differ different ent machin machine. e. The critic critical al issue issue was was that that we had been been using private patient data up to that point and due to privacy concerns it was impossible for us to move the data set into the new machine. We were thus faced with the challenge of finding an equivalent data set. We tried a number of sources and finally found the 1000 Genome project to be fruitful. 10

This This is furthe furtherr evide evident nt when when we captur captured ed the resour resource ce usage pattern for an actual run that lasted for approx. 7 days. Figures 11 and 12 show that although I/O was increased, its not the bottle neck of the system and the tools never ran out of memory. Although Although we believ believee that memory bandwidth bandwidth contention contention might might be one of the possible possible reasons reasons why these phases phases are slow but due to timing constraints we were not able to explore that avenue.

C. Pair issue and ’R’

Our Our woes woes did did not not stop stop afte afterr bein being g able able to find find a data data set from from 1000 1000 Genome Genome projec project, t, We expect expected ed the subset subset of the reads reads to succes successfu sfully lly run in the pipeli pipeline. ne. Howe Howeve verr, we faced faced crypti crypticc errors errors in the first first phase phase of the pipeli pipeline ne itself itself.. After After multi multiple ple long long debugg debugging ing sessio sessions ns and help help from from the bioinformatics people we found out that the base pair reads come in pairs and the bwa tool requires that every base pair has its pair in the data set.

P ha has e N am am e Shuff Algn Sort Sam Remove BAM DeDup Srtd Remove Srtd Index DeDup Indel Target Realn Target Remove Dedup Base Covar Base Covar 2 Plot (Didn’t Measure) Baseq Realn Remove Realn

Since we were running our experiments for a subset of the reads there was a probability that certain base-pairs didn’t have their pair and because of this pipeline was failing at the first phase itself. So to remove this error, we had to write a wrapper script around the given pipeline to pre-process the input bam file and remove all the reads whose pairs didn’t exist.

S in in gl gl e Th re re ad ad Ti me me ( s) s) 4932 244 1 299 1 40 2961 393 1 1276 1907 0 1085 1

4 T hr hr ea ea d Ti me me ( s) s) 1505 244 1 299 1 40 830 394 1 719 1330 0 721 1

TABLE IV: Time taken by different phases of the pipeline for different number of threads for dataset of size 10 Million reads

Challenges didn’t end here, there is a phase in the pipeline which uses ’R’ to plot graphs and it turns out that tools are using deprecated libraries of ’R’ for plotting which we couldn’t install while replicating the environment, so we had to skip that that phase phase during during our exper experim iment ents. s. Our initia initiall analys analyses es on bioinform bioinformatic aticss machine machine showed that this phase did not take 10

BASELINE R UN

A. Investigating Java Runtime Environment Environment

With the exception of the first phase of the pipeline, all the utilized tools are programs written in Java. This provided us with another avenue of investigation to pursue through the enabling of Java Garbage Collection (GC) logging.

http://www.1000genomes.org/

4

100

140

CPU Utilization Read(MB) Write(MB)

120

8

Memory Utilization

80 100

   )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

   )    B    M    (   e    t    i   r    W    /    d   a   e    R

80 60 40

7

   )    B    G    (   y   r   o   m   e    M

6

5

4

20 20 0 0

3

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

2 0

500

1000 1500 2000 2500 3000 3500 4000 4500 5000

Time (s)

Time (s)

(a) CPU and I/O Usage

(b) Memory Memory consumption consumption

Fig. 1: Single thread resource usage for Phase 1: Shuffle and Align 100

140

CPU Utilization Read(MB) Write(MB)

120

8

Memory Utilization

80 100

   )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

   )    B    M    (   e    t    i   r    W    /    d   a   e    R

80 60 40

7

   )    B    G    (   y   r   o   m   e    M

6

5

4

20 20 0 0

200

400

600

800

3

0 1000 1200 1400 1600

2 0

200

400

Time (s)

600

800

1000

1200

1400

Time (s)

(a) CPU and I/O Usage

(b) Memory Memory consumption consumption

Fig. 2: 4 threaded resource usage for Phase 1: Shuffle and Align

Given the batch nature of each phase in the pipeline, we are primarily interested in knowing the phase throughput i.e. the percen percentag tagee of time time spent spent by each each pipeli pipeline ne phase phase doing doing useful work instead of GC. In general, any throughput number at 95% and above are considered good 11. Additionally, should the throughput number falls below 95%, we are also interested in seeing seeing any instan instances ces of a GC taking taking an exces excessi sive vely ly long long amount of time.

. The set of flags serve to output GC data in sufficient details including the amount of memory freed in both Young and Old Generations in each iteration as well as the times during which GC happens into the specified log file. We then used a tool called JClarify Censum 12 to visualize the data and collect the throughput metric. 1) Throughput Results: When it comes to the pipeline, the phases utilizing Java that contribute the most to the running time time are Indel Targ Targets ets,, Base Base Cova Covarr 1, Base Base Covar Covar 2, and Baseq Baseq Recal Recal so we are going to constr constrain ain our discus discussio sion n to those those 4 phases phases.. For any combin combinati ation on of input input size size and number of threads, we found that the throughput number never dropped below 95% with the exception of the Base Covariance phase where the number dropped steadily when the number of

We modi modifie fied d the the scri script pt used used to run run the the tool tool chai chain n to augment every ”java” command with the following flags

-Xloggc:logs -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 11

12

http://www.slideshare.net/jclarity/hotspot-garbage-collection-tuning-guide

5

http://www.jclarity.com/censum/

1600

100

1.6

CPU Utilization Read(MB) Write(MB)

1.4

7

Memory Utilization

80 1.2

   )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

6.5

   )    B    M    (   e    t    i   r    W    /    d   a   e    R

1 0.8 0.6 0.4

6    )    B    G    (   y   r   o   m   e    M

20

5.5 5 4.5 4 3.5

0.2

3

0 0

500

1000

1500

2000

2500

0 3000

2.5 0

500

1000

Time (s)

1500

2000

2500

3000

Time (s)

(a) CPU and I/O Usage

(b) Memory Memory consumption consumption

Fig. 3: Single thread resource usage for Phase 7: Indel Target Index 100

40

CPU Utilization Read(MB) Write(MB)

35

10

Memory Utilization

80 30

   )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

25 20 15 10

9

   )    B    M    (   e    t    i   r    W    /    d   a   e    R

8    )    B    G    (   y   r   o   m   e    M

20

7 6 5 4

5 3

0

0 0

100

200

300

400

500

600

700

800

2

900

0

100

200

Time (s)

300

400

500

600

700

800

Time (s)

(a) CPU and I/O Usage

(b) Memory Memory consumption consumption

Fig. 4: 4 threaded resource usage for Phase 7: Indel Target Index

I nd nd el el Ta rg rg et et s Base Covar 1 Base Covar 2 Baseq Recal

10M 98 .6 .6% 95.8% 97.1% 97.4%

20M 9 8. 8. 9% 9% 94.7% 96.8% 97.5%

40M 9 8. 8. 7% 7% 93.6% 96.6% 97.5%

100M 9 8. 8. 9% 9% 92% 96.6% 97.6%

-Xms15000m -Xmx15000m -Xmn10000m

TABLE V: GC Throughput for number of threads = 8

The choice of those numbers were informed based on the raw GC data collected in particular the sizes of the Young and Old Generation at the end of the GC log for that phase with some buffer built in to tolerate possible memory spikes.

threads specified is 8 (at 4 threads and below, the throughput remained at above 95

Improve Improvement ment aside, we don’t don’t think think that it is particul particularly arly meanin meaningfu gful, l, for examp example, le, if we refer refer to table table V, impro improvin ving g Base Covar 1’s throughput from 92% to 95% would only result in an improvement of around 1 minute (3% of 35 minutes) in running time. It is true that given that the particular phase is currently taking around 30 hours (using the full data set) that we may be seeing some real savings in time, however we also know that the phase is currently being run with the number of threads set to 5 in the Bioinformatics head node (and earlier

We then made some modifications to the ”java” command to see see if we can can impr improv ovee the the thro throug ughp hput ut numb number er for for that that phas phase. e. In part partic icul ular ar we foun found d that that for for a data data size size of 40 million, we were able to increase the throughput from 93.6% to 96.1% by specifying a number of additional flags

6

900

100

0.16

CPU Utilization Read(MB) Write(MB)

0.14

14

Memory Utilization

80 0.12

   )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

0.1 0.08 0.06

   )    B    M    (   e    t    i   r    W    /    d   a   e    R

0.04

12

   )    B    G    (   y   r   o   m   e    M

10

8

6

20 0.02 0 0

200

400

600

800

1000

1200

4

0 1400

2 0

200

400

600

Time (s)

800

1000

1200

1400

Time (s)

(a) CPU and I/O Usage

(b) Memory Memory consumption consumption

Fig. 5: Single thread resource usage for Phase 10: Base Covariance 1 100

1.8

CPU Utilization Read(MB) Write(MB)

1.6

80

12

1.4

   )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

1.2 1 0.8 0.6 0.4

20

0.2

Memory Utilization

11

   )    B    M    (   e    t    i   r    W    /    d   a   e    R

10 9    )    B    G    (   y   r   o   m   e    M

8 7 6 5 4 3

0

0 0

100

200

300

400

500

600

700

800

2 0

100

200

300

Time (s)

400

500

600

700

Time (s)

(a) CPU and I/O Usage

(b) Memory Memory consumption consumption

Fig. 6: 4 threaded resource usage for Phase 10: Base Covariance 1

we have mentioned that throughput is not an issue when the number of threads is 4). In other words, without further work, we are not able able to ascert ascertain ain how much much of an impro improve vemen mentt would tuning this particular phase result in.

configu configurat ration ion option option,, there there were were also also a number number of utilit utility y threads started by the tool and the JVM including a Progress Tracking thread (started by GATK), a number of GC threads and a number of TCP/IP threads. We were not able to ascertain why the JVM would start a number of TCP/IP threads, and there there does does not seem to be any any flags flags specifi specificc turnin turning g those those threads threads off, however however the presence presence of these these extra extra threads threads do explain the CPU utilization phenomenon that we were seeing.

Anothe Anotherr thing thing that that is worth worth menti mentioni oning ng is that that early early in our experiments, we were seeing Java programs ran in single thre thread aded ed confi configu gura rati tion on thro throug ugh h the the use use of nct nct (see (see some some other other table table for informat information ion on tools tools that that accept accept number number of data/compute threads) consuming more than a thread’s worth of CPU utilization (”top” would occasionally show utilization above 30%) . We utilized an application called JConsole which is shipped with any JDK since version 5.0 to investigate the issue since the later has the ability to show all running threads in a particular Java Virtual Machine (JVM).

B. Performance counter measurements

Just to be sure that the delay in the phases are not caused by lot of L1 cache cache misses misses,, branch branch-pr -predi edicto ctorr misses misses or OffOffChip accesses accesses,, we measured measured hardware hardware performa performance nce counters counters using perf-stat tool tool and and foun found d it to be cons consis iste tent nt acro across ss multiple multiple runs for differe different nt input input sizes sizes and thread thread numbers. numbers. There was about 3.5% of L1 Data cache miss rate, 1.2% of Branch predictor misses. But an interesting result was 31 .3%

What What we found found was was that that althou although gh the tool did start start up a number number of data/c data/com omput putee thread threadss as specifi specified ed throug through h a 7

800

100

1.4

CPU Utilization Read(MB) Write(MB)

1.2

14

Memory Utilization

80 1

   )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

0.8 0.6

   )    B    M    (   e    t    i   r    W    /    d   a   e    R

0.4

12

   )    B    G    (   y   r   o   m   e    M

10

8

6

20 0.2 0 0

4

0 200 400 600 800 1000 1000 1200 1200 1400 1400 1600 1600 1800 1800 2000 2000

2 0

200

400

600

Time (s)

800

1000 1200 1400 1600 1800 2000 Time (s)

(a) CPU and I/O Usage

(b) Memory consumption

Fig. 7: Single thread resource usage for Phase 11: Base Covariance 2 100

0.4

CPU Utilization Read(MB) Write(MB)

0.35

14

Memory Utilization

80 0.3

   )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

0.25 0.2 0.15 0.1

   )    B    M    (   e    t    i   r    W    /    d   a   e    R

12

   )    B    G    (   y   r   o   m   e    M

10

8

6

20 0.05 0 0

200

400

600

800

1000

1200

0 1400

4

2 0

200

400

Time (s)

600

800

1000

1200

1400

Time (s)

(a) CPU and I/O Usage

(b) Memory consumption

Fig. 8: 4 threaded resource usage for Phase 11: Base Covariance 2

of LLC misses misses which was an interest interesting ing result but because of timing constraints we couldn’t explore this avenue as well. However, since we saw that these measurements seemed pretty much constant across runs with different sizes and threads we concluded concluded that while the cache cache misses misses could could be potential potentially ly interesting it wouldn’t affect our ability to model the pipeline. VII .

turned out to be a crucial step in building the model as this enabled us to better predict the behavior of the entire pipeline as a combin combinati ation on of phases phases rather rather than than as a single single entity entity. Essent Essential ially ly,, we have have built built a model model for each of the phases phases which is then used to make a prediction for the entire pipeline. More specifically, the model will utilize the size of input data and number of threads to make a prediction for the time the pipeline will take to complete.

T HE P IPELINE M ODEL

One of the ideas behind profiling the pipeline was to build a mode modell base based d of whic which h we coul could d make make pred predic icti tion ons. s. The The idea being that if we could accurately predict and model the behav behavior ior of the softwar softwaree pipeli pipeline ne we had, had, in a manne mannerr of speaki speaking, ng, truly truly unders understoo tood d the workin workings gs of the system system as a black box.

A. Building the Model

As mentioned previously, to better understand the pipeline we logically split the pipeline into phases. In retrospect, this

The factors that affect running time and form an integral part of the model are:

In build building ing the model model,, we reali realized zed that there there are four important factors that affect the running time. Further more, each of the phases had a different behavior which seemed to stem from change in the four factors.

8

100

40

CPU Utilization Read(MB) Write(MB)

35

13

80    )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

25 20 15

Memory Utilization

12

30    )    B    M    (   e    t    i   r    W    /    d   a   e    R

10

11 10    )    B    G    (   y   r   o   m   e    M

9 8 7 6 5

20 5

4 3

0 0

200

400

600

800

1000

0 1200

2 0

200

400

Time (s)

600

800

1000

1200

Time (s)

(a) CPU and I/O Usage

(b) Memory consumption

Fig. 9: Single thread resource usage for Phase 13: Base Recalibration

100

60

CPU Utilization Read(MB) Write(MB)

12

50

80

Memory Utilization

11

   )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

40

30

20

   )    B    M    (   e    t    i   r    W    /    d   a   e    R

10 9    )    B    G    (   y   r   o   m   e    M

8 7 6 5

20

10

4 3

0

0 0

100

200

300

400

500

600

700

800

2 0

100

200

Time (s)

300

400 Time (s)

(a) CPU and I/O Usage

(b) Memory consumption

Fig. 10: 4 threaded resource usage for Phase 13: Base Recalibration

Fig. 11: CPU and I/O utilization for full run on 1 billion reads.

9

500

600

700

800

Fig. 12: Memory utilization for full run on 1 billion reads.

1) 2) 3) 4)

f  : f  : The fraction of the phase that is parallelizable p : The number of threads used m : The fraction of the phase that actually depends on the size of the input s : The size of the input

f , m success of the entire model revolves around two factors f , and how well we estimate and compute these factors. For this, we decided to stay simple and come up with a f , Amdahl’ simple simple way to compute compute f and m. For f , Amdahl’ss seemed seemed like a good approximation. In fact, we see the Amdahl’s Law approximation works really well and seems to provide a good f  with Amdahl’s law starting point. Considering our sucess for f  with we decided to use a slightly modified version of the Amdahl’s law to get a value of m for each phase.

f , m which causes the Every phase, has it’s own set of f , phases to show different behavior. One assumption we make f of the pipeline here here is that that for fraction fraction f   pipeline the progra program m is perfectly perfectly scalable. scalable. However However,, in reality reality it is possible possible that the performance of the phase will stop improving as the number of threads is increased, or in fact degenerates after a certain threshold has been reached. In fact, we see this exact behavior for one of the phases. Considering that we are treating all the programs as black-boxes it would be hard to pinpoint the exact reasons for the degeneration. However, we briefly attempt to address the issue later in this section.

i f , we basically, for each phase P  p, To get f , 10M and p = i 2, 4, 6, get a corresponding f  p,10M and compute compute f as f i = min(f 2i,10M , f 4i,10M , f 6i,10M ) i Note, f  p,s denote denotess the fracti fraction on of the phase i that that is parallelizable for size s. However, since the size of the input s should not affect the parallelizability of the phase we can f   with s = 10M . M . estimate f

1) Model Terminology: In this section we briefly elaborate on the terminology we intend to use for building the mode. Each phase, in addition to the two internal parameters f and m has two external parameters that it depends on. One is the number of threads p, and the second is the size of the input ie. the number of reads s. These parameters are external and could possibly change.

On the other hand to get m, we basically, for each phase i P  p, correspon pondin ding g m p,20M and 20M and p = 2, 4, 6, get a corres i i i i compute m as m = max (m2,20M , f 4,20M , f 6,20M )

i Every phase is denoted by P  p,s where p is the number of threads and s is the input size and i is the phase number in the pipeline.

i Note, m Note, m p,s denotes the fraction of the phase i that depends M   as the base. on the size of input when compared to s = 10M   However, since the size of input s will not change the fraction of the phase that depends on the size we can estimate m with s = 20M

Thus Thus,, the the tota totall time time for for the the enti entire re pipe pipeli line ne E can can be represented as

min for f  and f  and the max for m as we would We choose the min for like to make conservative estimates.

n  i=1

B. Enumerating the Model: Phase 1 i t(P  p,s )

= t (E )

In this section, we try to make concrete the above calculations by walking through enumerating the model for Phase 1. Essentially, we try to solidify the model by showing how we build the model and then make predictions for Phase 1.

M   as the For the purposes of the analysis we consider 10M   base size. As mentioned in Section. VI, the model uses the data M   as ground truth because using an input size smaller for 10M   than that does not produce meaningful results, not only from the correctness point of you but also from the behavior point of view. Also as discussed before, the time taken by every phase p , s, s , m. m . However, there are two factors that are depends on f on f ,, p, inherent to the phases and are not stated explicitly. Hence, the

1) Comp Computi uting ng f: To compute compute f 1 we use use the the data data for for s = 10M and p = 2, 4, 6

Using Amdahl’s Law we get

10

TABLE VI: Phase 1, s = 10M Thread Threadss p=1 p=2 p=4 p=6

f =

TABLE IX: Phases, f, m

Time Time (secon (seconds) ds) 4932 2663 1529 1151

p

(1 − p−1

i t(P p, ) 10M

t(P 1i,10M )

Phase 1 P P 2 P 3 P 4 P 5 P 6 P 7 8 P P 9 P 10 P 11 P 12 M P 13 N P 14

)

f 21,10M = 0.9201 1 f 4,10M = 0.9200 f 61,10M = 0.9200 Thus we get f 1 = 0 .9200

Predicted Predicted (seconds) (seconds) 962

P

n  i=1

Actual (seconds) (seconds) 963

D. Porting the Model

TABLE VIII: Phase 1, s = 20M

All All the the abov abovee runs runs were were done done on the the Mach Machin ine. e. I as described. One obvious worry is that this model will not be portable and needs to be redone for every machine. While this is a valid concern, we believe the model is structured the only parameter that the model depends on - and is variable is on P 1i,10M . Also, it is evident that P 1i,10M   will change based on the machine it runs on - primarily because the new machine might run at a different frequencies. We hypothesize that we can essentially port the entire model by adding a multiplier k which could be simply computed for each phase by computing

Time Time (secon (seconds) ds) 4974 2909 1867

2) Computi Computing ng m: To compute compute m1 we use use the the data data for s = 20M and p = 2, 4, 8

To evaluate evaluate m we define a new term r which is a ratio M . Further, we modify the Amdahl’s Law of the size s to 10M . equation slightly to make it compute m.

m =

1 r −1

(1 −

i ) t(P p, 20M i t(P p, ) 10M

ki =

n 

= 0 .9145 m14,20M = 0 .9329 m18,20M = 0 .9387 Thus we get

i M new new (P 1,10M ) i M ref ref (P 1,10M )

Thus, we can rewrite our generic equation to be

)

m12,20M

m1

t(P 1i,10M ).( pf + (1 − f )).(m. 10sM + (1 − m)) = t (E )

It is interesting to note that most of the phases do not exhibit a high deal of parallelism. This indicates that we should not get improvements if we keep increasing the number of threads. In Section. VIII we evaluate our model for accuracy and discuss possible issues and shortcomings.

As a quick quick valid validati ation on of our estimat estimation ion we use this to predict predict the value value P 81,10M . As we can see in Table. VII that our estimation of f 1 is pretty spot on.

Thread Threadss p=2 p=4 p=8

m

0.9387 0.9877 0.0000 0.9369 0.0000 0.9230 0.0733 0.8560 0.0000 0.3533 0.6789 − 0.9996 0.0000

Thus we can rewrite our previous equation as follows

TABLE VII: Validate f, Phase 1, p = 8 threads 1 8,10M

f

0.9200 0.0000 0.0000 0.0000 0.0000 0.0000 0.9595 0.0000 0.0000 0.5982 0.4096 − 0.4473, 0 .1327 0.0000

i=1

t(P 1i,10M ).( pf + (1 − f )).(m. 10sM + (1 − m)).k i = t (E ) V I II II .

E VALUATION

In this section we evaluate our model by comparing our predictions to actual runtime for s = 10, 20, 40, 100 and for p = 4, 8. Essentially, we’d like to conclusively show that our simple model works surprisingly well. Table. X and Table. XI show the comparisons between our predictions and the actual runtime.

= 0.93870

C. Evaluating Evaluating the Pipeline

Fina Finall lly y, we have have all all the the data data we need need to eval evalua uate te the the pipeline pipeline and make predictio predictions. ns. It is importan importantt to understan understand d that we evaluate the pipeline in phases rather than as a whole. Based on our measures for f and m we can now rewrite the time taken by a phase as

To reiterate, we use the model and the equation as developed in Section. VII to make the predictions.

i t(P  p,s ) = t (P 1i,10M ).( pf + (1 − f )).(m. 10sM + (1 − m))

11

Shuff Algn Sort Sam Remove bam Dedup Sorted Remove Sorted Index DeDup Indel Targets Realn Targets Remove DeDup Base Covar 1 Base Covar 2 Plot Baseq Recal Remove Realn Total

10 Million Predicted Actual Time (s) Time (s) 1528 1505 244 244 1 1 299 299 1 1 40 40 830 830 393 394 1 1 703 719 1321 1330 721 721 1 1 6085 6086

20 Million Predicted Actual Time (s) Time (s) 2964 2909 484 484 1 1 579 579 1 1 76 79 891 865 729 730 1 1 952 973 2218 2233 1441 1379 1 1 10342 10235

40 Million Predicted Actual Time (s) Time (s) 5834 5779 966 965 1 1 1139 1148 1 1 150 152 1012 961 1402 1364 1 1 1449 1495 4011 3932 2883 2795 1 1 18855 18595

100 Million Predicted Actual Time (s) Time (s) 14445 14524 2412 2433 1 4 2820 2960 1 3 372 375 1377 1245 3420 3389 1 2 2940 3023 9393 9402 7207 6837 1 1 44396 44198

TABLE X: Predicted vs Measured values for each phase of the pipeline and different input sizes, running with 4 threads wherever possible.

Shuff Algn Sort Sam Remove bam Dedup Sorted Remove Sorted Index DeDup Indel Targets Realn Targets Remove DeDup Base Covar 1 Base Covar 2 Plot Baseq RecalO Remove Realn Total

10 Million Predicted Actual Time (s) Time (s) 961 963 244 244 1 1 299 301 1 1 40 39 475 454 393 395 1 1 608 613 1223 1223 959 959 1 1 5207 5195

20 Million Predicted Actual Time (s) Time (s) 1864 1867 484 485 1 1 579 593 1 1 76 78 509 477 729 721 1 1 822 798 2054 2032 1918 1906 1 1 9040 8961

40 Million Predicted Actual Time (s) Time (s) 3670 3728 966 964 1 2 1139 1139 1 1 150 155 579 526 1402 1370 1 1 1252 1134 3715 3594 3835 3744 1 1 16713 16359

100 Million Predicted Actual Time (s) Time (s) 9086 9384 2412 2475 1 4 2820 2935 1 2 372 376 788 685 3420 3402 1 2 2541 2142 8699 8410 9587 9178 1 1 39730 38996

TABLE XI: Predicted vs Measured values for each phase of the pipeline and different input sizes, running with 8 threads wherever possible.

A. Observations

4) TABLE XII: Prediction Error Si zzee 20M 40M 100M

1)

2)

3)

Er ro ro r ( p = 4) 107s 260s 198s

Er ro ro r ( p = 8 ) 79s 354s 734s

5)

To be honest, honest, we are pleasant pleasantly ly surpri surprised sed at how how well well our model performs. As we can see in Table. XII the M   which is 10% of the error for p = 4 for s = 100M   actual data size is less than 4 minutes. One One of the majo majorr sourc sources es of erro errorr seems seems to come come from P 13 for p = 8. As we noticed while building the model P model P 13 showed erratic behavior for 8 threads. Infact, when we increased the number of threads to 8, the time actually increased rather than decrease. If one one wer weree to rem remov ovee the the appr approx ox.. 400 second error

from P 13 for p = 8 we get much better results. Howe Howeve verr, it is impor importa tant nt to note note that even even the 734 seconds error for a runtime of  38996 seconds means an accuracy of  99 .9% which we believe is extremely good. More importan importantly tly,, it is eviden evidentt that that incre increasing asing number number of threads does not significantly improve the results. f   values that we computed This is supported by the f   in Section. VII

B. The kink in the model

Clearly, there seems to be an issue when we run with 8 threads for P 13 . However, the important question is why does P 13 show show the odd behav behavior ior with with p = 8. Was the the p = 8 a mach machin inee spec specifi ificc anom anomal aly y or is it a tool tool spec specifi ificc issu issue. e. We suspec suspectt that that this this partic particula ularr issue issue could could be becaus becausee that that P 13 cannot cannot fully fully utilize utilize the SMT threads and infact infact causes contention contention leading to worsenin worsening g behavior behavior.. However However,, due to 12

constraints on timings we have been unable to verify this and is part of our future work. IX.

TABLE XIV: Analysis 10M: More Threads vs More Instances Thre Thread adss p=2 p=4 Thre Thread adss p=2 p=6

ANALYSIS NALYSIS AND P REDICTIONS

Given that we have now successfully created a model - a model that seems to be able to accurately mirror the behavior of the pipeline we’d now like to use it to answer questions questions about runtimes on other machines, questions about trade-offs. Another question we would like to be able to answer is, how much time would it take to run the full pipeline on the current machine - Machine. I as well as the newly bought Machine. XIII that are intended to be part of a cluster.

One of the biggest issues for doing a complete run, is that a complete run takes approximately 4-5 days. Hence, it would have been impractical to run analysis on full runs given the high turnaround time. Also, given the limited time at hand, we could not do a complete run. However, given that we have a fancy model and all it only makes sense to do a prediction for the complete run on the Machine I. In Table. XV we make our predictions based on the model developed in Section. VII. We predict that to finish a complete run on we will take 119 hours ie. approx. 5 days.

Description

Total otal Time ime 2663 3058 (1529x2) Total otal Time ime 2663 3453 (1151x3)

B. Prediction for Complete Run

TABLE XIII: Cluster Node: Machine Description System Processor Model Clock Speed No. of Processors No. of Cores per Processor RAM Size Disk

Inst Instan ance cess 2 1 Inst Instan ance cess 3 1

Intel(R) Intel(R) Xeon(R) Xeon(R) CPU E5620 3.50GHz 1 4 64 GB 18 TB

TABLE XV: Prediction Full Run - 4 Threads A. Trade-Off: Trade-Off: More Threads or More Instances

Phase Time 143613 P 1 24102 P 2 1 P 3 P 4 28032 1 P 5 P 6 3695 6854 P 7 P 8 33697 1 P 9 25310 P 10 P 11 90118 1 P 12 P 13 72072 1 P 14 Total 427501

One second secondary ary goals while profiling profiling was to be able able to determine what is better in terms of time. Running a single instance instance with more threads or running running more instances instances with fewer fewer thread threads. s. Concre Concretel tely y, speaki speaking ng is it better better to spawn spawn a single single instance instance of the pipeline pipeline with say 4 thread threadss twice twice as oppose opposed d to runnin running g two instance instancess with with 2 thread threadss each each simultaneously. 1) Building Building an Intuition: Intuition: Based on our observatio observations ns and the model, it is apparent that not a lot of the phases in the pipeline have a high degree of parallelizability. Furthermore, for some of the phases it seems that things infact get worse if we go beyond a particular number of threads.

C. Prediction Prediction on Cluster Node

Hence, considering that there is not much parallelizability to be had beyond a particular point there is a strong indication that that spawni spawning ng more more instan instances ces with with fewer fewer thread threadss might might be better.

Another Another interest interesting ing problem problem as mentione mentioned d previou previously sly is port portin ing g the the mode modell to othe otherr mach machin ines es.. For For the the same same,, we k . However, while proposing proposed the idea of a multiplier k. the multiplier k we suggested an empirical way of getting k i for each of the phases. In this case, however, considering we do not have any relevant data to make an empirical estimate.

2) Potenti Potential al Issues: Issues: One potent potential ial issue issue with with spawni spawning ng multiple issues of the pipeline is resource contention or scheduler conflicts. However, we did a quick run and saw that there were no significant changes in runtime as compared to a single instan instance ce with with the same same number number of thread threads. s. Concre Concretel tely y, we saw that two instances of the pipeline running with 2 threads each took the same time as a single instance of the pipeline running with 2 threads. This indicates that resource contention or scheduler conflicts, if present, do not significantly affect the run.

1) Estimating k : Given that we do not have data to estimate k for each phase we would like to find a way to estimate k . Considering, the similar nature of the machines at hand, if we ask the question what could be the potential reasons for the running time to change.

1) 2) 3)

able.. XIV XIV it is 3) Actual Actual Analysis: Analysis: As can be seen in Table definit definitely ely more more benefic beneficial ial to run multi multiple ple instan instances ces of the pipeli pipeline ne as oppose opposed d to a single single instan instance ce with with more more thread threadss multiple times. Clearly, there seems to be much advantage in running multiple instances - infact it seems that the more the number of instances you can support and run the better it would be.

High Higher er Cloc Clock k Spee Speed d Differ Different ent Memor Memory y System System IO

Howe Howeve verr, as we have have seen seen in Sect Sectio ion. n. VI that that IO is definit definitely ely not a bottle bottlenec neck k and hence hence not a consid considera erati tion. on. Further, in VI-B we also comment on the fact that the cache behavior behavior remains same. Hence, Hence, we are potentially potentially left with only the clock speed as a possible source of change in time. 13

Note, while this might not be entirely accurate, we believe this is a good approximation of what we should expect.

threads threads needn’t needn’t neccessar neccessarily ily mean better better performa performance nce and hence care should be taken while deciding your configuartion parameters.

Thus Thus we can compute compute k as a sing single le fact factor or that that can can be applied to all phases as f Mref k = f Mref k = 31..84 = 0 .5294 new

2) Simple Simple Models Models Work: Probably, Probably, the biggest and most important part of the project was the model. The simplicity of the model has possibly been the key to it’s performance.

M M

Based on this we can estimate that approx. 3 days. X.

ACKNOWLEDGMENT

LESSONS

We would like to thank Prof. Voelker - the guy who knew the guy who had a projec projectt for us, withou withoutt whose whose guidan guidance ce and support much of this wouldn’t wouldn’t have been possible possible.. We would also like to thank Roy Ronnen and also Viraj Deshpande for patiently guiding us through the bioinformatics aspects of the project. Finally, we would like to thank Prof. Snoeren for helping us get key insights into the project and furthermore for the course itself.

The overarching goal of the project was to try and suggest spheres spheres for improve improvement ment and make suggestions suggestions on potential potential trade-offs and hardware suitability. As explained in the previous sections we think that the tools have been optimally set up - that is the tools are being in used in a way that extract the most possible out of them and have been able to rule out most of the obvious bottlenecks. Further, we believe our ability to model the pipeline fairly accurately reflects that we understand the pipeline - atleast as a black box. Further, it also reflects that whatever we find true for smaller artificial data sets will also be applicable for actual data sets.

R EFERENCES [1]

This said, in this section we would like to brielfy talk about the lessons we learnt and some things to keep in mind for future work with the pipeline.

Heng Li and Richard Durbin. Durbin. Fast and accurate short read alignment with burrows–wheeler burrows–wheeler transform. Bioinformatics, 25(14):1754–1760, 2009.

[2] Heng Heng Li, Bob Handsa Handsake ker, r, Alec Wysoker ysoker,, Tim Tim Fennel Fennell, l, Jue Ruan, Nils Homer, Homer, Gabor Gabor Marth, Marth, Goncalo Goncalo Abecasis, Richard Richard Durbin, Durbin, et al. The sequence sequence alignmen alignment/map t/map format and samtools. samtools. Bioinformatics, 25(16):2078–2079, 25(16):2078–2079, 2009. [3]

A. Lessons on Pipeline 1) Run More Instances as Opposed to More Threads: We conclusively show that the pipeline as a whole isn’t exactly parall paralleli elizab zable. le. Hence, Hence, it makes makes more more sense sense to run multi multiple ple instances instances of the pipeline pipeline with lesser lesser threads threads simultan simultaneousl eously y as oppose opposed d to a single single instan instance ce with with more more thread threadss multi multiple ple times.

Aaron McKenna, McKenna, Matthew Matthew Hanna, Hanna, Eric Banks, Andrey Andrey Sivachen Sivachenko, ko, Kristian Cibulskis, Andrew Kernytsky Kernytsky,, Kiran Garimella, David Altshuler, Stacey Stacey Gabriel Gabriel,, Mark Mark Daly Daly, et al. The genome genome analy analysis sis toolkit toolkit:: a mapreduce mapreduce framewor framework k for analyzing analyzing next-gen next-generati eration on dna sequencin sequencing g data. Genome research , 20(9):1297–1303, 2010.

A PPENDIX Graphs of resource usage for the remaining phases of the pipeline for single thread and 4 thread runs on 10M data-set

2) Tools Need Change: One thing that was evident from the entire exercise was that there needs to definitely be an impetus to incorporat incorporatee more parallelism parallelism in the tools. tools. However However,, there seem to be major issues regarding this. First, the algorithms themselv themselves es don’t don’t lend themselv themselves es well to parallel parallelism. ism. Secondly, the bioinformatics community as a whole is incredibly caut cautio ious us of movi moving ng to new new tool tools. s. The The latt latter er is prim primar aril ily y because some of the problems that these tools try to solve are essentially open problems which means different tools solve the same problems differentl differently y leading leading to possibly possibly differe different nt results. Thus, unless the tools are proven the community seems reluctant to adopt them. 3) Hardware Investments: The pipeline as a whole doesn’t affor afford d a lot of parall paralleli elizab zabili ility ty.. Hence, Hence, there there could could be two two ways ways that that the system system could could be built built up to accomo accomodat datee the pipelining. The first approach, the one that is currently being followed now, is to buy huge machines with lots of memory and run multiple instances of the pipeline simultaneously. The other approach is to have multiple lesser powerful machines. Hence, Hence, instea instead d of havin having g a 100GB 100GB quad-c quad-core ore machin machinee one could could possib possibly ly do away away with with less less power powerful ful dual dual cores cores and lesser memory. It would be worthwhile to investigate the latter approach if it could potentially lead to any savings. B. Lessons in general 1) (More Cores or More Threads) != Better Performance: One of the biggest biggest takea takeaway wayss was that more more cores cores or more more

14

100

140

CPU Utilization Read(MB) Write(MB)

120

10

Memory Utilization

80 100

   )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

80 60

9

   )    B    M    (   e    t    i   r    W    /    d   a   e    R

40

8    )    B    G    (   y   r   o   m   e    M

20

7 6 5 4

20

3

0

0 0

50

100

150

200

2

250

0

50

100

Time (s)

150

200

250

Time (s)

(a) CPU and I/O Usage

(b) Memory Memory consumption consumption

Fig. 13: Single threaded resource usage for Phase 2: Sort SAM

100

140

CPU Utilization Read(MB) Write(MB)

120

9

Memory Utilization

80 100

   )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

80 60 40

8

   )    B    M    (   e    t    i   r    W    /    d   a   e    R

7    )    B    G    (   y   r   o   m   e    M

6 5 4

20 20

3

0

0 0

50

100

150

200

250

2 0

50

Time (s)

100

150 Time (s)

(a) CPU and I/O Usage

(b) Memory Memory consumption consumption

Fig. 14: 4 threaded resource usage for Phase 2: Sort SAM

15

200

250

100

140

CPU Utilization Read(MB) Write(MB)

120

16

Memory Utilization

80 100

   )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

80 60

14

   )    B    M    (   e    t    i   r    W    /    d   a   e    R

40

12    )    B    G    (   y   r   o   m   e    M

10 8 6

20 20

4

0

0 0

50

100

150

200

250

2

300

0

50

100

Time (s)

150

200

250

300

Time (s)

(a) CPU and I/O Usage

(b) Memory Memory consumption consumption

Fig. 15: Single threaded resource usage for Phase 4: DeDup Sorting

100

140

CPU Utilization Read(MB) Write(MB)

120

14

Memory Utilization

80 100

   )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

80 60 40

   )    B    M    (   e    t    i   r    W    /    d   a   e    R

12

   )    B    G    (   y   r   o   m   e    M

10

8

6

20 20 0

0 0

50

100

150

200

250

300

4

2 0

50

Time (s)

100

150 Time (s)

(a) CPU and I/O Usage

(b) Memory Memory consumption consumption

Fig. 16: 4 threaded resource usage for Phase 4: DeDup Sorting

16

200

250

300

100

1.4

CPU Utilization Read(MB) Write(MB)

1.2

3.05

Memory Utilization

80 1

   )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

0.8 0.6

3

   )    B    M    (   e    t    i   r    W    /    d   a   e    R

0.4

2.95    )    B    G    (   y   r   o   m   e    M

20

2.9 2.85 2.8 2.75 2.7

0.2

2.65

0

0 0

5

10

15

20

25

30

35

2.6

40

0

5

10

15

Time (s)

20

25

30

35

40

35

40

Time (s)

(a) CPU and I/O Usage

(b) Memory Memory consumption consumption

Fig. 17: Single threaded resource usage for Phase 6: Index DeDup

100

1.4

CPU Utilization Read(MB) Write(MB)

1.2

2.5

Memory Utilization

80 1

   )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

0.8 0.6 0.4

20 0.2

2.45

   )    B    M    (   e    t    i   r    W    /    d   a   e    R

2.4    )    B    G    (   y   r   o   m   e    M

2.35 2.3 2.25 2.2 2.15 2.1

0

0 0

5

10

15

20

25

30

35

40

2.05 0

5

Time (s)

10

15

20 Time (s)

(a) CPU and I/O Usage

(b) Memory Memory consumption consumption

Fig. 18: 4 threaded resource usage for Phase 6: Index DeDup

17

25

30

100

80

CPU Utilization Read(MB) Write(MB)

70

12

80    )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

50 40 30

Memory Utilization

11

60    )    B    M    (   e    t    i   r    W    /    d   a   e    R

20

10 9    )    B    G    (   y   r   o   m   e    M

8 7 6 5

20

4

10

3

0

0 0

50

100

150

200

250

300

350

400

2

450

0

50

100

150

Time (s)

200

250

300

350

400

450

Time (s)

(a) CPU and I/O Usage

(b) Memory Memory consumption consumption

Fig. 19: Single threaded resource usage for Phase 8: Realign InDels

100

100

CPU Utilization Read(MB) Write(MB)

90 11

80

Memory Utilization

80 10

   )    %    (   n 60   o    i    t   a   z    i    l    i    t    U    U 40    P    C

70

20

20

60 50 40 30

10 0

0 0

50

100

150

200

250

300

350

400

   )    B    M    (   e    t    i   r    W    /    d   a   e    R

9    )    B    G    (   y   r   o   m   e    M

8 7 6 5 4 3 2 0

50

Time (s)

100

150

200 Time (s)

(a) CPU and I/O Usage

(b) Memory Memory consumption consumption

Fig. 20: 4 threaded resource usage for Phase 8: Realign InDels

18

250

300

350

400

Titanic

Short Description

Description

Comments

We need your help!