BigData, meet BigCompute: 1 Million Hours, 78 TB of genomic data analysis, in 1 week

It seems like every day at Cycle, we get to help people do amazing work, but this week is a little different. This week we wrapped up our involvement in the amazing work by Victor Ruotti of the Morgridge Institute for Research, winner of the inaugural Cycle Computing BigScience Challenge. In the name of improving the indexing of gene expression in differentiated stem cells, Cycle's utility supercomputing software just finished orchestrating the first publicly disclosed 1,000,000+ core hour HPC analysis on the Cloud. Yes, that’s 1 Million hours, or over a ComputeCenturyTM of work, on a total of 78 TB of genomic data, in a week, for $116/hr! 

To put this 115 years of computing into context, the word ‘computer,’ meaning an electronic calculation device, was first used in 1897. So if you had started this run on a one-core computer when the term was first used, and kept it running through World War I, Jazz, the roaring 20’s, the Great Depression, WWII, Big Bands, the start of Rock’n’Roll, the Cold War, the Space Race, the Vietnam War, Disco, the 80s, grunge, techno, hip hop, reality TV, and up to Gangnam Style, Victor’s analysis would be finishing now, sometime in 2012. Now that’s a lot of compute.

Below, we're going to explain the details of the analysis, and how it was executed, but if you're short on time, please skip to why this is important

Cycle Computing BigScience Challenge Overview

About a year ago we were very excited to announce the Challenge, a contest aimed at breaking the computation limits for any researchers working to answer questions that will help humanity. We asked researchers all over the world to think of normally unanswerable computational science questions, i.e. science that was limited by the compute power they had available,  describe their efforts and how access to a Utility Supercomputer could push their research forward, by making impossible science, possible.

As our part of the challenge, Cycle Computing offered $10,000 of compute time on a Utility Supercomputer, with Amazon Web Services kindly providing an additional $9,500 worth of compute time to the contest winner. From a field of many researchers doing fantastic work to benefit humanity, our panel of judges including industry luminaries such as Kevin Davies of Bio IT World, Peter Shenkin of Schrodinger, Jason Stowe of Cycle Computing, and Matt Wood of AWS, chose Alan Aspuru-Guzik of the Harvard University Chemistry department and the Harvard Green Energy Project as the Runner-up, and Victor Ruotti from Jamie Thomson’s Stem Cell Lab at the Morgridge Institute for Research as the BigScience 2011 challenge winner!

Victor required the massive computational power of a Utility Supercomputer to analyze gene expression data in order to construct a map of which genes are expressed when pluripotent stem cells differentiate, or turn into a specific cell type or state. His goal is accelerating the identification of cell types and the genes involved with creating them from induced pluripotent stem cells. More details on the phases of this analysis appear below:

Figure: Slide from Victor Ruotti on the phases of his research

To accomplish this goal, he ran a set of comparisons, in an n-squared algorithm, against 124 samples. Doing analysis on these 124×124 sample pairs, from a larger pool of 524×524, using bowtie to perform RNA Sequence analysis, is both compute and data intensive. The total data footprint, including intermediate data, was 78TB of storage, which was analyzed using 1,000,000 hours of computation. BigData, let us introduce you to BigCompute.

Figure: Running(green) and Idle(yellow) jobs for a portion of Victor's research

Sequence analysis also requires a large amount of memory, so we built our cluster mostly from high-memory Amazon EC2 instances (m2.xlarge, m2.2xlarge, and m2.4xlarge). To ensure we were able to do as much science as possible while staying in budget we used only Spot Instances, which at the time of the run were typically 12 times less expensive compared to on-demand instances.

Instance Type   On-Demand Price / Hour   Approx. Spot Price / Hour  







In the end, we are pleased to announce that the BigScience 2011 workload is the first publicly discussed 1,000,000+ core-hour HPC run on cloud. For this run, CycleCloud helped:

  • Create one-click massive clusters in the cloud
  • Route jobs, input and result data
  • Handle the common cloud failure conditions
  • Automatically adjust the cluster size to fit the workload
  • Provision thousands of instances scheduled with HTCondor, which the Morgridge Institute for Research selected to best meet their needs, (other options on CycleCloud clusters include GridEngine, Torque, etc.), and 
  • Configure the nodes' software, which Cycle's software uses Opscode Chef
Figure: CycleServer's visualization of Opscode's Chef working smoothly on thousands of hosts

We were able to start this multi-thousand core cluster in under 20 minutes. This cluster was used to do about 115 compute years of analysis in about 7 days against 78 TB of genomic data! It ran an average of about 5,000 cores 24/7, with a peak of over 8,000 cores. How much does a million hours of compute and 78 TB of storage cost? Well, we had a budget and in the end it ended up costing a mere $116/hour or $19,555 total! And if anyone wants to run big RNASeq, we can spin this up easily, any time.

Figure: Ganglia data showing a hot (yellow) cluster with thousands of hosts!

Why this is important: how this innovative work benefits us all

Now Victor has the data and will start compiling the first part of his indexing system of the cells, which will allow researchers to quickly classify the cells based on their expression pattern and identify genes and regions of the genome that are critical for establishing cell states that have potential for clinical applications. Ideally it should radically accelerate researchers using stem cells to replicate diseases in the petri dish, to enable easier experimentation for potential treatments. Wow!

We’re very excited about the results of the first CycleCloud BigScience Challenge, and that it helped Victor to think big, to tackle the impossibly large questions that should be asked, and to use the power of utility supercomputing to answer them. We’d like to thank AWS for joining us in sponsoring this research. And finally, Cycle Computing, AWS, and the Judges congratulate Victor Ruotti for his breakthrough work in the field of stem cell research, for helping push humanity forward by using over a ComputeCenturyTM of analysis to make impossible BigScience possible!

Last but not least… Cycle Computing BigScience Challenge 2012

Finally, as it’s getting to be that time of year again, get your creative juices flowing and start thinking BIG because this year’s Cycle Computing BigScience Challenge is here! If you’re interested in applying to run research that will help humanity, go to our Cycle Computing BigScience Challenge page and fill in an entry with who you are, what your big question is, why it needs BigCompute and/or BigData to make it happen, and how it benefits humanity. This year, anyone, in a non-profit or for-profit organization, can apply, and we can't wait to see how you all plan to change your work, your science, and hopefully the world.

Share this:
  • Big thanks to Cycle and the Condor team for this collaborative effort. Also I really appreciated the extra cycles given from AWS to complete this work. We are all very excited and looking forward to start querying these results.
    Thank you.