Months-long cancer gene analysis in an evening: using CycleCloud on Google Compute Engine Preemptible VMs

Cycle’s mission is to enable our customers to easily access the Big Compute resources required to solve problems and meet deadlines. Over the years our software has orchestrated workloads both internally and externally while accelerating the move to cloud.

This is why we were ready when the Broad Institute came to us with a problem: Their cancer researchers saw value in a highly-complex genome analysis, but even though they already had powerful processing systems in-house, running the analysis would take months or more. We thought this would be a perfect opportunity to utilize Google Compute Engine’s Preemptible VMs to further their cancer research, which was a natural part of our mission. And now that Preemptible VMs are generally available, we’re excited to tell you about this work.


The Science: The search for understanding cancer

The Broad Institute’s Cancer Program has data sets that include hundreds of cancer cell lines, information on the genetic mutations present in each cell line, gene expression data showing which genes are more or less active under various conditions, as well as  information about how various small molecules interact with the cell lines at both large and small scales. Each of these data sets is massively complex in its own right.  Combining them to explore the interactions between these layers of knowledge quickly creates a vast landscape of interrelated data to explore.  

One of the Cancer Program’s goals is to intelligently direct future research using these datasets. This particular workload used machine learning techniques to infer  relationships among and between these cell line and gene/expression data sets. This map provides scientists that are investigating a particular set of cancer genes with a guide to other genes, cell lines, or small molecules that they should direct their research attention to more fully understand a particular sort of  cancer they’re fighting, or a particular biological pathway.

These machine learning algorithms require a lot of compute power. To build this map for only several hundred samples on a single CPU would have required  decades of computing…  It was a sufficiently daunting amount that researchers found themselves holding back from running certain calculations, since prioritizing and scheduling such an effort would have required coordination across many groups.

Getting up & running: Fast migration

So that’s the science, and the researchers at the Broad Institute had executables, including R, and the data they needed to run this research both on local servers and in an existing  StarCluster framework on the cloud. So the challenge was to get this working on Google.

To solve this problem, we’ve connected CycleCloud to Google Cloud Platform, and made it so the same workload placement, data schedule, and at-scale computing capabilities that we’re known for, are available on Google. Thankfully CycleCloud has been orchestrating Big Compute for 8 years, and can import StarCluster workloads easily, so we were able to get this job up and running at moderate scales in 90 minutes.

We can do this because we have automation tools and components that admins can use to orchestrate complicated clustered environments quickly. Using our cluster containers, we’re able to represent the applications, the roles and SLAs of different components, and the data they depend upon, simply and easily.

The Cluster: 51,200 core Preemptible VM Cluster

Preemptible VMs offer a tremendous opportunity for users like Broad Institute, because for BigCompute applications that are “interruption friendly”, we get access to the same Google infrastructure that regular VMs offer, at 30% of the cost. CycleCloud handles the resiliency of making preemptions not matter to the user, while still orchestrating clustered applications at any scale.

For the Broad Institute’s research, the application scaled best at around 50,000 cores. So we pushed the button, created the cluster environment, and submitted the workloads to autoscale a 51,200 core cluster!

CycleCloud executed 3 decades of cancer research in an afternoon, on a petascale computer for less than the cost of a single server, instead of months on local computers. The cluster itself ran using Ubuntu images, with a shared file system, and the Univa Grid Engine scheduler. Grid Engine served up the 340,891 jobs to the 3200 instances in this cluster without issue.

The computation ran on n1-highmem and n1-standard instances across zones in a single region, and as the workload added more jobs, CycleCloud added more servers to the run, as shown below:



When, around 8:28pm some of the instances were preempted, CycleCloud automatically changed the cluster’s systems to be configured properly without the nodes, so the research would continue successfully.

After the jobs ran, the map was complete and is now being analysed and curated, with the hope that it can help direct future research efforts.

The Point: Easy access to Great Cloud Infrastructure

We’re excited about three things with this run: the cutting edge science, enabling easy access to Google Cloud Platform, and cost effective infrastructure that our customers can use to drive answers at any scale. We see Preemptible VMs playing a role in CycleCloud clustered applications including:

  • Computational Chemistry
  • Needle-in-a-Haystack Simulations
  • Financial pricing, back testing, modeling
  • Genomics, Bioinformatics Proteomics
  • Insurance risk management
  • Rendering, Media Encoding
  • Hadoop, Spark, Redis, other IoT processing frameworks

And thanks to Google’s announcement today, Preemptible VMs are generally available, ready for anyone to use on production workloads, like this one at the Broad.

So if you have a clustered application that needs 50 or 50,000 cores, and you want to be able to take advantage of Google Cloud Platform to do it, please contact us.

Share this: