Single click starts a 10,000-core CycleCloud cluster for $1060/hr

Update: This cluster received great coverage, including Amazon CTO Werner Vogel's kind tweet, customer commentary on this Life Science cloud HPC project, & results from our EC2 HPC Cluster.

Meet our latest CycleCloud cluster type, Tanuki. Created with the push of a button, he weighs in at a hefty 10,000 cores.

Yes, you read that right. 10,000 cores. Tanuki approximates #114 on the last 2010 Top 500 supercomputer list in size, and cost $1060/hr to operate, including all AWS and CycleCloud charges, with no up front costs.

Yes, you read that right. 10,000 cores costs $1060/hr. Here are some statistics on the cluster:

Scientific Need = 

80000 Compute Hour

Cluster Scale = 

10k cores, 1250 servers

Run-time = 

8 hours

User effort to start = 

Push a button

Provisioning Time = 

First 2000 cores in 15 minutes, All cores in 45 minutes

Upfront investment = 

$0

Total Cost (IaaS & CycleCloud) =  $1060/hr

This historic supercomputer, built completely in the cloud, drew its first breath minutes after the push of a button. Tanuki started operations through a completely automated launch using our CycleCloudSM service. It ran for 8 hours before the job workflow ended and the cluster was shutdown. The 8-hour run-time across 10000 cores yielded a treasure trove of scientific results for one of our large life science clients.

The ability to run a cluster of this size for $1060/hr, including AWS and CycleCloud charges, is mind-boggling, even to those of us that have been in the cloud HPC business for a while. When Tanuki was first mentioned within Cycle, its scale was thrown out partly as a joke and mostly as a gauntlet. We certainly aren’t laughing anymore. In fact, we have multiple clients already poised to run on 10k+ core clusters in the weeks following this blog post. To find out more, you can also contact us or request an account. All CycleCloud clusters benefit from:

  • Zero upfront capital costs, you only pay for what you use
  • Tera-scale computing available on-demand in minutes
  • Engineered, intelligent configurable data encryption
  • Current generation compute instances that include NVidia GPU-accelerated machines and 8-core, Nehalem machines
  • 10 GBit ethernet network speeds
  • Large, secure data transfers in to the cloud at 400GB/day over enterprise WAN connections
  • Pre-configured, secure, maintained, ready-for-launch clusters for most HPC schedulers (SGE, OGE, Condor, Torque are the most popular)

So, how did this particular capability come about?

Scientific Need

Computational modeling for life sciences, especially for protein binding and structure prediction calculations, is a useful tool for finding areas of interest that then need to be examined in the lab. The goal of these computations is to find targets of interest that might eventually lead to treatments for disease.

For this use case, the computational work is attempting to simulate the interactions of real proteins and molecules in order to help solve the combinatorial needle-in-a-haystack problem of finding these potential treatments. So needless to say, the faster this discovery process can be completed the better. In this case we'll use 10000 cores to shoot for hours instead of weeks/months for a very large Life Sciences company.

History of HPC Motivation

Basked in the eerie glow of LCD monitors, teams of Cycle engineers have been working on this capability for quite some time. Our goal: to run a client workload at a scale that rivals flagship clusters at supercomputing centers. The path to Tanuki took considerable effort to travel. In order to fully tell the story, we’ll start at the very beginning:

Nearly all of us at Cycle Computing are part of the diverse heritage of HPC. It doesn’t matter what industries we came from: academia, life sciences, financial, entertainment, and then some. Building things bigger is in an HPC person's blood. Increasing utilization, decreasing runtimes and improving scientific and computational output has been our goal long before we became colleagues at Cycle.

The cluster environments where we all first flexed our muscles were bound by typical budgets allocated for internal/private clusters. These types of computing environments are thoughtfully designed and purpose-built. Cycle still manages a significant number of compute grids like this today.

But, fixed sized clusters are too small when you need them most, and too large at every other time. Valleys are wasteful, and peaks are painful. This simple realization is what is driving the majority of Cycle’s business towards cloud environments such as AWS and internal VM deployments.

Screen shot 2011-04-05 at 3.21.45 AMOne of our core tenets is to leverage established scheduling environments with their easy submission interfaces, user priority, and other facilities, coupled with shared file-systems and standard open source toolsets. In a previous blog post, you might have met Okami, a 2048-core Torque cluster built within Amazon EC2 or seen our benchmarks using clusters of GPUs on the CG1 instances. We doubled this with the 4096 core Oni Cluster, a cluster which taught us a lot about the point at which traditional HPC architectures require more sophisticated techniques in order to scale in the cloud. When our EC2 clusters started to surpass the size of some of the larger internal environments that we manage, it was clear that we were onto something. It wasn’t certain whether it was Cycle or a few of our clients that were more excited when we announced that we were aiming at a 10,000-core cluster, a scale previously reserved for supercomputing centers and large enterprise customers.

CycleCloud_ClusterSize

This is the first in a series of four blog posts covering the gory details of Tanuki.

So, consider this post the overview. Our goal with Tanuki was a massive cluster, launched simply with a single click via our CycleCloud front-end service and easily familiar as a cluster environment, so it could easily accommodate workloads that were already running on internal supercomputing environments. There is absolutely no industry that will not be touched by the implications of this cluster, the first of many mega-clusters that Cycle will enable in 2011.

Launching Tanuki

CycleCloud as a cloud cluster management environment has been around since 2007. In a nutshell, we use it to automate every aspect of launching clusters quickly and easily. As CycleCloud began to grow, we needed a way to provision a wide variety of cluster types and machine types, all with a common base. We knew from benchmarking our client's application which servers were the most cost-effect to run the jobs. And this cluster used a lot of servers:

Cluster Role

Count

AWS Specifications

Execution Machines

1250

c1.xlarge: 8 core, 7GB RAM each

Condor Collector,
Condor Negotiator, and CycleServer

1

m1.xlarge: 4 core, 17.1GB RAM

Primary Scheduler and Disk Filer

1

c1.xlarge: 8 core, 7GB RAM

Auxiliary Schedulers

2

m1.large: 2 core, 7.5GB RAM

total

1254

10014 core, 8.6TB RAM, 2PB disk

We quickly became hip to the ways of chef. Still, chef configuration is a single tool, and it can be challenging to scale even an elegant configuration management environment like chef when dealing with hundreds or thousands of concurrent machines launching. We are proud to say that we met this challenge.

Screen shot 2011-03-01 at 10.34.11 AM In the series of blog posts to come, we will share some of the challenges and secrets of how we were able to achieve some of this scalability. We will provide some cold hard facts about chef converges based on our intensive monitoring of Tanuki. We’ll describe a few failure scenarios within AWS and explain how CycleCloud transparently recovered from them. We’ll also explain how we plan to go even further in scaling clusters larger than ever considered before in a cloud environment.

Scheduling: Condor

Screen shot 2011-03-01 at 10.33.06 AM Launching 10,000 cores is only a small part of the challenge in a cluster like this. Scheduling actual jobs to the cores can be extremely difficult. We like to use the same HPC scheduler (Condor, GridEngine, Torque) that a researcher might have in house. We pushed against some scalability boundaries with Torque on our 4096-core cluster. For this cluster, we were positive that we needed an environment that was massively scalable and fault-tolerant.

Thankfully, the Condor schedule is astoundingly scalable and extensible, out of the box, and deals well with cloud architectures where nodes come and go. We put our Condor knowledge to use in order to build a fully automated, horizontally scaled scheduling environment that was tuned precisely for maximum responsiveness on a cluster of this size. A future post in this series will describe in detail how we managed to scale condor to handle both the data requirements and the challenge of 10,000 concurrently running jobs.

Monitoring and Performance Analysis

When launching a cluster of this magnitude, it is imperative to closely monitor it. We had extensive telemetry capture enabled during the run which required a robust, scalable tool to both store and visualize the monitoring telemetry that we captured. Not coincidentally, we leveraged Cycle’s flagship product CycleServer to do just that.

Screen shot 2011-03-01 at 10.32.14 AM

The upcoming version of CycleServer (version 4.0) will include two enhancements that made keeping tabs on Tanuki much more easy than it would have been otherwise. For one, the scalability of CycleServer was frankly astonishing; as we threw every bit of scheduler and OS-level data that we could at it. Also, the new plug-in subsystem allowed us to extend the amount of metrics that were being captured and visualized to include practically everything of interest.

The result was a single pane of glass where we were able to monitor jobs, aggregate the performance of every machine in the cluster, and provide real-time readouts of progress as the workload progressed. In another blog post, we will dive into the monitoring capabilities of CycleServer and how it can be used to monitor HPC environments in an environment such as Tanuki or in your own datacenter.

10000-cores could be yours for $1060/hr

Look, we will admit that there was a time when the idea of replacing an internal cluster with a cloud environment was bleeding edge. If Tanuki teaches us anything, it is that clusters in the cloud (even massive clusters in the cloud) have proven time and again to be viable alternatives to running in-house clusters. The scientific results that our clients have gotten from these cloud clusters have proven this over and over. We have now run hundreds of clusters in cloud environments. Many of these clusters are larger than typical academic or enterprise clusters.

As we said before, running a cluster even an order of magnitude larger than that for less than $1060/hr is mind boggling. You can be sure that Cycle is already working on the next generation of mega-elastic cloud clusters. We won’t stop until any individual scientist or organization with a need can spin up a cluster quickly, submit their jobs easily, monitor the workload closely, and cut it all loose when they are done. Whether you need 10 cores or 10s of thousands, we’d like to help harness the HPC compute power you require to succeed with your research. To find out more, please contact us or request an account.

Share this: