Built to scale: 10,600-instance CycleCloud cluster, 39 core-years of science: $4,362!

Here is the story of a 10,600
(i.e. a multi-core server) HPC cluster created in 2
hours with CycleCloud
on Amazon EC2
with one Chef
11 server
and one purpose in mind: to accelerate life science research
relating to a cancer target for a Big 10 Pharmaceutical company.

Our story begins…
First, when we got a call from our friends at Opscode about scale-testing Chef, we
had just the workload in mind. As it happened, one of our life science clients
was running a very large scale run against a cancer target. And let us tell
you, knowing that hardcore science is being done with the infrastructure below
is a very satisfying thing: 

AWS Console Output


That’s right, 10,598 server instances running real science!
But we’re getting ahead of ourselves…

Unfortunately, we’re a bit limited in what parts of the
science we can talk about, other than to say we’ve done a large-scale computational
chemistry run to simulate millions of compounds that may interact with a
protein associated with a form of cancer. We estimated this would take about
341,700 hours. This is very cool science, science that would take months to
years on available internal capacity! More on that later…

Thankfully, our software has been doing a lot of utility
supercomputing for clients, and as we mentioned last week, because of this
we’re hiring

So to tackle this problem, we decided to build software to
create a CycleCloud utility supercomputer from 10,600 cloud instances, each of
which was a multi-core machine! This makes this cluster the largest
server-count cloud HPC environment that we know about, or has been made public
to date (the
former utility supercomputing leader was our 6,732 instance cluster for
Schödinger from 2012

If this cluster were a physical environment, analysts said
it would occupy a 12,000 sq ft data center space, costing $44 million. Instead,
we created this in 2 hours, with these 10,600 hosts, used it for 9 more, at a
peak cost of $549.72 per hour, and turned it off for a total cost of $4,362.

In creating this environment, we’re also happy to tell you we used a single
open-source Chef 11 server on a CC2 class machine! It took a mere two hours to
get this capacity from Amazon. We proceeded to do the 341,700 hours of
computational chemistry, or 39 compute years, against this protein target, and
then shut it down.

So, 10,600 servers, 39 years of compute in 11 hours,
on the equivalent of $44 Million in infrastructure, for only $4,362!

Simply put, Chef 11
Now we know that the latest version of Chef,
rewritten with an Erlang-PostgreSQL combo for scale, is supposedly
faster/better than the Ruby-Couch version, but we just wanted to put it through
its paces.

And it passed! Boy did it.  It was very cool when we ran knife, and saw:

10kInstance-Chef-Knife-Server-AtPeakKnife Output


Here’s the view from our CycleServer plug-ins,
showing a heck of a lot of servers that had successfully converged:



Lastly,  that’s a heck of a lot of servers
running science. And you can see from our ganglia view that Cycle’s software
has the cluster red hot and using 99% of the CPU:

Red hot science!


The future…
So there we have it, we just handled 10,600 servers, and our
software built the environment, secured it, scheduled data across, scaled it,
and tracked everything for audit/reporting purposes. Chef 11 handled configuration
for all of them. But now we’re ready to add zeros here, and so is our software.

If you’ve got a scientific, engineering, or finance
questions that need “more zeros”, we’d
love to hear from you

Or if you’re looking for a job, where you get to throw down world-class
, or you
want to help our business grow
, we’d also love to hear from you!

Share this: