Update: Wow, we've gotten tremendous feedback from this run on Arstechnica, Wired, and others, and man has it been a busy few days. We did have many people ask a quesiton that we wanted to clarify:
Q: How long would the run-time take in-house vs. in-cyclecloud?
A: The clients indicated the workload would never have happened in-house because it would have used everything they had for week(s). The in-cyclecloud run time was 7-8 hours.
In more ways than one, the Nekomata cluster is three times as impressive as our last public mega-cluster. A few months ago, we released details of the Tanuki cluster, a 10,000-core behemoth launched within AWS with the click of a button. Since then, we have been launching large clusters regularly for a variety of industries. We kept our eye open for a workload large enough to push us to the next level of scale. It didn’t take very long.
We have now launched a cluster 3 times the size of Tanuki, or 30,000 cores, which cost $1279/hour to operate for a Top 5 Pharma. It performed genuine scientific work — in this case molecular modeling — and a ton of it. The complexity of this environment did not necessarily scale linearly with the cores.
In fact, we had to implement a triad of features within CycleCloud to make it a reality:
1) MultiRegion support: To achieve the mind boggling core count of this cluster, we launched in three distinct AWS regions simultaneously, including Europe.
2) Massive Spot instance support: This was a requirement given the potential savings at this scale by going through the spot market. Besides, our scheduling environment and the workload had no issues with the possibility of early termination and rescheduling.
3) Massive CycleServer monitoring & Grill GUI app for Chef monitoring: There is no way that any mere human could keep track of all of the moving parts on a cluster of this scale. At Cycle, we’ve always been fans of extreme IT automation, but we needed to take this to the next level in order to monitor and manage every instance, volume, daemon, job, and so on in order for Nekomata to be an efficient 30,000 core tool instead of a big shiny on-demand paperweight. Our new Grill monitoring GUI app for CycleServer helped show what's cooking with our Opscode Chef environment that helped with cloud infrastructure automation for all the instances for this cluster.
Before we step through these enhancements one by one, let’s take a moment to sit back and contemplate the sheer scale of this compute environment.
|AWS Regions||3 ( us-east, us-west, eu-west )|
- Where Tanuki certainly encompassed a significant number of physical racks of compute hardware in the AWS datacenter, this scale meant that our footprint was a significant number of aisles. Certainly, the power, cooling, and floorspace that we occupied during this run were larger than what one would find at many moderately sized datacenters.
- Our system placed individual staged requests to AWS that eclipsed the size of entire clusters that we considered to be large at the time (such as Oni). These requests were made relentlessly to Amazon and were fulfilled with impressive efficiency.
- The amount of RAM available on the compute nodes was roughly 30TB, which means that it could hold the entirety of the current raw Wikipedia database 5 times over in memory (citation needed — get it?)
The workload that was run for the client in this case was a software package that consisted of molecular dynamics approaches applied to molecular structure and functioning that were run against millions of compound targets. The expected run time on the internal cluster environment was expected to be at least a week worth of run time consuming the entire cluster.
To manage the high throughput workload of the jobs, we used the condor scheduler for load distribution and CycleServer to monitor progress of workload and job scheduling. The table below provides a synopsis of the total work that was done in this massive scale analysis.
|Compute Hours of Work||95,078 hours|
|Compute Days of Work||3,961 days|
|Compute Years of Work||10.9 years|
|Number of condor jobs||154,116|
|Avg. Run Time per job||37 minutes|
New Multi-Region Support
Because of the scale of this run, we opted to go across multiple regions to spread out the spot instance requests and increase our probability of getting the core counts we required. Drilling down a little deeper into the numbers shown in the table above, here is what the breakout across the three regions looked like. We were very pleased that provision times were very consistent across the regions.
|Region||Server Count||Core Count||RAM|
Massive Spot Instance Support
Nekomata also marks our first use of Spot Instances. The Spot Market provides a huge discount over traditional on-demand instances which means we can do more science for less money. We’ve added support for Spot Instance bidding for CycleCloud execute hosts. Nekomata made use of over 3700 c1.xlarge Spot Instances at an average cost of 0.286 USD / instance / hour (0.036 USD / core / hour). Compare that to the 0.68 USD / instance / hour for the same On Demand instance. That’s 57% savings!
Massive Telemetry and Monitoring: CycleServer & Grill
We used Nekomata to give a workout to some of the new tools we’ve been working on that provide unprecedented views into large-scale computing environments. We’re extremely excited about this, as this is the first time that we are leveraging the plugin architecture of CycleServer to enable both the collection and visualization of diverse types of data. One such tool is a new offering that we have named Grill for CycleServer.
Cycle has professed its love for Opscode’s Chef project in the past on our blog, and we continue to use it to provision even our largest clusters–Nekomata is no different. When you have hundreds of nodes converging each minute, you need a way to see precisely which converges are failing so that you can resolve the issue(s).
Using the Handler feature of open-source Chef, we wrote a Ruby gem that reports converge information to our flagship telemetry engine, CycleServer. Grill shows the result of every converge, as you can see in the screenshot below.
For Nekomata, we built a Grill view that shows a timeline of all converges and a window showing drill-down to individual hosts and their converge history. Important information such as converge duration, number of resources updated and an exception stack trace on failure are all just a click away. On the right side of the display is a stacked histogram of converges per minute with successful in green and failed in red. We have many more features planned for our upcoming release of the chef dashboards.
Along with the chef dashboards, we have also extended our telemetry engine and visualization dashboards to include heat maps to succinctly highlight servers that are running host as well as including the data captured by ganglia to see resource utilization trends over time as illustrated with a point-in-time screenshot below from the Nekomata run.
We’ll be releasing more information in the coming weeks about these tools, so please stay tuned.
Next time around…
While all parties involved are absolutely thrilled at the achievement of the Nekomata cluster, we would be remiss if we did not share a couple of lessons learned during such a large run. You can be sure that when you run at massive scale, you are bound to run into some unexpected gotchas. As a result, it’s critical that you always execute a post mortem after such an event to identify what went well, what didn’t go well and how to do it better next time.
In our case, one of the gotchas included such things as running out of file descriptors on the license server. In hindsight, we should have anticipated this would be an issue, but we didn’t find that in our pre-launch testing, because we didn’t test at full scale ( another lesson learned….keep reading ). We were able to quickly recover from this bump and keep moving along with the workload with minimal impact. The license server was able to keep up very nicely with this workload once we increased the number of file descriptors.
Another bump that we hit were EBS volume and byte limits on EBS-backed volumes as outlined in their documentation here: http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?Concepts_BootFromEBS.html. Once again, once identified, necessary limits were raised and we kept moving right a long.
What’s next from CycleComputing’s large scale clusters? Well, we already have our next use-case identified and will be turning up the scale a bit more with the next run. While we are excited about the ability to scale clusters to larger and larger core counts and forging new frontiers in cloud computing, we are most excited about the possibilities we are creating by enabling scientists to spin up such large clusters with minimal effort to answer questions that they would never have considered asking a year ago because they didn’t have access to resources that could answer those questions. Today, we have enabled them to have access to massive computing scales in minutes which in turn gives them the ability and opportunity to ask these sorts of complex questions, not once or twice, but multiple times with multiple different perspectives and multiple different variables leading to improvements in their discovery research efforts. We understand that ultimately, it’s not about core counts or TB of RAM or PB of data. Rather, it’s about how we are helping to transform how science is done.