Lessons learned building a 4096-core Cloud HPC Supercomputer for $418/hr

The Challenge: 4096-core Cluster

Back in December 2010, we discussed running a 2048-core cluster using CycleCloud, which was in effect renting a circa 2005 Top 20 supercomputer for two hours. After that run, we were given a use case from a client that required us to push the boundary even further with CycleCloud. The challenge at hand was running a large workflow on a 4096-core cluster, but could our software start and resolve issues in getting a 4096-core cluster up and running?  

Cycle engineers accepted the challenge and built a new cluster we’ll call “Oni”. The mission of CycleCloud is to make running large computational clusters in the cloud as easy as possible. There is a lot of work that must happen behind the scenes to provision clusters both at this scale and on-demand.

What kinds of issues did we run into as we prepared to scale out the CycleCloud service from building 2048-core cluster up to a whopping 4096-core Oni cluster?  This post covers three of these questions:

  1. Can we get 4096 cores from EC2 reliably?
  2. Can the configuration management software keep up?
  3. Can the scheduler scale?
  4. How much does a 4096-core cluster cost on CycleCloud?


Question 1: Can We Get 4096 Cores from EC2 Reliably?

We needed 512 c1.xlarge instances (each with 8 virtual cores) in EC2’s us-east region for this workload. This is a lot of instances! First, we requested that our client’s EC2 instance limit be increased. This is a manual process, but Cycle Computing has a great relationship with AWS and we secured the limit increase without issue.

However, an increased instance limit doesn't guarantee availability.  How many instances will we actually get and how will they be distributed over the region’s Availability Zones?

Spot Instance Pricing is No Indicator of Availability!

Hoping to predict instance availability, we analyzed historical Spot Price for c1.xlarge instances during our previous run using cloudexchange to collect this information. Our request for 256 instances caused no significant change in Spot Price so we assumed we’d have no problem. You’ll see from the data below that we were wrong: Spot Price is no indicator of current availability! The Spot price was unchanged over the course of our run, despite receiving Out of Capacity errors in three out of four Availability Zones.

Availability Zone Distribution and Failure Rates

We staged our instance requests in batches of 64 to see if we could spread our instances evenly over the four Availability Zones in the us-east-1 region. The table below shows where our instances actually landed. We quickly realized that us-east-1a, us-east-1c and us-east-1d were full after receiving Out of Capacity errors from our requests. CycleCloud fail-over algorithms switched requests to Availability Zones with capacity on the fly. By the end, all of our instances were landing in us-east-1b.

Time (EST) Instances Requested AZ Requested Actual AZ Notes
13:08:00 64 us-east-1a us-east-1d 1 DOA; 1 disk failure
13:15:00 65 us-east-1b us-east-1b 1 DOA
13:25:00 64 us-east-1c us-east-1d 3 not reachable
13:30:00 64 us-east-1b us-east-1b
13:36:00 64 us-east-1a us-east-1b 1 DOA
13:42:00 64 us-east-1d us-east-1b
14:05:00 64 us-east-1c us-east-1b
14:13:00 81 us-east-1b us-east-1b 1 DOA
14:18:00 1 us-east-1b us-east-1b

You can also see from the chart that a few instances (roughly 1.5%) either failed to launch, or were unusable after launch and had to be terminated and replaced. We saw three AWS failure modes:

  • AWS-Terminated: The instance goes from the starting state right into the terminated state (DOA).
  • Unreachable: The instance does not register with CycleCloud nor can we access it with SSH. It’s as though the instance has no network connectivity.
  • Disk failure: One instance failed to create its ephemeral disk.

On recent large clusters we’ve seen a consistent 0.75-1.00% instance failure rate due to the first two modes.  We haven’t seen a repeat of the disk failure. These observations have led to new autoscaling features in CycleCloud that replace instances lost to these failure modes, ensuring that you get all the instances your workload requires.  As a result of our testing we are confident that we can reliably request and get 4096 cores from EC2.

Question 2: Can the Configuration Management Software Keep Up?

Now that we can get 512 of the 8-core, c1.xlarge instances, can we configure them reliably and quickly? CycleCloud uses Chef for configuration management. It is a great solution for automating software installation and configuration on large clusters.  However we weren’t certain the Chef infrastructure could keep up with the demand from 512 c1.xlarge instances.

Prior Peak: 64 nodes

Previous tests had shown that some new instances would time out while registering with our Chef back-end if we started more than 64 instances simultaneously. These timeouts exposed issues with the performance of configuration scripts, which we could stream-line. Thus, from a peak perspective, we knew to keep the individual requests to about 64 nodes.

Prior Throughput: 64 nodes every 10 minutes

Now as with many things in HPC there is both a concern about instantaneous peaks, as well as over all throughput. CycleCloud was able to get the cluster up and configured, but we learned a few things we could make better for throughput.

Based upon our experience testing with this cluster, we beefed up our Chef infrastructure and streamlined our configuration scripts to help with scalability. We also benchmarked and discovered that with an increasing converge timing scheme that we invented, we could both increase the peak number of nodes and the nodes per unit time. When we first started testing Oni, we could handle 64 instances every 10 minutes.

Post-improvements: 256 nodes every 5 minutes

Post optimizations and infrastructure changes, we can now support simultaneous launches of up to 256 instances or 2048-cores (up 4x from the previous limit of 64 instances at a time limit) every 5 minute. And we even have ideas on how we will increase that number dramatically. Thanks to our experience and the software we’ve written to automate these processes, we haven’t lost a single instance due to Chef registration timeouts since this run. Now on to the scheduler…


Question 3: Can the Scheduler Scale?

For this specific workload, we had a end-user requirement to use the open source resource manager Torque. We ran a quick test with our new 4096-core cluster to see if we could schedule jobs quickly and reliably. We found that with the default settings Torque could not handle 4096 jobs.

In doing tests for the 4096-core cluster, Torque’s scheduler failed several times and had to be tuned. The primary parameter that required tuning was the “alarm” parameter for pbs_sched.  The default for this parameter is quite short, but we modified this parameter to 3600 sec (1hr).  This avoided the pbs_sched process dying prematurely when large workloads were submitted.

We then submitted 4096 test jobs but once around 3200 jobs were scheduled, the scheduler slowed to a crawl (see graph below around 17:40). Modifying Torque’s job_stat_rate can solve this problem. Most notably, increasing the job_stat_rate to five minutes or more causes the pbs_server daemon to query pbs_mom for job data far less often.  We now can run these Torque clusters at 4000-cores, but after this experience, we do recommend Condor for workloads of this size in the future, as it seems to scale much more gracefully.

With this experience, we are now confident we can tune Torque to consistently handle over 4000 cores. It’s important to keep in mind that as you scale the number of jobs in your environment, you need to make sure that your scheduler is configured to handle the increased workload. Schedulers have many knobs that can be turned to maximize scalability and performance so you can’t just assume it will scale using the default configuration. In this case, for 4000 cores using Torque, set the ‘alarm’ to 3600 and ‘job_stat_rate’ to five minutes.


Question 4: Costs Defined?

Below are costs for the 512 node cluster, with an 1TB filer:


Price (AWS + CycleCloud)



C1.XLarge instances


512 Instances x 8-cores



$0.138 / TB-hour (approx.)

1 TB


TOTAL per Hour


So the Oni cluster, 4096-cores costs about $418/hr to operate.



We learned many things from building and scaling Oni, and were excited about the new software we’ve written to handle more error cases, scale configuration, and scheduling for wide clusters. In the end, we discovered:

  • We can get access to 512 instances with 4096 cores total
  • Over 80% of the instances landed in us-east-1b with the remainder in us-east-1d
  • We improved configuration to increase the scale, so we can now create hundreds of execute nodes at once, without any errors
  • For Torque, we used a couple of settings to stretch it into scaling at 4000 cores, but now appreciate Condor’s ease at scaling to these sizes
  • The availability of 512 c1.xlarge nodes just underlined the fact that cloud computing can really change the game for many types of calculations, by enabling you to run wide, and get results back faster for little extra cost

Stay tuned for much larger CycleCloud clusters coming in the future!  Meanwhile, if you have a cluster-bound workload that you would like to migrate to large scale cluster sizes and core counts, we can help. Please contact us with any questions about this cluster or running HPC in the Cloud.

Share this:
  • thanks

    Have you tried the same experiment using the cluster compute nodes at AWS? They cost more, but would be most interesting to see whether large clusters of them can be fired up on demand. The CPUs are faster, but the more hpc style interconnect would also be vital for many hpc applications.

  • Yes, we have. We had an earlier blog post where we talked about a Cluster Compute cluster that used the GPU nodes:
    The workload that ran on this was a molecular dynamics application that took advantage of the GPUs, fast CPUs, and better 10Gbps inter-server/filer bandwidth. Give it a read and let us know when you think?

  • Rob

    I’m curious why you would be interested in spreading the instances out across different availability zones. Given HPC jobs are generally network sensitive, having all of the nodes in a single zone (or as tightly packed as possible) would appear to be a strenght rather than a weakness.

  • it’s not an HPC cluster without a fast, low-latency network. what performance (latency, bandwidth and message rate) did you notice in this experiment, especially considering it included nodes at multiple sites?
    even if you believe that throughput on non-communicating serial jobs is HPC, the same basic question applies to IO: what kind of performance did you achieve to your filer. (also, how reasonable is it to have a single 1TB server for 4k cores?)
    no insult intended, but to me this looks like a very cool way to get a high rank on foo@home…

  • @Rob, So HPC is generally composed of two different types of applications, those that are low-latency and require fast interconnects to properly function, versus those that are pleasantly parallel and more concerned with throughput. An increasing number of applications and newer architectures focus less on low-latency interconnects, and more on throughput.
    In this case, the application is pleasantly parallel and had a very high compute to data ratio (hours/days of compute for comparatively small amounts of data), so we had the option of spanning availability zones.

  • @MarkHahn so let’s talk about your points:
    > it’s not an HPC cluster without a fast, low-latency network.
    #40 on the Top 500 supercomputing list uses Gigabit Ethernet, as do number of the other Top 500. There were moderate data needs and we do monitor I/O but the reason it wasn’t in the blog is because network I/O wasn’t a bottle neck for these workloads (which is probably the case for the other Gigabit Top 500 Supercomputers)
    >what kind of performance did you achieve to your filer
    This cluster ran a production workflow for a Fortune 500 client that had a high compute to data ratio with intermediate data stored on the executing node. So there really isn’t much to say about the filer in this case. With regard to filers, while single node filers are common in EC2, they aren’t the only option. We will talk about this more in future work.