CycleCloud 6.5.3 released

Last week we pushed the button on the latest release of our CycleCloud software for managing cloud HPC and Big Compute workloads. This release has one particular feature that many customers customers asked for: Cost Alerting. This new feature will give you the ability to easily set cost alerts on a per-cluster basis. You can set the alert to be dollars per day or per week. This gives you a great way to manage consumption and assure that users aren’t blowing through budgets. After all, you want to give your users access to unlimited compute, but you don’t want to give them an unlimited budget. Clusters from any supported cloud service provider display an estimated compute cost along with the core-hour usage. Daily or monthly budgets are set from the cluster page and trigger alerts when the threshold is crossed. Because the appropriate action when a cluster goes over budget varies, the CycleCloud software does not take any automated enforcement action. We find most customers try to set it the threshold to some percentage of total budget to give them a heads up before exceeding the budget. The percentage can be a function of the type of work and size of budget. In addition to the cost alerting, we’ve added additional features to our Microsoft Azure support. CycleCloud now uses Azure Managed Disks and Images for virtual machines, simplifying management of storage and improving performance. Azure instances will automatically use CycleCloud’s standalone DNS configuration to improve the experience for Open Grid Scheduler users. Current customers can download CycleCloud 6.5.3 from the Cycle Computing Portal. If you’d like to learn...
Simulating Hyperloop pods on Microsoft Azure

Simulating Hyperloop pods on Microsoft Azure

Earlier today, we published a case study and press release about some work we did with the HyperXite team from the University of California, Irvine team and their efforts in the Hyperloop competition. This team leveraged CycleCloud to run ANSYS Fluent™ on Microsoft Azure Big Compute to to complete their iterations in 48 hours, enabling them to get results fast enough to make adjustments and modifications to the design then rerun the simulations until they were able to converge on a final solution. All for less than $600 in simulation costs. This was a case where Cloud enabled them to do something they could not have done any other way. As a bit of background, Elon Musk’s SpaceX started the Hyperloop project as a way to accelerate development of a fast, safe, low-power, and cheap method of transporting people and freight. HyperXite was one of 27 teams that competed recently. Nima Mohseni, the team’s simulation lead, used the popular computational fluid dynamics software ANSYS Fluent™ to perform modeling of the pod. Key areas that the team modeled were related to the braking approach that they were using. Through the use of simulation, they were able to show that they could brake with just the use of magnetic force, removing the need for mechanical brakes. This reduced weight, increased efficiency, and improved the overall design, which was recognized with a Pod Technical Excellence award last year. Using the CycleCloud software suite, the HyperXite team created an Open Grid Scheduler cluster leveraging Azure’s memory-optimized instances in the East US region. Each instance has 16 cores based on the 2.4 GHz Intel...

LAMMPS scaling on Azure InfiniBand

While public clouds have gained a reputation as strong performers and a good fit for batch and throughput-based workloads, we often still hear that clouds don’t work for “real” or “at scale” high performance computing applications. That’s not necessarily true, however, as Microsoft Azure has continued its rollout of Infiniband-enabled virtual machines. InfiniBand is the most common interconnect among TOP500 supercomputers, and Microsoft has deployed the powerful and stable iteration known as “FDR” Infiniband. Best of all, these exceptionally high levels of interconnect performance are now available to everyone on Azure’s new H-series and N-series virtual machines. To see how well Azure’s Infiniband works, we benchmarked LAMMPS, an open source molecular dynamics simulation package developed by Sandia National Laboratories. LAMMPS is used widely-used across government, academia, and industry, and is frequently a computational tool of choice for some of the most advanced science and engineering teams. LAMMPS relies heavily on MPI to achieve sustained high performance on real-world workloads, and can scale to many hundreds of thousands of CPU cores. Armed with H16r virtual machines, we used the Lennard-Jones liquid benchmark. We selected the “LJ” benchmark and tested two scenarios: “weak scaling”, in which every core simulated 32,000 atoms no matter how many cores were utilized, and “strong scaling” which used a fixed problem size of 512,000 atoms with an increasing number of cores. Both scenarios simulated 1,000 time steps. We performed no “data dumps” (i.e. intermediate output to disk) in order to isolate solver performance, and ran 30 test jobs per data point in order to obtain statistical significance and associated averages. In summary, the results were impressive...

Leap second #37 is coming!

Everybody always talks about needing more time. Well, this year you get it! Saturday night will be one second longer than normal. A leap second is being inserted in order to slow clocks down to match the Earth’s rotation. Beyond just adding a second to your day, your software needs to be ready as well. The addition of leap seconds in 2012 and 2015 means that many software systems are ready for it. This includes CycleCloud and the cloud service providers it works with. Leap second handling Here’s how the cloud service providers handle the leap second: Amazon Web Services — The additional second is spread over the 24 hour period from 12:00 UTC on December 31 through 12:00 UTC on January 1. Each “second” will be 1/86400 longer. Azure — In 2015, Azure inserted leap seconds at midnight local time. The assumption is that they will do this again. Google Cloud — The additional second is spread over the 20 hour period from 14:00 UTC on December 31 through 10:00 UTC on January 1. Instances started in the cloud providers will depend on the configured behavior. Generally speaking, Linux instances will use the NTP server pools and handle the change in the kernel. Windows instances on AWS will follow the AWS time adjustment above. Windows generally handles leap seconds by changing the clock at the next update. It’s a leap year, too In case one extra second was not enough 2016 for you, remember that this year was a leap year as well. If your application considers the day of the year, you’ll want to make sure it’s...

CycleCloud 6 feature: Azure Resource Manager support

This post is one of several in a series describing features introduced in CycleCloud 6, which we released on November 8. Microsoft introduced the Azure Resource Manager (ARM) to speed the process of large deployments. It treats groups of resources as a single unit, allowing Big Compute clusters to be rapidly scaled up and down in response to current needs. That is the core of our philosophy, so we added ARM support to CycleCloud 6. CycleCloud manages the complexity for you: resource groups and scale sets are dynamically created. All you need to do is provide your credentials: Coming to SC16? Stop by booth #3621 for a...

Efficient use of entropy in cloud environments

Secure communication requires entropy — unpredictable input to the encryption algorithms that convert your message into what seems like a string of gibberish. Entropy is particularly important when generating keypairs, encrypting filesystems, and encrypting communication between processes. Computers use a variety of inputs to provide entropy: network jitter, keyboard and mouse input, purpose-built hardware, and so on. Frequently drawing from the pool of entropy can reduce it to the point where communications are blocked waiting for sufficient entropy. Generally speaking, entropy has two aspects: quality (i.e. how random is the value you get?) and the amount available. The quality of entropy can be increased by seeding it from a quality source of entropy. Higher quality entropy makes better initialization vectors for the Linux Pseudo Random Number Generator (LinuxPRNG). The Ubuntu project offers a publicly-available entropy server. The quantity of entropy (i.e. the value of /proc/sys/kernel/random/entropy_avail) is only increased over time. It is worth noting here Virtual Machines in the cloud are not quite “normal” computers in regards to entropy. Cloud instances lack many of the inputs that a physical machine would have, since they don’t have keyboard and mice attached, and the hypervisor buffers away much of the random jitter of internal hardware. Further, the Xen (Amazon Web Service), KVM (Google Cloud), and HyperV (Microsoft Azure) hypervisors virtualize hardware access to varying degrees which can result in diminished entropy. You need to be aware of the entropy available on your instances and how your code affects that. When writing code, it’s important to minimize the calls to /dev/random for entropy as it blocks until sufficient entropy is available. /dev/urandom...