Simulating Hyperloop pods on Microsoft Azure

Simulating Hyperloop pods on Microsoft Azure

Earlier today, we published a case study and press release about some work we did with the HyperXite team from the University of California, Irvine team and their efforts in the Hyperloop competition. This team leveraged CycleCloud to run ANSYS Fluent™ on Microsoft Azure Big Compute to to complete their iterations in 48 hours, enabling them to get results fast enough to make adjustments and modifications to the design then rerun the simulations until they were able to converge on a final solution. All for less than $600 in simulation costs. This was a case where Cloud enabled them to do something they could not have done any other way. As a bit of background, Elon Musk’s SpaceX started the Hyperloop project as a way to accelerate development of a fast, safe, low-power, and cheap method of transporting people and freight. HyperXite was one of 27 teams that competed recently. Nima Mohseni, the team’s simulation lead, used the popular computational fluid dynamics software ANSYS Fluent™ to perform modeling of the pod. Key areas that the team modeled were related to the braking approach that they were using. Through the use of simulation, they were able to show that they could brake with just the use of magnetic force, removing the need for mechanical brakes. This reduced weight, increased efficiency, and improved the overall design, which was recognized with a Pod Technical Excellence award last year. Using the CycleCloud software suite, the HyperXite team created an Open Grid Scheduler cluster leveraging Azure’s memory-optimized instances in the East US region. Each instance has 16 cores based on the 2.4 GHz Intel...

CycleCloud 6 feature: MPI optimizations

This post is one of several in a series describing features introduced in CycleCloud 6, which we released on November 8. Batch workloads have long been a natural fit for cloud environments. Tightly-coupled workflows (e.g. MPI jobs) are sensitive to bandwidth, latency, and abruptly-terminated instances. MPI workloads can certainly be run on the cloud, but with guardrails. CycleCloud 6 adds several new features that make the cloud even better for MPI jobs. MPI jobs can’t make use of a subset of cores; they need all-or-nothing. CycleCloud now considers the minimum core count necessary for the job and sets the minimum request size. In other words, if the provider cannot fulfill the entire request, it won’t provision any nodes. Similarly, CycleCloud 6 also adds support for Amazon’s Launch Group feature, which provides all-or-nothing allocation for spot instances. This opens the spot marketing to MPI jobs, which can represent significant per-hour savings. To address the latency concern, CycleCloud now dynamically creates AWS Placement Groups for MPI jobs. This groups instances logically nearby, minimizing latency. At SC16? Stop by booth #3621 for a...

Efficient use of entropy in cloud environments

Secure communication requires entropy — unpredictable input to the encryption algorithms that convert your message into what seems like a string of gibberish. Entropy is particularly important when generating keypairs, encrypting filesystems, and encrypting communication between processes. Computers use a variety of inputs to provide entropy: network jitter, keyboard and mouse input, purpose-built hardware, and so on. Frequently drawing from the pool of entropy can reduce it to the point where communications are blocked waiting for sufficient entropy. Generally speaking, entropy has two aspects: quality (i.e. how random is the value you get?) and the amount available. The quality of entropy can be increased by seeding it from a quality source of entropy. Higher quality entropy makes better initialization vectors for the Linux Pseudo Random Number Generator (LinuxPRNG). The Ubuntu project offers a publicly-available entropy server. The quantity of entropy (i.e. the value of /proc/sys/kernel/random/entropy_avail) is only increased over time. It is worth noting here Virtual Machines in the cloud are not quite “normal” computers in regards to entropy. Cloud instances lack many of the inputs that a physical machine would have, since they don’t have keyboard and mice attached, and the hypervisor buffers away much of the random jitter of internal hardware. Further, the Xen (Amazon Web Service), KVM (Google Cloud), and HyperV (Microsoft Azure) hypervisors virtualize hardware access to varying degrees which can result in diminished entropy. You need to be aware of the entropy available on your instances and how your code affects that. When writing code, it’s important to minimize the calls to /dev/random for entropy as it blocks until sufficient entropy is available. /dev/urandom...

Cloud providers offer newer, better GPUs

Ever since there’s been a public cloud, people have been interested in running jobs on public cloud graphics processing units (GPUs). Amazon Web Services (AWS) became the first to offer this as an option when they announced their first GPU instance type six years ago. GPUs offer considerable performance improvements for some of the most demanding computational workloads. Originally designed to improve the performance of 3D rendering for games, GPUs found a use in big compute due to their ability to perform operations over a set of data rapidly and with a much greater core count than traditional central processing units (CPUs). Workloads that can use a GPU can see a performance improve up to 10-100 times. Two years later, AWS announced an upgraded GPU instance type: the g2 family. AWS does not publish exact capacity or usage numbers, but it’s reasonable to believe that the cg1 instances were sufficiently successful from a business perspective to add the g2s. GPUs are not cheap, so cloud providers won’t keep spending money on them without return. We know that some of our customers were quick to make use of GPU clusters in CycleCloud. But there was a segment of the market that still wasn’t being served. The GPUs in the cg1 and g2 instance families were great for so-called “single precision” floating point operations, but had poor performance for “double precision” operations. Single precision is faster, and is often sufficient for many calculations, particularly graphics rendering and other visualization needs. Computation that requires a higher degree of numerical precision, particularly if exponential calculations are made, need double precision. The GPUs that...

The question isn’t cost, it’s value

Addison Snell recently wrote an article for The Next Platform called “The three great lies of cloud computing.” Snell points out that the marketing around cloud computing doesn’t always match reality. As someone who does marketing for cloud computing software, I just want to go on the record as saying Addison is absolutely….right. We’ve spent a lot of time on this blog, at conferences, etc. talking about the benefits of using public cloud services for big compute and big data. But we believe that a one-size-fits-all solution is never the right size. Public cloud services can be sized to fit many needs, but not every need. If there’s one area where Addison’s article falls short, it’s that he only considers the raw dollar amount when talking about cost. Raw dollar amount is important, of course, but it’s not the whole story. As I said in response to a question at HTCondor Week 2016, it’s all about the value that cloud resources provide, not the cost. If you spend twice as much to run a workload in the cloud, but you get three times the value (e.g. due to faster time-to-results or the ability to run simulations at a finer resolution due to adding greater capacity), that’s a net win. Customers often find value in the additional capacity or flexibility the cloud can offer: adding more compute without having to plan datacenter space or trying out new hardware by renting instead of making a large capital investment. Another part of the value discussion is the total value of your entire HPC environment: the mix of cloud plus internal resources. Many...

CycleCloud Support for Elastic File System (EFS)

Last week, Amazon released the Elastic File System (EFS) in the US East (Northern Virginia), US West (Oregon), and Europe (Ireland) regions. EFS provides a scalable, POSIX-compliant filesystem for Amazon EC2 instances without having to run a file server. This means you can grow your storage as your usage increases instead of having to pre-provision disks. Instances mount EFS just as they would any traditional NFS volume. Of course, we know that, you, our customers will want to start testing workloads against EFS, so we’ve added support for it in the next CycleCloud release. Once the EFS is created through the AWS console, cluster instances can mount it with the configuration you’re already used to. For example, the configuration below will mount EFS fs-f00cf6b8 to /mnt/efs_test: [[[configuration cyclecloud.mounts.efs_test]]] type = efs filesystem_id = fs-f00cf6b8 So what does EFS look like in the real world? We took an I/O-intensive genomics workload and ran it on a 16-instance cluster using four different configurations: c3.4xlarge filer using ephemeral storage c3.4xlarge filer using a 500 GB GP2 (solid state drive) volume c4.4xlarge filer using a 500 GB GP2 volume EFS (Basic) Each job runs without competition on a c4.4xlarge instance and pulls 25 GB of reference genome data into memory. The code performs genomic alignment in batches and at the end writes approximately 1 GB of data (per job) back to the filer. The table below shows the average runtimes for the different filer configurations with as many as 16 of such tasks simultaneously using the shared filer:   Filer Simultaneous Tasks Average runtime (seconds) c3.4xlarge (ephemeral) 1 213 c3.4xlarge (ephemeral) 16 1658...