Use AWS EBS Snapshots to speed instance setup

Time has always been money, but this is especially true when you rent your compute. Shortening the setup time of a cluster not only shortens the time-to-results, but it saves money. We recently worked with a customer to cut about 20 minutes from the initial boot time of their compute instances by changing how they staged in their data. This customer does cancer research, so the jobs have a lot of reference data. The reference data was provided by a collaborator and stored in Amazon S3, but in a different region from the compute instances used by the researchers. Initially, each instance pulled the data from S3 using Cycle Computing’s pogo tool for cloud data transfer. pogo provides great performance but since the traffic went across the public internet, it was slow and sometimes overwhelmed the AWS NAT Gateway. We looked at two options: mirroring the data to the desired region and using EBS Snapshots. We chose snapshots because the reference data was static, so the effort of creating a snapshot was amortized over many computation runs. Using a snapshot takes a little bit of time to create and copy, but the startup time for each compute instance is smaller than downloading the data from S3 for each time an instance starts. When execute nodes start, CycleCloud creates a volume from that Snapshot and attaches it to the instance. This gives each execute node a local copy of the reference data in a fraction of the time. Note that disk snapshots are not the only way to avoid downloading reference data each time. Snapshots work well when the reference...

Leap second #37 is coming!

Everybody always talks about needing more time. Well, this year you get it! Saturday night will be one second longer than normal. A leap second is being inserted in order to slow clocks down to match the Earth’s rotation. Beyond just adding a second to your day, your software needs to be ready as well. The addition of leap seconds in 2012 and 2015 means that many software systems are ready for it. This includes CycleCloud and the cloud service providers it works with. Leap second handling Here’s how the cloud service providers handle the leap second: Amazon Web Services — The additional second is spread over the 24 hour period from 12:00 UTC on December 31 through 12:00 UTC on January 1. Each “second” will be 1/86400 longer. Azure — In 2015, Azure inserted leap seconds at midnight local time. The assumption is that they will do this again. Google Cloud — The additional second is spread over the 20 hour period from 14:00 UTC on December 31 through 10:00 UTC on January 1. Instances started in the cloud providers will depend on the configured behavior. Generally speaking, Linux instances will use the NTP server pools and handle the change in the kernel. Windows instances on AWS will follow the AWS time adjustment above. Windows generally handles leap seconds by changing the clock at the next update. It’s a leap year, too In case one extra second was not enough 2016 for you, remember that this year was a leap year as well. If your application considers the day of the year, you’ll want to make sure it’s...

CycleCloud 6 feature: MPI optimizations

This post is one of several in a series describing features introduced in CycleCloud 6, which we released on November 8. Batch workloads have long been a natural fit for cloud environments. Tightly-coupled workflows (e.g. MPI jobs) are sensitive to bandwidth, latency, and abruptly-terminated instances. MPI workloads can certainly be run on the cloud, but with guardrails. CycleCloud 6 adds several new features that make the cloud even better for MPI jobs. MPI jobs can’t make use of a subset of cores; they need all-or-nothing. CycleCloud now considers the minimum core count necessary for the job and sets the minimum request size. In other words, if the provider cannot fulfill the entire request, it won’t provision any nodes. Similarly, CycleCloud 6 also adds support for Amazon’s Launch Group feature, which provides all-or-nothing allocation for spot instances. This opens the spot marketing to MPI jobs, which can represent significant per-hour savings. To address the latency concern, CycleCloud now dynamically creates AWS Placement Groups for MPI jobs. This groups instances logically nearby, minimizing latency. At SC16? Stop by booth #3621 for a...

Efficient use of entropy in cloud environments

Secure communication requires entropy — unpredictable input to the encryption algorithms that convert your message into what seems like a string of gibberish. Entropy is particularly important when generating keypairs, encrypting filesystems, and encrypting communication between processes. Computers use a variety of inputs to provide entropy: network jitter, keyboard and mouse input, purpose-built hardware, and so on. Frequently drawing from the pool of entropy can reduce it to the point where communications are blocked waiting for sufficient entropy. Generally speaking, entropy has two aspects: quality (i.e. how random is the value you get?) and the amount available. The quality of entropy can be increased by seeding it from a quality source of entropy. Higher quality entropy makes better initialization vectors for the Linux Pseudo Random Number Generator (LinuxPRNG). The Ubuntu project offers a publicly-available entropy server. The quantity of entropy (i.e. the value of /proc/sys/kernel/random/entropy_avail) is only increased over time. It is worth noting here Virtual Machines in the cloud are not quite “normal” computers in regards to entropy. Cloud instances lack many of the inputs that a physical machine would have, since they don’t have keyboard and mice attached, and the hypervisor buffers away much of the random jitter of internal hardware. Further, the Xen (Amazon Web Service), KVM (Google Cloud), and HyperV (Microsoft Azure) hypervisors virtualize hardware access to varying degrees which can result in diminished entropy. You need to be aware of the entropy available on your instances and how your code affects that. When writing code, it’s important to minimize the calls to /dev/random for entropy as it blocks until sufficient entropy is available. /dev/urandom...

Cloud providers offer newer, better GPUs

Ever since there’s been a public cloud, people have been interested in running jobs on public cloud graphics processing units (GPUs). Amazon Web Services (AWS) became the first to offer this as an option when they announced their first GPU instance type six years ago. GPUs offer considerable performance improvements for some of the most demanding computational workloads. Originally designed to improve the performance of 3D rendering for games, GPUs found a use in big compute due to their ability to perform operations over a set of data rapidly and with a much greater core count than traditional central processing units (CPUs). Workloads that can use a GPU can see a performance improve up to 10-100 times. Two years later, AWS announced an upgraded GPU instance type: the g2 family. AWS does not publish exact capacity or usage numbers, but it’s reasonable to believe that the cg1 instances were sufficiently successful from a business perspective to add the g2s. GPUs are not cheap, so cloud providers won’t keep spending money on them without return. We know that some of our customers were quick to make use of GPU clusters in CycleCloud. But there was a segment of the market that still wasn’t being served. The GPUs in the cg1 and g2 instance families were great for so-called “single precision” floating point operations, but had poor performance for “double precision” operations. Single precision is faster, and is often sufficient for many calculations, particularly graphics rendering and other visualization needs. Computation that requires a higher degree of numerical precision, particularly if exponential calculations are made, need double precision. The GPUs that...

Using Tags for Tracking Cloud Resources

When Amazon announced that the limit for tags on resources was increased from 10 to 50, a great cheer went up from the masses. In the same way that you’ll always own slightly more stuff than you can fit in your house, many cloud users find they wanted more tags than were available. Tags are custom labels that can be assigned to cloud resources. While they don’t have a functional effect (except for in Google Compute Platform, see below), they can be very powerful for reporting and automation. For example, some customers have a single corporate account and apply resources based on department, user, project, et cetera for chargeback. Some customers also use labels in automated tools. For example, you can tag instances with a “backup” attribute and have a script that polls those instances to create snapshots of permanent volumes on a daily basis. Or perhaps you have an account for testing and you don’t want users to accidentally leave instances running forever. You can automatically terminate long-running instances that don’t have a “keepalive” tag set. In Amazon Elastic Compute Cloud (EC2) and Microsoft Azure, tags are key-value pairs. CycleCloud supports adding tags to instances and volumes a simple syntax:       tags.Application = my application       tags.CustomValue = 57       tags.Custom Text = Hello world Tags in Google Compute Platform The term “tag” has a different meaning in Google Compute Platform. A “tag” is an attribute places on an instance that is used to apply network or firewall settings. Other resources do not have tags. CycleCloud supports adding tags to GCP instances...