Monitoring cloud GPUs with CycleCloud

Monitoring cloud GPUs with CycleCloud

Graphics Processing Units (GPUs) provide a great boost for high performance computing, but they’re expensive and take time to purchase and install. With our CycleCloud software, you can get immediate access to just the right amount of cloud GPU time from Microsoft Azure, Google Cloud, and Amazon Web Services. GPU-enabled instances in CycleCloud enjoy the same features that traditional compute instances do: cost control, monitoring, and dynamic scaling. In our upcoming release, we’ve improved the monitoring experience, making it easier than ever to manage your cloud GPU instances. CycleCloud configures the monitoring automatically for GPU-enabled instances with drivers installed. You don’t need to do any of the setup yourself. When clicking Show Detail on a cloud nodes in the CycleCloud interface, you can now see performance graphs and statistics alongside the other node information. When the node has GPUs, this includes the GPU usage and memory. In addition, the detail window also includes a Metrics tab. This tab shows all of the raw performance metrics reported by the Ganglia system monitoring platform. If you’re interested in learning more, stop by booth #530 at the GPU Technology Conference this week for a demo, or contact...
How can you save money with Preemptible VMs?

How can you save money with Preemptible VMs?

You have your workload in production on Google Cloud, so now what? The next step is to do more work without going over budget. Google Cloud offers reduced-rate cloud instances, called Preemptible VMs. Preemptible VM pricing varies by machine type, but can represent a cost savings of up to 80%. Since the prices are fixed, you can easily predict your spend. The tradeoff for the lower prices is that the instances are subject to being taken away on short notice in order to meet demand from the regularly-priced instances and they are terminated after 24 hours. Preemptible VMs also cannot have GPUs attached and do not receive sustained use discounts. For customers looking to get the most out of Preemptible VMs, CycleCloud™ makes using them easy and effective with a single click, with features like: Easy requests across multiple machine types and Zones Automatic replacement of lost instances Using Preemptible VMs in CycleCloud CycleCloud software has unique features that makes it easy to not only use Preemptible VMs, but use them effectively. Preemptible VM capacity varies by machine type and Zone. With CycleCloud, you can specify multiple machine types and Zones to make it easy to get the capacity you need. CycleCloud software automatically spreads the requests out across all combinations you choose. If you lose some instances, CycleCloud automatically requests replacement instances from the remaining combinations that have capacity. To use Preemptible VMs in CycleCloud, set Preemptible = true in the nodearray section of the cluster template. Job considerations When using Preemptible VMs, your jobs need to be interruptible. This means that they can either restart from the...

Improving ALS research with Google Cloud, Schrödinger, and Cycle Computing

Today we published a case study describing how the use of Google Cloud enabled one professor to do work she never thought possible. May Khanna, Assistant Professor of Pharmacology at the University of Arizona, studies pharmacological treatments for pain. Her specific area of expertise focuses on research that uses protein binding to develop possible treatments. Using our CycleCloud™ software to manage a 5,000 core Google Cloud Preemptible VM cluster running Schrödinger® Glide™ has enabled research she never thought possible. This cluster was used to run 20,000 hours of docking computations in four hours for $192, thanks to the simple, consistent pricing of GCP’s Preemptible VMs. The n1-highcpu-16 instances she used have 16 virtual cores and 60 gigabyte of RAM, so they’re well-suited for this kind of compute-heavy workload. For this project, Professor Khanna wanted to analyze a protein associated with amyotrophic lateral sclerosis, also known as “ALS” or “Lou Gerhig’s disease”. ALS has no known cure and causes pain and eventual death for some 20,000 people in the United States every year. Protein binding simulation is compute-intensive, even under the constraints researchers often apply to achieve their results in a reasonable time. For example, proteins are often simulated in isolation and the binding sites are restricted to a set of known-or-expected active locations on the protein. With those constraints, Professor Khanna was only able to simulate 50,000 compounds, which yielded a grand total of four possible hits. She was about to give up on the project when she approached Cycle Computing. Using her Google Cloud cluster, she was able to simulate binding of a million compounds in just a...

Leap second #37 is coming!

Everybody always talks about needing more time. Well, this year you get it! Saturday night will be one second longer than normal. A leap second is being inserted in order to slow clocks down to match the Earth’s rotation. Beyond just adding a second to your day, your software needs to be ready as well. The addition of leap seconds in 2012 and 2015 means that many software systems are ready for it. This includes CycleCloud and the cloud service providers it works with. Leap second handling Here’s how the cloud service providers handle the leap second: Amazon Web Services — The additional second is spread over the 24 hour period from 12:00 UTC on December 31 through 12:00 UTC on January 1. Each “second” will be 1/86400 longer. Azure — In 2015, Azure inserted leap seconds at midnight local time. The assumption is that they will do this again. Google Cloud — The additional second is spread over the 20 hour period from 14:00 UTC on December 31 through 10:00 UTC on January 1. Instances started in the cloud providers will depend on the configured behavior. Generally speaking, Linux instances will use the NTP server pools and handle the change in the kernel. Windows instances on AWS will follow the AWS time adjustment above. Windows generally handles leap seconds by changing the clock at the next update. It’s a leap year, too In case one extra second was not enough 2016 for you, remember that this year was a leap year as well. If your application considers the day of the year, you’ll want to make sure it’s...

Efficient use of entropy in cloud environments

Secure communication requires entropy — unpredictable input to the encryption algorithms that convert your message into what seems like a string of gibberish. Entropy is particularly important when generating keypairs, encrypting filesystems, and encrypting communication between processes. Computers use a variety of inputs to provide entropy: network jitter, keyboard and mouse input, purpose-built hardware, and so on. Frequently drawing from the pool of entropy can reduce it to the point where communications are blocked waiting for sufficient entropy. Generally speaking, entropy has two aspects: quality (i.e. how random is the value you get?) and the amount available. The quality of entropy can be increased by seeding it from a quality source of entropy. Higher quality entropy makes better initialization vectors for the Linux Pseudo Random Number Generator (LinuxPRNG). The Ubuntu project offers a publicly-available entropy server. The quantity of entropy (i.e. the value of /proc/sys/kernel/random/entropy_avail) is only increased over time. It is worth noting here Virtual Machines in the cloud are not quite “normal” computers in regards to entropy. Cloud instances lack many of the inputs that a physical machine would have, since they don’t have keyboard and mice attached, and the hypervisor buffers away much of the random jitter of internal hardware. Further, the Xen (Amazon Web Service), KVM (Google Cloud), and HyperV (Microsoft Azure) hypervisors virtualize hardware access to varying degrees which can result in diminished entropy. You need to be aware of the entropy available on your instances and how your code affects that. When writing code, it’s important to minimize the calls to /dev/random for entropy as it blocks until sufficient entropy is available. /dev/urandom...

Using Tags for Tracking Cloud Resources

When Amazon announced that the limit for tags on resources was increased from 10 to 50, a great cheer went up from the masses. In the same way that you’ll always own slightly more stuff than you can fit in your house, many cloud users find they wanted more tags than were available. Tags are custom labels that can be assigned to cloud resources. While they don’t have a functional effect (except for in Google Compute Platform, see below), they can be very powerful for reporting and automation. For example, some customers have a single corporate account and apply resources based on department, user, project, et cetera for chargeback. Some customers also use labels in automated tools. For example, you can tag instances with a “backup” attribute and have a script that polls those instances to create snapshots of permanent volumes on a daily basis. Or perhaps you have an account for testing and you don’t want users to accidentally leave instances running forever. You can automatically terminate long-running instances that don’t have a “keepalive” tag set. In Amazon Elastic Compute Cloud (EC2) and Microsoft Azure, tags are key-value pairs. CycleCloud supports adding tags to instances and volumes a simple syntax:       tags.Application = my application       tags.CustomValue = 57       tags.Custom Text = Hello world Tags in Google Compute Platform The term “tag” has a different meaning in Google Compute Platform. A “tag” is an attribute places on an instance that is used to apply network or firewall settings. Other resources do not have tags. CycleCloud supports adding tags to GCP instances...