How do I optimize cloud HTCondor jobs for cost?

The most important consideration is getting your work done. The second most important consideration is doing it without wasting money. In this post, we describe how you can minimize costs in an HTCondor environment on Amazon Web Services using CycleCloud™.

Our CycleCloud software is an orchestration platform for any workflow. It provides multi-user support, cost management, alerting, and automation to organizations who want to get better answers, faster. CycleCloud launches, configures, and monitors cloud resources and provides tools for managing data and workflows. With support for Microsoft Azure, Google Cloud, and Amazon Web Services, our customers use CycleCloud to power their cloud HPC and big compute workloads using HTCondor, PBS Pro, Hadoop, and other technologies.

So how can you use CycleCloud’s features to get the most compute for your dollar?

The HTCondor scheduler tracks the state of execution slots, including the time slots are idle. This makes it easy to identify “wasted” time in a cloud environment, but it’s not always as straightforward as it may seem. CycleCloud will wait for a user-configurable length of time before considering whether an idle instance should be shut down, and will only shut down when the node is within 5 minutes of the end of the billing hour.

“Shut the instances down sooner!” is an understandable first reaction, but it isn’t necessarily beneficial. AWS bills for EC2 instances by the hour, so shutting an instance down early only reduces the appearance of idle time without lowering your bill. With any of the cloud service providers, keeping the minimum idle time too short will result in instance churn, especially with uneven work submission. This will not only extend the time-to-results, but could increase costs, since new jobs will start up new instances which have to be configured before use.

If your load doesn’t fluctuate over short periods (for example, if all work is submitted upfront) then shortening the minimum idle time is a good way to prevent instances from running over into the next billing hour. Often, the best way to reduce cost is to optimize the job.

Use runtime hints. CycleCloud will autoscale the cluster to the number of jobs in the queue by default. This works well for long-running jobs, but for shorter jobs, it can lead to overprovisioning. The average_runtime job attribute can be specified to let CycleCloud know how long you expect the job to run. This is used to scale down the number of scores requested. Four jobs with a 15 minute average_runtime will cause CycleCloud to request 1 core instead of 4. To specify this in a submit file, use +average_runtime = 900

Use smaller instances. One busy core on a machine will prevent the machine from shutting down. So from a cost perspective, using 16 2-core machines is preferable to 1 32-core machine, all other things being equal.

Reduce runtime variability. Preventing “straggler” jobs is another good way to reduce cost. If the workload can be tuned to reduce the variance in job run times, you will mitigate some of the “single core keep the whole instance alive” issue.

Of course, the easiest solution is to use reduced-price instances like AWS Spot Instances. These can represent savings of 50% or more without having to make any changes to the workflow.

If you’d like to learn more about getting the most out of your cloud HTCondor clusters, see us at HTCondor Week in Madison, WI this week or contact us.

Share this: