How to optimize cluster throughput for Cloud HPC

Ben CottonBy Ben Cotton, Cycle Computing Sr. Support Engineer

At Cycle, we believe compute power helps users invent and discover faster. To help them do this, they use schedulers to help distribute work to servers at many scales, including the open HTCondor, SLRM, and Open Grid Scheduler, and the proprietary LFS, PBS, UGE, etc.

Specifically, these batch schedulers help distribute the running of batch command-lines across a set of servers according to priority, user fair-share, quotas, and license concurrency limits, among other factors. Typically, there is a delicate balance of priority/fairness to reduce queue wait times for individual users, and increasing the overall the throughput of the server cluster.

Today we’re going to share some advice to help HTCondor users tune CLAIM_WORKLIFE attribute to optimize a cluster’s throughput:

=====

When the HTCondor negotiator matches a slot to a submitter, a claim is established. For as long as the

claim lives, the submitter will continue to use the slot without having to go back to the negotiator first.

A claim goes away in one of four ways:

  1. The submitter has no more jobs to run
  2. The slot evicts the claim (e.g. because the START expression evaluates to false or a condor_vacate command is issued)
  3. The job is evicted by startd or negotiator preemption policy
  4. The claim is older than CLAIM_WORKLIFE

By default, CLAIM_WORKLIFE is 20 minutes. This strikes a balance between the extremes of CLAIM_WORKLIFE settings. A very large value maximizes the performance optimization since a submitter will never have to go back to the negotiator so long as the slot is still willing to run jobs.

There are some down sides to an infinite CLAIM_WORKLIFE. Perhaps the most significant is the rare case where a condor_startd process goes catatonic and doesn’t realize it should release the claim. In that situation, the slot will remain Claimed/Idle and never do any work. This is bad enough on a dedicated cluster, but it can be expensive in a cloud environment if it prevents an execute host from shutting down when it should.

Setting CLAIM_WORKLIFE has downsides as well. If a claim expires before the submitter runs out of jobs, the scheduler has to go back to the negotiator to ask for a new match. This generates some additional load. If CLAIM_WORKLIFE is set too low, the load on the scheduler could keep large submissions from filling the pool. “Too low” is not a specific number, but depends on the performance of the scheduler and negotiator, job characteristics, etc.

So why set CLAIM_WORKLIFE? In addition to being a backstop against zombie condor_startd processes, it can be useful for implementing group quotas and priorities without having to use preemption. As an example, groupA is configured to use no more than 250 slots in a 1000-slot pool and groupB is configured to use all of the slots (with higher priority). If 2000 groupA jobs are submitted, 250 will begin running. If groupB jobs are submitted next, the groupB jobs will not get the full pool until all of the groupA jobs have finished. With CLAIM_WORKLIFE set to 1800 (30 minutes), then within half an hour, job usage will reflect the desired state (assuming jobs complete within that timeframe).

The best CLAIM_WORKLIFE setting is a balance between keeping the slot allocation in a pool at the desired state (smaller values) and reducing the matchmaking overhead on the scheduler and negotiator (larger values). Setting a CLAIM_WORKLIFE substantially shorter than the runtime of jobs will have no benefit.

Share this: