Make the Most of Your AWS Instances: Using open-source Condor to Harvest Cycles, Part 2

How To – Harvest Cycles From Your AWS App Servers, Part 2

In Part 1 of this series I introduced you to AmazingWebVideo Inc. They’re a successful, Amazon EC2-based, application provider who wants to get more out of their rented processors. Specifically they want to harvest unused compute cycles from various application servers in between bursty, end-user traffic. We introduced them to Condor in Part 1 and helped them move three classes of background processing jobs from a simple queuing system to Condor in preparation for cycle harvesting. Now lets take a look at how Condor, installed on their application severs, can help them accomplish this goal.

In our existing Condor pool, our machines are set to service jobs always. Since the only processing load these machines experience comes directly from running Condor jobs this setup is fine. But our application servers won’t be running under Condor’s control. Condor needs to pay attention to load outside of Condor’s control and only run jobs when this load is suitably low. We’ll use Condor’s START attribute and ClassAd technology to write an expression that controls when these machines should run jobs. But first lets decide how we want the jobs to run on these machines. There is a whole spectrum of choice here and it helps to think about it advance of writing your run-time policies in Condor configuration files.

Policy Time

There are four state changes around which we need to develop policy:

  1. “When can Condor run jobs on this machine?”;
  2. “When should Condor suspend jobs it may be running?”;
  3. “When should Condor resume running suspended jobs?”; and
  4. “When should Condor kill a job it may be running?”

The first question, “When can Condor run jobs on this machine?”, is certainly the most interesting to us at this point in time. We’d like to use these machines to do Condor work whenever there’s a lull in primary app server work. Condor tracks the overall load average for the machine and the load average of any processes under Condor’s control. With these two values we can compute the non-Condor load average:

NonCondorLoadAvg = (LoadAvg – CondorLoadAvg)

For the sake of convenience we’ll define two new ClassAd attributes for our machines called CPUBusy and CPUIdle. CPUBusy is true if the CPU of the machine is busy doing non-Condor work and CPUIdle is true if the CPU of the machine is quite low:

CPUBusy = ($(NonCondorLoadAvg) >= 0.8)
CPUIdle = ($(NonCondorLoadAvg) <= 0.3)

We can adjust that 0.8 to suit our specific tastes.

Are we ready to define our START policy now? Lets see:

START = $(CPUBusy) =!= True

That looks pretty good. LoadAvg in Condor is a 1 minute average so you’re reasonably safe from thrashing due to quick dips in your app server work load. The machine will have been calm for some time before the START expression is changed to True.

Suspend or Kill?

With Condor to the choice to suspend or kill a job when the app server load spikes is not mutually exclusive. You can build a flexible policy that suspends first and vacates the job if it can’t be resumed within a defined window of time. Policy can even be built around job classes or attributes, suspending some jobs and vacating others when app server load begins to increase.

For this post we’ll build a policy that just kills the jobs when the non-Condor load begins to go beyond 0.5 again. We want to turn off suspension on the app server machines:

WANT_SUSPEND = False
SUSPEND = False
CONTINUE = True

And we want to preempt the job when the CPU is no longer idle:

PREEMPT = $(CPUBusy)

We can give our job a window of time to respond to a soft kill signal when we enter the preempting state before we actually hard kill the job processes. Lets provide our jobs with a thirty second window to vacate the machine before we move to hard kill them:

StateTimer = (CurrentTime – EnteredCurrentState)
ActivityTimer = (CurrentTime – EnteredCurrentActivity)
WANT_VACATE = True
KILL = $(StateTimer) > 30

We created a macro called StateTimer to make our policy a little more succinct. Once a machine passes from the Claimed state to the Preempting state a job has thirty seconds to complete before a hard kill signal is sent to terminate the process.

We can observe our app server and dedicated server policies in effect through the CycleServer management interface. In the graph below, machines being used by the primary application are not available to run jobs and are represented by a dark grey area in the graph. The available machines show what work they were processing at the time, with render (dark blue), mine(green), and clean(orange) jobs executing.

Now that we have this working, how can we make it so jobs on condor don't run if certain executables are running (like backup.sh)? Can we prioritize render jobs so that thumbnails for important content renders first? Can we not preempt jobs just because the node they run on gets busy for a minute or two?

In Part 3 of this series we’ll explore some more complicated Condor policy setups, including the ability to suspend and resume certain jobs.

Share this: