How To – Harvest Cycles From Your App Servers, Part 1
It’s a common problem: you run a successful, cloud-based application business in Amazon’s EC2 cloud with bursty traffic. In order to handle the bursts you have to keep a minimum number of EC2 application servers up and running. Would it not be nice if you could do something with these servers between handling the bursty requests? After all: you’re paying for that time, and there’s thumbnails to generate, analytics to calculate, and batch applications to run.
Condor is a high throughput distributed computing environment from the University of Wisconsin, Madison (http://cs.wisc.edu/condor/) that can be configured to steal unused cycles from your application severs when they aren’t serving your main business applications to your customers. Condor provides advanced job scheduling, quota management, policy configuration, support for virtual machine based work loads, integration with all the popular operating systems in use today. And: it’s free.
In the next three posts I’m going to show you how to use Condor to harness the wasted compute power on your application servers and how Cycle Computing’s CycleServer can help make this process simple and manageable.
Throughout this series of posts I’m going to talk about a fictitious web application company: AmazingWebVideo Inc. They offer video hosting services and their business has been growing rapidly over the past twelve months. They already run all of their web application components in Amazon’s EC2 cloud, but the nature of their business still requires that they keep a base number of web app servers constantly running to handle the start of any bursts in usage.
They also are currently having to waste the partial hours of compute time that occur when an app server is done serving traffic and could be shut down, or when the machine is far less than 100% utilized. We’d like to capitalize on any idle time a machine has, including periods where it is serving traffic, but not busy enough to top out the machine.
In addition to the application front end they have three other types of work loads that they currently handle on dedicated hardware in the cloud. They use a simple queueing system, perhaps based upon Amazon SQS, and three pools of machines to process each type of job.
This type of job performs some CPU-bound work on data from the application database on an as-needed basis. Run time varies from a few seconds to a few minutes. If they have free cycles they want render jobs to run first because timely completion impacts the end customer experience.
This type of job performs a small amount of database maintenance. It can be run on an ad hoc basis and wipes away application data that is no longer being referenced and optimizes the database. The more frequently these jobs can be run, the longer the gap between critical maintenance outages.
This type of jobs mines application data for internal analysis and planning. It’s very rare that it ever really needs to run, but nice when it can run.
The existing approach to running this work is to have worker instances running each type of job in the cloud. A simple queue for each type of job farms out jobs to these pools of machines and using AWS Auto Scaling can spin instances up or down automatically, if demand ebbs and flows.
Introducing Condor to The System
The first step to bring Condor in to this picture is as a replacement for the in-house simple scheduling system. Condor can provide a single queue for all three types of jobs and a unified view of the pool of cloud machines available to run those jobs. Using group quota technology in Condor it’s a simple matter to set policies on job types and begin making maximum utilization of the pool.
Our run-time policy for the machines in this Condor pool is simple: we always run jobs as long as their are jobs in our queue to run. In Condor configuration speak:
START = True
The machines are setup with no preference towards which type of jobs they will run. Instead jobs are entered in to the Condor queue tagged with a custom group based on their job type. The relevant submit ticket syntax is:
+AccountingGroup = “<job_type>.<owner>”
Group quotas are set at a system level by system administrators:
GROUP_NAMES = render, clean, mine
GROUP_QUOTA_render = 8
GROUP_QUOTA_clean = 3
GROUP_QUOTA_mine = 1
And jobs are entered in to the queue, assigned to groups. In a Condor submit ticket the following assigns a job to group “render” as user “jdoe”:
+AccountingGroup = “render.jdoe”
We can now make optimal use of their background task machines. When render type jobs are not present in the queue more mine and clean jobs will start to fill up the unused processing power. As render jobs reappear in the queue we can configure our policy to either immediately preempt running non-render jobs to make space for them, or to simply not run any more non-render jobs on machines until render jobs have once again consumed their quota of execute slots in the pool.
We know, we promised to show you how to harvest cycles from your idle application servers. But we had to tackle the first step: getting your jobs in to a Condor-managed pool.
By running a Condor pool, we get free access to tremendous features we’d otherwise not have: including resource quotas, priority scheduling, configurable policies, being able to run parallel jobs, fairsharing between users/applications, and detecting the state of the execute nodes for policy purposes. These are things you can't easily get from SQS-based scheduling systems.
Now that we have a Condor pool with dedicated background task processing CPU power in it we can move on to the next step: harvest cycles from our idle application servers.
And that step we’ll deal with in Part 2.