HowTo: Save a $million on HPC for a Fortune100 Bank

In any large, modern organization there exists a considerable deployment of desktop-based compute power. Those bland, beige boxes used to piece together slide presentations, surf the web and send out reminders about cake in the lunch room are  turned on at 8am and off at 5pm, left to collect dust after hours. Especially with modern virtual desktop initiatives (VDI), thin clients running Linux are left useless, despite the value they hold from a compute perspective.


Fortune 100 Bank Harvesting Cycles

Today we want to educate you about how big financial services companies use desktops of any type to perform high throughput pricing and risk calculations. The  example we want to leverage is from a Fortune 100 company, let's call them ExampleBank, that runs a constant stream of moderate data and heavy CPU computations on their dedicated grid. As an alternative to dedicated server resources, running jobs on desktops was estimated to save them millions in server equipment, power and other operation costs, and London/UK data center space, thanks to open source software that has no license costs associated with it!

Cycle engineers worked with their desktop management IT team to deploy Condor on thousands of their desktops, all managed by our CycleServer product. Once deployed, Condor falls under control of CycleServer and job execution policies are crafted to allow latent desktop cycles to be used for quantitative finance jobs.

Configuring Condor

Condor is a highly flexible job execution engine that can fit very comfortably into a desktop compute environment, offering up spare cycles to grid jobs when the desktop machine is not being used for its primary role. Our client wanted a policy that would make machines available after hours and on weekends, but only if the machine wasn't performing computational or interactive work for its owner when the execution window opened. Condor tracks mouse and keyboard activity as well as non-Condor initiated CPU load making it possible to craft execution policies that meet complex requirements such as this case.

The first step was to define a policy section that managed the execution windows based on time and day of the week. Two macros were created to simplify the final START expression. The first macro, WEEKDAY_CAN_START, tested to see if the date and time fall on a weekday execution window, the second macro, WEEKEND_CAN_START, tested to see if the date and time fall on a weekend execution window. If either is true we know the machine can run jobs. We combined them in the RUNWINDOW_SCHEDULE_OBEYED macro for simplicity. If this macro is true, the job window is open.

# When, during a weekday, should Condor begin to run jobs and stop
# running jobs?
# 18:00 start time (=1080 minutes in to the day)
WEEKDAY_START_TIME = 1080
# 08:00 end time (=480 minutes in to the day)
WEEKDAY_END_TIME = 480

# When, during the weekend, should Condor begin to run jobs and stop
# running jobs?
# 00:00 start time (=0 minutes in to the day)
WEEKEND_START_TIME = 0
# 23:59 end time (=1439 minutes in to the day)
WEEKEND_END_TIME = 1439

# Boolean expression that returns true if it's a weekday and we're
# in the weekday start window.
WEEKDAY_CAN_START = \
( \
( \
$(WEEKDAY_END_TIME) > $(WEEKDAY_START_TIME) && \
( \
ClockMin >= $(WEEKDAY_START_TIME) && \
ClockMin <= $(WEEKDAY_END_TIME) \
) \
) || ( \
$(WEEKDAY_END_TIME) < $(WEEKDAY_START_TIME) && \
( \
ClockMin >= $(WEEKDAY_START_TIME) || \
ClockMin <= $(WEEKDAY_END_TIME) \
) \
) \
)

# Boolean expression that returns true if it's a weekend and we're
# in the weekend start window.
WEEKEND_CAN_START = \
( \
( \
$(WEEKEND_END_TIME) > $(WEEKEND_START_TIME) && \
( \
ClockMin >= $(WEEKEND_START_TIME) && \
ClockMin <= $(WEEKEND_END_TIME) \
) \
) || ( \
( \
$(WEEKEND_END_TIME) < $(WEEKEND_START_TIME) && \
( \
ClockMin >= $(WEEKEND_START_TIME) || \
ClockMin <= $(WEEKEND_END_TIME) \
) \
) \
) \
)

# This returns true if the schedule for running jobs is in a
# run window. Otherwise false.
RUNWINDOW_SCHEDULE_OBEYED = \
( \
((ClockDay > 0 && ClockDay < 6) && $(WEEKDAY_CAN_START)) || \
((ClockDay == 0 || ClockDay == 6) && $(WEEKEND_CAN_START)) \
)

With time-of-day windows now being computed we turned our attention to monitoring external-to-Condor CPU load and user activity on the console to ensure that the job window only opens up when the machine is really not being used off hours. Configurable macros were set up to determine when external-to-Condor CPU was deemed to be unacceptably high. The parameterized approach used meant the  IT staff at ExampleBank could easily tune the policy. The CPUIdle macro indicated when a machine has become quiet and the CPUBusy macro indicated when a machine had become loaded down with non-Condor work.

# These are the load values that determine if non-Condor
# load is acceptably low enough to run jobs.
BackgroundLoad = 0.3
HighLoad = 0.5

# This is the non-Condor load average
NonCondorLoadAvg = (LoadAvg – CondorLoadAvg)

# These macros return true or false and answer load questions
# about this machine.
CPUIdle = ($(NonCondorLoadAvg) <= $(BackgroundLoad))
CPUBusy = ($(NonCondorLoadAvg) >= $(HighLoad))
A final macro to detect keyboard and console use was put in place. The ConsoleNotBusy macro will be true if the machine has been free of mouse of keyboard activity for at least one hour.
KeyboardBusy = (KeyboardIdle < 60*60)
ConsoleBusy = (ConsoleIdle  < 60*60)
ConsoleNotBusy = ($(KeyboardBusy) == False && $(ConsoleBusy) == False)

All three pieces were combined to define the START expression for the machine, which if true, indicates the machine is available to run jobs:

START = ($(ConsoleNotBusy)) && \
($(CPUIdle)) && \
($(RUNWINDOW_SCHEDULE_OBEYED) =?= True)
The other side of the policy was to determine what happened to jobs when the execution window closed. It can close for any of three reasons:
  1. The console may become busy in the form of mouse or keyboard activity;
  2. The external-to-Condor CPU load may go beyond the threshold set for acceptably low;
  3. The time of day may pass in to a time where jobs are not supposed to execute.

When reason 1 and 2 were encountered, ExampleBank wished to run jobs that suspended for a period of 1 hour before being returned to the queue  to relocate CPU resources to use. For reason 3, it was desired the job be returned to the queue immediately when the time-of-day execution window closed since it known the job would have no chance to run on this machine again for many hours.

Condor provides suspend, resume and vacate settings to tailor the behaviour of an execute node. Cycle engineers modified them as follows to achieve the policy ExampleBank desired:

SUSPEND = ($(CPUBusy)) || ($(ConsoleBusy))
WANT_SUSPEND = SUSPEND

CONTINUE = ($(CPUBusy) =!= True) && ($(ConsoleNotBusy))

PREEMPT = (Activity == "Suspended") && ($(ActivityTimer) > 3600) || \
(($(RUNWINDOW_SCHEDULE_OBEYED) =?= False))

The SUSPEND statement tells Condor to suspend a job if the CPU or console becomes busy. The CONTINUE statement tells Condor to resume running a suspended job if the console or CPU stops being busy. The PREEMPT statement says put the job back in the queue if it's been suspended for more than an hour or if the run window time period has closed.

The Outcome

Today, the desktop pools like this perform financial instrument pricing, as well as software build and test. The following picture taken from CycleServer shows their desktop pool availability over a 24 hour period, with risk management tasks (light blue) running at night and daytime unavailability marked by the machines being in the Owner state (dark grey). The light grey area represents machines unused by staff during the day that do not match to work in their queues.

The dip in available machines that become available just after 7:00 pm occurs as power saving software turns some desktop machines off for the night. A next step in this project is to integrate desktop hibernation and wake-on-LAN features with Condor's power management facilities to turn machines on or off depending on the needs of computation load.
The result of rolling out this harvesting was quite a success, even for only one set of desktops. This initial set of desktops provides an additional 3200 execution slots, with approximately 2400 of slots providing enough memory, disk and CPU horsepower to satisfy the demands of the grid jobs. Operating for a 12 hour period during weekdays and a 60 hour period every weekend(we could do much more here), ExampleBank has gained an additional (((12 hours x 4 days) + (60 hours)) x 52 weeks) x 2400 cores = 13,478,400 CPU hours, or ~1537 CPU years every calendar year!
Share this: