Clouds with Deadlines

Background: Why is the field important?

There exist many different modern approaches to resource allocation in computing clusters, ranging from the simplistic (FIFO, highest priority first, and earliest deadline first) to the more complex (fair scheduling, etc). However, many of these algorithms focus on maximizing optimization problems that have very basic job utilities that are either constant or binary with respect to a deadline time. We extend this problem to look at the nonconvex problem of scheduling jobs with deadlines, with time-dependent utility functions.

Summary of our work at EDGE Lab: What new things are we bringing to this field?

In EDGE Lab at Princeton University, we have developed the iterative-TACT algorithm that allows us to achieve deadline-aware resource allocation for jobs with time-dependent utilities and deadlines, which is useful for optimizing the scheduling of jobs in computing clusters. We have further extended this work to deal with the problem of optimizing the allocation of resources to deadline-sensitive data flows under different network conditions:

Independent Network

In the case of an independent network, like the internet, the exact topology of the network is not visible to the clients and clusters on the edge—rather, these entities must rely on observed end-to-end statistics in order to infer both (a) the topology of the network and (b) any bottlenecked links that serve as constraints on the resource allocation problem. In EDGE Lab at Princeton University, we are developing a set of tools based on bottleneck inference and resource allocation based on inferred bottlenecks for optimal resource allocation to data flows over the uncontrolled network.

Based on Katabi’s 2000 work on pseudo-K-Means clustering of packet interarrival times for bottleneck inference, we have developed a bottleneck inference module that minimizes the normalized cost of bottleneck grouping to infer both the number of bottlenecked links, and which flows belong to which bottlenecked links. Using this information, generated in an online manner, we can use this as a real-time input to the robust resource allocation module, which allocates different throughputs to different flows on the network. We are currently working on setting up an experimental testbed to analyze the results of our algorithm to demonstrate the responsiveness, accuracy, and adaptability of our algorithm to sudden bottlenecks caused by spikes in client demands for file transfer.

Controllable Network

In the case of a controllable network, like a circuit-switched optical network, the topology is indeed known to the clients and clusters—and indeed an entity can reserve resources on the optical network to guarantee a certain level of throughput for any given data flow. Having control over the network itself gives rise to the joint optimization problem of selecting both the cluster and a path to that cluster for a data flow. In EDGE Lab at Princeton University, we are developing a general joint-optimization problem for dealing with flows that are characterized by priorities, utilities, deadlines, and type of deadline to deal with the variety of jobs that may pass through such a network.

Y. Xiang, B. Balasubramanian, M. Wang, T. Lan, S. Sen, M. Chiang “Self-Adaptive, Deadline-aware Resource Control in Cloud Computing”, submitted to Workshop on Adaptive Host and Network Security, IEEE International Conference on Self-Adaptive and Self-Organizing Systems (SASO), 2013.

S. Wagner, E. van den Berg, J. Giacopelli, A. Ghetie, J. Burns, M. Tauil, S. Sen, M. Wang, M. Chiang, T. Lan, R. Laddaga, P. Robertson, P. Manghwani. Autonomous, Collaborative Control for Resilient Cyber Defense (ACCORD), Workshop on Adaptive Host and Network Security, SASO, 2012.