Efficient Resource Management for Datacenter infrastructure

With the ever-growing emerging applications (e.g., big-data, and machine learning), lots of datacenters (both private and public) have been built, each of which is a very large scale infrastructure with massive resource and high cost. From users' perspective, they require their jobs to receive the guaranteed (or even better) application-level performance. On the other hand, the infrastructure operators (owner) wish to reduce their operating cost without violating the service-level agreements (SLA). The ultimate question we want to answer is how to efficiently manage the resource in infrastructure with the objectives of both maximizing application performance and resource utilization. Our general idea is to break the physical (machine) boundary and disaggregate resources (CPU, memory, storage, and accelerators (GPU)) for global management (scheduling).

Tiresias : A GPU Cluster Manager for Distributed Deep Learning
Tiresias is an information-agnostic GPU cluster manager for distributed deep learning (DDL) training jobs. Existing solutions in productions are derived from the traditional ones for big data analytics. Without considering the new challenges from DDL jobs, such as unpredictable training time and all-or-nothing execution model, they result in suboptimal performance and over-aggressive constraints. By applying the proposed discretized two-dimensional Gittins Index and Least-Attained Service (LAS) algorithm, Tiresias can minimize the average job completion time (JCT) with non or partial job information. Moreover, Tiresias relies on the model structure information captured by its RDMA network profiler to make good job placement decisions.

Efficient Memory Disaggregation with Infiniswap
Infiniswap is a remote memory paging system designed specifically for an low-latency (RDMA) network. It opportunistically harvests and transparently exposes unused memory to unmodified applications by dividing the swap space of each machine into many slabs and distributing them across many machines' remote memory. Because one-sided RDMA operations bypass remote CPUs, Infiniswap leverages the power of many choices to perform decentralized slab placements and evictions. Extensive benchmarks on workloads from memory-intensive applications ranging from in-memory databases such as VoltDB and Memcached to popular big data software Apache Spark, PowerGraph, and GraphX show that Infiniswap provides order-of-magnitude performance improvements when working sets do not completely fit in memory.

Faculty

Graduate Students

  • jcgu
  • Youngmoon Lee
  • Yiwen Zhang


Publications

  • Juncheng Gu, Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo, Tiresias: A GPU Cluster Manager for Distributed Deep Learning, in the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI '19), Boston, Massachusetts, USA, February 2019.
    <pdf> 
  • Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G. Shin, Decentralized Memory Disaggregation Over Low-Latency Networks, in USENIX Login, vol. 42, no. 4, pp. 42-48, December 2017.
    <pdf> 
  • Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G. Shin, Efficient Memory Disaggregation with Infiniswap, in the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI '17), Boston, Massachusetts, USA, March 2017.
    <pdf>