Data Center Design for Machine Learning Applications
Artificial Intelligence (AI) and Machine Learning (ML) applications now constitute a significant part of Data Center workloads. These AI and ML applications are widely used for business planning and user services. The increasing need to processing larger data sets and more user requests forces us to build larger data centers. However, with the growing demand and the computation-power-hungry nature of these applications, soon our data centers will be too expensive to build and operate. This challenge compels us to rethink the way we build our data centers and the way we execute AI and ML applications.
|
|
Because most AI/ML applications are heavy memory consumers, our investigation on this data center design problem focuses on the memory system. More specifically, we are looking at two directions:
|
Software Performance Testing for Cloud
While the low cost of ownership and the flexibility have attracted businesses and users to migrate their applications, Internet services and IT infrastructures to the Cloud, Cloud's performance uncertainty puts many potential Cloud users in doubt. This performance uncertainty makes it very difficult to determine if Cloud services can meet a user's performance target. Even in the case where Cloud can meet a performance target, this uncertainty makes it very challenging to project the potential the resource usage and cost of using Cloud services. |
|
Performance testing is the traditionally way to determine if a system satisfies certain performance targets. However, cloud services are provided to the users as black boxes, i.e., users have limited control over the execution environments (this black-boxed approach is also one of the causes of performance uncertainty). If traditional performance testing is employed in Cloud, extremely long tests (months) have to be conducted to extensively explore the unknown execution environments, which is cost prohibitive in practice. To reduce the testing cost, we are investigating how to incorporate Cloud system knowledge into performance testing. The knowledge of Cloud systems help users better control and interpret execution environments to reduce the number of tests. Additionally, to help users determine the Cloud resource configurations that minimize their costs, we also apply machine learning into Cloud testing. Machine learning may help reduce the search space of potential cost-efficient resource configurations. This is a collaboration project with Dr. Pollock from UDel and Dr. Soffa from UVa. |
Automatic Elasticity Management for Cloud Application
Elasticity is used to describe the characteristic of Cloud Computing where users can dynamically increase or decrease their resource usages based on workload demands. While elasticity is the key to achieve low-cost-of-ownership of Clouds, Cloud users are responsible to design elasticity policies. However, properly designing elasticity policies is extremely challenging even to power users.
|
|
A good elasticity policy requires two predictions: 1) the prediction of in-come workloads, so that a Cloud application can prepare in advance for future workloads spikes or drops; 2) the prediction of the performance of Cloud resource configurations, so that a Cloud application can correctly choose the configuration with lowest cost while meeting performance goals. However, accurately making these predictions remains open questions.
It is worth noting that, for any deployed applications, there is usually a history of past workloads, performance and resources usages. This history may reveal valuable insights of application behaviors, and thus help improve the accuracy of workload and performance predictions. Because this history is already available to Cloud providers, we believe Cloud providers can supply users with accurate predictors to help them design elastic policies. In fact, given the large application pool and the histories of a wide range of applications, Cloud providers may be able to directly supply users with elasticity policies, completely relieving the users from the burden of designing their own policies. Here, we combine the knowledge of Cloud Computing, computer systems and data mining to identify the type of historical data that are required for elasticity policy design, and investigate how to utilize the history to improve elasticity policies. |