Title page for ETD etd-08282012-152556


Type of Document Dissertation
Author Wang, Guanying
Author's Email Address wanggy@vt.edu
URN etd-08282012-152556
Title Evaluating MapReduce System Performance: A Simulation Approach
Degree PhD
Department Computer Science
Advisory Committee
Advisor Name Title
Butt, Ali R. A. Committee Chair
Cameron, Kirk W. Committee Member
Feng, Wu-Chun Committee Member
Nikolopoulos, Dimitrios S. Committee Member
Pandey, Prashant Committee Member
Keywords
  • performance prediction
  • performance modeling
  • simulation
  • MapReduce
  • Hadoop
Date of Defense 2012-08-27
Availability unrestricted
Abstract
Scale of data generated and processed is exploding in the Big Data era. The MapReduce system popularized by open-source Hadoop is a powerful tool for the exploding data problem, and is widely employed in many areas involving large scale of data. In many circumstances, hypothetical MapReduce systems must be evaluated, e.g. to provision a new MapReduce system to provide certain performance goal, to upgrade a currently running system to meet increasing business demands, to evaluate novel network topology, new scheduling algorithms, or resource arrangement schemes. The traditional trial-and-error solution involves the time-consuming and costly process in which a real cluster is first built and then benchmarked. In this dissertation, we propose to simulate MapReduce systems and evaluate hypothetical MapReduce systems using simulation. This simulation approach offers significantly lower turn-around time and lower cost than experiments. Simulation cannot entirely replace experiments, but can be used as a preliminary step to reveal potential flaws and gain critical insights.

We studied MapReduce systems in detail and developed a comprehensive performance model for MapReduce, including sub-task phase level performance models for both map and reduce tasks and a model for resource contention between multiple processes running in concurrent. Based on the performance model, we developed a comprehensive simulator for MapReduce, MRPerf. MRPerf is the first full-featured MapReduce simulator. It supports both workload simulation and resource contention, and it still offers the most complete features among all MapReduce simulators to date. Using MRPerf, we conducted two case studies to evaluate scheduling algorithms in MapReduce and shared storage in MapReduce, without building real clusters.

Furthermore, in order to further integrate simulation and performance prediction into MapReduce systems and leverage predictions to improve system performance, we developed online prediction framework for MapReduce, which periodically runs simulations within a live Hadoop MapReduce system. The framework can predict task execution within a window in near future. These predictions can be used by other components in MapReduce systems in order to improve performance. Our results show that the framework can achieve high prediction accuracy and incurs negligible overhead. We present two potential use cases, prefetching and dynamic adapting scheduler.

Files
  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  Wang_G_D_2012.pdf 1.14 Mb 00:05:15 00:02:42 00:02:22 00:01:11 00:00:06

Browse All Available ETDs by ( Author | Department )

dla home
etds imagebase journals news ereserve special collections
virgnia tech home contact dla university libraries

If you have questions or technical problems, please Contact DLA.