ApacheCon NA 2013

Portland, Oregon

February 26th – 28th, 2013

Register Now!

Tuesday 5:15 p.m.–6 p.m.

Hadoop and HBase on the Cloud: A Case Study on Performance and Isolation.

Konstantin Shvachko, Jagane Sundar

Cloud Crowd
Audience level:


Cloud infrastructure is a flexible tool to orchestrate multiple Hadoop and HBase clusters, which provides strict isolation of data and compute resources for multiple customers. Most importantly our benchmarks show that virtualized environment allows for higher average utilization of per-node resources.


Hadoop in general and HBase in particular are known to be low CPU usage systems. Specific workload patterns and internal restrictions usually do not allow boosting the CPU usage to its maximum if a single instance of HBase server software is assigned to each node. We ran a series of benchmarks on a cluster built of virtual machines, where a RegionServer is assigned to a single VM so that multiple RegionServers run on a single physical node but are isolated from each other by VM containers. The benchmarks are based on DFSIO (with random reads – a newly introduced functionality) for Hadoop and YCSB for HBase. We analyzed the performance of the cluster for various mixes of read / write workloads and cluster configurations. The presentation will report the main results and discuss the conclusions. We argue that virtualized environment allows to increase average node utilization and stresses the system with more load. Cloud infrastructure is also a flexible tool to orchestrate a pool of (virtual) clusters, which provide strict isolation of data and computational resources for a community of customers based on individual security and performance requirements.