2011 International Conference on Computer Science and Network Technology
Hadoop-HBase for Large-Scale Data
Mehul Nalin Vora
Innovation Labs, PERC Tata Consultancy Services (TCS) Ltd. Mumbai, India mehul.vora@tcs
Abstract — Today we are inundated with digital data. Yet we are very poor in managing and processing it. It is becoming increasingly difficult to store and analyze data efficiently and economically via conventional database management tools. Not only that, type of data, appearing in the databases, are also changing. Now a day, binary large objects are a standard integral part of any database. Researchers, all over the globe, are baffling with analysis of these ultra large databases. Apache HBase is one such attempt. HBase is a noSQL distributed database developed on top of Hadoop Distributed File System (HDFS). In this paper, we present an evaluation of hybrid architecture where HDFS contains the non-textual data like images and location of such data is stored in HBase. This hybrid architecture enables faster search and retrieval of the data which is a growing need in any organization who are flooded with data. The paper aims at evaluating the performance of random reads and random writes of data storage location information to HBase and retrieving and storing data
in HDFS respectively. We also present a comparative study of HBase-HDFS architecture with MySQLHDFS architecture. Keywords - large-scale data; distributed storage; Hadoop; HDFS; Map Reduce; HBase; noSQL database
With the advent of web 2.0 (social web) and at the dawn of web 3.0 (semantic web), more and more Binary Large Objects (BLOBs) (images, audio, video etc.) and Character Large Objects (CLOBs) are appearing in databases which adds one more degree of complexity in the identification of the actual content and thereby in processing and identifying patterns that actually make sense to the user. Various database management systems are in use today. Majority of the systems relies on the high-end hardware and / or special-purpose architecture to deliver the desired performance. Higher capital expenditure for the acquisition of infrastructure and higher licensing cost are becoming prohibitive in many cases in this competitive age. Tackling these large-data and socio-economic problems require a distinct approach that sometimes runs counter to traditional models of storage and computing which provides good scalability and desired level of performance with insignificant or little cost. There are number of projects have been developed as an alternative to traditional database system. Google’s BigTable [3], Amazon’s Dynamo [4], Apache’s Cassandra [5], Hypertable [6], Apache’s CouchDB [7], LinkedIn’s Project Voldermort [8], MemcacheDB [9], MongoDB [10] are just to name a few. The Apach
e Hadoop-based project - HBase [11] is one such approach. HBase is a distributed, fault-tolerant, highly scalable, no-SQL database, built on top of Hadoop Distributed File System (HDFS) [12] [13] to create a massively scalable and highly performant platform for dealing with heterogeneous data including non-textual data types (blob, clob etc.). Here, in this paper, we describe performance evaluation of a hybrid architecture. The HBase contains the information regarding the data storage location only whereas the actual data in the form of image files (non-textual data) are stored in HDFS. While evaluating performance, we perform random reads from HBase to retrieve location information and then extract the actual data from the HDFS. To evaluate the performance in case of random write, file is stored in HDFS first and then location information for the same is stored in HBase. We also present a comparison of HBase-HDFS architecture with MySQL-HDFS architecture. More about the choice of this hybrid architecture is explained in section III. Rest of the paper is organized as follows: Section II describes the Apache Hadoop framework components mainly distributed storage as HDFS and analytical framework as Map/Reduce. It also highlights the important features of
I.
INTRODUCTION
In last couple of decades, we have witnessed a great revolution in the technology front. During the same period we are overwhelmed with the increase in the amount of electronic and digital data. Half a decade ago, average size of a corporate database tended to be in the range of Gigabytes (GBs). Now, multi-Terabyte (TB) or even Petabyte (PB) databases are becoming common norms. According to [1], Facebook receives about 12 TBs of compressed digital data per day. As per International Data Corporation's (IDC) prediction, from 2010 to 2020, digital data will grow 44 fold to 35 Zetabytes (ZBs) per year [2]. As databases are becoming increasingly large, it poses a new set of problems for an enterprise in the following areas: • Information and Content Management - capturing, storing, managing, retrieving and processing largescale data in an acceptable timeframe Business Analytics Performance Modeling like concurrency, throughput and response time of new applications
• •
978-1-4577-1587-7/11/$26.00 ©2011 IEEE
601
December 24-26, 2011
HBase. Section III contains the experimental details and performance evaluation study followed by the conclusion in section IV. II. HADOOP
Hadoop is an open-source, reliable and scalable architecture for large-scale distributed data storage and high-performance computation on a network of inexpensive pieces of commodity hardware [14]-[16]. Created by Doug Cutting in 2004, Hadoop became a top-level Apache Software Foundation project in January 2008. There have been many contributors, both from academic and commercial front, and Hadoop has received considerable attention from rapidly growing user community. Developed in java, the Hadoop framework is comprised of major three components as follows: • Hadoop Core - a set of utilities like file system, RPC, and serialization libraries that support the Hadoop subprojects. HDFS - Scalable, fault tolerant, distributed file system Map/Reduce – An analytical framework for high performance distributed computing.
The simplest solution to avoid data loss is replication mechanism: redundant copies of the data are kept by the system so that in the event of failure, there is always another copy available and thereby improving reliability and making system fault-tolerant. By default, HDFS stores three copies of each data block to ensure reliability, availability, and faulttolerance. The usual replication policy for a datacenter is to have two replicas on the machines from the same rack and a replica on a machine loc
ated in another rack. This policy limits the data traffic between racks and it makes HDFS resilient towards two common failure scenarios: individual DataNode crashes and failures in networking equipment like switch that shuts-off an entire rack. To minimize latency for data access, HDFS tries to access the data from the nearest replica, so if there is a replica hosted by the same rack, it will be preferred. In the most recent release of Hadoop as of this writing (release 0.20.203.0), HDFS is built for write-once, read-many time pattern. Writing operation in HDFS is expensive as NameNode has to create metadata for each block of a file and their replicas. Also there is no support for multiple writers, or for modifications at arbitrary offsets in the any existing file under HDFS can be accessed in read or append mode only. B. Map / Reduce Map/Reduce has its roots in functional programming, map phase corresponds to the map operation, whereas reduce phase corresponds to the fold operation. Fundamental idea behind Map/Reduce framework is divide and conquer - partition a large problem into smaller sub-problems to the extent that the sub-problems are independent and they can be tackled in parallel by different slaves. Map/Reduce is a linearly scalable programming model. The programmer writes two functions (a) map and (b) reduce function — each of which defines a mapping from one set of key-value pairs to another.
• •
Following sub-sections give brief description of HDFS, Map/Reduce and HBase. A. Hadoop Distributed File System (HDFS) Philosophy behind the distributed storage is simple: While the storage capacities have increased massively over the decades, data access rate or seek time have not been improved in comparison and creates the performance bottleneck for the modern-age applications. The simplest way to reduce the seek time is to access the data from multiple disks at once and thereby improving the overall response time. In HDFS, data is organized into files and directories. Files are divided into uniform sized blocks and distributed across cluster nodes and thereby removing the file size restriction. Also blocks are significantly larger than block sizes in standard file systems like ext3 (64 MB by default) to minimize the cost of seeks and thereby enhancing the application-performance. HDFS adopts a master-slave architecture in which the NameNode (master) maintains the file namespace (metadata, directory structure, list of files, list of blocks for each file, location for each block, file attributes, authorization and authentication) and the DataNodes (slaves) creates, deletes or replicates the actual data blocks based on the instructions received from the NameNode and they report back periodically (heartbeat) to the NameNode with the complete list of blocks that they are storing. A snapshot of the entire file system namespace and block map is kept in memory for the faster access. Both NameNode and DataNode are designed to work with commodity hardware running on a Linux operating system. A common problem of using cluster of commodity hardware is higher probability of hardware failure and thereby data loss.
The same functions can be used for a small as well as for a large-scale database without any modification since these functions are written in a fashion that is independent from datasize as well from the cluster-size. Upon submitting a job and the input data, Map/Reduce framework divides the work in two phases: the map phase and the reduce phase separated by data transfer between nodes in the cluster. Each phase has key-value pairs as input and output, and the types of these can be chosen by the client. Data workflow for a Map/Reduce job: Input → Map → Merge/Sort/Shuffle → Reduce → Output. In the first stage, Hadoop divides the input into fixed-size pieces called input splits, or just splits. Hadoop creates one map task for each split, which runs the user defined map function for each record in the split. Map output is a set of records in the form of keyvalue pairs as chosen earlier. The records for any given key – possibly spread across many nodes – are aggregated at the node running the Reducer for that key. This involves data transfer between machines. The Reduce stage produces another sets of
602
key-value pair, as final output based on user defined reduce function. Map/Reduce works well on unstructured or semistructured data, since it is designed to interpret the data at processing time. This simple programming model, restricted to use of key-value pairs, surprisingly fits for a large number of d
iverse tasks and algorithms. Similar to HDFS, Hadoop Map/Reduce framework also adopts master/slave architecture: a JobTracker (master) and a number of TaskTrackers (slaves). The JobTracker coordinates all the jobs run on a system by scheduling tasks to run on TaskTrackers. TaskTrackers run tasks and send progress reports to the JobTracker, which keeps a record of the overall progress of each job. If a task fails, the JobTracker can reschedule it on a different TaskTracker thereby making the system fault-tolerant. MapReduce tries to co-locate the TaskTracker with DataNode, so data access is fast since it is local. This feature, known as data locality, is at the heart of Map/Reduce and is the reason for its good performance. Map/Reduce suits applications where the data is written once and read many times and is a good fit for problems that need to analyze the whole database, in a batch mode. C. HBase HBase, an Apache open-source project, is a distributed fault-tolerant and highly scalable, column-oriented, noSQL database built on top of HDFS. HBase is used for real-time read/write random-access to very large databases. From a logical point of view, data in HBase are organized in labeled tables. Each HBase table is stored as a multidimensional sparse map, with rows and columns, each row having a sorting key and an arbitrary number of columns. Table cells are versioned. By default, their version is a timestamp auto-assigned by HBase at the time of cell insertion. Each particular column can have several versions for the same row key. Each cell is tagged by column family and column name, so programs can always identify what type of data item a given ce
ll contains. A cell’s content is an uninterrupted array of bytes which is uniquely identified by Table + Row-Key + Column-Family:Column + Timestamp. Table rows are sorted by row key which is also a byte array and serves as table’s primary key. All table accesses are via the table primary key and any scan of HBase table results into a Map/Reduce job. Parallel scan in terms of Map/Reduce job results into faster query response time and better overall throughput. While creating a table it necessary to mention set of column-families. Although new column-families cannot be added for an existing table, columns within any pre-existing column-family can be added on-the-fly to tables. Row updates are atomic thus achieving both reading and writing operations at the same time. Recent versions allow blocking several rows, if the option has been explicitly activated. HBase tables are made up of several HDFS files and blocks, each of which is replicated by Hadoop. HBase tables are automatically partitioned horizontally by HBase into regions. Each region comprises a subset of a table’s rows. A region is defined by its first row, inclusive, and last row, exclusive, plus a randomly
generated region identifier. HBase, also has some special catalog tables named -ROOT- and .META. within which it maintains the current list, state, recent history, and location of all regions. The -ROOT- table holds the list of .META. table regions. The .META. table holds the list of all user-space regions. Similar to HDFS and Map/Reduce, HBase also adopts master/slave architecture. HMaster (master) is r
esponsible for assigning regions to HRegionServers (slaves) and for recovering from HRegionServer failures. HRegionServer is responsible for managing client requests for reading and writing. HBase uses Zookeeper [17], another Hadoop subproject, for management of HBase cluster. There is no support for SQL query language in HBase. However, there is a Hive / HBase integration project [18] that allows SQL statements written in Hive [19] to access HBase tables. There is HBql project [20] which adds a dialect of SQL and JDBC bindings for HBase. III. PERFORMANCE EVALUATION
A. Methodology In this paper we present a systematic evaluation of a noSQL database HBase and we have compared its performance with a SQL database, MySQL. For the purpose of this performance evaluation study, we decided to work with image files. Actual data in the form image files were mounted on HDFS whereas metadata (location of each file under HDFS) were stored in both databases namely HBase and MySQL. This approach for data arrangement (metadata-data separation) was preferred over the one where the actual data stored in database as blobs. This is because every access to image data, stored in MySQL, results in serialization and de-serialization which is inherently slow and thereby leading to poor performance and inability to scale-up as data volume grows. Rather just storing and searching just for metadata in the database is quicker and efficient. To retrieve the metadata from a database and appropriate image data from HDFS, we have d
eveloped a custom Java API. This Java API has mainly two function calls: • Read – takes an image name as an argument and performs a random read form a database to retrieve the metadata information for an image file. Based on this information, it retrieves the actual image file from the HDFS. Write – takes an image file as an argument and saves this image file in HDFS and stores the location of this file (metadata) in a database with unique image name (random write).
•
Response time and through-put were measured for each of the following scenarios where a set of concurrent users perform: 1) random read 2) random write
603
3) equal read and write (read:write 1:1) 4) heavy read (read:write 2:1) 5) heavy write (read:write 1:2) B. Experimental Setup Tests were conducted on a Hadoop-HBase cluster having one master and eight slaves. The hardware platform for the master was Intel-Itanium (II) processor with 4-core CPU and 16 GBs of RAM. All slaves were dual core machines with 2 GB of RAM and 64-bit architecture. Client was running on a 4core machine with 16 GB of RAM and SunOS 5.10 operating system. All machines under Hadoop cluster were running under 64-bit Linux operating system connected by the same
switch. To benchmark the performance, the latest stable release of Hadoop and HBase namely Hadoop-0.20.203.0 and HBase0.90.4 were chosen. Daemons running on master node include NameNode, SecondaryNameNode, JobTracker, mysql, HMaster and HQuorumPeer (ZooKeeper). Daemons running on slave machine include DataNode, TaskTracker, and HRegionServer. C. Results As mentioned earlier, the purpose of this study was to evaluate the HBase-HDFS performance for both random read as well as random write and compare it with MySQL-HDFS performance under the same setup. The tests were conducted
Figure 1. Response time and Throughput
604
for 2 million image files. The average size of these image files is about 50 KBs amounting to 100 GBs of total physical storage. Tests were repeated for 10, 50, 100, 500, 1000 and 2000 concurrent users for each of the scenarios listed above. Fig. 1 shows the response time and throughput in each case. It is evident from the fig. 1 that response-time and throughput for both setup (HBase-HDFS and MySQL-HDFS) was much better for read call than for write call leading to the conclusion that write call in HDFS is very expensive as expressed earlier. Second observation was made that for smaller number
of concurrent users (up to 500) performance for HBase-HDFS setup for marginally better than MySQL-HDFS. With increase in user-load, HBase was able to perform read and retrieve information quicker than MySQL and was able to serve more client in the same amount of time. For write operation as well as in read-write combine scenarios, HBase-HDFS was performing better than MySQL-HDFS setup although difference was not as significant as in read operation alone. Also the tests were performed with only 2 million records. This initial results shows that difference could be significant for the larger data volume. We have also monitored the resource (CPU, memory, disk I/O and network bandwidth) utilization during these studies to understand the system behavior. After drilling down on resource usage pattern, we discovered that at the peak load, across all machines in the Hadoop-cluster maximum CPU utilization was about 57.46%, memory consumption was about 99.98% and disk utilization for I/O was about 79.39% as reading and writing images are example of disk bound operations, leading to the inference that disk and memory were the most critical resources. IV. CONCLUSION
REFERENCES
[1] [2]
hadoopblog.blogspot/2010/05/facebook-has-worldslargest-hadoop.html
gigaom.files.wordpress/2010/05/2010-digitaluniverse-iview_5-4-10.pdf [3] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, "Bigtable: A Distributed Storage System for Structured Data", Seventh Symposium on Operating System Design and Implementation (OSDI), Seattle, WA: Usenix Association; November 2006.
[4]
[5] [6] [7]
[8] [9] [10]
[11] [12] [13] [14]
In this digital era, Apache-Hadoop, HBase and other subprojects have a diverse and growing user community because of its scale-out approach using commodity hardware rather than scale-up approach using commercial and specialpurpose hardware. The main advantage of this architecture is the reduction of cost. Based on the evaluation done here, it was observed that for smaller number of concurrent user, performance of HBase-HDFS and MySQL-HDFS is at par. As the load increases with
number of users, HBase-HDFS performance was better than MySQL-HDFS. Overall, the experimental results were in favor of the HBase approach. We conclude that the HBase, as an open source alternative to traditional database management systems, is highly scalable, fault-tolerant, reliable, noSQL, distributed database that operates on a cluster of commodity machines to handle large data volumes.
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall and Werner Vogels, “Dynamo: Amazon's Highly Available Key-Value Store”, in the Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007. Apache Cassendra Homepage - / Doug Judd, “Scale out with HyperTable”, Linux magazine, August 7th, 2008 (www.linux-mag/id/6645/) Anderson, J. Chris; Slater, Noah; Lehnardt, Jan, “CouchDB: The Definitive Guide” , 1st edition, O'Reilly Media, November, 2009, ISBN 0596158165 Linkdln Project Voldemort Home page - projectvoldemort/ Steve Chu, “Memcachedb: The Complete Guide”, (/memcachedb-guide-1.0.pdf) Kristina Chodorow, Michael Dirolf, "MongoDB: The Definitive Guide", 1st edition, O'Reilly Media, September 2010, ISBN 9781449381561 Lars George, "HBase: The Definitive Guide", 1st edition, O'Reilly Media, September 2011, ISBN 978144939
6107 Apache Haddop HDFS homepage /hdfs/ Tom White, "Hadoop: The Definitive Guide", 1st edition, O'Reilly Media, June 2009, ISBN 9780596521974 Dorin Carstoiu, Elena Lepadatu, Mihai Gaspar, "Hbase - non SQL Database, Performances Evaluation", International Journal of Advancements in Computing Technology Volume 2, Number 5, December 2010. HBase framework and its current applications in bioinformatics", 11th Annual Bioinformatics Open Source Conference (BOSC) 2010, Boston, MA, USA. July 2010. Ankur Khetrapal, Vinay Ganesh, "HBase and Hypertable for large scale distributed storage systems", (cloud.pubs.dbs.uni-leipzig.de/node/46) Apache ZooKeeper Homepage - / /hadoop/Hive/HBaseIntegration Apache Hive Homepage - / HBql Homepage - www.hbql/
[15] Ronald C Taylor, "An overview of the Hadoop / MapReduce /
[16]
[17] [18] [19] [20]hbase官方文档
605
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论