What is Hadoop?
by Dan Power
Data-driven DSS may need to access and process very large data sets to support decision-making. One way to provide this capability is with Hadoop. Apache Hadoop is an open source Java framework for processing, storing and querying large amounts of data distributed on clusters of commodity hardware. Hadoop is a top level Apache project that Yahoo! initiated. The Hadoop project (http://hadoop.apache.org/) develops open-source software for reliable, scalable, distributed computing.
According to the project webpage, the Apache Hadoop software library is "a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures."
Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce (http://www.cloudera.com/what-is-hadoop/).
Hadoop is a family of open-source products and technologies under the Apache Software Foundation (ASF). The Apache Hadoop library includes: the Hadoop Distributed File System (HDFS), MapReduce, Hive, Hbase, Pig, Zookeeper, Flume, Sqoop, Oozie, Hue, and ther applications. You can combine these applications in various ways, but HDFS and MapReduce (perhaps with Hbase and Hive) are a useful technology stack for applications in business intelligence, data warehousing, and analytics.
In an interview with Doug Cutting, creator of the Hadoop framework, Jaikumar Vijayan for Computerworld (11/7/2011) asked a number of relevant questions:
How would you describe Hadoop to a CIO or a CFO? Why should enterprises care?
Cutting: "At a really simple level it lets you affordably save and process vastly more data than you could before. With more data and the ability to process it, companies can see more, they can learn more, they can do more. [With Hadoop] you can start to do all sorts of analyses that just weren't practical before. You can start to look at patterns over years, over seasons, across demographics. You have enough data to fill in patterns and make predictions and decide, 'How should we price things?' and 'What should we be selling now?' and 'How should we advertise?' It is not only about having data for longer periods but also richer data about any given period, as well."
What are Hive and Pig? Why should enterprises know about these projects?
Cutting: "Hive gives you [a way] to query data that is stored in Hadoop. A lot of people are used to using SQL and so, for some applications, it's a very useful tool. Pig is a different language. It is not SQL. It is an imperative data flow language. It is an alternate way to do higher level programming of Hadoop clusters. There is also HBase, if you want to have real time [analysis] as opposed to batch. There is a whole ecosystem of projects that have grown up around Hadoop and that are continuing to grow. Hadoop is the kernel of a distributed operating system and all the other components around the kernel are now arriving on the stage. Pig and Hive are good examples of those kinds of things. Nobody we know of uses just Hadoop. They use several of these other tools on top as well."
The Hadoop file system (HDSF) is a distributed file system. It hides the complexity of distributed storage and redundancy from the programmer (cf., Vogel, 2010). From a more technical perspective Hive provides data summarization and ad hoc querying. Pig is a high-level data-flow language for parallel computing. Mahout is a machine learning and data mining library. Hadoop has many subprojects.
According to the Yahoo! Hadoop tutorial, "Performing large-scale computation is difficult. To work with this volume of data requires distributing parts of the problem to multiple machines to handle in parallel. Whenever multiple machines are used in cooperation with one another, the probability of failures rises. ... What makes Hadoop unique is its simplified programming model which allows the user to quickly write and test distributed systems, and its efficient, automatic distribution of data and work across machines and in turn utilizing the underlying parallelism of the CPU cores. In a Hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in. The Hadoop Distributed File System (HDFS) will split large data files into chunks which are managed by different nodes in the cluster. In addition to this each chunk is replicated across several machines, so that a single machine failure does not result in any data being unavailable."
Hadoop uses a computing strategy of "moving computation to the data to achieve high data locality which in turn results in high performance".
A recent innovation in this technology space is a product called Hadapt that combines Hadoop with relational databases. According to Curt Monash, President of Monash Research and Editor of DBMS 2, "If you need to do investigative analytics on multi-structured data, Hadoop can be great for some steps in the process, while relational database management systems are best for other stages. ... Hadapt's approach to integrating Hadoop and RDBMS into a single analytic platform contains some very interesting capabilities."
Doug Henschen in an InformationWeek cover story on Hadoop concludes "Once Hadoop is proven and mission critical, as it is at AOL, its use will be as routine and accepted as SQL and relational databases are today. It's the right tool for the job when scalability, flexibility, and affordability really matter. That's what all the Hadoopla is about (p. 26)."
Hadapt Press Release, "Hadapt announces early access for Hadapt 1.0, the first big data platform to combine Hadoop with relational databases," November 8, 2011, at URL http://dssresources.com/news/3402.php.
Henschen, D., "Why all the Hadoopla?" InformationWeek, November 14, 2011, issue 1,316, pp. 19-26.
Russom, P., "Busting 10 Myths about Hadoop," http://tdwi.org, March 20, 2012, at URL http://tdwi.org/articles/2012/03/20/Busting-10-Hadoop-Myths.aspx
Vijayan, J., "Q&A: Hadoop creator expects surge in interest to continue: interview with Doug Cutting," http://www.computerworld.com, November 7, 2011.
Vogel, L., "Apache Hadoop-Tutorial," v. 3, 03.04.2010 at URL http://www.vogella.de/articles/ApacheHadoop/article.html .
Yahoo! Hadoop Tutorial at http://developer.yahoo.com/hadoop/tutorial/index.html
Added May 26, 2013
"The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures."
Hadoop is "a way of storing enormous data sets across distributed clusters of servers and then running 'distributed' analysis applications in each cluster." see http://readwrite.com/2013/05/23/hadoop-what-it-is-and-how-it-works?goback=.gde_4495391_member_243680049
Last update: 2013-05-26 06:08
Author: Daniel Power
You cannot comment on this entry