Hadoop: From Toy Elephant to Jumbo Data Sets [A 101]

What is Hadoop?

Hadoop is an Apache project that excels at complex analysis and detailed special-purpose computation of massive data sets in a distributed environment. It can scale up from single servers to thousands of clusters and nodes and support petabytes (1 PB =a million GBs) of data, each offering local computation and storage. The system does not rely on hardware for performance, but is designed to detect and handle failures at the application layer and deliver a highly available service over several computers, each of which is failure-prone.Hadoop-logo
The Apache project includes three sub projects, which are often independently confused with the complete project:

  • MapReduce: This is a software framework that helps you write applications that parallelly process really huge data sets on large clusters of computers.
  • Distributed File System: HDFS is the storage system used by Hadoop applications that helps replication of data blocks and distribution of these blocks on computational nodes, thereby providing high-availability and high performance.
  • Common: This is a set of utilities that support the Hadoop subprojects. Hadoop Common includes FileSystem, RPC, and serialization libraries. The package also provides source code, documentation, and a contribution section which includes projects from the Hadoop Community.

Who uses Hadoop?

The largest user of Hadoop in the world is the largest social network, Facebook, which houses 30PB of data and uses Hadoop to perform various kinds of analytics on it. Yahoo, the largest contributor to the Apache project, comes second with 100,000 CPUs in 40,000 computers for its search queries. Other behemoths that use Hadoop are Amazon, Apple, eBay, HP, IBM, Microsoft, NetFlix, and Twitter.

How does Hadoop work?

Hadoop creates clusters of inexpensive computers and distributes and coordinates work among them. The Hadoop Common package contains the necessary JAR files and scripts needed to start Hadoop.
Hadoop HDFS distributes data across nodes in a cluster, often replicating the data so that even multiple node failures do not cause data loss. If a machine fails, Hadoop continues to operate the cluster by shifting work to the remaining machines. Every Hadoop-compatible filesystem is location aware; it has the address of the network switch where a worker node is. Applications can use this information to direct processing to the node where the data is, and, failing that, on the same rack/switch, hence reducing backbone traffic. As a result, clusters are self-healing for both storage and computation without system admin intervention.

The ‘map’ step of a MapReduce job splits input data into independent chunks, and passed to nodes, which may further split them into smaller chunks, creating a multi-level-tree structure. All the map tasks(trees) can be processed in parallel. The ‘reduce’ step then collects the answers to each of these sub-problems, which in turn solicit answers from their sub-problems. Each node of the tree passes the solutions back up level, and the master node combines this to form the output. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.



What can Hadoop do for you?

Apache Hadoop is an ideal platform for consolidating existing, legacy data management solutions with new analyses and processing tools. It is used in a variety of verticals providing multiple applications:

  • E-tailing: recommendation engines and analytics, cross-channel analytics, and sales attribution.
  • Financial services: compliance and regulatory reporting, risk analysis and management, security analytics, credit scoring etc.
  • Retail/CPG: merchandizing and market basket analysis, campaign management, supply-chain management and analytics.
  • Telecommunications: campaign management, call detail record analysis, network performance and optimization.
  • Web & Digital Media Services: large-scale click stream analytics, ad targeting, analysis, forecasting and optimization, social graph analysis and profile segmentation.

What Hadoop is not?

A Database. Nor does it tend to replace any existing data systems that you have. Rather, it augments them by offloading the difficult problem of simultaneously ingesting, processing and exporting large volumes of data so that existing systems can focus on serving real time transactional data or provide interactive business intelligence. Hadoop also works where the data is too big for a database, and you have approached technical boundaries. With very large datasets, the cost of regenerating indexes is so high you can’t easily index changing data. With many machines trying to write to the database, you can’t get locks on it. Here the idea of vaguely-related files in a distributed filesystem can work.

However, if you still want to fire SQL queries on Hadoop–there is a project adding a column-table database on top of Hadoop called HBase.

Have you ever considered using Hadoop? Or do you already use it? Do share your experience.

[If you are wondering what the title means, the famous trivia about Hadoop is that Hadoop’s creator, Doug Cutting named it after his son’s Toy elephant.]

[img credit: wikipedia]