Monday, December 15, 2014

Introduction on HADOOP

HADOOP Introduction
Why HADOOP?

Do you know
       Facebook produces 10TB of data every day.
       The New York Stock Exchange generates about 1TB of new trade data per day.
       e-commerce applications produces GBs of data every day.

If we have 1TB of data in hard disk, the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk.
This is a long time to read all data on a single drive—and processing is even slower. To speed up the processing, we need to run parts of the program in parallel. In theory, this is straightforward using all the available hardware threads on a machine.

There are a few problems with this, however.
First, dividing the work into equal-size pieces isn’t always easy or obvious. In this case, the file size may vary widely, so some processes will finish much earlier than others. Even if they pick up further work, the whole run is dominated by the longest file. A better approach, is to split the input into fixed-size chunks and assign each chunk to a process.

Second, combining the results from independent processes may need further processing. In this case, the result is independent of other and may be combined by concatenating all the results. If using the fixed-size chunk approach, the combination is more delicate.

Third, you are still limited by the processing capacity of a single machine. If the best time you can achieve is 20 minutes with the number of processors you have, then that’s it. You can’t make it go faster. Also, some datasets grow beyond the capacity of a single machine. When we start using multiple machines, a whole host of other factors come into play, mainly falling in the category of coordination and reliability. Who runs the overall job? How do we deal with failed processes?

Over all, The first problem to solve is hardware failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high. A common way of avoiding data loss is through replication: HDFS

The second problem is that most analysis tasks need to be able to combine the data in some way; data read from one disk may need to be combined with the data from any of the other disks.

So, though it’s feasible to parallelize the processing. Using a framework like Hadoop to take care of these issues is a great help.

Introduction
Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Distributed computing is a wide and varied field, but the key distinctions of Hadoop are that it is
·         Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon’s Elastic Compute Cloud (EC2 ).
·         Robust—Because it is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures.
·         Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster.
·         Simple—Hadoop allows users to quickly write efficient parallel code.

Hadoop’s accessibility and simplicity give it an edge over writing and running large distributed programs.

 



 Fig., A Hadoop cluster has many parallel machines that store and process large data sets. Client computers send jobs into this computer cloud and obtain results.

As you can see, a Hadoop cluster is a set of commodity machines networked together in one location.
Data storage and processing all occur within this “cloud” of machines.

Hadoop Core concepts
·         Distribute data initially
o   Let processors / nodes( node is nothing but a single machine on a Hadoop cluster) work on local data.
o   Minimize data transfer over network.
o   Replicate data multiple times for increased availability.
·         Write applications at a high level
o   Programmers should not have to worry about network programming, temporal dependencies, low level infrastructure, etc.
·         Minimize talking between nodes (share-nothing).

Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS, and analysis by MapReduce. There are other parts to Hadoop, but these capabilities are its kernel.

Hadoop – view

·         Data is split into blocks (64MB/128MB)
·         Map tasks work on small portions of data (e.g., one block)
·         A master program allots work to nodes
o   If a node fails, master program reassigns work to an alternate node. No disturbance or communication to any other nodes.
o   If a node slows down, master program may start a parallel duplicate effort (speculative execution).
o   If a failed node restarts, master program starts using it, without any communication to other nodes.
·         To take advantage of the parallel processing that Hadoop provides, we need to express our query as a MapReduce job that will be able to run it on a cluster of machines.


Hadoop cluster
Cluster is a set of machines running HDFS and MapReduce.



HDFS(Hadoop Distributed File System) :

·         Data is split into blocks and distributed across multiple nodes in the cluster.
o   Each block is typically 64MB or 128MB
·         Each block is replicated multiple times
o   Default is to replicate each block three times
o   Replicas are stored on different nodes
o   This ensures both reliability and availability
·         Written in Java, based on Google File System (GFS).
·         Performs best with a “modest” number of large files
o   Millions (not billions) of files, each typically more than 100MB
·         Does not allow random writes, and poor at random reads.
o   Optimized for large, streaming reads and write-once.

More on HDFS NameNodes
·          The NameNode daemon must be running at all times.
o   If the NameNode stops, the cluster becomes inaccessible.
·         The NameNode holds all of its metadata in RAM for fast access.
o   It keeps a record of changes on disk for crash recovery.
·         A separate daemon, Secondary NameNode, takes care of some housekeeping tasks for the NameNode.
o   The Secondary NameNode is not only a backup NameNode!
·         When a client application wants to read a file:
o   It communicates with the NameNode to determine which blocks make up the file, and which DataNodes those blocks reside on.
o   It then communicates directly with the DataNodes to read the data,               using Java API
o   The NameNode will not be a bottleneck.

Data Pipelining
·         Client retrives a list of DataNodes on which to place replicas of a block.
·         Client writes block to the first DataNode.
·         The first DataNode forwards the data to the next dataNode in the Pipeline.
·         When all replicas are written, the Client moves on to write the next block in file.
MapReduce
·         A programming method to distribute a task among multiple nodes.
·         Each node processes only data stored on that node, as much as possible.
·         Abstracts all the housekeeping away from the programmer.
·         Between the Map & Reduce steps, there is a “shuffle and sort” step.
JobTracker & TaskTracker
·         All MapReduce tasks are controlled by a software daemon called JobTracker. JobTracker resides on a “master node”.
·         Clients submit MapReduce jobs to the JobTracker.
·         JobTracker assigns Map and Reduce tasks to other nodes on the cluster.
·         These nodes each run a software daemon known as the TaskTracker.
·         The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker.
Map-Reduce: Physical Flow


Hadoop Components
·         Core Components
o   Hadoop Distributed File System(HDFS)
o   Map Reduce.
·         Other components (ecosystem)
o   Pig, Hive, Hbase, Flume, Oozie, Sqoop, etc.




Comparing SQL databases and Hadoop
Hadoop is a framework for processing data, what makes it better than standard relational databases.
One reason is that SQL (structured query language) is by design targeted at structured data. Many of Hadoop’s initial applications deal with unstructured data such as text. From this perspective Hadoop provides a more general paradigm than SQL. For working only with structured data, the comparison is more nuanced. In principle, SQL and Hadoop can be complementary, as SQL is a query language which
can be implemented on top of Hadoop as the execution engine. But in practice, SQL databases tend to refer to a whole set of legacy technologies, with several dominant vendors, optimized for a historical set of applications. Many of these existing commercial databases are a mismatch to the requirements that Hadoop targets.
 SCALE-OUT INSTEAD OF SCALE-UP
Scaling commercial relational databases is expensive. Their design is more friendly to scaling up. To run a bigger database you need to buy a bigger machine. In fact, it’s not unusual to see server vendors market their expensive high-end machines as “database-class servers.” Unfortunately, at some point there won’t be a big enough machine available for the larger data sets. More importantly, the high-end machines are not cost effective for many applications. For example, a machine with four times the power of a standard PC costs a lot more than putting four such PCs in a cluster. Hadoop is designed to be a scale-out architecture operating on a cluster of commodity PC machines. Adding more resources means adding more machines to the Hadoop cluster. Hadoop clusters with ten to hundreds of machines is standard. In fact, other than for development purposes, there’s no reason to run Hadoop on a
single server.
 KEY/VALUE PAIRS INSTEAD OF RELATIONAL TABLES
A fundamental tenet of relational databases is that data resides in tables having relational structure defined by a schema . Although the relational model has great formal properties, many modern applications deal with data types that don’t fit well into this model. Text documents, images, and XML files are popular examples. Also, large data sets are often unstructured or semistructured. Hadoop uses key/value pairs as its basic data unit, which is flexible enough to work with the less-structured data types. In Hadoop, data can originate in any form, but it eventually transforms into (key/value)
pairs for the processing functions to work on.
 FUNCTIONAL PROGRAMMING (MAPREDUCE) INSTEAD OF DECLARATIVE QUERIES (SQL)
SQL is fundamentally a high-level declarative language. You query data by stating the result you want and let the database engine figure out how to derive it. Under MapReduce you specify the actual steps in processing the data, which is more analogous to an execution plan for a SQL engine . Under SQL you have query statements; under MapReduce you have scripts and codes. MapReduce allows you to process data in a more general fashion than SQL queries.
 In fact, some enable you to write queries in a SQL-like language, and your query is automatically compiled into MapReduce code for execution.
 OFFLINE BATCH PROCESSING INSTEAD OF ONLINE TRANSACTIONS

Hadoop is designed for offline processing and analysis of large-scale data. It doesn’t work for random reading and writing of a few records, which is the type of load for online transaction processing. In fact, as of this writing (and in the foreseeable future), Hadoop is best used as a write-once , read-many-times type of data store. In this aspect it’s similar to data warehouses in the SQL world.

No comments:

Post a Comment