|
Introduction
on HADOOP
|
HADOOP
Introduction
Why HADOOP?
Do you know
•
Facebook produces 10TB of data every day.
•
The New York Stock Exchange generates about 1TB of new trade
data per day.
•
e-commerce applications produces GBs of data every day.
If we have 1TB of data in hard disk, the transfer speed
is around 100 MB/s, so it takes more than two and a half hours to read all the
data off the disk.
This is a long time to read all data on a single
drive—and processing is even slower. To speed up the processing, we need to run
parts of the program in parallel. In theory, this is straightforward using all
the available hardware threads on a machine.
There are a few problems with this, however.
First, dividing the work into equal-size pieces isn’t
always easy or obvious. In this case, the file size may vary widely, so some
processes will finish much earlier than others. Even if they pick up further
work, the whole run is dominated by the longest file. A better approach, is to
split the input into fixed-size chunks and assign each chunk to a process.
Second, combining the results from independent processes
may need further processing. In this case, the result is independent of other
and may be combined by concatenating all the results. If using the fixed-size
chunk approach, the combination is more delicate.
Third, you are still limited by the processing capacity
of a single machine. If the best time you can achieve is 20 minutes with the
number of processors you have, then that’s it. You can’t make it go faster.
Also, some datasets grow beyond the capacity of a single machine. When we start
using multiple machines, a whole host of other factors come into play, mainly
falling in the category of coordination and reliability. Who runs the overall
job? How do we deal with failed processes?
Over all, The first problem to solve is hardware
failure: as soon as you start using many pieces of hardware, the chance that
one will fail is fairly high. A common way of avoiding data loss is through
replication: HDFS
The second problem is that most analysis tasks need to
be able to combine the data in some way; data read from one disk may need to be
combined with the data from any of the other disks.
So, though it’s feasible to parallelize the processing.
Using a framework like Hadoop to take care of these issues is a great help.
Introduction
Hadoop is an open source
framework for writing and running distributed applications that process large
amounts of data. Distributed computing is a wide and varied field, but the key
distinctions of Hadoop are that it is
· Accessible—Hadoop
runs on large clusters of commodity machines or on cloud computing services
such as Amazon’s Elastic Compute Cloud (EC2 ).
· Robust—Because
it is intended to run on commodity hardware, Hadoop is architected with the
assumption of frequent hardware malfunctions. It can gracefully handle most
such failures.
· Scalable—Hadoop
scales linearly to handle larger data by adding more nodes to the cluster.
· Simple—Hadoop
allows users to quickly write efficient parallel code.
Hadoop’s accessibility and
simplicity give it an edge over writing and running large distributed programs.
Fig., A Hadoop
cluster has many parallel machines that store and process large data sets. Client computers send jobs into this computer cloud and obtain
results.
As you can see, a Hadoop cluster is a set of commodity
machines networked together in one location.
Data storage and processing all occur within this
“cloud” of machines.
Hadoop Core concepts
· Distribute data initially
o Let processors / nodes( node is nothing but a
single machine on a Hadoop cluster) work on local data.
o Minimize data transfer over network.
o Replicate data multiple times for increased
availability.
· Write applications at a high level
o Programmers should not have to worry about network
programming, temporal dependencies, low level infrastructure, etc.
·
Minimize talking between nodes
(share-nothing).
Hadoop provides: a reliable shared storage and analysis
system. The storage is provided by HDFS, and analysis by MapReduce. There are
other parts to Hadoop, but these capabilities are its kernel.
Hadoop – view
·
Data is split into blocks
(64MB/128MB)
·
Map tasks work on small portions
of data (e.g., one block)
·
A master program allots work to
nodes
o If a node fails, master program reassigns work to an
alternate node. No disturbance or communication to any other nodes.
o If a node slows down, master program may start a
parallel duplicate effort (speculative execution).
o If a failed node restarts, master program starts using
it, without any communication to other nodes.
· To take advantage of the parallel
processing that Hadoop provides, we need to express our query as a MapReduce
job that will be able to run it on a cluster of machines.
Hadoop cluster
Cluster
is a set of machines running HDFS and MapReduce.
HDFS(Hadoop Distributed File
System) :
·
Data is split into blocks and
distributed across multiple nodes in the cluster.
o Each block is typically 64MB or 128MB
·
Each block is replicated
multiple times
o Default is to replicate each block three times
o Replicas are stored on different nodes
o This ensures both reliability and availability
·
Written in Java, based on Google
File System (GFS).
·
Performs best with a “modest”
number of large files
o Millions (not billions) of files, each typically more
than 100MB
·
Does not allow random writes,
and poor at random reads.
o Optimized for large, streaming reads and write-once.
More on HDFS NameNodes
· The NameNode daemon must be
running at all times.
o If the NameNode stops, the cluster becomes
inaccessible.
·
The NameNode holds all of its
metadata in RAM for fast access.
o It keeps a record of changes on disk for crash
recovery.
·
A separate daemon, Secondary
NameNode, takes care of some housekeeping tasks for the NameNode.
o The Secondary NameNode is not only a backup NameNode!
·
When a client application wants
to read a file:
o It communicates with the NameNode to determine which
blocks make up the file, and which DataNodes those blocks reside on.
o It then communicates directly with the DataNodes to
read the
data,
using Java API
o The NameNode will not be a bottleneck.
Data Pipelining
·
Client retrives a list of
DataNodes on which to place replicas of a block.
·
Client writes block to the first
DataNode.
·
The first DataNode forwards the
data to the next dataNode in the Pipeline.
·
When all replicas are written,
the Client moves on to write the next block in file.
MapReduce
·
A programming method to
distribute a task among multiple nodes.
·
Each node processes only data
stored on that node, as much as possible.
·
Abstracts all the housekeeping
away from the programmer.
·
Between the Map & Reduce
steps, there is a “shuffle and sort” step.
JobTracker & TaskTracker
·
All MapReduce tasks are
controlled by a software daemon called JobTracker. JobTracker resides on a
“master node”.
·
Clients submit MapReduce jobs to
the JobTracker.
·
JobTracker assigns Map and
Reduce tasks to other nodes on the cluster.
·
These nodes each run a software
daemon known as the TaskTracker.
·
The TaskTracker is responsible
for actually instantiating the Map or Reduce task, and reporting progress back
to the JobTracker.
Map-Reduce: Physical Flow
Hadoop Components
·
Core Components
o Hadoop
Distributed File System(HDFS)
o Map
Reduce.
· Other components (ecosystem)
o Pig, Hive, Hbase, Flume, Oozie, Sqoop, etc.
Comparing SQL databases and
Hadoop
Hadoop is a framework for processing data, what makes it
better than standard relational databases.
One reason is that SQL (structured query language) is by
design targeted at structured data. Many of Hadoop’s initial applications deal
with unstructured data such as text. From this perspective Hadoop provides a
more general paradigm than SQL. For working only with structured data, the
comparison is more nuanced. In principle, SQL and Hadoop can be complementary,
as SQL is a query language which
can be implemented on top of Hadoop as the execution
engine. But in practice, SQL databases tend to refer to a whole set of legacy
technologies, with several dominant vendors, optimized for a historical set of
applications. Many of these existing commercial databases are a mismatch to the
requirements that Hadoop targets.
SCALE-OUT
INSTEAD OF SCALE-UP
Scaling commercial relational databases is expensive.
Their design is more friendly to scaling up. To run a bigger database you need
to buy a bigger machine. In fact, it’s not unusual to see server vendors market
their expensive high-end machines as “database-class servers.” Unfortunately,
at some point there won’t be a big enough machine available for the larger data
sets. More importantly, the high-end machines are not cost effective for many
applications. For example, a machine with four times the power of a standard PC
costs a lot more than putting four such PCs in a cluster. Hadoop is designed to
be a scale-out architecture operating on a cluster of commodity PC machines.
Adding more resources means adding more machines to the Hadoop cluster. Hadoop
clusters with ten to hundreds of machines is standard. In fact, other than for
development purposes, there’s no reason to run Hadoop on a
single server.
KEY/VALUE
PAIRS INSTEAD OF RELATIONAL TABLES
A fundamental tenet of relational databases is that data
resides in tables having relational structure defined by a schema . Although
the relational model has great formal properties, many modern applications deal
with data types that don’t fit well into this model. Text documents, images,
and XML files are popular examples. Also, large data sets are often
unstructured or semistructured. Hadoop uses key/value pairs as its basic data
unit, which is flexible enough to work with the less-structured data types. In Hadoop,
data can originate in any form, but it eventually transforms into (key/value)
pairs for the processing functions to work on.
FUNCTIONAL
PROGRAMMING (MAPREDUCE) INSTEAD OF DECLARATIVE QUERIES (SQL)
SQL is fundamentally a high-level declarative language.
You query data by stating the result you want and let the database engine
figure out how to derive it. Under MapReduce you specify the actual steps in
processing the data, which is more analogous to an execution plan for a SQL
engine . Under SQL you have query statements; under MapReduce you have scripts
and codes. MapReduce allows you to process data in a more general fashion than
SQL queries.
In fact, some enable you to write queries in a SQL-like
language, and your query is automatically compiled into MapReduce code for
execution.
OFFLINE
BATCH PROCESSING INSTEAD OF ONLINE TRANSACTIONS
Hadoop is designed for offline processing
and analysis of large-scale data. It doesn’t work for random reading and writing
of a few records, which is the type of load for online transaction processing.
In fact, as of this writing (and in the foreseeable future), Hadoop is best
used as a write-once , read-many-times type of data store. In this aspect it’s
similar to data warehouses in the SQL world.




No comments:
Post a Comment