Apache™ Hadoop® is a distributed and highly scalable storage framework to process very large data sets across hundreds to thousands of computing nodes that operate in parallel.
Hadoop was created by Doug Cutting and he introduced the name as
hadoop with Elephant toy symbol by seeing elephant toy with his son
1. Two main concepts associated with Hadoop are the Hadoop FileSystem (HDFS) and the
MapReduce processing
engine.
2. The HDFS provides the storage whereas the MapReduce executes
the program
BENEFITS
Some of the reasons organizations use Hadoop is its’ ability to store, manage and analyze vast amounts of structured and unstructured data quickly, reliably, flexibly and at low-cost
· Scalability and Performance – distributed processing of data local to each node in a cluster enables Hadoop to store,manage, process and analyze data at petabyte scale.
·
Reliability- Large computing clusters are prone to failure of
individual nodes in the cluster. Hadoop is fundamentally resilient – when a
node fails processing is re-directed to the remaining nodes in the cluster and
data is automatically re-replicated in preparation for future node failures.
· Flexibility- Unlike traditional relational database management systems, you don’t have to create structured schemas before storing data. You can store data in any format, including semi-structured or unstructured formats, and then parse and apply schema to the data when read.
· Low Cost- unlike proprietary software, Hadoop is open source and runs on low-cost commodity hardware.
We can talk about some of the differences between RDBMS and Hadoop
RDBMS
|
Hadoop
|
|
1
|
Need to model your data
|
No need to model your data
|
2
|
Schema on write
|
Schema on read
|
3
|
Suits for OLTP
|
Suits for Batch processing jobs (OLAP)
|
4
|
Structured Data
|
All varieties of Data
|
5
|
Downtime is needed to do any kind of maintenance on storage or data files
|
No downtime is required to add storage
|
6
|
In standalone database systems, to add
processing power such as more CPU,
physical memory in non-virtualized environment, a downtime is needed for RDBMS such as DB2, Oracle, and SQL Server.
|
However, Hadoop systems are individual independent nodes that can be added in an as needed basis.
|
0 comments:
Post a Comment