What is Hadoop?
Hadoop is an open-source project overseen by the Apache Software Foundation (http://hadoop.apache.org/). It is mostly used for reliable, scalable, distributed computing but can be also used as a general purpose file storage capable to keep petabytes of data. There is a large number of companies and organizations that use Hadoop for both research and production.
Hadoop consists of two core components:
- The Hadoop Distributed File System (HDFS). HDFS is responsible for storing data on the Hadoop cluster.
- The MapReduce. It is a system used for distributed computing and processing of large amounts of data in the Hadoop cluster.
There are many other subprojects based around core Hadoop such as Pig, Hive, HBase, etc
HDFS is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations:
- Data is distributed across many machines at load time
- HDFS is optimized for large, streaming reads of files rather than random reads.
- Files in HDFS are written once and no random writes to files are allowed
- Applications can read and write HDFS files directly via the Java API
MapReduce is a programming model and software framework for writing applications that rapidly process large amounts of data in parallel on large clusters of compute nodes:
- Provides automatic parallelization and distribution of tasks
- Has built-in fault-tolerance mechanisms
- Gives status and monitoring tools
- Provides a clean abstraction layer for programmers
Hadoop and Enterprise
There is a typical pattern of how RDBMS is used in large Enterprise systems:
- Interactive RDBMS serves queries from a Web site or other front end applications
- Data is later extracted and loaded into a Data warehouse for future processing and archiving
- Data is usually denormalized into an OLAP cube
Unfortunately, modern RDBMS cannot store all the data that large companies can generate and they have to come up with trade offs when data is either partially copied into RDBMS or purged after a certain amount of time. To address the problem of processing large amounts of data, Hadoop is used as an intermediate layer between Interactive database and Warehause:
- Processing power scales with data storage in contrast to large computers where storage can scale while processing power cannot
- With Hadoop, as you add more nodes to the storage, you get more processing power ‘for free’
- Hadoop can store and process multiple petabytes of data
Hadoop has a few serious limitations and cannot be used as an operational database:
- The fastest Hadoop job will still take several seconds to run
- Data stored on HDFS does not allow modifications
- Transactions are not supported by Hadoop
Hadoop vs. RDBMS
Relational Database Management Systems (RDBMS) have many strengths:
- Ability to handle complex transactions
- Ability to process hundreds or thousands of queries per second
- Real-time delivery of results
- Simple but powerful query language
There are some areas where RDBMS has disadvantages:
- Data schema is determined before data is imported
- Upper bound on data storage reaches hundreds of terabytes
- Practical upper bound on data in a single query is tens of terabytes
Hadoop vs. Storage Systems
Enterprise data is often held on large fileservers like NetApp, EMC, etc. that provide fast and random access to the data and can handle many concurrent clients. At the same time cost per terabyte of storage can be really high especially in cases when it is needed to store petabytes of data. Hadoop is a really good alternative when random access to data can be substituted by sequential reads and append only updates.
Hive and Pig
Although MapReduce is very powerful, it can also be complex to create and maintain while many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code. Also there are many other organizations that have programmers who are skilled at writing code in scripting languages. Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce.
Hive is a data warehouse infrastructure built on top of Hadoop and providing tools to enable easy data summarization, ad hoc querying and analysis of huge datasets:
- Can be used by people who know SQL
- Under the covers, generates MapReduce jobs that run on the Hadoop cluster
- Hive table definitions are built on top of data in HDFS
Pig is a platform for analyzing huge data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs:
- Relatively simple syntax
- Under the covers, scripts are turned into MapReduce jobs that run on the Hadoop cluster
HBase is a column-store database layered on top of HDFS that can store massive amounts of data ( from multiple gigabytes, up to petabytes of data). It is used when it is needed to have random, real-time read/write access to Big Data stored on the HDFS.
HBase has a constrained access model:
- Limited to lookup of a row by a single key
- Does not support transactions
- Provides single row operations only