While reading this article, Hadoop has already reached 3.0 version.
However, it is important to know the history and how it evolved in the past. This will help people who are working in the migration projects from Hadoop 1.0 to Hadoop 2.0. It will also help the developers to understand and consider future use cases while writing any program.
In the previous post, we got some brief details about what is BigData. In this article, we will see what is Hadoop.
So, what is Hadoop? A tool? an application? a package? a technology?
Hadoop is nothing but a framework for managing the storage and processing of large volume and variety of data in a distributed system. As I mentioned in my previous blog, big data is a problem statement and the two main challenges in BigData are storage and processing. Hadoop framework provides a solution for those two challenges(Storage and processing). It handles the storage with the help of HDFS and processing via MapReduce and other techniques. Let’s look at them one by one
1. Storage
So, we know that there are multiple types of files that will come under the term BigData. We can broadly classify them as structured(RDBMS), semi-structured(Jain,XML) and unstructured(logs, reports, junk contents).
Hadoop uses HDFS-Hadoop Distributed File System for storing these data. Imagine the file systems FAT32 and NTFS used in Windows. HDFS is also a file system for storing the large volume of data in a distributed way. The data will be split into blocks and stored in HDFS. If you want multiple copies of the data to be there in the network, you can do that by specifying the replication factors in the HDFS config file.
Namenode and Datanode
The physical data will be stored in the data nodes of the cluster. The metadata about the size and location of the data in the data node will be maintained by the Name node in the cluster. So whenever there is a request from the user for processing the data, the Name node will serve it with the meta-information it has about the whereabouts of data.
2. Processing
Now that we know how the data is stored in HDFS, we should know how to process that for our operations.
For this, Hadoop uses a data processing paradigm called MapReduce. There are two stages namely Map and Reduce. The mapper will read the input line by line and creates small chunks of data. Reduce is a process that takes the output of the mapper and process it based on aggregate operations requested by the user.
Please go through the below slides for a better understanding of Hadoop architecture and also the major differences between Hadoop 1.0 and Hadoop 2.0.
Also, check “What is BigData?” and “What is HDFS?“
References – https://en.wikipedia.org/wiki/Apache_Hadoop