What is HDFS? This is the common question that everyone will encounter when they start learning about Hadoop. HDFS deals with the way data is stored and managed by Hadoop Framework.
What is a Distributed File System?
A distributed file system deals with managing data(files and folder) across multiple nodes or computers. It serves the same purpose as the file system provided by the OS in your PC, for example in Windows we have NTFS and in Mac we have HFS. The only difference between this one and the distributed file system is that in this the files are stored in a single machine or computer. Whereas, in the distributed file system your data is stored in a cluster which is nothing but a bunch of computers connected to a network.
The advantage of the distributed file system is that during the processing of data, the computation can happen in multiple nodes parallelly compared to processing a huge chunk of data in a single machine.
What is HDFS?
HDFS is nothing but Hadoop Distributed File System which allows you to store huge sized datasets across the cluster/bunch of machines so that you can process the data stored in each machine parallelly.
HDFS Architecture
HDFS uses a master/slave architecture where the master consists of a single NameNode that manages the file system metadata and one or more slave DataNodes.
Name node keeps track of files using 2 things,
- Namespace image – Stores the metadata of files, directories stored in HDFS. It contains data such as the size of each data block, to which file it belongs to and on which data node it resides
- Edit log – Log of activities in HDFS performed by clients. It keeps growing as the activities on HDFS keeps on happening
- These two together form the FS Image which has all metadata and changes happened in HDFS.
Datanodes are actually responsible for storing the data. Datanodes are supposed to send heartbeats periodically to the Name node to ensure that it is functioning properly. If the Name node doesn’t receive a heartbeat from any of the data nodes, it will assume that the node is down and communicate the same to admin.
The moment the data node boots up, it will send the block details to Name node and the same will be updated in namespace
The advantage of this Master-Slave architecture is that we can add more space on the fly by adding data nodes. The process of adding a data node to an existing cluster is called “commissioning”. The opposite of it, i.e the process of removing a data node from a cluster is called “decommissioning”
How data is stored?
The data will be stored in the form of blocks in HDFS. The number of blocks used to store data depends on the size of the data.
Data blocks
Block is nothing but a piece of storage that stores information. The size of the block is different for Hadoop1 and Hadoop2 frameworks. Hadoop1 has a block size of 64MB. Whereas Hadoop2 stores 128MB of data in each block.
What is data replication?
HDFS handles fault tolerance with the concept of replication. Whenever a block of data is coming into HDFS for storage, HDFS will replicate that and creates multiple copies(this is configurable). The copies will be placed in different nodes so that even if any node in the cluster goes down, we will still be able to access the replica and process the data.
The default replication factor is 3. Which means, across the cluster three copies of any block will be available.
The important thing about the Hadoop framework is that it is an open-source platform developed by Apache Community and everyone can make use of it without procuring any license. In the below slides we have tried to capture the core concepts of the HDFS file system.