The 3 core components of the Apache Software Foundation’s Hadoop framework are:
1. MapReduce – A software programming model for processing large sets of data in parallel
2. HDFS – The Java-based distributed file system that can store all kinds of data without prior organization.
3. YARN – A resource management framework for scheduling and handling resource requests from distributed applications.
In this blog we’ll take a shallow dive into the Hadoop Distributed File System and its significance and contribution in providing sturdiness to the Data residing on the Hadoop framework.
HDFS is the storage sheath of Hadoop. It takes care of storing data of petabyte scale.
Saturation makes it necessary to think laterally and marches towards scaling.
As, and when data, grows vigorously, it is constantly challenging the human perception of building and stacking data storage in the “vertical” form (i.e. accommodating data growth only on a single machine, the concept of “scaling up” was facing chronic saturation.)
What happens next?
It’s necessary to build a system which could run discreetly on multiple networked computers and the design of the “file system” is such that it gives an impression as if the system is on a unified single file system in the exterior.
So if the problem is that data is too big to store in one computer, then the solution is to store Data on multiple computers.
Now, as there is a need for a cluster of computers, conscious efforts should be taken for the “system” to be cost-effective; “enter commodity hardware”, relatively cheap in comparison with expensive traditional machines but equally sturdy and robust – “performant server class machines.”
Now, how do we counter, manage and contain hardware failure?
The counter approach is to build intelligence into the software which would look over the hardware, so the “cluster software” will be smart enough to handle hardware failures. The software detects hardware failures and takes corrective actions automatically — without human intervention – the conception for the thought of Heartbeat and High Availability.
But here, still, hardware failure is inevitable, what about data loss?
Now we have a network of machines serving as a storage layer and data is spread out all over the nodes. What happens when a node fails?
The approach could be to make multiple copies of this data and store them on different machines. So even if one node goes down, other nodes will have the data intact — yes, “Data Replication.”
Now, there’s the need to ceremoniously godfather the data Nodes; the Master who would pull the right strings at the right time.
A Master node is elected to govern and manage the worker nodes eventually simplifying the functional architecture, design and implementation of the system.
Generic file systems, say Linux EXT file systems, will store files of varying size, from a few bytes to few gigabytes. HDFS, however, is designed to store large files. Large, as in a few hundred megabytes to a few gigabytes.
HDFS was built to work with mechanical disk drives, whose capacity has grown up in recent years. However, seek times haven’t improved much. So Hadoop by design tries to minimize and avoid disk seeks.
Files are write-once only.
HDFS supports writing files once (they cannot be updated.) This is the stark difference between HDFS and a “generic file system, like a Linux file system. Generic file systems allows files to be modified. However, appending to a file is supported.
What are your thoughts? We’d love to hear from you.
The “ORA-12154: TNS:could not resolve the connect identifier specified” Oracle error is a commonly seen message for database administrators.
Which RAID should you use with SQL Server? Learn the differences between RAID 0, RAID 1, RAID 5, and RAID 10, along with best practices.