Select Page

3 Core Components of the Hadoop Framework

Author: Anirudh Sunder | | October 13, 2015

The 3 core components of the Apache Software Foundation’s Hadoop framework are:

1. MapReduce – A software programming model for processing large sets of data in parallel
2. HDFS – The Java-based distributed file system that can store all kinds of data without prior organization.
3. YARN – A resource management framework for scheduling and handling resource requests from distributed applications.

In this blog we’ll take a shallow dive into the Hadoop Distributed File System and its significance and contribution in providing sturdiness to the Data residing on the Hadoop framework.

HDFS is the storage sheath of Hadoop. It takes care of storing data of petabyte scale.

Saturation makes it necessary to think laterally and marches towards scaling.

As, and when data, grows vigorously, it is constantly challenging the human perception of building and stacking data storage in the “vertical” form (i.e. accommodating data growth only on a single machine, the concept of “scaling up” was facing chronic saturation.)

What happens next?

It’s necessary to build a system which could run discreetly on multiple networked computers and the design of the “file system” is such that it gives an impression as if the system is on a unified single file system in the exterior.

So if the problem is that data is too big to store in one computer, then the solution is to store Data on multiple computers.

Now, as there is a need for a cluster of computers, conscious efforts should be taken for the “system” to be cost-effective; “enter commodity hardware”, relatively cheap in comparison with expensive traditional machines but equally sturdy and robust – “performant server class machines.”

Now, how do we counter, manage and contain hardware failure?

The counter approach is to build intelligence into the software which would look over the hardware, so the “cluster software” will be smart enough to handle hardware failures. The software detects hardware failures and takes corrective actions automatically — without human intervention – the conception for the thought of Heartbeat and High Availability.

But here, still, hardware failure is inevitable, what about data loss?

Now we have a network of machines serving as a storage layer and data is spread out all over the nodes. What happens when a node fails?

The approach could be to make multiple copies of this data and store them on different machines. So even if one node goes down, other nodes will have the data intact — yes, “Data Replication.”

HDFS Architecture

Now, there’s the need to ceremoniously godfather the data Nodes; the Master who would pull the right strings at the right time.

A Master node is elected to govern and manage the worker nodes eventually simplifying the functional architecture, design and implementation of the system.

Generic file systems, say Linux EXT file systems, will store files of varying size, from a few bytes to few gigabytes. HDFS, however, is designed to store large files. Large, as in a few hundred megabytes to a few gigabytes.

HDFS was built to work with mechanical disk drives, whose capacity has grown up in recent years. However, seek times haven’t improved much. So Hadoop by design tries to minimize and avoid disk seeks.

Files are write-once only.

HDFS supports writing files once (they cannot be updated.) This is the stark difference between HDFS and a “generic file system, like a Linux file system. Generic file systems allows files to be modified. However, appending to a file is supported.

What are your thoughts? We’d love to hear from you.

How to Solve the Oracle Error ORA-12154: TNS:could not resolve the connect identifier specified

The “ORA-12154: TNS Oracle error message is very common for database administrators. Learn how to diagnose & resolve this common issue here today.

Vijay Muthu | February 4, 2021

Data Types: The Importance of Choosing the Correct Data Type

Most DBAs have struggled with the pros and cons of choosing one data type over another. This blog post discusses different situations.

Craig Mullins | October 11, 2017

How to Recover a Table from an Oracle 12c RMAN Backup

Our database experts explain how to recover and restore a table from an Oracle 12c RMAN Backup with this step-by-step blog. Read more.

Megan Elphingstone | February 2, 2017

Subscribe to Our Blog

Never miss a post! Stay up to date with the latest database, application and analytics tips and news. Delivered in a handy bi-weekly update straight to your inbox. You can unsubscribe at any time.

Work with Us

Let’s have a conversation about what you need to succeed and how we can help get you there.

CONTACT US

Work for Us

Where do you want to take your career? Explore exciting opportunities to join our team.

EXPLORE JOBS