Specialized IT Services focused on Data Management | Speak with Us 877-634-9222
Hadoop & Big Data
Hadoop is an open-source software framework for storing and processing heavy data by distributing them on large clusters of dependable hardware. Essentially, the Hadoop framework was conceptualized to accomplish two tasks:
1. Massive data storage
2. Lightning fast processing
How was this framework named?
“Hadoop” was the name of a “Yellow Toy Elephant” owned by the son of one of its inventors.
Hadoop as a brand experienced euphoric and progressive success because of its ability to digest and adapt exponential change and encompass impacting sub-technologies within it, which eventually enhanced its purpose of existence.
With its ability to cost effectively store and process any kind of data (not just numerical or structured data), various organizations (enterprise level or not) used Hadoop to their advantage (i.e. Google, Yahoo.)
What necessitated Hadoop’s inception?
During the early 2000s, search results were returned by humans invariably. But as the number of web pages grew infectiously from hundreds to millions; automation became imperative. The necessity for Web crawlers and search engines which could automate searching with bare minimum human intervention were evident.
The project, which was first thought of was “Nutch” – an open-source web search engine and the “creative mainsprings” behind it were Doug Cutting and Mike Cafarella.
Their goal was to invent a way to return web search results faster by distributing data and calculations across different computers so that “concurrency” and “multi-level parallel processing” could be accomplished.
While efforts were being streamlined on Nutch there was a parallel mammoth effort being channelized on another search engine project called Google. It was based on the same thought process – storing and processing data in a distributed, automated way so that more relevant web search results could be returned faster with an impressive hit ratio. An important thing considered was “relevance” of the search results; just speed would not suffice.
In 2006, Doug Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s early work with automating distributed data storage and processing. The essence of the Nutch project tore into two:
1. The web crawler development effort remained as Nutch
2. The distributed computing and processing development effort became Hadoop
In 2008, Yahoo released Hadoop as an open-source project, and currently Hadoop’s framework and its cluster of technologies are managed and maintained by the non-profit Apache Software Foundation (ASF), a global community of software developers and contributors.
Hadoop is powerful and most sought after because of its:
Low cost: The open-source framework is free and uses commodity hardware to store large quantities of data.
Computing power: Its distributed computing model can quickly process very large volumes of data. The more computing nodes you use, the more processing power you have.
Scalability: You can easily grow your system simply by adding more nodes. Little administration is required.
Storage flexibility: Unlike traditional relational databases, there is no necessity to “pre-process” data before storing it and that includes unstructured data like text, images and videos. You can store as much data as you want and decide how to use it later.
Self-healing capabilities: Data and application processing are inherently protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed architecture does not fail and by design it automatically backs up multiple copies of residing data.
Do you use Hadoop? What’s your main reasons for using it? Leave your comments in the section below. We’d love to hear from you.