An Introduction to Hadoop for Noobs

The client-server model transformed the data industry because it helped break apart overloaded systems, however, client-server models of the 1980s wouldn’t be able to handle the processing required for big data.

Along Comes Google. 

As an answer to some of the biggest limitations of early client-server models, such as long processing time, frequent system overload, and client-servers bottleneck, Google created the Google File System (GFS). The GFS was groundbreaking because it used industry-standard hardware on a large scale. It broke up data into chunks on nodes all over the world. With this model, data isn’t centralized in the way it once was.

Moving forward, some of the back work behind GFS was published on a white paper so that it could be available to the entire community, which created a big buzz around the product. That’s when the founders, Dave and Mike, developed the open-source software that would eventually become Hadoop.

But what is Hadoop?

Data comes in and lives on servers. Then, work is pushed on to localized servers so that one centralized location isn’t pushed over capacity. This eliminates the bottlenecking that occurred in some of the earlier client-server models.  To achieve this process, Hadoop relies on two main processes, MapReduce and HDFS.

HDFS is a java-based and purposed-built system designed to handle the demands of big data. In Hadoop, data is stored in clusters known as Hadoop clusters, which make it easier and faster to find files.

MapReduce is designed to process large data sets across a cluster of computers. It has two parts, the Mapper and the Reducer.

To give an example, MapReduce would be helpful if you were looking to find out how many times the word “the” appeared throughout a book. You would first distribute and send every page of the book to mappers, which would perform the mapping function to find the word, “the”. Eventually, the data becomes reduced into a smaller key and you wind up with the final results.

Why would you want to use Hadoop?

The major advantage of HDFS is the scalability. It can be leveraged to scale up to 4500 server clusters and 200 PB of data. Storage is handled in a distributive way so that overloading does not occur.

Unstructured Data

It’s also a powerful way to manage unstructured data. Upwards of 90% of data is unstructured. Examples of this include television scripts, emails and blogs.

Importable Data Tools

Applications such as HBASE and HIVE can be imported or exported from other databases.

Open Source

Like the GFS that it was created from, Hadoop is open-source, too.

I highly suggest you consider using Hadoop for your next big data project.

Comments

comments

About the author

John DeCleene

Whilst having spent a lot of his life in Asia, John DeCleene has lived and studied all over the world - including spells in Hong Kong, Mexico, The U.S. and China. He graduated with a BA in Political Science from Tulane University in 2016. Fluent in English and proficient in Mandarin and Spanish, he can communicate and connect with most of the world’s population too, and this certainly helped John as he gained work experience interning for the U.S.-Taiwan Business counsel in Washington D.C. as an investment analyst and then working alongside U.S. Senator Robert P. Casey of Pennsylvania as a legislative intern. He subsequently worked as a business analyst for a mutual fund in Singapore, where his passion for travel and aptitude for creating connections between opportunities and ideas was the perfect intersection of natural ability and experience, spending his time travelling between Cambodia, Hong Kong, and China investigating and discovering untapped investment opportunities. John is a fund manager for OCIM’s fintech fund, and currently progressing towards becoming a CFA charter holder. He loves to travel for business and pleasure, having visited 38 countries (including North Korea); he represents the new breed of global citizen for the 21st century.


Want to stay on top of all things DDI? Subscribe!