Hadoop for Beginners: The Basics of the Open Source Data Framework
The past few years have been filled with a great deal of hype about big data. But what is it, and how do you actually go about harnessing its power?
According to Wikipedia, big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy.
When organizations talk about their big data projects, they are typically referring to using those massive data sets for predictive analytics, user behavior analytics, or other advanced data analytics methods. When combined with human understanding, the analysis of big data can reveal important trends and powerful insights.
By its nature, big data can be overwhelming to gather, store and make sense of. When you’re collecting billions of data points from hundreds of sources, you will need to reconsider your data management tools. For many organizations, that’s where Hadoop comes in.
If you are in the midst of planning your big data initiatives for 2018, you might be considering Hadoop. If you’re new to Hadoop, you probably want to know exactly what it is and how to know if it’s right for your organization.
At the highest level, Hadoop is a way that you can store massive data sets across distributed clusters of servers and run analysis applications in each cluster.
Here’s how Apache describes it:
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Hadoop includes four modules:
Hadoop Common: These are the common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS): This is the Hadoop component that holds the actual data. You can load your data to the HDFS and it stays there until you need internal or external access. For example, you can run analysis on it within Hadoop or export a data set to another tool for analysis.
Hadoop YARN: YARN is short for Yet Another Resource Negotiator and provides a framework for job scheduling and cluster resource management. The YARN central resource manager reconciles the way applications use Hadoop system resources. The YARN node manager agents monitor the processing operations of individual cluster nodes.
Hadoop MapReduce: This is where the magic happens, in the YARN-based system for parallel processing of large data sets. It’s the tool where data actually gets processed. The term MapReduce actually refers to two separate Hadoop tasks. The mapping takes a set of data and converts it into another set of data, breaking down individual elements into tuples (key/value pairs). The reduce job takes the mapping output and combines the data tuples into a smaller set of tuples, which ultimately produces the final result set.
In our experience , most large-scale data projects are using Hadoop in some capacity. However, running a Hadoop cluster is quite a challenge and requires several resources with a wide range of skills and expertise. That’s one reason many companies with big Hadoop clusters end up moving their data to the cloud. Instead of spending time managing the cluster, their resources can instead focus on doing the analysis and finding the answers they need.