TechyVia
×

Hadoop Mastery: In-Depth Guide to HDFS, YARN, MapReduce, and More

HDFS – Hadoop Distributed File System

🔹 How Data is Stored Across Multiple Nodes

HDFS is the storage layer of the Hadoop ecosystem. It’s designed to store large datasets reliably and efficiently across multiple machines in a distributed manner.

📌 Data Split into Blocks

HDFS splits large files into smaller, fixed-size chunks called blocks (typically 128MB or 256MB). These blocks are stored across different machines in the Hadoop cluster.

📌 Distributed Storage

Instead of storing a file on a single machine, HDFS stores multiple copies (replicas) of each block across various DataNodes. This increases fault tolerance, as if one node fails, the data can still be accessed from other nodes where replicas are stored.

📌 Block Size & Data Redundancy

🔹 NameNode, DataNode, and Block Storage Concepts

NameNode

DataNode

Block Storage

🔹 Hands-on Example: Uploading & Retrieving Files from HDFS

Step 1: Setting Up HDFS

Ensure your Hadoop cluster is running. Start HDFS using:

$ start-dfs.sh

Step 2: Uploading a File to HDFS

To upload a file from your local system to HDFS, use:

$ hadoop fs -put /local/path/to/file.txt /user/hadoop/hdfs/

Step 3: Verifying File Upload

Check if the file was successfully uploaded:

$ hadoop fs -ls /user/hadoop/hdfs/

Step 4: Retrieving a File from HDFS

To download a file from HDFS to your local system:

$ hadoop fs -get /user/hadoop/hdfs/file.txt /local/path/

Step 5: Verifying File Retrieval

Check the downloaded file in your local directory:

$ ls /local/path/

📌 Summary

With this understanding of HDFS, you’re now ready to move forward with the next component: YARN!

YARN – Yet Another Resource Negotiator

🔹 Role of ResourceManager & NodeManager

YARN is the resource management layer in Hadoop. It’s responsible for managing and allocating resources to various jobs running on the cluster. It decouples resource management from the processing layer (MapReduce), making Hadoop more flexible and capable of handling various types of workloads.

📌 YARN consists of two main components:

🔹 ResourceManager (RM)

📌 Two Main Parts of ResourceManager:

🔹 NodeManager (NM)

🔹 How YARN Allocates Resources for Tasks

When a job is submitted to YARN, it is divided into smaller tasks (containers). The ResourceManager then allocates resources for these containers across the cluster.

📌 Process of Resource Allocation:

  1. Job Submission: A job is submitted, and the ResourceManager allocates resources for the tasks.
  2. Container Allocation: The ResourceManager requests NodeManagers on available nodes to launch containers (resources).
  3. Execution of Tasks: NodeManagers start containers, and each container runs one or more tasks in parallel.
  4. Resource Deallocation: Once tasks are completed, NodeManagers inform the ResourceManager, and the resources are freed up.

🔹 Practical Example: Running Jobs in YARN

📌 Submitting a Job to YARN

To submit a job to YARN, use the following command:

$ yarn jar /path/to/hadoop-examples.jar wordcount /input/path /output/path

Explanation: This command submits a MapReduce word count job to YARN. /input/path is the location of the input data, and /output/path is where the results will be saved.

📌 ResourceManager Allocation:

📌 NodeManager Execution:

📌 Monitoring Job Execution

You can monitor the job’s progress using the YARN ResourceManager Web UI or via the command line:

$ yarn application -status <application_id>

📌 Job Completion

Once all tasks complete, the job finishes, and the output is stored in the specified HDFS output directory.

📌 Summary

Now that you understand YARN, let’s move on to the next major component of Hadoop: MapReduce, the programming model used to process large datasets.

MapReduce – The Processing Engine

🔹 How MapReduce Works (Map, Shuffle, Reduce)

MapReduce is a programming model used to process large amounts of data in parallel across a Hadoop cluster. It consists of three primary phases: Map, Shuffle, and Reduce. Let’s break it down:

📌 Map Phase:


def map_function(input_line):
words = input_line.split(" ")
for word in words:
    print(f"{word}\t1")

📌 Shuffle Phase:

📌 Reduce Phase:


def reduce_function(key, values):
total = sum(values)
print(f"{key}: {total}")

🔹 Word Count Example in MapReduce

Let’s look at a practical example: a word count MapReduce program.

📌 Input File (sample.txt):


Hadoop is a framework that allows for distributed processing of large data sets.
MapReduce is a programming model used for processing data in parallel.
Hadoop and MapReduce are powerful tools for big data processing.

📌 Map Function Output:


("Hadoop", 1)
("is", 1)
("a", 1)
("framework", 1)
("that", 1)
("allows", 1)
("for", 1)
("distributed", 1)
("processing", 1)
("of", 1)
("large", 1)
("data", 1)
("sets", 1)

📌 Shuffle Phase Output:


("Hadoop", [1, 1])
("is", [1])
("a", [1])
("framework", [1])
("that", [1])
("allows", [1])
("for", [1])
("distributed", [1])
("processing", [1])
("of", [1])
("large", [1])
("data", [1])
("sets", [1])

📌 Reduce Function Output:


("Hadoop", 2)
("is", 1)
("a", 1)
("framework", 1)
("that", 1)
("allows", 1)
("for", 1)
("distributed", 1)
("processing", 1)
("of", 1)
("large", 1)
("data", 1)
("sets", 1)

🔹 Writing & Running a Basic MapReduce Program

Now, let’s write and execute a Java-based MapReduce program.

📌 Mapper Class:


public class WordCountMapper extends Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    String[] words = value.toString().split("\\s+");
    for (String str : words) {
        word.set(str);
        context.write(word, one);
    }
}
}

📌 Reducer Class:


public class WordCountReducer extends Reducer {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
    int sum = 0;
    for (IntWritable val : values) {
        sum += val.get();
    }
    result.set(sum);
    context.write(key, result);
}
}

📌 Running the MapReduce Job:


$ hadoop jar wordcount.jar WordCountMapper WordCountReducer /input /output

This command will process the text from the /input directory and store the word count results in the /output directory in HDFS.

📌 Summary

Advanced Topics in MapReduce

🔹 Real-life Use Cases of MapReduce

MapReduce is widely used across various industries for large-scale data processing. Here are some real-world applications:

📌 1. Log Analysis

Many organizations use MapReduce to analyze server logs to monitor system performance or detect anomalies.

📌 2. Data Processing for Machine Learning

MapReduce is used to preprocess data before applying machine learning algorithms. It helps clean, filter, and transform large datasets before modeling.

📌 3. Search Engine Indexing

Search engines like Google and Bing use MapReduce to crawl the web, process documents, and create an index for fast search results.

🔹 Optimization and Performance Tuning

MapReduce jobs can take a long time to execute, especially with massive datasets. Here are some techniques to improve performance:

📌 1. Combiner Function

The combiner function acts as a mini-reducer that runs on the Map output before it is sent to the reducer. This reduces the amount of data shuffled across the network.


public class WordCountCombiner extends Reducer {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
    int sum = 0;
    for (IntWritable val : values) {
        sum += val.get();
    }
    result.set(sum);
    context.write(key, result);
}
}

📌 2. Partitioner

Using a custom partitioner helps distribute data efficiently among reducers, optimizing the shuffle phase.

📌 3. Parallelizing Data

Splitting large datasets into smaller chunks and increasing the number of parallel tasks can significantly speed up job execution.

🔹 Handling Failures in MapReduce Jobs

In a production environment, MapReduce jobs can fail due to resource issues, node failures, or data corruption. Hadoop provides built-in fault tolerance mechanisms:

🔹 Integrating with Other Big Data Tools

MapReduce integrates well with other big data tools, making it easier to work with structured and unstructured data.

📌 1. Integration with Hive and Pig

📌 2. Integration with Apache Spark

Although MapReduce is powerful, Apache Spark provides a more efficient in-memory processing engine. Spark can run MapReduce-like tasks faster because it avoids writing intermediate data to disk.

🔹 Debugging and Monitoring MapReduce Jobs

Hadoop provides a Web UI for monitoring and debugging MapReduce jobs:


$ yarn application -status <application_id>

🔹 MapReduce vs Other Big Data Processing Models

MapReduce is not the only big data processing model. Here’s how it compares to other frameworks:

📌 MapReduce vs Spark

📌 When to Use MapReduce?

Case Study: Analyzing Website Traffic Logs Using Hadoop (HDFS, YARN, MapReduce)

🔹 Scenario: The Problem

Imagine you work as a Data Engineer for an e-commerce company called ShopNow, which handles millions of customers daily. Your team is facing a major challenge:

🔹 The Solution: Hadoop Ecosystem

To solve this, ShopNow decides to migrate its log analysis to Hadoop. Here’s how it works:

📌 Step 1: Storing Log Data in HDFS (Hadoop Distributed File System)

Before processing the logs, they need to be stored efficiently and securely.

How does HDFS help?

📌 Step 2: Managing Resources with YARN (Yet Another Resource Negotiator)

Now that the log data is stored in HDFS, it needs to be processed. Running jobs on petabytes of data requires proper resource allocation to avoid overloading the cluster.

How does YARN help?

📌 Real-World Example: Processing 10TB of Logs

📌 Step 3: Processing Logs with MapReduce

Once YARN assigns resources, we can analyze the logs using MapReduce.

Goal: Count the number of times each webpage was visited and identify the most popular pages.

🔹 How MapReduce Works in This Case

1️⃣ Map Phase:

The Map function reads log files and extracts webpage visit counts.

Example Input Log File:

10.0.0.1 - - [2024-01-01 12:00:00] "GET /home.html" 200
10.0.0.2 - - [2024-01-01 12:01:00] "GET /product.html" 200
10.0.0.3 - - [2024-01-01 12:02:00] "GET /home.html" 200
Map Output:

/home.html, 1  
/product.html, 1  
/home.html, 1  

2️⃣ Shuffle & Sort Phase:

Groups the data so that all counts for the same webpage are together.


/home.html → [1, 1]  
/product.html → [1]  

3️⃣ Reduce Phase:

The Reduce function sums up the counts for each webpage.

Final Output (Popular Webpages):

/home.html → 2  
/product.html → 1  

🎯 Insights from Data:

The home page was visited twice, while the product page was visited once. This helps ShopNow:

🔹 Final Outcome: Business Benefits

By using Hadoop (HDFS + YARN + MapReduce), ShopNow can:

📌 Conclusion: Why Hadoop is a Game Changer for Big Data

This case study demonstrates how Hadoop is used in the industry to solve real-world big data challenges efficiently. 🚀