TechyVia
☰
Γ—

Unleashing Hadoop's Power: Hive, HBase, Pig, Sqoop, Flume, and Oozie

Apache Hive – SQL on Hadoop

πŸ”Ή What is Apache Hive?

Apache Hive is a data warehouse system built on top of Hadoop that allows users to process large datasets using SQL-like queries.

πŸ”₯ Think of Hive as a Translator:

πŸš€ Why Was Hive Created?

Imagine you work at an e-commerce company storing millions of customer transactions in HDFS. You need to:

Without Hive, you would have to write complex MapReduce programs in Java. Hive eliminates this need by allowing you to run simple SQL queries instead.

πŸ”Ή How Does Hive Work?

πŸ”₯ Hive is NOT a database. It is a query engine for Hadoop.

It consists of:

πŸ’‘ Hive Query Flow:

  1. 1️⃣ User writes an HQL query.
  2. 2️⃣ Hive converts it into MapReduce, Tez, or Spark jobs.
  3. 3️⃣ The query runs on Hadoop, processing data from HDFS.
  4. 4️⃣ Results are returned in a structured table format.

πŸ”Ή Key Features of Hive

πŸ“Œ Getting Started with Hive

1️⃣ Creating a Hive Table

Assume you have a file sales_data.csv stored in HDFS:

πŸ“„ Example HBase Table (Storing User Activity Data)

RowKey Column Family: Activity Column Family: Profile
u1 login_time: 12:00 PM name: Alice
u2 login_time: 12:05 PM name: Bob

OrderID,Customer,Product,Amount,Date
101,Alice,Laptop,50000,2024-01-01
102,Bob,Phone,30000,2024-01-02
103,Alice,Headphones,2000,2024-01-02

πŸ’‘ Step 1: Create a Hive Table


CREATE TABLE sales_data (
    OrderID INT,
    Customer STRING,
    Product STRING,
    Amount FLOAT,
    Date STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

2️⃣ Loading Data into Hive


LOAD DATA INPATH '/data/sales_data.csv' INTO TABLE sales_data;

πŸ“Œ Querying Data in Hive

πŸ”₯ Finding Total Sales Per Customer


    SELECT Customer, SUM(Amount) AS Total_Spent
    FROM sales_data
    GROUP BY Customer;
    

πŸ’‘ Output:


+---------+-------------+
| Customer | Total_Spent |
+---------+-------------+
| Alice    | 52000       |
| Bob      | 30000       |
+---------+-------------+

πŸ“Œ Real-Life Use Case of Hive

πŸ“Œ Use Case: Analyzing Social Media Data

A social media company stores billions of user interactions (likes, comments, shares) in HDFS. They want to:

πŸ’‘ Without Hive: Writing Java-based MapReduce programs would take days.

πŸ’‘ With Hive: A simple SQL query can provide insights in minutes!

πŸ“Œ Advanced Hive Concepts

1️⃣ Partitioning in Hive

πŸ”₯ Problem: Suppose you have 1 billion rows in your sales_data table. Querying the entire dataset is slow.

βœ… Solution: Partition the table by date!


    CREATE TABLE sales_data_partitioned (
        OrderID INT,
        Customer STRING,
        Product STRING,
        Amount FLOAT
    ) PARTITIONED BY (Date STRING);
    

2️⃣ Bucketing in Hive

πŸ”₯ Problem: What if we need to filter by Customer Name?

βœ… Solution: Use bucketing to distribute data evenly!


    CREATE TABLE sales_data_bucketed (
        OrderID INT,
        Customer STRING,
        Product STRING,
        Amount FLOAT
    ) CLUSTERED BY (Customer) INTO 10 BUCKETS;
    

πŸ“Œ When to Use Hive vs. HBase

Feature Hive (SQL on Hadoop) HBase (NoSQL on Hadoop)
Query Type Batch processing (SQL) Real-time queries (NoSQL)
Data Type Structured Unstructured, Semi-structured
Speed Slow for small queries Fast for real-time lookups
Use Case Data warehousing, reports Messaging apps, real-time analytics

πŸ“Œ Summary

Apache HBase – NoSQL on Hadoop

πŸ”Ή What is Apache HBase?

Apache HBase is a NoSQL database that runs on Hadoop and provides real-time access to large datasets.

πŸ”₯ Think of HBase as a Big Table Inside Hadoop:

πŸ”Ή Why Use HBase Instead of Hive?

Feature Apache Hive (SQL) Apache HBase (NoSQL)
Query Type Batch processing (SQL) Real-time queries (NoSQL)
Data Type Structured (tables, columns) Semi-structured (key-value)
Use Case Analytics, reports Fast read/write applications
Speed Slow for individual lookups Fast for real-time access

πŸ’‘ Example:

πŸ”Ή Key Features of HBase

πŸ“Œ How Does HBase Work?

πŸ›  HBase Architecture

HBase consists of three main components:

πŸ’‘ HBase stores data in a key-value format:

πŸ“„ Example HBase Table (Storing User Activity Data)

RowKey Column Family: Activity Column Family: Profile
u1 login_time: 12:00 PM name: Alice
u2 login_time: 12:05 PM name: Bob

βœ” Unlike traditional databases, HBase stores columns together instead of rows, making it faster for analytical queries.

πŸ“Œ Getting Started with HBase

1️⃣ Creating an HBase Table

To create a table named users, run:


create 'users', 'activity', 'profile'

πŸ’‘ This creates a table with two column families: activity (for login times) and profile (for user details).

2️⃣ Inserting Data into HBase


put 'users', 'u1', 'activity:login_time', '12:00 PM'
put 'users', 'u1', 'profile:name', 'Alice'

3️⃣ Retrieving Data from HBase


get 'users', 'u1'

πŸ’‘ Output:


COLUMN               CELL
activity:login_time  timestamp=2024-01-01, value=12:00 PM
profile:name         timestamp=2024-01-01, value=Alice

βœ” Unlike Hive, HBase supports real-time lookups without scanning the whole table!

πŸ“Œ Real-Life Use Case: How Facebook Uses HBase

πŸ“Œ Use Case: Facebook Messenger

Facebook needs to store and retrieve millions of messages per second in real time.

πŸ”₯ Why HBase?

πŸ’‘ Without HBase: A traditional SQL database would struggle to handle so much data in real time.

πŸ’‘ With HBase: Facebook can efficiently store, retrieve, and analyze messages at a massive scale.

πŸ“Œ Advanced HBase Concepts

1️⃣ HBase vs. Relational Databases

Feature HBase (NoSQL) SQL Databases (MySQL, PostgreSQL)
Schema Flexible, No fixed schema Fixed schema (tables, columns)
Scalability Horizontally scalable Limited scalability
Transactions No ACID transactions Supports ACID transactions
Query Language NoSQL API (Get, Put, Scan) SQL queries

πŸ“Œ Scanning Data in HBase


scan 'users'

πŸ“Œ Summary

Apache Pig – Scripting for Data Processing

πŸ”Ή What is Apache Pig?

Apache Pig is a high-level scripting language for processing large datasets in Hadoop.

πŸ”₯ Why Was Pig Created?

πŸ’‘ Think of Pig as SQL + Scripting for Big Data:

πŸ”Ή Why Use Apache Pig?

πŸ’‘ Use Case: Analyzing E-Commerce Clickstream Data

An e-commerce company needs to analyze user clickstream data stored in Hadoop.

πŸ“Œ Apache Pig vs. Apache Hive

Feature Apache Pig (Scripting) Apache Hive (SQL)
Language Pig Latin (procedural) SQL-like queries
Use Case ETL, data preparation Analytics, reports
Structure Handles semi-structured data Best for structured data
Execution Converts to MapReduce Converts to MapReduce

πŸ’‘ When to Use What?

πŸ“Œ How Does Pig Work?

πŸ›  Pig Execution Modes

Apache Pig runs in two modes:

βœ… Command to Start Pig in Interactive Mode:


pig

πŸ’‘ This opens the Grunt Shell, where you can write Pig Latin commands.

πŸ“Œ Writing a Pig Script – Hands-on Example

πŸ’‘ Scenario: Processing an E-Commerce Transactions Dataset

We have an e-commerce dataset with user transactions in a CSV file:

πŸ“„ transactions.csv


1001, Alice, Laptop, 900, 2024-06-01
1002, Bob, Phone, 600, 2024-06-02
1003, Charlie, Tablet, 400, 2024-06-03

1️⃣ Load Data into Pig


transactions = LOAD 'transactions.csv' USING PigStorage(',')
            AS (id: INT, name: CHARARRAY, product: CHARARRAY, amount: FLOAT, date: CHARARRAY);

2️⃣ Filter Transactions Above $500


high_value = FILTER transactions BY amount > 500;

3️⃣ Group Data by Product Type


grouped = GROUP transactions BY product;

4️⃣ Find Total Sales Per Product


total_sales = FOREACH grouped GENERATE group AS product, SUM(transactions.amount) AS total;

5️⃣ Store the Result in HDFS


STORE total_sales INTO 'hdfs:/output/sales' USING PigStorage(',');

πŸ“Œ Running the Pig Script

βœ… Save the Pig script as sales_analysis.pig

βœ… Run it in Hadoop mode:


pig -x mapreduce sales_analysis.pig

πŸ’‘ Output in HDFS:


Laptop, 900
Phone, 600
Tablet, 400

βœ” We successfully analyzed sales data using just a few lines of Pig Latin!

πŸ“Œ Real-Life Use Case: How Twitter Uses Pig

πŸ“Œ Use Case: Analyzing Tweets

Twitter stores millions of tweets per day in Hadoop and needs to process them for:

πŸ”₯ Why Pig?

πŸ’‘ Example: Extracting Hashtags from Tweets


tweets = LOAD 'tweets.json' USING JsonLoader('user:chararray, message:chararray');
hashtags = FOREACH tweets GENERATE FLATTEN(TOKENIZE(message)) AS word;
filtered = FILTER hashtags BY word STARTSWITH '#';

βœ” Extracts hashtags from tweets and prepares them for analysis.

πŸ“Œ Summary

Apache Sqoop – Data Transfer Between Hadoop & Databases

πŸ”Ή What is Apache Sqoop?

Apache Sqoop is a data transfer tool used to move large volumes of data between:

πŸ”₯ Why Was Sqoop Created?

πŸ’‘ Think of Sqoop as a data pipeline between Hadoop & Databases!

πŸ“Œ Why Use Apache Sqoop?

πŸ’‘ Use Case: Analyzing Bank Transactions

A bank wants to analyze customer transaction data stored in a MySQL database.

πŸ“Œ How Sqoop Works?

Apache Sqoop executes MapReduce jobs in the background to transfer data efficiently.

πŸ›  Sqoop Execution Modes

πŸ’‘ Command to Check Available Databases in MySQL:


sqoop list-databases --connect jdbc:mysql://localhost:3306/ --username root --password mypassword

πŸ“Œ Hands-on Example: Import Data from MySQL to Hadoop

πŸ’‘ Scenario: We have a transactions table in MySQL.

πŸ“„ MySQL Table – transactions


transaction_id  customer_name   amount   transaction_date
101             Alice           500      2024-06-01
102             Bob             700      2024-06-02
103             Charlie         300      2024-06-03

1️⃣ Import MySQL Data into HDFS


sqoop import \
--connect jdbc:mysql://localhost/bankdb \
--username root \
--password mypassword \
--table transactions \
--target-dir /user/hadoop/transactions \
--m 4

πŸ’‘ What happens here?

2️⃣ Import MySQL Data into Hive


sqoop import \
--connect jdbc:mysql://localhost/bankdb \
--username root \
--password mypassword \
--table transactions \
--hive-import \
--hive-table hive_transactions

πŸ’‘ What happens here?

3️⃣ Import Only New Data (Incremental Load)

πŸ’‘ Instead of importing all data every time, we can import only new rows.


sqoop import \
--connect jdbc:mysql://localhost/bankdb \
--username root \
--password mypassword \
--table transactions \
--incremental append \
--check-column transaction_id \
--last-value 103

πŸ’‘ What happens here?

4️⃣ Export Processed Data from Hadoop to MySQL

πŸ’‘ After analyzing data in Hadoop, we may need to send results back to MySQL.


sqoop export \
--connect jdbc:mysql://localhost/bankdb \
--username root \
--password mypassword \
--table aggregated_sales \
--export-dir /user/hadoop/sales_results

πŸ’‘ What happens here?

πŸ“Œ Real-Life Use Case: How eBay Uses Sqoop

πŸ“Œ Use Case: Analyzing Customer Purchase Patterns

eBay stores billions of customer orders in MySQL and PostgreSQL databases.

To gain insights, they need to analyze this data in Hadoop.

πŸ”₯ Why Sqoop?

πŸ’‘ Example:


sqoop import \
--connect jdbc:mysql://ebay_db/orders \
--username admin \
--password secret \
--table customer_orders \
--incremental lastmodified \
--check-column last_update_time \
--last-value '2024-06-10 00:00:00'

βœ” Transfers only newly updated orders since June 10.

πŸ“Œ Summary

Apache Flume – Streaming Data into Hadoop

πŸ”Ή What is Apache Flume?

Apache Flume is a data ingestion tool designed to collect, aggregate, and transport large amounts of streaming data into Hadoop (HDFS, Hive, HBase).

πŸ”₯ Why Was Flume Created?

πŸ’‘ Think of Flume as a pipeline that continuously transports data into Hadoop!

πŸ“Œ Why Use Apache Flume?

πŸ’‘ Use Case: Analyzing Social Media Activities

A social media company wants to analyze user activities (likes, shares, comments) in real time.

πŸ“Œ How Apache Flume Works?

Flume uses an agent-based architecture with the following components:

πŸ›  Flume Architecture

πŸ’‘ Think of it like a water pipeline:

πŸ“Œ Hands-on Example: Streaming Log Data to Hadoop

πŸ’‘ Scenario: Collecting Apache Server Logs and Storing in HDFS

πŸ“„ Sample Log File – access.log


192.168.1.10 - - [10/Jul/2024:12:15:22 +0000] "GET /index.html HTTP/1.1" 200 512
192.168.1.11 - - [10/Jul/2024:12:15:23 +0000] "POST /login HTTP/1.1" 302 0
192.168.1.12 - - [10/Jul/2024:12:15:24 +0000] "GET /dashboard HTTP/1.1" 200 1024

βœ” Each line represents a web request made by a user.

1️⃣ Install Apache Flume


sudo apt-get install flume-ng

2️⃣ Configure Flume Agent

Flume agents are defined in a flume.conf file.

πŸ“„ flume.conf (Configuration File)


# Define Agent Name
agent1.sources = source1
agent1.channels = channel1
agent1.sinks = sink1

# Define Source - Collects data from a log file
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -F /var/log/apache2/access.log

# Define Channel - Temporary storage
agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 1000

# Define Sink - Sends data to HDFS
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = hdfs://localhost:9000/user/hadoop/flume_logs/
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat = Text
agent1.sinks.sink1.hdfs.rollSize = 10000
agent1.sinks.sink1.hdfs.rollCount = 10

# Bind Source and Sink to Channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

πŸ’‘ What happens here?

3️⃣ Start Flume Agent


flume-ng agent --name agent1 --conf ./conf/ -Dflume.root.logger=INFO,console

βœ” Now, Flume continuously streams log data into Hadoop in real-time!

4️⃣ Verify Data in HDFS


hadoop fs -ls /user/hadoop/flume_logs/

βœ” If successful, Flume is now ingesting live data into Hadoop! 🎯

πŸ“Œ Real-Life Use Case: How Twitter Uses Flume

πŸ“Œ Use Case: Analyzing Social Media Trends

Twitter generates millions of tweets per second.

To analyze real-time trends, they need to store tweets in Hadoop continuously.

πŸ”₯ Why Flume?

πŸ’‘ Example: Streaming Twitter Data to Hadoop


agent1.sources.source1.type = twitter
agent1.sources.source1.consumerKey = xxxxx
agent1.sources.source1.consumerSecret = xxxxx
agent1.sources.source1.accessToken = xxxxx
agent1.sources.source1.accessTokenSecret = xxxxx

agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = hdfs://localhost:9000/user/hadoop/twitter_logs/

βœ” This configuration streams tweets directly into Hadoop! πŸš€

πŸ“Œ Summary

Apache Oozie – Workflow Scheduling in Hadoop

πŸ”Ή What is Apache Oozie?

Apache Oozie is a workflow scheduler designed for managing and automating Hadoop jobs.

πŸ”₯ Why Was Oozie Created?

πŸ’‘ Think of Oozie as a "Task Manager" for Hadoop jobs!

πŸ“Œ Why Use Apache Oozie?

πŸ’‘ Example: Automating an E-Commerce Data Pipeline

πŸ“Œ Instead of running each step manually, Oozie automates the entire workflow! πŸš€

πŸ“Œ How Apache Oozie Works?

Oozie has two main types of jobs:

πŸ’‘ Think of Oozie like a train schedule: