Unleashing Hadoop's Power: Hive, HBase, Pig, Sqoop, Flume, and Oozie

Apache Hive – SQL on Hadoop

What is Apache Hive?

Apache Hive is a data warehouse system built on top of Hadoop that allows users to process large datasets using SQL-like queries.

Think of Hive as a Translator:

  • Converts SQL queries into MapReduce, Tez, or Spark jobs.
  • Makes querying big data easier without writing Java or Python.
  • Designed for batch processing, making it ideal for analytics and reporting.

Why Was Hive Created?

Imagine you work at an e-commerce company storing millions of customer transactions in HDFS. You need to:

  • ✔ Find total sales per product.
  • ✔ Identify the top 10 customers.
  • ✔ Analyze buying patterns over time.

Without Hive, you would have to write complex MapReduce programs in Java. Hive eliminates this need by allowing you to run simple SQL queries instead.

How Does Hive Work?

Hive is NOT a database. It is a query engine for Hadoop.

It consists of:

  • Hive Query Language (HQL) → SQL-like queries.
  • Metastore → Stores table metadata (columns, data types, location in HDFS).
  • Execution Engine → Converts SQL queries into MapReduce jobs.
  • HDFS → Stores the actual data.

Hive Query Flow:

  1. 1. User writes an HQL query.
  2. 2. Hive converts it into MapReduce, Tez, or Spark jobs.
  3. 3. The query runs on Hadoop, processing data from HDFS.
  4. 4. Results are returned in a structured table format.

Key Features of Hive

  • SQL-Like Queries → Uses HiveQL (similar to MySQL, PostgreSQL).
  • Handles Large Datasets → Designed for petabyte-scale processing.
  • Schema on Read → Data structure is defined at query time.
  • Batch Processing → Best for reports, aggregations, and analytics.
  • Supports Multiple Storage Formats → Text, ORC, Parquet, Avro.

Getting Started with Hive

1. Creating a Hive Table

Assume you have a file sales_data.csv stored in HDFS:

📄 Example HBase Table (Storing User Activity Data)

RowKey Column Family: Activity Column Family: Profile
u1 login_time: 12:00 PM name: Alice
u2 login_time: 12:05 PM name: Bob

OrderID,Customer,Product,Amount,Date
101,Alice,Laptop,50000,2024-01-01
102,Bob,Phone,30000,2024-01-02
103,Alice,Headphones,2000,2024-01-02

Step 1: Create a Hive Table


CREATE TABLE sales_data (
    OrderID INT,
    Customer STRING,
    Product STRING,
    Amount FLOAT,
    Date STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

2. Loading Data into Hive


LOAD DATA INPATH '/data/sales_data.csv' INTO TABLE sales_data;

Querying Data in Hive

Finding Total Sales Per Customer


    SELECT Customer, SUM(Amount) AS Total_Spent
    FROM sales_data
    GROUP BY Customer;
    

Output:


+---------+-------------+
| Customer | Total_Spent |
+---------+-------------+
| Alice    | 52000       |
| Bob      | 30000       |
+---------+-------------+

Real-Life Use Case of Hive

Use Case: Analyzing Social Media Data

A social media company stores billions of user interactions (likes, comments, shares) in HDFS. They want to:

  • ✔ Identify the most liked posts.
  • ✔ Find users who comment the most.
  • ✔ Track engagement trends over time.

Without Hive: Writing Java-based MapReduce programs would take days.

With Hive: A simple SQL query can provide insights in minutes!

Advanced Hive Concepts

1. Partitioning in Hive

Problem: Suppose you have 1 billion rows in your sales_data table. Querying the entire dataset is slow.

Solution: Partition the table by date!


    CREATE TABLE sales_data_partitioned (
        OrderID INT,
        Customer STRING,
        Product STRING,
        Amount FLOAT
    ) PARTITIONED BY (Date STRING);
    

2. Bucketing in Hive

Problem: What if we need to filter by Customer Name?

Solution: Use bucketing to distribute data evenly!


    CREATE TABLE sales_data_bucketed (
        OrderID INT,
        Customer STRING,
        Product STRING,
        Amount FLOAT
    ) CLUSTERED BY (Customer) INTO 10 BUCKETS;
    

When to Use Hive vs. HBase

Feature Hive (SQL on Hadoop) HBase (NoSQL on Hadoop)
Query Type Batch processing (SQL) Real-time queries (NoSQL)
Data Type Structured Unstructured, Semi-structured
Speed Slow for small queries Fast for real-time lookups
Use Case Data warehousing, reports Messaging apps, real-time analytics

Summary

  • Apache Hive = SQL + Hadoop
  • ✔ Converts SQL queries into MapReduce, Tez, or Spark jobs.
  • ✔ Supports partitioning & bucketing for faster queries.
  • ✔ Best for batch processing & data warehousing.

Apache HBase – NoSQL on Hadoop

What is Apache HBase?

Apache HBase is a NoSQL database that runs on Hadoop and provides real-time access to large datasets.

Think of HBase as a Big Table Inside Hadoop:

  • Stores millions to billions of rows and thousands of columns efficiently.
  • Unlike Hive (which is best for batch processing), HBase is designed for real-time reads and writes.
  • It is modeled after Google’s Bigtable and works well for applications that need random, fast lookups.

Why Use HBase Instead of Hive?

Feature Apache Hive (SQL) Apache HBase (NoSQL)
Query Type Batch processing (SQL) Real-time queries (NoSQL)
Data Type Structured (tables, columns) Semi-structured (key-value)
Use Case Analytics, reports Fast read/write applications
Speed Slow for individual lookups Fast for real-time access

Example:

  • Hive: Best for analyzing customer purchases in an e-commerce store.
  • HBase: Best for storing user session data in a real-time web application.

Key Features of HBase

  • Column-Oriented Storage → Stores data in columns instead of rows (efficient for large datasets).
  • Scalable & Distributed → Can handle billions of rows across multiple nodes.
  • Real-Time Data Access → Supports fast reads and writes for applications needing quick lookups.
  • Automatic Sharding → Data is automatically distributed across different servers (called RegionServers).

How Does HBase Work?

🛠 HBase Architecture

HBase consists of three main components:

  • HMaster → Manages and assigns regions to RegionServers.
  • RegionServers → Stores actual data and handles read/write requests.
  • HDFS → Used as the storage layer.

HBase stores data in a key-value format:

  • Each row has a unique RowKey.
  • Each row can have multiple column families, and each column can store multiple versions of data.

📄 Example HBase Table (Storing User Activity Data)

RowKey Column Family: Activity Column Family: Profile
u1 login_time: 12:00 PM name: Alice
u2 login_time: 12:05 PM name: Bob

✔ Unlike traditional databases, HBase stores columns together instead of rows, making it faster for analytical queries.

Getting Started with HBase

1. Creating an HBase Table

To create a table named users, run:


create 'users', 'activity', 'profile'

This creates a table with two column families: activity (for login times) and profile (for user details).

2. Inserting Data into HBase


put 'users', 'u1', 'activity:login_time', '12:00 PM'
put 'users', 'u1', 'profile:name', 'Alice'

3. Retrieving Data from HBase


get 'users', 'u1'

Output:


COLUMN               CELL
activity:login_time  timestamp=2024-01-01, value=12:00 PM
profile:name         timestamp=2024-01-01, value=Alice

✔ Unlike Hive, HBase supports real-time lookups without scanning the whole table!

Real-Life Use Case: How Facebook Uses HBase

Use Case: Facebook Messenger

Facebook needs to store and retrieve millions of messages per second in real time.

Why HBase?

  • ✔ Each message is stored with a unique RowKey (e.g., user123:message456).
  • Fast lookups → When a user opens Messenger, HBase quickly retrieves their chat history.
  • Scalability → Can handle billions of messages across different servers.

Without HBase: A traditional SQL database would struggle to handle so much data in real time.

With HBase: Facebook can efficiently store, retrieve, and analyze messages at a massive scale.

Advanced HBase Concepts

1. HBase vs. Relational Databases

Feature HBase (NoSQL) SQL Databases (MySQL, PostgreSQL)
Schema Flexible, No fixed schema Fixed schema (tables, columns)
Scalability Horizontally scalable Limited scalability
Transactions No ACID transactions Supports ACID transactions
Query Language NoSQL API (Get, Put, Scan) SQL queries

Scanning Data in HBase


scan 'users'

Summary

  • HBase = NoSQL database for Hadoop
  • ✔ Supports real-time reads/writes for large datasets.
  • ✔ Used by companies like Facebook, LinkedIn, and Twitter.
  • ✔ Best for random access, fast lookups, and real-time applications.

Apache Pig – Scripting for Data Processing

What is Apache Pig?

Apache Pig is a high-level scripting language for processing large datasets in Hadoop.

Why Was Pig Created?

  • Writing MapReduce programs in Java is complex and time-consuming.
  • Pig provides Pig Latin, a simpler scripting language to process data without writing Java code.

Think of Pig as SQL + Scripting for Big Data:

  • Unlike SQL, Pig can handle both structured and unstructured data.
  • Runs on top of Hadoop and converts Pig Latin scripts into MapReduce jobs automatically.

Why Use Apache Pig?

  • Easier Than Java: Reduces the effort of writing MapReduce jobs.
  • Handles Any Data Format: Supports JSON, XML, CSV, and raw text files.
  • Efficient Data Transformation: Ideal for ETL tasks like filtering, grouping, and joining datasets.
  • Scalable & Fault-Tolerant: Runs on Hadoop clusters, handling terabytes of data.

Use Case: Analyzing E-Commerce Clickstream Data

An e-commerce company needs to analyze user clickstream data stored in Hadoop.

  • Without Pig: Developers write complex Java-based MapReduce programs.
  • With Pig: A simple Pig script can process the data in a few lines.

Apache Pig vs. Apache Hive

Feature Apache Pig (Scripting) Apache Hive (SQL)
Language Pig Latin (procedural) SQL-like queries
Use Case ETL, data preparation Analytics, reports
Structure Handles semi-structured data Best for structured data
Execution Converts to MapReduce Converts to MapReduce

When to Use What?

  • Use Pig for data cleaning, ETL, and complex transformations.
  • Use Hive for running SQL-like queries on large datasets.

How Does Pig Work?

🛠 Pig Execution Modes

Apache Pig runs in two modes:

  • 1. Local Mode → Runs on a single machine (for testing).
  • 2. Hadoop Mode → Runs on a Hadoop cluster for processing large datasets.

Command to Start Pig in Interactive Mode:


pig

This opens the Grunt Shell, where you can write Pig Latin commands.

Writing a Pig Script – Hands-on Example

Scenario: Processing an E-Commerce Transactions Dataset

We have an e-commerce dataset with user transactions in a CSV file:

📄 transactions.csv


1001, Alice, Laptop, 900, 2024-06-01
1002, Bob, Phone, 600, 2024-06-02
1003, Charlie, Tablet, 400, 2024-06-03

1. Load Data into Pig


transactions = LOAD 'transactions.csv' USING PigStorage(',')
            AS (id: INT, name: CHARARRAY, product: CHARARRAY, amount: FLOAT, date: CHARARRAY);

2. Filter Transactions Above $500


high_value = FILTER transactions BY amount > 500;

3. Group Data by Product Type


grouped = GROUP transactions BY product;

4. Find Total Sales Per Product


total_sales = FOREACH grouped GENERATE group AS product, SUM(transactions.amount) AS total;

5. Store the Result in HDFS


STORE total_sales INTO 'hdfs:/output/sales' USING PigStorage(',');

Running the Pig Script

Save the Pig script as sales_analysis.pig

Run it in Hadoop mode:


pig -x mapreduce sales_analysis.pig

Output in HDFS:


Laptop, 900
Phone, 600
Tablet, 400

✔ We successfully analyzed sales data using just a few lines of Pig Latin!

Real-Life Use Case: How Twitter Uses Pig

Use Case: Analyzing Tweets

Twitter stores millions of tweets per day in Hadoop and needs to process them for:

  • ✔ Hashtag trends
  • ✔ User engagement analytics
  • ✔ Spam detection

Why Pig?

  • ✔ Handles semi-structured data (JSON tweets) easily.
  • ✔ Scales well with huge datasets.
  • ✔ Faster than writing complex MapReduce programs.

Example: Extracting Hashtags from Tweets


tweets = LOAD 'tweets.json' USING JsonLoader('user:chararray, message:chararray');
hashtags = FOREACH tweets GENERATE FLATTEN(TOKENIZE(message)) AS word;
filtered = FILTER hashtags BY word STARTSWITH '#';

✔ Extracts hashtags from tweets and prepares them for analysis.

Summary

  • Apache Pig simplifies big data processing with Pig Latin scripts.
  • ✔ Best for ETL, data transformations, and semi-structured data.
  • ✔ Used by Twitter, LinkedIn, Netflix, and Yahoo.
  • ✔ Runs on Hadoop clusters, making it scalable and fault-tolerant.

Apache Sqoop – Data Transfer Between Hadoop & Databases

What is Apache Sqoop?

Apache Sqoop is a data transfer tool used to move large volumes of data between:

  • Relational Databases (MySQL, PostgreSQL, Oracle, SQL Server, etc.)
  • Hadoop Ecosystem (HDFS, Hive, HBase, etc.)

Why Was Sqoop Created?

  • Companies store vast amounts of structured data in relational databases.
  • To analyze this data efficiently, it needs to be moved into Hadoop.
  • Manually writing ETL scripts for this is time-consuming and error-prone.
  • Sqoop automates the entire data transfer process efficiently and securely.

Think of Sqoop as a data pipeline between Hadoop & Databases!

Why Use Apache Sqoop?

  • Fast & Efficient → Transfers terabytes of data quickly.
  • Parallel Data Transfer → Uses multiple mappers for faster processing.
  • Supports Incremental Loads → Imports only new data instead of full tables.
  • Easy Integration → Works seamlessly with Hive, HBase, HDFS, and RDBMS.
  • Secure Data Transfer → Supports Kerberos authentication for security.

Use Case: Analyzing Bank Transactions

A bank wants to analyze customer transaction data stored in a MySQL database.

  • Without Sqoop: Developers must write complex ETL scripts.
  • With Sqoop: A simple command moves data into Hadoop effortlessly.

How Sqoop Works?

Apache Sqoop executes MapReduce jobs in the background to transfer data efficiently.

🛠 Sqoop Execution Modes

  • 1. Import Mode → Transfers data from RDBMS → Hadoop (HDFS, Hive, HBase).
  • 2. Export Mode → Transfers data from Hadoop → RDBMS.

Command to Check Available Databases in MySQL:


sqoop list-databases --connect jdbc:mysql://localhost:3306/ --username root --password mypassword

Hands-on Example: Import Data from MySQL to Hadoop

Scenario: We have a transactions table in MySQL.

📄 MySQL Table – transactions


transaction_id  customer_name   amount   transaction_date
101             Alice           500      2024-06-01
102             Bob             700      2024-06-02
103             Charlie         300      2024-06-03

1. Import MySQL Data into HDFS


sqoop import \
--connect jdbc:mysql://localhost/bankdb \
--username root \
--password mypassword \
--table transactions \
--target-dir /user/hadoop/transactions \
--m 4

What happens here?

  • ✔ Reads data from the MySQL transactions table.
  • ✔ Writes data to HDFS in /user/hadoop/transactions.
  • ✔ Uses 4 parallel mappers (-m 4) for faster transfer.

2. Import MySQL Data into Hive


sqoop import \
--connect jdbc:mysql://localhost/bankdb \
--username root \
--password mypassword \
--table transactions \
--hive-import \
--hive-table hive_transactions

What happens here?

  • ✔ Creates a Hive table named hive_transactions.
  • ✔ Transfers MySQL data directly into Hive for SQL queries.

3. Import Only New Data (Incremental Load)

Instead of importing all data every time, we can import only new rows.


sqoop import \
--connect jdbc:mysql://localhost/bankdb \
--username root \
--password mypassword \
--table transactions \
--incremental append \
--check-column transaction_id \
--last-value 103

What happens here?

  • ✔ Only imports rows with transaction_id > 103.
  • ✔ Avoids duplicate data in Hadoop.

4. Export Processed Data from Hadoop to MySQL

After analyzing data in Hadoop, we may need to send results back to MySQL.


sqoop export \
--connect jdbc:mysql://localhost/bankdb \
--username root \
--password mypassword \
--table aggregated_sales \
--export-dir /user/hadoop/sales_results

What happens here?

  • ✔ Reads processed sales data from Hadoop.
  • ✔ Inserts data into the aggregated_sales table in MySQL.

Real-Life Use Case: How eBay Uses Sqoop

Use Case: Analyzing Customer Purchase Patterns

eBay stores billions of customer orders in MySQL and PostgreSQL databases.

To gain insights, they need to analyze this data in Hadoop.

Why Sqoop?

  • Fast & Scalable → Handles huge transaction data efficiently.
  • Incremental Loads → Only new purchases are imported daily.
  • Integrates with Hive & Spark for further processing.

Example:


sqoop import \
--connect jdbc:mysql://ebay_db/orders \
--username admin \
--password secret \
--table customer_orders \
--incremental lastmodified \
--check-column last_update_time \
--last-value '2024-06-10 00:00:00'

✔ Transfers only newly updated orders since June 10.

Summary

  • Apache Sqoop is the bridge between Hadoop & Databases.
  • ✔ Supports fast, parallel imports & exports.
  • ✔ Ideal for big data ingestion pipelines.
  • ✔ Used by banks, e-commerce, telecom, and finance companies.

Apache Flume – Streaming Data into Hadoop

What is Apache Flume?

Apache Flume is a data ingestion tool designed to collect, aggregate, and transport large amounts of streaming data into Hadoop (HDFS, Hive, HBase).

Why Was Flume Created?

  • Companies generate vast amounts of logs, sensor data, social media feeds, and IoT data.
  • Traditional tools (like Sqoop) cannot handle continuous data streams.
  • Flume enables real-time ingestion into Hadoop without data loss.

Think of Flume as a pipeline that continuously transports data into Hadoop!

Why Use Apache Flume?

  • Handles Real-Time Data → Ingests logs, social media feeds, sensor data.
  • Reliable & Scalable → Ensures no data loss, even if Hadoop crashes.
  • Flexible Architecture → Can be customized for any data source.
  • Supports Multiple Sources → Works with Kafka, Twitter, IoT devices, log files, etc.
  • Integrates with Hadoop Ecosystem → Sends data to HDFS, Hive, HBase.

Use Case: Analyzing Social Media Activities

A social media company wants to analyze user activities (likes, shares, comments) in real time.

  • Without Flume: Engineers manually move logs to Hadoop (slow, inefficient).
  • With Flume: Flume automatically streams logs to Hadoop in real-time.

How Apache Flume Works?

Flume uses an agent-based architecture with the following components:

🛠 Flume Architecture

  • 1. Source → Collects data from log files, syslogs, Kafka, Twitter, sensors, etc.
  • 2. Channel → Acts as a temporary storage buffer (like RAM).
  • 3. Sink → Sends data to HDFS, Hive, HBase, Kafka, etc.

Think of it like a water pipeline:

  • Source → Collects water (data).
  • Channel → Temporarily stores it (buffer).
  • Sink → Transfers it to a final location (Hadoop).

Hands-on Example: Streaming Log Data to Hadoop

Scenario: Collecting Apache Server Logs and Storing in HDFS

📄 Sample Log File – access.log


192.168.1.10 - - [10/Jul/2024:12:15:22 +0000] "GET /index.html HTTP/1.1" 200 512
192.168.1.11 - - [10/Jul/2024:12:15:23 +0000] "POST /login HTTP/1.1" 302 0
192.168.1.12 - - [10/Jul/2024:12:15:24 +0000] "GET /dashboard HTTP/1.1" 200 1024

✔ Each line represents a web request made by a user.

1. Install Apache Flume


sudo apt-get install flume-ng

2. Configure Flume Agent

Flume agents are defined in a flume.conf file.

📄 flume.conf (Configuration File)


# Define Agent Name
agent1.sources = source1
agent1.channels = channel1
agent1.sinks = sink1

# Define Source - Collects data from a log file
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -F /var/log/apache2/access.log

# Define Channel - Temporary storage
agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 1000

# Define Sink - Sends data to HDFS
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = hdfs://localhost:9000/user/hadoop/flume_logs/
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat = Text
agent1.sinks.sink1.hdfs.rollSize = 10000
agent1.sinks.sink1.hdfs.rollCount = 10

# Bind Source and Sink to Channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

What happens here?

  • Source → Collects logs from /var/log/apache2/access.log.
  • Channel → Stores data before sending to HDFS.
  • Sink → Saves logs in /user/hadoop/flume_logs/ in HDFS.

3. Start Flume Agent


flume-ng agent --name agent1 --conf ./conf/ -Dflume.root.logger=INFO,console

✔ Now, Flume continuously streams log data into Hadoop in real-time!

4. Verify Data in HDFS


hadoop fs -ls /user/hadoop/flume_logs/

✔ If successful, Flume is now ingesting live data into Hadoop! 🎯

Real-Life Use Case: How Twitter Uses Flume

Use Case: Analyzing Social Media Trends

Twitter generates millions of tweets per second.

To analyze real-time trends, they need to store tweets in Hadoop continuously.

Why Flume?

  • Scalable – Handles massive tweet volumes.
  • Reliable – No data loss, even if Hadoop is down.
  • Integrates with Spark & Hive for real-time analytics.

Example: Streaming Twitter Data to Hadoop


agent1.sources.source1.type = twitter
agent1.sources.source1.consumerKey = xxxxx
agent1.sources.source1.consumerSecret = xxxxx
agent1.sources.source1.accessToken = xxxxx
agent1.sources.source1.accessTokenSecret = xxxxx

agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = hdfs://localhost:9000/user/hadoop/twitter_logs/

✔ This configuration streams tweets directly into Hadoop!

Summary

  • Apache Flume is used for real-time data ingestion into Hadoop.
  • ✔ Handles log files, social media feeds, IoT sensor data, etc.
  • ✔ Uses Source → Channel → Sink model.
  • ✔ Supports Kafka, Twitter, MySQL, HDFS, and HBase integration.
  • ✔ Used by banks, social media companies, e-commerce, and IoT platforms.

Apache Oozie – Workflow Scheduling in Hadoop

What is Apache Oozie?

Apache Oozie is a workflow scheduler designed for managing and automating Hadoop jobs.

Why Was Oozie Created?

  • Big data workflows involve multiple steps (e.g., data ingestion → transformation → analysis).
  • Manually triggering jobs like Flume → Hive → Spark → Sqoop is inefficient.
  • Oozie automates job execution, manages dependencies, and schedules workflows.

Think of Oozie as a "Task Manager" for Hadoop jobs!

Why Use Apache Oozie?

  • Automates Workflows → Runs Hadoop jobs in a predefined sequence.
  • Manages Dependencies → Ensures job order (Job A → Job B → Job C).
  • Schedules Jobs → Runs workflows daily, weekly, or in real-time.
  • Handles Failures → Retries failed jobs or sends alerts.
  • Supports Multiple Hadoop Components → Works with MapReduce, Hive, Pig, Sqoop, Spark, Flume, and Shell scripts.

Example: Automating an E-Commerce Data Pipeline

  • ✔ Step 1: Flume ingests logs from web servers.
  • ✔ Step 2: Hive processes data for sales reports.
  • ✔ Step 3: Sqoop exports processed data to MySQL.
  • ✔ Step 4: Spark runs machine learning predictions.

Instead of running each step manually, Oozie automates the entire workflow!

How Apache Oozie Works?

Oozie has two main types of jobs:

  • 1. Workflow Jobs → Defines a series of tasks that must be executed in order.
  • 2. Coordinator Jobs → Schedules workflow execution at fixed intervals (e.g., hourly, daily).

Think of Oozie like a train schedule:

  • Workflow Job: Defines the train route (which stations to stop at).
  • Coordinator Job: Decides when the train should run.