Unleashing Hadoop's Power: Hive, HBase, Pig, Sqoop, Flume, and Oozie
Apache Hive β SQL on Hadoop
πΉ What is Apache Hive?
Apache Hive is a data warehouse system built on top of Hadoop that allows users to process large datasets using SQL-like queries.
π₯ Think of Hive as a Translator:
- Converts SQL queries into MapReduce, Tez, or Spark jobs.
- Makes querying big data easier without writing Java or Python.
- Designed for batch processing, making it ideal for analytics and reporting.
π Why Was Hive Created?
Imagine you work at an e-commerce company storing millions of customer transactions in HDFS. You need to:
- β Find total sales per product.
- β Identify the top 10 customers.
- β Analyze buying patterns over time.
Without Hive, you would have to write complex MapReduce programs in Java. Hive eliminates this need by allowing you to run simple SQL queries instead.
πΉ How Does Hive Work?
π₯ Hive is NOT a database. It is a query engine for Hadoop.
It consists of:
- β Hive Query Language (HQL) β SQL-like queries.
- β Metastore β Stores table metadata (columns, data types, location in HDFS).
- β Execution Engine β Converts SQL queries into MapReduce jobs.
- β HDFS β Stores the actual data.
π‘ Hive Query Flow:
- 1οΈβ£ User writes an HQL query.
- 2οΈβ£ Hive converts it into MapReduce, Tez, or Spark jobs.
- 3οΈβ£ The query runs on Hadoop, processing data from HDFS.
- 4οΈβ£ Results are returned in a structured table format.
πΉ Key Features of Hive
- β SQL-Like Queries β Uses HiveQL (similar to MySQL, PostgreSQL).
- β Handles Large Datasets β Designed for petabyte-scale processing.
- β Schema on Read β Data structure is defined at query time.
- β Batch Processing β Best for reports, aggregations, and analytics.
- β Supports Multiple Storage Formats β Text, ORC, Parquet, Avro.
π Getting Started with Hive
1οΈβ£ Creating a Hive Table
Assume you have a file sales_data.csv
stored in HDFS:
π Example HBase Table (Storing User Activity Data)
RowKey | Column Family: Activity | Column Family: Profile |
---|---|---|
u1 | login_time: 12:00 PM | name: Alice |
u2 | login_time: 12:05 PM | name: Bob |
OrderID,Customer,Product,Amount,Date
101,Alice,Laptop,50000,2024-01-01
102,Bob,Phone,30000,2024-01-02
103,Alice,Headphones,2000,2024-01-02
π‘ Step 1: Create a Hive Table
CREATE TABLE sales_data (
OrderID INT,
Customer STRING,
Product STRING,
Amount FLOAT,
Date STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
2οΈβ£ Loading Data into Hive
LOAD DATA INPATH '/data/sales_data.csv' INTO TABLE sales_data;
π Querying Data in Hive
π₯ Finding Total Sales Per Customer
SELECT Customer, SUM(Amount) AS Total_Spent
FROM sales_data
GROUP BY Customer;
π‘ Output:
+---------+-------------+
| Customer | Total_Spent |
+---------+-------------+
| Alice | 52000 |
| Bob | 30000 |
+---------+-------------+
π Real-Life Use Case of Hive
π Use Case: Analyzing Social Media Data
A social media company stores billions of user interactions (likes, comments, shares) in HDFS. They want to:
- β Identify the most liked posts.
- β Find users who comment the most.
- β Track engagement trends over time.
π‘ Without Hive: Writing Java-based MapReduce programs would take days.
π‘ With Hive: A simple SQL query can provide insights in minutes!
π Advanced Hive Concepts
1οΈβ£ Partitioning in Hive
π₯ Problem: Suppose you have 1 billion rows in your sales_data
table. Querying the entire dataset is slow.
β Solution: Partition the table by date!
CREATE TABLE sales_data_partitioned (
OrderID INT,
Customer STRING,
Product STRING,
Amount FLOAT
) PARTITIONED BY (Date STRING);
2οΈβ£ Bucketing in Hive
π₯ Problem: What if we need to filter by Customer Name?
β Solution: Use bucketing to distribute data evenly!
CREATE TABLE sales_data_bucketed (
OrderID INT,
Customer STRING,
Product STRING,
Amount FLOAT
) CLUSTERED BY (Customer) INTO 10 BUCKETS;
π When to Use Hive vs. HBase
Feature | Hive (SQL on Hadoop) | HBase (NoSQL on Hadoop) |
---|---|---|
Query Type | Batch processing (SQL) | Real-time queries (NoSQL) |
Data Type | Structured | Unstructured, Semi-structured |
Speed | Slow for small queries | Fast for real-time lookups |
Use Case | Data warehousing, reports | Messaging apps, real-time analytics |
π Summary
- β Apache Hive = SQL + Hadoop
- β Converts SQL queries into MapReduce, Tez, or Spark jobs.
- β Supports partitioning & bucketing for faster queries.
- β Best for batch processing & data warehousing.
Apache HBase β NoSQL on Hadoop
πΉ What is Apache HBase?
Apache HBase is a NoSQL database that runs on Hadoop and provides real-time access to large datasets.
π₯ Think of HBase as a Big Table Inside Hadoop:
- Stores millions to billions of rows and thousands of columns efficiently.
- Unlike Hive (which is best for batch processing), HBase is designed for real-time reads and writes.
- It is modeled after Googleβs Bigtable and works well for applications that need random, fast lookups.
πΉ Why Use HBase Instead of Hive?
Feature | Apache Hive (SQL) | Apache HBase (NoSQL) |
---|---|---|
Query Type | Batch processing (SQL) | Real-time queries (NoSQL) |
Data Type | Structured (tables, columns) | Semi-structured (key-value) |
Use Case | Analytics, reports | Fast read/write applications |
Speed | Slow for individual lookups | Fast for real-time access |
π‘ Example:
- Hive: Best for analyzing customer purchases in an e-commerce store.
- HBase: Best for storing user session data in a real-time web application.
πΉ Key Features of HBase
- β Column-Oriented Storage β Stores data in columns instead of rows (efficient for large datasets).
- β Scalable & Distributed β Can handle billions of rows across multiple nodes.
- β Real-Time Data Access β Supports fast reads and writes for applications needing quick lookups.
- β Automatic Sharding β Data is automatically distributed across different servers (called RegionServers).
π How Does HBase Work?
π HBase Architecture
HBase consists of three main components:
- β HMaster β Manages and assigns regions to RegionServers.
- β RegionServers β Stores actual data and handles read/write requests.
- β HDFS β Used as the storage layer.
π‘ HBase stores data in a key-value format:
- Each row has a unique RowKey.
- Each row can have multiple column families, and each column can store multiple versions of data.
π Example HBase Table (Storing User Activity Data)
RowKey | Column Family: Activity | Column Family: Profile |
---|---|---|
u1 | login_time: 12:00 PM | name: Alice |
u2 | login_time: 12:05 PM | name: Bob |
β Unlike traditional databases, HBase stores columns together instead of rows, making it faster for analytical queries.
π Getting Started with HBase
1οΈβ£ Creating an HBase Table
To create a table named users
, run:
create 'users', 'activity', 'profile'
π‘ This creates a table with two column families: activity
(for login times) and profile
(for user details).
2οΈβ£ Inserting Data into HBase
put 'users', 'u1', 'activity:login_time', '12:00 PM'
put 'users', 'u1', 'profile:name', 'Alice'
3οΈβ£ Retrieving Data from HBase
get 'users', 'u1'
π‘ Output:
COLUMN CELL
activity:login_time timestamp=2024-01-01, value=12:00 PM
profile:name timestamp=2024-01-01, value=Alice
β Unlike Hive, HBase supports real-time lookups without scanning the whole table!
π Real-Life Use Case: How Facebook Uses HBase
π Use Case: Facebook Messenger
Facebook needs to store and retrieve millions of messages per second in real time.
π₯ Why HBase?
- β Each message is stored with a unique RowKey (e.g.,
user123:message456
). - β Fast lookups β When a user opens Messenger, HBase quickly retrieves their chat history.
- β Scalability β Can handle billions of messages across different servers.
π‘ Without HBase: A traditional SQL database would struggle to handle so much data in real time.
π‘ With HBase: Facebook can efficiently store, retrieve, and analyze messages at a massive scale.
π Advanced HBase Concepts
1οΈβ£ HBase vs. Relational Databases
Feature | HBase (NoSQL) | SQL Databases (MySQL, PostgreSQL) |
---|---|---|
Schema | Flexible, No fixed schema | Fixed schema (tables, columns) |
Scalability | Horizontally scalable | Limited scalability |
Transactions | No ACID transactions | Supports ACID transactions |
Query Language | NoSQL API (Get, Put, Scan) | SQL queries |
π Scanning Data in HBase
scan 'users'
π Summary
- β HBase = NoSQL database for Hadoop
- β Supports real-time reads/writes for large datasets.
- β Used by companies like Facebook, LinkedIn, and Twitter.
- β Best for random access, fast lookups, and real-time applications.
Apache Pig β Scripting for Data Processing
πΉ What is Apache Pig?
Apache Pig is a high-level scripting language for processing large datasets in Hadoop.
π₯ Why Was Pig Created?
- Writing MapReduce programs in Java is complex and time-consuming.
- Pig provides Pig Latin, a simpler scripting language to process data without writing Java code.
π‘ Think of Pig as SQL + Scripting for Big Data:
- Unlike SQL, Pig can handle both structured and unstructured data.
- Runs on top of Hadoop and converts Pig Latin scripts into MapReduce jobs automatically.
πΉ Why Use Apache Pig?
- β Easier Than Java: Reduces the effort of writing MapReduce jobs.
- β Handles Any Data Format: Supports JSON, XML, CSV, and raw text files.
- β Efficient Data Transformation: Ideal for ETL tasks like filtering, grouping, and joining datasets.
- β Scalable & Fault-Tolerant: Runs on Hadoop clusters, handling terabytes of data.
π‘ Use Case: Analyzing E-Commerce Clickstream Data
An e-commerce company needs to analyze user clickstream data stored in Hadoop.
- β Without Pig: Developers write complex Java-based MapReduce programs.
- β With Pig: A simple Pig script can process the data in a few lines.
π Apache Pig vs. Apache Hive
Feature | Apache Pig (Scripting) | Apache Hive (SQL) |
---|---|---|
Language | Pig Latin (procedural) | SQL-like queries |
Use Case | ETL, data preparation | Analytics, reports |
Structure | Handles semi-structured data | Best for structured data |
Execution | Converts to MapReduce | Converts to MapReduce |
π‘ When to Use What?
- β Use Pig for data cleaning, ETL, and complex transformations.
- β Use Hive for running SQL-like queries on large datasets.
π How Does Pig Work?
π Pig Execution Modes
Apache Pig runs in two modes:
- 1οΈβ£ Local Mode β Runs on a single machine (for testing).
- 2οΈβ£ Hadoop Mode β Runs on a Hadoop cluster for processing large datasets.
β Command to Start Pig in Interactive Mode:
pig
π‘ This opens the Grunt Shell, where you can write Pig Latin commands.
π Writing a Pig Script β Hands-on Example
π‘ Scenario: Processing an E-Commerce Transactions Dataset
We have an e-commerce dataset with user transactions in a CSV file:
π transactions.csv
1001, Alice, Laptop, 900, 2024-06-01
1002, Bob, Phone, 600, 2024-06-02
1003, Charlie, Tablet, 400, 2024-06-03
1οΈβ£ Load Data into Pig
transactions = LOAD 'transactions.csv' USING PigStorage(',')
AS (id: INT, name: CHARARRAY, product: CHARARRAY, amount: FLOAT, date: CHARARRAY);
2οΈβ£ Filter Transactions Above $500
high_value = FILTER transactions BY amount > 500;
3οΈβ£ Group Data by Product Type
grouped = GROUP transactions BY product;
4οΈβ£ Find Total Sales Per Product
total_sales = FOREACH grouped GENERATE group AS product, SUM(transactions.amount) AS total;
5οΈβ£ Store the Result in HDFS
STORE total_sales INTO 'hdfs:/output/sales' USING PigStorage(',');
π Running the Pig Script
β
Save the Pig script as sales_analysis.pig
β Run it in Hadoop mode:
pig -x mapreduce sales_analysis.pig
π‘ Output in HDFS:
Laptop, 900
Phone, 600
Tablet, 400
β We successfully analyzed sales data using just a few lines of Pig Latin!
π Real-Life Use Case: How Twitter Uses Pig
π Use Case: Analyzing Tweets
Twitter stores millions of tweets per day in Hadoop and needs to process them for:
- β Hashtag trends
- β User engagement analytics
- β Spam detection
π₯ Why Pig?
- β Handles semi-structured data (JSON tweets) easily.
- β Scales well with huge datasets.
- β Faster than writing complex MapReduce programs.
π‘ Example: Extracting Hashtags from Tweets
tweets = LOAD 'tweets.json' USING JsonLoader('user:chararray, message:chararray');
hashtags = FOREACH tweets GENERATE FLATTEN(TOKENIZE(message)) AS word;
filtered = FILTER hashtags BY word STARTSWITH '#';
β Extracts hashtags from tweets and prepares them for analysis.
π Summary
- β Apache Pig simplifies big data processing with Pig Latin scripts.
- β Best for ETL, data transformations, and semi-structured data.
- β Used by Twitter, LinkedIn, Netflix, and Yahoo.
- β Runs on Hadoop clusters, making it scalable and fault-tolerant.
Apache Sqoop β Data Transfer Between Hadoop & Databases
πΉ What is Apache Sqoop?
Apache Sqoop is a data transfer tool used to move large volumes of data between:
- β Relational Databases (MySQL, PostgreSQL, Oracle, SQL Server, etc.)
- β Hadoop Ecosystem (HDFS, Hive, HBase, etc.)
π₯ Why Was Sqoop Created?
- Companies store vast amounts of structured data in relational databases.
- To analyze this data efficiently, it needs to be moved into Hadoop.
- Manually writing ETL scripts for this is time-consuming and error-prone.
- Sqoop automates the entire data transfer process efficiently and securely.
π‘ Think of Sqoop as a data pipeline between Hadoop & Databases!
π Why Use Apache Sqoop?
- β Fast & Efficient β Transfers terabytes of data quickly.
- β Parallel Data Transfer β Uses multiple mappers for faster processing.
- β Supports Incremental Loads β Imports only new data instead of full tables.
- β Easy Integration β Works seamlessly with Hive, HBase, HDFS, and RDBMS.
- β Secure Data Transfer β Supports Kerberos authentication for security.
π‘ Use Case: Analyzing Bank Transactions
A bank wants to analyze customer transaction data stored in a MySQL database.
- β Without Sqoop: Developers must write complex ETL scripts.
- β With Sqoop: A simple command moves data into Hadoop effortlessly.
π How Sqoop Works?
Apache Sqoop executes MapReduce jobs in the background to transfer data efficiently.
π Sqoop Execution Modes
- 1οΈβ£ Import Mode β Transfers data from RDBMS β Hadoop (HDFS, Hive, HBase).
- 2οΈβ£ Export Mode β Transfers data from Hadoop β RDBMS.
π‘ Command to Check Available Databases in MySQL:
sqoop list-databases --connect jdbc:mysql://localhost:3306/ --username root --password mypassword
π Hands-on Example: Import Data from MySQL to Hadoop
π‘ Scenario: We have a transactions
table in MySQL.
π MySQL Table β transactions
transaction_id customer_name amount transaction_date
101 Alice 500 2024-06-01
102 Bob 700 2024-06-02
103 Charlie 300 2024-06-03
1οΈβ£ Import MySQL Data into HDFS
sqoop import \
--connect jdbc:mysql://localhost/bankdb \
--username root \
--password mypassword \
--table transactions \
--target-dir /user/hadoop/transactions \
--m 4
π‘ What happens here?
- β Reads data from the MySQL
transactions
table. - β Writes data to HDFS in
/user/hadoop/transactions
. - β Uses 4 parallel mappers (
-m 4
) for faster transfer.
2οΈβ£ Import MySQL Data into Hive
sqoop import \
--connect jdbc:mysql://localhost/bankdb \
--username root \
--password mypassword \
--table transactions \
--hive-import \
--hive-table hive_transactions
π‘ What happens here?
- β Creates a Hive table named
hive_transactions
. - β Transfers MySQL data directly into Hive for SQL queries.
3οΈβ£ Import Only New Data (Incremental Load)
π‘ Instead of importing all data every time, we can import only new rows.
sqoop import \
--connect jdbc:mysql://localhost/bankdb \
--username root \
--password mypassword \
--table transactions \
--incremental append \
--check-column transaction_id \
--last-value 103
π‘ What happens here?
- β Only imports rows with
transaction_id > 103
. - β Avoids duplicate data in Hadoop.
4οΈβ£ Export Processed Data from Hadoop to MySQL
π‘ After analyzing data in Hadoop, we may need to send results back to MySQL.
sqoop export \
--connect jdbc:mysql://localhost/bankdb \
--username root \
--password mypassword \
--table aggregated_sales \
--export-dir /user/hadoop/sales_results
π‘ What happens here?
- β Reads processed sales data from Hadoop.
- β Inserts data into the
aggregated_sales
table in MySQL.
π Real-Life Use Case: How eBay Uses Sqoop
π Use Case: Analyzing Customer Purchase Patterns
eBay stores billions of customer orders in MySQL and PostgreSQL databases.
To gain insights, they need to analyze this data in Hadoop.
π₯ Why Sqoop?
- β Fast & Scalable β Handles huge transaction data efficiently.
- β Incremental Loads β Only new purchases are imported daily.
- β Integrates with Hive & Spark for further processing.
π‘ Example:
sqoop import \
--connect jdbc:mysql://ebay_db/orders \
--username admin \
--password secret \
--table customer_orders \
--incremental lastmodified \
--check-column last_update_time \
--last-value '2024-06-10 00:00:00'
β Transfers only newly updated orders since June 10.
π Summary
- β Apache Sqoop is the bridge between Hadoop & Databases.
- β Supports fast, parallel imports & exports.
- β Ideal for big data ingestion pipelines.
- β Used by banks, e-commerce, telecom, and finance companies.
Apache Flume β Streaming Data into Hadoop
πΉ What is Apache Flume?
Apache Flume is a data ingestion tool designed to collect, aggregate, and transport large amounts of streaming data into Hadoop (HDFS, Hive, HBase).
π₯ Why Was Flume Created?
- Companies generate vast amounts of logs, sensor data, social media feeds, and IoT data.
- Traditional tools (like Sqoop) cannot handle continuous data streams.
- Flume enables real-time ingestion into Hadoop without data loss.
π‘ Think of Flume as a pipeline that continuously transports data into Hadoop!
π Why Use Apache Flume?
- β Handles Real-Time Data β Ingests logs, social media feeds, sensor data.
- β Reliable & Scalable β Ensures no data loss, even if Hadoop crashes.
- β Flexible Architecture β Can be customized for any data source.
- β Supports Multiple Sources β Works with Kafka, Twitter, IoT devices, log files, etc.
- β Integrates with Hadoop Ecosystem β Sends data to HDFS, Hive, HBase.
π‘ Use Case: Analyzing Social Media Activities
A social media company wants to analyze user activities (likes, shares, comments) in real time.
- β Without Flume: Engineers manually move logs to Hadoop (slow, inefficient).
- β With Flume: Flume automatically streams logs to Hadoop in real-time.
π How Apache Flume Works?
Flume uses an agent-based architecture with the following components:
π Flume Architecture
- 1οΈβ£ Source β Collects data from log files, syslogs, Kafka, Twitter, sensors, etc.
- 2οΈβ£ Channel β Acts as a temporary storage buffer (like RAM).
- 3οΈβ£ Sink β Sends data to HDFS, Hive, HBase, Kafka, etc.
π‘ Think of it like a water pipeline:
- β Source β Collects water (data).
- β Channel β Temporarily stores it (buffer).
- β Sink β Transfers it to a final location (Hadoop).
π Hands-on Example: Streaming Log Data to Hadoop
π‘ Scenario: Collecting Apache Server Logs and Storing in HDFS
π Sample Log File β access.log
192.168.1.10 - - [10/Jul/2024:12:15:22 +0000] "GET /index.html HTTP/1.1" 200 512
192.168.1.11 - - [10/Jul/2024:12:15:23 +0000] "POST /login HTTP/1.1" 302 0
192.168.1.12 - - [10/Jul/2024:12:15:24 +0000] "GET /dashboard HTTP/1.1" 200 1024
β Each line represents a web request made by a user.
1οΈβ£ Install Apache Flume
sudo apt-get install flume-ng
2οΈβ£ Configure Flume Agent
Flume agents are defined in a flume.conf
file.
π flume.conf (Configuration File)
# Define Agent Name
agent1.sources = source1
agent1.channels = channel1
agent1.sinks = sink1
# Define Source - Collects data from a log file
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -F /var/log/apache2/access.log
# Define Channel - Temporary storage
agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 1000
# Define Sink - Sends data to HDFS
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = hdfs://localhost:9000/user/hadoop/flume_logs/
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat = Text
agent1.sinks.sink1.hdfs.rollSize = 10000
agent1.sinks.sink1.hdfs.rollCount = 10
# Bind Source and Sink to Channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1
π‘ What happens here?
- β Source β Collects logs from
/var/log/apache2/access.log
. - β Channel β Stores data before sending to HDFS.
- β Sink β Saves logs in
/user/hadoop/flume_logs/
in HDFS.
3οΈβ£ Start Flume Agent
flume-ng agent --name agent1 --conf ./conf/ -Dflume.root.logger=INFO,console
β Now, Flume continuously streams log data into Hadoop in real-time!
4οΈβ£ Verify Data in HDFS
hadoop fs -ls /user/hadoop/flume_logs/
β If successful, Flume is now ingesting live data into Hadoop! π―
π Real-Life Use Case: How Twitter Uses Flume
π Use Case: Analyzing Social Media Trends
Twitter generates millions of tweets per second.
To analyze real-time trends, they need to store tweets in Hadoop continuously.
π₯ Why Flume?
- β Scalable β Handles massive tweet volumes.
- β Reliable β No data loss, even if Hadoop is down.
- β Integrates with Spark & Hive for real-time analytics.
π‘ Example: Streaming Twitter Data to Hadoop
agent1.sources.source1.type = twitter
agent1.sources.source1.consumerKey = xxxxx
agent1.sources.source1.consumerSecret = xxxxx
agent1.sources.source1.accessToken = xxxxx
agent1.sources.source1.accessTokenSecret = xxxxx
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = hdfs://localhost:9000/user/hadoop/twitter_logs/
β This configuration streams tweets directly into Hadoop! π
π Summary
- β Apache Flume is used for real-time data ingestion into Hadoop.
- β Handles log files, social media feeds, IoT sensor data, etc.
- β Uses Source β Channel β Sink model.
- β Supports Kafka, Twitter, MySQL, HDFS, and HBase integration.
- β Used by banks, social media companies, e-commerce, and IoT platforms.
Apache Oozie β Workflow Scheduling in Hadoop
πΉ What is Apache Oozie?
Apache Oozie is a workflow scheduler designed for managing and automating Hadoop jobs.
π₯ Why Was Oozie Created?
- Big data workflows involve multiple steps (e.g., data ingestion β transformation β analysis).
- Manually triggering jobs like Flume β Hive β Spark β Sqoop is inefficient.
- Oozie automates job execution, manages dependencies, and schedules workflows.
π‘ Think of Oozie as a "Task Manager" for Hadoop jobs!
π Why Use Apache Oozie?
- β Automates Workflows β Runs Hadoop jobs in a predefined sequence.
- β Manages Dependencies β Ensures job order (Job A β Job B β Job C).
- β Schedules Jobs β Runs workflows daily, weekly, or in real-time.
- β Handles Failures β Retries failed jobs or sends alerts.
- β Supports Multiple Hadoop Components β Works with MapReduce, Hive, Pig, Sqoop, Spark, Flume, and Shell scripts.
π‘ Example: Automating an E-Commerce Data Pipeline
- β Step 1: Flume ingests logs from web servers.
- β Step 2: Hive processes data for sales reports.
- β Step 3: Sqoop exports processed data to MySQL.
- β Step 4: Spark runs machine learning predictions.
π Instead of running each step manually, Oozie automates the entire workflow! π
π How Apache Oozie Works?
Oozie has two main types of jobs:
- 1οΈβ£ Workflow Jobs β Defines a series of tasks that must be executed in order.
- 2οΈβ£ Coordinator Jobs β Schedules workflow execution at fixed intervals (e.g., hourly, daily).
π‘ Think of Oozie like a train schedule:
- β Workflow Job: Defines the train route (which stations to stop at).
- β Coordinator Job: Decides when the train should run.