Unlock SQL Mastery: Advanced Concepts for Experts

Advanced Joins and Subqueries

Advanced joins and subqueries allow you to work efficiently with relational databases by combining data from multiple tables and performing complex queries.

Why Use Advanced Joins and Subqueries?

Advanced Joins: Enable complex relationships by linking multiple tables.
Subqueries: Allow queries to be nested inside other queries for advanced data retrieval.

1. Joining Multiple Tables

Real-world databases often store data across multiple tables. Joins help retrieve meaningful information efficiently.

Example: Joining Three Tables

Retrieve order details including customer name and product name:


SELECT o.order_id, c.customer_name, p.product_name, o.order_date
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id;

2. Nested Subqueries

Subquery in the WHERE Clause

Find all customers who have placed an order:


SELECT customer_name 
FROM customers 
WHERE customer_id IN (
SELECT customer_id FROM orders
);

Subquery in the SELECT Clause

Retrieve each customer’s total number of orders:


SELECT customer_name, (
SELECT COUNT(*) FROM orders o 
WHERE o.customer_id = c.customer_id
) AS total_orders 
FROM customers c;

Subquery in the FROM Clause

Calculate the average order amount per customer:


SELECT customer_id, AVG(order_amount) AS avg_order_amount 
FROM (
SELECT customer_id, order_amount FROM orders
) AS customer_orders 
GROUP BY customer_id;

3. Using Subqueries in SELECT, FROM, and WHERE Clauses

Subquery in the WHERE Clause

Find customers who have placed more than 5 orders:


SELECT customer_name 
FROM customers 
WHERE customer_id IN (
SELECT customer_id FROM orders 
GROUP BY customer_id 
HAVING COUNT(*) > 5
);

Common Mistakes to Avoid

Overusing Subqueries: Subqueries can slow down queries on large datasets.
Ignoring Performance: Optimize queries by using indexes and avoiding unnecessary subqueries.
Mismatched Data Types: Ensure consistency between subquery results and outer query conditions.

What’s Next?

Now that you understand advanced joins and subqueries, the next step is exploring Window Functions, which enable calculations across related rows.

Interactive Practice

Try these challenges in your SQL editor:

Join the orders, customers, and products tables to retrieve order details.
Use a subquery to find all customers who have placed more than 3 orders.
Use a subquery in the SELECT clause to retrieve the total number of orders per customer.

Window Functions: Advanced Data Analysis

Window functions are powerful SQL tools that allow calculations across a set of rows while keeping individual rows intact. Unlike aggregate functions, which collapse data, window functions maintain row-level details while providing insights.

Why Use Window Functions?

Running Totals: Compute cumulative sums.
Ranking: Assign ranks to rows.
Moving Averages: Analyze trends over time.
Comparative Analysis: Compare a row with other rows in the group.

Key Concepts

PARTITION BY: Divides data into groups (similar to GROUP BY but without collapsing rows).
ORDER BY: Specifies row order within each partition.
Frame Clause: Defines the subset of rows included in the calculation.

1. ROW_NUMBER()

Assigns a unique number to each row within a partition.


SELECT 
author_id, 
title, 
ROW_NUMBER() OVER (PARTITION BY author_id ORDER BY year_published) AS row_num
FROM books;

2. RANK() and DENSE_RANK()

Ranks rows based on a specified order.


SELECT 
title, 
price, 
RANK() OVER (ORDER BY price DESC) AS rank,
DENSE_RANK() OVER (ORDER BY price DESC) AS dense_rank
FROM books;

3. LEAD() and LAG()

Access data from the next or previous row.

Example: Comparing a Book’s Price to the Next Book


SELECT 
title, 
price, 
LEAD(price) OVER (ORDER BY price) AS next_price
FROM books;

4. SUM() and AVG() Over a Window

Running Total of Book Prices


SELECT 
title, 
price, 
SUM(price) OVER (ORDER BY year_published) AS running_total
FROM books;

Moving Average of Book Prices


SELECT 
title, 
year_published, 
price, 
AVG(price) OVER (ORDER BY year_published ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg
FROM books;

Common Mistakes to Avoid

Forgetting the OVER Clause: Window functions require the OVER clause.
Confusing PARTITION BY and GROUP BY: PARTITION BY keeps rows intact, GROUP BY collapses them.
Misusing Frame Clauses: Ensure correct frame ranges when using ROWS BETWEEN.

What’s Next?

Next, explore Common Table Expressions (CTEs) to simplify complex queries.

Interactive Practice

Try these challenges in your SQL editor:

Rank books by their price within each author group.
Calculate the running total of book prices.
Compare each book’s price to the next book’s price.

Common Table Expressions (CTEs): Simplifying Complex Queries

Common Table Expressions (CTEs) allow you to define temporary result sets within a query. They help in breaking down complex queries into smaller, more manageable parts.

Why Use CTEs?

Readability: Makes queries easier to understand.
Reusability: Allows using the same subquery multiple times.
Debugging: Enables testing individual query parts.
Recursion: Handles hierarchical data efficiently.

1. Basic Syntax


WITH cte_name AS (
SELECT ...
)
SELECT ...
FROM cte_name;

2. Simplifying a Query

Find authors with an average book price above $15.

Without CTE


SELECT author_id, AVG(price) AS avg_price
FROM books
GROUP BY author_id
HAVING AVG(price) > 15;

With CTE


WITH author_avg_price AS (
SELECT author_id, AVG(price) AS avg_price
FROM books
GROUP BY author_id
)
SELECT author_id, avg_price
FROM author_avg_price
WHERE avg_price > 15;

3. Reusing a CTE

Calculate total sales per author and percentage contribution of each book.


WITH total_sales AS (
SELECT author_id, SUM(price) AS total_price
FROM books
GROUP BY author_id
)
SELECT 
b.title, 
b.price, 
b.price / total_sales.total_price * 100 AS sales_percentage
FROM books b
JOIN total_sales
ON b.author_id = total_sales.author_id;

4. Recursive CTEs

Recursive CTEs allow handling hierarchical data, like employee hierarchies.

Example: Organizational Hierarchy


WITH RECURSIVE org_chart AS (
SELECT employee_id, employee_name, manager_id
FROM employees
WHERE employee_id = 1 -- Manager's ID

UNION ALL

SELECT e.employee_id, e.employee_name, e.manager_id
FROM employees e
INNER JOIN org_chart oc
ON e.manager_id = oc.employee_id
)
SELECT * FROM org_chart;

5. Why CTEs Are Important for Data Engineers

Modularity: Breaks complex queries into reusable parts.
Debugging: Easier to test individual components.
Performance: Reduces reliance on subqueries and temporary tables.
Recursive Queries: Essential for hierarchical data analysis.

Common Mistakes to Avoid

Overusing CTEs: Use them when they add value.
Forgetting RECURSIVE: Required for recursive queries.
Ignoring Performance: Test queries for efficiency.

What’s Next?

Next, explore Indexes and Query Optimization to improve performance on large datasets.

Interactive Practice

Try these challenges in your SQL editor:

Calculate the average book price per author using a CTE.
Find total sales per author and percentage contribution of each book.
Use a recursive CTE to find the hierarchy of employees under a manager.

Transactions and Concurrency Control: Keeping Your Data Safe and Consistent

A transaction is a sequence of database operations that are treated as a single unit of work. Transactions ensure data integrity, even in cases of system failures or concurrent user access.

Why Are Transactions Important?

Atomicity: Ensures all operations succeed or none take effect.
Consistency: Maintains database integrity before and after transactions.
Isolation: Prevents transactions from interfering with each other.
Durability: Ensures committed transactions are permanently saved.

1. How Do Transactions Work?

Step 1: Start a Transaction

START TRANSACTION;

Step 2: Perform Operations

UPDATE accounts SET balance = balance - 100 WHERE account_id = 1; -- Deduct from savings
UPDATE accounts SET balance = balance + 100 WHERE account_id = 2; -- Add to checking

Step 3: Commit or Rollback

If successful, commit the transaction:

COMMIT;

If an error occurs, rollback the changes:

ROLLBACK;

2. Concurrency Control: Handling Multiple Users

When multiple users access the database simultaneously, transactions must be managed to avoid conflicts.

Isolation Levels

Read Uncommitted: Allows dirty reads (seeing uncommitted changes).
Read Committed: Prevents dirty reads but allows non-repeatable reads.
Repeatable Read: Prevents non-repeatable reads but allows phantom reads.
Serializable: Ensures full isolation by executing transactions sequentially.

Example: Setting the Isolation Level

SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

START TRANSACTION;

SELECT * FROM orders;

COMMIT;

3. Example: Using Transactions in SQL

Placing an Order with Inventory Update


START TRANSACTION;

-- Step 1: Insert the order
INSERT INTO orders (order_id, product_id, quantity)
VALUES (101, 1, 2);

-- Step 2: Update inventory
UPDATE inventory SET stock = stock - 2 WHERE product_id = 1;

COMMIT;

Simulating a Rollback


START TRANSACTION;

INSERT INTO orders (order_id, product_id, quantity)
VALUES (103, 3, 5);

-- Cause an error (invalid product_id)
UPDATE inventory SET stock = stock - 5 WHERE product_id = 999;

ROLLBACK;

Common Mistakes to Avoid

Forgetting to Commit: Uncommitted transactions do not save changes.
Long-Running Transactions: Keeping transactions open for too long can slow down performance.
Ignoring Isolation Levels: Higher isolation levels can impact performance.

What’s Next?

Next, explore Stored Procedures and Functions to encapsulate complex logic.

Interactive Practice

Try these challenges in your SQL editor:

Start a transaction, insert a new order, update inventory, and commit the transaction.
Simulate a rollback by causing an error within a transaction.
Change the isolation level and observe transaction behavior.

Stored Procedures and Functions

Stored procedures and user-defined functions (UDFs) are essential for modularizing SQL code, improving performance, and enhancing security.

What Are Stored Procedures and Functions?

Stored Procedures: Precompiled SQL code blocks that perform specific tasks.
User-Defined Functions (UDFs): Reusable code blocks that return a single value or a table.

1. Creating Stored Procedures

Example: Retrieve All Customers


CREATE PROCEDURE GetAllCustomers()
BEGIN
SELECT * FROM customers;
END;

Calling the Procedure

CALL GetAllCustomers();

2. Parameters in Stored Procedures

Example: Retrieve Customers by City


CREATE PROCEDURE GetCustomersByCity(IN city_name VARCHAR(100))
BEGIN
SELECT * FROM customers WHERE city = city_name;
END;

Calling the Procedure

CALL GetCustomersByCity('New York');

3. Creating User-Defined Functions (UDFs)

Example: Calculate Total Order Price


CREATE FUNCTION CalculateOrderTotal(order_id INT)
RETURNS DECIMAL(10, 2)
BEGIN
DECLARE total DECIMAL(10, 2);
SELECT SUM(price * quantity) INTO total
FROM order_items
WHERE order_id = order_id;
RETURN total;
END;

Using the Function


SELECT order_id, CalculateOrderTotal(order_id) AS total_price
FROM orders;

4. Differences Between Procedures and Functions

Feature	Stored Procedures	User-Defined Functions (UDFs)
Return Value	Can return multiple values or none.	Must return a single value or a table.
Usage	Called using `CALL`.	Used in SQL statements like `SELECT`.
Parameters	Can have `IN`, `OUT`, and `INOUT` parameters.	Only supports `IN` parameters.
SQL Statements	Can execute multiple SQL statements.	Typically used for calculations.
Performance	Better for complex operations.	Better for reusable calculations.

Common Mistakes to Avoid

Overusing Procedures/Functions: Use them when they simplify code.
Ignoring Performance: Optimize procedures and functions for efficiency.
Mismatched Data Types: Ensure parameters and return values match expected types.

What’s Next?

Next, explore Triggers, which automatically execute in response to INSERT, UPDATE, or DELETE events.

Interactive Practice

Try these challenges in your SQL editor:

Create a stored procedure to retrieve all orders for a specific customer.
Create a function to calculate the total price of an order.
Create a stored procedure to update the status of an order.

Triggers: Automating Database Actions

A trigger is a special type of stored procedure that automatically executes in response to certain events on a table, such as INSERT, UPDATE, or DELETE. Triggers help enforce business rules, maintain data integrity, and automate actions.

Why Use Triggers?

Enforce Business Rules: Prevent invalid data modifications.
Audit Changes: Track who made changes and when.
Maintain Data Integrity: Automatically update related records.

1. Creating and Dropping Triggers

Example: Log New Orders


CREATE TRIGGER log_new_order
AFTER INSERT
ON orders
FOR EACH ROW
BEGIN
INSERT INTO order_audit (order_id, action, action_time)
VALUES (NEW.order_id, 'INSERT', NOW());
END;

Example: Prevent Cancelling Shipped Orders


CREATE TRIGGER prevent_cancellation
BEFORE UPDATE
ON orders
FOR EACH ROW
BEGIN
IF OLD.order_status = 'Shipped' AND NEW.order_status = 'Cancelled' THEN
SIGNAL SQLSTATE '45000' 
SET MESSAGE_TEXT = 'Cannot cancel a shipped order';
END IF;
END;

Dropping a Trigger


DROP TRIGGER log_new_order;

2. BEFORE and AFTER Triggers

BEFORE Triggers

Executed before the triggering event, useful for validation.

Example: Validate Order Date


CREATE TRIGGER validate_order_date
BEFORE INSERT
ON orders
FOR EACH ROW
BEGIN
IF NEW.order_date < CURDATE() THEN
SIGNAL SQLSTATE '45000' 
SET MESSAGE_TEXT = 'Order date cannot be in the past';
END IF;
END;

AFTER Triggers

Executed after the triggering event, useful for logging and updates.

Example: Update Inventory After an Order


CREATE TRIGGER update_inventory
AFTER INSERT
ON order_items
FOR EACH ROW
BEGIN
UPDATE inventory
SET stock = stock - NEW.quantity
WHERE product_id = NEW.product_id;
END;

3. Use Cases for Triggers

1. Auditing Changes

Track updates to sensitive data.


CREATE TRIGGER log_employee_updates
AFTER UPDATE
ON employees
FOR EACH ROW
BEGIN
INSERT INTO employee_audit (employee_id, old_salary, new_salary, action_time)
VALUES (OLD.employee_id, OLD.salary, NEW.salary, NOW());
END;

2. Enforcing Business Rules

Prevent deletion of customers with active orders.


CREATE TRIGGER prevent_customer_deletion
BEFORE DELETE
ON customers
FOR EACH ROW
BEGIN
IF (SELECT COUNT(*) FROM orders WHERE customer_id = OLD.customer_id) > 0 THEN
SIGNAL SQLSTATE '45000' 
SET MESSAGE_TEXT = 'Cannot delete customer with active orders';
END IF;
END;

3. Maintaining Data Integrity

Automatically update customer sales records.


CREATE TRIGGER update_total_sales
AFTER INSERT
ON orders
FOR EACH ROW
BEGIN
UPDATE customers
SET total_sales = total_sales + NEW.order_amount
WHERE customer_id = NEW.customer_id;
END;

Common Mistakes to Avoid

Overusing Triggers: Too many triggers can slow performance and complicate debugging.
Ignoring Performance: Triggers add overhead to INSERT, UPDATE, and DELETE operations.
Creating Infinite Loops: Be careful not to create triggers that call each other recursively.

What’s Next?

Next, explore Transactions and Concurrency Control to manage data consistency in multi-user environments.

Interactive Practice

Try these challenges in your SQL editor:

Create a trigger that logs every new customer into a customer_audit table.
Create a trigger that prevents updating a product’s price to a negative value.
Create a trigger that updates the total_orders column in the customers table after an order is placed.

Performance Optimization: Making Your Queries Faster

As your database grows, queries can slow down, affecting performance and user experience. Optimizing queries ensures fast and efficient data retrieval. Key focus areas include:

Query Execution Plans: Understanding how queries are processed.
Index Optimization: Using indexes to speed up searches.
Avoiding Full Table Scans: Minimizing the number of scanned rows.
Using EXPLAIN: Analyzing and optimizing query execution.

1. Query Execution Plans

What Is a Query Execution Plan?

A query execution plan is a roadmap that shows how a database retrieves data, including:

Which indexes are used.
How tables are joined.
Whether a full table scan is performed.

Viewing the Execution Plan

EXPLAIN SELECT * FROM customers WHERE city = 'New York';

The output includes:

Type of Scan: Index scan, full table scan, etc.
Rows Examined: Number of rows checked.
Keys Used: Indexes used (if any).

2. Index Optimization

What Are Indexes?

Indexes help the database locate rows quickly, improving query performance.

Types of Indexes

Single-Column Index: Indexes a single column.
Composite Index: Indexes multiple columns.
Unique Index: Ensures unique values.
Primary Key Index: Created automatically with a primary key.

Example: Creating an Index


CREATE INDEX idx_city ON customers (city);
SELECT * FROM customers WHERE city = 'New York';

When to Use Indexes

On WHERE, JOIN, ORDER BY, or GROUP BY columns.
For large tables with frequent queries.
For high selectivity columns (many unique values).

When NOT to Use Indexes

For small tables (indexes provide little benefit).
On frequently updated columns (indexes slow down writes).
For low selectivity columns (many duplicate values).

3. Avoiding Full Table Scans

What Is a Full Table Scan?

A full table scan occurs when the database reads all rows to find the required data, slowing down queries.

How to Avoid Full Table Scans

Use indexes on frequently queried columns.
Rewrite queries to leverage indexes.
Use LIMIT to restrict returned rows.

Example: Using an Index Instead of a Full Table Scan


-- Full Table Scan (Slow)
SELECT * FROM customers WHERE city = 'New York';

-- Index Scan (Optimized)
CREATE INDEX idx_city ON customers (city);
SELECT * FROM customers WHERE city = 'New York';

4. Using EXPLAIN for Query Analysis

What Is EXPLAIN?

The EXPLAIN command shows how the database executes a query, helping to identify slow operations.

Example: Analyzing a Query


EXPLAIN SELECT c.customer_name, o.order_date
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.city = 'New York';

Key EXPLAIN Output

Type: Shows if a full table scan is used.
Rows: Number of rows examined.
Key: Index used.
Extra: Additional info (e.g., Using index).

EXPLAIN ANALYZE (PostgreSQL)

PostgreSQL allows you to see both the execution plan and runtime performance using:

EXPLAIN ANALYZE SELECT * FROM customers WHERE city = 'New York';

Common Mistakes to Avoid

Ignoring Execution Plans: Always use EXPLAIN to analyze queries.
Over-Indexing: Too many indexes slow down write operations.
Not Testing: Run queries on realistic data for performance validation.

What’s Next?

Next, explore Transactions and Concurrency Control to manage multiple user operations and data integrity.

Interactive Practice

Try these challenges in your SQL editor:

Create an index on the order_date column in the orders table and analyze the query.
Use EXPLAIN to analyze a query that joins the customers and orders tables.
Rewrite a query to avoid a full table scan.

Backup and Recovery: Protecting Your Data

Backup and recovery ensure that your database is protected from data loss due to accidental deletions, hardware failures, or cyberattacks. A solid backup strategy allows for quick restoration and disaster recovery.

Why Are Backup and Recovery Important?

Data Loss Prevention: Protects against accidental deletions and failures.
Disaster Recovery: Restores databases after ransomware attacks or system crashes.
Compliance: Ensures adherence to regulatory backup policies.

1. Database Backup Strategies

Types of Backups

Full Backup: A complete copy of the entire database.
Incremental Backup: Saves only the changes since the last backup.
Differential Backup: Saves changes since the last full backup.

Full Backup Example (MySQL)

mysqldump -u username -p database_name > backup_file.sql

Incremental Backup (PostgreSQL)

Enable Write-Ahead Logging (WAL) to track database changes.


wal_level = replica
archive_mode = on
archive_command = 'cp %p /path/to/wal_archive/%f'

Perform a full backup:

pg_basebackup -U username -D /path/to/backup -Ft -z

Differential Backup (SQL Server)


BACKUP DATABASE database_name
TO DISK = '/path/to/backup.bak'
WITH DIFFERENTIAL;

2. Point-in-Time Recovery (PITR)

What Is PITR?

PITR allows restoring a database to a specific moment in time, helping recover from accidental deletions or corruption.

PITR in PostgreSQL

pg_restore -U username -d database_name /path/to/backup.tar

Apply WAL logs to restore a specific time:

pg_wal_restore -U username -d database_name /path/to/wal_archive --target-time '2023-10-01 12:00:00'

PITR in MySQL

Enable binary logging in my.cnf:

log_bin = /var/log/mysql/mysql-bin.log

Restore full backup:

mysql -u username -p database_name < backup_file.sql

Apply binary logs to restore a specific time:

mysqlbinlog --stop-datetime="2023-10-01 12:00:00" /var/log/mysql/mysql-bin.log | mysql -u username -p

3. Exporting and Importing Data

Exporting Data

Export data for backup or migration.

MySQL Export to CSV


SELECT * INTO OUTFILE '/path/to/customers.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM customers;

PostgreSQL Export to CSV

COPY customers TO '/path/to/customers.csv' WITH CSV HEADER;

Importing Data

Load data back into the database.

MySQL Import from CSV


LOAD DATA INFILE '/path/to/customers.csv'
INTO TABLE customers
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;

PostgreSQL Import from CSV

COPY customers FROM '/path/to/customers.csv' WITH CSV HEADER;

Common Mistakes to Avoid

Not Testing Backups: Regularly test your backups to ensure they can be restored.
Ignoring Retention Policies: Define how long backups should be kept and remove outdated backups.
Storing Backups on the Same Server: Keep backups in a secure, separate location.

What’s Next?

Next, explore Database Monitoring and Maintenance to keep your database running efficiently over time.

Interactive Practice

Try these challenges in your SQL editor:

Create a full backup of a database using mysqldump.
Export the orders table to a CSV file in PostgreSQL.
Import data from a CSV file into the products table in MySQL.

Advanced Database Design: Building Efficient Databases

Why Is Database Design Important?

A well-designed database is crucial for performance, data integrity, and scalability. It helps:

Improve Performance: Reduces redundancy and speeds up queries.
Ensure Data Integrity: Maintains accuracy and consistency.
Simplify Maintenance: Makes updates and scaling easier.
Support Business Needs: Aligns with application requirements.

1. Normalization

What Is Normalization?

Normalization is the process of organizing data to reduce redundancy and improve integrity. It involves splitting tables and defining relationships between them.

Levels of Normalization

First Normal Form (1NF):

Ensures each column contains atomic values and each row is unique.

Before 1NF:

OrderID	Products
1	Apple, Banana
2	Orange, Mango

After 1NF:

OrderID	Product
1	Apple
1	Banana
2	Orange

Second Normal Form (2NF):

Ensures all non-key columns depend on the entire primary key.

Before 2NF:

OrderID	Product	Price
1	Apple	1.00
1	Banana	0.50

After 2NF:

Orders Table:

OrderID	Product
1	Apple
1	Banana

Products Table:

Product	Price
Apple	1.00
Banana	0.50

Third Normal Form (3NF):

Ensures non-key columns are independent of each other.

Before 3NF:

OrderID	Customer	CustomerPhone
1	John	123-456-7890

After 3NF:

Orders Table:

OrderID	CustomerID
1	1

Customers Table:

CustomerID	Customer	CustomerPhone
1	John	123-456-7890

2. Denormalization

What Is Denormalization?

Denormalization introduces redundancy to improve read performance. It’s useful in:

Read-Heavy Workloads: Reducing joins speeds up queries.
Reporting & Analytics: Precomputed data enhances performance.
Caching: Storing frequently used results avoids recalculations.

3. Entity-Relationship Diagrams (ERDs)

An ERD visually represents database relationships. It shows:

Entities: Tables (e.g., customers, orders).
Attributes: Columns (e.g., customer_id, order_date).
Relationships: How tables connect (e.g., one-to-many).

4. Schema Design Best Practices

Plan Ahead: Define entities, attributes, and relationships before designing.
Normalize First, Denormalize Later: Keep data integrity, optimize later.
Consistent Naming: Use meaningful table and column names.
Index Strategically: Optimize query performance.
Plan for Scalability: Use partitioning, sharding, and proper data types.

What’s Next?

Now that you understand database design, let’s move to Database Monitoring & Maintenance to keep systems running smoothly.

Working with Big Data: Scaling Your Database

Why Is Big Data Important?

As datasets grow, traditional databases can struggle to handle the volume, velocity, and variety of data. Big data techniques help you:

Scale Horizontally: Distribute data across multiple servers.
Improve Performance: Speed up queries on large datasets.
Reduce Costs: Optimize storage and processing.

1. Partitioning Tables

What Is Partitioning?

Partitioning splits a large table into smaller, more manageable pieces called partitions. Each partition stores a subset of the data based on a specific criterion (e.g., date, region).

Types of Partitioning

Range Partitioning: Partitions based on a range of values (e.g., dates).
List Partitioning: Partitions based on a list of values (e.g., regions).
Hash Partitioning: Partitions based on a hash function (e.g., user ID).

Example: Range Partitioning

Partition a sales table by year:


CREATE TABLE sales (
sale_id INT,
sale_date DATE,
amount DECIMAL(10, 2)
)
PARTITION BY RANGE (YEAR(sale_date)) (
PARTITION p0 VALUES LESS THAN (2020),
PARTITION p1 VALUES LESS THAN (2021),
PARTITION p2 VALUES LESS THAN (2022),
PARTITION p3 VALUES LESS THAN (2023)
);

Example: List Partitioning

Partition a customers table by region:


CREATE TABLE customers (
customer_id INT,
name VARCHAR(100),
region VARCHAR(50)
)
PARTITION BY LIST (region) (
PARTITION p_north VALUES IN ('NY', 'NJ', 'PA'),
PARTITION p_south VALUES IN ('TX', 'FL', 'GA'),
PARTITION p_west VALUES IN ('CA', 'WA', 'OR')
);

2. Sharding

What Is Sharding?

Sharding splits a database into smaller, independent pieces called shards. Each shard is stored on a separate server, allowing distributed load balancing.

Example: Sharding by User ID

Shard a users table across 3 servers based on user_id:

Shard 1: user_id % 3 = 0
Shard 2: user_id % 3 = 1
Shard 3: user_id % 3 = 2

3. Clustering

What Is Clustering?

Clustering groups related data together on disk to improve query performance. It’s often used with partitioning to optimize retrieval.

Example: Clustered Index

Create a clustered index on the orders table:


CREATE CLUSTERED INDEX idx_order_date
ON orders (order_date);

4. Using SQL with Big Data Tools

Apache Hive

Apache Hive is a data warehouse tool that allows you to query large datasets using SQL-like syntax.


CREATE TABLE sales (
sale_id INT,
sale_date STRING,
amount DOUBLE
)
STORED AS ORC;

SELECT * FROM sales WHERE sale_date = '2023-10-01';

Spark SQL

Spark SQL is a module in Apache Spark for querying structured data.


from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

# Load data
df = spark.read.csv("sales.csv", header=True, inferSchema=True)

# Create a temporary view
df.createOrReplaceTempView("sales")

# Run a query
result = spark.sql("SELECT * FROM sales WHERE amount > 100")
result.show()

Common Mistakes to Avoid

Over-Partitioning: Too many partitions can slow down queries.
Ignoring Shard Distribution: Uneven shard distribution can lead to hotspots.
Not Testing: Always test partitioning, sharding, and clustering strategies with real data.

What’s Next?

Now that you know how to work with big data, let’s explore Database Monitoring and Maintenance to ensure smooth performance.

Interactive Practice

Try these challenges in your SQL environment:

Partition a logs table by log_date (range partitioning):


CREATE TABLE logs (
log_id INT,
log_date DATE,
message TEXT
)
PARTITION BY RANGE (log_date) (
PARTITION p0 VALUES LESS THAN ('2023-01-01'),
PARTITION p1 VALUES LESS THAN ('2023-07-01'),
PARTITION p2 VALUES LESS THAN ('2024-01-01')
);

Shard a users table across 4 servers based on user_id:


-- Shard 1: user_id % 4 = 0
-- Shard 2: user_id % 4 = 1
-- Shard 3: user_id % 4 = 2
-- Shard 4: user_id % 4 = 3

Create a clustered index on the orders table:


CREATE CLUSTERED INDEX idx_order_date
ON orders (order_date);

Key Takeaways

Use partitioning to split large tables into smaller, more manageable pieces.
Use sharding to distribute data across multiple servers.
Use clustering to group related data together for faster queries.
Use big data tools like Apache Hive and Spark SQL to query large datasets.

SQL for Data Analysis: Unlocking Insights

Why Use SQL for Data Analysis?

SQL is a powerful tool for analyzing large datasets because:

It’s Fast: Optimized for querying structured data.
It’s Flexible: Supports complex transformations and aggregations.
It’s Widely Used: Compatible with most databases and analytics tools.

1. Pivoting Data

What Is Pivoting?

Pivoting transforms rows into columns, making it easier to analyze tabular data. This is useful for creating summary tables and reports.

Example: Pivoting Sales Data

Suppose you have a sales table:

sale_id	product	month	amount
1	Apple	Jan	100
2	Banana	Jan	150
3	Apple	Feb	200
4	Banana	Feb	250

Pivoted View:

product	Jan	Feb
Apple	100	200
Banana	150	250

Using CASE Statements for Pivoting


SELECT product,
SUM(CASE WHEN month = 'Jan' THEN amount ELSE 0 END) AS Jan,
SUM(CASE WHEN month = 'Feb' THEN amount ELSE 0 END) AS Feb
FROM sales
GROUP BY product;

Using PIVOT (SQL Server)


SELECT product, [Jan], [Feb]
FROM sales
PIVOT (
SUM(amount)
FOR month IN ([Jan], [Feb])
) AS pvt;

2. Time Series Analysis

What Is Time Series Analysis?

Time series analysis involves analyzing data points collected over time to identify trends and patterns.

Example: Monthly Sales Trends


SELECT DATE_TRUNC('month', sale_date) AS month,
SUM(amount) AS total_sales
FROM sales
GROUP BY DATE_TRUNC('month', sale_date)
ORDER BY month;

Example: Rolling Averages

Calculate a 3-month rolling average of sales:


SELECT month,
AVG(total_sales) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS rolling_avg
FROM (
SELECT DATE_TRUNC('month', sale_date) AS month,
    SUM(amount) AS total_sales
FROM sales
GROUP BY DATE_TRUNC('month', sale_date)
) AS monthly_sales;

3. Advanced Aggregations

What Are Advanced Aggregations?

Advanced aggregations go beyond simple SUM and AVG to provide deeper insights.

Example: Window Functions

Calculate cumulative sales for each product:


SELECT product,
sale_date,
amount,
SUM(amount) OVER (PARTITION BY product ORDER BY sale_date) AS cumulative_sales
FROM sales;

Example: Grouping Sets

Aggregate sales by product and month:


SELECT product,
DATE_TRUNC('month', sale_date) AS month,
SUM(amount) AS total_sales
FROM sales
GROUP BY GROUPING SETS ((product), (month), (product, month));

Example: CUBE

Generate subtotals and grand totals for sales by product and month:


SELECT product,
DATE_TRUNC('month', sale_date) AS month,
SUM(amount) AS total_sales
FROM sales
GROUP BY CUBE (product, month);

Common Mistakes to Avoid

Ignoring Performance: Complex queries can be slow; optimize them using indexes.
Overusing Pivoting: Pivoting can make queries harder to read; use when necessary.
Not Validating Results: Always double-check data accuracy before analysis.

What’s Next?

Now that you know how to use SQL for data analysis, let’s explore Database Monitoring and Maintenance to ensure your database runs efficiently over time.

Interactive Practice

Try these challenges in your SQL environment:

Pivot the following sales table to show sales by product and month:


SELECT product,
SUM(CASE WHEN month = 'Jan' THEN amount ELSE 0 END) AS Jan,
SUM(CASE WHEN month = 'Feb' THEN amount ELSE 0 END) AS Feb
FROM sales
GROUP BY product;

Calculate the cumulative sales for each product:


SELECT product,
sale_date,
amount,
SUM(amount) OVER (PARTITION BY product ORDER BY sale_date) AS cumulative_sales
FROM sales;

Key Takeaways

Use pivoting to transform rows into columns for better analysis.
Use time series analysis to analyze trends and patterns over time.
Use advanced aggregations like window functions, grouping sets, and CUBE for deeper insights.

Common SQL Interview Questions

1. Basic SQL Questions

These questions test your fundamental SQL knowledge, including query writing and key concepts.

1.1 What is the Difference Between WHERE and HAVING?

WHERE: Filters rows before aggregation (used with SELECT, UPDATE, DELETE).
HAVING: Filters groups after aggregation (used with GROUP BY).


-- WHERE: Filter rows before grouping
SELECT department, AVG(salary)
FROM employees
WHERE salary > 50000
GROUP BY department;

-- HAVING: Filter groups after grouping
SELECT department, AVG(salary)
FROM employees
GROUP BY department
HAVING AVG(salary) > 50000;

1.2 What is a JOIN? Explain the Types of Joins.

A JOIN combines rows from two or more tables based on a related column.

INNER JOIN: Returns only matching rows.
LEFT JOIN: Returns all rows from the left table and matching rows from the right.
RIGHT JOIN: Returns all rows from the right table and matching rows from the left.
FULL JOIN: Returns all rows when there is a match in either table.


-- INNER JOIN Example
SELECT employees.name, departments.department_name
FROM employees
INNER JOIN departments ON employees.department_id = departments.department_id;

1.3 What is the Difference Between DELETE and TRUNCATE?

DELETE: Removes rows one by one, can be rolled back, supports WHERE filtering.
TRUNCATE: Removes all rows at once, cannot be rolled back, faster than DELETE.


-- DELETE Example
DELETE FROM employees WHERE department_id = 10;

-- TRUNCATE Example
TRUNCATE TABLE employees;

2. Tricky SQL Problems

These questions test your ability to solve complex problems using subqueries, window functions, and joins.

2.1 Find the Second Highest Salary

Write a query to find the second highest salary in the employees table.


SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

2.2 Find Duplicate Emails

Write a query to find duplicate emails in the users table.


SELECT email, COUNT(*)
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

2.3 Rank Employees by Salary

Write a query to rank employees by salary within each department.


SELECT name, department_id, salary,
RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
FROM employees;

3. Real-World SQL Scenarios

These questions test how well you can apply SQL to practical business situations.

3.1 Analyze Sales Trends

Calculate monthly sales trends from a sales table.


SELECT DATE_TRUNC('month', sale_date) AS month,
SUM(amount) AS total_sales
FROM sales
GROUP BY DATE_TRUNC('month', sale_date)
ORDER BY month;

3.2 Identify Inactive Users

Find users who haven’t logged in for more than 30 days.


SELECT user_id
FROM users
WHERE last_login_date < NOW() - INTERVAL '30 days';

3.3 Calculate Customer Lifetime Value (CLV)

Calculate the total amount spent by each customer.


SELECT customer_id, SUM(amount) AS lifetime_value
FROM orders
GROUP BY customer_id;

Common Mistakes to Avoid

Not Testing Queries: Always test on sample data before using in production.
Ignoring Edge Cases: Handle NULL values, empty tables, and duplicate data.
Overcomplicating Solutions: Keep queries readable and efficient.

What’s Next?

Now that you’ve practiced SQL interview questions, let’s explore Database Monitoring and Maintenance to ensure optimal database performance.

Interactive Practice

Try these SQL challenges in your database:

Find the second highest salary in the employees table:


SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

Find duplicate emails in the users table:


SELECT email, COUNT(*)
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

Calculate the monthly sales trend from the sales table:


SELECT DATE_TRUNC('month', sale_date) AS month,
SUM(amount) AS total_sales
FROM sales
GROUP BY DATE_TRUNC('month', sale_date)
ORDER BY month;

Key Takeaways

Master fundamental SQL concepts like joins, filtering, and grouping.
Practice solving tricky SQL problems using subqueries and window functions.
Apply SQL to real-world scenarios like sales trends and customer analysis.