Amazon

E-commerce and cloud computing giant with AWS, the world's leading cloud platform.

5 Rounds ~28 Days Very Hard

Start Mock Interview

The Interview Loop

Recruiter Screen (30 min)

Standard fit check, behavioral questions, and resume overview.

Technical Loop (3-4 Rounds)

Deep dive into domain knowledge, coding, and system design.

Interview Question Bank

All Roles AI Engineer 47 Cloud Engineer 69 Data Analyst 43 Data Engineer 76 Data Scientist 65 Machine Learning Engineer 15 ML Engineer 52 Product Manager 15 Software Engineer 15

All Topics SQL 10 Leadership 5 Data Lake 4 Data Pipeline 4 Data Modeling 4 Data Governance 3 System Design 3 Data Warehousing 3

Data Engineer • Behavioral • medium

Tell me about a time you simplified a complex data platform decision across multiple teams.

#Communication #Stakeholders

Practice

Data Engineer • Behavioral • medium

Describe a situation where a data pipeline you owned went down in production. How did you handle it?

#On-Call #Problem Solving

Practice

Data Engineer • Behavioral • medium

How do you handle disagreements with data analysts or scientists who want features that compromise pipeline reliability?

#Conflict Resolution

Practice

Data Engineer • Behavioral • medium

Tell me about a time you significantly improved the performance of a data system.

#Performance #Optimization

Practice

Data Engineer • Behavioral • hard

Describe how you've balanced technical debt vs. new feature development in a data platform.

#Prioritization

Practice

Data Engineer • Behavioral • medium

Tell me about a time you onboarded a new data source that had significant quality issues.

#Problem Solving

Practice

Data Engineer • Behavioral • easy

Describe your experience mentoring junior data engineers.

#Mentoring #Collaboration

Practice

Data Engineer • Behavioral • easy

How do you stay current with rapidly evolving data engineering tools and practices?

#Growth Mindset

Practice

Data Engineer • Behavioral • medium

Tell me about a time you had to dive deep into a complex data discrepancy issue between a source system and your data warehouse. How did you find the root cause?

#Dive Deep #Debugging #Root Cause Analysis

Practice

Data Engineer • Behavioral • medium

Tell me about a time you had to make a technical compromise in your data pipeline design to meet an urgent business deadline. How did you handle the tech debt?

#Deliver Results #Trade-offs #Tech Debt

Practice

Data Engineer • Behavioral • easy

Tell me about a time you received feedback from a customer (or internal stakeholder) that your data or dashboard was incorrect. How did you respond?

#Customer Obsession #Earn Trust #Communication

Practice

Data Engineer • Coding • medium

Write a SQL query to find the second highest salary per department.

#Window Functions #SQL

Practice

Data Engineer • Coding • medium

Write a SQL query to compute a 7-day rolling average of daily sales.

#Window Functions #Analytics

Practice

Data Engineer • Coding • medium

Write a SQL query to find the rolling 7-day average of daily sales per product category, given an 'orders' table with order_id, product_id, category_id, order_date, and order_amount.

#Window Functions #Time Series #Aggregations

Practice

Data Engineer • Coding • medium

Write a Python function to flatten a deeply nested JSON object representing an Amazon product catalog, where keys of nested dictionaries should be concatenated with a dot ('.').

#Python #Recursion #Data Structures #JSON Parsing

Practice

Data Engineer • Coding • hard

Write a SQL query to identify 'loyal' customers who have made at least one purchase in 3 consecutive months.

#Self Joins #Window Functions #Gaps and Islands

Practice

Data Engineer • Coding • medium

Given a list of strings representing Amazon search queries, write a Python script to return the top K most frequent queries. Your solution must be optimized for large datasets.

#Python #Heaps #Hash Maps #Big O Notation

Practice

Data Engineer • Coding • medium

Write a SQL query to calculate the Year-over-Year (YoY) growth rate of total revenue for each product sub-category.

#Date Functions #CTEs #Math Operations

Practice

Data Engineer • System Design • hard

Design an ETL pipeline that ingests 10TB of raw clickstream data daily.

#ETL #Batch Processing

Practice

Data Engineer • System Design • hard

How would you design a data pipeline that needs exactly-once delivery guarantees?

#Exactly-Once #Kafka

Practice

Data Engineer • System Design • hard

How would you design a real-time anomaly detection pipeline for 100K events/sec?

#Real-Time #Anomaly Detection

Practice

Data Engineer • System Design • hard

Design a data model for an e-commerce platform tracking orders, users, and products.

#ER Modeling #Dimensional Modeling

Practice

Data Engineer • System Design • hard

How would you design a data warehouse for a ride-sharing company from scratch?

#Architecture #Design

Practice

Data Engineer • System Design • hard

Design a real-time inventory tracking system for Amazon's fulfillment network.

#Inventory #Streaming

Practice

Data Engineer • System Design • hard

Design a data pipeline for Prime Video's recommendation signals.

#Prime Video #Pipeline

Practice

Data Engineer • System Design • hard

Design a real-time streaming pipeline to process and aggregate Amazon clickstream data to detect anomalous user behavior (e.g., bot scraping) within a 1-minute window.

#AWS Kinesis #Apache Flink #Stream Processing #Anomaly Detection

Practice

Data Engineer • System Design • medium

Design a dimensional data model (Star Schema) for Amazon Prime Video to track user viewership, subscription changes, and content metadata.

#Star Schema #Fact Tables #Dimension Tables #SCD

Practice

Data Engineer • System Design • hard

Design an ETL pipeline to migrate 100TB of historical order data from an on-premise Oracle database to AWS Redshift, ensuring zero data loss and minimal downtime.

#Data Migration #AWS DMS #AWS S3 #AWS Redshift

Practice

Data Engineer • System Design • hard

Design a scalable Data Lake architecture on AWS to support both ad-hoc querying by data scientists and daily aggregated reporting by BI tools.

#Data Lake #AWS S3 #AWS Athena #AWS Glue #Parquet/Iceberg

Practice

Data Engineer • Technical • medium

Explain the difference between OLAP and OLTP systems. When would you use each?

#OLAP #OLTP #Databases

Practice

Data Engineer • Technical • hard

What is a slowly changing dimension (SCD)? Describe SCD Type 1, 2, and 3 with examples.

#SCD #Dimensional Modeling

Practice

Data Engineer • Technical • hard

How would you optimize a SQL query that is running slowly on a 1 billion row table?

#Query Optimization #Indexing

Practice

Data Engineer • Technical • medium

Explain the difference between RANK(), DENSE_RANK(), and ROW_NUMBER().

#Window Functions #SQL

Practice

Data Engineer • Technical • medium

What is a materialized view? How does it differ from a regular view?

#Materialized Views #Performance

Practice

Data Engineer • Technical • hard

Describe partitioning strategies in a data warehouse. When would you use range vs hash partitioning?

#Partitioning #Performance

Practice

Data Engineer • Technical • medium

What are CTEs (Common Table Expressions) and how do they differ from subqueries?

#CTEs #SQL

Practice

Data Engineer • Technical • medium

Explain ACID properties. Which databases sacrifice ACID for performance and why?

#ACID #Distributed Systems

Practice

Data Engineer • Technical • hard

How do you handle late-arriving data in a streaming pipeline?

#Kafka #Watermarks

Practice

Data Engineer • Technical • medium

What is idempotency and why is it critical in data pipelines?

#Idempotency #Data Quality

Practice

Data Engineer • Technical • hard

Explain the Lambda architecture. What are its tradeoffs vs Kappa architecture?

#Lambda #Kappa #Streaming

Practice

Data Engineer • Technical • hard

What is backfilling? How do you handle a backfill of 2 years of historical data without impacting production?

#Backfill #Airflow

Practice

Data Engineer • Technical • medium

Describe how you'd implement circuit breakers in a data pipeline.

#Circuit Breakers #Fault Tolerance

Practice

Data Engineer • Technical • medium

How do you monitor data pipeline health in production? What metrics do you track?

#Monitoring #Alerting

Practice

Data Engineer • Technical • medium

What is Apache Airflow? How does it differ from Prefect or Dagster?

#Airflow #Prefect #Dagster

Practice

Data Engineer • Technical • easy

Explain the difference between push-based and pull-based data ingestion.

#Push #Pull #CDC

Practice

Data Engineer • Technical • hard

Explain how Apache Spark's execution model works. What is a DAG in Spark?

#Spark #DAG #Distributed Computing

Practice

Data Engineer • Technical • hard

What is data skew in Spark? How do you diagnose and fix it?

#Data Skew #Performance

Practice

Data Engineer • Technical • hard

Explain the difference between map-side and reduce-side joins in MapReduce/Spark.

#Joins #MapReduce

Practice

Data Engineer • Technical • medium

What is Apache Kafka? Explain topics, partitions, consumer groups, and offsets.

#Kafka #Streaming

Practice

Data Engineer • Technical • medium

How does Kafka handle message ordering guarantees?

#Ordering #Partitions

Practice

Data Engineer • Technical • medium

What is the CAP theorem? Give an example of a real-world system tradeoff.

#CAP #Consistency #Availability

Practice

Data Engineer • Technical • medium

Explain how Parquet and ORC file formats work and when you'd use each.

#Parquet #ORC #Columnar

Practice

Data Engineer • Technical • hard

What is Delta Lake? How does it provide ACID transactions on data lakes?

#Delta Lake #ACID #Time Travel

Practice

Data Engineer • Technical • medium

Explain compaction in Delta Lake / Iceberg. Why is it important?

#Compaction #Performance

Practice

Data Engineer • Technical • medium

What is the star schema vs snowflake schema? When would you use each?

#Star Schema #Snowflake Schema

Practice

Data Engineer • Technical • hard

What is Data Vault methodology? How does it differ from Kimball?

#Data Vault #Kimball

Practice

Data Engineer • Technical • medium

Explain the concept of a data lakehouse. What are its advantages over a traditional data warehouse?

#Data Lakehouse #Data Warehouse

Practice

Data Engineer • Technical • hard

How do you handle schema evolution in a data pipeline without breaking downstream consumers?

#Schema Evolution #Backward Compatibility

Practice

Data Engineer • Technical • medium

What is a medallion architecture (Bronze/Silver/Gold)?

#Medallion #Data Lake

Practice

Data Engineer • Technical • medium

How do you implement data quality checks in a production pipeline?

#Great Expectations #Data Validation

Practice

Data Engineer • Technical • medium

What is data lineage and why is it important? How do you implement it?

#Lineage #Metadata

Practice

Data Engineer • Technical • hard

How would you detect and handle data drift in a production system?

#Data Drift #Monitoring

Practice

Data Engineer • Technical • medium

What is PII (Personally Identifiable Information) and how do you handle it in a data pipeline?

#PII #Privacy #Compliance

Practice

Data Engineer • Technical • medium

Explain the concept of a data catalog. What tools have you used?

#Data Catalog #Metadata

Practice

Data Engineer • Technical • hard

Compare AWS Redshift, Google BigQuery, and Snowflake for a petabyte-scale warehouse.

#Redshift #BigQuery #Snowflake

Practice

Data Engineer • Technical • hard

How does BigQuery handle large joins efficiently? What is its columnar storage approach?

#BigQuery #Columnar Storage

Practice

Data Engineer • Technical • medium

Explain the difference between S3, HDFS, and GCS for data storage.

#S3 #HDFS #GCS

Practice

Data Engineer • Technical • medium

How would you reduce costs in a cloud-based data platform?

#Cloud #Cost

Practice

Data Engineer • Technical • medium

What is infrastructure as code (IaC)? Have you used Terraform for data infrastructure?

#Terraform #IaC

Practice

Data Engineer • Technical • hard

How would you use AWS Glue and Athena to build a serverless data lake?

#Glue #Athena

Practice

Data Engineer • Technical • hard

Explain how Amazon Redshift Spectrum enables querying S3 data.

#Spectrum #S3

Practice

Data Engineer • Technical • hard

How do you implement CDC (Change Data Capture) using AWS DMS?

#DMS #Replication

Practice

Data Engineer • Technical • hard

What is Amazon's Write Every Read (WEAR) approach and why?

#WEAR #Data Modeling

Practice

Data Engineer • Technical • hard

How would you optimize a slow-running Apache Spark job on AWS EMR that is suffering from severe data skew during a large join operation?

#Apache Spark #Performance Tuning #Data Skew #AWS EMR

Practice

Data Engineer • Technical • medium

Explain the difference between distribution styles (KEY, ALL, EVEN) in Amazon Redshift. Given a massive 'orders' table and a small 'date' dimension table, which distribution styles would you choose and why?

#AWS Redshift #Distributed Databases #Query Optimization

Practice

Data Engineer • Technical • medium

How do you handle dependency management, backfilling, and failure recovery in a complex Apache Airflow DAG processing daily e-commerce transactions?

#Apache Airflow #DAGs #Fault Tolerance #Idempotency

Practice

Difficulty Radar

Based on recent AI-sourced data.

Meet Your Interviewers

The "Standard" Interviewer

Senior Engineer

Focuses on core competencies, system constraints, and clear communication.

Simulate

Unwritten Rules

Think Out Loud

Always explain your thought process before writing code or drawing architecture.

Practice Now