KPMG

KPMG

Multinational professional services network, and one of the Big Four accounting organizations.

4 Rounds ~21 Days Medium
Start Mock Interview

The Interview Loop

Recruiter Screen (30 min)

Standard fit check, behavioral questions, and resume overview.

Technical Loop (3-4 Rounds)

Deep dive into domain knowledge, coding, and system design.

Interview Question Bank

Data Engineer Behavioral medium

Tell me about a time you had to explain a complex data engineering concept to a non-technical client stakeholder.

#Stakeholder Management #Communication #Consulting
Data Engineer Behavioral medium

Describe a situation where a client changed the requirements of a data pipeline midway through the sprint. How did you handle it?

#Agile #Scope Creep #Client Management
Data Engineer Behavioral medium

Tell me about a time you discovered a critical data quality issue right before a client deliverable was due.

#Data Quality #Crisis Management #Integrity
Data Engineer Behavioral easy

How do you prioritize tasks when working on multiple client engagements with competing deadlines?

#Prioritization #Consulting #Organization
Data Engineer Behavioral medium

Describe a time you disagreed with a senior architect or manager on a technical design. How was it resolved?

#Teamwork #Technical Disagreement #Professionalism
Data Engineer Coding medium

Write a SQL query to find the top 3 highest-paid employees in each department, handling ties appropriately.

#Window Functions #DENSE_RANK #Aggregations
Data Engineer Coding medium

Calculate a rolling 7-day average of daily transactions for a financial client using SQL.

#Window Functions #Time Series #Financial Data
Data Engineer Coding hard

Write a SQL query to identify overlapping date ranges in a client's software subscription dataset.

#Self Joins #Date Functions #Complex Logic
Data Engineer Coding medium

Write a Python function to parse a deeply nested JSON file from a REST API and flatten it into a tabular pandas DataFrame.

#JSON Parsing #Pandas #Data Transformation
Data Engineer Coding medium

Write a Python script to process and merge multiple large CSV files (50GB+) that do not fit into memory.

#Chunking #Generators #Memory Management
Data Engineer Coding easy

Implement a Python function to detect and remove duplicate records based on a composite key, keeping the most recently updated record.

#Deduplication #Pandas #Data Cleaning
Data Engineer Coding easy

Given a list of dictionaries representing financial transactions, write a Python function to aggregate total spend by category without using external libraries.

#Data Structures #Dictionaries #Aggregation
Data Engineer Coding medium

Write a SQL query to identify gaps in sequential invoice numbers for an audit client.

#Audit Data #Sequential Gaps #LEAD/LAG
Data Engineer Coding medium

Write a Python algorithm to find the longest consecutive sequence of days a user logged into a client portal.

#Python #Arrays #Logic
Data Engineer System Design hard

Design a batch ETL pipeline to ingest daily transaction data from 50 different regional banks into a centralized Azure Data Lake.

#Batch Processing #Azure #Data Ingestion #Scalability
Data Engineer System Design hard

Design a real-time fraud detection data pipeline for a credit card company.

#Streaming #Kafka #Real-time Processing #Fraud Detection
Data Engineer System Design medium

How would you design a data model for a retail client's Customer 360 dashboard?

#Dimensional Modeling #Customer 360 #Star Schema
Data Engineer System Design hard

Design a system to migrate on-premise legacy SQL Server data to a cloud-native Snowflake environment with minimal downtime.

#Cloud Migration #Snowflake #Change Data Capture (CDC)
Data Engineer System Design medium

Architect a logging, alerting, and monitoring solution for a complex data pipeline to ensure data quality and pipeline reliability.

#Observability #Monitoring #Data Quality
Data Engineer Technical hard

How do you optimize a slow-running SQL query with multiple joins and aggregations that is timing out on a client's database?

#Query Optimization #Indexing #Execution Plans
Data Engineer Technical easy

Explain the difference between RANK(), DENSE_RANK(), and ROW_NUMBER(). Give a specific use case for each in an audit context.

#Window Functions #Data Ranking
Data Engineer Technical medium

How do you handle missing, null, or corrupted data in a large dataset before loading it into a data warehouse?

#Data Cleansing #Imputation #ETL
Data Engineer Technical medium

Explain how Spark handles data partitioning and why it matters for pipeline performance.

#PySpark #Partitioning #Distributed Computing
Data Engineer Technical hard

How do you handle data skewness in a PySpark join operation where one key has millions of records and others have very few?

#PySpark #Data Skew #Performance Tuning
Data Engineer Technical medium

What is a broadcast join in Spark, and when would you use it in a client's ETL pipeline?

#PySpark #Joins #Optimization
Data Engineer Technical easy

Explain the difference between transformations and actions in Spark. Give examples of each.

#PySpark #Lazy Evaluation
Data Engineer Technical hard

How would you troubleshoot and optimize a PySpark job that is failing with an OutOfMemory (OOM) error on the driver node?

#PySpark #Troubleshooting #Memory Management
Data Engineer Technical medium

Describe how you would set up a CI/CD pipeline for Databricks notebooks and data pipelines using Azure DevOps.

#CI/CD #Databricks #Azure DevOps
Data Engineer Technical easy

What is the difference between a Data Warehouse, a Data Lake, and a Data Lakehouse? Why are clients moving towards Lakehouses?

#Data Lakehouse #Data Warehouse #Delta Lake
Data Engineer Technical medium

How do you implement Slowly Changing Dimensions (SCD Type 2) in Snowflake or Databricks?

#SCD Type 2 #Data Warehousing #Snowflake #Databricks
Data Engineer Technical medium

Explain the architecture of Azure Data Factory (ADF). How do you use it to orchestrate complex ETL pipelines?

#Azure Data Factory #Orchestration #ETL
Data Engineer Technical hard

How do you ensure data security, masking, and governance when building a cloud data platform for a highly regulated healthcare or financial client?

#Data Security #PII/PHI #RBAC #Data Governance
Data Engineer Technical medium

Explain the concept of Delta Lake and its advantages over traditional Parquet files in a data lake.

#Delta Lake #ACID Transactions #Time Travel
Data Engineer Technical medium

How would you implement incremental loading in an ETL pipeline using a watermark column?

#Incremental Load #Watermarking #Data Integration
Data Engineer Technical medium

What strategies do you use for testing data pipelines before deploying them to production?

#Data Testing #Unit Testing #Integration Testing

Difficulty Radar

Based on recent AI-sourced data.

Meet Your Interviewers

The "Standard" Interviewer

Senior Engineer

Focuses on core competencies, system constraints, and clear communication.

Simulate

Unwritten Rules

Think Out Loud

Always explain your thought process before writing code or drawing architecture.

Practice Now