KPMG
Multinational professional services network, and one of the Big Four accounting organizations.
4 Rounds
~21 Days
Medium
The Interview Loop
Recruiter Screen (30 min)
Standard fit check, behavioral questions, and resume overview.
Technical Loop (3-4 Rounds)
Deep dive into domain knowledge, coding, and system design.
Interview Question Bank
Data Engineer
•
Behavioral
•
medium
Tell me about a time you had to explain a complex data engineering concept to a non-technical client stakeholder.
#Stakeholder Management
#Communication
#Consulting
Data Engineer
•
Behavioral
•
medium
Describe a situation where a client changed the requirements of a data pipeline midway through the sprint. How did you handle it?
#Agile
#Scope Creep
#Client Management
Data Engineer
•
Behavioral
•
medium
Tell me about a time you discovered a critical data quality issue right before a client deliverable was due.
#Data Quality
#Crisis Management
#Integrity
Data Engineer
•
Behavioral
•
easy
How do you prioritize tasks when working on multiple client engagements with competing deadlines?
#Prioritization
#Consulting
#Organization
Data Engineer
•
Behavioral
•
medium
Describe a time you disagreed with a senior architect or manager on a technical design. How was it resolved?
#Teamwork
#Technical Disagreement
#Professionalism
Data Engineer
•
Coding
•
medium
Write a SQL query to find the top 3 highest-paid employees in each department, handling ties appropriately.
#Window Functions
#DENSE_RANK
#Aggregations
Data Engineer
•
Coding
•
medium
Calculate a rolling 7-day average of daily transactions for a financial client using SQL.
#Window Functions
#Time Series
#Financial Data
Data Engineer
•
Coding
•
hard
Write a SQL query to identify overlapping date ranges in a client's software subscription dataset.
#Self Joins
#Date Functions
#Complex Logic
Data Engineer
•
Coding
•
medium
Write a Python function to parse a deeply nested JSON file from a REST API and flatten it into a tabular pandas DataFrame.
#JSON Parsing
#Pandas
#Data Transformation
Data Engineer
•
Coding
•
medium
Write a Python script to process and merge multiple large CSV files (50GB+) that do not fit into memory.
#Chunking
#Generators
#Memory Management
Data Engineer
•
Coding
•
easy
Implement a Python function to detect and remove duplicate records based on a composite key, keeping the most recently updated record.
#Deduplication
#Pandas
#Data Cleaning
Data Engineer
•
Coding
•
easy
Given a list of dictionaries representing financial transactions, write a Python function to aggregate total spend by category without using external libraries.
#Data Structures
#Dictionaries
#Aggregation
Data Engineer
•
Coding
•
medium
Write a SQL query to identify gaps in sequential invoice numbers for an audit client.
#Audit Data
#Sequential Gaps
#LEAD/LAG
Data Engineer
•
Coding
•
medium
Write a Python algorithm to find the longest consecutive sequence of days a user logged into a client portal.
#Python
#Arrays
#Logic
Data Engineer
•
System Design
•
hard
Design a batch ETL pipeline to ingest daily transaction data from 50 different regional banks into a centralized Azure Data Lake.
#Batch Processing
#Azure
#Data Ingestion
#Scalability
Data Engineer
•
System Design
•
hard
Design a real-time fraud detection data pipeline for a credit card company.
#Streaming
#Kafka
#Real-time Processing
#Fraud Detection
Data Engineer
•
System Design
•
medium
How would you design a data model for a retail client's Customer 360 dashboard?
#Dimensional Modeling
#Customer 360
#Star Schema
Data Engineer
•
System Design
•
hard
Design a system to migrate on-premise legacy SQL Server data to a cloud-native Snowflake environment with minimal downtime.
#Cloud Migration
#Snowflake
#Change Data Capture (CDC)
Data Engineer
•
System Design
•
medium
Architect a logging, alerting, and monitoring solution for a complex data pipeline to ensure data quality and pipeline reliability.
#Observability
#Monitoring
#Data Quality
Data Engineer
•
Technical
•
hard
How do you optimize a slow-running SQL query with multiple joins and aggregations that is timing out on a client's database?
#Query Optimization
#Indexing
#Execution Plans
Data Engineer
•
Technical
•
easy
Explain the difference between RANK(), DENSE_RANK(), and ROW_NUMBER(). Give a specific use case for each in an audit context.
#Window Functions
#Data Ranking
Data Engineer
•
Technical
•
medium
How do you handle missing, null, or corrupted data in a large dataset before loading it into a data warehouse?
#Data Cleansing
#Imputation
#ETL
Data Engineer
•
Technical
•
medium
Explain how Spark handles data partitioning and why it matters for pipeline performance.
#PySpark
#Partitioning
#Distributed Computing
Data Engineer
•
Technical
•
hard
How do you handle data skewness in a PySpark join operation where one key has millions of records and others have very few?
#PySpark
#Data Skew
#Performance Tuning
Data Engineer
•
Technical
•
medium
What is a broadcast join in Spark, and when would you use it in a client's ETL pipeline?
#PySpark
#Joins
#Optimization
Data Engineer
•
Technical
•
easy
Explain the difference between transformations and actions in Spark. Give examples of each.
#PySpark
#Lazy Evaluation
Data Engineer
•
Technical
•
hard
How would you troubleshoot and optimize a PySpark job that is failing with an OutOfMemory (OOM) error on the driver node?
#PySpark
#Troubleshooting
#Memory Management
Data Engineer
•
Technical
•
medium
Describe how you would set up a CI/CD pipeline for Databricks notebooks and data pipelines using Azure DevOps.
#CI/CD
#Databricks
#Azure DevOps
Data Engineer
•
Technical
•
easy
What is the difference between a Data Warehouse, a Data Lake, and a Data Lakehouse? Why are clients moving towards Lakehouses?
#Data Lakehouse
#Data Warehouse
#Delta Lake
Data Engineer
•
Technical
•
medium
How do you implement Slowly Changing Dimensions (SCD Type 2) in Snowflake or Databricks?
#SCD Type 2
#Data Warehousing
#Snowflake
#Databricks
Data Engineer
•
Technical
•
medium
Explain the architecture of Azure Data Factory (ADF). How do you use it to orchestrate complex ETL pipelines?
#Azure Data Factory
#Orchestration
#ETL
Data Engineer
•
Technical
•
hard
How do you ensure data security, masking, and governance when building a cloud data platform for a highly regulated healthcare or financial client?
#Data Security
#PII/PHI
#RBAC
#Data Governance
Data Engineer
•
Technical
•
medium
Explain the concept of Delta Lake and its advantages over traditional Parquet files in a data lake.
#Delta Lake
#ACID Transactions
#Time Travel
Data Engineer
•
Technical
•
medium
How would you implement incremental loading in an ETL pipeline using a watermark column?
#Incremental Load
#Watermarking
#Data Integration
Data Engineer
•
Technical
•
medium
What strategies do you use for testing data pipelines before deploying them to production?
#Data Testing
#Unit Testing
#Integration Testing
Difficulty Radar
Based on recent AI-sourced data.
Meet Your Interviewers
The "Standard" Interviewer
Senior EngineerFocuses on core competencies, system constraints, and clear communication.
SimulateUnwritten Rules
Think Out Loud
Always explain your thought process before writing code or drawing architecture.