PwC
PricewaterhouseCoopers, a multinational professional services network.
4 Rounds
~21 Days
Medium
The Interview Loop
Recruiter Screen (30 min)
Standard fit check, behavioral questions, and resume overview.
Technical Loop (3-4 Rounds)
Deep dive into domain knowledge, coding, and system design.
Interview Question Bank
Data Engineer
•
Behavioral
•
medium
Tell me about a time you had to explain a complex technical data issue to a non-technical client stakeholder.
#Communication
#Client Management
#PwC Professional
Data Engineer
•
Behavioral
•
medium
Describe a situation where a client changed the requirements for a data pipeline midway through the sprint. How did you handle it?
#Agile
#Adaptability
#Consulting
Data Engineer
•
Behavioral
•
medium
Tell me about a time you identified a data quality issue that others missed. What was the impact and how did you resolve it?
#Attention to Detail
#Data Quality
#Problem Solving
Data Engineer
•
Behavioral
•
easy
Why do you want to work as a Data Engineer at a consulting firm like PwC specifically, compared to working at a product-based company?
#Motivation
#Consulting
#PwC Professional
Data Engineer
•
Behavioral
•
medium
Describe a time you had to work with a difficult team member or client who was resistant to adopting a new data engineering tool or process.
#Conflict Resolution
#Change Management
#Communication
Data Engineer
•
Coding
•
medium
Write a SQL query to find the top 3 highest paid employees in each department. If there is a tie, they should have the same rank.
#Window Functions
#DENSE_RANK
#Joins
Data Engineer
•
Coding
•
hard
Given a table of client project assignments with start and end dates, write a SQL query to identify any overlapping date ranges for the same consultant.
#Self Joins
#Date Functions
#Complex Logic
Data Engineer
•
Coding
•
easy
Write a Python function to check if a given string is a valid palindrome, ignoring case and all non-alphanumeric characters.
#Python
#String Manipulation
#Two Pointers
Data Engineer
•
Coding
•
medium
Given a list of dictionaries representing nested JSON data from a client API, write a Python script to flatten the dictionaries into a single level.
#Python
#Recursion
#Data Parsing
#JSON
Data Engineer
•
Coding
•
medium
Write a Python script using Pandas to merge two large datasets (e.g., clients and transactions), handle missing values by imputing the mean, and output aggregated metrics by region.
#Python
#Pandas
#Data Cleansing
#Aggregation
Data Engineer
•
Coding
•
medium
Write PySpark code to read a CSV file from Azure Data Lake, filter out records where the 'amount' column is null, and write the output back as Parquet, partitioned by 'transaction_date'.
#PySpark
#Data I/O
#Partitioning
Data Engineer
•
Coding
•
medium
Write a SQL query to calculate the 7-day rolling average of daily sales for a retail company.
#Window Functions
#Moving Average
#Date Functions
Data Engineer
•
Coding
•
hard
Write a Python generator function to process a massive 50GB log file line by line without loading the entire file into memory, extracting specific error codes.
#Python
#Generators
#Memory Management
#File I/O
Data Engineer
•
Coding
•
medium
Write a SQL query to find the second highest salary in an employee table without using the LIMIT, TOP, or FETCH keywords.
#Subqueries
#Aggregation
#Max Function
Data Engineer
•
Coding
•
medium
Write a Python script to interact with a REST API, handle pagination to retrieve all records, and load the extracted data into a local SQLite database.
#Python
#REST APIs
#Pagination
#SQLite
Data Engineer
•
System Design
•
hard
Design an ETL pipeline for a retail client that ingests 50GB of daily transaction data, cleanses it, and makes it available for BI reporting within 1 hour of store closing.
#Architecture
#Batch Processing
#Cloud Storage
#Data Warehousing
Data Engineer
•
System Design
•
hard
Design a data lakehouse architecture using Databricks for a financial services firm that needs both nightly batch reporting and near-real-time fraud detection.
#Lakehouse
#Databricks
#Lambda Architecture
#Streaming
Data Engineer
•
System Design
•
medium
How would you design a data quality framework to validate incoming data before it lands in the gold layer of a Medallion architecture?
#Data Quality
#Medallion Architecture
#Data Governance
Data Engineer
•
System Design
•
hard
Design a real-time streaming pipeline using Kafka and Spark Structured Streaming to process IoT sensor data and detect anomalies.
#Streaming
#Kafka
#Spark Structured Streaming
#Architecture
Data Engineer
•
Technical
•
easy
Explain the difference between RANK(), DENSE_RANK(), and ROW_NUMBER() with a practical data engineering example.
#Window Functions
#Analytical Functions
Data Engineer
•
Technical
•
hard
How would you approach optimizing a slow-running SQL query in a distributed data warehouse like Snowflake or Azure Synapse?
#Performance Tuning
#Execution Plans
#Indexing
#Partitioning
Data Engineer
•
Technical
•
medium
Explain the difference between transformations and actions in PySpark. Why is this distinction important for performance?
#PySpark
#Lazy Evaluation
#DAG
Data Engineer
•
Technical
•
hard
You are joining a massive transaction table with a smaller client table in PySpark, and the job is failing due to OutOfMemory errors. How do you handle this data skewness?
#PySpark
#Optimization
#Broadcast Joins
#Salting
Data Engineer
•
Technical
•
medium
What is the difference between repartition() and coalesce() in Spark? When would you use each in a data pipeline?
#PySpark
#Data Shuffling
#Partitioning
Data Engineer
•
Technical
•
medium
Describe how you would set up an Azure Data Factory (ADF) pipeline to copy data from an on-premise SQL Server to Azure Data Lake Storage (ADLS).
#Azure Data Factory
#Integration Runtime
#ADLS
Data Engineer
•
Technical
•
hard
How do you implement incremental loading (Change Data Capture) in a cloud ETL tool like Azure Data Factory or AWS Glue?
#ETL
#CDC
#Watermarking
Data Engineer
•
Technical
•
medium
Explain Slowly Changing Dimensions (SCD). How would you implement an SCD Type 2 in a modern cloud data warehouse?
#Data Warehousing
#SCD
#Dimensional Modeling
Data Engineer
•
Technical
•
easy
What is the difference between a Star Schema and a Snowflake Schema? Which do you prefer for a cloud data warehouse and why?
#Data Warehousing
#Schema Design
#Normalization
Data Engineer
•
Technical
•
hard
Explain Spark's Catalyst Optimizer. How does it improve query execution plans?
#PySpark
#Catalyst Optimizer
#Under the Hood
Data Engineer
•
Technical
•
medium
What are the different types of triggers available in Azure Data Factory, and when would you use a Tumbling Window trigger over a Schedule trigger?
#Azure Data Factory
#Scheduling
#Orchestration
Data Engineer
•
Technical
•
medium
Explain the concept of the Medallion Architecture (Bronze, Silver, Gold). What specific transformations happen at each stage?
#Medallion Architecture
#Data Lakehouse
#Data Modeling
Data Engineer
•
Technical
•
hard
How do you handle schema evolution in Delta Lake or Apache Iceberg when upstream source systems unexpectedly add or remove columns?
#Delta Lake
#Schema Evolution
#Data Governance
Data Engineer
•
Technical
•
medium
What is the difference between a clustered and non-clustered index? How does indexing affect ETL performance?
#Indexing
#Performance Tuning
#Database Internals
Data Engineer
•
Technical
•
medium
Explain the difference between Azure Synapse Analytics and Azure Databricks. When would you recommend one over the other to a client?
#Azure
#Databricks
#Synapse
#Consulting
Data Engineer
•
Technical
•
medium
How do you manage CI/CD for data pipelines? Describe the deployment process for ADF or Databricks notebooks across Dev, QA, and Prod environments.
#CI/CD
#Git
#Azure DevOps
#Deployment
Difficulty Radar
Based on recent AI-sourced data.
Meet Your Interviewers
The "Standard" Interviewer
Senior EngineerFocuses on core competencies, system constraints, and clear communication.
SimulateUnwritten Rules
Think Out Loud
Always explain your thought process before writing code or drawing architecture.