PwC

PricewaterhouseCoopers, a multinational professional services network.

4 Rounds ~21 Days Medium

Start Mock Interview

The Interview Loop

Recruiter Screen (30 min)

Standard fit check, behavioral questions, and resume overview.

Technical Loop (3-4 Rounds)

Deep dive into domain knowledge, coding, and system design.

Interview Question Bank

All Roles Backend Engineer 35 Cloud Engineer 35 Data Engineer 35 Data Scientist 35 DevOps Engineer 35 Frontend Engineer 35 Full Stack Engineer 35 Machine Learning Engineer 35 Product Manager 35 Software Engineer 35

All Topics SQL 6 Big Data 5 Culture Fit 4 System Design 4 Cloud 4 Algorithms 3 Data Modeling 2 Data Architecture 2

Data Engineer • Behavioral • medium

Tell me about a time you had to explain a complex technical data issue to a non-technical client stakeholder.

#Communication #Client Management #PwC Professional

Practice

Data Engineer • Behavioral • medium

Describe a situation where a client changed the requirements for a data pipeline midway through the sprint. How did you handle it?

#Agile #Adaptability #Consulting

Practice

Data Engineer • Behavioral • medium

Tell me about a time you identified a data quality issue that others missed. What was the impact and how did you resolve it?

#Attention to Detail #Data Quality #Problem Solving

Practice

Data Engineer • Behavioral • easy

Why do you want to work as a Data Engineer at a consulting firm like PwC specifically, compared to working at a product-based company?

#Motivation #Consulting #PwC Professional

Practice

Data Engineer • Behavioral • medium

Describe a time you had to work with a difficult team member or client who was resistant to adopting a new data engineering tool or process.

#Conflict Resolution #Change Management #Communication

Practice

Data Engineer • Coding • medium

Write a SQL query to find the top 3 highest paid employees in each department. If there is a tie, they should have the same rank.

#Window Functions #DENSE_RANK #Joins

Practice

Data Engineer • Coding • hard

Given a table of client project assignments with start and end dates, write a SQL query to identify any overlapping date ranges for the same consultant.

#Self Joins #Date Functions #Complex Logic

Practice

Data Engineer • Coding • easy

Write a Python function to check if a given string is a valid palindrome, ignoring case and all non-alphanumeric characters.

#Python #String Manipulation #Two Pointers

Practice

Data Engineer • Coding • medium

Given a list of dictionaries representing nested JSON data from a client API, write a Python script to flatten the dictionaries into a single level.

#Python #Recursion #Data Parsing #JSON

Practice

Data Engineer • Coding • medium

Write a Python script using Pandas to merge two large datasets (e.g., clients and transactions), handle missing values by imputing the mean, and output aggregated metrics by region.

#Python #Pandas #Data Cleansing #Aggregation

Practice

Data Engineer • Coding • medium

Write PySpark code to read a CSV file from Azure Data Lake, filter out records where the 'amount' column is null, and write the output back as Parquet, partitioned by 'transaction_date'.

#PySpark #Data I/O #Partitioning

Practice

Data Engineer • Coding • medium

Write a SQL query to calculate the 7-day rolling average of daily sales for a retail company.

#Window Functions #Moving Average #Date Functions

Practice

Data Engineer • Coding • hard

Write a Python generator function to process a massive 50GB log file line by line without loading the entire file into memory, extracting specific error codes.

#Python #Generators #Memory Management #File I/O

Practice

Data Engineer • Coding • medium

Write a SQL query to find the second highest salary in an employee table without using the LIMIT, TOP, or FETCH keywords.

#Subqueries #Aggregation #Max Function

Practice

Data Engineer • Coding • medium

Write a Python script to interact with a REST API, handle pagination to retrieve all records, and load the extracted data into a local SQLite database.

#Python #REST APIs #Pagination #SQLite

Practice

Data Engineer • System Design • hard

Design an ETL pipeline for a retail client that ingests 50GB of daily transaction data, cleanses it, and makes it available for BI reporting within 1 hour of store closing.

#Architecture #Batch Processing #Cloud Storage #Data Warehousing

Practice

Data Engineer • System Design • hard

Design a data lakehouse architecture using Databricks for a financial services firm that needs both nightly batch reporting and near-real-time fraud detection.

#Lakehouse #Databricks #Lambda Architecture #Streaming

Practice

Data Engineer • System Design • medium

How would you design a data quality framework to validate incoming data before it lands in the gold layer of a Medallion architecture?

#Data Quality #Medallion Architecture #Data Governance

Practice

Data Engineer • System Design • hard

Design a real-time streaming pipeline using Kafka and Spark Structured Streaming to process IoT sensor data and detect anomalies.

#Streaming #Kafka #Spark Structured Streaming #Architecture

Practice

Data Engineer • Technical • easy

Explain the difference between RANK(), DENSE_RANK(), and ROW_NUMBER() with a practical data engineering example.

#Window Functions #Analytical Functions

Practice

Data Engineer • Technical • hard

How would you approach optimizing a slow-running SQL query in a distributed data warehouse like Snowflake or Azure Synapse?

#Performance Tuning #Execution Plans #Indexing #Partitioning

Practice

Data Engineer • Technical • medium

Explain the difference between transformations and actions in PySpark. Why is this distinction important for performance?

#PySpark #Lazy Evaluation #DAG

Practice

Data Engineer • Technical • hard

You are joining a massive transaction table with a smaller client table in PySpark, and the job is failing due to OutOfMemory errors. How do you handle this data skewness?

#PySpark #Optimization #Broadcast Joins #Salting

Practice

Data Engineer • Technical • medium

What is the difference between repartition() and coalesce() in Spark? When would you use each in a data pipeline?

#PySpark #Data Shuffling #Partitioning

Practice

Data Engineer • Technical • medium

Describe how you would set up an Azure Data Factory (ADF) pipeline to copy data from an on-premise SQL Server to Azure Data Lake Storage (ADLS).

#Azure Data Factory #Integration Runtime #ADLS

Practice

Data Engineer • Technical • hard

How do you implement incremental loading (Change Data Capture) in a cloud ETL tool like Azure Data Factory or AWS Glue?

#ETL #CDC #Watermarking

Practice

Data Engineer • Technical • medium

Explain Slowly Changing Dimensions (SCD). How would you implement an SCD Type 2 in a modern cloud data warehouse?

#Data Warehousing #SCD #Dimensional Modeling

Practice

Data Engineer • Technical • easy

What is the difference between a Star Schema and a Snowflake Schema? Which do you prefer for a cloud data warehouse and why?

#Data Warehousing #Schema Design #Normalization

Practice

Data Engineer • Technical • hard

Explain Spark's Catalyst Optimizer. How does it improve query execution plans?

#PySpark #Catalyst Optimizer #Under the Hood

Practice

Data Engineer • Technical • medium

What are the different types of triggers available in Azure Data Factory, and when would you use a Tumbling Window trigger over a Schedule trigger?

#Azure Data Factory #Scheduling #Orchestration

Practice

Data Engineer • Technical • medium

Explain the concept of the Medallion Architecture (Bronze, Silver, Gold). What specific transformations happen at each stage?

#Medallion Architecture #Data Lakehouse #Data Modeling

Practice

Data Engineer • Technical • hard

How do you handle schema evolution in Delta Lake or Apache Iceberg when upstream source systems unexpectedly add or remove columns?

#Delta Lake #Schema Evolution #Data Governance

Practice

Data Engineer • Technical • medium

What is the difference between a clustered and non-clustered index? How does indexing affect ETL performance?

#Indexing #Performance Tuning #Database Internals

Practice

Data Engineer • Technical • medium

Explain the difference between Azure Synapse Analytics and Azure Databricks. When would you recommend one over the other to a client?

#Azure #Databricks #Synapse #Consulting

Practice

Data Engineer • Technical • medium

How do you manage CI/CD for data pipelines? Describe the deployment process for ADF or Databricks notebooks across Dev, QA, and Prod environments.

#CI/CD #Git #Azure DevOps #Deployment

Practice

Difficulty Radar

Based on recent AI-sourced data.

Meet Your Interviewers

The "Standard" Interviewer

Senior Engineer

Focuses on core competencies, system constraints, and clear communication.

Simulate

Unwritten Rules

Think Out Loud

Always explain your thought process before writing code or drawing architecture.

Practice Now