EY
Ernst & Young Global Limited, a multinational professional services partnership.
4 Rounds
~21 Days
Medium
The Interview Loop
Recruiter Screen (30 min)
Standard fit check, behavioral questions, and resume overview.
Technical Loop (3-4 Rounds)
Deep dive into domain knowledge, coding, and system design.
Interview Question Bank
Data Engineer
•
Behavioral
•
medium
Tell me about a time you had to explain a complex data pipeline failure or technical issue to a non-technical client partner.
#Communication
#Client Management
#Consulting
Data Engineer
•
Behavioral
•
medium
Describe a situation where a client changed the data requirements halfway through a sprint. How did you handle it?
#Agile
#Adaptability
#Stakeholder Management
Data Engineer
•
Behavioral
•
easy
Why do you want to work at EY? How do your career goals align with our mission of 'Building a better working world'?
#Company Knowledge
#Motivation
Data Engineer
•
Behavioral
•
medium
Tell me about a time you found a critical data discrepancy in a production environment. What was your troubleshooting process?
#Problem Solving
#Incident Management
#Accountability
Data Engineer
•
Behavioral
•
medium
Describe a time you disagreed with a senior engineer or architect's design on a client project. How did you resolve the disagreement?
#Conflict Resolution
#Teamwork
#Professionalism
Data Engineer
•
Behavioral
•
medium
Describe a time you optimized a slow-running ETL pipeline. What specific metrics did you improve, and what was the business impact?
#Performance Optimization
#Impact
#Technical Leadership
Data Engineer
•
Coding
•
medium
Write a SQL query using window functions to calculate the 7-day rolling average of daily transaction volumes for our financial audit clients.
#Window Functions
#Data Aggregation
#Time Series
Data Engineer
•
Coding
•
hard
Given a table of user logins, write a SQL query to find the maximum number of consecutive days each user logged in.
#Advanced SQL
#Gaps and Islands Problem
#CTEs
Data Engineer
•
Coding
•
medium
Write a SQL query to identify and delete duplicate records in a massive transaction table without using the DISTINCT keyword.
#Data Cleansing
#CTEs
#Window Functions
Data Engineer
•
Coding
•
medium
Write a PySpark script to read a CSV file from Azure Data Lake, filter out records with null client IDs, and write the output to Parquet format partitioned by transaction date.
#Data I/O
#DataFrames
#Partitioning
Data Engineer
•
Coding
•
easy
Write a PySpark snippet to perform a left anti join. Explain a business use case for this operation in an audit context.
#Joins
#Data Validation
Data Engineer
•
Coding
•
hard
Write a Python function to flatten a deeply nested JSON object representing complex financial records from a client API.
#Data Manipulation
#Recursion
#JSON
Data Engineer
•
Coding
•
medium
Write a Python script using boto3 or azure-storage-blob to upload a local file to cloud storage, including basic error handling and logging.
#Cloud SDKs
#Error Handling
#I/O
Data Engineer
•
Coding
•
medium
Write a SQL query to find the second highest salary in each department. If a department has less than two employees, return null for that department.
#Window Functions
#CTEs
#Aggregations
Data Engineer
•
System Design
•
hard
Design a batch processing pipeline to ingest daily financial audit logs from 50 different client on-premise systems into a centralized Azure Data Lake.
#Azure Data Factory
#Batch Processing
#Data Integration
Data Engineer
•
System Design
•
hard
Design a real-time streaming pipeline for detecting fraudulent credit card transactions for a banking client.
#Stream Processing
#Kafka/Event Hubs
#Fraud Detection
Data Engineer
•
System Design
•
hard
How would you design a data migration strategy from an on-premise Oracle database to Azure Synapse Analytics with minimal downtime?
#Cloud Migration
#Azure Synapse
#Change Data Capture (CDC)
Data Engineer
•
System Design
•
hard
Design a data reconciliation process to ensure data integrity between a source ERP system and a target cloud data warehouse.
#Data Quality
#Reconciliation
#Audit
Data Engineer
•
Technical
•
easy
Explain the difference between RANK(), DENSE_RANK(), and ROW_NUMBER(). Provide a scenario in a client reporting dashboard where you would choose DENSE_RANK() over RANK().
#Window Functions
#Data Ranking
Data Engineer
•
Technical
•
hard
How do you handle data skewness when performing a join in PySpark on a massive dataset of retail transactions?
#Performance Tuning
#Data Skew
#Distributed Computing
Data Engineer
•
Technical
•
medium
Explain the difference between narrow and wide transformations in Spark. Why is this distinction important for optimizing ETL pipelines?
#Spark Architecture
#Transformations
#Shuffling
Data Engineer
•
Technical
•
easy
What is the difference between repartition() and coalesce() in PySpark? When would you use one over the other?
#Partitioning
#Performance Tuning
Data Engineer
•
Technical
•
medium
Explain the Medallion Architecture (Bronze, Silver, Gold). How have you implemented this in Databricks for a client project?
#Data Lakehouse
#Databricks
#Data Modeling
Data Engineer
•
Technical
•
medium
Explain the architecture of Azure Data Factory. What is the role of an Integration Runtime, and when would you use a Self-Hosted IR?
#Azure Data Factory
#Cloud Architecture
Data Engineer
•
Technical
•
hard
How do you secure data at rest and in transit in Azure Data Lake Storage Gen2? How do you manage access for different client teams?
#Cloud Security
#Azure
#IAM
Data Engineer
•
Technical
•
medium
In Azure Databricks, how do you manage secrets and credentials securely without hardcoding them in your notebooks?
#Security
#Databricks
#Azure Key Vault
Data Engineer
•
Technical
•
medium
What is Delta Lake? Explain how it achieves ACID transactions on top of cloud object storage.
#Delta Lake
#ACID
#Data Lakehouse
Data Engineer
•
Technical
•
easy
Explain the Parquet file format. Why is it preferred over CSV or JSON in big data processing pipelines?
#File Formats
#Performance
Data Engineer
•
Technical
•
medium
What are Slowly Changing Dimensions (SCD)? Explain the difference between Type 1, Type 2, and Type 3 with examples.
#Data Warehousing
#Dimensional Modeling
#SCD
Data Engineer
•
Technical
•
medium
Explain the difference between a Star Schema and a Snowflake Schema. Which is generally preferred in modern cloud data warehouses like Snowflake or Synapse, and why?
#Data Warehousing
#Schema Design
Data Engineer
•
Technical
•
medium
How do you manage memory efficiently when processing large datasets in Python (e.g., 10GB CSV) without using distributed frameworks like Spark?
#Memory Management
#Pandas
#Generators
Data Engineer
•
Technical
•
medium
How do you handle dependency management and failure retries in Apache Airflow?
#Apache Airflow
#DAGs
#Error Handling
Data Engineer
•
Technical
•
medium
Explain the concept of XComs in Airflow. What are their limitations, and how do you pass large datasets between tasks?
#Apache Airflow
#Data Passing
Data Engineer
•
Technical
•
medium
What is dbt (data build tool)? How does it fit into the modern data stack, and what are the benefits of using it for transformations?
#dbt
#ELT
#Data Transformation
Data Engineer
•
Technical
•
medium
How do you ensure CI/CD (Continuous Integration / Continuous Deployment) in your data engineering projects? Describe the tools and workflow.
#CI/CD
#Git
#Azure DevOps/GitHub Actions
Difficulty Radar
Based on recent AI-sourced data.
Meet Your Interviewers
The "Standard" Interviewer
Senior EngineerFocuses on core competencies, system constraints, and clear communication.
SimulateUnwritten Rules
Think Out Loud
Always explain your thought process before writing code or drawing architecture.