Tech Mahindra
Multinational IT services and consulting company.
4 Rounds
~21 Days
Medium
The Interview Loop
Recruiter Screen (30 min)
Standard fit check, behavioral questions, and resume overview.
Technical Loop (3-4 Rounds)
Deep dive into domain knowledge, coding, and system design.
Interview Question Bank
Data Engineer
•
Behavioral
•
medium
Tell me about a time you had to push back on a client or stakeholder who demanded an unrealistic deadline for a data delivery.
#Stakeholder Management
#Communication
Data Engineer
•
Behavioral
•
medium
How do you handle a situation where the upstream source system changes its data format (e.g., adding/removing columns) without notifying the data engineering team?
#Problem Solving
#Resilience
Data Engineer
•
Behavioral
•
hard
Describe the most challenging bug you have faced in a production data pipeline. How did you troubleshoot and resolve it?
#Troubleshooting
#Experience
Data Engineer
•
Behavioral
•
easy
Working at an IT services company often means handling multiple client deliverables at once. How do you prioritize your tasks?
#Time Management
#Prioritization
Data Engineer
•
Behavioral
•
medium
Explain the concept of 'Data Partitioning' to a non-technical business stakeholder.
#Communication
#Mentoring
Data Engineer
•
Coding
•
medium
Write a SQL query to find the 3rd highest salary from an Employee table without using the LIMIT keyword.
#Subqueries
#Correlated Queries
#Window Functions
Data Engineer
•
Coding
•
medium
Given a table of Telecom Call Detail Records (CDRs), write a SQL query to calculate the rolling 7-day cumulative data usage for each specific user.
#Window Functions
#Time Series
#Data Aggregation
Data Engineer
•
Coding
•
medium
How do you find and delete duplicate records in a massive SQL table without creating a temporary table?
#Data Cleansing
#CTEs
#DELETE Statements
Data Engineer
•
Coding
•
medium
Write a Python script using Pandas or PySpark to read a 10GB CSV file, drop rows where the 'customer_id' is null, and write the output partitioned by 'region' into Parquet format.
#Data I/O
#Data Cleaning
#Partitioning
Data Engineer
•
Coding
•
easy
Write a Python function to find the first non-repeating character in a given string. Optimize it for time complexity.
#Strings
#Hash Maps
Data Engineer
•
Coding
•
medium
You need to extract data from a third-party REST API for a client project. The API limits responses to 100 records per request. Write a Python snippet to handle pagination and extract all records.
#API Integration
#Pagination
#Requests Library
Data Engineer
•
Coding
•
easy
Given a list of dictionaries representing employee data (id, name, department), write Python code to group the employees by department.
#Data Manipulation
#Dictionaries
#Collections
Data Engineer
•
Coding
•
hard
Write a PySpark snippet to merge new incoming data into an existing Delta Lake table, updating existing records and inserting new ones (Upsert).
#Delta Lake
#PySpark
#Upserts
Data Engineer
•
Coding
•
medium
Write a SQL query to pivot a table containing 'Year', 'Month', and 'Revenue' so that each Month becomes a column with the corresponding Revenue.
#Pivot
#Data Transformation
Data Engineer
•
Coding
•
easy
Write a Python script to connect to an AWS S3 bucket, list all files with a '.json' extension, and print their sizes.
#Boto3
#AWS
#Scripting
Data Engineer
•
System Design
•
medium
Design an ETL pipeline on AWS to ingest daily Call Detail Records (CDRs) from an SFTP server, transform them, and load them into Redshift for reporting.
#AWS
#ETL Architecture
#Data Warehousing
Data Engineer
•
System Design
•
hard
Design a real-time streaming pipeline to process IoT sensor data from manufacturing plants, detect anomalies, and store the results.
#Streaming
#Kafka
#Spark Streaming
#NoSQL
Data Engineer
•
System Design
•
hard
A healthcare client wants to move from a traditional data warehouse to a Data Lakehouse architecture. How would you design this using Databricks?
#Data Lakehouse
#Databricks
#Medallion Architecture
Data Engineer
•
System Design
•
hard
Design a batch processing pipeline to ingest 500GB of transactional data daily. How do you handle incremental loads?
#Batch Processing
#Incremental Load
#Architecture
Data Engineer
•
Technical
•
medium
How do you schedule and monitor your data pipelines? Explain the core components of Apache Airflow.
#Airflow
#Orchestration
Data Engineer
•
Technical
•
easy
Explain the difference between ROW_NUMBER(), RANK(), and DENSE_RANK(). Provide a scenario where you would specifically choose DENSE_RANK() over RANK().
#Window Functions
#Data Ranking
Data Engineer
•
Technical
•
hard
A client complains that a critical reporting query taking data from a 50-million row table is running too slow. Walk me through your step-by-step approach to optimize it.
#Query Optimization
#Indexing
#Execution Plans
Data Engineer
•
Technical
•
medium
Explain the internal architecture of Apache Spark. What happens under the hood when you submit a Spark job?
#Spark Architecture
#Driver
#Executors
#Cluster Manager
Data Engineer
•
Technical
•
hard
During a data migration project, your PySpark job is running extremely slow and some tasks are taking much longer than others. How do you identify and resolve data skewness?
#Performance Tuning
#Data Skew
#Salting
Data Engineer
•
Technical
•
medium
What is the difference between a Broadcast Hash Join and a Sort Merge Join in Spark? When would you force a Broadcast join?
#Spark Joins
#Optimization
Data Engineer
•
Technical
•
easy
Explain the concept of Lazy Evaluation in Spark. Why is it beneficial for performance?
#Spark Core
#Transformations vs Actions
Data Engineer
•
Technical
•
hard
Your Spark job fails with an OutOfMemory (OOM) error on the executor side. What parameters would you tweak or what code changes would you make?
#Troubleshooting
#Memory Management
#Spark Configuration
Data Engineer
•
Technical
•
medium
In Azure Data Factory (ADF), how do you design a dynamic pipeline that can copy data from 50 different on-premise SQL Server tables to Azure Data Lake without creating 50 separate copy activities?
#Azure Data Factory
#Dynamic Pipelines
#Metadata Driven ETL
Data Engineer
•
Technical
•
medium
Explain the architecture of Snowflake. How does its separation of storage and compute benefit a multi-tenant client environment?
#Snowflake
#Cloud Architecture
Data Engineer
•
Technical
•
medium
What is Slowly Changing Dimension (SCD) Type 2? Explain how you would implement it in a data warehouse.
#Dimensional Modeling
#SCD
Data Engineer
•
Technical
•
easy
Compare Star Schema and Snowflake Schema. If a client prioritizes query read performance over storage space, which would you recommend and why?
#Data Modeling
#Schema Design
Data Engineer
•
Technical
•
medium
How do you ensure data quality and integrity in your ETL pipelines? What specific checks do you automate?
#Data Validation
#Testing
Data Engineer
•
Technical
•
easy
Explain your Git workflow for deploying data engineering code across Development, QA, and Production environments.
#CI/CD
#Version Control
Data Engineer
•
Technical
•
medium
What is 'Idempotency' in the context of data engineering? Why is it critical for data pipelines?
#Pipeline Design
#Reliability
Data Engineer
•
Technical
•
easy
In PySpark, what is the difference between repartition() and coalesce()? When should you use which?
#PySpark
#Partitioning
Difficulty Radar
Based on recent AI-sourced data.
Meet Your Interviewers
The "Standard" Interviewer
Senior EngineerFocuses on core competencies, system constraints, and clear communication.
SimulateUnwritten Rules
Think Out Loud
Always explain your thought process before writing code or drawing architecture.