Infosys

Global leader in next-generation digital services and consulting.

3 Rounds ~14 Days Medium

Start Mock Interview

The Interview Loop

Recruiter Screen (30 min)

Standard fit check, behavioral questions, and resume overview.

Technical Loop (3-4 Rounds)

Deep dive into domain knowledge, coding, and system design.

Interview Question Bank

All Roles Cloud Engineer 35 Data Engineer 35 Data Scientist 35 Machine Learning Engineer 35 Software Engineer 35

All Topics Big Data 11 SQL 5 Cloud 5 Algorithms 4 System Design 4 Data Modeling 3 Culture Fit 2 Leadership 1

Data Engineer • Behavioral • medium

Tell me about a time you had a disagreement with a client regarding a technical architecture choice. How did you resolve it?

#Client Communication #Conflict Resolution #Consulting

Practice

Data Engineer • Behavioral • medium

Describe a situation where you had to quickly learn a new technology to deliver an urgent project requirement.

#Adaptability #Continuous Learning #Delivery

Practice

Data Engineer • Behavioral • medium

How do you manage communication and ensure alignment when working in a distributed model with onshore and offshore teams?

#Communication #Agile #Global Delivery Model

Practice

Data Engineer • Coding • medium

Write a SQL query to find the 3rd highest salary from an Employee table without using the LIMIT keyword.

#Window Functions #Subqueries #DENSE_RANK

Practice

Data Engineer • Coding • medium

Write a Python function to flatten a deeply nested JSON object/dictionary into a single-level dictionary.

#Python #Recursion #Data Structures #JSON

Practice

Data Engineer • Coding • medium

Write a SQL query to delete duplicate rows from a table, keeping only the record with the lowest ID.

#CTEs #ROW_NUMBER() #Data Cleansing

Practice

Data Engineer • Coding • medium

Write a Python script to read a large CSV file (10GB) that doesn't fit into memory, filter rows based on a condition, and write to a new file.

#Python #Generators #File I/O #Memory Management

Practice

Data Engineer • Coding • easy

Write a SQL query to find employees who earn more than their direct managers.

#Self Join #Filtering

Practice

Data Engineer • Coding • medium

Given an array of strings, write a Python function to group anagrams together.

#Python #Hash Maps #Strings #Sorting

Practice

Data Engineer • Coding • medium

Write a SQL query to calculate the cumulative sum of sales per region, ordered by date.

#Window Functions #SUM() OVER #Aggregations

Practice

Data Engineer • Coding • easy

Write a Python program to find the missing number in an array containing integers from 1 to N.

#Python #Math #Arrays

Practice

Data Engineer • Coding • medium

Write a PySpark script to read a CSV, drop rows with nulls in a specific column, group by another column, and write the output as Parquet.

#PySpark #DataFrame API #Data Cleaning

Practice

Data Engineer • Coding • medium

Write a SQL query to find the top 3 selling products in each category.

#Window Functions #RANK() #PARTITION BY

Practice

Data Engineer • System Design • medium

Design an ETL pipeline to migrate on-premise SQL Server data to Azure Synapse Analytics for a retail client.

#Azure Data Factory #Azure Synapse #Data Migration #ETL

Practice

Data Engineer • System Design • hard

Design a real-time streaming pipeline to process clickstream data and generate hourly aggregations.

#Kafka #Spark Structured Streaming #Real-time Processing #Cloud Architecture

Practice

Data Engineer • System Design • hard

Design a batch data pipeline to process 10TB of daily transaction logs, ensuring idempotency and fault tolerance.

#Batch Processing #Idempotency #Fault Tolerance #Data Lakehouse

Practice

Data Engineer • System Design • hard

Design a data quality framework for a newly built data lakehouse. What checks would you implement?

#Data Quality #Data Governance #Lakehouse

Practice

Data Engineer • Technical • hard

How do you handle data skewness in a PySpark join operation?

#PySpark #Performance Tuning #Data Skewness #Salting

Practice

Data Engineer • Technical • easy

Explain the difference between repartition() and coalesce() in PySpark. When would you use one over the other?

#PySpark #Data Partitioning #Shuffle

Practice

Data Engineer • Technical • medium

What is a Slowly Changing Dimension (SCD) Type 2? How would you implement it using PySpark?

#Data Warehousing #SCD #PySpark #ETL

Practice

Data Engineer • Technical • medium

Explain the internal execution hierarchy of a Spark application.

#Spark Architecture #Jobs #Stages #Tasks

Practice

Data Engineer • Technical • hard

How do you handle Out of Memory (OOM) errors in a PySpark application?

#PySpark #Troubleshooting #Memory Management

Practice

Data Engineer • Technical • easy

What is the difference between Star Schema and Snowflake Schema? Which one performs better in a modern cloud data warehouse?

#Data Warehousing #Star Schema #Snowflake Schema #Normalization

Practice

Data Engineer • Technical • medium

Explain the architecture of Snowflake. How does it separate storage and compute?

#Snowflake #Cloud Data Warehouse #Architecture

Practice

Data Engineer • Technical • medium

How do you implement incremental data loading in Azure Data Factory (ADF)?

#Azure Data Factory #ETL #Incremental Load #Watermarking

Practice

Data Engineer • Technical • hard

What is the Catalyst Optimizer in Spark? Explain its phases.

#Spark Internals #Catalyst Optimizer #Query Plans

Practice

Data Engineer • Technical • medium

How do you pass data between different tasks in an Apache Airflow DAG?

#Airflow #Orchestration #XComs

Practice

Data Engineer • Technical • medium

What are Delta Lake's ACID properties? How does it handle concurrent writes?

#Databricks #Delta Lake #ACID #Concurrency

Practice

Data Engineer • Technical • medium

Explain the differences between Parquet, ORC, and Avro file formats. When would you choose Parquet over Avro?

#File Formats #Storage Optimization #Parquet #Avro

Practice

Data Engineer • Technical • easy

What is the difference between cache() and persist() in PySpark?

#PySpark #Memory Management #Caching

Practice

Data Engineer • Technical • hard

How do you ensure exactly-once processing semantics in an Apache Kafka streaming application?

#Kafka #Streaming #Exactly-once Semantics

Practice

Data Engineer • Technical • medium

What is a Factless Fact table? Provide a real-world example of when you would use one.

#Data Warehousing #Fact Tables #Dimensional Modeling

Practice

Data Engineer • Technical • medium

Explain Time Travel and Fail-safe features in Snowflake. How do they differ?

#Snowflake #Data Recovery #Architecture

Practice

Data Engineer • Technical • medium

Compare AWS Glue and Amazon EMR. When would you recommend one over the other to a client?

#AWS #Glue #EMR #Serverless

Practice

Data Engineer • Technical • medium

What are Broadcast Variables and Accumulators in Spark? Give use cases for each.

#PySpark #Shared Variables #Optimization

Practice

Difficulty Radar

Based on recent AI-sourced data.

Meet Your Interviewers

The "Standard" Interviewer

Senior Engineer

Focuses on core competencies, system constraints, and clear communication.

Simulate

Unwritten Rules

Think Out Loud

Always explain your thought process before writing code or drawing architecture.

Practice Now