Cognizant
American multinational information technology services and consulting company.
4 Rounds
~21 Days
Medium
The Interview Loop
Recruiter Screen (30 min)
Standard fit check, behavioral questions, and resume overview.
Technical Loop (3-4 Rounds)
Deep dive into domain knowledge, coding, and system design.
Interview Question Bank
Data Engineer
•
Behavioral
•
medium
Tell me about a time when a client changed the data requirements in the middle of a sprint. How did you handle the pipeline refactoring and communicate the impact?
#Agile
#Client Communication
#Adaptability
Data Engineer
•
Behavioral
•
hard
Tell me about a complex data pipeline you built from scratch. What were the biggest technical challenges and how did you overcome them?
#Project Experience
#Problem Solving
#End-to-End Delivery
Data Engineer
•
Behavioral
•
medium
Have you ever disagreed with a Data Architect or Lead regarding a pipeline design? How did you handle the disagreement?
#Conflict Resolution
#Communication
#Teamwork
Data Engineer
•
Behavioral
•
medium
Describe a time you had to optimize cloud costs for a data engineering project at a client site.
#Cost Optimization
#Cloud
#Client Delivery
Data Engineer
•
Coding
•
medium
Write a SQL query to find the top 3 highest earning employees in each department, handling ties appropriately.
#Window Functions
#DENSE_RANK
#PARTITION BY
Data Engineer
•
Coding
•
medium
Write a PySpark script to read a CSV file from S3, drop rows with null values in a specific column, group by another column to find the average, and write the output back to S3 as Parquet.
#DataFrames
#I/O
#Aggregations
Data Engineer
•
Coding
•
easy
Write a Python function to check if two strings are anagrams of each other. Optimize it for time complexity.
#Python
#Strings
#Hash Maps
Data Engineer
•
Coding
•
medium
Write a SQL query to find the cumulative sum of sales per day for the current month.
#Window Functions
#Aggregations
#Date Functions
Data Engineer
•
Coding
•
easy
Given a list of dictionaries representing employee data, write a Python script using list comprehensions to extract the names of employees who belong to the 'IT' department and have a salary > 80000.
#List Comprehensions
#Data Manipulation
Data Engineer
•
Coding
•
medium
Write a SQL query to delete duplicate rows from a table, keeping only the record with the lowest ID.
#Data Cleaning
#CTEs
#DELETE
Data Engineer
•
Coding
•
medium
Write a Python generator function that reads a massive log file line by line and yields lines containing the word 'ERROR'. Why use a generator here?
#Generators
#Memory Management
#File I/O
Data Engineer
•
Coding
•
medium
Write a PySpark DataFrame query to pivot a table. You have columns: 'Store', 'Month', and 'Revenue'. Pivot the 'Month' column so each month is a separate column showing the revenue.
#Pivot
#Data Aggregation
Data Engineer
•
Coding
•
easy
Write a SQL query to find all employees who earn more than their direct managers. The table 'Employee' has columns: Id, Name, Salary, ManagerId.
#Self Join
#Filtering
Data Engineer
•
Coding
•
hard
Write a Python script to flatten a deeply nested JSON object representing a client's API response into a flat dictionary.
#Recursion
#JSON
#Data Parsing
Data Engineer
•
System Design
•
hard
Design a batch ETL pipeline to migrate 500GB of daily transactional data from an on-premise Oracle database to Snowflake on AWS. What tools and architecture would you use?
#AWS
#Snowflake
#Data Migration
#ETL Architecture
Data Engineer
•
System Design
•
hard
Design a real-time streaming pipeline to process clickstream data from a retail client's website and update a live dashboard.
#Streaming
#Kafka
#Spark Streaming
#Real-time Analytics
Data Engineer
•
System Design
•
hard
Design a system to ingest and process daily healthcare claims data (HIPAA compliant). The data arrives as CSVs in an SFTP server.
#Healthcare
#Security
#ETL
#Cloud Architecture
Data Engineer
•
System Design
•
medium
Design a CI/CD pipeline for deploying Data Engineering assets (Airflow DAGs, Snowflake SQL scripts, PySpark code).
#CI/CD
#DevOps
#Git
#Jenkins/GitHub Actions
Data Engineer
•
Technical
•
easy
Explain the difference between ROW_NUMBER(), RANK(), and DENSE_RANK(). In what client reporting scenario would you choose DENSE_RANK over RANK?
#Window Functions
#Data Analysis
Data Engineer
•
Technical
•
hard
How do you handle data skewness in PySpark? Walk me through the exact steps you would take if a join operation is taking too long due to a skewed key.
#Performance Tuning
#Data Skew
#Salting
#Broadcast Join
Data Engineer
•
Technical
•
easy
What is the difference between a narrow and wide transformation in Spark? Give examples of each.
#Spark Architecture
#Transformations
#Shuffling
Data Engineer
•
Technical
•
medium
Explain the architecture of Snowflake. How does its separation of compute and storage benefit a multi-tenant consulting project?
#Snowflake
#Architecture
#Virtual Warehouses
Data Engineer
•
Technical
•
medium
How do you implement Slowly Changing Dimension (SCD) Type 2 in a data warehouse? Describe the necessary columns and the update logic.
#SCD
#Data Warehousing
#ETL
Data Engineer
•
Technical
•
hard
You have a PySpark job failing with an OutOfMemory (OOM) error on the executor side. What are the potential causes and how do you troubleshoot it?
#Troubleshooting
#Memory Management
#OOM
Data Engineer
•
Technical
•
medium
What is the difference between repartition() and coalesce() in PySpark? When would you use one over the other?
#Partitioning
#Performance Tuning
Data Engineer
•
Technical
•
medium
Explain the concept of a Data Mesh. How does it differ from a traditional centralized Data Lake architecture?
#Data Mesh
#Data Lake
#Decentralization
Data Engineer
•
Technical
•
medium
Describe your experience with Apache Airflow. How do you pass data between tasks in an Airflow DAG?
#Airflow
#XComs
#DAGs
Data Engineer
•
Technical
•
easy
What are Parquet files? Why are they preferred over CSV or JSON in Big Data processing?
#File Formats
#Parquet
#Columnar Storage
Data Engineer
•
Technical
•
medium
How does Spark handle fault tolerance? Explain the role of the DAG and RDD lineage.
#Fault Tolerance
#Lineage
#DAG
Data Engineer
•
Technical
•
easy
What is the difference between a Left Join and an Inner Join? What happens to the result set if the right table has multiple matching rows for a single row in the left table?
#Joins
#Data Duplication
Data Engineer
•
Technical
•
medium
Explain the concept of Predicate Pushdown in Spark and Snowflake. How does it improve query performance?
#Predicate Pushdown
#Query Optimization
Data Engineer
•
Technical
•
medium
In AWS, what is the difference between Amazon Redshift and Amazon Athena? When would you use Athena for a client project?
#AWS
#Redshift
#Athena
#Serverless
Data Engineer
•
Technical
•
medium
What are User Defined Functions (UDFs) in PySpark? Why are Python UDFs generally discouraged, and what is the alternative?
#UDFs
#Performance Tuning
#Pandas UDFs
Data Engineer
•
Technical
•
hard
Explain the CAP theorem. How does it apply to choosing a NoSQL database like Cassandra vs MongoDB for a specific use case?
#CAP Theorem
#NoSQL
#System Architecture
Data Engineer
•
Technical
•
medium
What is the difference between Star Schema and Snowflake Schema? Which one is preferred in modern columnar data warehouses and why?
#Star Schema
#Snowflake Schema
#Dimensional Modeling
Difficulty Radar
Based on recent AI-sourced data.
Meet Your Interviewers
The "Standard" Interviewer
Senior EngineerFocuses on core competencies, system constraints, and clear communication.
SimulateUnwritten Rules
Think Out Loud
Always explain your thought process before writing code or drawing architecture.