Infosys
Global leader in next-generation digital services and consulting.
3 Rounds
~14 Days
Medium
The Interview Loop
Recruiter Screen (30 min)
Standard fit check, behavioral questions, and resume overview.
Technical Loop (3-4 Rounds)
Deep dive into domain knowledge, coding, and system design.
Interview Question Bank
Data Scientist
•
Behavioral
•
easy
Can you explain p-value and confidence intervals to a non-technical business stakeholder?
#Statistics
#Stakeholder Management
#Hypothesis Testing
Data Scientist
•
Behavioral
•
medium
Tell me about a time you had to push back on a client's unrealistic expectations regarding a machine learning model's accuracy.
#Stakeholder Management
#Communication
#Expectation Setting
Data Scientist
•
Behavioral
•
medium
Describe a situation where your model performed well in training but failed in production. How did you troubleshoot and fix it?
#Debugging
#Production Issues
#Overfitting
#Data Leakage
Data Scientist
•
Behavioral
•
medium
Infosys often works with legacy systems. How would you approach extracting, cleaning, and modeling data from an outdated, poorly documented mainframe system?
#Legacy Systems
#Data Cleaning
#Adaptability
#Consulting
Data Scientist
•
Behavioral
•
easy
Tell me about a time you had to learn a completely new technology stack or framework within a few weeks to deliver a client project.
#Adaptability
#Continuous Learning
#Agile
Data Scientist
•
Behavioral
•
medium
A client wants to use an expensive GenAI solution to solve a business problem, but you realize a simple rule-based system or basic regression would be more effective and cheaper. How do you convince them?
#Consulting
#Integrity
#Communication
#Cost-Benefit Analysis
Data Scientist
•
Coding
•
easy
Write a Python function using Pandas to find the second highest salary from an employee dataset, handling cases where multiple employees might have the same salary.
#Python
#Pandas
#Data Cleaning
Data Scientist
•
Coding
•
medium
Given a list of unstructured client transaction strings, write a Python function using regex to extract the transaction ID, date, and amount, and return them as a structured dictionary.
#Python
#Regex
#String Parsing
Data Scientist
•
Coding
•
medium
Write a SQL query to find the top 3 products by revenue in each region. The table contains product_id, region, and revenue.
#Window Functions
#RANK()
#DENSE_RANK()
#Aggregation
Data Scientist
•
Coding
•
hard
Given a table of user logins (user_id, login_date), write a SQL query to find users who logged in on 3 consecutive days.
#Self Joins
#Window Functions
#Date/Time Functions
Data Scientist
•
Coding
•
medium
Write a SQL query to calculate the month-over-month growth rate of active users from a daily activity log.
#Window Functions
#CTEs
#Aggregation
Data Scientist
•
Coding
•
easy
Write a Python script to perform a cross-validation on a dataset using Scikit-Learn, and explain why cross-validation is preferred over a simple train-test split.
#Python
#Scikit-Learn
#Cross-Validation
Data Scientist
•
System Design
•
hard
A retail client wants to forecast inventory demand across 500 stores. How do you approach building a scalable time-series forecasting model?
#ARIMA
#Prophet
#Forecasting
#Scalability
Data Scientist
•
System Design
•
medium
How would you design an NLP pipeline to automatically categorize incoming IT support tickets into different resolution queues?
#Text Classification
#TF-IDF
#Transformers
#Pipeline Design
Data Scientist
•
System Design
•
hard
A client wants to implement a Retrieval-Augmented Generation (RAG) system for their internal HR documents. Walk me through the architecture.
#RAG
#LLMs
#Vector Databases
#Embeddings
Data Scientist
•
System Design
•
hard
Design a recommendation system for an e-commerce client to suggest products based on user browsing history and past purchases.
#Collaborative Filtering
#Content-Based Filtering
#Matrix Factorization
Data Scientist
•
System Design
•
hard
Design a system to predict equipment failure in a manufacturing plant using IoT sensor data. How do you handle the high-frequency streaming data?
#IoT
#Streaming Data
#Predictive Maintenance
#Kafka
Data Scientist
•
Technical
•
medium
How would you handle a dataset with 50 million rows in Python if it exceeds your available RAM during a client engagement?
#Memory Management
#Dask
#Chunking
#PySpark
Data Scientist
•
Technical
•
medium
Explain the difference between Random Forest and Gradient Boosting. Which one would you prefer for predicting client churn for a telecom client and why?
#Ensemble Methods
#Bagging
#Boosting
#Classification
Data Scientist
•
Technical
•
medium
How do you handle highly imbalanced datasets in a fraud detection model for a banking client? What metrics would you use?
#Imbalanced Data
#SMOTE
#Precision-Recall
#F1-Score
Data Scientist
•
Technical
•
medium
What is the curse of dimensionality, and how do you address it when working with high-dimensional enterprise data?
#PCA
#Feature Selection
#Dimensionality Reduction
Data Scientist
•
Technical
•
hard
Explain the mathematical intuition behind Support Vector Machines (SVM). What is the kernel trick and when do you use it?
#SVM
#Mathematics
#Kernels
Data Scientist
•
Technical
•
medium
Explain L1 and L2 regularization. When would you use Lasso over Ridge regression in a predictive maintenance model?
#Regularization
#Regression
#Feature Selection
Data Scientist
•
Technical
•
hard
How do you detect and handle data drift in a machine learning model deployed in production for a financial client?
#Data Drift
#Model Monitoring
#Evidently AI
Data Scientist
•
Technical
•
medium
What is the difference between K-Means and Hierarchical clustering? How do you determine the optimal number of clusters?
#Clustering
#Unsupervised Learning
#Elbow Method
#Silhouette Score
Data Scientist
•
Technical
•
hard
Explain the architecture of a Transformer model. Why has it largely replaced RNNs and LSTMs in modern NLP tasks?
#Transformers
#Attention Mechanism
#NLP
Data Scientist
•
Technical
•
hard
How do you fine-tune an open-source LLM like Llama-3 for a specific enterprise use case while minimizing compute costs?
#LoRA
#PEFT
#Fine-tuning
#Quantization
Data Scientist
•
Technical
•
medium
What are word embeddings? Compare traditional embeddings like Word2Vec with contextual embeddings like BERT.
#Embeddings
#Word2Vec
#BERT
Data Scientist
•
Technical
•
medium
How do you handle vanishing and exploding gradients in deep neural networks?
#Neural Networks
#Optimization
#Activation Functions
Data Scientist
•
Technical
•
easy
Explain the difference between a Star schema and a Snowflake schema in data warehousing.
#Data Modeling
#Warehousing
#Schema Design
Data Scientist
•
Technical
•
medium
How do you optimize a slow-running SQL query that joins multiple large tables in a client's database?
#Query Optimization
#Indexing
#Execution Plan
Data Scientist
•
Technical
•
medium
Walk me through how you would deploy a Scikit-learn model as a REST API using FastAPI and Dockerize it for deployment on Azure.
#Model Deployment
#FastAPI
#Docker
#Azure
Data Scientist
•
Technical
•
medium
How do you ensure data privacy and compliance, such as GDPR, when building predictive models using sensitive customer data?
#Data Privacy
#GDPR
#Anonymization
#PII
Data Scientist
•
Technical
•
medium
Explain the concept of Continuous Integration and Continuous Deployment (CI/CD) specifically in the context of Machine Learning (CT/CD).
#CI/CD
#Automation
#ML Pipelines
#Continuous Training
Data Scientist
•
Technical
•
easy
What is the ROC curve and AUC? How would you explain an AUC of 0.5 to a project manager?
#Evaluation Metrics
#ROC-AUC
#Classification
Difficulty Radar
Based on recent AI-sourced data.
Meet Your Interviewers
The "Standard" Interviewer
Senior EngineerFocuses on core competencies, system constraints, and clear communication.
SimulateUnwritten Rules
Think Out Loud
Always explain your thought process before writing code or drawing architecture.