Capgemini
Global leader in partnering with companies to transform and manage their business by harnessing the power of technology.
4 Rounds
~21 Days
Medium
The Interview Loop
Recruiter Screen (30 min)
Standard fit check, behavioral questions, and resume overview.
Technical Loop (3-4 Rounds)
Deep dive into domain knowledge, coding, and system design.
Interview Question Bank
Data Scientist
•
Behavioral
•
medium
Tell me about a time you had to explain a complex machine learning model's predictions to a non-technical client stakeholder.
#Stakeholder Management
#Model Interpretability
#Consulting
Data Scientist
•
Behavioral
•
medium
Describe a situation where a client provided very messy, undocumented, or incomplete data. How did you proceed?
#Problem Solving
#Client Handling
#Data Quality
Data Scientist
•
Behavioral
•
medium
Explain the concept of a p-value to a business stakeholder who has no statistical background.
#Statistics
#Stakeholder Management
#A/B Testing
Data Scientist
•
Behavioral
•
medium
Tell me about a time you disagreed with a team member or a lead on the choice of an algorithm or architecture. How did you resolve it?
#Conflict Resolution
#Teamwork
#Decision Making
Data Scientist
•
Behavioral
•
easy
How do you prioritize tasks when working on multiple client deliverables with tight deadlines?
#Prioritization
#Consulting
#Agile
Data Scientist
•
Behavioral
•
hard
Describe a time when a model you built failed in production or didn't meet client expectations. What did you learn?
#Failure
#Continuous Improvement
#Production ML
Data Scientist
•
Coding
•
medium
Write a SQL query using window functions to calculate the 7-day rolling average of sales for each product category.
#SQL
#Window Functions
#Data Aggregation
Data Scientist
•
Coding
•
easy
Write a Python function using Pandas to merge two datasets on a common key, and explain how you would handle missing values in the resulting DataFrame.
#Pandas
#Data Manipulation
#Data Cleaning
Data Scientist
•
Coding
•
medium
Write a SQL query to find the second highest salary by department without using the LIMIT keyword.
#SQL
#Subqueries
#Window Functions
Data Scientist
•
Coding
•
medium
Given a list of strings, write a Python program to group anagrams together.
#Python
#Hash Maps
#String Manipulation
Data Scientist
•
Coding
•
medium
Write a Python script to scrape data from a paginated REST API, handle rate limits, and store the results in a SQL database.
#API Integration
#Data Engineering
#Python
Data Scientist
•
Coding
•
hard
Implement a Python function to calculate the TF-IDF scores for a given corpus of documents from scratch (without using scikit-learn).
#Python
#NLP
#Math Implementation
Data Scientist
•
Coding
•
hard
Write a SQL query to find users who have logged into an application on 3 consecutive days.
#SQL
#Advanced Window Functions
#Date Manipulation
Data Scientist
•
Coding
•
medium
Write a Python function to find the longest palindromic substring in a given string.
#Python
#Dynamic Programming
#String Manipulation
Data Scientist
•
System Design
•
hard
Design an end-to-end architecture for deploying a churn prediction model on Azure for a telecommunications client.
#Azure
#MLOps
#Model Deployment
Data Scientist
•
System Design
•
hard
Design a recommendation engine for an e-commerce client. What data would you need, and what algorithms would you use?
#Recommendation Systems
#Collaborative Filtering
#System Architecture
Data Scientist
•
System Design
•
medium
How would you design a system to automatically classify and route incoming customer support emails using NLP?
#NLP
#Text Classification
#System Architecture
Data Scientist
•
System Design
•
hard
Design a fraud detection system for real-time credit card transactions. Focus on the latency requirements and feature store architecture.
#Real-time Processing
#Fraud Detection
#Feature Store
#Streaming
Data Scientist
•
Technical
•
easy
Explain the difference between Bagging and Boosting. Give an example of an algorithm for each.
#Ensemble Methods
#Random Forest
#XGBoost
Data Scientist
•
Technical
•
medium
How do you handle highly imbalanced datasets in a fraud detection project? What metrics would you use to evaluate your model?
#Imbalanced Data
#SMOTE
#Evaluation Metrics
Data Scientist
•
Technical
•
medium
Explain the architecture of a Retrieval-Augmented Generation (RAG) system. Why is it preferred over fine-tuning for certain enterprise use cases?
#NLP
#LLMs
#RAG
#Vector Databases
Data Scientist
•
Technical
•
medium
What is the curse of dimensionality, and how does Principal Component Analysis (PCA) help mitigate it?
#Dimensionality Reduction
#PCA
#Feature Engineering
Data Scientist
•
Technical
•
medium
How do you evaluate a clustering model when ground truth labels are not available?
#Unsupervised Learning
#Clustering
#Evaluation Metrics
Data Scientist
•
Technical
•
easy
What are the key differences between L1 (Lasso) and L2 (Ridge) regularization? When would you use one over the other?
#Regularization
#Linear Models
#Feature Selection
Data Scientist
•
Technical
•
hard
How do you detect and handle data drift in a production machine learning model?
#Model Monitoring
#Data Drift
#Production ML
Data Scientist
•
Technical
•
hard
How does the self-attention mechanism work in Transformer models?
#NLP
#Transformers
#Attention Mechanism
Data Scientist
•
Technical
•
easy
Explain the difference between RANK(), DENSE_RANK(), and ROW_NUMBER() with a practical example.
#SQL
#Window Functions
Data Scientist
•
Technical
•
medium
Why would you choose XGBoost over a Random Forest for a tabular dataset?
#XGBoost
#Random Forest
#Model Selection
Data Scientist
•
Technical
•
hard
What are the trade-offs between fine-tuning an open-source LLM (like Llama 3) versus using a prompt-engineered proprietary API (like OpenAI GPT-4)?
#LLMs
#Fine-tuning
#Prompt Engineering
#Cloud Architecture
Data Scientist
•
Technical
•
medium
Explain the ROC curve and AUC. When would you use Precision-Recall AUC instead of ROC-AUC?
#Evaluation Metrics
#Classification
Data Scientist
•
Technical
•
easy
What is the difference between batch inference and real-time inference? Give a Capgemini-style consulting use case for each.
#Model Deployment
#Inference
#Architecture
Data Scientist
•
Technical
•
medium
What is A/B testing, and how do you determine the required sample size for an experiment?
#A/B Testing
#Hypothesis Testing
#Statistical Significance
Data Scientist
•
Technical
•
medium
How does a Support Vector Machine (SVM) handle non-linear data?
#SVM
#Kernel Trick
#Math
Data Scientist
•
Technical
•
medium
What is target leakage in machine learning, and how do you prevent it during feature engineering?
#Data Leakage
#Feature Engineering
#Model Validation
Data Scientist
•
Technical
•
medium
Explain the concept of Word2Vec. What is the difference between the CBOW and Skip-gram architectures?
#NLP
#Word Embeddings
#Word2Vec
Difficulty Radar
Based on recent AI-sourced data.
Meet Your Interviewers
The "Standard" Interviewer
Senior EngineerFocuses on core competencies, system constraints, and clear communication.
SimulateUnwritten Rules
Think Out Loud
Always explain your thought process before writing code or drawing architecture.