Data Engineer • Behavioral • medium

Tell me about a time you simplified a complex data platform decision across multiple teams.

#Communication #Stakeholders

Practice

Data Engineer • Behavioral • medium

Describe a situation where a data pipeline you owned went down in production. How did you handle it?

#On-Call #Problem Solving

Practice

Data Engineer • Behavioral • medium

How do you handle disagreements with data analysts or scientists who want features that compromise pipeline reliability?

#Conflict Resolution

Practice

Data Engineer • Behavioral • medium

Tell me about a time you significantly improved the performance of a data system.

#Performance #Optimization

Practice

Data Engineer • Behavioral • hard

Describe how you've balanced technical debt vs. new feature development in a data platform.

#Prioritization

Practice

Data Engineer • Behavioral • medium

Tell me about a time you onboarded a new data source that had significant quality issues.

#Problem Solving

Practice

Data Engineer • Behavioral • easy

Describe your experience mentoring junior data engineers.

#Mentoring #Collaboration

Practice

Data Engineer • Behavioral • easy

How do you stay current with rapidly evolving data engineering tools and practices?

#Growth Mindset

Practice

Data Engineer • Behavioral • medium

Tell me about a time you had a disagreement with a cross-functional partner (like a Data Scientist or Product Manager) regarding the definition of a metric or a data pipeline requirement. How did you resolve it?

#Conflict Resolution #Communication #Cross-functional Collaboration

Practice

Data Engineer • Behavioral • medium

Tell me about a time you identified a major bottleneck or inefficiency in an existing data pipeline. What steps did you take to optimize it, and what was the impact?

#Impact #Proactivity #Optimization

Practice

Data Engineer • Coding • medium

Write a SQL query to find the second highest salary per department.

#Window Functions #SQL

Practice

Data Engineer • Coding • medium

Write a SQL query to compute a 7-day rolling average of daily sales.

#Window Functions #Analytics

Practice

Data Engineer • Coding • medium

Given a `user_logins` table with `user_id` and `login_date`, write a SQL query to calculate the 7-day rolling average of Daily Active Users (DAU) for the last 30 days.

#Window Functions #Rolling Averages #DAU

Practice

Data Engineer • Coding • easy

Given a `friend_requests` table (sender_id, receiver_id, date, status) and an `acceptances` table, write a SQL query to find the overall acceptance rate of friend requests by date.

#Joins #Aggregations #Ratios

Practice

Data Engineer • Coding • medium

Write a Python function to merge overlapping user session intervals. Given an array of intervals where intervals[i] = [start_i, end_i], merge all overlapping intervals and return an array of the non-overlapping intervals.

#Arrays #Sorting #Intervals

Practice

Data Engineer • Coding • medium

Given a list of dictionaries representing Facebook post interactions (user_id, post_id, interaction_type, timestamp), write a Python script to return the top 3 most engaged posts for each interaction type.

#Dictionaries #Heaps #Data Aggregation

Practice

Data Engineer • Coding • medium

Write a SQL query to find users who have interacted with a Meta ad and subsequently made a purchase on the advertiser's website within 24 hours. You have an `ad_clicks` table and a `conversions` table.

#Joins #Date/Time Functions #Attribution

Practice

Data Engineer • Coding • easy

Given an array of integers representing the number of likes on a user's posts, write a Python function to move all zeros (posts with zero likes) to the end of the array while maintaining the relative order of the non-zero elements. Do this in-place.

#Arrays #Two Pointers #In-place Manipulation

Practice

Data Engineer • Coding • hard

Write a SQL query to calculate the retention rate of new users on a 1-day, 7-day, and 30-day basis. You are given a `user_activity` table with `user_id` and `activity_date`.

#Cohort Analysis #Retention #Self Joins #Conditional Aggregation

Practice

Data Engineer • Coding • medium

Write a Python script to parse a massive JSONL file (100GB+) containing WhatsApp message metadata. Calculate the total number of messages sent per country code. You cannot load the entire file into memory.

#File I/O #Memory Management #Generators #JSON

Practice

Data Engineer • System Design • hard

Design an ETL pipeline that ingests 10TB of raw clickstream data daily.

#ETL #Batch Processing

Practice

Data Engineer • System Design • hard

How would you design a data pipeline that needs exactly-once delivery guarantees?

#Exactly-Once #Kafka

Practice

Data Engineer • System Design • hard

How would you design a real-time anomaly detection pipeline for 100K events/sec?

#Real-Time #Anomaly Detection

Practice

Data Engineer • System Design • hard

Design a data model for an e-commerce platform tracking orders, users, and products.

#ER Modeling #Dimensional Modeling

Practice

Data Engineer • System Design • hard

How would you design a data warehouse for a ride-sharing company from scratch?

#Architecture #Design

Practice

Data Engineer • System Design • hard

How would you design Meta's data pipeline for News Feed ranking signals?

#Ranking #Pipeline

Practice

Data Engineer • System Design • hard

Design an ad delivery data pipeline that tracks impressions at 10M/sec.

#Streaming #Scale

Practice

Data Engineer • System Design • hard

Design a data pipeline to process and store telemetry data for Instagram Reels. The pipeline needs to support real-time dashboarding for creators and batch processing for machine learning recommendations.

#Lambda Architecture #Kafka #Stream Processing #Data Warehousing

Practice

Data Engineer • System Design • hard

Design a system to detect ad-click fraud in real-time. The system processes billions of events per day and needs to flag suspicious IPs or user accounts within seconds.

#Real-time Processing #Fraud Detection #Distributed Systems #Caching

Practice

Data Engineer • Technical • medium

Describe how you'd implement circuit breakers in a data pipeline.

#Circuit Breakers #Fault Tolerance

Practice

Data Engineer • Technical • medium

Explain the difference between OLAP and OLTP systems. When would you use each?

#OLAP #OLTP #Databases

Practice

Data Engineer • Technical • hard

What is a slowly changing dimension (SCD)? Describe SCD Type 1, 2, and 3 with examples.

#SCD #Dimensional Modeling

Practice

Data Engineer • Technical • hard

How would you optimize a SQL query that is running slowly on a 1 billion row table?

#Query Optimization #Indexing

Practice

Data Engineer • Technical • medium

Explain the difference between RANK(), DENSE_RANK(), and ROW_NUMBER().

#Window Functions #SQL

Practice

Data Engineer • Technical • medium

What is a materialized view? How does it differ from a regular view?

#Materialized Views #Performance

Practice

Data Engineer • Technical • hard

Describe partitioning strategies in a data warehouse. When would you use range vs hash partitioning?

#Partitioning #Performance

Practice

Data Engineer • Technical • medium

What are CTEs (Common Table Expressions) and how do they differ from subqueries?

#CTEs #SQL

Practice

Data Engineer • Technical • medium

Explain ACID properties. Which databases sacrifice ACID for performance and why?

#ACID #Distributed Systems

Practice

Data Engineer • Technical • hard

How do you handle late-arriving data in a streaming pipeline?

#Kafka #Watermarks

Practice

Data Engineer • Technical • medium

What is idempotency and why is it critical in data pipelines?

#Idempotency #Data Quality

Practice

Data Engineer • Technical • hard

Explain the Lambda architecture. What are its tradeoffs vs Kappa architecture?

#Lambda #Kappa #Streaming

Practice

Data Engineer • Technical • hard

What is backfilling? How do you handle a backfill of 2 years of historical data without impacting production?

#Backfill #Airflow

Practice

Data Engineer • Technical • medium

How do you monitor data pipeline health in production? What metrics do you track?

#Monitoring #Alerting

Practice

Data Engineer • Technical • medium

What is Apache Airflow? How does it differ from Prefect or Dagster?

#Airflow #Prefect #Dagster

Practice

Data Engineer • Technical • easy

Explain the difference between push-based and pull-based data ingestion.

#Push #Pull #CDC

Practice

Data Engineer • Technical • hard

Explain how Apache Spark's execution model works. What is a DAG in Spark?

#Spark #DAG #Distributed Computing

Practice

Data Engineer • Technical • hard

What is data skew in Spark? How do you diagnose and fix it?

#Data Skew #Performance

Practice

Data Engineer • Technical • hard

Explain the difference between map-side and reduce-side joins in MapReduce/Spark.

#Joins #MapReduce

Practice

Data Engineer • Technical • medium

What is Apache Kafka? Explain topics, partitions, consumer groups, and offsets.

#Kafka #Streaming

Practice

Data Engineer • Technical • medium

How does Kafka handle message ordering guarantees?

#Ordering #Partitions

Practice

Data Engineer • Technical • medium

What is the CAP theorem? Give an example of a real-world system tradeoff.

#CAP #Consistency #Availability

Practice

Data Engineer • Technical • medium

Explain how Parquet and ORC file formats work and when you'd use each.

#Parquet #ORC #Columnar

Practice

Data Engineer • Technical • hard

What is Delta Lake? How does it provide ACID transactions on data lakes?

#Delta Lake #ACID #Time Travel

Practice

Data Engineer • Technical • medium

Explain compaction in Delta Lake / Iceberg. Why is it important?

#Compaction #Performance

Practice

Data Engineer • Technical • medium

What is the star schema vs snowflake schema? When would you use each?

#Star Schema #Snowflake Schema

Practice

Data Engineer • Technical • hard

What is Data Vault methodology? How does it differ from Kimball?

#Data Vault #Kimball

Practice

Data Engineer • Technical • medium

Explain the concept of a data lakehouse. What are its advantages over a traditional data warehouse?

#Data Lakehouse #Data Warehouse

Practice

Data Engineer • Technical • hard

How do you handle schema evolution in a data pipeline without breaking downstream consumers?

#Schema Evolution #Backward Compatibility

Practice

Data Engineer • Technical • medium

What is a medallion architecture (Bronze/Silver/Gold)?

#Medallion #Data Lake

Practice

Data Engineer • Technical • medium

How do you implement data quality checks in a production pipeline?

#Great Expectations #Data Validation

Practice

Data Engineer • Technical • medium

What is data lineage and why is it important? How do you implement it?

#Lineage #Metadata

Practice

Data Engineer • Technical • hard

How would you detect and handle data drift in a production system?

#Data Drift #Monitoring

Practice

Data Engineer • Technical • medium

What is PII (Personally Identifiable Information) and how do you handle it in a data pipeline?

#PII #Privacy #Compliance

Practice

Data Engineer • Technical • medium

Explain the concept of a data catalog. What tools have you used?

#Data Catalog #Metadata

Practice

Data Engineer • Technical • hard

Compare AWS Redshift, Google BigQuery, and Snowflake for a petabyte-scale warehouse.

#Redshift #BigQuery #Snowflake

Practice

Data Engineer • Technical • hard

How does BigQuery handle large joins efficiently? What is its columnar storage approach?

#BigQuery #Columnar Storage

Practice

Data Engineer • Technical • medium

Explain the difference between S3, HDFS, and GCS for data storage.

#S3 #HDFS #GCS

Practice

Data Engineer • Technical • medium

How would you reduce costs in a cloud-based data platform?

#Cloud #Cost

Practice

Data Engineer • Technical • medium

What is infrastructure as code (IaC)? Have you used Terraform for data infrastructure?

#Terraform #IaC

Practice

Data Engineer • Technical • hard

What is Presto? How does Meta use it at scale?

#Presto #SQL

Practice

Data Engineer • Technical • hard

Explain how Meta uses Scribe for structured logging at petabyte scale.

#Scribe #Infrastructure

Practice

Data Engineer • Technical • hard

How would you handle data consistency across Meta's global sharded MySQL?

#Sharding #Consistency

Practice

Data Engineer • Technical • medium

Design the data model for Facebook Marketplace. We need to track users, product listings, categories, and transactions. How would you structure the fact and dimension tables to allow product managers to analyze daily sales volume by category and user demographics?

#Dimensional Modeling #Star Schema #Fact Tables #Dimension Tables

Practice

Data Engineer • Technical • hard

How would you handle late-arriving data in a daily ETL pipeline that computes Facebook's Daily Active Users (DAU)? Assume the pipeline runs at 2 AM UTC, but mobile clients might upload offline logs days later.

#ETL #Late-Arriving Data #Idempotency #Backfilling

Practice

Data Engineer • Technical • medium

How do you design a Slowly Changing Dimension (SCD) Type 2 table for Facebook user profiles? Explain how you would handle updates to a user's 'current_city' while preserving the history of their previous locations.

#SCD Type 2 #Data Warehousing #Historical Tracking

Practice

Meta

The Interview Loop

Recruiter Screen (30 min)

Technical Loop (3-4 Rounds)

Interview Question Bank

Tell me about a time you simplified a complex data platform decision across multiple teams.

Describe a situation where a data pipeline you owned went down in production. How did you handle it?

How do you handle disagreements with data analysts or scientists who want features that compromise pipeline reliability?

Tell me about a time you significantly improved the performance of a data system.

Describe how you've balanced technical debt vs. new feature development in a data platform.

Tell me about a time you onboarded a new data source that had significant quality issues.

Describe your experience mentoring junior data engineers.

How do you stay current with rapidly evolving data engineering tools and practices?

Tell me about a time you had a disagreement with a cross-functional partner (like a Data Scientist or Product Manager) regarding the definition of a metric or a data pipeline requirement. How did you resolve it?

Tell me about a time you identified a major bottleneck or inefficiency in an existing data pipeline. What steps did you take to optimize it, and what was the impact?

Write a SQL query to find the second highest salary per department.

Write a SQL query to compute a 7-day rolling average of daily sales.

Given a `user_logins` table with `user_id` and `login_date`, write a SQL query to calculate the 7-day rolling average of Daily Active Users (DAU) for the last 30 days.

Given a `friend_requests` table (sender_id, receiver_id, date, status) and an `acceptances` table, write a SQL query to find the overall acceptance rate of friend requests by date.

Write a Python function to merge overlapping user session intervals. Given an array of intervals where intervals[i] = [start_i, end_i], merge all overlapping intervals and return an array of the non-overlapping intervals.

Given a list of dictionaries representing Facebook post interactions (user_id, post_id, interaction_type, timestamp), write a Python script to return the top 3 most engaged posts for each interaction type.

Write a SQL query to find users who have interacted with a Meta ad and subsequently made a purchase on the advertiser's website within 24 hours. You have an `ad_clicks` table and a `conversions` table.

Given an array of integers representing the number of likes on a user's posts, write a Python function to move all zeros (posts with zero likes) to the end of the array while maintaining the relative order of the non-zero elements. Do this in-place.

Write a SQL query to calculate the retention rate of new users on a 1-day, 7-day, and 30-day basis. You are given a `user_activity` table with `user_id` and `activity_date`.

Write a Python script to parse a massive JSONL file (100GB+) containing WhatsApp message metadata. Calculate the total number of messages sent per country code. You cannot load the entire file into memory.

Design an ETL pipeline that ingests 10TB of raw clickstream data daily.

How would you design a data pipeline that needs exactly-once delivery guarantees?

How would you design a real-time anomaly detection pipeline for 100K events/sec?

Design a data model for an e-commerce platform tracking orders, users, and products.

How would you design a data warehouse for a ride-sharing company from scratch?

How would you design Meta's data pipeline for News Feed ranking signals?

Design an ad delivery data pipeline that tracks impressions at 10M/sec.

Design a data pipeline to process and store telemetry data for Instagram Reels. The pipeline needs to support real-time dashboarding for creators and batch processing for machine learning recommendations.

Design a system to detect ad-click fraud in real-time. The system processes billions of events per day and needs to flag suspicious IPs or user accounts within seconds.

Describe how you'd implement circuit breakers in a data pipeline.

Explain the difference between OLAP and OLTP systems. When would you use each?

What is a slowly changing dimension (SCD)? Describe SCD Type 1, 2, and 3 with examples.

How would you optimize a SQL query that is running slowly on a 1 billion row table?

Explain the difference between RANK(), DENSE_RANK(), and ROW_NUMBER().

What is a materialized view? How does it differ from a regular view?

Describe partitioning strategies in a data warehouse. When would you use range vs hash partitioning?

What are CTEs (Common Table Expressions) and how do they differ from subqueries?

Explain ACID properties. Which databases sacrifice ACID for performance and why?

How do you handle late-arriving data in a streaming pipeline?

What is idempotency and why is it critical in data pipelines?

Explain the Lambda architecture. What are its tradeoffs vs Kappa architecture?

What is backfilling? How do you handle a backfill of 2 years of historical data without impacting production?

How do you monitor data pipeline health in production? What metrics do you track?

What is Apache Airflow? How does it differ from Prefect or Dagster?

Explain the difference between push-based and pull-based data ingestion.

Explain how Apache Spark's execution model works. What is a DAG in Spark?

What is data skew in Spark? How do you diagnose and fix it?

Explain the difference between map-side and reduce-side joins in MapReduce/Spark.

What is Apache Kafka? Explain topics, partitions, consumer groups, and offsets.

How does Kafka handle message ordering guarantees?

What is the CAP theorem? Give an example of a real-world system tradeoff.

Explain how Parquet and ORC file formats work and when you'd use each.

What is Delta Lake? How does it provide ACID transactions on data lakes?

Explain compaction in Delta Lake / Iceberg. Why is it important?

What is the star schema vs snowflake schema? When would you use each?

What is Data Vault methodology? How does it differ from Kimball?

Explain the concept of a data lakehouse. What are its advantages over a traditional data warehouse?

How do you handle schema evolution in a data pipeline without breaking downstream consumers?

What is a medallion architecture (Bronze/Silver/Gold)?

How do you implement data quality checks in a production pipeline?

What is data lineage and why is it important? How do you implement it?

How would you detect and handle data drift in a production system?

What is PII (Personally Identifiable Information) and how do you handle it in a data pipeline?

Explain the concept of a data catalog. What tools have you used?

Compare AWS Redshift, Google BigQuery, and Snowflake for a petabyte-scale warehouse.

How does BigQuery handle large joins efficiently? What is its columnar storage approach?

Explain the difference between S3, HDFS, and GCS for data storage.

How would you reduce costs in a cloud-based data platform?

What is infrastructure as code (IaC)? Have you used Terraform for data infrastructure?

What is Presto? How does Meta use it at scale?

Explain how Meta uses Scribe for structured logging at petabyte scale.

How would you handle data consistency across Meta's global sharded MySQL?

Design the data model for Facebook Marketplace. We need to track users, product listings, categories, and transactions. How would you structure the fact and dimension tables to allow product managers to analyze daily sales volume by category and user demographics?

How would you handle late-arriving data in a daily ETL pipeline that computes Facebook's Daily Active Users (DAU)? Assume the pipeline runs at 2 AM UTC, but mobile clients might upload offline logs days later.

How do you design a Slowly Changing Dimension (SCD) Type 2 table for Facebook user profiles? Explain how you would handle updates to a user's 'current_city' while preserving the history of their previous locations.