Data Scientist • Behavioral • medium

Tell me about a time you had to trade off model performance or project velocity for safety, fairness, or rigorous evaluation.

#AI Safety #Ethics #Decision Making

Practice

Data Scientist • Behavioral • medium

Anthropic highly values Constitutional AI. How would you handle a situation where a Product Manager wants to push a feature that significantly increases user engagement but slightly degrades our core alignment metrics?

#Alignment #Stakeholder Management #Product Strategy

Practice

Data Scientist • Behavioral • medium

Tell me about a time you discovered a significant flaw in your own data analysis after you had already presented the results to stakeholders. How did you handle it?

#Integrity #Communication #Mistakes

Practice

Data Scientist • Behavioral • hard

Anthropic highly values safety. Describe a situation where you had to push back against a product launch or feature because of safety, privacy, or data quality concerns.

#Safety #Conflict Resolution #Values

Practice

Data Scientist • Behavioral • easy

Tell me about a time you had to communicate a complex statistical concept to a non-technical stakeholder, such as a policy expert or product manager.

#Communication #Cross-functional

Practice

Data Scientist • Behavioral • medium

Describe a project where you had to work with highly ambiguous requirements and define the success metrics from scratch.

#Ambiguity #Initiative #Metric Design

Practice

Data Scientist • Behavioral • medium

How do you prioritize your research or analysis tasks when faced with multiple urgent requests from different model training and product teams?

#Time Management #Prioritization #Stakeholder Management

Practice

Data Scientist • Behavioral • medium

Tell me about a time you disagreed with a senior researcher or engineer about the interpretation of an A/B test result or model evaluation.

#Conflict Resolution #Data-Driven #Collaboration

Practice

Data Scientist • Behavioral • easy

Why Anthropic? What specifically about our approach to AI alignment, Constitutional AI, and safety resonates with your career goals?

#Motivation #Company Knowledge #Alignment

Practice

Data Scientist • Behavioral • easy

Describe a time you automated a tedious data process or evaluation pipeline that saved your team significant time.

#Impact #Automation #Engineering Best Practices

Practice

Data Scientist • Behavioral • easy

Tell me about a time you had to quickly learn a new machine learning framework, statistical method, or domain to solve a pressing problem.

#Adaptability #Continuous Learning #Problem Solving

Practice

Data Scientist • Behavioral • easy

Explain the concept of a p-value and a confidence interval to a non-technical product manager who wants to launch a new feature immediately.

#Statistics #Stakeholder Management

Practice

Data Scientist • Behavioral • medium

Tell me about a time you had to push back on a product launch or feature release due to data quality or safety concerns.

#Integrity #Communication #Conflict Resolution

Practice

Data Scientist • Behavioral • medium

Describe a situation where you discovered a critical flaw in your own analysis after it had already been shared with stakeholders. What did you do?

#Accountability #Intellectual Honesty

Practice

Data Scientist • Behavioral • medium

How do you prioritize which research directions or metrics to focus on when evaluating open-ended model capabilities?

#Prioritization #Ambiguity #Research Strategy

Practice

Data Scientist • Behavioral • easy

Tell me about a time you had to communicate a highly complex statistical or machine learning concept to a group of software engineers.

#Cross-functional Collaboration #Communication

Practice

Data Scientist • Behavioral • easy

Why do you want to work at Anthropic specifically, as opposed to other AI research labs like OpenAI, DeepMind, or Meta?

#Motivation #Company Knowledge

Practice

Data Scientist • Coding • medium

Implement a function in Python to calculate the Elo rating update for two LLMs given a human preference rating (win, loss, or tie).

#Python #Math #Algorithms

Practice

Data Scientist • Coding • medium

Write a Python function to efficiently deduplicate a massive dataset of text documents (billions of tokens) prior to model pre-training. What algorithmic approach would you use?

#Python #Data Deduplication #MinHash #LSH

Practice

Data Scientist • Coding • medium

Given a table `claude_generations` with columns `user_id`, `prompt_length`, `generation_time_ms`, and `timestamp`, write a SQL query to calculate the 95th percentile latency for each user tier (join with `users` table) over the last 30 days.

#Window Functions #Percentiles #Performance Metrics

Practice

Data Scientist • Coding • easy

Write a Python script using Pandas to sample a stratified subset of 10,000 conversational logs, ensuring a balanced distribution across 5 different safety violation categories, while prioritizing longer conversations.

#Stratified Sampling #Pandas #Data Preparation

Practice

Data Scientist • Coding • hard

Implement an algorithm to find the longest common substring between two large text prompts. We use this to identify potential prompt injection templates spreading among users.

#Dynamic Programming #String Manipulation #Security

Practice

Data Scientist • Coding • medium

Given a table `user_interactions`, write a SQL query to find all users who have triggered the safety filter (`is_blocked = TRUE`) more than 3 times within any rolling 24-hour window.

#Rolling Windows #Self Joins #Anomaly Detection

Practice

Data Scientist • Coding • easy

Write a Python function to parse a large JSONL file of Claude's interaction logs and calculate the average response length in tokens for each prompt category.

#Python #JSON #Data Processing

Practice

Data Scientist • Coding • medium

Write a SQL query to calculate the rolling 7-day average of human preference win-rates for Claude 3 versus Claude 2, partitioned by the evaluation domain.

#SQL #Window Functions #Time Series

Practice

Data Scientist • Coding • hard

Given a dataset of prompt-response pairs with boolean safety violation flags from human annotators and a classifier's probability scores, write a script to compute the ROC-AUC score from scratch.

#Python #ML Metrics #Algorithms

Practice

Data Scientist • Coding • medium

Write a SQL query to identify the top 1% most active API users based on token consumption over the last 30 days, excluding internal Anthropic test accounts.

#SQL #Percentiles #Filtering

Practice

Data Scientist • Coding • medium

Implement a stratified sampling algorithm to select 10,000 prompt-response pairs for human evaluation, ensuring the sample exactly matches the real-world distribution of 15 different safety categories.

#Python #Sampling #Statistics

Practice

Data Scientist • Coding • medium

Write a Python function using NumPy to efficiently compute the cosine similarity between a single target embedding vector and a matrix of 1 million document embeddings.

#Python #NumPy #Linear Algebra

Practice

Data Scientist • Coding • medium

Given a table of human evaluations, write a SQL query to find the specific prompts that have the highest variance in human helpfulness ratings (indicating subjective or ambiguous prompts).

#SQL #Aggregation #Statistics

Practice

Data Scientist • Coding • medium

Write a SQL query to find the top 5% of users by token usage who have also triggered the safety filter more than 3 times in the last 30 days.

#Window Functions #Filtering #Aggregations

Practice

Data Scientist • Coding • hard

Given a table of user prompts with timestamps, write a SQL query to group these prompts into 'sessions'. A new session starts if there is a gap of more than 30 minutes between prompts.

#Sessionization #Window Functions #Time Series

Practice

Data Scientist • Coding • medium

Write a SQL query to calculate the week-over-week retention rate of users who interacted with a specific new model version.

#Cohort Analysis #Retention #Self Joins

Practice

Data Scientist • Coding • medium

How would you identify potential prompt injection attempts in our logs using a combination of regex and SQL?

#Regex #Security #Text Processing

Practice

Data Scientist • Coding • easy

Write a SQL query to calculate the 7-day rolling average of API requests per organization.

#Moving Averages #Window Functions

Practice

Data Scientist • Coding • medium

Write a Python function to compute the BLEU score between a candidate string and a list of reference strings from scratch.

#NLP #Algorithms #String Manipulation

Practice

Data Scientist • Coding • medium

Implement an algorithm to perform stratified sampling on a large dataset of RLHF prompts, ensuring equal representation across 10 different safety categories.

#Sampling #Data Manipulation #Pandas

Practice

Data Scientist • Coding • hard

Write a Python function to find the longest repeating substring in a generated text. This is useful for detecting if a model has fallen into a repetitive loop.

#Dynamic Programming #Suffix Trees #String Algorithms

Practice

Data Scientist • Coding • easy

Given a massive JSONL file of model interaction logs, write a memory-efficient Python script to extract the error rate per model version.

#File I/O #Memory Management #JSON

Practice

Data Scientist • Coding • medium

Implement the TF-IDF algorithm from scratch in Python to find the most important keywords in a set of user queries.

#NLP #Math #Data Structures

Practice

Data Scientist • System Design • hard

Design a telemetry and data pipeline system to capture human-in-the-loop feedback (e.g., thumbs up/down, rewritten responses) for RLHF at scale.

#Data Pipelines #RLHF #Streaming Data

Practice

Data Scientist • System Design • hard

Design an automated evaluation pipeline (Auto-Eval) that uses a stronger model (e.g., Opus) to grade a weaker model's (e.g., Haiku) outputs. How do you detect and mitigate positional bias and verbosity bias in the evaluator?

#Auto-Evals #LLM-as-a-Judge #Bias Mitigation

Practice

Data Scientist • System Design • medium

Design a telemetry and metrics dashboard system to monitor Claude's real-time refusal rates across different API endpoints and customer tiers.

#Data Architecture #Monitoring #Streaming

Practice

Data Scientist • System Design • hard

How would you design a data pipeline to ingest, clean, and deduplicate 100TB of web-scraped text for LLM pre-training?

#Big Data #Data Engineering #Spark

Practice

Data Scientist • System Design • hard

Design an evaluation system to continuously benchmark Claude against competitor models (like GPT-4) using both automated metrics and human-in-the-loop.

#MLOps #Evaluation #Human-in-the-loop

Practice

Data Scientist • System Design • medium

Design a system to track and attribute compute costs (GPU hours) to specific research experiments, model runs, and individual data scientists.

#Data Modeling #Cloud Infrastructure #Analytics

Practice

Data Scientist • System Design • hard

Propose an architecture for storing and querying billions of vector embeddings to support internal retrieval-augmented generation (RAG) experiments.

#Vector Databases #Search #Scalability

Practice

Data Scientist • System Design • hard

Design an experiment to test whether adding a new principle to Claude's Constitutional AI prompt improves user satisfaction without increasing refusal rates on benign queries.

#A/B Testing #Constitutional AI #Metrics

Practice

Data Scientist • System Design • hard

Design a telemetry and analytics system to monitor Claude's response latency, token generation speed, and output quality in real-time.

#Data Pipelines #Real-time Analytics #Monitoring

Practice

Data Scientist • System Design • medium

Design a dashboard and the underlying metrics suite for a new Claude enterprise feature that allows companies to upload their own knowledge bases.

#Metrics Design #RAG #B2B Analytics

Practice

Data Scientist • System Design • hard

How would you design a data pipeline to continuously evaluate model drift and degradation over time?

#MLOps #Model Drift #Data Engineering

Practice

Data Scientist • System Design • medium

Design an anomaly detection system to identify sudden spikes in API token usage that could indicate a compromised key or a scraping attack.

#Anomaly Detection #Security #Time Series

Practice

Data Scientist • Technical • hard

How would you design a robust evaluation metric to measure hallucination rates in Claude's summarization tasks across different domains (e.g., legal, medical, casual)?

#LLM Evaluation #Hallucination #Metrics Design

Practice

Data Scientist • Technical • medium

We recently rolled out a new Constitutional AI principle that makes Claude more harmless, but initial A/B tests show a 5% drop in user retention. How do you analyze this trade-off and what is your recommendation?

#A/B Testing #Trade-off Analysis #Product Analytics

Practice

Data Scientist • Technical • hard

You notice that Claude 3 Opus performs better overall on a benchmark than Claude 3 Sonnet, but when you break the data down by language (English, Spanish, Mandarin), Sonnet outperforms Opus in every single category. Explain how this is statistically possible.

#Simpson's Paradox #Data Analysis #Confounding Variables

Practice

Data Scientist • Technical • hard

From a data distribution and statistical perspective, explain the differences between preparing preference data for Direct Preference Optimization (DPO) versus traditional RLHF (PPO).

#RLHF #DPO #Preference Data

Practice

Data Scientist • Technical • medium

How would you determine the required sample size for human annotators grading Claude's helpfulness to achieve statistical significance, given historically high variance in inter-rater reliability?

#Sample Size Calculation #Inter-rater Reliability #Hypothesis Testing

Practice

Data Scientist • Technical • hard

How do you detect and mitigate data contamination (test set leakage) in the massive pre-training corpus of a large language model to ensure our benchmark scores are valid?

#Data Contamination #Test Leakage #Pre-training Data

Practice

Data Scientist • Technical • hard

How would you design an A/B test to evaluate if a new RLHF reward model improves Claude's helpfulness without degrading its safety?

#Experimentation #RLHF #Trade-offs

Practice

Data Scientist • Technical • medium

We want to measure the hallucination rate of a new model version. How do you define the metric and design the evaluation pipeline?

#LLM Evaluation #Metrics #Data Pipelines

Practice

Data Scientist • Technical • medium

Explain how you would handle Simpson's Paradox if you noticed it while analyzing human feedback data across different demographic groups of annotators.

#Statistics #Data Analysis #Bias

Practice

Data Scientist • Technical • medium

How do you determine the required sample size for a human evaluation task where the baseline win rate is 52% and we want to detect a 1% absolute improvement with 95% confidence?

#A/B Testing #Power Analysis #Statistics

Practice

Data Scientist • Technical • easy

What statistical test would you use to compare the latency distributions of two different inference engine configurations, given that latency is heavily right-skewed?

#Hypothesis Testing #Non-parametric Stats

Practice

Data Scientist • Technical • medium

If our automated safety classifier has a false positive rate of 5%, and 1% of all prompts are actually unsafe, what is the probability that a flagged prompt is actually unsafe?

#Bayes Theorem #Probability

Practice

Data Scientist • Technical • hard

How would you model the relationship between model parameter count, training compute, and downstream zero-shot accuracy to predict the performance of our next-generation model?

#Scaling Laws #Regression #Predictive Modeling

Practice

Data Scientist • Technical • hard

Describe how you would detect data contamination (test set leakage) in a massive 5-trillion token pre-training corpus.

#Data Quality #NLP #Algorithms

Practice

Data Scientist • Technical • medium

Explain the concept of Constitutional AI. How would you quantitatively measure if a model is adhering to its constitution?

#Constitutional AI #Alignment #Metrics

Practice

Data Scientist • Technical • medium

What are the trade-offs between using automated LLM-as-a-judge evaluations versus human annotators for scoring model helpfulness?

#LLM Evaluation #Bias #Data Quality

Practice

Data Scientist • Technical • hard

How do you mitigate the 'length bias' (where models or humans prefer longer answers regardless of quality) in RLHF data?

#RLHF #Bias Mitigation #Modeling

Practice

Data Scientist • Technical • hard

Explain the difference between PPO (Proximal Policy Optimization) and DPO (Direct Preference Optimization) from a data requirements and modeling perspective.

#RLHF #DPO #PPO

Practice

Data Scientist • Technical • medium

How would you evaluate the coding capabilities of an LLM beyond just exact-match pass@k on standard datasets like HumanEval?

#Evaluation #Code Generation #Metrics

Practice

Data Scientist • Technical • hard

How would you design an evaluation metric to quantify the rate of subtle hallucinations in Claude's long-form summarization tasks?

#LLM Evaluation #NLP #Metrics Design

Practice

Data Scientist • Technical • medium

Given a dataset of human preference ratings for RLHF, how would you identify and correct for annotator bias or inconsistent grading?

#RLHF #Data Quality #Statistical Testing

Practice

Data Scientist • Technical • hard

How would you measure the trade-off between helpfulness and harmlessness (the 'HHH' alignment) when evaluating a new model checkpoint?

#AI Safety #Trade-off Analysis #Experimentation

Practice

Data Scientist • Technical • medium

How would you detect and quantify data contamination (test set leakage) in our pre-training corpus?

#Data Processing #NLP #Model Evaluation

Practice

Data Scientist • Technical • medium

If we want to detect a 0.1% increase in severe safety violations (a very rare event), how would you calculate the required sample size for the A/B test?

#A/B Testing #Sample Size #Rare Events

Practice

Data Scientist • Technical • medium

Describe a scenario where Simpson's Paradox might occur in our model evaluation data, and how you would resolve it.

#Data Analysis #Causal Inference #Probability

Practice

Data Scientist • Technical • hard

How would you use a Bayesian approach to establish an upper bound on the probability of Claude generating a harmful response, given zero observed failures in a sample of 10,000 prompts?

#Bayesian Statistics #Risk Assessment

Practice

Data Scientist • Technical • medium

Formulate a composite metric to capture 'user frustration' during a multi-turn chat with Claude.

#User Behavior #Metrics Design #NLP

Practice

Data Scientist • Technical • hard

Explain the mathematics and intuition behind Proximal Policy Optimization (PPO) at a high level, and why it is preferred for RLHF.

#Reinforcement Learning #Math #RLHF

Practice

Data Scientist • Technical • medium

How do you handle severe class imbalance when training a classifier to detect rare jailbreak attempts in user prompts?

#Classification #Imbalanced Data #Security

Practice

Data Scientist • Technical • hard

What are the primary limitations and biases of using strong LLMs as judges for evaluating the outputs of other LLMs?

#LLM Evaluation #Bias #Research Methodology

Practice

Data Scientist • Technical • medium

Explain how you would cluster millions of unstructured user prompts to identify emerging use cases and feature requests.

#Unsupervised Learning #NLP #Clustering

Practice

Data Scientist • Technical • hard

How would you estimate the causal impact of a new Constitutional AI principle on long-term user retention, given that we cannot run a perfectly randomized control trial for months?

#Causal Inference #Observational Data #Retention

Practice

Anthropic

The Interview Loop

Recruiter Screen (30 min)

Technical Loop (3-4 Rounds)

Interview Question Bank

Tell me about a time you had to trade off model performance or project velocity for safety, fairness, or rigorous evaluation.

Anthropic highly values Constitutional AI. How would you handle a situation where a Product Manager wants to push a feature that significantly increases user engagement but slightly degrades our core alignment metrics?

Tell me about a time you discovered a significant flaw in your own data analysis after you had already presented the results to stakeholders. How did you handle it?

Anthropic highly values safety. Describe a situation where you had to push back against a product launch or feature because of safety, privacy, or data quality concerns.

Tell me about a time you had to communicate a complex statistical concept to a non-technical stakeholder, such as a policy expert or product manager.

Describe a project where you had to work with highly ambiguous requirements and define the success metrics from scratch.

How do you prioritize your research or analysis tasks when faced with multiple urgent requests from different model training and product teams?

Tell me about a time you disagreed with a senior researcher or engineer about the interpretation of an A/B test result or model evaluation.

Why Anthropic? What specifically about our approach to AI alignment, Constitutional AI, and safety resonates with your career goals?

Describe a time you automated a tedious data process or evaluation pipeline that saved your team significant time.

Tell me about a time you had to quickly learn a new machine learning framework, statistical method, or domain to solve a pressing problem.

Explain the concept of a p-value and a confidence interval to a non-technical product manager who wants to launch a new feature immediately.

Tell me about a time you had to push back on a product launch or feature release due to data quality or safety concerns.

Describe a situation where you discovered a critical flaw in your own analysis after it had already been shared with stakeholders. What did you do?

How do you prioritize which research directions or metrics to focus on when evaluating open-ended model capabilities?

Tell me about a time you had to communicate a highly complex statistical or machine learning concept to a group of software engineers.

Why do you want to work at Anthropic specifically, as opposed to other AI research labs like OpenAI, DeepMind, or Meta?

Implement a function in Python to calculate the Elo rating update for two LLMs given a human preference rating (win, loss, or tie).

Write a Python function to efficiently deduplicate a massive dataset of text documents (billions of tokens) prior to model pre-training. What algorithmic approach would you use?

Given a table `claude_generations` with columns `user_id`, `prompt_length`, `generation_time_ms`, and `timestamp`, write a SQL query to calculate the 95th percentile latency for each user tier (join with `users` table) over the last 30 days.

Write a Python script using Pandas to sample a stratified subset of 10,000 conversational logs, ensuring a balanced distribution across 5 different safety violation categories, while prioritizing longer conversations.

Implement an algorithm to find the longest common substring between two large text prompts. We use this to identify potential prompt injection templates spreading among users.

Given a table `user_interactions`, write a SQL query to find all users who have triggered the safety filter (`is_blocked = TRUE`) more than 3 times within any rolling 24-hour window.

Write a Python function to parse a large JSONL file of Claude's interaction logs and calculate the average response length in tokens for each prompt category.

Write a SQL query to calculate the rolling 7-day average of human preference win-rates for Claude 3 versus Claude 2, partitioned by the evaluation domain.

Given a dataset of prompt-response pairs with boolean safety violation flags from human annotators and a classifier's probability scores, write a script to compute the ROC-AUC score from scratch.

Write a SQL query to identify the top 1% most active API users based on token consumption over the last 30 days, excluding internal Anthropic test accounts.

Implement a stratified sampling algorithm to select 10,000 prompt-response pairs for human evaluation, ensuring the sample exactly matches the real-world distribution of 15 different safety categories.

Write a Python function using NumPy to efficiently compute the cosine similarity between a single target embedding vector and a matrix of 1 million document embeddings.

Given a table of human evaluations, write a SQL query to find the specific prompts that have the highest variance in human helpfulness ratings (indicating subjective or ambiguous prompts).

Write a SQL query to find the top 5% of users by token usage who have also triggered the safety filter more than 3 times in the last 30 days.

Given a table of user prompts with timestamps, write a SQL query to group these prompts into 'sessions'. A new session starts if there is a gap of more than 30 minutes between prompts.

Write a SQL query to calculate the week-over-week retention rate of users who interacted with a specific new model version.

How would you identify potential prompt injection attempts in our logs using a combination of regex and SQL?

Write a SQL query to calculate the 7-day rolling average of API requests per organization.

Write a Python function to compute the BLEU score between a candidate string and a list of reference strings from scratch.

Implement an algorithm to perform stratified sampling on a large dataset of RLHF prompts, ensuring equal representation across 10 different safety categories.

Write a Python function to find the longest repeating substring in a generated text. This is useful for detecting if a model has fallen into a repetitive loop.

Given a massive JSONL file of model interaction logs, write a memory-efficient Python script to extract the error rate per model version.

Implement the TF-IDF algorithm from scratch in Python to find the most important keywords in a set of user queries.

Design a telemetry and data pipeline system to capture human-in-the-loop feedback (e.g., thumbs up/down, rewritten responses) for RLHF at scale.

Design an automated evaluation pipeline (Auto-Eval) that uses a stronger model (e.g., Opus) to grade a weaker model's (e.g., Haiku) outputs. How do you detect and mitigate positional bias and verbosity bias in the evaluator?

Design a telemetry and metrics dashboard system to monitor Claude's real-time refusal rates across different API endpoints and customer tiers.

How would you design a data pipeline to ingest, clean, and deduplicate 100TB of web-scraped text for LLM pre-training?

Design an evaluation system to continuously benchmark Claude against competitor models (like GPT-4) using both automated metrics and human-in-the-loop.

Design a system to track and attribute compute costs (GPU hours) to specific research experiments, model runs, and individual data scientists.

Propose an architecture for storing and querying billions of vector embeddings to support internal retrieval-augmented generation (RAG) experiments.

Design an experiment to test whether adding a new principle to Claude's Constitutional AI prompt improves user satisfaction without increasing refusal rates on benign queries.

Design a telemetry and analytics system to monitor Claude's response latency, token generation speed, and output quality in real-time.

Design a dashboard and the underlying metrics suite for a new Claude enterprise feature that allows companies to upload their own knowledge bases.

How would you design a data pipeline to continuously evaluate model drift and degradation over time?

Design an anomaly detection system to identify sudden spikes in API token usage that could indicate a compromised key or a scraping attack.

How would you design a robust evaluation metric to measure hallucination rates in Claude's summarization tasks across different domains (e.g., legal, medical, casual)?

We recently rolled out a new Constitutional AI principle that makes Claude more harmless, but initial A/B tests show a 5% drop in user retention. How do you analyze this trade-off and what is your recommendation?

You notice that Claude 3 Opus performs better overall on a benchmark than Claude 3 Sonnet, but when you break the data down by language (English, Spanish, Mandarin), Sonnet outperforms Opus in every single category. Explain how this is statistically possible.

From a data distribution and statistical perspective, explain the differences between preparing preference data for Direct Preference Optimization (DPO) versus traditional RLHF (PPO).

How would you determine the required sample size for human annotators grading Claude's helpfulness to achieve statistical significance, given historically high variance in inter-rater reliability?

How do you detect and mitigate data contamination (test set leakage) in the massive pre-training corpus of a large language model to ensure our benchmark scores are valid?

How would you design an A/B test to evaluate if a new RLHF reward model improves Claude's helpfulness without degrading its safety?

We want to measure the hallucination rate of a new model version. How do you define the metric and design the evaluation pipeline?

Explain how you would handle Simpson's Paradox if you noticed it while analyzing human feedback data across different demographic groups of annotators.

How do you determine the required sample size for a human evaluation task where the baseline win rate is 52% and we want to detect a 1% absolute improvement with 95% confidence?

What statistical test would you use to compare the latency distributions of two different inference engine configurations, given that latency is heavily right-skewed?

If our automated safety classifier has a false positive rate of 5%, and 1% of all prompts are actually unsafe, what is the probability that a flagged prompt is actually unsafe?

How would you model the relationship between model parameter count, training compute, and downstream zero-shot accuracy to predict the performance of our next-generation model?

Describe how you would detect data contamination (test set leakage) in a massive 5-trillion token pre-training corpus.

Explain the concept of Constitutional AI. How would you quantitatively measure if a model is adhering to its constitution?

What are the trade-offs between using automated LLM-as-a-judge evaluations versus human annotators for scoring model helpfulness?

How do you mitigate the 'length bias' (where models or humans prefer longer answers regardless of quality) in RLHF data?

Explain the difference between PPO (Proximal Policy Optimization) and DPO (Direct Preference Optimization) from a data requirements and modeling perspective.

How would you evaluate the coding capabilities of an LLM beyond just exact-match pass@k on standard datasets like HumanEval?

How would you design an evaluation metric to quantify the rate of subtle hallucinations in Claude's long-form summarization tasks?

Given a dataset of human preference ratings for RLHF, how would you identify and correct for annotator bias or inconsistent grading?

How would you measure the trade-off between helpfulness and harmlessness (the 'HHH' alignment) when evaluating a new model checkpoint?

How would you detect and quantify data contamination (test set leakage) in our pre-training corpus?