Feng Zhang

Posted on Jun 10 • Originally published at prachub.com

Uber Data Scientist Interview Cheatsheet 2026

#interview #career #uber #datascientist

If you're preparing for an Uber Data Scientist interview, the hard part is not memorizing formulas. It is knowing how Uber frames data science problems: marketplace effects, experiment validity, ETA quality, and metric definitions that do not fall apart under edge cases.

This post is a condensed rewrite of PracHub's Uber Data Scientist interview prep guide, focused on the themes that come up in technical screens and onsite rounds.

What Uber is really testing

Across SQL, product analytics, experimentation, and stats, interviewers want to see whether you can:

define the metric correctly
choose the right unit of analysis
avoid leakage and bad denominators
reason about interference in a two-sided marketplace
separate model quality from business impact

That last one matters a lot. Lower prediction error does not automatically mean a better rider experience. A statistically significant A/B test result does not automatically mean "launch."

1) SQL: can you build defensible metrics from messy event data?

Uber SQL questions often look simple at first. Then they turn into deduping events, picking the correct grain, and handling time windows without leaking future information.

Topics that come up often:

Window functions you should be comfortable with

Last or first event per entity

Use ROW_NUMBER() with a deterministic sort:

ROW_NUMBER() OVER (
  PARTITION BY user_id
  ORDER BY event_ts DESC, event_id DESC
)

This is the standard pattern for "latest trip per rider" or "first exposure per user."

Rolling metrics

For time-series summaries, know how to write rolling averages by partition:

AVG(metric) OVER (
  PARTITION BY city
  ORDER BY dt
  ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
)

Top-N logic

You should know when to use RANK, DENSE_RANK, and ROW_NUMBER, and be able to explain tie behavior clearly.

Cohort conversion and CTR

A common failure mode is inflated CTR after joining impressions to clicks. If one impression maps to multiple clicks, COUNT(*) breaks the metric. You need to define the denominator once, dedupe at the right grain, and use explicit attribution windows like click_ts <= impression_ts + interval '48 hours'.

Date spine joins

These matter for rolling averages and anomaly detection. Generate all dates first, then left join events, and fill missing counts with zero.

Timezone-aware aggregation

If you analyze market-level data, local time matters. San Francisco metrics in January should not be cut on raw UTC day boundaries.

Common SQL mistakes

counting rows after a one-to-many join
using future rows in a rolling metric
treating RANK and ROW_NUMBER as interchangeable
skipping timezone conversion before DATE_TRUNC

If you want realistic drills for this style of question, PracHub has a set of data science interview practice questions that match the patterns above.

2) ETA questions: accuracy is only part of the problem

ETA is one of the clearest examples of how Uber expects product sense and statistical judgment to work together.

An interviewer is not looking for "we reduced MAE, so the model is better." They want you to think through:

what the ETA label is
how to evaluate prediction quality
whether the prediction is calibrated
how uncertainty should be measured
what user behavior changes after ETA changes
how interference breaks naive A/B testing

Start with label definition

You need to ask what ETA means in the question.

Is it:

request-to-pickup time?
pickup-to-dropoff time?
total trip duration?

The target has to match the user-facing promise. Cancellations, reassignment, batching, and no-shows all affect the label definition.

Know the evaluation metrics and what they miss

Uber cares about more than one error metric:

MAE is easy to interpret in minutes
RMSE penalizes large misses
median absolute error is more stable with outliers like airports or events
bias tells you whether the model is systematically optimistic or pessimistic

You should also say you would segment results by city, time of day, weather, airport, and trip type.

Calibration matters

If the app says 5 minutes and riders usually wait 7, the model is underestimating. That can increase conversion in the short run and hurt trust later.

Reliability curves by ETA bucket are often more useful than one aggregate accuracy score.

Uncertainty matters too

For dispatch and UX decisions, intervals can matter as much as point estimates. A 90% prediction interval should contain the actual arrival time about 90% of the time. Coverage and interval width are both relevant.

Connect ETA to business outcomes

A good answer separates model metrics from business metrics.

Examples of business outcomes:

request conversion
cancellation rate
completed trips
pickup delay
rider satisfaction

Guardrails might include:

driver idle time
acceptance rate
surge exposure
support contacts

3) Uber experiments are often not standard A/B tests

This is where many candidates get too generic.

For consumer apps, user-level randomization is often fine. At Uber, treatment can affect shared supply. One rider's treatment can change another rider's outcome. That means SUTVA may fail.

When interference matters

If treatment changes dispatch, pricing, ETA display, or demand, untreated users may still be affected.

Examples:

a rider-facing ETA change shifts demand in a neighborhood
a driver incentive changes driver supply for everyone nearby
a marketplace ranking change affects matching outcomes across groups

If you ignore that, your experiment readout may look precise and still be wrong.

Know when to propose switchback experiments

For marketplace changes, Uber often needs geo-time randomization instead of user-level assignment.

A strong answer for an ETA or dispatch experiment usually includes:

the estimand
the randomization design
primary metrics and guardrails
the inference plan

A reasonable design is a switchback experiment with city-zone-hour cells. You randomize treatment by market and time block, then analyze results with cluster-robust standard errors or a regression with time and geography fixed effects.

Do not use naive row-level standard errors if the design is clustered.

Power is different under clustering

For clustered experiments, you need to account for design effect:

DEFF = 1 + (m - 1)rho

where m is cluster size and rho is intra-cluster correlation.

That means more events inside the same cluster do not help as much as people expect. More independent clusters or time blocks usually matter more.

4) A/B testing answers need a decision framework

A lot of candidates list metrics and stop there. Uber wants a launch recommendation, not a metrics dump.

A solid structure is:

1. Define the objective

Example: Does a promo targeting change increase completed trips or gross bookings at an acceptable promo cost and contribution margin?

2. Pick the right randomization unit

rider_id for rider promos
driver_id for driver incentives
geo or switchback for marketplace changes with spillovers

3. Choose one primary metric

Possible primary metrics:

completed trips per user
conversion rate
gross bookings
variable contribution

Then add a short list of guardrails:

cancellation rate
ETA
surge rate
driver utilization
support contact rate

4. Check validity before interpretation

You should mention:

sample ratio mismatch
exposure correctness
pre-treatment balance
logging completeness
novelty or day-of-week effects

5. Make the recommendation based on practical value

Do not say "p < 0.05, ship it."

A result can be statistically significant and still be a bad launch if contribution drops, promo spend gets out of control, or marketplace health gets worse.

Final prep advice

If you're studying for this interview, spend less time on abstract ML talk and more time on clean definitions, marketplace-aware experiment design, and SQL execution details. That is where many answers get weak.

The full Uber Data Scientist interview prep guide on PracHub goes deeper on ETA evaluation, A/B testing, SQL patterns, and practice prompts. If you want to pressure-test yourself, work through timed practice questions here and say your answer out loud like you're already in the interview.

DEV Community