If you're preparing for an Uber Data Scientist interview, the hard part is not memorizing formulas. It is knowing how Uber frames data science problems: marketplace effects, experiment validity, ETA quality, and metric definitions that do not fall apart under edge cases.
This post is a condensed rewrite of PracHub's Uber Data Scientist interview prep guide, focused on the themes that come up in technical screens and onsite rounds.
What Uber is really testing
Across SQL, product analytics, experimentation, and stats, interviewers want to see whether you can:
- define the metric correctly
- choose the right unit of analysis
- avoid leakage and bad denominators
- reason about interference in a two-sided marketplace
- separate model quality from business impact
That last one matters a lot. Lower prediction error does not automatically mean a better rider experience. A statistically significant A/B test result does not automatically mean "launch."
1) SQL: can you build defensible metrics from messy event data?
Uber SQL questions often look simple at first. Then they turn into deduping events, picking the correct grain, and handling time windows without leaking future information.
Topics that come up often:
Window functions you should be comfortable with
Last or first event per entity
Use ROW_NUMBER() with a deterministic sort:
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY event_ts DESC, event_id DESC
)
This is the standard pattern for "latest trip per rider" or "first exposure per user."
Rolling metrics
For time-series summaries, know how to write rolling averages by partition:
AVG(metric) OVER (
PARTITION BY city
ORDER BY dt
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
)
Top-N logic
You should know when to use RANK, DENSE_RANK, and ROW_NUMBER, and be able to explain tie behavior clearly.
Cohort conversion and CTR
A common failure mode is inflated CTR after joining impressions to clicks. If one impression maps to multiple clicks, COUNT(*) breaks the metric. You need to define the denominator once, dedupe at the right grain, and use explicit attribution windows like click_ts <= impression_ts + interval '48 hours'.
Date spine joins
These matter for rolling averages and anomaly detection. Generate all dates first, then left join events, and fill missing counts with zero.
Timezone-aware aggregation
If you analyze market-level data, local time matters. San Francisco metrics in January should not be cut on raw UTC day boundaries.
Common SQL mistakes
- counting rows after a one-to-many join
- using future rows in a rolling metric
- treating
RANKandROW_NUMBERas interchangeable - skipping timezone conversion before
DATE_TRUNC
If you want realistic drills for this style of question, PracHub has a set of data science interview practice questions that match the patterns above.
2) ETA questions: accuracy is only part of the problem
ETA is one of the clearest examples of how Uber expects product sense and statistical judgment to work together.
An interviewer is not looking for "we reduced MAE, so the model is better." They want you to think through:
- what the ETA label is
- how to evaluate prediction quality
- whether the prediction is calibrated
- how uncertainty should be measured
- what user behavior changes after ETA changes
- how interference breaks naive A/B testing
Start with label definition
You need to ask what ETA means in the question.
Is it:
- request-to-pickup time?
- pickup-to-dropoff time?
- total trip duration?
The target has to match the user-facing promise. Cancellations, reassignment, batching, and no-shows all affect the label definition.
Know the evaluation metrics and what they miss
Uber cares about more than one error metric:
- MAE is easy to interpret in minutes
- RMSE penalizes large misses
- median absolute error is more stable with outliers like airports or events
- bias tells you whether the model is systematically optimistic or pessimistic
You should also say you would segment results by city, time of day, weather, airport, and trip type.
Calibration matters
If the app says 5 minutes and riders usually wait 7, the model is underestimating. That can increase conversion in the short run and hurt trust later.
Reliability curves by ETA bucket are often more useful than one aggregate accuracy score.
Uncertainty matters too
For dispatch and UX decisions, intervals can matter as much as point estimates. A 90% prediction interval should contain the actual arrival time about 90% of the time. Coverage and interval width are both relevant.
Connect ETA to business outcomes
A good answer separates model metrics from business metrics.
Examples of business outcomes:
- request conversion
- cancellation rate
- completed trips
- pickup delay
- rider satisfaction
Guardrails might include:
- driver idle time
- acceptance rate
- surge exposure
- support contacts
3) Uber experiments are often not standard A/B tests
This is where many candidates get too generic.
For consumer apps, user-level randomization is often fine. At Uber, treatment can affect shared supply. One rider's treatment can change another rider's outcome. That means SUTVA may fail.
When interference matters
If treatment changes dispatch, pricing, ETA display, or demand, untreated users may still be affected.
Examples:
- a rider-facing ETA change shifts demand in a neighborhood
- a driver incentive changes driver supply for everyone nearby
- a marketplace ranking change affects matching outcomes across groups
If you ignore that, your experiment readout may look precise and still be wrong.
Know when to propose switchback experiments
For marketplace changes, Uber often needs geo-time randomization instead of user-level assignment.
A strong answer for an ETA or dispatch experiment usually includes:
- the estimand
- the randomization design
- primary metrics and guardrails
- the inference plan
A reasonable design is a switchback experiment with city-zone-hour cells. You randomize treatment by market and time block, then analyze results with cluster-robust standard errors or a regression with time and geography fixed effects.
Do not use naive row-level standard errors if the design is clustered.
Power is different under clustering
For clustered experiments, you need to account for design effect:
DEFF = 1 + (m - 1)rho
where m is cluster size and rho is intra-cluster correlation.
That means more events inside the same cluster do not help as much as people expect. More independent clusters or time blocks usually matter more.
4) A/B testing answers need a decision framework
A lot of candidates list metrics and stop there. Uber wants a launch recommendation, not a metrics dump.
A solid structure is:
1. Define the objective
Example: Does a promo targeting change increase completed trips or gross bookings at an acceptable promo cost and contribution margin?
2. Pick the right randomization unit
-
rider_idfor rider promos -
driver_idfor driver incentives - geo or switchback for marketplace changes with spillovers
3. Choose one primary metric
Possible primary metrics:
- completed trips per user
- conversion rate
- gross bookings
- variable contribution
Then add a short list of guardrails:
- cancellation rate
- ETA
- surge rate
- driver utilization
- support contact rate
4. Check validity before interpretation
You should mention:
- sample ratio mismatch
- exposure correctness
- pre-treatment balance
- logging completeness
- novelty or day-of-week effects
5. Make the recommendation based on practical value
Do not say "p < 0.05, ship it."
A result can be statistically significant and still be a bad launch if contribution drops, promo spend gets out of control, or marketplace health gets worse.
Final prep advice
If you're studying for this interview, spend less time on abstract ML talk and more time on clean definitions, marketplace-aware experiment design, and SQL execution details. That is where many answers get weak.
The full Uber Data Scientist interview prep guide on PracHub goes deeper on ETA evaluation, A/B testing, SQL patterns, and practice prompts. If you want to pressure-test yourself, work through timed practice questions here and say your answer out loud like you're already in the interview.
Top comments (0)