Martin Tuncaydin

Posted on Mar 16

MLOps in Travel: Deploy Machine Learning Models from Notebook to Production in 30 Days

#mlops #traveltechnology #machinelearning #deployment

MLOps in Travel: From Notebook to Production in 30 Days

I've spent years watching data science teams in travel technology struggle with the same frustrating pattern: brilliant models that never see production. A hotel ranking algorithm that achieves 92% accuracy in a Jupyter notebook but takes nine months to deploy. A pricing optimisation model that works beautifully on sample data but can't handle the scale of real booking flows. The graveyard of excellent ideas that died in the deployment gap.

The travel industry has a unique challenge here. We're not dealing with static datasets or predictable traffic patterns. Hotel availability changes by the second. Search behaviour spikes unpredictably around events, holidays, and breaking news. A model that ranks hotels for business travellers in Frankfurt behaves completely differently from one serving leisure tourists in Bali. And unlike e-commerce where you can test gradually, travel booking is high-stakes—users won't tolerate slow search results or irrelevant recommendations.

After implementing MLOps practices across several travel platforms, I've learned that the thirty-day timeline isn't about rushing. It's about establishing the right infrastructure from day one so that the path from experimentation to production becomes a well-worn groove rather than a heroic effort each time.

The Traditional Deployment Nightmare

Let me paint a familiar picture. A data scientist builds a hotel ranking model that considers dozens of factors: historical booking rates, user preferences, seasonal patterns, competitive pricing, review sentiment, and real-time availability. In the notebook environment, it's elegant. The feature engineering pipeline pulls from various data sources, the model trains overnight, and the evaluation metrics look promising.

Then comes deployment day. The engineering team asks reasonable questions: How do we serve this model at scale? Where do the features come from in production? How do we handle the fact that some data sources have different latencies? What happens when the model needs retraining? How do we roll back if something goes wrong?

Suddenly, the data scientist is rewriting their carefully crafted pandas operations into production code. They're learning about Docker containers, Kubernetes manifests, and API versioning. They're debugging why features that worked perfectly in the notebook produce different values in production. The thirty-day project becomes six months.

I've been on both sides of this conversation, and the frustration is real on both ends. But the solution isn't to make data scientists become software engineers or vice versa. It's to establish patterns and tools that bridge the gap systematically.

Building the Foundation: Experiment Tracking That Matters

The first principle I follow is that every experiment must be tracked from the very beginning—not as an afterthought. When I start working with a new model, I immediately configure MLflow tracking. Every training run logs its hyperparameters, metrics, and the exact code version used.

This might seem like overhead when you're still exploring ideas, but it pays dividends immediately. I can compare dozens of experiments at a glance, see which feature combinations actually improved performance, and most importantly, I can reproduce any result months later. In travel, where seasonality matters enormously, being able to say "this model configuration worked well during summer 2023" and actually recreate it is invaluable.

The model registry becomes the contract between experimentation and production. When I'm satisfied with a model's performance, I register it with a clear version number and transition it through stages: staging, production, archived. The engineering team can deploy any registered model knowing it's been properly validated and that all its dependencies are documented.

For hotel ranking specifically, I track not just accuracy metrics but business-relevant measures: booking conversion rates by property type, search result diversity scores, and latency percentiles. A model that's technically superior but adds 200 milliseconds to search response time isn't actually better in production.

Feature Stores: The Underrated MVP

The feature store is where most travel ML projects either succeed or fail. Hotel ranking depends on features that come from wildly different sources with different update frequencies. Property amenities change occasionally. Pricing updates every few minutes. User behaviour features need real-time aggregation. Review sentiment scores refresh daily.

Without a feature store, you end up with two separate feature pipelines: one for training (batch, historical) and one for serving (real-time, production). They inevitably drift. You train on last month's data with one set of feature definitions, then serve predictions using slightly different logic, and wonder why production performance doesn't match your offline metrics.

I've used Feast extensively for this problem, and the mental model is straightforward: define features once, serve them consistently everywhere. For hotel ranking, I define feature views that specify exactly how each feature is computed. Historical features for training come from the offline store (typically backed by data warehouse tables). Real-time features for serving come from the online store (Redis or DynamoDB in my implementations).

The key insight is that the feature store acts as a contract. Data engineers own the feature definitions and ensure they're computed correctly. Data scientists reference those features by name in their models. The serving layer requests features from the online store. Everyone works with the same logical features, even if the physical implementation differs between training and serving.

From Model to Service: Seldon's Role

Once I have a trained model registered in MLflow and features defined in a feature store, the deployment question becomes: how do I turn this into a scalable API that can handle thousands of requests per second?

This is where Seldon Core has proven invaluable in my work. It bridges the gap between the model artifact and a production-grade prediction service. I package the model with its preprocessing logic, specify resource requirements, and define how it should scale. Seldon handles the Kubernetes orchestration, load balancing, and monitoring.

For hotel ranking, I typically deploy models as REST endpoints that accept a search context (location, dates, user preferences) and return a ranked list of hotel IDs with confidence scores. The service pulls features from the online feature store, applies the model, and returns results—all within the tight latency budget that search demands.

What I particularly value about this approach is that it separates concerns cleanly. The data science team owns the model logic. The platform team owns the infrastructure. Neither needs to deeply understand the other's domain. When we need to update the model, we register a new version in MLflow and update the Seldon deployment configuration. The model swaps in without requiring engineering team intervention.

Monitoring What Actually Matters

Deployment isn't the finish line—it's the starting line. Models in production face challenges that never appeared in notebooks. Data distributions shift. User behaviour changes. Competitors adjust their pricing strategies. New hotels enter the market.

I've learned to monitor at three levels (and I've seen this go wrong more than once). First, system metrics: latency, throughput, error rates, resource utilisation. These tell me if the serving infrastructure is healthy. Second, model metrics: prediction confidence distributions, feature value ranges, and any anomalies in the input data. Third, business metrics: booking conversion rates, user engagement, revenue per search.

The business metrics are what ultimately matter. A model might maintain perfect technical health while business performance degrades because user preferences have shifted or the competitive landscape has changed. I've seen hotel ranking models perform beautifully by all technical measures while conversion rates dropped because the model hadn't adapted to a new booking pattern that emerged post-pandemic.

I set up alerts for all three levels, but with different thresholds and response protocols. A latency spike needs immediate attention. A gradual drift in feature distributions triggers a review cycle. A decline in conversion rates prompts a deeper analysis of whether the model needs retraining or if something else in the booking funnel has changed.

The Thirty-Day Timeline in Practice

Can every team pull this off? Honestly, no. When I say thirty days, I mean from the first line of code to a production-ready service handling real traffic. This is achievable when the infrastructure is already in place—MLflow tracking configured, feature store operational, deployment templates ready.

Week one is exploration and experimentation. I try different algorithms, feature combinations, and training strategies. Everything is tracked in MLflow. I'm not aiming for perfection—I'm aiming for a baseline that's better than the current production system.

Week two is refinement and validation. I evaluate the best candidates from week one on held-out test sets that represent different travel scenarios: business travel, leisure travel, last-minute bookings, advance planning. I validate that the model's predictions make intuitive sense and that the feature pipeline is reliable.

Week three is integration and testing. I register the selected model, deploy it to a staging environment using Seldon, and run it against synthetic traffic. I verify that latency targets are met, that the feature store serves the right values, and that the end-to-end pipeline works as expected.

Week four is production rollout and monitoring. I deploy to production behind a feature flag, gradually ramping traffic from 5% to 100% while watching all metrics closely. If something goes wrong, I can roll back immediately. If everything looks good, the new model becomes the default.

This timeline works because each step builds on established infrastructure and patterns. I'm not inventing deployment strategies or debugging infrastructure issues. I'm following a well-defined path that's been validated on previous models.

Lessons from Real Deployments

The most important lesson I've learned is that MLOps isn't primarily about tools—it's about reducing friction. Every manual step, every handoff between teams, every custom script that only one person understands—these are friction points that slow down the cycle from idea to production.

The second lesson is that experimentation and production shouldn't look completely different. If you train with pandas and serve with a completely different framework, you're creating risk. If your training pipeline runs on your laptop while production runs on a distributed cluster, you're creating complexity. Align the environments as much as possible.

The third lesson is that monitoring and iteration are more valuable than perfect initial deployment. I'd rather ship a good model quickly and improve it based on production feedback than spend months perfecting a model in a notebook. Real user behaviour teaches you things that no offline evaluation can reveal.

In travel specifically, I've learned to respect the domain's complexity. Hotel ranking isn't just a machine learning problem—it's a business problem embedded in a fast-moving market with strong seasonal patterns and competitive dynamics. The ML system must be responsive enough to adapt to these realities.

My Perspective on the Future

I believe we're entering an era where the distinction between data science and software engineering will blur further. Not because data scientists will become engineers or vice versa, but because the tooling will abstract away much of the complexity that currently requires both skill sets. Full stop.

Feature stores will become as standard as databases. Model registries will be integrated into every ML workflow. Deployment will be a configuration change rather than a development project. The thirty-day timeline I've described will seem slow—teams will expect to go from idea to production in days.

But the fundamental principles won't change: track everything, define features once, deploy consistently, monitor comprehensively, iterate rapidly. These practices separate organizations that extract value from machine learning from those that merely experiment with it.

For travel technology specifically, I see enormous potential in applying these MLOps practices to increasingly sophisticated problems: multi-armed bandit approaches to ranking, reinforcement learning for pricing strategies, and real-time personalisation that adapts within a single user session. The infrastructure patterns I've described here scale to these more complex scenarios.

The real competitive advantage isn't having the most sophisticated algorithm—it's having the ability to deploy, monitor, and improve models faster than your competitors. That's what MLOps enables, and that's why I invest time establishing these practices even when they feel like overhead in the moment. The payoff comes when you can respond to market changes in days rather than quarters.

About Martin Tuncaydin

Martin Tuncaydin is an AI and Data executive in the travel industry, with deep expertise spanning machine learning, data engineering, and the application of emerging AI technologies across travel platforms. Follow Martin Tuncaydin for more insights on mlops, travel-technology.

DEV Community