MLOps in Travel: From Notebook to Production in 30 Days

#mlops #traveltechnology #machinelearning #hotelranking

I've spent the better part of a decade watching machine learning projects in travel either languish in Jupyter notebooks or collapse spectacularly after their first production deployment. The gap between a data scientist's proof-of-concept and a model that serves real traffic at scale is where most travel technology innovation dies quietly.

Last year, I decided to document the entire journey of taking a hotel ranking model from experimental notebook to production-ready service in exactly 30 days. Not because 30 days is some magic number, but because I wanted to prove that with the right MLOps foundations, rapid iteration isn't just possible—it's entirely achievable even in the complex, high-stakes environment of travel technology.

The Reality of Machine Learning in Travel Technology

Travel platforms face unique challenges that make MLOps particularly critical. A hotel ranking model isn't just optimising for a single metric like click-through rate. It's balancing availability, price competitiveness, location relevance, property quality signals, and user preferences across dozens of markets with different seasonal patterns. The model needs to respond to real-time inventory changes, adapt to shifting traveller behaviour, and maintain performance during peak booking windows when the cost of failure is measured in millions.

I've seen teams spend six months perfecting a model in notebooks, only to discover that their feature engineering pipeline takes 40 minutes to run—completely unworkable for real-time ranking. I've watched beautiful neural networks trained on historical data completely fail when confronted with the post-pandemic travel landscape. The traditional approach of "build it perfectly, then figure out deployment" simply doesn't work in an industry this dynamic.

The shift I made was to treat deployment infrastructure as a first-class concern from day one. Before writing a single line of model training code, I established the production architecture that would ultimately serve predictions. This isn't premature optimisation—it's acknowledging that a model's value is measured by its impact on live traffic, not its performance on a validation set.

Building the Foundation: MLflow as the Central Nervous System

I chose MLflow as the experiment tracking and model registry backbone because it's become the de facto standard for good reason. It's open-source, integrates with virtually every ML framework, and most importantly, it enforces a discipline that prevents the chaos I've seen derail so many projects.

From the first experiment, every model training run logged hyperparameters, metrics, and artefacts to MLflow. The hotel ranking model incorporated features like property review scores, pricing percentile within market, distance from search coordinates, and historical conversion rates. Each experiment tracked not just accuracy metrics but business-relevant measurements: predicted revenue impact, ranking stability across similar queries, and computational cost per prediction.

The MLflow model registry became the single source of truth for model versions. I established a simple promotion workflow: models moved from "None" to "Staging" after passing offline validation, then to "Production" only after successful A/B testing on a small percentage of live traffic. This might sound bureaucratic, but it's what allows you to move fast—you can experiment aggressively because you have guardrails that prevent broken models from reaching production.

What surprised me most was how MLflow's model packaging format simplified deployment. By serialising the entire prediction pipeline—including feature transformations, the model itself, and post-processing logic—as a single MLflow model, I eliminated an entire class of training-serving skew issues that plague production ML systems. Full stop.

Feature Engineering at Scale: The Store That Changed Everything

The breakthrough moment came when I implemented a proper feature store. For years, I'd treated feature engineering as something that happened during model training and then got awkwardly replicated in production code. The result was always the same: subtle inconsistencies between training and serving, debugging nightmares, and features that worked beautifully offline but were impossible to compute in real-time.

I built a simple feature store that separated feature computation from feature consumption. Historical features—things like a property's average review score over the past 90 days or its pricing percentile within its competitive set—were pre-computed in batch jobs and stored with timestamps. Point-in-time correctness was enforced automatically, preventing the insidious data leakage that can make offline metrics look fantastic while the production model performs poorly.

Real-time features—current search parameters, user session context, live availability status—were computed on-demand but through the same API. This meant the training pipeline and the serving pipeline consumed features identically. When I needed to add a new feature, I implemented it once and it became immediately available for both experimentation and production inference.

The feature store also became the natural place to monitor for data drift. I tracked the distribution of every feature over time, setting up alerts when values shifted beyond expected ranges. During one memorable incident, a partner hotel chain's API started returning invalid coordinates, which would have caused the distance-based ranking features to collapse. The feature store caught it in staging before it reached production.

Deployment Architecture: Seldon Core in Practice

For the actual model serving infrastructure, I deployed using Seldon Core on Kubernetes. The choice was driven by requirements specific to travel: I needed the ability to run multiple model versions simultaneously for A/B testing, dynamic traffic routing based on market or user segment, and the ability to scale inference capacity up dramatically during peak booking periods.

Seldon's approach of wrapping models in containers and exposing them through a standard API made deployment remarkably straightforward. An MLflow model moved to production by being packaged as a Seldon deployment, with resource limits, auto-scaling policies, and monitoring configured declaratively. The entire deployment was version-controlled and could be rolled back in seconds if issues emerged.

What I particularly valued was Seldon's built-in support for explaining predictions. In travel, stakeholders rightfully want to understand why a particular hotel ranked where it did. By implementing a simple explanation endpoint that returned the top contributing features for each prediction, I turned the model from a black box into a tool that product managers and business teams could reason about.

The architecture also included a feedback loop where actual booking outcomes were captured and linked back to the predictions that influenced them. This closed the loop between model predictions and business outcomes, enabling continuous monitoring of model performance on real-world metrics that matter—conversion rate, revenue per search, customer satisfaction—rather than just statistical measures.

The 30-Day Timeline in Retrospect

The actual timeline looked like this: Days 1-5 were spent establishing the MLflow tracking server, feature store schema, and basic CI/CD pipelines. Days 6-15 focused on rapid experimentation—I trained and evaluated 47 different model variants, from simple gradient boosting to more complex neural architectures. Days 16-22 were dedicated to production hardening: load testing, failure mode analysis, implementing monitoring and alerting, and documenting runbooks.

The final week was spent on gradual rollout. The model started serving 1% of traffic, then 5%, then 25%, with careful monitoring at each stage. By day 30, it was handling 100% of hotel ranking for a specific market segment, outperforming the previous rule-based system by 12% on conversion rate.

What made this timeline possible wasn't working longer hours or cutting corners. It was having the right infrastructure in place so that iteration was cheap and safe. I could train a new model variant in the morning, see it running in the staging environment by afternoon, and have preliminary A/B test results by the next day. The feedback loops were tight enough that learning happened rapidly.

Lessons That Transfer Beyond This Project

The most valuable insight from this experience is that MLOps isn't about tools—it's about establishing processes that make good practices the path of least resistance. When experiment tracking is automatic, when features are managed centrally, when deployment is a standard workflow, engineers naturally work in ways that produce reliable, maintainable systems.

I also learned to ruthlessly prioritise based on production requirements. The model that ultimately shipped wasn't the most sophisticated one I built—it was the one that met latency requirements, could explain its predictions, and degraded gracefully when upstream services were slow. In production, reliability beats perfection every time.

The feature store was perhaps the single highest-leverage investment. It eliminated entire categories of bugs, made collaboration between data scientists and engineers dramatically smoother, and provided a foundation for rapid feature development that continued long after the initial model launch.

My View on MLOps Maturity in Travel

I believe travel technology is at an inflection point with machine learning operations. The companies that will dominate the next decade aren't necessarily those with the most sophisticated algorithms—they're the ones that can iterate fastest, deploy confidently, and learn from production traffic most effectively.

The 30-day timeline I documented isn't a speed record to chase. It's a demonstration that with proper MLOps foundations, the cycle time from idea to production-validated model can be measured in weeks, not quarters. This velocity fundamentally changes what's possible with machine learning in travel. You can respond to market shifts, test hypotheses about traveller behaviour, and continuously optimise experiences at a pace that was previously unimaginable.

The infrastructure I've described—MLflow, feature stores, containerised model serving—isn't exotic (easier said than done, of course). These are increasingly standard tools that any team can adopt. What separates successful ML initiatives from failed ones isn't access to proprietary technology. It's the discipline to treat models as products that require proper engineering, monitoring, and lifecycle management.

As travel continues to recover and evolve, the platforms that thrive will be those that treat machine learning not as a research exercise but as a core operational capability. That requires investing in MLOps infrastructure before you feel ready, instrumenting everything from day one, and building systems that make doing the right thing easier than doing the expedient thing. The 30-day journey I documented was just the beginning—the real value emerged in the months that followed, when the foundation enabled continuous improvement at a pace I'd never previously achieved.

About Martin Tuncaydin

Martin Tuncaydin is an AI and Data executive in the travel industry, with deep expertise spanning machine learning, data engineering, and the application of emerging AI technologies across travel platforms. Follow Martin Tuncaydin for more insights on mlops, travel technology.