N Chandra Prakash Reddy for AWS Community Builders

Posted on Apr 3 • Originally published at devopstour.hashnode.dev

From MLOps to LLMOps: A Practical AWS GenAI Operations Guide

#aws #ai #genai #productivity

The vibe at AWS Student Community Day Tirupati on November 1, 2025, was different from what I thought it would be like. There were lots of students, cloud fans, and builders in the room. They were all there to learn, meet, and geek out about AWS. Throughout the day, there were several classes, and each one added something new.

One lesson, though, made me sit up and pay more attention. Raghul Gopal, a Data Scientist and AWS Community Builder (ML), walked up to the stage to talk about something that most people don't think much about: how do you run AI models in real life? Not just make them on a laptop and be happy about it; consistently test, watch, and scale them.

"Generative AI Operations: FMOps, LLMOps Integration with MLOps Maturity Model" was the title of the talk. When it was over, I had a whole new perspective on the AI/ML lifecycle on AWS.

The Question That Kicked Everything Off

"AWS gives you everything in one place to build ML models," Raghul said to start the talk, and it really hit the mark. But are we really using it right in production?"

Sense a pattern? A model can be trained by many teams. It's a whole different task to get that model to work reliably for a lot of real users.

To put it another way, making a great meal at home is one thing. It takes a lot of different skills to run a restaurant kitchen that feeds hundreds of people every day without any problems. That's what this meeting was all about.

What "ML in Production" Actually Means

Before getting into answers, the session gave us a really helpful list of questions that can be used as a litmus test to see if your machine learning setup is really ready for production:

Are your model's features (the pieces of data it uses to make predictions) kept separate and tracked correctly?
Is your model that you learned kept in a model repository or registry?
Is the model being watched all the time to make sure it keeps giving correct answers?
Is model lineage being kept? This is a list of which data made which version of the model.
Are there CI/CD pipelines (automated delivery systems) that move code from development to pre-production to production, with approval steps that need to be manned by manual?
Is testing done automatically in every environment?
Does ETL (Extract, Transform, Load) automatically load data so that machine learning engineers can start working on projects without haThe event also had a number of other great sessions, such as ones about cloud design, hands-on demos, and more. But this one helped me learn how to organise my thoughts in a way that I will use in all future AI projects.ving to wait for data teams?

There are a lot of people like you who answered "not really" to most of those questions. That's exactly what MLOps is meant to fix.

The Three "-Ops" You Need to Know

Let's be honest: the words can be hard to understand. It's simple like this:

MLOps (Machine Learning Operations): The process of putting standard machine learning solutions into production in a smart way. Examples include fraud detection models, recommendation systems, and churn prediction.
FMOps (Foundation Model Operations): Massive AI models like Claude or Titan are trained on terabytes of data with billions of parameters. This is an extension of MLOps for Foundation Models. FMOps includes use cases for making text, images, music, and videos.
LLMOps (Large Language Model Operations): A part of FMOps that is used to operationalise Large Language Models. This is the technology that makes chatbots, writing helpers, and coding tools work.

Imagine three rings stacked on top of each other. MLOps is the outer ring, FMOps is inside it, and LLMOps is in the middle. It doesn't matter what kind of AI model you run, all three work the same way.

The MLOps Maturity Model: Four Levels

Now things really start to get interesting. Raghul showed a four-level MLOps Maturity Model, which is a plan for how teams move from small tests to using machine learning on a large scale. It's kind of like getting better at a video game.

Level 0 - Initial Phase: Experiments and Ideas

At this point, data scientists are just looking around. To make and test models, they use Adobe SageMaker Studio (AWS's cloud-based ML IDE) or local tools like VS Code and PyCharm. This is what the technology stack looks like:

Amazon SageMaker: Core ML platform with Data Wrangler (data prep), Pipelines (automation), Feature Store, and Clarify (bias detection)
Amazon S3: Stores your raw ML training data
AWS Glue: ETL service - cleans and transforms data before feeding it to models
Amazon Athena: Run SQL queries directly on data sitting in S3
AWS Lambda: Trigger automated jobs and workflows
The event also had a number of other great sessions, such as ones about cloud design, hands-on demos, and more. But this one helped me learn how to organise my thoughts in a way that I will use in all future AI projects.
Code Repository: AWS had its own CodeCommit, but now most people use GitHub or Bitbucket to store and track their work.

That's fine; everything here is done by manual and by exploration. The beginning of every fully developed machine learning system.

Level 1 - Repeatable Phase: Automating the Workflow

From doing runs by manual, the team now goes on to automated pipelines. You don't have to re-train a model by manual every time because SageMaker Pipelines can do the data preparation, training, evaluation, and packaging for you. The SageMaker Model Registry is a central list of all your model versions that gets updated when new models are trained.

"I trained this once" became "every training run is tracked, versiThe event also had a number of other great sessions, such as ones about cloud design, hands-on demos, and more. But this one helped me learn how to organise my thoughts in a way that I will use in all future AI projects.oned, and reproducible."

Level 2 - Reliable Phase: Adding the Safety Net

This is the quality gate before going live. You introduce:

Automated testing: Unit tests, integration tests, and evaluation metrics that are specific to machine learning are all run immediately.
CI/CD Pipelines Using AWS CodePipelines and AWS CodeBuild to move code from development to pre-production to production, with approval steps that need to be manned by manual.
Different testing strategies based on how data arrives:
- Batch requests: Tested via Lambda and S3
- Real-time requests: Handled through Amazon API Gateway
- Streaming requests: Managed with Kafka and Amazon MSK

To be fair, this level demands real engineering discipline. But it's what separates a prototype from something you'd stake your business on.

Level 3 - Scalable Phase: Multi-Team, Enterprise-Scale

Everything from Level 2 is multiplied across different teams and machine learning solutions at the same time in the last level. New things added here:

Multiple data sources: NoSQL databases like DynamoDB and DocumentDB for different team needs
IAM (Identity and Access Management) to manage roles and permissions at scale
CloudFormation or Terraform for Infrastructure as Code - your entire environment defined in code, replicable in minutes
Your team can choose to use GitHub Actions or Jenkins instead of AWS CodePipelines if they already know how to use those tools.

What is the goal at this level? From idea to production in days instead of weeks, and use more than one option at the same time.

Making the Leap: MLOps → LLMOps

When you have a strong base in MLOps, moving on to LLMOps is easier than it sounds. The slide made it clear: "You can operationalise your basic LLM use cases from one environment to the next."

The ideas behind Dev, Pre-Prod, and Prod environments, CI/CD pipelines, manual approvals, and automated tests are all the same. Now you're working with Foundation Models instead of the old ML models, which is different. They're the building blocks you use to build on top of your MLOps skills.

Initial LLMOps: Picking the Right Foundation Model

This is where lots of teams get stuck. How do you pick from the dozens of LLMs that are out there? The lesson gave a framework that could be used right away.

Step 1: Know Your Use Case First

Make sure you know what you need before you choose a type. The things to look at are:

Open source vs. proprietary?
Commercial license compatibility
Model size: Small Language Model (SLM) vs. Large Language Model (LLM)
Speed and latency requirements
Context window size - how much text the model can process at once (measured in tokens)
Quality of the training dataset and how it applies to your area
Is the model fine-tunable with your own data?

Step 2: Navigate the Speed-Precision-Cost Triangle

The truth is that you can't have everything. Raghul showed this with a triangle that showed three objectives that were at odds with each other:

High speed → smaller model → lower precision → lower cost
Higher precision → larger model → lower speed → higher cost

In the case on the slide, three Foundation Models were put side by side. FM1 had the highest accuracy (5/5) but also the highest cost. FM3 was less expensive ($$), but it wasn't as accurate. When price was the most important factor, FM2 was chosen because it had the best mix of accuracy (4/5) and low cost ($). The best choice is always based on which triangle point is most important to you.

Step 3: Build a Prompt Catalog and Evaluate Systematically

Don't just pick a model and hope it works. The recommended process:

Prompt Engineers make good evaluation questions by following organised rules like CORS or Anthropic's instructions.
In a Prompt Catalogue, you can store those prompts. It's kind of like a Feature Store, but for prompts. With version control turned on, DynamoDB works well here.
GenAI Developers shortlist the top 3 Foundation Models based on those prompts
You can do structured evaluations in one of four ways, depending on the facts you have:

Accuracy metrics (when labeled data exists with discrete outputs — e.g., classification)
Similarity metrics like ROUGE or cosine similarity (for open-ended text outputs)
Human in the Loop (HIL): Using tools like Amazon SageMaker Ground Truth, human judges score model outputs by manual against set criteria.
LLM-as-judge: Feed outputs to a trusted, reliable LLM and have it rate the response with a score and explanation

The result is a clean evaluation scorecard, which means that you chose your model based on facts instead of your gut.

Building and Deploying Your LLMOps App

What do you do now that you've picked your LLM? Building the real app around it is the last step:

Frontend: Django, Flask, Streamlit (highly recommended for quick and clean prototypes), or React
Backend / LLM Provider: Amazon Bedrock, SageMaker JumpStart, or HuggingFace - depending on your model choice
Load Balancing and Auto Scaling to handle real-world traffic without hiccups
The same Dev → Pre-Prod → Prod pipeline from MLOps applies - always test your LLM in Pre-Prod before exposing it to end users

The architecture changes based on whether you're delivering at the edge or through a centralised group.

Key Takeaways

After this lesson, a few things really stuck with me:

Building a model is the easy part. Running it consistently in production, testing it, keeping track of its history, and being able to do it again is the real engineering work.
The MLOps maturity model is a journey, not a checklist. You can start at Level 0 if that's where you are now. You get to the higher levels bit by bit.
LLMOps is MLOps with a GenAI lens. You're a lot closer to LLMOps than you think if you already know how MLOps works.
Model selection should be data-driven. You don't have to guess or worry about which LLM to choose because of the prompt catalogue and organised evaluation method.

Conclusion

In the end, Raghul's talk made it clear that having the tools isn't enough; what counts is how you use them. From SageMaker to Bedrock to CodePipelines, AWS gives you a very full set of tools. But even the best tools can't fix a broken process if you don't think about things like testing, tracking, and being able to do the same thing again.

If you're a student just starting to learn machine learning, a developer looking into GenAI, or an engineer building real systems at work, you need to understand this operational layer. This is what sets people who play with AI apart from those who ship AI. The talk at AWS Student Community Day Tirupati taught me that there isn't as much of a gap between the two as most people think. You have to get on that growth curve somewhere and keep going up.

The event also had a number of other great sessions, such as ones about cloud design, hands-on demos, and more. But this one helped me learn how to organise my thoughts in a way that I will use in all future AI projects.

About the Author

As an AWS Community Builder, I enjoy sharing the things I've learned through my own experiences and events, and I like to help others on their path. If you found this helpful or have any questions, don't hesitate to get in touch! 🚀

🔗 Connect with me on LinkedIn

References

Event: AWS Student Community Day Tirupati

Topic: From MLOps to LLMOps: A Practical AWS GenAI Operations Guide

Date: November 01, 2025

Also Published On

AWS Builder Center

Hashnode

DEV Community