Ashish Upadhyay

Posted on Jul 14

Learning about real-life DevOps: The case of Morningstar

#devops #aws #monitoring

When I joined Morningstar Inc. as a DevOps engineer in 2022, I exemplified a new graduate: excited to start my career and equipped with a few Azure certifications, working knowledge of Jenkins and Docker, and hands-on experience with Kubernetes. I did not initially expect to make a significant impact; however, my perspective changed two months later when I was assigned my first major task - a long-running machine learning (ML) project, Investor Pulse.

What is the Investor Pulse?

The Investor Pulse is a data-driven ML project that generates regional and global stock market insights. It had been running for four years before I joined, and it was an exceptional workload. The project’s scale was impressive:

36 ML models run each month
Covering Global Equity, Global Fixed Income, and Global Allocation, and Regional models for the US, Canada, Japan, Europe, and Asia
Each model ran on a massive m4.16xlarge EC2 instance (64 vCPUs, 256 GB RAM)
Cost per instance: $3.20 per hour
Typical run: 10-15 days per month, totalling approximately $15,000 monthly spend

However, there were large gaps that could have led to failures:

No container orchestration
No monitoring
Logs sent only to CloudWatch
Deployments triggered via Jenkins pipelines
Code pulled from Bitbucket and deployed using CodeDeploy
Outputs stored on S3

The turning point

One day, the Investor Pulse reached its point of no return: during one of my log reviews, I noticed that the Global Equity model had been loading data for eight days without progressing. Moreover, CloudWatch logs demonstrated that the process had started, but no output was generated. Later, I connected to the EC2 instance, checked system metrics, and identified the issue: the machine had no RAM left. The process had stalled, but the instance continued running, increasing costs.

The temporary fix

To solve the problem, I:

Manually terminated the unresponsive EC2 instance
Re-launched it on an m4.32xlarge (twice the RAM and vCPUs resources)
Re-executed and completed the job successfully

The memory issue was resolved then. However, I still wondered: How can we prevent this failure in the future? How can we afford to spend $15,000 per month without the actual need?

*Making the change *

I raised these concerns with the Director of Quant and the Director of Software Engineering and shared my argument:

Without monitoring, we do not recognise failures until it is too late.
Upgrading to larger EC2s would increase costs in the long term.
In the case of other memory leaks or bottlenecks, we do not have a proactive approach to detect or resolve them.

Admirably, the managers estimated the risk and allowed me to prototype a monitoring solution despite it being a legacy workload.

*Solution: Prometheus, Grafana, and Custom AMIs *

As the solution, I implemented:

Custom AMI with Prometheus Node Exporter pre-installed
Jenkins jobs updated to use this AMI when launching EC2 instances
Prometheus to discover new EC2s via EC2 discovery automatically
Grafana dashboards showing:
1. CPU and memory usage per model
2. Runtime durations
3. Cost projections based on utilisation

We achieved the results in a few days, gaining comprehensive visibility into 36 previously invisible workloads.

However, I noticed significant inefficiencies later:

The Japanese model was using less than 10% of the m4.16xlarge capacity.
Most models were CPU-bound rather than memory-intensive.
Global models had memory requirements but did not use 256 GB RAM.

We had previously applied memory-optimised EC2 instances across the board without valid reasons. Hence, I needed to find another solution.

The optimisation: switching to compute-optimised EC2

After analysis and testing, I moved our workload from m4.16xlarge to c5.9xlarge instances, which resolved the problem:

New instances cost $1.52 per hour.
Compute-optimised instances used multiple CPU cores more efficiently.
Average runtime reduced from 15 to 7-8 days.

To visualise the impact I made, the table below showcases the comparison of cost savings and pace:

Lessons learnt

This problem case and implementing the solution myself taught me much:

Monitoring is necessary, even for legacy workloads. The fact that the system runs does not yet mean it runs well.
Resource optimisation is a skill. Avoid automatically using larger EC2s. Instead, understand your workload and determine whether the models are CPU-bound or memory-bound. After that, you can decide.
Cost is not equal to performance. Bigger is not always better. For example, the case of Investor Pulse proved that performance improved after reducing the instance size and respective costs.
Contribute and be proactive from the start. The Investor Pulse was a legacy project without an actual owner. However, I was still not afraid to ask questions after just one month of work.
Small successes can achieve long-term positive effects. My story inspired other teams to analyse their workload and identify potential memory-related inefficiencies. I also guided some of them through the right-sizing and monitoring setup. This proactive approach helped me build more trust in the company, which was especially relevant as a new employee. But most importantly, I contributed to resource optimisation on a larger scale - inside the entire company.

Reflecting on this case, the project became my “school of DevOps life” and showed me more than any certification could. Thus, in the second month of my work, I was able to identify invisible issues, provide convincing arguments to managers, implement monitoring from scratch, and make infrastructure decisions to save time and money.

I advise fresh graduates to adopt the mindset that early-career professionals can also make a change. What is needed is curiosity, agency, and the willingness to ask: Why are we using this approach in the project? Analysing the work processes will help you succeed.

DEV Community

Learning about real-life DevOps: The case of Morningstar

Top comments (0)