<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ashish Upadhyay</title>
    <description>The latest articles on DEV Community by Ashish Upadhyay (@ashishupadhyay).</description>
    <link>https://dev.to/ashishupadhyay</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3353247%2F1c344f09-fd4b-4e5f-b0a6-93e0c7532f64.jpeg</url>
      <title>DEV Community: Ashish Upadhyay</title>
      <link>https://dev.to/ashishupadhyay</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ashishupadhyay"/>
    <language>en</language>
    <item>
      <title>Learning about real-life DevOps: The case of Morningstar</title>
      <dc:creator>Ashish Upadhyay</dc:creator>
      <pubDate>Mon, 14 Jul 2025 10:39:38 +0000</pubDate>
      <link>https://dev.to/ashishupadhyay/learning-about-real-life-devops-the-case-of-morningstar-j54</link>
      <guid>https://dev.to/ashishupadhyay/learning-about-real-life-devops-the-case-of-morningstar-j54</guid>
      <description>&lt;p&gt;When I joined Morningstar Inc. as a DevOps engineer in 2022, I exemplified a new graduate: excited to start my career and equipped with a few Azure certifications, working knowledge of Jenkins and Docker, and hands-on experience with Kubernetes. I did not initially expect to make a significant impact; however, my perspective changed two months later when I was assigned my first major task - a long-running machine learning (ML) project, Investor Pulse. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the Investor Pulse?&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;The Investor Pulse is a data-driven ML project that generates regional and global stock market insights. It had been running for four years before I joined, and it was an exceptional workload. The project’s scale was impressive:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;36 ML models run each month &lt;/li&gt;
&lt;li&gt;Covering Global Equity, Global Fixed Income, and Global Allocation, and Regional models for the US, Canada, Japan, Europe, and Asia &lt;/li&gt;
&lt;li&gt;Each model ran on a massive m4.16xlarge EC2 instance (64 vCPUs, 256 GB RAM) &lt;/li&gt;
&lt;li&gt;Cost per instance: $3.20 per hour &lt;/li&gt;
&lt;li&gt;Typical run: 10-15 days per month, totalling approximately $15,000 monthly spend &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, there were large gaps that could have led to failures: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No container orchestration &lt;/li&gt;
&lt;li&gt;No monitoring &lt;/li&gt;
&lt;li&gt;Logs sent only to CloudWatch &lt;/li&gt;
&lt;li&gt;Deployments triggered via Jenkins pipelines &lt;/li&gt;
&lt;li&gt;Code pulled from Bitbucket and deployed using CodeDeploy &lt;/li&gt;
&lt;li&gt;Outputs stored on S3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The turning point&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One day, the Investor Pulse reached its point of no return: during one of my log reviews, I noticed that the Global Equity model had been loading data for eight days without progressing. Moreover, CloudWatch logs demonstrated that the process had started, but no output was generated. Later, I connected to the EC2 instance, checked system metrics, and identified the issue: the machine had no RAM left. The process had stalled, but the instance continued running, increasing costs. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The temporary fix&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To solve the problem, I:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manually terminated the unresponsive EC2 instance &lt;/li&gt;
&lt;li&gt;Re-launched it on an m4.32xlarge (twice the RAM and vCPUs resources)&lt;/li&gt;
&lt;li&gt;Re-executed and completed the job successfully&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The memory issue was resolved then. However, I still wondered: How can we prevent this failure in the future? How can we afford to spend $15,000 per month without the actual need? &lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Making the change *&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I raised these concerns with the Director of Quant and the Director of Software Engineering and shared my argument: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Without monitoring, we do not recognise failures until it is too late.&lt;/li&gt;
&lt;li&gt;Upgrading to larger EC2s would increase costs in the long term. &lt;/li&gt;
&lt;li&gt;In the case of other memory leaks or bottlenecks, we do not have a proactive approach to detect or resolve them. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Admirably, the managers estimated the risk and allowed me to prototype a monitoring solution despite it being a legacy workload. &lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Solution: Prometheus, Grafana, and Custom AMIs *&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As the solution, I implemented: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom AMI with Prometheus Node Exporter pre-installed &lt;/li&gt;
&lt;li&gt;Jenkins jobs updated to use this AMI when launching EC2 instances &lt;/li&gt;
&lt;li&gt;Prometheus to discover new EC2s via EC2 discovery automatically &lt;/li&gt;
&lt;li&gt;Grafana dashboards showing:

&lt;ol&gt;
&lt;li&gt;CPU and memory usage per model &lt;/li&gt;
&lt;li&gt;Runtime durations &lt;/li&gt;
&lt;li&gt;Cost projections based on utilisation&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We achieved the results in a few days, gaining comprehensive visibility into 36 previously invisible workloads. &lt;/p&gt;

&lt;p&gt;However, I noticed significant inefficiencies later: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Japanese model was using less than 10% of the m4.16xlarge capacity. &lt;/li&gt;
&lt;li&gt;Most models were CPU-bound rather than memory-intensive. &lt;/li&gt;
&lt;li&gt;Global models had memory requirements but did not use 256 GB RAM. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We had previously applied memory-optimised EC2 instances across the board without valid reasons. Hence, I needed to find another solution. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The optimisation: switching to compute-optimised EC2&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After analysis and testing, I moved our workload from m4.16xlarge to c5.9xlarge instances, which resolved the problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New instances cost $1.52 per hour.&lt;/li&gt;
&lt;li&gt;Compute-optimised instances used multiple CPU cores more efficiently. &lt;/li&gt;
&lt;li&gt;Average runtime reduced from 15 to 7-8 days. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To visualise the impact I made, the table below showcases the comparison of cost savings and pace: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjjsbadlprjeyraeareje.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjjsbadlprjeyraeareje.png" alt=" " width="471" height="171"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lessons learnt&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This problem case and implementing the solution myself taught me much:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring is necessary, even for legacy workloads&lt;/strong&gt;. The fact that the system runs does not yet mean it runs well. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource optimisation is a skill&lt;/strong&gt;. Avoid automatically using larger EC2s. Instead, understand your workload and determine whether the models are CPU-bound or memory-bound. After that, you can decide.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost is not equal to performance&lt;/strong&gt;. Bigger is not always better. For example, the case of Investor Pulse proved that performance improved after reducing the instance size and respective costs. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contribute and be proactive from the start&lt;/strong&gt;. The Investor Pulse was a legacy project without an actual owner. However, I was still not afraid to ask questions after just one month of work. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small successes can achieve long-term positive effects&lt;/strong&gt;. My story inspired other teams to analyse their workload and identify potential memory-related inefficiencies. I also guided some of them through the right-sizing and monitoring setup. This proactive approach helped me build more trust in the company, which was especially relevant as a new employee. But most importantly, I contributed to resource optimisation on a larger scale - inside the entire company.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reflecting on this case, the project became my “school of DevOps life” and showed me more than any certification could. Thus, in the second month of my work, I was able to identify invisible issues, provide convincing arguments to managers, implement monitoring from scratch, and make infrastructure decisions to save time and money. &lt;/p&gt;

&lt;p&gt;I advise fresh graduates to adopt the mindset that early-career professionals can also make a change. What is needed is curiosity, agency, and the willingness to ask: Why are we using this approach in the project? Analysing the work processes will help you succeed. &lt;/p&gt;

</description>
      <category>devops</category>
      <category>aws</category>
      <category>monitoring</category>
    </item>
  </channel>
</rss>
