Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Apollo Tyres Accelerates Engineering Workflows with HPC on AWS (IND368)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Apollo Tyres Accelerates Engineering Workflows with HPC on AWS (IND368)

In this video, Apollo Tyres and AWS discuss how the tire manufacturer accelerated engineering workflows by implementing high-performance computing on AWS. Shailender Gupta explains Apollo's digital transformation journey, starting with IoT data collection and building a 400+ terabyte data lake, then moving SAP to cloud and implementing tire genealogy tracking. The company faced challenges with complex tire simulations requiring significant compute resources. By migrating from on-premises HPC to AWS cloud using FSx storage and the Tachyon management platform, Apollo achieved 60% cost savings by running simulations on Linux instead of Windows, reduced simulation time by 60%, and shifted from CapEx to OpEx model. Gautam Kumar demonstrates Tachyon's capabilities including job submission, workstation management, and AI-powered assistance using Amazon Bedrock. The solution enables self-service for R&D teams, provides granular cost visibility, and supports virtual prototyping. Future roadmap includes chemical compound simulation using AI, complexity reduction, and global expansion to multiple R&D centers.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Apollo Tyres' Journey to Accelerate Engineering Workflows with AWS

Welcome. Thank you all very much for coming and for spending your last hour of what must have been a very busy day already with us. My name is Alex Francois-Saint-Cyr. I'm the Business Development Lead for Product Engineering in North America. Shailender, Gautam, and I are thrilled to share with you how Apollo Tyres has been accelerating engineering workflows by leveraging high performance computing on AWS.

Let me go ahead and jump into the agenda for this session. I will first go through a quick introduction on how we at Amazon and AWS are envisioning engineering excellence and how we execute it as well. I will then take a deeper dive into engineering simulation and how it can be implemented on the cloud and discuss a few areas that are key to tackle to accelerate product development.

Shailender will then take over to introduce Apollo Tyres and the challenges that they faced with tire development. Gautam will then come in after that section to provide details on the solutions built to accelerate engineering simulation and the benefits that it generated for Apollo Tyres. We will conclude the session with a discussion on the lessons learned and what is next for Apollo Tyres. If you have any questions, we will be very happy to answer them after the session ends at 6:30 p.m., and we will be able to do this outside of the room.

Engineering Excellence Through Digital Transformation: AWS's Vision and Practice

Let me dive into engineering excellence now. We see engineering excellence being successfully driven by three key aspects. The first one is really about competing in the digital-first manufacturing world. Traditional manufacturers now face digital-native competitors who are moving at unprecedented speed. Success requires two things: first, modernizing operations through cloud technologies and automation, and also accelerating product development to reduce time to market. Digital transformation is no longer an option; it has become existential.

The next one is engineering complexity and mastering the digital thread. Modern products are incredibly complex, with millions of lines of code, multiple thousands of components, and multiple suppliers as well. The digital thread connects data throughout the entire product lifecycle to enable seamless management of requirements, configurations, and multidisciplinary design. We can share an example with you about how Apollo Tyres built a tire genealogy, which is a digital thread type of approach, and I think this will be quite interesting for you to see.

Finally, simulations allow us to test thousands of experiments virtually, saving time and enabling innovation that would be impossible if we were relying solely on physical prototypes. The last aspect I want to talk about is excellence at scale by mastering quality and efficiency. Customers now demand higher quality, shareholders expect better margins, and competitors constantly optimize.

Advanced technologies such as AI, machine learning, and computer vision become game changers. Quality has to be built into every single process and not just inspected in. The other aspect is cost reduction, which means eliminating waste and making smarter decisions. It is not about cutting corners. These drivers are interconnected and mutually reinforcing. Digital-first operations enable the digital thread, and at the same time, the digital thread provides data for quality optimization.

Manufacturers who are able to tackle all three criteria will be able to thrive in the global marketplace. Let's dive a little bit into how at Amazon we practice what we preach. We design much of our own hardware and we source from a global supply chain, giving us firsthand experience with these challenges.

Our approach is rooted in working backwards from our customers.

This leads us to rapid prototyping, rapid innovation, and a customer-centric methodology that drives everything we do. We have multiple globally distributed teams who work on data center design and infrastructure that powers our global operations. We also have teams that work on devices like the Echo or the Kindle for product development. Additionally, we have a robotics division where we build new AI services that are revolutionizing our fulfillment centers. A key enabler for us to accelerate product development is the ability to access secure cloud-based design and collaboration tools. Whether we are developing integrated circuits or even satellites with Amazon Leo, we leverage the cloud to enable our teams to work together seamlessly, securely, at speed, and at scale.

Time to Results: The Critical Role of Simulation Workflows in Product Development

What I'm going to do next is look into time to results, because that is critical to developing products faster. The faster you can iterate through the design cycle, the more innovative, reliable, and higher quality your products will become. Let me walk through what a typical simulation workflow is in case you are not familiar with it. It starts with the design, which is the concept of the initial product you might be working on. This moves into model preparation, which is when you add boundary conditions and create a virtual representation of the component or product to assess its performance. From there, you move into planning the different simulation studies, which is about exploring the design space so you have a good understanding of how that product behaves based on different scenarios. The process ends with submitting the jobs to run those simulations and then getting results so you can do post-processing and understand what kind of performance you are getting from this product.

Simulations are required throughout all phases of the product life cycle. First, as we discussed, is to explore the design space because this is where innovation happens and competitive advantage is built. As you are running simulations, you can rapidly test different scenarios or design alternatives for your products. The second purpose is improving robustness and reliability. Simulation can help identify potential failure modes and optimize designs before any physical prototypes are built. Finally, simulations can be used to prepare manufacturing operations and de-risk technologies, which allows you to identify potential production issues early on, saving significant time and cost.

Why are customers increasingly choosing the cloud for engineering simulation? Let me break this down into four key benefit areas. The first is capacity benefits. Having access to massive capacity means teams do not have to wait to run simulations, which increases productivity. A pay-per-use model provides business flexibility, while elasticity gives you the ability to scale up and down depending on whether you need to run more projects or fewer projects. These are actual benefits that Apollo Tyres experienced, and you will learn about this from Shailender Gupta. The next benefit is global advantages, being able to have teams anywhere in the world collaborate seamlessly. You also have access to disaster recovery, resilience, and business continuity built in. Teams are able to access resources from anywhere, anytime to do their job.

We have a few customers now that use this to really follow a "follow the sun" approach, so design work is being done all day long across the world. Then we have economic benefits such as accessing the latest technology without having to make any capital investment. Another economic benefit is that you purchase a lot of software licenses, so you want to ensure those licenses are being used in the most effective and efficient way. By doing this on the cloud, you get a better return on investment. Finally, there are environmental benefits because as you leverage the cloud, this might help you achieve some of your sustainability goals within your company, and we provide this through shared infrastructure.

Cloud-Based HPC Architecture: Optimizing Compute Resources for Engineering Simulation

These benefits fundamentally transform how engineering teams work, removing traditional constraints around compute capacity and enabling innovation at scale. Let me show you how we have architected our computer-aided engineering approach for HPC to deliver those benefits. The system has three main components. The first one is what we call the front end, which includes a portal so you can do job submission and job management. It also contains the virtual desktop infrastructure that allows you to run your engineering applications, and finally DCV, which is a technology that enables high-performance remote visualization.

The next component is compute, which encompasses the core compute clusters that run your simulations with cluster management features and a wide choice of purpose-built compute instances, as I will show in the next slide. The last component, but nonetheless a very critical one, is data. Many of those simulations generate a lot of data, and those files are quite large. We need to ensure proper storage of the simulation data based on when this data might be needed and used for other projects, as well as the management of these data for traceability purposes. This is where our digital thread comes back into the picture.

In summary, this architecture provides the flexibility to match simulation workloads with the optimal compute resources , which I will explore with you now. This is where we differentiate on the front, with each job being able to leverage the most optimized instance type. Typically, our customers have complex products and may need to run jobs requiring different types of physics, such as multiphysics, or different solvers like computational fluid dynamics or finite element analysis. These would be jobs one, two, three, four, and so on, and each job can be matched to the best instance based on compute, memory, or other specific requirements.

The cost and performance optimization comes from fine-tuned matching of jobs to the best instance type from this broad selection. When it comes to how you would take costs into consideration, you can utilize different types of instances. We have spot instances for cost-sensitive batch jobs, on-demand instances for critical jobs that will not tolerate any interruptions, and savings plans for baseline always-on compute capacity. Shailender Gupta and Gautam Kumar will also show you how Apollo Tyres took advantage of those different instances and approaches to optimize the cost of those runs.

Looking ahead, we are also taking into account AI to potentially set up smart provisioning in order to automatically recommend the optimal instance for each workload.

Talking about AI, we obviously have GPU-based instances, and those need to be taken into consideration on the physics-based simulation front as well. Independent software vendors such as Siemens have been redesigning their solvers to run on GPUs to accelerate time to results. We do have those options, and you may have heard Mike Garman this morning also mentioning a new one on the P6 side with the B-300 Nvidia chip that is available as part of this family. This really gives our customers tremendous flexibility not only to solve compute-intensive simulations, but also to run AI and machine learning workloads.

With this breadth of options, you have the opportunity to match your specific workload requirements, whether it is AI-driven design optimization or simulations, to the most cost-effective and performance instance type. Now with no further delay, I will pass things over to Shailender, who will introduce Apollo Tyres and take a deeper dive into the different challenges that they experienced.

Apollo Tyres: Global Presence and the Imperative for Price-Performance Solutions

Thank you, Alex. Before I go into the problem statement and technical details, let's try to understand why we are talking about this today and what Apollo Tyres is and why this is important to us. We are a well-known brand in Asia Pacific. In the US, we are heavily invested and we primarily sell under the name of Vredestein tires, which is our premium brand for luxury and sports cars. We have a global presence, with Europe being our second biggest market. We have seven manufacturing plants, five in India and two in Europe. We have the entire suite of product lines including passenger car tires, farm tires, trucks, off-roading vehicles, and two-wheelers. With that, you further add the complexity of summer tires, winter tires, all-season tires, and in areas with heavy snowfall, studded tires.

All these ecosystems that we manage make it not easy to run our operations. We are a $3.5 billion company with 20,000 people across the geography. Designing products is becoming increasingly complex. There are newer vehicle models coming constantly, and electric vehicles have their own specific demands for tire types. In this competitive market, what you need is reduced time to market—how soon you can design a good product and sell it across the market.

For all of this, essentially what you need is a good solution that can help you design a good product, and that solution should also be cost-effective at the same time. It is not just that you need a solution, but a solution that can give you the price-performance benefit. In the next 30 minutes, I will take you through this whole journey—what we did, why we did it, what insights we used to make a decision, and why we moved to the cloud.

I shared information about Apollo Tyres and why this is important, as I mentioned regarding price-performance and the vision that we want to achieve. These are some of the targets that have been set by our management and our board. To achieve these targets, as I mentioned, time to market and price-performance ratio are very critical, even for tires. So it is important that we design the right product. We heavily invest in our R&D with two R&D centers in India and in Europe. For both R&D centers, it is important that we give them the right resources where they can do simulations with chemical compounds and with the design of tires.

Building the Foundation: Apollo's Digital Journey from IoT Data Lake to Tire Genealogy

How we started this digital journey was not a sudden overnight decision. Back at the beginning of 2021, we decided that we needed to move to the cloud. We needed to come out of the silos for all the plants and consolidate our data and resources into a single point of view. The first project we started was the collection of IoT data, which is a gold mine for any manufacturing company. We implemented real-time streaming of IoT data from our manufacturing plants, machines, and PLCs. That was the first project we implemented. What do I do with this data? How do I collect it? I need a good data lake to collect the data. So we created our first data lake on the cloud using S3 and Redshift.

As we grew, we started using other services like Glue, DMS, and Aurora. Today we have a massive data lake with more than 400 terabytes of data, which includes a lot of X-ray images and video analytics that we collect for quality checks of tires.

In addition to that, we have structured and relational data that comes from systems like SAP or transactional systems running as SaaS products, such as CRMs. Altogether, that data would be around 8 terabytes compressed. If you know Redshift, it heavily compresses the data, which runs on a multi-node cluster for us. This was the second step we took, and it gives a lot of insights to my R&D team.

Before we design a new product, chemical simulation, compound properties, and test results that we conduct in the test lab or even from the field are extremely important. We need to understand what the performance of the previous product was on the field, what kind of products are giving complaints, and why they are giving complaints.

The next step was to bring the rest of our systems closer to the cloud, so we moved our entire SAP solution and ERP from on-premises to the cloud. Now that we have a majority of our production workloads running within the cloud, we focused on the Industry 4.0 journey. We wanted to connect more of our IoT devices. Initially, it was just the mixers, then later on the extruders, the belt cutters, the tire building machine, and the last point of the tire, which is called the curing stage, where the tire is baked.

By the way, I was talking to someone today, and when I told them the tire is glued together, they were surprised. Sometimes it feels like the tire is a black doughnut where you just pour the rubber in and then bake a tire. However, a simple tire can be assembled with 200 or more components which are glued together and then baked. The entire life cycle can be one to two hours, with 20 to 40 minutes going into baking itself, depending on how big the tire is.

If it is a mining tire, which would be twice my height, probably 12 feet, it takes a long time to build such specialized tires. So it is not an easy product, and I am proud that we made tires. Whether you buy a Mercedes, a Bugatti, or a regular car, you cannot run it without a tire. Whatever the price of the car is, you have to buy a tire to run it.

All said and done, this was our journey. This is how we started, and then the next step was to move to high-performance computing things, simulations, and the greater good. Now that we have seen that the cloud is giving us the ROI and returns, on the data lake, as I said, 400 terabytes of data is there. Structured data is there. What do I do with this data unless I do not get an answer out of that particular data?

Genealogy is one of the first things that my R&D needs before they design a tire. What is genealogy? I want to understand from the final finish of the product, what components were used. As I said, 200 or more components could be used in the manufacturing of the tire. I want to know what component was used, who supplied the material, who supplied the carbon, who supplied the natural rubber, who supplied the steel, the nylon of the tire, on which assembly line it was manufactured, what was the shift, who was the operator, and what was the weather condition?

Was it summer? Was it winter? Was it rainy? That is why the product is getting more moisture, bubbles, or failure rates. So I want to understand the whole genealogy of the tire. Sometimes the genealogy starts from top to bottom. It could be bottom to top, or it could be somewhere in the middle.

What I mean is that if there is a particular component which is giving me a failure, I want to know how many other child components have used this component so that I can trace out what kind of failure I can anticipate. Or if I want to know what components were used to manufacture that component, so that I can control the failure. So this is the genealogy where I want to start from anywhere in between. I want to know who are the parents and who are the children.

One of the most complex processes is which the systems collect the whole genealogy data, and this was one of the key drivers for our projects. They wanted to know before we go to the HPC what kind of components we are using and what components we can reduce, something called complexity reduction.

What I mean by complexity reduction is if I'm manufacturing 10 tires, all of them 17 inch, for a particular car or SUV. I want to know if tire one uses 200 components, why tire two is using 205 components, why 5 components are extra in the second tire. So I want to reduce the complexity of my product. These things were done so that we can give this input to my design team so that they can use it for the simulation when they design the new tires in the HPC.

From Physical to Virtual: Addressing Tire Design Challenges with Cloud HPC

The next thing was my team wanted to know, my identity team wanted to know a lot of questions from the data that we were collecting. One way was that every time I go and design a BI report for them, a business intelligence report in the tabular, Quicksight, Power BI, whatever you use. The problem is every time I design a report, they have a new question. I design another report. They have another question. There are probably 500 reports running today in the company, but everybody wants a custom version of the report. They want a different question to be answered. How many reports can I design?

The good thing is that we are living in the times of LLMs. What we did was put an LLM on top of my data lake. Now you put a cushion of whatever you want. You ask a question in natural language and simply say what kind of product is giving a problem, who are the top products, what products are giving issues in a particular time, what was my sale, anything that you want to ask. My LLM converts it into a SQL statement, fetches the data, and gives it to my users. I don't need any BI solution. That was the next project that we did, which was extremely useful for my R&D and for manufacturing. It can also do root cause analysis. If there was a failure, what was the cause of that failure?

Obviously you need a good data layer, and as I mentioned, we started right. We had a very solid foundation of the data layer. On top of that, we are able to run it. We have clean data. We have trusted data governance in place, so all those things are there. Based on this, we started the next project. Now my R&D has all the tools they need to design a good tire. The problem still remains: how do I design a good tire? I need the hardware. I need a high performance computing solution to design a good tire.

Designing a tire is a pretty costly and complicated process. After you design a tire, testing it in the field requires a lot of effort. There are race tracks where we go and test the tire. We test how it brakes under various conditions, how it brakes on snow, under rain conditions, on wet surfaces. We measure the braking distance and how much noise it induces. In the case of electric vehicles, it is the other way around. You have to induce noise in the tire so that you know the vehicle is coming. Things are very dynamic depending on what kind of vehicle you are designing.

So designing a tire, as I mentioned, is a complicated process, and at the same time you have to reduce the cost. If you can design this tire in a virtual environment rather than the physical environment, you don't have to get a die, bake the tire, and test it. You can do a lot of testing within the simulated environment itself. A lot of software has configurations where you can put the concepts of physics, wind resistance, and all, and then you can test the tire virtually. These are some of the challenges our teams are facing.

We are facing hundreds of simulations per month, then multiple jobs, multiple users, and aging hardware. A quick overview of what we use: industry standard software. Abaqus is a very well-known software for simulations and Siemens NX for the design. So a typical solution from on-prem HPC to cloud HPC: you heavily invest in the on-prem solution, you plan an ROI of 6 to 7 years. You put a lot of capex. You buy a large computer, probably 256 CPUs, 512 CPUs. You do not know who and when are going to use that. You buy large storage, SAN storage, probably 10 terabytes, 20 terabytes. It's sitting there. Somebody will use it. Somebody may not use it. If we move this entirely to the cloud, I can use it as a pay-as-you-go model.

Instead of investing €5 million on an on-premises solution, I can use this on a pay-as-you-go basis. Nobody uses these resources overnight. Everybody goes home and sleeps. So why do I need these things on-premises? Let's move it to the cloud, let the team use it, hire and fire, use it when you need it, and then it automatically shuts down when the job is running.

The other thing is that the cost of computing is reducing day by day. I have a small analogy here. When ChatGPT-2 was launched in 2019, it was trained at a cost of $40,000 to $50,000. If you follow the co-founder of OpenAI, he claimed that today you can train the same ChatGPT-2 at less than $600. Considering the keynote that was given today, there are better chips. I don't know if you go back home and train it, but probably you can train it at $60. So you never know. By the way, if you want to train it, the whole code of ChatGPT-2 is open source. You can go and train your own ChatGPT-2.

Technical Implementation: Architecture, Benchmarking, and Cost Optimization Strategies

So let's take a technical dive now. Enough of theory, let's discuss what exactly we did. Here is a high-level architecture diagram. Our users were accustomed to using Windows systems. We have on-premises users connected through a high-tunnel VPN onto the AWS Cloud. This entire box is the AWS Cloud. We created a landing zone for our users using Windows Terminal Services where multiple users can log in simultaneously, see the status of their jobs, and run their jobs.

At the back end, there was a job simulation server which controls what jobs to allocate and what resources to allocate based on your configuration. It can spin up a number of clusters and then run the jobs. Our users can submit jobs, see the job status, and view temporary files running. We have given them the power to select what kind of clusters they can spin and the default cluster size. This entire thing is stored on FSx storage, which is one of the key game changers. Without it, it would be really hard to share data across multiple users. Typically, you put the data on an EBS storage system which comes with EC2. With FSx, you can share the data with high-performance computing. So this is the whole structure that we created. I'll take a deep dive in the next few slides on what kind of configuration we use and why we use that configuration.

As I mentioned, FSx is one of the game changers, and if you attended the keynote today, there are more enhancements which have come to FSx announced today. What it basically does is provide shared storage where it automatically scales up and down the size if you want to do that, which is very unique to FSx. With EBS, you can only go one way—you can only increase the size, you cannot reduce the size of the EBS. So if I'm using an FSx of 6 terabytes or 10 terabytes and I don't reduce the size, it's going to be very costly to me. My whole concept of OpEx is not going to work. I need to keep the cost under control.

It's not that all users are working at the same time. Depending on what kind of jobs they are running—small jobs, big jobs, how many jobs or users there are—I want my FSx to either increase or decrease depending on what storage is required. This is the beauty of FSx. You put a script, and if my consumption of FSx is more than 90 percent, it automatically scales to another 500 GB, 1 terabyte, or 2 terabytes, whatever threshold you have set. If my storage has reduced to 60 percent, further bring it down to whatever number you have set, or 40 percent, reduce it. It can also archive the data in the same form like S3. It can automatically move the data between hot, cold, and warm storage. That's another thing it can do. It can take snapshots of the user's data. So if a user accidentally deletes a file, you can go and recover from it. You don't have to rely on a full backup, so that's another good thing that it can do. This was definitely one of the most critical things in terms of the entire solution when you are working with multiple users and you want to share the data.

When we moved from on-premises to the cloud, we wanted to measure the performance. My management and board, who were sanctioning the dollars for this project, wanted to know what we were gaining. Additionally, my users had to be convinced. For myself, I wanted to know what kind of instance to use. There are so many instances available: Linux, Graviton, AMD, Intel, GPU-based. Which one should I use? What size should I use? Measurement was critical for us. We did extensive simulations in the proof of concept to reach a conclusion about what works for us.

We tested on-premises and submitted various kinds of jobs to see how long each particular job took. Then we tested it on multiple combinations of instances. Here are the configurations. Linux was giving the best price-to-performance ratio in the high job scenario cases. The rest of the instances were either Windows or AMD, and you can see what kind of CPUs we were running: Intel and others. We came to a conclusion about what we needed, how much money we wanted to spend, and how much time we could tolerate in terms of queue or wait time.

We tested this for one software and then for a second one as well, so we could conclude what software would run on what kind of instance if we wanted. After we were convinced that on-premises versus the cloud would give us a better price-to-performance ratio and we knew what kind of instance we wanted, we configured that into the entire solution. Originally we were using SLURM as our job scheduling tool, but it was very crude and raw, so we opted for a custom proprietary solution sold by a company called Invisible. We started using Tachyon, and we were convinced it would work well for us.

Our users were happy with the user interface and the way they could control the jobs. We worked for a long time with Tachyon to customize the product for us and ensure it was well adopted by our users. The other thing, as I mentioned, CPU was just one factor. The big factor was what operating system to use. My users were more comfortable with Windows, but did I really want to give it to them? The front end is Windows, as I told you earlier. The landing zone is Windows Terminal Services. What we saw in the back end, the users do not need to know what is running there. It is a black box for them. If Linux is giving us a better price-to-performance ratio, we said we would run the jobs on Linux in the back end.

As you can see, there is a 60 percent cost saving between Windows and Linux for the same jobs we were running earlier purely on Windows or in the PC environment. Many times I hear from people or the CIO that cloud is costly and costs are overrunning, but have you done your due diligence? Have you done your benchmarking? If you do it correctly, you can actually come to an optimal point on what you want to run: Windows, Linux, AMD, Intel. I was giving a session earlier where I gave some golden rules on what kind of CPUs to go with and what kind of OS to go with. I will take probably 30 seconds to repeat.

My golden rule is if I can run it on Graviton Linux, I will do that. That is my first choice: Amazon Linux Graviton, the cheapest solution you can get. If not, if your application does not work on the Graviton architecture, then I will probably go with x86. If I am going with x86, I will go with AMD. If not, then I will go with Intel. If none of these combinations work and Linux does not work, there are enterprise applications like SAP, then probably I will go with Enterprise Linux like SUSE Linux or Red Hat Enterprise Linux.

If you specifically need Windows, then and only then will I go with Windows. Again with Windows, my first preference is AMD; if not, Intel. I have a rulebook, and I follow that rulebook. The amazing cost savings I get from this approach are significant. The same thing applies with RDS. If I can run on PostgreSQL, fine. If not, then I will probably ask for other proprietary options like Oracle or SQL Server. It is a simple rule. The best thing with PostgreSQL is it runs on Graviton as an RDS as well. Tremendous cost savings come from RDS, and I am a happy customer today that AWS has announced database savings plans. I was looking forward to this for a long time, so thanks for that.

We solved operational challenges with our R&D team. Everybody wants self-service. Nobody wants to come to IT or be in a queue waiting for their ticket to be logged. They want to go use the platform we have given them and execute jobs directly. Security and governance are tightly integrated with Active Directory users. Cost efficiencies are tremendous if you simulate correctly and do your benchmarking right. The most important thing for us is the scale-up of engineering for our R&D team. With that, I'll hand over to Gautham to give you a walkthrough of the Tachyon platform.

Tachyon Platform: Empowering R&D Teams with Self-Service HPC and AI Capabilities

Thanks Aender for providing great insights on running HPC on AWS. Apollo has faced a classic HPC challenge. Their R&D team needed powerful computing resources, but as Shailender mentioned, managing jobs, monitoring usage, and controlling costs were becoming a bottleneck to innovation. What we did as Amazon, we worked backwards from the specific requirements and implemented Tachyon, a partner solution customized for Apollo Tyres, which is a comprehensive management platform that puts control back in the hands of the researchers. It also gives full control to the administrator for complete visibility of what clusters they are running and how to optimize costs.

Let me walk you through what makes Tachyon game-changing. Researchers can now submit and monitor jobs through an intuitive UI. They can request workstations on demand from a managed catalog and access files seamlessly through an integrated file manager. They don't have to reserve an IT ticket to get things done. Admins get complete unified observability across all their clusters. Most importantly, they can allocate budgets at the project and user level, which gives better control of the budget. As a result, Apollo Tyres' R&D team now focuses on innovation instead of infrastructure management, while the admin team gains control of the entire infrastructure.

Let me show you how Tachyon's cloud-native architecture on AWS delivers comprehensive management while maintaining security, scalability, and seamless integration with the existing infrastructure that Shailender was discussing earlier. At the heart of the Tachyon platform is an EKS cluster running with three nodes, which is lightweight yet powerful enough to orchestrate complex HPC workloads on AWS. The Tachyon application is set up in a dedicated VPC in the customer's AWS account. The solution uses Amazon OpenSearch as a database and search engine to store configuration and transaction information. A Lambda function is used to trigger scheduled jobs that assimilate all the billing data and can run scheduled notifications.

Tachyon uses a proxy node that runs a proxy service closer to the HPC cluster. This proxy node is created in the target AWS account where HPC clusters are running. Users can access the Tachyon web console through private connectivity via a Direct Connect or VPN tunnel and integrate it with existing enterprise Active Directory authentication to ensure secure communication between on-premises and AWS. I'll give you a quick demo of how the Tachyon platform works. The first thing users can do is job submission and tracking. Tachyon has a nice front end where users can create a job. They can provide the account information in the FSx cluster, which application they want to run, what version of that application, and the template that is required. In the working directory, they can specify where all the scripts will run. In the config section, they can select which queue they need to run based on their CPU, memory, and node requirements. They can specify the parameters including what the total task is required for running the simulation job, what the CPU requirement is per task, what the total number of nodes required is, and what the tasks per node are.

These parameters also include the memory configuration data. This gives a very granular level of detail that can be submitted. Once they submit the job, as you can see on the bottom right, this gives us what will be the cost of running this cluster so that the users can get in advance only what budget they are going to allocate for this simulation. Once the job has been submitted, they can go ahead and see the complete details of the job. In the script section they can see what kind of script had been run by the job which was submitted. The observability section provides the complete CPU and memory utilization details. They can dive deeper if any issue has come up. They can go ahead from the log section and do the complete deep analysis. They can see the simulation results and all. They can directly go and do it from this intuitive interface.

The second feature which I can provide, which is a very important one, is that they can manage and request the workstation from a managed catalog. Whatever catalogs they have created, the users can go ahead, subscribe the workstation, and specify the memory storage and CPU requirement that they have. Once they subscribe to the workstation, they have the privilege to share it whether they want to share it with everyone or they want to share this workstation with any specific user. When this workstation is created and running, they get complete visibility of how it is performing, how the CPU utilization and memory utilization everything is working, and it also provides the details of monitoring lifecycle and other events.

Now I'll talk a little bit about Tachyon AI. Tachyon AI is an intelligent solution which is powering the next generation of high performance computing on AWS. It delivers innovation through two powerful components. The first is Physics AI. It contains and has dramatically accelerated the simulation workloads. Researchers can now access preconfigured open source models, and they can also train custom models with fine tuning capabilities. The second is the Tachyon AI Assistant, which is powered by Amazon Bedrock and Claude model running at the back end, which allows end users to interact with HPC resources using natural language. The users can now track jobs and troubleshoot issues using natural language. They can create their job scripts. They can access documents through conversational Q&A, and they can optimize workloads for the perfect balance of performance and cost. Together these AI capabilities make Tachyon not just a management platform but an intelligent partner that accelerates research outcomes while maximizing resource efficiency.

A quick demo of how the Tachyon AI Assistant works. The user can launch this Tachyon AI system and ask the status of any job. What is the current status of this job? Now in the back end, it calls the Anthropic Claude model in Bedrock, and it fetches the information and returns it back to the users. Users can go and ask about the details to generate a graph plot. Tachyon can go ahead, the AI model will be in the back end, it will go and return it back to us. Also, if there has been any issue or any job has failed, the users can ask what is the root cause why this job has failed. The AI Assistant does the job for them. Now also, if they want to understand what kind of infrastructure it is running for their HPC workload, the Assistant, the AI model, the LLM model in the back end, it goes, fetches the results, and gives it back to the users.

We talked about Apollo and how they have done various benchmarking, then how with Tachyon AI platform we kind of solved the operational challenges which their end users were experiencing. These were some of the benefits which Apollo has achieved. Shailender also talked about that the simulation time was reduced by 60% as compared to on-premise. Now Apollo with HPC on AWS has moved from a very high CapEx model to a controlled OpEx model for running HPC applications. The self-service capabilities empowered by Tachyon is allowing the R&D team to focus more on innovation and less on infrastructure management.

Now with Tachyon AI and an accelerated product development cycle, Apollo Tyres is performing virtual prototyping instead of physical prototyping, which is helping them accelerate the development cycle for tire development. With this, I'll invite Shailender again to talk about the roadmap for HPC and the challenges and lessons learned from this engagement.

Future Roadmap and Lessons Learned: Scaling HPC Innovation Across Apollo Tyres

The system is, no doubt, another game changer after the FSX that I mentioned in terms of user adoption and acceptance of the whole solution. The key point was ease of use. The whole solution should be easy for users to understand, use, and then see their jobs running. As I told you about the journey, we take these projects in bits and pieces, going easy and small with easy wins. We started with the IT data lake and scaled it up to Industry 4.0. Simulation was a big thing, and HPC is not a small project to work upon. We did a lot of testing on what to use, and now that we have tested it, it's running fine in production for one of my business units in one of the locations.

We want to scale it up, and that's the roadmap we are looking for. Chemical compound simulation is a complex problem to solve. When we are designing the product, we want to know which chemical is giving us the best performance, best price performance, longevity of the tire, and heat dispersion when running at high speed. These chemical compound simulations are typically done within the R&D lab using historical data and insights from the physical properties of the chemicals. Doing it in the lab physically is a time-consuming and costly process. However, with that knowledge base, we can develop a linear regression model to determine which chemicals we can add to create the best compound.

Complexity reduction is another area where I want to understand what are the best components which I can combine together to make the tire. If I'm using 200 components, can I reduce it to 190 components? This again requires good computational power. Global expansion is another focus area. We have multiple R&D centers, and I want to expand the HPC solution to multiple R&D centers so that other locations can also use the same HPC solution that we have built and created in one of the locations.

Gautham spoke about AI, and I also gave you the use cases on our self-service data lake with AI. We want to explore if some of the things that we are doing today in manual processes can be handled by an agent that can run the simulation and then suggest which is the best simulation we can use, whether in compound or in tire. Those are the future roadmaps that we are looking for. Now, regarding lessons learned, benchmark analysis was one of the key deciding factors. At one point in time, when we were using a Windows solution, the cost was high and it became a roadblock for us in terms of acceptance of the solution. When we did the right benchmark analysis, we could optimize the cost and gain buy-in from management.

Choosing the right instance type is a subset of benchmarking. I have already disclosed my golden rule of using the right instance type in the right application. Design for elasticity is what we also do. AWS is famous for the elasticity concept, and we make sure that anything we design is scalable and easy to scale. It's not just about being scalable, but about being easy to scale.

The Tachyon platform actually helped us to easily scale it. Otherwise, initially the problem was how do I scale it and how do I manage the whole complexity of the jobs. Monitor and optimize. If you can't monitor it and optimize it again against the 0.1 benchmark, it will somewhere fall between the cracks. So you have to keep monitoring in terms of the costs. Put the thresholds on your budgets on AWS so that before the bill comes, you should know that you are exceeding the threshold. This is what we do on AWS.

Automate everything. I'm sure this will become much easier with so many agents which were also announced today. Earlier today when I was doing the noon session, I also spoke about how we are using multiple agents, where one agent is termed as a judge, which is going to judge my first agent if it is working right or not. So a pretty interesting concept that we are using. In terms of architecture, we planned for scale from day one. Even though we were doing the proof of concept, from day one our objective was, can I spin multiple clusters and how will I scale up my FSx? So we nailed down these things on the dry run itself, on paper itself, and once we were convinced with all these things, we actually invested in the proof of concept.

So those were the lessons learned from my side. Let me invite Alex for the closing notes. All right, thank you very much for coming here and thank you, Shailender, and thank you, Gautam. Shailender, I'm hoping that you're coming back next year so that you can talk more about your progress and what you've been working on. Before we close the session, I would really appreciate if you could go ahead and fill the survey on the session. It's very important for us to understand if we delivered what you were expecting from this session so that we can always improve for next year. Thank you very much for attending and for spending this last hour with us. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community