<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aishwarya Goel Inferless ( A.G.I)</title>
    <description>The latest articles on DEV Community by Aishwarya Goel Inferless ( A.G.I) (@aginfer).</description>
    <link>https://dev.to/aginfer</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1188057%2F07f7efe6-eef8-4c57-91d6-1db184e4d723.jpeg</url>
      <title>DEV Community: Aishwarya Goel Inferless ( A.G.I)</title>
      <link>https://dev.to/aginfer</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aginfer"/>
    <language>en</language>
    <item>
      <title>The State of Serverless GPU Part -2</title>
      <dc:creator>Aishwarya Goel Inferless ( A.G.I)</dc:creator>
      <pubDate>Wed, 15 Nov 2023 17:30:00 +0000</pubDate>
      <link>https://dev.to/aginfer/the-state-of-serverless-gpu-part-2-4l6d</link>
      <guid>https://dev.to/aginfer/the-state-of-serverless-gpu-part-2-4l6d</guid>
      <description>&lt;p&gt;In the evolving landscape of AI Infrastructure, Serverless GPUs have been a game changer. Six months on &lt;a href="https://news.ycombinator.com/item?id=35738072"&gt;from our last guide,&lt;/a&gt; which sparked multiple discussions &amp;amp; created more awareness about the space, we've returned with fresh insights on the state of "True Serverless" offerings and I am here sharing performance benchmark &amp;amp; cost effectiveness analysis for &lt;a href="https://huggingface.co/meta-llama/Llama-2-7b-hf"&gt;Llama 2-7Bn&lt;/a&gt; &amp;amp; &lt;a href="https://huggingface.co/meta-llama/Llama-2-7b-hf"&gt;Stable Diffusion 2-1&lt;/a&gt; model.  &lt;/p&gt;

&lt;p&gt;📊 &lt;strong&gt;Performance Testing Methodology:&lt;/strong&gt; We put the spotlight on popular serverless GPU contenders: Runpod, Replicate, Inferless, and Hugging Face Inference Endpoints, specifically testing for:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Cold Starts:&lt;/strong&gt; Varied across platforms. Latency minus inference time, represents the delay due to initializing a dormant Serverless function.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HBLseXZq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jlnrmym69bycm5w6sw83.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HBLseXZq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jlnrmym69bycm5w6sw83.png" alt="Image description" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Variability:&lt;/strong&gt; We don't just trust one-off results; we test over 5 days to ensure stability. We observed differences in consistency.   &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hO10M_ZH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3b3xib8w02bwt3y1ng53.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hO10M_ZH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3b3xib8w02bwt3y1ng53.png" alt="Image description" width="800" height="567"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Autoscaling:&lt;/strong&gt; Simulated traffic peaks to assess how well platforms scale under pressure ,we tried the simulation on what happens when we receive 200 requests with a concurrency of 5. Not all platforms could manage linear scaling efficiently, leading to varied latencies under load.   &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--o_KaiaEt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3w1kbgbvdrbh4iyqra7z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--o_KaiaEt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3w1kbgbvdrbh4iyqra7z.png" alt="Image description" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Decoding Serverless Pricing:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;4.1 We modeled a scenario where you process 1,000 documents daily with the Llama 2 7Bn model. Here's the TL;DR on costs:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Ohs0SYEE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/a1rpwv5nmvdj4k4g1ua9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Ohs0SYEE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/a1rpwv5nmvdj4k4g1ua9.png" alt="Image description" width="800" height="289"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;4.2 For the image processing (stable diffusion) use case, only the number of processed items and cold start times differ. Instead of 1,000 documents, we're considering 1,000 images daily.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---bmCCHb3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tzejelcrav5vhm77401w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---bmCCHb3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tzejelcrav5vhm77401w.png" alt="Image description" width="800" height="293"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔮 Overall Insights:&lt;/strong&gt; The serverless GPU sector is advancing, notably in reducing cold-start times and improving cost efficiency. However, the best choice depends on specific use cases. While AWS Lambda is a leader in general serverless solutions, specialized tasks, particularly those GPU-intensive, may find better options elsewhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Detailed Blog link:&lt;/em&gt;&lt;/strong&gt; &lt;a href="https://www.inferless.com/learn/the-state-of-serverless-gpus-part-2"&gt;https://www.inferless.com/learn/the-state-of-serverless-gpus-part-2&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This analysis aims at shedding light on the serverless GPU arena. We welcome feedback and aim for precision in our findings.&lt;/p&gt;

</description>
      <category>serverless</category>
      <category>gpu</category>
      <category>machinelearning</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
