<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dustin Liu</title>
    <description>The latest articles on DEV Community by Dustin Liu (@grhaonan).</description>
    <link>https://dev.to/grhaonan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F913849%2F9c150d20-49ac-4a91-a983-795d54bb62e0.jpeg</url>
      <title>DEV Community: Dustin Liu</title>
      <link>https://dev.to/grhaonan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/grhaonan"/>
    <language>en</language>
    <item>
      <title>Injecting Machine Learning Model into AWS Lambda for Accelerated Inference Performance</title>
      <dc:creator>Dustin Liu</dc:creator>
      <pubDate>Sun, 26 Mar 2023 06:23:17 +0000</pubDate>
      <link>https://dev.to/aws-builders/injecting-machine-learning-model-into-aws-lambda-for-accelerated-inference-performance-36l7</link>
      <guid>https://dev.to/aws-builders/injecting-machine-learning-model-into-aws-lambda-for-accelerated-inference-performance-36l7</guid>
      <description>&lt;p&gt;In my p&lt;a href="https://dev.to/grhaonan/cost-efficient-scalable-yolov5-model-inference-in-production-more-than-just-aws-lambda-19el"&gt;revious article&lt;/a&gt;, we introduced a cost-effective inference architecture that successfully handles approximately one million inferences per month in our live production environment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ktgAUVbX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ekqtjqcl0rfsr349o623.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ktgAUVbX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ekqtjqcl0rfsr349o623.jpg" alt="Image by author, previous logic" width="614" height="513"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Previous logic:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Upon receiving an inference request, Lambda will load the specified model, which is mounted on the Amazon Elastic File System (EFS).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In the event that loading the model from EFS fails, the system will resort to fetching the model from Amazon S3 as a backup solution (Alternative 1)&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One challenge associated with this solution is that during a scale-up scenario, when a large number of concurrent Lambda functions are invoked, they all attempt to load models from EFS simultaneously. This surge in demand can quickly exhaust the reserved bandwidth (throughput), leading to numerous inference timeouts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uSzem6hN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0dbyrjnhrckewvluh95d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uSzem6hN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0dbyrjnhrckewvluh95d.png" alt="Image by author, high lambda error rate due to timeout" width="880" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Solution 1 (Not recommended):&lt;br&gt;
While increasing the provisioned throughput may seem like a straightforward solution, this can prove to be costly and is therefore not recommended.&lt;/p&gt;

&lt;p&gt;Solution 2 (Recommended):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Jgt7Fex4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/b8z01yn43mo140p1z8lx.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Jgt7Fex4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/b8z01yn43mo140p1z8lx.jpg" alt="Image by author, new logic" width="614" height="513"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Injecting the model into the base image used by Lambda allows each inference to load the model from 'local', resulting in reduced inference latency and cost savings on EFS. If this approach fails, the following backup solutions are in place:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Loading the model from EFS (Backup Solution 1)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Loading the model from S3 if EFS loading fails (Backup Solution 2)&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Results:&lt;/p&gt;

&lt;p&gt;Upon implementing the model injection, there is a notable reduction in both EFS throughput and Lambda inference duration, resulting in decreased costs and improved performance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--sjQYrPw9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vu36ej80ssv3r4ordeaf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--sjQYrPw9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vu36ej80ssv3r4ordeaf.png" alt="Image by author, EFS throughput before implementation" width="453" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Qqy_yj5g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9nw1ikijplckb0r0776w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Qqy_yj5g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9nw1ikijplckb0r0776w.png" alt="Image by author, lambda duration before implementation" width="522" height="296"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--r55lKFXD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1ohmlz08blq1gd26w2pn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--r55lKFXD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1ohmlz08blq1gd26w2pn.png" alt="Image by author, lambda duration after implementation" width="519" height="294"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>docker</category>
      <category>machinelearning</category>
      <category>ai</category>
    </item>
    <item>
      <title>AWS Athena + DBT Integration Demo</title>
      <dc:creator>Dustin Liu</dc:creator>
      <pubDate>Wed, 12 Oct 2022 01:38:06 +0000</pubDate>
      <link>https://dev.to/grhaonan/aws-athena-dbt-integration-demo-50ii</link>
      <guid>https://dev.to/grhaonan/aws-athena-dbt-integration-demo-50ii</guid>
      <description>&lt;p&gt;Hi everyone,&lt;/p&gt;

&lt;p&gt;If you are looking for building a quick demo of Athena + DBT integration by reusing existing code through Terraform, feel free to check my post at medium: &lt;a href="https://towardsdatascience.com/aws-athena-dbt-integration-4e1dce0d97fc"&gt;https://towardsdatascience.com/aws-athena-dbt-integration-4e1dce0d97fc&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cheers&lt;/p&gt;

</description>
      <category>aws</category>
      <category>dbt</category>
      <category>athena</category>
    </item>
    <item>
      <title>Cost-efficient &amp; Scalable Model Inference In Production - More than just AWS Lambda</title>
      <dc:creator>Dustin Liu</dc:creator>
      <pubDate>Tue, 23 Aug 2022 11:08:00 +0000</pubDate>
      <link>https://dev.to/grhaonan/cost-efficient-scalable-yolov5-model-inference-in-production-more-than-just-aws-lambda-19el</link>
      <guid>https://dev.to/grhaonan/cost-efficient-scalable-yolov5-model-inference-in-production-more-than-just-aws-lambda-19el</guid>
      <description>&lt;p&gt;I am new to dev.to community would like to share a solution we implemented at BGL for YOLO model inference(any general model inference also applies) and hope you find it helpful.&lt;/p&gt;

&lt;p&gt;Inference cost may count for a large proportion of computing cost, in order to address this concern and reduce high inference cost, AWS already offers several model inference solutions for a wide range of scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html"&gt;Real-Time inference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html"&gt;Batch Transform&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html"&gt;Asynchronous Inference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html"&gt;Serverless Inference&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But what if you're looking for a more flexible, customizable serverless solution with an even lower cost? And most importantly, one that can potentially be “lifted and shifted” to other cloud platforms if your organization is dealing with multi-cloud infrastructure. or worried about having the solution locked in with one specific provider?&lt;/p&gt;

&lt;p&gt;Welcome to BGL’s story about how to implement AWS Lambda as a model inference service to handle a large volume of inference requests in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context：
&lt;/h3&gt;

&lt;p&gt;Our engineering team is building an AI product to automate several business processes. Both Hugging Face Transformer (For NLP task) and YOLOv5(For object detection tasks) framework&lt;a href="https://dev.tourl"&gt;&lt;/a&gt;s are integrated and several models had been trained based on the business cases and datasets. Inference API is integrated with the existing business workflow to automate the process so the organization can pull resources away from tedious and low-value works and reduce operating costs.&lt;/p&gt;

&lt;p&gt;The current system is handling 1000+ inference requests/day with a 50x volume growth in the near future. Once the model candidate is deployed to production and stable, the main cost would be on the model inferencing.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution Design
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ePXNTe2K--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vixebz3vho7tcgs9sxww.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ePXNTe2K--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vixebz3vho7tcgs9sxww.jpg" alt="Image description" width="614" height="513"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The solution utilizes several AWS services with the following purposes (under the solution’s context):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sagemaker: Training custom model&lt;/li&gt;
&lt;li&gt;EFS: Storing artifacts of the trained model, as the primary model loading source&lt;/li&gt;
&lt;li&gt;S3: Storing artifacts of the trained model, as a secondary model loading source&lt;/li&gt;
&lt;li&gt;ECR: Hosting Lambda docker image&lt;/li&gt;
&lt;li&gt;EventBridge: A “Scheduler” to invoke Lambda
System Manager: Storing model path in parameter store&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Training:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Sagemaker normally compresses and outputs the model artifacts to S3. In this case, additionally, one copy is also saved to EFS(explained in 3.2) by adding an extra Sagemaker channel name.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Lambda Deployment:
&lt;/h3&gt;

&lt;p&gt;2.1 &lt;a href="https://www.serverless.com/"&gt;Serverless Framework&lt;/a&gt; is used to manage Lambda configurations with all required packages and Lambda script built through Jenkins and Docker into a docker image that is further pushed to ECR.&lt;/p&gt;

&lt;p&gt;Note: The model artifacts are not embedded in this image, they stay at both S3 and EFS.&lt;/p&gt;

&lt;p&gt;Here is an example of the docker file for building the Lambda image:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public lambda basic image from AWS (public.ecr.aws/lambda/python:3.8.2021.11.04.16)&lt;/li&gt;
&lt;li&gt;Copying Lambda logic written in awslambda.py&lt;/li&gt;
&lt;li&gt;Copying the YOLOv5 project (originally cloned from YOLOv5) as loading a YOLOv5 trained model locally in Lambda would require it&lt;/li&gt;
&lt;li&gt;All packages required are defined in a separate file called lambda-requirements.txt which is under the same directory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--CDUv1YUc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/d8vnipip1oq44hc5gf9e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--CDUv1YUc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/d8vnipip1oq44hc5gf9e.png" alt="Image description" width="880" height="713"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2.2 Each Lambda has an associated EventBridge rule configured (in Serverless YAML file). EventBridge invokes Lambda by every 5 minutes with two important purposes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keeping Lambda warm to avoid the cold start without triggering an actual inference request&lt;/li&gt;
&lt;li&gt;Pre-loading the desired model in the first warming request and then caching it (through a global variable), will significantly reduce the inference lead time caused by model loading for subsequent actual inference requests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--58cBnSSX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4i2zh5w2hw82fu6wcjpb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--58cBnSSX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4i2zh5w2hw82fu6wcjpb.png" alt="Image description" width="880" height="524"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The code snippet above shows the structure of the Lambda configuration used in Serverless YAML file. For details please refer to Serverless Framework Documentation, key points are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;ephemeralStorageSize 5120 will configure the size of ‘/tmp’ folder of a Lambda to 5120MB(explained in 3.2)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Model bucket and path as input parameters of EventBridge invoke (explained in 3.1)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Inference
&lt;/h3&gt;

&lt;p&gt;The diagram below explains the Lambda handling logic, the principle behind is to check whether the invoke is triggered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;By lambda warming request(through EventBridge) -&amp;gt; load the mode (for the first time) &amp;amp; cache -&amp;gt; return without processing inference, or&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;By a actual inference request -&amp;gt; process inference -&amp;gt; return inference result&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QsDzIG8v--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/84oh4qiqtswyixpho408.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QsDzIG8v--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/84oh4qiqtswyixpho408.jpg" alt="Image description" width="880" height="1022"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3.1 The S3 bucket and corresponding model path are saved in the System Manager parameter store so this is uncoupled from Lambda deployment. Engineers can point the Lambda to any desired model by changing the parameter without redeploying this Lambda.&lt;/p&gt;

&lt;p&gt;3.2 EFS is a file system, so mounting and loading the model straight from EFS would be much faster (please keep a close eye on EFS bandwidth cost). In case the EFS loading fails (invalid path, bandwidth constrained, etc), Lambda downloads the model artifacts from S3 to the local ‘/tmp’ directory and loads the model from there. It is important to make sure Lambda has enough storage for ‘/tmp’ (ephemeralStorageSize parameter is set to 5120MB in our case)!&lt;/p&gt;

&lt;p&gt;Loading YOLOv5 model locally in Lambda is straightforward:&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SpSFWaD7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0u4i0m0gvj8w9kxtnkwt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SpSFWaD7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0u4i0m0gvj8w9kxtnkwt.png" alt="Image description" width="880" height="192"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Make sure you copied the yolov5 project directory (mentioned in 2.1) and set the path of the directory to repo_or_dir&lt;/li&gt;
&lt;li&gt;Use model= ‘custom’ and source= ‘local’&lt;/li&gt;
&lt;li&gt;The path points to a valid *.pt YOLOv5 trained model.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Limitations
&lt;/h3&gt;

&lt;p&gt;No solution is perfect, it all comes with tradeoffs and limitations, some of the limitations of the proposed solution are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lambda inference is not set for a real-time inference which requires real-time processing and very low latency&lt;/li&gt;
&lt;li&gt;Lambda has a 15-minute runtime limitation, it will fail if inference takes too long&lt;/li&gt;
&lt;li&gt;EFS bandwidth cost is an extra cost but you could switch to S3 downloading &amp;amp; loading as the primary method with EFS mounting &amp;amp; loading as the secondary. It is slower but cheaper and the batch inference is not sensitive to latency in general&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Potential Lift &amp;amp; Shift
&lt;/h3&gt;

&lt;p&gt;Some features/design patterns of this solution can be potentially lifted &amp;amp; shifted to other cloud platforms (Azure, GCP for instance) as they offer similar services as AWS, two valuable ones are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using low-cost serverless computer services (Azure Functions, GCP Functions)to serve model batch inference and integrate with a ‘scheduler’ to keep the service ‘warm’&lt;/li&gt;
&lt;li&gt;Designing logic to pre-load &amp;amp; cache the model to reduce inference processing time.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;Thanks for your reading and hope you enjoyed this article, here comes some key takeaways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS Lambda, EFS, and S3 can be combined into a cost-efficient, scalable &amp;amp; robust service for model batch-inference(up to 15 minutes)&lt;/li&gt;
&lt;li&gt;Implementing the AWS EventBridge Lambda trigger properly can help to minimize Lambda cold-start&lt;/li&gt;
&lt;li&gt;Implementing a model pre-loading&amp;amp;caching logic in Lambda will help to reduce model inference lead time&lt;/li&gt;
&lt;li&gt;Passing the model path as a system parameter instead of embedding the model into the Lambda image yields more flexibility (decoupling)&lt;/li&gt;
&lt;li&gt;There are potential lift &amp;amp; shift opportunities to apply a similar concept on other cloud platforms&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>machinelearning</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
