<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: chh</title>
    <description>The latest articles on DEV Community by chh (@chenhunghan).</description>
    <link>https://dev.to/chenhunghan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F453245%2F964dad41-5c67-42f3-a7e1-d4c8ad1dc2f9.jpeg</url>
      <title>DEV Community: chh</title>
      <link>https://dev.to/chenhunghan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/chenhunghan"/>
    <language>en</language>
    <item>
      <title>6 lessons from building a MCP Apps for 🏃🏃‍♂️🏃‍♀️</title>
      <dc:creator>chh</dc:creator>
      <pubDate>Sun, 22 Feb 2026 07:31:15 +0000</pubDate>
      <link>https://dev.to/chenhunghan/6-lessons-from-building-a-mcp-apps-for-4ad6</link>
      <guid>https://dev.to/chenhunghan/6-lessons-from-building-a-mcp-apps-for-4ad6</guid>
      <description>&lt;p&gt;Like many, since Dec 2025, I have switched my daily workflow from typing in VSCode to using only prompts to complete daily tasks.&lt;/p&gt;

&lt;p&gt;I have had several weekend projects, like &lt;a href="https://github.com/chenhunghan/0ma" rel="noopener noreferrer"&gt;0ma&lt;/a&gt;, for managing local VMs, and &lt;a href="https://github.com/chenhunghan/scim-mcp" rel="noopener noreferrer"&gt;scim-mcp&lt;/a&gt;, an MCP server that proxies requests to SCIM endpoints. Most of them were for fun, just wanted to see how far I can push LLMs to their limits, using only natural language.&lt;/p&gt;

&lt;p&gt;This time I want to build something different: &lt;a href="https://github.com/chenhunghan/garmin-mcp-app" rel="noopener noreferrer"&gt;Garmin MCP Apps&lt;/a&gt; that can be installed on ChatGPT Desktop or Claude Desktop, and interact with the data from your watch with a generative UI that is intended for a non-technical audience.&lt;/p&gt;

&lt;p&gt;This app allows users to not only query data from your Garmin watch, but also play with dynamic visualisation charts to explore your workout data, without giving out who you are or your credentials to the AI.&lt;/p&gt;

&lt;p&gt;The target users are serious runners who want to utilise the power of LLMs to plan for their next workout. The Garmin mobile app is awesome, but it is general-purpose, targeting more general users, a few years behind the latest developments in sport science. I feel like I always want to customise some parameters/dashboard. With MCP Apps, there is no interface that is more flexible than natural language + generative UIs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskou2x4wp7bhnm82ykkv.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskou2x4wp7bhnm82ykkv.gif" alt="Demo" width="600" height="794"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Running, not AI, in the past few years, has changed my life. I wish to bring this positive impact to more users. With the integration of LLMs and cutting-edge sport science, more users will actually start to enjoy running, and become runners.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fevonf61j5jhk228xzaoq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fevonf61j5jhk228xzaoq.png" alt="Training readiness" width="800" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The only problem is I have young kids, so I have very limited time during the weekends. Eventually I managed to make the initial version in 12 hours, so here are the tips worth documenting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use Frontier Model
&lt;/h2&gt;

&lt;p&gt;Do not waste your time on older, smaller models, even if they are cheaper. Always use the best model on the market. However, there are exceptions: when I started this project, I started with Antigravity + Gemini Pro 3.1 because it was the newest, but sorry, Google, this model is not the best, not even the second best. I ended up using Opus 4.6 with Claude Code. The time you waste in an agentic loop with a weak model will cost more than the cost of tokens from the frontier model.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Best Context Engineering is No Context Engineering
&lt;/h2&gt;

&lt;p&gt;Let me repeat: the best context engineering is no context engineering. Context engineering, in essence, is to give an LLM access to context it might not otherwise have. For example, give browser CDP access to an LLM using the MCP or skill you prefer so the LLM has access to the DOM to inspect the elements. However, if you want to go fast, you should completely skip the context engineering part, and the best way to take a shortcut is to use the most popular language/framework on the market.&lt;/p&gt;

&lt;p&gt;You might not like React, you might not like TypeScript, you might not like Tailwind, and you are probably smart enough to build your own lib that completely avoids the re-renders on state updates or is "truly reactive" compared to React. However, LLMs are trained on internet data (and pirated books); reinventing the wheel means you need to do extra context engineering to prompt-engineer the LLM. That works, but is not the optimal way to speed things up. So choose the most popular. For UI, there is nothing that can compare to React/Tailwind in terms of market share.&lt;/p&gt;

&lt;h2&gt;
  
  
  Harness Engineering
&lt;/h2&gt;

&lt;p&gt;Harness engineering is similar to context engineering, however instead of giving more data to the LLM, harness engineering is more about making it easier for LLMs to modify your code. Claude Code is an excellent example of good harness engineering; with it, LLMs can modify large amounts of code without making errors, unlike Gemini Pro 3.1 struggling with Antigravity.&lt;/p&gt;

&lt;p&gt;Claude Code's hook system makes harness engineering easy. I have made two Claude Code plugins: ralph-hook-fmt, which automatically formats files after they are written or edited, inspired by Formatters in OpenCode; and ralph-hook-lint, which automatically emits lint errors when Claude Code finishes one loop of editing. Both plugins speed up the feedback loop in the agentic loop, so the LLM can react automatically after files are edited and static analysis detects problems. Those two hook plugins are still in early phases, but give them a try or make your own hook.&lt;/p&gt;

&lt;h2&gt;
  
  
  Plan First
&lt;/h2&gt;

&lt;p&gt;Use your favourite planning tool, no matter if it is from Superpower, Speckit, or the built-in Claude Code plan tool. In plan mode, always ask Claude to relentlessly question you to fill the gaps in the plan. Stay in plan mode until you feel comfortable to let go. Planning is recommended early as the model is smarter in the first few thousand tokens of the context window. Making explorations and clarifications, then forking out from plan mode into implementation mode, will make the implementation more concentrated and focused without reading unnecessary files to stuff the context window.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use Sub-agents
&lt;/h2&gt;

&lt;p&gt;Sub-agents are useful when you want to do things in parallel, e.g. exploring an unknown code base. LLMs are smarter in the first 0–30% of the context window. Use sub-agents to explore a code base and give back the summary to the main context; this will keep the main context lean and not polluted with tool calls, avoiding losing the "Needle-in-a-Haystack".&lt;/p&gt;

&lt;p&gt;Use sub-agents to write code, but only when the structure of the project is stable. Give tasks to sub-agents when you know exactly the boundaries of the task. One example is when you have a chart that already has a pattern to feed data into, and has known patterns to style the chart using shared CSS variables. Sub-agents are good when you know the exact output, for example, making a chart for each API endpoint following an example chart pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  Worktree, Worktree, Worktree
&lt;/h2&gt;

&lt;p&gt;Claude Code has native support for git worktree now; start the worktree by &lt;code&gt;claude --worktree&lt;/code&gt;. However, like sub-agents, I would suggest using worktree only when the structure starts to stabilise, and there is no, or little, ambiguity in the requirements. Worktree is for self-contained, fully isolated features or bug fixes that have clear boundaries. You probably do not want to use worktree to add a feature that touches all files; you will end up spending more time resolving conflicts. Do it on the main branch. Worktree is for predictable tasks that you are sure will be finished in a known timeframe.&lt;/p&gt;

&lt;p&gt;Finally, thanks for reading through to here — your attention span is longer than most humans'. Give &lt;a href="https://github.com/chenhunghan/garmin-mcp-app" rel="noopener noreferrer"&gt;Garmin MCP Apps&lt;/a&gt; a try, let it help you plan for the next optimal run. Most importantly, start running. AI might change the world but not our everyday life; running (or whatever exercise you are into) will change your life.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>mcp</category>
      <category>showdev</category>
    </item>
    <item>
      <title>I May Be Wrong</title>
      <dc:creator>chh</dc:creator>
      <pubDate>Sat, 17 Jan 2026 07:56:55 +0000</pubDate>
      <link>https://dev.to/chenhunghan/i-may-be-wrong-2oal</link>
      <guid>https://dev.to/chenhunghan/i-may-be-wrong-2oal</guid>
      <description>&lt;p&gt;If you've spent any time working with AI agents, you've probably seen this response:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;You are absolutely right!&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I hate it. It usually appears after you've wasted lots of tokens and the agent still can't figure out what's going on. It's that moment of frustration when you, the human, have to diagnose the problem yourself and point out the solution.&lt;/p&gt;

&lt;p&gt;I blamed the agent. Why can't it just say "I am wrong"? Why did it have such strong confidence moments earlier, without even a hint of "maybe I'm wrong"?&lt;/p&gt;

&lt;p&gt;These are the moments when I doubt whether AI can really replace human intelligence.&lt;/p&gt;

&lt;p&gt;But after gaining more experience, I've started to see You are absolutely right! differently. It's still frustrating, but now I recognise it as a signal.&lt;/p&gt;

&lt;p&gt;Here's the thing: an LLM is trained to follow human instructions, but you're not talking to a collective intelligence that can reason and reflect. You're projecting thoughts onto a probability machine that returns the most likely next token.&lt;/p&gt;

&lt;p&gt;In essence, you're mostly talking to yourself (and your code). The LLM is the rubber duck sitting next to your desk, helping you understand what you're trying to achieve. LLMs can answer questions, but when it comes to open-ended discovery—research with no definite end—you're mostly talking to your own reflections.&lt;/p&gt;

&lt;p&gt;So when I see You are absolutely right! I now take it as my cue to step back—to think outside the box and find what's missing.&lt;/p&gt;

&lt;p&gt;Depressing? A little. But it's also freeing.&lt;/p&gt;

&lt;p&gt;In those moments, I've found wisdom outside of computer science: a book I recently read called I May Be Wrong by Björn Natthiko Lindeblad—a beautiful memoir about a Swedish man who became a forest monk in Thailand.&lt;/p&gt;

&lt;p&gt;It's a book about letting go of control and embracing uncertainty (sounds familiar?). Your thoughts aren't you; you may be wrong. It has nothing to do with software engineering or AI. But I'd like to borrow the book's title as a reminder—a memo beside my desk when I vibe code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;I May Be Wrong&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It's time to step back—letting go, and not believing everything you think.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>discuss</category>
      <category>llm</category>
    </item>
    <item>
      <title>I made a Copilot in Rust 🦀 , here is what I have learned... (as a TypeScript dev)</title>
      <dc:creator>chh</dc:creator>
      <pubDate>Sun, 17 Dec 2023 05:27:38 +0000</pubDate>
      <link>https://dev.to/chenhunghan/i-made-a-copilot-in-rust-here-is-what-i-have-learned-as-a-typescript-dev-52md</link>
      <guid>https://dev.to/chenhunghan/i-made-a-copilot-in-rust-here-is-what-i-have-learned-as-a-typescript-dev-52md</guid>
      <description>&lt;p&gt;My article &lt;a href="https://dev.to/chenhunghan/use-code-llama-and-other-open-llms-as-drop-in-replacement-for-copilot-code-completion-58hg"&gt;Code Llama as Drop-In Replacement for Copilot Code Completion&lt;/a&gt; receiving lots of positive feedbacks. Since then, I made few other attempt to improve the copilot server.&lt;/p&gt;

&lt;p&gt;In terms of performance, I made a PR in &lt;a href="https://github.com/turboderp/exllamav2/pull/23" rel="noopener noreferrer"&gt;exllamav2&lt;/a&gt; on a copilot server, using exllama2's super fast CUDA custom kernel.&lt;/p&gt;

&lt;p&gt;To improve the completion quality, I tried few other LLMs such as &lt;a href="https://huggingface.co/replit/replit-code-v1_5-3b" rel="noopener noreferrer"&gt;replit-code-v1_5-3b&lt;/a&gt;, &lt;a href="https://huggingface.co/syzymon/long_llama_code_7b" rel="noopener noreferrer"&gt;long_llama_code_7b&lt;/a&gt;, &lt;a href="https://huggingface.co/WisdomShell/CodeShell-7B" rel="noopener noreferrer"&gt;CodeShell-7B&lt;/a&gt; and &lt;a href="https://huggingface.co/stabilityai/stablelm-3b-4e1t" rel="noopener noreferrer"&gt;stablelm-3b-4e1t&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As a practicing developer + a &lt;a href="https://www.businessinsider.com/gpu-rich-vs-gpu-poor-tech-companies-in-each-group-2023-8?r=US&amp;amp;IR=T" rel="noopener noreferrer"&gt;GPU poor&lt;/a&gt; without access to H100 clusters, to contribute in the era of &lt;em&gt;AI Wild West&lt;/em&gt;, I'm more interesting to improve the ergonomics of the copilot server.&lt;/p&gt;

&lt;p&gt;Hugging Face's &lt;a href="https://github.com/huggingface/candle" rel="noopener noreferrer"&gt;candle&lt;/a&gt;, a &lt;em&gt;minimalist ML framework&lt;/em&gt; for Rust, looks super interesting. So I started to create a minimalist copilot server in Rust 🦀.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Before you continue, note that &lt;a href="https://github.com/turboderp/exllamav2/pull/23" rel="noopener noreferrer"&gt;exllamav2&lt;/a&gt; version, python + CUDA is still much faster then the Rust version. This article is mostly for those interested in learning Rust, and want to learn the programming by building a fun project.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Essentially, this's a &lt;strong&gt;Build Your Own Copilot&lt;/strong&gt; (in Rust 🦀) tutorial, the code is intended to be educational. If you just want to try the final product &lt;a href="https://github.com/chenhunghan/oxpilot" rel="noopener noreferrer"&gt;oxpilot&lt;/a&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;chenhunghan/homebrew-formulae/oxpilot
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;and starts the copilot server&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ox serve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;or chat with the LLM&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ox hi in Japanese
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We will be using &lt;a href="https://github.com/tokio-rs/axum" rel="noopener noreferrer"&gt;axum&lt;/a&gt; as the web framework, &lt;a href="https://github.com/huggingface/candle" rel="noopener noreferrer"&gt;candle&lt;/a&gt; for text inferencing,   &lt;a href="https://github.com/clap-rs/clap" rel="noopener noreferrer"&gt;clap&lt;/a&gt; for cli arguments parsing and &lt;a href="https://github.com/tokio-rs/tokio" rel="noopener noreferrer"&gt;tokio&lt;/a&gt; as the asynchronous runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table Of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Books and References&lt;/li&gt;
&lt;li&gt;
Print (console.log) for debugging

&lt;ul&gt;
&lt;li&gt;Pretty print&lt;/li&gt;
&lt;li&gt;Measure performance&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

Feeling Safe

&lt;ul&gt;
&lt;li&gt;Variables are immutable by default&lt;/li&gt;
&lt;li&gt;
You Should Not Moved! Ownership

&lt;ul&gt;
&lt;li&gt;Ownership and Scope&lt;/li&gt;
&lt;li&gt;Borrowing&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;/li&gt;

&lt;li&gt;

Asynchronous Rust

&lt;ul&gt;
&lt;li&gt;Parallelism&lt;/li&gt;
&lt;li&gt;Concurrency&lt;/li&gt;
&lt;li&gt;Task (Green Thread)&lt;/li&gt;
&lt;li&gt;Async Runtime&lt;/li&gt;
&lt;li&gt;Ownership and Async&lt;/li&gt;
&lt;li&gt;Share States in Async Program: Arc and Mutex&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

Hands-on

&lt;ul&gt;
&lt;li&gt;
Server-Sent Events (SSE) Server

&lt;ul&gt;
&lt;li&gt;BDD the Endpoint&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

Builder Pattern

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;::default()&lt;/code&gt; v.s. &lt;code&gt;::new()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;impl Into&amp;lt;String&amp;gt;&lt;/code&gt; for function parameter&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Type State&lt;/li&gt;

&lt;li&gt;Share Memory &lt;code&gt;Arc&amp;lt;Mutex&amp;lt;_&amp;gt;&amp;gt;&lt;/code&gt;
&lt;/li&gt;

&lt;li&gt;Share Memory by Communicating: Actor&lt;/li&gt;

&lt;/ul&gt;

&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;(Some sections are still WIP)&lt;/p&gt;

&lt;p&gt;If you are already familiar with Rust at some levels, for example feeling comfortable with Rust's &lt;code&gt;ownership&lt;/code&gt;/&lt;code&gt;borrowing&lt;/code&gt; but not the async world, I suggest to jump to Async section.&lt;/p&gt;

&lt;p&gt;If you are already familiar with async Rust, you can go directly to Hands-on which introduces some design patterns you might find useful, or just go to the Github &lt;a href="https://github.com/chenhunghan/oxpilot" rel="noopener noreferrer"&gt;oxpilot&lt;/a&gt; project where everything is open-sourced.&lt;/p&gt;

&lt;p&gt;Please expect some, &lt;del&gt;if not many&lt;/del&gt;, human errors. I documented my learning process hoping that can help someone on the internet, which likes to build a exciting project when learning a new language.&lt;/p&gt;

&lt;p&gt;Thanks &lt;a href="https://github.com/jihchi" rel="noopener noreferrer"&gt;jihchi&lt;/a&gt; for reviewing the draft of this article.&lt;/p&gt;

&lt;h2&gt;
  
  
  Books and References &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;This article is self-contained, which means it should have all you need to know to read the source code in &lt;a href="https://github.com/chenhunghan/oxpilot" rel="noopener noreferrer"&gt;oxpilot&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;However it's not possible to cover everything in each section. I try to provide references at the end of each sections, highly recommend to read the &lt;a href="https://doc.rust-lang.org/book/" rel="noopener noreferrer"&gt;The Rust Programming Language&lt;/a&gt; if you haven't. &lt;/p&gt;

&lt;h2&gt;
  
  
  Tracing is batteries-included &lt;code&gt;console.log&lt;/code&gt; &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;console.log&lt;/code&gt; is a powerful tool in TypeScript, you can print whatever you want, thus &lt;code&gt;console.log&lt;/code&gt; is super useful for debugging. The Rust equivalent is &lt;code&gt;print!&lt;/code&gt;, &lt;a href="https://doc.rust-lang.org/rust-by-example/hello/print.html" rel="noopener noreferrer"&gt;Rust by Example&lt;/a&gt; is an excellent document if you want get started quickly to use &lt;code&gt;print!&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;However, &lt;code&gt;print!&lt;/code&gt; blocks stdio, and it's better to &lt;a href="https://nnethercote.github.io/perf-book/io.html" rel="noopener noreferrer"&gt;lock stdio and unlock manually&lt;/a&gt;, which is tedious. &lt;/p&gt;

&lt;p&gt;Luckily we have alternative, &lt;a href="https://github.com/tokio-rs/tracing" rel="noopener noreferrer"&gt;Tracing&lt;/a&gt; is an awesome project by &lt;a href="https://github.com/tokio-rs/tokio" rel="noopener noreferrer"&gt;tokio&lt;/a&gt; team, as a TypeScript developer, I feel like home using tracing.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nd"&gt;info!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Hello! Rust!"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nd"&gt;info!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Print var: {:?}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  What is &lt;code&gt;{:?}&lt;/code&gt; ? &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;You might wonder what is &lt;code&gt;{:?}&lt;/code&gt; in the code block.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nd"&gt;info!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Print var: {:?}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;code&gt;{:?}&lt;/code&gt; is for printing &lt;a href="https://doc.rust-lang.org/std/keyword.struct.html" rel="noopener noreferrer"&gt;struct&lt;/a&gt; (like Object in TypeScript). Alternatively &lt;code&gt;{:#?}&lt;/code&gt; can pretty print (), &lt;a href="https://doc.rust-lang.org/rust-by-example/hello/print/print_debug.html" rel="noopener noreferrer"&gt;see more&lt;/a&gt;, think of it like &lt;code&gt;console.log(JSON.stringify(object,null,2))&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Measure performance &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/tokio-rs/tracing" rel="noopener noreferrer"&gt;Tracing&lt;/a&gt; is an awesome for logging performance metrics, for example, if I want to measure how long &lt;code&gt;awesome()&lt;/code&gt; took.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;awesome&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="nf"&gt;awesome&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.instrument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;tracing&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nd"&gt;info_span!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"awesome"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Prints super usefully messages, which tells when we start invoking &lt;code&gt;awesome()&lt;/code&gt;, at which line, and which thread it was on, and how long it took to execute the function.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;2023-10-22T09:01:13.128553Z INFO ThreadId&lt;span class="o"&gt;(&lt;/span&gt;01&lt;span class="o"&gt;)&lt;/span&gt; awesome src/main.rs:172: enter
2023-10-22T09:01:13.128569Z INFO ThreadId&lt;span class="o"&gt;(&lt;/span&gt;01&lt;span class="o"&gt;)&lt;/span&gt; awesome src/main.rs:172: close time.busy&lt;span class="o"&gt;=&lt;/span&gt;15.3µs time.idle&lt;span class="o"&gt;=&lt;/span&gt;3.96µs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Feeling Safe &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Rust is &lt;strong&gt;&lt;em&gt;safe&lt;/em&gt;&lt;/strong&gt; by default, the &lt;em&gt;safe&lt;/em&gt; often refers to memory-safety. However, from my experience, Rust makes you feel safe shipping to production...once the code compiles.&lt;/p&gt;

&lt;p&gt;If you ever wrote a line of code in JavaScript, and then switch to TypeScript, you probably knows what I mean by "feeling safe".&lt;/p&gt;

&lt;p&gt;TypeScript protects us from &lt;code&gt;TypeError: Cannot read property '' of undefined&lt;/code&gt; at compile time, Rust is like TypeScript with &lt;strong&gt;&lt;em&gt;ultra&lt;/em&gt;&lt;/strong&gt; &lt;code&gt;strict&lt;/code&gt; mode on which protect us, developers from making mistakes &lt;strong&gt;at compile time&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Rust makes pull requests easier to review and increase the confident of shipping to production, the compiler error messages might seen overwhelming, just like TypeScript errors at the beginning.&lt;/p&gt;

&lt;p&gt;However , if you ever under the stress of recovering production servers, you will know that learning to resolve the compile time error is better then resolving runtime exceptions.&lt;/p&gt;

&lt;p&gt;To embrace the Rust &lt;em&gt;safety net&lt;/em&gt;, immutability and ownership are two key concepts to understand.&lt;/p&gt;
&lt;h3&gt;
  
  
  Variables are immutable by default &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;"Immutable by default" means once data created, they can't not be mutated, most will agree that immutable data makes &lt;a href="https://medium.com/dailyjs/use-const-and-make-your-javascript-code-better-aac4f3786ca1" rel="noopener noreferrer"&gt;your code better&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For example, a seasoned TypeScript developer probably knows the benefits of using &lt;code&gt;const&lt;/code&gt;, &lt;code&gt;const&lt;/code&gt; makes the code intend explicit when you try to mutate the value.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Cannot assign to 'x' because it is a constant.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;In Rust, variables are immutable by default and only mutable if you explicitly declare as mutable.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// this does not compile, &lt;/span&gt;
    &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// explicit `let mut x` to make mutation possible.&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The book's &lt;a href="https://doc.rust-lang.org/book/ch03-01-variables-and-mutability.html" rel="noopener noreferrer"&gt;Variables and Mutability&lt;/a&gt; has comprehensive explanation on mutability in Rust.&lt;/p&gt;

&lt;h3&gt;
  
  
  You should not &lt;strong&gt;&lt;em&gt;move&lt;/em&gt;&lt;/strong&gt;! Ownership &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;From a language with a garbage collector, the following code looks natural, we try to create &lt;code&gt;string2&lt;/code&gt; by referencing &lt;code&gt;string1&lt;/code&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;String&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"hello"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;s2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;However, the code does not compile, the compiler said you have &lt;strong&gt;moved&lt;/strong&gt; &lt;code&gt;s1&lt;/code&gt;.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;11 |   &lt;span class="nb"&gt;let &lt;/span&gt;s1 &lt;span class="o"&gt;=&lt;/span&gt; String::from&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"hello"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   |       &lt;span class="nt"&gt;--&lt;/span&gt; move occurs because &lt;span class="sb"&gt;`&lt;/span&gt;s1&lt;span class="sb"&gt;`&lt;/span&gt; has &lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="sb"&gt;`&lt;/span&gt;String&lt;span class="sb"&gt;`&lt;/span&gt;, which does not implement the &lt;span class="sb"&gt;`&lt;/span&gt;Copy&lt;span class="sb"&gt;`&lt;/span&gt; trait
12 |   &lt;span class="nb"&gt;let &lt;/span&gt;s2 &lt;span class="o"&gt;=&lt;/span&gt; s1&lt;span class="p"&gt;;&lt;/span&gt;
   |            &lt;span class="nt"&gt;--&lt;/span&gt; value moved here
13 |   println!&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"{}"&lt;/span&gt;, s1&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   |                  ^^ value borrowed here after move
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This might be the first, and continuously frustrating compiler error message when starting Rust.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.imgur.com%2F7hz5bz3.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.imgur.com%2F7hz5bz3.jpg" alt="You should not pass!"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Rust does not ship with a garbage collector, which means it does not know (at runtime) when to drop the value from memory when you don't need the value anymore.&lt;/p&gt;

&lt;p&gt;To archive this goal, Rust introduce the ownership checker, to make the developer &lt;strong&gt;mark&lt;/strong&gt; the value when the rest of the code doesn't need it. Ownership checker helps you to manage memory &lt;strong&gt;at compile time&lt;/strong&gt;, so we don't need to ship the code with a garbage collector that collect, and drop unused values from the memory in the runtime.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;value **moved** here&lt;/code&gt; in the above example code is telling that the code is violating the ownership rules, which are&lt;/p&gt;

&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;Each value in Rust has an owner.&lt;/li&gt;
&lt;li&gt;There can only be one owner at a time.&lt;/li&gt;
&lt;li&gt;When the owner goes out of scope, the value will be dropped.&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;The compiler is telling: Hey! &lt;code&gt;s1&lt;/code&gt; is the owner of &lt;code&gt;String::from("hello")&lt;/code&gt;, however, you have &lt;strong&gt;&lt;em&gt;moved&lt;/em&gt;&lt;/strong&gt; the ownership from &lt;code&gt;s1&lt;/code&gt; to &lt;code&gt;s2&lt;/code&gt;, since you don't need the &lt;code&gt;s1&lt;/code&gt;, compiler dropped &lt;code&gt;s1&lt;/code&gt;, therefore, you should not use it again &lt;br&gt;
 in &lt;code&gt;println!&lt;/code&gt;!&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;String&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"hello"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;s2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// ownership moved from s1 to s2&lt;/span&gt;
  &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{}{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// s1 is dropped, why you are still using it?&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;If you are from TypeScript world (or any language with a garbage collector), ownership might looks foreign, however, learning ownership checker makes you aware of memory allocation.&lt;/p&gt;
&lt;h4&gt;
  
  
  Ownership and Scope &lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;Let's review the ownership rules again, and get deeper into the third rule.&lt;/p&gt;

&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;Each value in Rust has an owner.&lt;/li&gt;
&lt;li&gt;There can only be one owner at a time.&lt;/li&gt;
&lt;li&gt;When the owner goes out of scope (the curly brackets &lt;code&gt;{}&lt;/code&gt;), the value will be dropped.&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the following example, the compiler stops us at the second &lt;code&gt;do_something()&lt;/code&gt; call, because we violate the ownership rule by moving &lt;code&gt;owner&lt;/code&gt; into &lt;code&gt;do_something&lt;/code&gt; and try to use &lt;code&gt;owner&lt;/code&gt; again.&lt;/p&gt;

&lt;p&gt;This does not compile:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;String&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// we took the ownership of "value" from `owner` and&lt;/span&gt;
    &lt;span class="c1"&gt;// "value" is dropped at the end of the `do_something` function&lt;/span&gt;
    &lt;span class="c1"&gt;// thus the variable `owner` does not own it anymore&lt;/span&gt;
    &lt;span class="nf"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// use of moved value: `owner` value used here after move&lt;/span&gt;
    &lt;span class="nd"&gt;print!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// &lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;a href="https://play.rust-lang.org/?version=stable&amp;amp;mode=debug&amp;amp;edition=2021&amp;amp;gist=855280d7cee41053de575a3af7697934" rel="noopener noreferrer"&gt;playground&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;print!("{}", owner)&lt;/code&gt; violates the ownership rule because we have already move the &lt;code&gt;owner&lt;/code&gt; into &lt;code&gt;do_something(owner)&lt;/code&gt;'s &lt;br&gt;
 scope, therefore, after the the &lt;code&gt;do_something(owner)&lt;/code&gt; execution is finished, the &lt;code&gt;owner&lt;/code&gt; is &lt;strong&gt;&lt;em&gt;out of scope&lt;/em&gt;&lt;/strong&gt;, the &lt;code&gt;owner&lt;/code&gt; is dropped and we can't use it anymore.&lt;/p&gt;
&lt;h4&gt;
  
  
  Borrow &lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;To get around ownership rules, borrowing to rescue.&lt;/p&gt;

&lt;p&gt;Borrowing is using reference syntax (&lt;code&gt;&amp;amp;&lt;/code&gt;) to make Rust compiler knows we are just borrowing instead of taking ownership, borrowing using reference to make a promise that we are justing temporarily borrowing the value, we do not intend to take the ownership, and return the value when don't need it anymore.&lt;/p&gt;

&lt;p&gt;Compiled&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;String&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// `do_something` borrows `"value"` from `owner`&lt;/span&gt;
    &lt;span class="nf"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// No more error!&lt;/span&gt;
    &lt;span class="nd"&gt;print!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// "value" is NOT dropped at the end of the function&lt;/span&gt;
    &lt;span class="c1"&gt;// because we are just borrowing (`&amp;amp;String`) not taking the ownership&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;a href="https://play.rust-lang.org/?version=stable&amp;amp;mode=debug&amp;amp;edition=2021&amp;amp;gist=eeca5897ae7a8fd8fbee836218eceaf8" rel="noopener noreferrer"&gt;playground&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Just like the ownership, the borrowing has a set of rules, these rules are the like contracts you made when you borrow something from someone else.&lt;/p&gt;

&lt;p&gt;Areal world example for analogue: you want to borrow a book &lt;a href="https://rust-for-rustaceans.com/" rel="noopener noreferrer"&gt;"Rust for Rustaceans"&lt;/a&gt; from a friend, to keep the friendship, you made a contract  (a verbal promise: "I will return the borrowed book back to you in one month"), the contract needs to follow the borrowing rules:&lt;/p&gt;

&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;At any given time, you can have as many immutable reference you want but only one mutable reference.
&lt;/li&gt;
&lt;li&gt;Reference must be referencing to a value that is valid (disallow referencing to a dropped value).&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;It's ok if ownership and borrowing still seems blur, the book's &lt;a href="https://doc.rust-lang.org/book/ch04-00-understanding-ownership.html" rel="noopener noreferrer"&gt;understanding ownership&lt;/a&gt; chapter is the best read, and you will get familiar with ownership rules soon after passing data and compiler yelling at you from time to time.&lt;/p&gt;

&lt;p&gt;If you are a busy developer, Let's Get Rusty's &lt;a href="https://www.youtube.com/watch?v=usJDUSrcwqI" rel="noopener noreferrer"&gt;The Rust Survival Guide&lt;/a&gt; is a great way to crash into ownership rules quickly.&lt;/p&gt;
&lt;h2&gt;
  
  
  Asynchronous &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Before we start this section, let's pin the definitions of terminology.&lt;/p&gt;
&lt;h3&gt;
  
  
  Terminology
&lt;/h3&gt;

&lt;p&gt;Async is a feature in a programming language intended to provide opportunities for the program to execute a unit of computation while waiting another unit of computation to complete.&lt;/p&gt;
&lt;h4&gt;
  
  
  Parallelism &lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;The program executes units of computation at the same time, simultaneously, for example running two computations in two different cores of CPU.&lt;/p&gt;
&lt;h4&gt;
  
  
  Concurrency &lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;The program process units of computation, executes them one by one, and yield from a unit to another unit quickly when a unit makes progress, the program yields between units quickly, as if the program executes units at the same time (but it's &lt;strong&gt;not&lt;/strong&gt; simultaneously) &lt;a href="https://youtu.be/Z-2siR9Ki84?si=L6MZCUQZmRA5hLcg&amp;amp;t=357" rel="noopener noreferrer"&gt;ref&lt;/a&gt;, for the single-threaded Node.js runtime.&lt;/p&gt;
&lt;h4&gt;
  
  
  Task (Green Thread) &lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;A task is for some computation running in a &lt;em&gt;parallel&lt;/em&gt; or &lt;em&gt;concurrent&lt;/em&gt; system. In this article, the term &lt;strong&gt;task&lt;/strong&gt; refer to &lt;a href="https://docs.rs/tokio/latest/tokio/task/index.html" rel="noopener noreferrer"&gt;asynchronous green thread&lt;/a&gt; that is not a OS thread but a unit of execution managed by the async runtime.&lt;/p&gt;
&lt;h3&gt;
  
  
  Runtime (the Task Runner) &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Node.js is single-threaded, asynchronous runtime, the program can process tasks asynchronously, however, the program is not processing the tasks in parallel, because Node.js is single-threaded.&lt;/p&gt;

&lt;p&gt;To process tasks asynchronously in Rust, the developer needs to setup a task runner. The &lt;code&gt;main&lt;/code&gt; function (think of it like &lt;code&gt;index.ts&lt;/code&gt;), which is the entry point of a Rust program, is always synchronous, the developer needs to setup the runtime to be able to run asynchronous tasks in Rust.&lt;/p&gt;

&lt;p&gt;The following code uses &lt;a href="https://docs.rs/futures/latest/futures/executor/index.html" rel="noopener noreferrer"&gt;futures::executor&lt;/a&gt; as the async task runner.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// the async task runner.&lt;/span&gt;
    &lt;span class="nn"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;block_on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// An async task&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;//&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;In Rust, you are free to choose any async runtimes, like in TypeScript, we have &lt;code&gt;node.js&lt;/code&gt;, &lt;code&gt;bun&lt;/code&gt; and &lt;code&gt;deno&lt;/code&gt;. In Rust we have &lt;a href="https://github.com/tokio-rs/tokio" rel="noopener noreferrer"&gt;tokio&lt;/a&gt;, &lt;a href="https://async.rs/" rel="noopener noreferrer"&gt;async-std&lt;/a&gt;, &lt;a href="https://github.com/smol-rs/smol" rel="noopener noreferrer"&gt;smol&lt;/a&gt; and &lt;a href="https://docs.rs/futures/latest/futures/" rel="noopener noreferrer"&gt;futures&lt;/a&gt;, these runtime can be single threaded, like &lt;code&gt;node.js&lt;/code&gt; which runs tasks concurrently, or multiple threaded that is &lt;strong&gt;true&lt;/strong&gt; parallelism.&lt;/p&gt;

&lt;p&gt;You may find these video useful to understand the &lt;code&gt;async/await&lt;/code&gt; in Rust.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=0HwrZp9CBD4" rel="noopener noreferrer"&gt;1 Hour Dive into Asynchronous Rust&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=FNcXf-4CLH0" rel="noopener noreferrer"&gt;Async/await in Rust: Introduction&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Ownership and Async &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;In You should not &lt;strong&gt;&lt;em&gt;move&lt;/em&gt;&lt;/strong&gt;! Ownership we discuss the ownership rules, and in Borrow section, we discuss how to get over ownership rules by borrowing.&lt;/p&gt;

&lt;p&gt;In async Rust, no matter you are using a single-threaded concurrent green thread runtime, or distributing computation to multiple OS threads (parallelism), the ownership rules always apply. In the async Rust, the ownership rules are preventing data race in concurrent programming or parallel programming in Rust. (Also known as &lt;a href="https://doc.rust-lang.org/book/ch16-00-concurrency.html#fearless-concurrency" rel="noopener noreferrer"&gt;fearless concurrency&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Remember the ownership rules?&lt;/p&gt;

&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;Each value in Rust has an owner.&lt;/li&gt;
&lt;li&gt;There can only be one owner &lt;strong&gt;at a time&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;When the owner goes out of scope the value will be dropped.&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;Pay special attention of &lt;strong&gt;at a time&lt;/strong&gt;, it is how ownership help us avoiding data race when running computation at the same time (= asynchronous). &lt;/p&gt;

&lt;p&gt;Let's look at the synchronous version again, in previous example this failed to compile...&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;String&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nf"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// use of moved value: `owner` value used here after move&lt;/span&gt;
    &lt;span class="nd"&gt;print!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// &lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;a href="https://play.rust-lang.org/?version=stable&amp;amp;mode=debug&amp;amp;edition=2021&amp;amp;gist=24619e92b8118da6d9900c423ad0f957" rel="noopener noreferrer"&gt;playground&lt;/a&gt;&lt;br&gt;
...because the code does not follow the ownership rules, that is, both &lt;code&gt;do_something()&lt;/code&gt; took ownership of &lt;code&gt;String::from("hello")&lt;/code&gt;, but Rust compiler only allows one ownership &lt;strong&gt;at a time&lt;/strong&gt;. To protect us from forgetting deallocating memory, the &lt;code&gt;owner&lt;/code&gt; is &lt;strong&gt;&lt;em&gt;moved&lt;/em&gt;&lt;/strong&gt; into the fist &lt;code&gt;do_something(owner)&lt;/code&gt;, and we can't compile the code because this error &lt;code&gt;use of moved value:&lt;/code&gt;owner&lt;code&gt;value used here after&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;3 |     do_something&lt;span class="o"&gt;(&lt;/span&gt;owner&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  |                  &lt;span class="nt"&gt;-----&lt;/span&gt; value moved here
4 |     // use of moved value: &lt;span class="sb"&gt;`&lt;/span&gt;owner&lt;span class="sb"&gt;`&lt;/span&gt; value used here after move
5 |     print!&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"{}"&lt;/span&gt;, owner&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  |                  ^^^^^ value borrowed here after move
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;We can get over this by borrowing(&lt;code&gt;&amp;amp;&lt;/code&gt;)&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;String&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// use &amp;amp; to reference owner&lt;/span&gt;
    &lt;span class="nf"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// We can still use owner after&lt;/span&gt;
    &lt;span class="nd"&gt;print!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// &lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;a href="https://play.rust-lang.org/?version=stable&amp;amp;mode=debug&amp;amp;edition=2021&amp;amp;gist=323a44e074de5572fd74fb80249c3c9d" rel="noopener noreferrer"&gt;playground&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The same ownership rules apply to asynchronous Rust, let's look at parallelism version, which &lt;code&gt;spawns&lt;/code&gt; OS threads running code &lt;strong&gt;simultaneously&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;String&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nn"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;(||&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// &lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;a href="https://play.rust-lang.org/?version=stable&amp;amp;mode=debug&amp;amp;edition=2021&amp;amp;gist=28553de752bf5e85d5ed32d5abf4cc10" rel="noopener noreferrer"&gt;playground&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We knew that we need to use borrowing(&lt;code&gt;&amp;amp;&lt;/code&gt;) to avoid taking ownership when calling &lt;code&gt;do_something(&amp;amp;owner)&lt;/code&gt;. However, the compiler still reject, it says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;closure may outlive the current function, but it borrows &lt;code&gt;owner&lt;/code&gt;, which is owned by the current function&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This very compiler error is telling that the borrowing of &lt;code&gt;owner&lt;/code&gt; might be referenced to a value, outside of the thread closure, &lt;strong&gt;at a time&lt;/strong&gt;, when the value is dropped, violating this rule we discussed in borrowing.&lt;/p&gt;

&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;Reference must be referencing to a value that is valid (disallow referencing to a dropped value).&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;To give this &lt;strong&gt;&lt;em&gt;outlive&lt;/em&gt;&lt;/strong&gt; error more context, try to run this code in the &lt;a href="https://play.rust-lang.org/?version=stable&amp;amp;mode=debug&amp;amp;edition=2021&amp;amp;gist=7be27d273fbee33b82c642fbfd214055" rel="noopener noreferrer"&gt;playground&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nn"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;(||&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nd"&gt;print!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"from thread"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="nd"&gt;print!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"from main"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;You might be surprised that there is only &lt;code&gt;from main&lt;/code&gt; in the console. It's because Rust's thread implementation in the &lt;code&gt;std&lt;/code&gt; allows the created threads to outlive the thread created them, in other words, the parent thread (in our case the &lt;code&gt;main()&lt;/code&gt;) created the child thread, the child thread created via &lt;code&gt;thread::spawn&lt;/code&gt; might outlived the parent (the &lt;code&gt;main()&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;That's the reason why you see &lt;code&gt;from main&lt;/code&gt; in the console, execution of &lt;code&gt;|| print!("from thread")&lt;/code&gt; outlived the execution of &lt;code&gt;main&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If we step back, and think at a higher degree of borrowing in threads:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;use std::thread;

fn main() {
    let owner = String::from("value");
    thread::spawn(|| {
        // we borrow owner, but the borrowed value (`owner`)
        // might be dropped in main(), that is the `&amp;amp;` might point
        // to a dropped value
        do_something(&amp;amp;owner);
    });
}
fn do_something(_: &amp;amp;String) {
    // 
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;a href="https://play.rust-lang.org/?version=stable&amp;amp;mode=debug&amp;amp;edition=2021&amp;amp;gist=d545cc80534ff175b793a9df34dcd941" rel="noopener noreferrer"&gt;playground&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We are running code &lt;strong&gt;simultaneously&lt;/strong&gt; from &lt;code&gt;main&lt;/code&gt; and a thread, &lt;strong&gt;at the same time&lt;/strong&gt;, the compiler stops us by telling us that the "closure (in the thread) may outlive the current function but it borrows &lt;code&gt;owner&lt;/code&gt;, which is owned by the &lt;code&gt;main()&lt;/code&gt;, we shouldn't do this because we might be referenced to &lt;code&gt;owner&lt;/code&gt; when it is invalid in the parent thread (&lt;code&gt;main()&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;The same &lt;strong&gt;&lt;em&gt;outlive&lt;/em&gt;&lt;/strong&gt; problem can be observed in concurrent code, even if in most concurrent runtimes, code execution is not in OS threads but in tasks:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="cm"&gt;/*
[dependencies]
tokio = { version = "1.32.0", features = ["full"] }
*/&lt;/span&gt;

&lt;span class="nd"&gt;#[tokio::main]&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;String&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="nn"&gt;tokio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;//&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;a href="https://www.rustexplorer.com/b#LyoKW2RlcGVuZGVuY2llc10KdG9raW8gPSB7IHZlcnNpb24gPSAiMS4zMi4wIiwgZmVhdHVyZXMgPSBbImZ1bGwiXSB9CiovCgojW3Rva2lvOjptYWluXQphc3luYyBmbiBtYWluKCkgewogICAgbGV0IG93bmVyID0gU3RyaW5nOjpmcm9tKCJ2YWx1ZSIpOwogICAgCiAgICB0b2tpbzo6c3Bhd24oZG9fc29tZXRoaW5nKCZvd25lcikpOwp9Cgphc3luYyBmbiBkb19zb21ldGhpbmcoXzogJlN0cmluZykgewogICAgLy8KfQ==" rel="noopener noreferrer"&gt;rustexplorer&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This code block failed with similar error message &lt;code&gt;"owner" does not live long enough&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;To get over this &lt;a href="https://stevedonovan.github.io/rust-gentle-intro/7-shared-and-networking.html#threads-dont-borrow" rel="noopener noreferrer"&gt;Threads Don't Borrow&lt;/a&gt; error, that is, to get over the ownership rule that disallow referencing a value from parent to children threads/tasks. We have few solutions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;move&lt;/code&gt; the value into the thread (&lt;a href="https://doc.rust-lang.org/book/ch16-01-threads.html#using-move-closures-with-threads" rel="noopener noreferrer"&gt;read more&lt;/a&gt;)
```rust
use std::thread;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;fn main() {&lt;br&gt;
    let owner = String::from("value");&lt;br&gt;
    // &lt;code&gt;move&lt;/code&gt; moving the &lt;code&gt;owner&lt;/code&gt; into the spawned thread&lt;br&gt;
    thread::spawn(move || {&lt;br&gt;
        do_something(&amp;amp;owner);&lt;br&gt;
    });&lt;br&gt;
}&lt;br&gt;
fn do_something(_: &amp;amp;String) {&lt;br&gt;
    // &lt;br&gt;
}&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[playground](https://play.rust-lang.org/?version=stable&amp;amp;mode=debug&amp;amp;edition=2021&amp;amp;gist=ff8990b967ff3d589c248f04a0d2b0d3)
2. Use [`scoped` thread](https://doc.rust-lang.org/beta/std/thread/fn.scope.html), which exits before the parent thread (`main`) exits.
```rust
use std::thread;

fn main() {
    let owner = String::from("value");
    // scoped thread alway exists before the main thread exists
    // therefore we can use reference to pointing to `owner`
    thread::scope(|_| {
        do_something(&amp;amp;owner);
    });
}
fn do_something(_: &amp;amp;String) {
    // 
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;a href="https://play.rust-lang.org/?version=stable&amp;amp;mode=debug&amp;amp;edition=2021&amp;amp;gist=cc57c70bd170def86338d77eee8bda83" rel="noopener noreferrer"&gt;playground&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;"Do not communicate by sharing memory; instead, share memory by communicating" as in the &lt;a href="https://go.dev/doc/effective_go#concurrency" rel="noopener noreferrer"&gt;Go language documentation&lt;/a&gt;. We will dive into this in the actor section.&lt;/li&gt;
&lt;li&gt;Atomic Reference Counting (&lt;a href="https://doc.rust-lang.org/std/sync/struct.Arc.html" rel="noopener noreferrer"&gt;&lt;code&gt;Arc&amp;lt;T&amp;gt;&lt;/code&gt;&lt;/a&gt;) and Mutual Exclusion (&lt;a href="https://doc.rust-lang.org/std/sync/struct.Mutex.html" rel="noopener noreferrer"&gt;&lt;code&gt;Mutex&amp;lt;T&amp;gt;&lt;/code&gt;&lt;/a&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We will dive in to the (&lt;a href="https://doc.rust-lang.org/std/sync/struct.Arc.html" rel="noopener noreferrer"&gt;&lt;code&gt;Arc&amp;lt;T&amp;gt;&lt;/code&gt;&lt;/a&gt;) in the next section.&lt;/p&gt;
&lt;h3&gt;
  
  
  Share States in Async Program: Arc and Mutex &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Sharing a state in an async program can be a challenge. Ownership rules only allow a value to have a owner at a time. We can't use borrowing because the compiler does not know will the borrower in the thread/task pointing to a dropped value &lt;strong&gt;&lt;em&gt;at a time&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;To solve this problem, we can use &lt;code&gt;Arc&lt;/code&gt; (&lt;a href="https://doc.rust-lang.org/std/sync/struct.Arc.html" rel="noopener noreferrer"&gt;Atomic Reference Counting&lt;/a&gt;). &lt;/p&gt;

&lt;p&gt;&lt;code&gt;Arc&lt;/code&gt; is &lt;em&gt;safe&lt;/em&gt; to use to share the state across multiple threads/tasks. To wrap a data into &lt;code&gt;Arc&lt;/code&gt; to have multiple copies of the same data:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;sync&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;Arc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;arc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Arc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;String&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="nn"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;(||&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arc&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Arc&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// &lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;a href="https://play.rust-lang.org/?version=stable&amp;amp;mode=debug&amp;amp;edition=2021&amp;amp;gist=5d4b3d2af20a1152f4fcd56477fb04b4" rel="noopener noreferrer"&gt;playground&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Arc&lt;/code&gt; allows &lt;strong&gt;&lt;em&gt;safe&lt;/em&gt;&lt;/strong&gt; read to the inner data across threads, it's similar to borrowing but for asynchronous code blocks.&lt;/p&gt;

&lt;p&gt;However, &lt;code&gt;Arc&lt;/code&gt; only allows read, to enable thread to write to the inner data. The data needs to be handled with proper locking mechanism, that is the (&lt;a href="https://doc.rust-lang.org/std/sync/struct.Mutex.html" rel="noopener noreferrer"&gt;&lt;code&gt;Mutex&amp;lt;T&amp;gt;&lt;/code&gt;&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://doc.rust-lang.org/std/sync/struct.Mutex.html" rel="noopener noreferrer"&gt;&lt;code&gt;Mutex&amp;lt;T&amp;gt;&lt;/code&gt;&lt;/a&gt; (reads: mutual exclusion) will block threads waiting for the lock to become available. When calling &lt;code&gt;lock()&lt;/code&gt; on a thread, the thread will become the only thread that can access the data, &lt;code&gt;Mutex&amp;lt;T&amp;gt;&lt;/code&gt; blocks other threads from access the data, therefore, it's safe to mutate the data while the lock has not been unlocked.&lt;/p&gt;

&lt;p&gt;To safely mutate the data we share with state:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;sync&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="nb"&gt;Arc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Mutex&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;inner_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;String&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Hello "&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;mutex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Arc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Mutex&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inner_data&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;mutex_clone&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mutex&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="nn"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;move&lt;/span&gt; &lt;span class="p"&gt;||&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;inner_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mutex&lt;/span&gt;&lt;span class="nf"&gt;.lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="n"&gt;inner_data&lt;/span&gt;&lt;span class="nf"&gt;.push_str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;" world (once)!"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="nn"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;move&lt;/span&gt; &lt;span class="p"&gt;||&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;inner_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mutex_clone&lt;/span&gt;&lt;span class="nf"&gt;.lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="n"&gt;inner_data&lt;/span&gt;&lt;span class="nf"&gt;.push_str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;" world (twice)!"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;a href="https://play.rust-lang.org/?version=stable&amp;amp;mode=debug&amp;amp;edition=2021&amp;amp;gist=e6c1a5be6c381f2895c7733d4a8aa102" rel="noopener noreferrer"&gt;playground&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We will dive deeper on how to use &lt;code&gt;Arc&amp;lt;Mutex&amp;lt;_&amp;gt;&amp;gt;&lt;/code&gt; to share the mutable state in section Share Memory &lt;code&gt;Arc&amp;lt;Mutex&amp;lt;_&amp;gt;&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To learn more on sharing state:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The book's &lt;a href="https://doc.rust-lang.org/book/ch16-03-shared-state.html" rel="noopener noreferrer"&gt;Shared-State Concurrency
&lt;/a&gt; chapter.&lt;/li&gt;
&lt;li&gt;Tokio's documentation has a dedicated page on how to &lt;a href="https://tokio.rs/tokio/tutorial/shared-state" rel="noopener noreferrer"&gt;share state between async tasks&lt;/a&gt;. &lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Hands-On &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;In the following sections, we will start building the copilot server.&lt;/p&gt;
&lt;h3&gt;
  
  
  Server-Sent Events (SSE) Server &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;In this PR, we add the endpoint for the copilot client&lt;/p&gt;

&lt;p&gt;From &lt;a href="https://dev.to/chenhunghan/use-code-llama-and-other-open-llms-as-drop-in-replacement-for-copilot-code-completion-58hg"&gt;Code Llama as Drop-In Replacement for Copilot Code Completion&lt;/a&gt; we knew that a copilot server is essentially a HTTP server that accepts a request with a prompt and return &lt;code&gt;JSON&lt;/code&gt; chucks in &lt;a href="https://en.wikipedia.org/wiki/Server-sent_events" rel="noopener noreferrer"&gt;Server-Sent Events (SSE)&lt;/a&gt;. Let's try to specify the SSE endpoint and creating an Server-Sent Events (SSE) Server using &lt;a href="https://github.com/tokio-rs/axum" rel="noopener noreferrer"&gt;axum&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The URL path of the endpoint. &lt;code&gt;/v1/engines/:engine/completions&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The endpoint should accept a &lt;code&gt;POST&lt;/code&gt; request.&lt;/li&gt;
&lt;li&gt;The endpoint takes a path parameter (&lt;code&gt;:engine&lt;/code&gt;) and a request body.&lt;/li&gt;
&lt;li&gt;The endpoint return a SSE stream of text chucks (&lt;code&gt;Content-Type: text/event-stream&lt;/code&gt;). &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since this endpoint is almost identical to &lt;a href="https://platform.openai.com/docs/guides/text-generation/completions-api" rel="noopener noreferrer"&gt;OpenAI's completions&lt;/a&gt; endpoint, we can use &lt;code&gt;curl&lt;/code&gt; to see the input's input (request body) and the output (SSE text chucks)&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl https://api.openai.com/v1/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "gpt-3.5-turbo-instruct",
    "prompt": "Say this is a test",
    "max_tokens": 7,
    "temperature": 0,
    "stream": true
  }'&lt;/span&gt;
&lt;span class="c"&gt;# chuck 0&lt;/span&gt;
data: &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"choices"&lt;/span&gt;:[&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;:&lt;span class="s2"&gt;"This "&lt;/span&gt;,&lt;span class="s2"&gt;"index"&lt;/span&gt;:0,&lt;span class="s2"&gt;"logprobs"&lt;/span&gt;:null,&lt;span class="s2"&gt;"finish_reason"&lt;/span&gt;:null&lt;span class="o"&gt;}]&lt;/span&gt;,
&lt;span class="s2"&gt;"model"&lt;/span&gt;:&lt;span class="s2"&gt;"gpt-3.5-turbo-instruct"&lt;/span&gt;, &lt;span class="s2"&gt;"id"&lt;/span&gt;:&lt;span class="s2"&gt;"..."&lt;/span&gt;,&lt;span class="s2"&gt;"object"&lt;/span&gt;:&lt;span class="s2"&gt;"text_completion"&lt;/span&gt;,&lt;span class="s2"&gt;"created"&lt;/span&gt;:1&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="c"&gt;# chuck 1&lt;/span&gt;
data: &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"choices"&lt;/span&gt;:[&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;:&lt;span class="s2"&gt;"is "&lt;/span&gt;,&lt;span class="s2"&gt;"index"&lt;/span&gt;:0,&lt;span class="s2"&gt;"logprobs"&lt;/span&gt;:null,&lt;span class="s2"&gt;"finish_reason"&lt;/span&gt;:null&lt;span class="o"&gt;}]&lt;/span&gt;,
&lt;span class="s2"&gt;"model"&lt;/span&gt;:&lt;span class="s2"&gt;"gpt-3.5-turbo-instruct"&lt;/span&gt;, &lt;span class="s2"&gt;"id"&lt;/span&gt;:&lt;span class="s2"&gt;"..."&lt;/span&gt;,&lt;span class="s2"&gt;"object"&lt;/span&gt;:&lt;span class="s2"&gt;"text_completion"&lt;/span&gt;,&lt;span class="s2"&gt;"created"&lt;/span&gt;:1&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="c"&gt;# chuck 2&lt;/span&gt;
data: &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"choices"&lt;/span&gt;:[&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;:&lt;span class="s2"&gt;"a "&lt;/span&gt;,&lt;span class="s2"&gt;"index"&lt;/span&gt;:0,&lt;span class="s2"&gt;"logprobs"&lt;/span&gt;:null,&lt;span class="s2"&gt;"finish_reason"&lt;/span&gt;:null&lt;span class="o"&gt;}]&lt;/span&gt;,
&lt;span class="s2"&gt;"model"&lt;/span&gt;:&lt;span class="s2"&gt;"gpt-3.5-turbo-instruct"&lt;/span&gt;, &lt;span class="s2"&gt;"id"&lt;/span&gt;:&lt;span class="s2"&gt;"..."&lt;/span&gt;,&lt;span class="s2"&gt;"object"&lt;/span&gt;:&lt;span class="s2"&gt;"text_completion"&lt;/span&gt;,&lt;span class="s2"&gt;"created"&lt;/span&gt;:1&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="c"&gt;# chuck with `"finish_reason":"stop"`&lt;/span&gt;
data: &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"choices"&lt;/span&gt;:[&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;:&lt;span class="s2"&gt;"test."&lt;/span&gt;,&lt;span class="s2"&gt;"index"&lt;/span&gt;:0,&lt;span class="s2"&gt;"logprobs"&lt;/span&gt;:null,&lt;span class="s2"&gt;"finish_reason"&lt;/span&gt;:&lt;span class="s2"&gt;"stop"&lt;/span&gt;&lt;span class="o"&gt;}]&lt;/span&gt;,
&lt;span class="s2"&gt;"model"&lt;/span&gt;:&lt;span class="s2"&gt;"gpt-3.5-turbo-instruct"&lt;/span&gt;, &lt;span class="s2"&gt;"id"&lt;/span&gt;:&lt;span class="s2"&gt;"..."&lt;/span&gt;,&lt;span class="s2"&gt;"object"&lt;/span&gt;:&lt;span class="s2"&gt;"text_completion"&lt;/span&gt;,&lt;span class="s2"&gt;"created"&lt;/span&gt;:1&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="c"&gt;# end of SSE event stream&lt;/span&gt;
data: &lt;span class="o"&gt;[&lt;/span&gt;DONE]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h4&gt;
  
  
  BDD (Behaviour-Driven Development) the Endpoint &lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;We will use &lt;a href="https://docs.rs/reqwest-eventsource/latest/reqwest_eventsource/" rel="noopener noreferrer"&gt;&lt;code&gt;reqwest_eventsource&lt;/code&gt;&lt;/a&gt; and its friends in the test to act as the client, which send request to our endpoint &lt;code&gt;/v1/engines/:engine/completions&lt;/code&gt; and assert the response is what we expected. Since &lt;a href="https://docs.rs/reqwest-eventsource/latest/reqwest_eventsource/" rel="noopener noreferrer"&gt;&lt;code&gt;reqwest_eventsource&lt;/code&gt;&lt;/a&gt; and friends are not used in our final binary, let's add them in &lt;code&gt;dev-dependencies&lt;/code&gt; in &lt;code&gt;Cargo.toml&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[dev-dependencies]&lt;/span&gt;
&lt;span class="py"&gt;reqwest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="py"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.11.22"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="py"&gt;features&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"json"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"stream"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"multipart"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="py"&gt;reqwest-eventsource&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.5.0"&lt;/span&gt;
&lt;span class="py"&gt;eventsource-stream&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.2.3"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Add the dummy handler, and axum's router to route requests to &lt;code&gt;POST /v1/engines/:engine/completions&lt;/code&gt;&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;axum&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="nn"&gt;routing&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Router&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;'static&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s"&gt;"Hello, World!"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;app&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Router&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nn"&gt;Router&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/v1/engines/:engine/completions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Translate the spec into the test:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nd"&gt;#[cfg(test)]&lt;/span&gt;
&lt;span class="k"&gt;mod&lt;/span&gt; &lt;span class="n"&gt;tests&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// imports are only for the tests&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;

    &lt;span class="cd"&gt;/// `super::*` means "everything in the parent module"&lt;/span&gt;
    &lt;span class="cd"&gt;/// It will bring all of the test module’s parent’s items into scope.&lt;/span&gt;
    &lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="k"&gt;super&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="cd"&gt;/// A helper function that spawns our application in the background&lt;/span&gt;
    &lt;span class="cd"&gt;/// and returns its address (e.g. http://127.0.0.1:[random_port])&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;spawn_app&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="nb"&gt;Into&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;_host&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="nf"&gt;.into&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="c1"&gt;// Bind to localhost at the port 0, which will let the OS assign an available port to us&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;listener&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;TcpListener&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nd"&gt;format!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{}:0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_host&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="c1"&gt;// We retrieve the port assigned to us by the OS&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;listener&lt;/span&gt;&lt;span class="nf"&gt;.local_addr&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.port&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;tokio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;move&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;app&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
            &lt;span class="nn"&gt;axum&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;serve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;listener&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;

        &lt;span class="c1"&gt;// We return the application address to the caller!&lt;/span&gt;
        &lt;span class="nd"&gt;format!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"http://{}:{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="cd"&gt;/// The #[tokio::test] annotation on the test_sse_engine_completion function is a macro.&lt;/span&gt;
    &lt;span class="cd"&gt;/// Similar to #[tokio::main] It transforms the async fn test_sse_engine_completion()&lt;/span&gt;
    &lt;span class="cd"&gt;/// into a synchronous fn test_sse_engine_completion() that initializes a runtime instance&lt;/span&gt;
    &lt;span class="cd"&gt;/// and executes the async main function.&lt;/span&gt;
    &lt;span class="nd"&gt;#[tokio::test]&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;test_sse_engine_completion&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;listening_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;spawn_app&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"127.0.0.1"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Completion&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nd"&gt;vec!&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"code-llama-7b"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;serde_json&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nd"&gt;json!&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;time_before_request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;SystemTime&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="nf"&gt;.duration_since&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;UNIX_EPOCH&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="nf"&gt;.as_secs&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;reqwest&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="nf"&gt;.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nd"&gt;format!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="s"&gt;"{}/v1/engines/{engine}/completions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;listening_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt;
            &lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="nf"&gt;.header&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Content-Type"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"application/json"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;.json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;.send&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;.await&lt;/span&gt;
            &lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="nf"&gt;.bytes_stream&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="nf"&gt;.eventsource&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

        &lt;span class="c1"&gt;// iterate over the stream of events&lt;/span&gt;
        &lt;span class="c1"&gt;// and collect them into a vector of Completion objects&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="nf"&gt;.next&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="c1"&gt;// break the loop at the end of SSE stream&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="py"&gt;.data&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"[DONE]"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;

                    &lt;span class="c1"&gt;// parse the event data into a Completion object&lt;/span&gt;
                    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;serde_json&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;from_str&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Completion&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="py"&gt;.data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
                    &lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="nf"&gt;.push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="nf"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="nd"&gt;panic!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Error in event stream"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="c1"&gt;// The endpoint should return at least one completion object&lt;/span&gt;
        &lt;span class="nd"&gt;assert!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Check that each completion object has the correct fields&lt;/span&gt;
        &lt;span class="c1"&gt;// note that we didn't check all the values of the fields because&lt;/span&gt;
        &lt;span class="c1"&gt;// `serde_json::from_str::&amp;lt;Completion&amp;gt;` should panic if the field &lt;/span&gt;
        &lt;span class="c1"&gt;// is missing or in unexpected format&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;completions&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// id should be a non-empty string&lt;/span&gt;
            &lt;span class="nd"&gt;assert!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="py"&gt;.id&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="nd"&gt;assert!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="py"&gt;.object&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"text_completion"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="nd"&gt;assert!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="py"&gt;.created&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;time_before_request&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="nd"&gt;assert!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="py"&gt;.model&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

            &lt;span class="c1"&gt;// each completion object should have at least one choice&lt;/span&gt;
            &lt;span class="nd"&gt;assert!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="py"&gt;.choices&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

            &lt;span class="c1"&gt;// check that each choice has a non-empty text&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="py"&gt;.choices&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="nd"&gt;assert!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="py"&gt;.text&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="c1"&gt;// finish_reason should can be None or Some(String)&lt;/span&gt;
                &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="py"&gt;.finish_reason&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finish_reason&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="nd"&gt;assert!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finish_reason&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                    &lt;span class="nb"&gt;None&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;

            &lt;span class="nd"&gt;assert!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="py"&gt;.system_fingerprint&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Run the tests by &lt;code&gt;cargo test&lt;/code&gt;, the tests failed, because we haven't implemented the  &lt;code&gt;completion()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Add the endpoint, to pass the tests, the endpoint need to response with chucks of SSE struct, let's first fake the values in the struct first, we will connect the endpoint to llm later! It's important to stabilise the HTTP interface first.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;async_stream&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;axum&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;response&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;sse&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;Event&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;SseEvent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KeepAlive&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Sse&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;axum&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Json&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Stream&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;oxpilot&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;types&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;Choice&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Completion&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CompletionRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Usage&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;serde_json&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;to_string&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Infallible&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;time&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;SystemTime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;UNIX_EPOCH&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="c1"&gt;// Reference: https://github.com/tokio-rs/axum/blob/main/examples/sse/src/main.rs&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="c1"&gt;// `Json&amp;lt;T&amp;gt;` will automatically deserialize the request body to a type `T` as JSON.&lt;/span&gt;
    &lt;span class="nf"&gt;Json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;Json&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;CompletionRequest&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Sse&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="n"&gt;Stream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Item&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;SseEvent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Infallible&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// `stream!` is a macro from [`async_stream`](https://docs.rs/async-stream/0.3.5/async_stream/index.html) &lt;/span&gt;
    &lt;span class="c1"&gt;// that makes it easy to create a `futures::stream::Stream` from a generator.&lt;/span&gt;
    &lt;span class="nn"&gt;Sse&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nd"&gt;stream!&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
          &lt;span class="c1"&gt;// Create a new `SseEvent` with the default settings.&lt;/span&gt;
          &lt;span class="c1"&gt;// `SseEvent::default().data("Hello, World!")` will return `data: Hello, World!` as the event text chuck.&lt;/span&gt;
          &lt;span class="nn"&gt;SseEvent&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="c1"&gt;// Serialize the `Completion` struct to JSON and return it as the event text chunk.&lt;/span&gt;
            &lt;span class="nf"&gt;to_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
              &lt;span class="c1"&gt;// json! is a macro from serde_json that makes it easy to create JSON values from a struct.&lt;/span&gt;
              &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nd"&gt;json!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;Completion&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                  &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"cmpl-"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                  &lt;span class="n"&gt;object&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"text_completion"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                  &lt;span class="n"&gt;created&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;SystemTime&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                      &lt;span class="nf"&gt;.duration_since&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;UNIX_EPOCH&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                      &lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                      &lt;span class="nf"&gt;.as_secs&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                  &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="py"&gt;.model&lt;/span&gt;&lt;span class="nf"&gt;.unwrap_or&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"unknown"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
                  &lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nd"&gt;vec!&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Choice&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                      &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;" world!"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                      &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;logprobs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;finish_reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"stop"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
                  &lt;span class="p"&gt;}],&lt;/span&gt;
                  &lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Usage&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                      &lt;span class="n"&gt;prompt_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;completion_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;total_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
                  &lt;span class="p"&gt;},&lt;/span&gt;
                  &lt;span class="n"&gt;system_fingerprint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
              &lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="nf"&gt;.keep_alive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;KeepAlive&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;That's it,  the tests should pass now.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;running 1 &lt;span class="nb"&gt;test
test &lt;/span&gt;tests::test_sse_engine_completion ... ok

&lt;span class="nb"&gt;test &lt;/span&gt;result: ok. 1 passed&lt;span class="p"&gt;;&lt;/span&gt; 0 failed&lt;span class="p"&gt;;&lt;/span&gt; 0 ignored&lt;span class="p"&gt;;&lt;/span&gt; 0 measured&lt;span class="p"&gt;;&lt;/span&gt; 0 filtered out&lt;span class="p"&gt;;&lt;/span&gt; finished &lt;span class="k"&gt;in &lt;/span&gt;0.04s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Alternatively, we can test the copilot e2e:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;will bind the server at port &lt;code&gt;6666&lt;/code&gt;, because we have these in &lt;code&gt;main&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nd"&gt;#[tokio::main]&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// ..&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;listener&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;tokio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;net&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;TcpListener&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"0.0.0.0:6666"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;app&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="nn"&gt;axum&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;serve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;listener&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Edit the &lt;code&gt;settings.json&lt;/code&gt; in VSCode.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"github.copilot.advanced"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"debug.overrideProxyUrl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:6666"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;And open any file, we should see ghost texts with &lt;code&gt;world!&lt;/code&gt; that is from our copilot server running at port &lt;code&gt;6666&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fchenhunghan%2Foxpilot%2Fassets%2F1474479%2F40050247-fa0b-4253-9173-863dc720217f" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fchenhunghan%2Foxpilot%2Fassets%2F1474479%2F40050247-fa0b-4253-9173-863dc720217f"&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fchenhunghan%2Foxpilot%2Fassets%2F1474479%2F3b047efa-2621-4879-bc89-080eb67080b3" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fchenhunghan%2Foxpilot%2Fassets%2F1474479%2F3b047efa-2621-4879-bc89-080eb67080b3"&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;h3&gt;
  
  
  Builder Pattern &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;In this section, we will implement a new &lt;a href="https://doc.rust-lang.org/book/ch05-01-defining-structs.html" rel="noopener noreferrer"&gt;&lt;code&gt;struct&lt;/code&gt;&lt;/a&gt; (like an &lt;code&gt;Object&lt;/code&gt;), &lt;code&gt;LLMBuilder&lt;/code&gt; in &lt;code&gt;llm.rs&lt;/code&gt;, and use the &lt;code&gt;struct&lt;/code&gt; in our binary's entry point &lt;code&gt;main.rs&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To be able to use components from &lt;code&gt;llm.rs&lt;/code&gt; in &lt;code&gt;main.rs&lt;/code&gt; we layout our files like&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;└── src
    ├── lib.rs
    ├── llm.rs
    └── main.rs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;in &lt;code&gt;llm.rs&lt;/code&gt;, we make &lt;code&gt;LLMBuilder&lt;/code&gt; public.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// llm.rs&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;LLMBuilder&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Declare new module (&lt;a href="https://doc.rust-lang.org/stable/reference/items/modules.html" rel="noopener noreferrer"&gt;mod&lt;/a&gt;) in &lt;code&gt;lib.rs&lt;/code&gt;)&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// lib.rs&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;mod&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// rust will resolve to `./llm.rs`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;and use the module in &lt;code&gt;main.rs&lt;/code&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;oxpilot&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;LLMBuilder&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// &lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;code&gt;LLMBuilder&lt;/code&gt; is implemented using a design pattern "&lt;a href="https://www.youtube.com/watch?v=Z_3WOSiYYFY" rel="noopener noreferrer"&gt;Builder&lt;/a&gt;", which is a &lt;em&gt;creational&lt;/em&gt; pattern that lets you construct complex objects steps-by-steps.&lt;/p&gt;

&lt;p&gt;The end result is&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;llm_builder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;LLMBuilder&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.tokenizer_repo_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"hf-internal-testing/llama-tokenizer"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;.model_repo_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"TheBloke/CodeLlama-7B-GGU"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;.model_file_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"codellama-7b.Q2_K.gguf"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm_builder&lt;/span&gt;&lt;span class="nf"&gt;.build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h4&gt;
  
  
  &lt;strong&gt;&lt;em&gt;Constructor&lt;/em&gt;&lt;/strong&gt; &lt;code&gt;::default()&lt;/code&gt; v.s. &lt;code&gt;::new()&lt;/code&gt; &lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;Rust does not have constructor for &lt;code&gt;struct&lt;/code&gt; to assign values to fields in &lt;code&gt;struct&lt;/code&gt; when creating new instances, it's common to use &lt;a href="https://doc.rust-lang.org/stable/book/ch05-03-method-syntax.html#associated-functions" rel="noopener noreferrer"&gt;associated-functions&lt;/a&gt; &lt;code&gt;::new()&lt;/code&gt; for the same purpose. Another option is to use &lt;a href="https://doc.rust-lang.org/std/default/trait.Default.html" rel="noopener noreferrer"&gt;&lt;code&gt;Default&lt;/code&gt;&lt;/a&gt; trait to implement "Constructor". &lt;/p&gt;

&lt;p&gt;We implement &lt;a href="https://doc.rust-lang.org/std/default/trait.Default.html" rel="noopener noreferrer"&gt;&lt;code&gt;Default&lt;/code&gt;&lt;/a&gt; trait for &lt;code&gt;LLMBuilder&lt;/code&gt; and implement &lt;code&gt;new()&lt;/code&gt; for the user who prefer &lt;code&gt;::new()&lt;/code&gt; pattern.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="n"&gt;LLMBuilder&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;Self&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;Self&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="nn"&gt;LLMBuilder&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// same as `LLMBuilder::default()`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h4&gt;
  
  
  &lt;code&gt;impl Into&amp;lt;String&amp;gt;&lt;/code&gt; for function parameter &lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;To make the functions in our &lt;code&gt;struct&lt;/code&gt; friendly for user, we use &lt;code&gt;impl Into&amp;lt;String&amp;gt;&lt;/code&gt; tricks to allow passing both &lt;code&gt;String&lt;/code&gt; and &lt;code&gt;&amp;amp;str&lt;/code&gt; as function parameters.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="n"&gt;LLMBuilder&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;tokenizer_repo_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;param&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="nb"&gt;Into&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;// both are accepted&lt;/span&gt;
&lt;span class="nn"&gt;LLMBuilder&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.tokenizer_repo_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"string_slice"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nn"&gt;LLMBuilder&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.tokenizer_repo_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;String&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"String"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Type State &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;In previous section, we implement the builder for &lt;code&gt;LLM&lt;/code&gt;, that is great, we can construct &lt;code&gt;LLM&lt;/code&gt; with the descriptive chain of methods.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;llm_builder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;LLMBuilder&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;.tokenizer_repo_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"repo"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;.model_repo_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"repo"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;.model_file_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"file"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm_builder&lt;/span&gt;&lt;span class="nf"&gt;.build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;However, let's step aside, and be in the shoes the users, if a user tries to use  &lt;code&gt;LLMBuilder&lt;/code&gt;, it's possible that they forgot to support mandatory parameter, for example, one may forget to chain &lt;code&gt;model_repo_id()&lt;/code&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;llm_builder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;LLMBuilder&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;.model_file_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"file"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is &lt;strong&gt;acceptable&lt;/strong&gt;. unlike other language which designed to throws exceptions in runtime, Rust's &lt;a href="https://doc.rust-lang.org/std/result/" rel="noopener noreferrer"&gt;&lt;code&gt;Result&lt;/code&gt;&lt;/a&gt; will propagate error back to user. As a result, there won't be runtime exceptions if the user deal with the &lt;a href="https://doc.rust-lang.org/std/result/" rel="noopener noreferrer"&gt;&lt;code&gt;Result&lt;/code&gt;&lt;/a&gt; properly at compile time:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;llm_builder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;LLMBuilder&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;.model_file_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"file"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="n"&gt;llm_builder&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nf"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// handle the error properly here&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;However, what if we can improve the DX, to make the developer knows the problem as soon as possible, to make the feedback loop shorter, ideally when writing the code, i.e., compile time error?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Type State&lt;/em&gt;&lt;/strong&gt; is a pattern that specify the state in type, and make compiler checks the state before running the code.&lt;/p&gt;

&lt;p&gt;Our goal is to make compiler warn us, when mandatory parameters for creation of &lt;code&gt;LLM&lt;/code&gt; is missing, for example, this will failed to compile:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;llm_builder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;LLMBuilder&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lllm_builder&lt;/span&gt;&lt;span class="nf"&gt;.build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The compiler will tell the user that, hey, the &lt;code&gt;build()&lt;/code&gt; can't be used yet, you should not pass!&lt;/p&gt;

&lt;p&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fchenhunghan%2Foxpilot%2Fassets%2F1474479%2Ff8261337-e52d-4d59-9429-6201bdb335df" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fchenhunghan%2Foxpilot%2Fassets%2F1474479%2Ff8261337-e52d-4d59-9429-6201bdb335df"&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;and the code intelligent in the editor will support that, hey, there is &lt;code&gt;tokenizer_repo_id()&lt;/code&gt; method available, would you want to try first?  &lt;/p&gt;

&lt;p&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fchenhunghan%2Foxpilot%2Fassets%2F1474479%2F4f12ffef-34d9-4b90-9327-3fe26b7fe45e" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fchenhunghan%2Foxpilot%2Fassets%2F1474479%2F4f12ffef-34d9-4b90-9327-3fe26b7fe45e"&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;We can help our user, to find the next steps by defining the type state&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Init state when `::new()` is called.&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;InitState&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Intermediate state, with token repo id, ready to accept model repo id&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;WithTokenizerRepoId&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And, move the implementation to where have the correct state, at the beginning, the state is &lt;code&gt;InitState&lt;/code&gt;, and user can only use &lt;code&gt;new()&lt;/code&gt; (does not change state), and &lt;code&gt;tokenizer_repo_id()&lt;/code&gt;, which will return the instance with &lt;code&gt;State=WithTokenizerRepoId&lt;/code&gt;.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="n"&gt;LLMBuilder&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;InitState&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;Self&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;LLMBuilder&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
           &lt;span class="o"&gt;...&lt;/span&gt;
            &lt;span class="c1"&gt;// does not change state&lt;/span&gt;
            &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;InitState&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;tokenizer_repo_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tokenizer_repo_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="nb"&gt;Into&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;LLMBuilder&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;WithTokenizerRepoId&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;LLMBuilder&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="o"&gt;...&lt;/span&gt;
            &lt;span class="c1"&gt;// change state to `WithTokenizerRepoId`&lt;/span&gt;
            &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;WithTokenizerRepoId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;If we inspect the builder instance, we will notice that it has &lt;code&gt;WithTokenizerRepoId&lt;/code&gt; state.&lt;/p&gt;

&lt;p&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fchenhunghan%2Foxpilot%2Fassets%2F1474479%2F9e49bce2-c577-4ec0-8c25-12dff71a21c3" class="article-body-image-wrapper"&gt;&lt;img alt="Screenshot 2023-11-02 at 20 58 57" src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fchenhunghan%2Foxpilot%2Fassets%2F1474479%2F9e49bce2-c577-4ec0-8c25-12dff71a21c3"&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;That's great! Let's &lt;code&gt;impl&lt;/code&gt; to builder with &lt;code&gt;WithTokenizerRepoId&lt;/code&gt; state, so user will know what to do next.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;tokenizer_repo_id&lt;/code&gt; in place, the next is to set &lt;code&gt;model_repo_id&lt;/code&gt;, calling &lt;code&gt;model_repo_id()&lt;/code&gt; will set &lt;code&gt;model_repo_id&lt;/code&gt; and return &lt;code&gt;LLMBuilder&amp;lt;WithModelRepoId&amp;gt;&lt;/code&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Intermediate state, with model repo id, ready to accept model file name&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;WithModelRepoId&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="n"&gt;LLMBuilder&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;WithTokenizerRepoId&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;model_repo_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_repo_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="nb"&gt;Into&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;LLMBuilder&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;WithModelRepoId&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;LLMBuilder&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="o"&gt;...&lt;/span&gt;
           &lt;span class="c1"&gt;// change state to `WithModelRepoId`&lt;/span&gt;
            &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;WithModelRepoId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We are almost ready, the final step is to assign &lt;code&gt;model_file_name&lt;/code&gt;, then the builder is ready to build.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="cd"&gt;/// With both token repo id and model repo id&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;ReadyState&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="n"&gt;LLMBuilder&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;WithModelRepoId&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;model_file_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_file_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="nb"&gt;Into&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;LLMBuilder&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ReadyState&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;LLMBuilder&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="o"&gt;...&lt;/span&gt;
           &lt;span class="c1"&gt;// change state to `WithModelRepoId`&lt;/span&gt;
            &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ReadyState&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Implement the &lt;code&gt;LLMBuilder&amp;lt;ReadyState&amp;gt;&lt;/code&gt; which adds the &lt;code&gt;build()&lt;/code&gt; method.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="n"&gt;LLMBuilder&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ReadyState&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;LLM&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;That's it. We have improved our builder. The compiler will emit errors when any of mandatory parameters are missing, and avoid the runtime exceptions.&lt;/p&gt;

&lt;p&gt;The final result&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;LLMBuilder&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;// mandatory parameters, without these compiler warns&lt;/span&gt;
    &lt;span class="nf"&gt;.tokenizer_repo_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"string_slice"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;.model_repo_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"repo"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;.model_file_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"model.file"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Inspect the builder, it has &lt;code&gt;ReadyState&lt;/code&gt;!&lt;/p&gt;

&lt;p&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fchenhunghan%2Foxpilot%2Fassets%2F1474479%2Ff3eacfb1-1bf3-4b17-ab31-60b444a346e6" class="article-body-image-wrapper"&gt;&lt;img alt="Screenshot 2023-11-02 at 21 15 30" src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fchenhunghan%2Foxpilot%2Fassets%2F1474479%2Ff3eacfb1-1bf3-4b17-ab31-60b444a346e6"&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;h3&gt;
  
  
  Share Memory &lt;code&gt;Arc&amp;lt;Mutex&amp;lt;_&amp;gt;&amp;gt;&lt;/code&gt; &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;h3&gt;
  
  
  Share Memory by Communicating: &lt;strong&gt;Actor&lt;/strong&gt; &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

</description>
      <category>rust</category>
      <category>typescript</category>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>Use Code Llama (and other open LLMs) as Drop-In Replacement for Copilot Code Completion</title>
      <dc:creator>chh</dc:creator>
      <pubDate>Sun, 27 Aug 2023 09:37:08 +0000</pubDate>
      <link>https://dev.to/chenhunghan/use-code-llama-and-other-open-llms-as-drop-in-replacement-for-copilot-code-completion-58hg</link>
      <guid>https://dev.to/chenhunghan/use-code-llama-and-other-open-llms-as-drop-in-replacement-for-copilot-code-completion-58hg</guid>
      <description>&lt;p&gt;&lt;a href="https://huggingface.co/docs/transformers/main/model_doc/code_llama"&gt;CodeLlama&lt;/a&gt; is now available under a commercial-friendly license.&lt;/p&gt;

&lt;p&gt;The question arises: Can we replace GitHub Copilot and use &lt;a href="https://huggingface.co/docs/transformers/main/model_doc/code_llama"&gt;CodeLlama&lt;/a&gt; as the code completion LLM without transmitting source code to the &lt;em&gt;cloud&lt;/em&gt;?&lt;/p&gt;

&lt;p&gt;The answer is both yes and no. Tweaking hyperparameters becomes essential in this endeavor. Let's explore the options available as of August 2023.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: You might want to read my latest article on &lt;a href="https://dev.to/chenhunghan/i-made-a-copilot-in-rust-here-is-what-i-have-learned-as-a-typescript-dev-52md"&gt;copilot&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By analyzing CodePilot's VSCode extension&lt;sup id="fnref1"&gt;1&lt;/sup&gt; at &lt;a href="https://github.com/thakkarparth007/copilot-explorer"&gt;thakkarparth007/copilot-explorer&lt;/a&gt;, it becomes evident that CodePilot relies on an OpenAI API-compatible backend. Drawing from prior experiences such as &lt;a href="https://github.com/fauxpilot/fauxpilot"&gt;fauxpilot&lt;/a&gt;, we understand that it's possible to switch the backend by introducing specific modifications to the &lt;a href="https://code.visualstudio.com/docs/getstarted/settings"&gt;&lt;code&gt;settings.json&lt;/code&gt;&lt;/a&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;github.copilot.advanced&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// fauxpilot was using `codegen`&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;debug.overrideEngine&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;codegen&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="c1"&gt;// OpenAI API compatible server url&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;debug.testOverrideProxyUrl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;http://localhost:5000&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;debug.overrideProxyUrl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;http://localhost:5000&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; 
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Choosing an OpenAI API-Compatible Server
&lt;/h2&gt;

&lt;p&gt;To make use of CodeLlama, an OpenAI API-compatible server is all that's required. As of 2023, there are numerous options available, and here are a few noteworthy ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/abetlen/llama-cpp-python/tree/main#web-server"&gt;llama-cpp-python&lt;/a&gt;: This Python-based option supports llama models exclusively.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/vllm-project/vllm"&gt;vllm&lt;/a&gt;: Known for high performance, though it lacks support for GGML.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/flexflow/FlexFlow"&gt;flexflow&lt;/a&gt;: Touting faster performance compared to &lt;code&gt;vllm&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/go-skynet/LocalAI"&gt;LocalAI&lt;/a&gt;: A feature-rich choice that even supports image generation.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md"&gt;FastChat&lt;/a&gt;: Developed by LMSYS.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/bentoml/OpenLLM"&gt;OpenLLM&lt;/a&gt;: An actively developed project.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/chenhunghan/ialacol"&gt;ialacol&lt;/a&gt;: Noteworthy for its focus on Kubernetes.&lt;/li&gt;
&lt;li&gt;...and many more&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The choice among these options is entirely up to you. For the purpose of this article, I'll be focusing on &lt;a href="https://github.com/chenhunghan/ialacol"&gt;ialacol&lt;/a&gt;, primarily because I am the main contributor and thus intimately familiar with all the implementation details.&lt;/p&gt;

&lt;p&gt;Let's begin with GGML models. These models boast a low memory requirement and operate without the need for a GPU (which might not be as affordable anymore). If you possess robust CUDA (Nvidia) GPUs, I recommend directly proceeding to the GPTQ section of this article.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting up the OpenAI API-Compatible Server
&lt;/h2&gt;

&lt;p&gt;Getting your OpenAI API-compatible server up and running is a straightforward process.&lt;/p&gt;

&lt;h3&gt;
  
  
  Clone the Repository and Install Dependencies
&lt;/h3&gt;

&lt;p&gt;Use this one-liner to clone the repository and set up the necessary dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gh repo clone chenhunghan/ialacol &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;ialacol &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; python3 &lt;span class="nt"&gt;-m&lt;/span&gt; pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Run the server and download the &lt;a href="https://huggingface.co/TheBloke/CodeLlama-7B-GGML"&gt;model&lt;/a&gt;.
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;DEFAULT_MODEL_HG_REPO_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"TheBloke/CodeLlama-7B-GGML"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;DEFAULT_MODEL_FILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"codellama-7b.ggmlv3.Q2_K.bin
"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;LOGGING_LEVEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"DEBUG"&lt;/span&gt; &lt;span class="c"&gt;# optional, more on this later&lt;/span&gt;
uvicorn main:app &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 9999
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Configure VSCode Copilot extension, pointing to the server.
&lt;/h3&gt;

&lt;p&gt;To integrate the server with the VSCode Copilot extension, edit &lt;code&gt;settings.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;github.copilot.advanced&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;debug.overrideEngine&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;codellama-7b.ggmlv3.Q2_K.bin&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;debug.testOverrideProxyUrl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;http://localhost:9999&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;debug.overrideProxyUrl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;http://localhost:9999&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With these configurations in place, you're ready to roll. CodeLlama's code completion capabilities will now be at your fingertips.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tweaking for Optimal Performance
&lt;/h2&gt;

&lt;p&gt;While CodeLlama's completion capabilities are impressive, they might not always meet your expectations, yielding occasional suggestions by chance. However, they might not match the proficiency of GitHub Copilot, especially in terms of inference speed.&lt;/p&gt;

&lt;p&gt;Several factors contribute to this discrepancy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Our current model utilizes 7 billion parameters. To potentially enhance performance, consider experimenting with the &lt;a href="https://huggingface.co/TheBloke/CodeLlama-13B-GGML"&gt;13B&lt;/a&gt; and &lt;a href="https://huggingface.co/TheBloke/CodeLlama-34B-GGUF"&gt;34B&lt;/a&gt; variants.&lt;/li&gt;
&lt;li&gt;GGML/GGUF models are tailored to minimize memory usage rather than prioritize speed. While they excel in asynchronous tasks, code completion mandates swift responses from the server.&lt;/li&gt;
&lt;li&gt;GitHub Copilot's extension generates a multitude of requests as you type, which can pose challenges, given that language models typically process one prompt at a time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To address these considerations, exploring smaller models is a viable option. Smaller models often exhibit a &lt;em&gt;faster&lt;/em&gt; inference speed. Here are some alternatives to consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/salesforce/CodeGen"&gt;CodeGen&lt;/a&gt; offers a &lt;a href="https://huggingface.co/ravenscroftj/CodeGen-2B-multi-ggml-quant"&gt;2B quantized version&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/replit/replit-code-v1-3b"&gt;Replit-Code&lt;/a&gt; provides a &lt;a href="https://huggingface.co/abetlen/replit-code-v1-3b-ggml"&gt;3B quantized version&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/bigcode/starcoder"&gt;StarCoder&lt;/a&gt; presents a &lt;a href="https://huggingface.co/TheBloke/starcoder-GGML"&gt;quantized version&lt;/a&gt; as well as a &lt;a href="https://huggingface.co/mike-ravkine/gpt_bigcode-santacoder-GGML"&gt;quantized 1B version&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/bigcode/tiny_starcoder_py"&gt;TinyCoder&lt;/a&gt; stands as a very compact model with only 164 million parameters (specifically for &lt;code&gt;python&lt;/code&gt;). There's even a &lt;a href="https://huggingface.co/mike-ravkine/tiny_starcoder_py-GGML"&gt;quantized version&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://huggingface.co/stabilityai/stablecode-completion-alpha-3b-4k"&gt;Stablecode-Completion&lt;/a&gt; by StabilityAI also offers a &lt;a href="https://huggingface.co/TheBloke/stablecode-completion-alpha-3b-4k-GGML"&gt;quantized version&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a potential increase in throughput, a useful strategy is queuing requests before the inference server. This optimization boosts throughput (not speed) and can be achieved using tools like &lt;a href="https://github.com/ialacol/text-inference-batcher"&gt;text-inference-batcher&lt;/a&gt; (Disclaimer: I authored this tool, and &lt;code&gt;tib&lt;/code&gt; is still in its early alpha phase).&lt;/p&gt;

&lt;p&gt;Leveraging the various trade-offs at our disposal, let's proceed with the plan: utilizing a high-quality 3B model with a small footprint. Additionally, let's set up two instances of servers to enhance performance further.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# in `ialacol` folder you just cloned.&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;THREAD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2
&lt;span class="c"&gt;# Use small model https://stability.ai/blog/stablecode-llm-generative-ai-coding&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;DEFAULT_MODEL_HG_REPO_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"TheBloke/TheBloke/stablecode-completion-alpha-3b-4k-GGML"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;DEFAULT_MODEL_FILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin"&lt;/span&gt;
&lt;span class="c"&gt;# truncate the prompt to make inference faster...&lt;/span&gt;
&lt;span class="c"&gt;# (it's a trade off, you get lower quality results too)&lt;/span&gt;
&lt;span class="nv"&gt;TRUNCATE_PROMPT_LENGTH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100
uvicorn main:app &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 9998
&lt;span class="c"&gt;# in another terminal session&lt;/span&gt;
uvicorn main:app &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 9999 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Load Balancing with a Queue to Increase Throughput
&lt;/h3&gt;

&lt;p&gt;To enhance throughput, we can employ load balancing with a queuing mechanism. Here's how you can set it up using &lt;a href="https://github.com/ialacol/text-inference-batcher"&gt;text-inference-batcher&lt;/a&gt;:&lt;/p&gt;

&lt;h4&gt;
  
  
  Setting Up &lt;code&gt;tib&lt;/code&gt; for Load Balancing
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Clone the repository and set up the necessary environment:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# clone and setup&lt;/span&gt;
gh repo clone ialacol/text-inference-batcher &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;text-inference-batcher &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Start tib, directing to your servers.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;export UPSTREAMS="http://localhost:9998,http://localhost:9999"
npm start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Configuring the Copilot Extension, directing to the load balancer.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;github.copilot.advanced&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;debug.overrideEngine&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="c1"&gt;// pointing to `tib`&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;debug.testOverrideProxyUrl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;http://localhost:8000&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;debug.overrideProxyUrl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;http://localhost:8000&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Despite the compromise in inference quality due to smaller models and prompt truncation, results improved. However, they still fall short of GitHub Copilot's code completion capabilities.&lt;/p&gt;

&lt;p&gt;Let's now venture to push the limits in the opposite direction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Leveraging Cloud Infrastructure for Enhanced Performance
&lt;/h2&gt;

&lt;p&gt;If you possess powerful cloud infrastructure equipped with GPUs, the process becomes notably streamlined.&lt;/p&gt;

&lt;p&gt;In this scenario, we will harness the capabilities of Kubernetes due to its exceptional automation features. Both &lt;a href="https://github.com/chenhunghan/ialacol"&gt;ialacol&lt;/a&gt; and &lt;a href="https://github.com/ialacol/text-inference-batcher"&gt;text-inference-batcher&lt;/a&gt; are inherently compatible with Kubernetes, which further simplifies the setup.&lt;/p&gt;

&lt;p&gt;Let's delve into deploying the &lt;a href="https://huggingface.co/TheBloke/CodeLlama-34B-GPTQ"&gt;34B CodeLLama GPTQ model&lt;/a&gt; onto Kubernetes clusters, leveraging CUDA acceleration via the &lt;code&gt;Helm&lt;/code&gt; package manager:&lt;/p&gt;

&lt;p&gt;(&lt;code&gt;values.yaml&lt;/code&gt;)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="na"&gt;deployment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/chenhunghan/ialacol-gptq:latest&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;DEFAULT_MODEL_HG_REPO_ID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TheBloke/CodeLlama-34B-GPTQ&lt;/span&gt;
    &lt;span class="na"&gt;TOP_K&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
    &lt;span class="na"&gt;TOP_P&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.9&lt;/span&gt;
    &lt;span class="na"&gt;MAX_TOKENS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
    &lt;span class="na"&gt;THREADS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Request a node with Nvidia 1 GPU&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;persistence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30Gi&lt;/span&gt;
    &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ReadWriteOnce&lt;/span&gt;
    &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;~&lt;/span&gt;
&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterIP&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
&lt;span class="c1"&gt;# You probably need to use these to select a node with GPUs.&lt;/span&gt;
&lt;span class="na"&gt;tolerations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[]&lt;/span&gt;
&lt;span class="na"&gt;affinity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
&lt;span class="c"&gt;# work one&lt;/span&gt;
helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; codellama-worker-0 ialacol/ialacol &lt;span class="nt"&gt;-f&lt;/span&gt; values.yaml
&lt;span class="c"&gt;# work two&lt;/span&gt;
helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; codellama-worker-1 ialacol/ialacol &lt;span class="nt"&gt;-f&lt;/span&gt; values.yaml
&lt;span class="c"&gt;# and maybe more? Depends on your budget :)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Again, load balancing using &lt;code&gt;tib&lt;/code&gt; with this &lt;code&gt;values.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="na"&gt;deployment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/ialacol/text-inference-batcher-nodejs:latest&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# pointing to our workers&lt;/span&gt;
    &lt;span class="na"&gt;UPSTREAMS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://codellama-worker-0:8000,http://codellama-worker-1:8000"&lt;/span&gt;
    &lt;span class="c1"&gt;# increase this if your the worker can handle more then one inference at a time.&lt;/span&gt;
    &lt;span class="na"&gt;MAX_CONNECT_PER_UPSTREAM&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;500m&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;128Mi&lt;/span&gt;
&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterIP&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
  &lt;span class="c1"&gt;# If using an AWS load balancer, you'll need to override the default 60s load balancer idle timeout&lt;/span&gt;
  &lt;span class="c1"&gt;# service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1200"&lt;/span&gt;
&lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
&lt;span class="na"&gt;tolerations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[]&lt;/span&gt;
&lt;span class="na"&gt;affinity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; tib text-inference-batcher/text-inference-batcher-nodejs &lt;span class="nt"&gt;-f&lt;/span&gt; values.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expose the &lt;code&gt;tib&lt;/code&gt; service by utilizing your cloud's load balancer, or for testing purposes, you can employ &lt;code&gt;kubectl port-forward&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;With CodeLLama operating at 34B, benefiting from CUDA acceleration, and employing at least one worker, the code completion experience becomes not only swift but also of commendable quality. I would confidently state that this setup is on par with the performance of GitHub Copilot.&lt;/p&gt;

&lt;p&gt;Nonetheless, it's crucial to acknowledge that this particular configuration does come at a notably higher cost when compared to &lt;a href="https://github.com/features/copilot#pricing"&gt;GitHub Copilot&lt;/a&gt;. Striking a balance between budget considerations and privacy concerns is imperative. This investment is especially justifiable when handling proprietary or enterprise-level software projects. Conversely, the pricing structure of Copilot holds its own appeal.&lt;/p&gt;

&lt;p&gt;In essence, we're fortunate to have a range of options at our disposal. Your thoughts and feedback are valuable, so feel free to share your insights in the comments section.&lt;/p&gt;

&lt;p&gt;Let's keep the conversation going! 🚀&lt;/p&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;Highly recommended to go through the Copilot source code, you will learn prompt engineering and client cache on different levels before hitting the server. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>kubernetes</category>
      <category>llama</category>
    </item>
    <item>
      <title>Deploy Llama 2 AI on Kubernetes, Now</title>
      <dc:creator>chh</dc:creator>
      <pubDate>Wed, 19 Jul 2023 16:41:47 +0000</pubDate>
      <link>https://dev.to/chenhunghan/deploy-llama-2-ai-on-kubernetes-now-2jc5</link>
      <guid>https://dev.to/chenhunghan/deploy-llama-2-ai-on-kubernetes-now-2jc5</guid>
      <description>&lt;p&gt;Llama 2 is the newest open-sourced LLM with a &lt;em&gt;custom&lt;/em&gt; commercial &lt;a href="https://ai.meta.com/llama/license/"&gt;license&lt;/a&gt; by &lt;a href="https://huggingface.co/meta-llama"&gt;Meta&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here are simple steps that you can try Llama 13B, by few clicks on Kubernetes.&lt;/p&gt;

&lt;p&gt;You will need a node with about 10GB pvc and 16vCPU to get reasonable response time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; values.yaml &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
replicas: 1
deployment:
  image: quay.io/chenhunghan/ialacol:latest
  env:
    DEFAULT_MODEL_HG_REPO_ID: TheBloke/Llama-2-13B-chat-GGML
    DEFAULT_MODEL_FILE: llama-2-13b-chat.ggmlv3.q4_0.bin
    DEFAULT_MODEL_META: ""
    THREADS: 8
    BATCH_SIZE: 8
    CONTEXT_LENGTH: 1024
service:
  type: ClusterIP
  port: 8000
  annotations: {}
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm &lt;span class="nb"&gt;install &lt;/span&gt;llama-2-13b-chat ialacol/ialacol &lt;span class="nt"&gt;-f&lt;/span&gt; values.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Port forward&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl port-forward svc/llama-2-13b-chat 8000:8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Talk to it&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{ "messages": [{"role": "user", "content": "Hello, are you better then llama version one?"}], "temperature":"1", "model": "llama-2-13b-chat.ggmlv3.q4_0.bin"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://localhost:8000/v1/chat/completions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it!&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Hi there! I'm happy to help answer your questions. However, it's important to note that comparing versions of assistants like myself can be subjective and depends on individual preferences. Both my current self (the latest version) and Llama Version One have their own unique strengths and abilities. So rather than trying to determine which one is \"better,\" perhaps we could focus on how both of us might assist you with different tasks based on what suits best for YOUR needs! Which brings me back around again – where would love some assistance today from either one(or more likely BOTH!) of our amazing offerings?” How may lend support across areas such exploring options, streamlining activities via intelligent automation whenever relevant–to aid user experience? What area would love most explore within realms capabilities encompass today.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Enjoy!&lt;/p&gt;

&lt;p&gt;The project use to deploy llama 2 on k8s is open-sourced with MIT license, see &lt;a href="https://github.com/chenhunghan/ialacol"&gt;ialacol&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;AI for Everyone!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llama</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Cloud Native Workflow for *Private* MPT-30B AI Apps</title>
      <dc:creator>chh</dc:creator>
      <pubDate>Sat, 01 Jul 2023 14:16:56 +0000</pubDate>
      <link>https://dev.to/chenhunghan/cloud-native-workflow-for-private-ai-apps-2omb</link>
      <guid>https://dev.to/chenhunghan/cloud-native-workflow-for-private-ai-apps-2omb</guid>
      <description>&lt;p&gt;In this article, we will guide you through the process of developing your own private AI application 🤖, leveraging the capabilities of Kubernetes.&lt;/p&gt;

&lt;p&gt;Unlike many other tutorials, we will &lt;strong&gt;NOT&lt;/strong&gt; rely on OpenAI APIs. Instead, we will utilize a private AI instance with a Apache 2.0 licensed model &lt;a href="https://huggingface.co/mosaicml/mpt-30b"&gt;MPT-30B&lt;/a&gt;, that ensures the &lt;strong&gt;confidentiality&lt;/strong&gt; of all 🔒 sensitive data 🔒 within your Kubernetes cluster. No data goes to the third-party cloud 🙅‍♂️ 🌩️!&lt;/p&gt;

&lt;p&gt;To set up the development environment on Kubernetes, we will utilize &lt;a href="https://www.devspace.sh/"&gt;devspace&lt;/a&gt;. This environment includes a file sync pipeline for your AI application, as well as the backend &lt;a href="https://github.com/chenhunghan/ialacol"&gt;AI API&lt;/a&gt; (a RESTful API service designed to replace OpenAI API) for the AI app.&lt;/p&gt;

&lt;p&gt;Let's kick-start the process by deploying the necessary services on Kubernetes using the command &lt;code&gt;devspace deploy&lt;/code&gt;. DevSpace will handle the deployment of the initial structure of our applications, along with their dependencies, including &lt;a href="https://github.com/chenhunghan/ialacol"&gt;ialacol&lt;/a&gt;. For more detailed explanations, please refer to the in-line comments provided in the code snippet below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This is the configuration file for DevSpace&lt;/span&gt;
&lt;span class="c1"&gt;# &lt;/span&gt;
&lt;span class="c1"&gt;# devspace use namespace private-ai # suggest to use a namespace instead of the default name space&lt;/span&gt;
&lt;span class="c1"&gt;# devspace deploy # deploy the skeleton of the app and the dependencies (ialacol)&lt;/span&gt;
&lt;span class="c1"&gt;# devspace dev # start syncing files to the container&lt;/span&gt;
&lt;span class="c1"&gt;# devspace purge # to clean up&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v2beta1&lt;/span&gt;
&lt;span class="na"&gt;deployments&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# This are the manifest our private app deployment&lt;/span&gt;
  &lt;span class="c1"&gt;# The app will be in "sleep mode" after `devspace deploy`, and start when we start&lt;/span&gt;
  &lt;span class="c1"&gt;# syncing files to the container by `devspace dev`&lt;/span&gt;
  &lt;span class="na"&gt;private-ai-app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;helm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;chart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# We are deploying the so-called Component Chart: https://devspace.sh/component-chart/docs&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;component-chart&lt;/span&gt;
        &lt;span class="na"&gt;repo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://charts.devspace.sh&lt;/span&gt;
      &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/loft-sh/devspace-containers/python:3-alpine&lt;/span&gt;
            &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sleep"&lt;/span&gt;
            &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;99999"&lt;/span&gt;
        &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;app.kubernetes.io/name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;private-ai-app&lt;/span&gt;
  &lt;span class="na"&gt;ialacol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;helm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# the backend for the AI app, we are using ialacol https://github.com/chenhunghan/ialacol/&lt;/span&gt;
      &lt;span class="na"&gt;chart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ialacol&lt;/span&gt;
        &lt;span class="na"&gt;repo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://chenhunghan.github.io/ialacol&lt;/span&gt;
      &lt;span class="c1"&gt;# overriding values.yaml of ialacol helm chart&lt;/span&gt;
      &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
        &lt;span class="na"&gt;deployment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;quay.io/chenhunghan/ialacol:latest&lt;/span&gt;
          &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# We are using MPT-30B, which is the most sophisticated model at the moment&lt;/span&gt;
            &lt;span class="c1"&gt;# If you want to start with some small but mightym try orca-mini&lt;/span&gt;
            &lt;span class="c1"&gt;# DEFAULT_MODEL_HG_REPO_ID: TheBloke/orca_mini_3B-GGML&lt;/span&gt;
            &lt;span class="c1"&gt;# DEFAULT_MODEL_FILE: orca-mini-3b.ggmlv3.q4_0.bin&lt;/span&gt;
            &lt;span class="c1"&gt;# MPT-30B&lt;/span&gt;
            &lt;span class="na"&gt;DEFAULT_MODEL_HG_REPO_ID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TheBloke/mpt-30B-GGML&lt;/span&gt;
            &lt;span class="na"&gt;DEFAULT_MODEL_FILE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mpt-30b.ggmlv0.q4_1.bin&lt;/span&gt;
            &lt;span class="na"&gt;DEFAULT_MODEL_META&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
        &lt;span class="c1"&gt;# Request more resource if needed&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;{}&lt;/span&gt;
        &lt;span class="c1"&gt;# pvc for storing the cache&lt;/span&gt;
        &lt;span class="na"&gt;cache&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;persistence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5Gi&lt;/span&gt;
            &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ReadWriteOnce&lt;/span&gt;
            &lt;span class="na"&gt;storageClass&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;~&lt;/span&gt;
        &lt;span class="na"&gt;cacheMountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/app/cache&lt;/span&gt;
        &lt;span class="c1"&gt;# pvc for storing the models&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;persistence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;20Gi&lt;/span&gt;
            &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ReadWriteOnce&lt;/span&gt;
            &lt;span class="na"&gt;storageClass&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;~&lt;/span&gt;
        &lt;span class="na"&gt;modelMountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/app/models&lt;/span&gt;
        &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterIP&lt;/span&gt;
          &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;
          &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
        &lt;span class="c1"&gt;# You might want to use the following to select a node with more CPU and memory&lt;/span&gt;
        &lt;span class="c1"&gt;# for MPT-30B, we need at least 32GB of memory&lt;/span&gt;
        &lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
        &lt;span class="na"&gt;tolerations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[]&lt;/span&gt;
        &lt;span class="na"&gt;affinity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's wait for few seconds for the pods to become green, I am using &lt;a href="https://github.com/lensapp/lens"&gt;Lens&lt;/a&gt;, it's awesome btw.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--zLJF6H8N--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/d7jy8twpu7x43mg76cts.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--zLJF6H8N--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/d7jy8twpu7x43mg76cts.png" alt="Waiting for pending pods" width="800" height="123"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When all pods are green. We are ready for the next step.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--5tBTnSu6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/clbgbqscmj6i0rmmw5r7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--5tBTnSu6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/clbgbqscmj6i0rmmw5r7.png" alt="Pods are ready" width="800" height="123"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The private AI app we are developing is a simple web server with an endpoint &lt;code&gt;POST /prompt&lt;/code&gt;. When a client sends a request with a &lt;code&gt;prompt&lt;/code&gt; in the request body to &lt;code&gt;POST /prompt&lt;/code&gt;, the endpoint's controller will forward the &lt;code&gt;prompt&lt;/code&gt; to the backend &lt;a href="https://github.com/chenhunghan/ialacol"&gt;AI API&lt;/a&gt;, retrieve the response, and send it back to the client.&lt;/p&gt;

&lt;p&gt;To begin, let's install the necessary dependencies on our local machine&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
pip &lt;span class="nb"&gt;install &lt;/span&gt;fastapi uvicorn
pip &lt;span class="nb"&gt;install &lt;/span&gt;openai &lt;span class="c"&gt;# We are not using OpenAI API, but we can use openai client library to simplify things because our backend (ialacol) has OpenAI compatible RESTful interface.&lt;/span&gt;
pip freeze &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and create a &lt;code&gt;main.py&lt;/code&gt; file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;openai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Body&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/prompt"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Body&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;
    &lt;span class="c1"&gt;# Add more logics here, for example, you can add the context to the prompt
&lt;/span&gt;    &lt;span class="c1"&gt;# using context augmentation retrieval methods
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Completion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"mpt-30b.ggmlv0.q4_1.bin"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;completion&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The implementation of our app's endpoint &lt;code&gt;POST /prompt&lt;/code&gt; is straightforward. It acts as a proxy, forwarding the request to the backend. You can further extend it by incorporating additional functionality, such as context augmentation retrieval based on the provided &lt;code&gt;prompt&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;With the core functionality of the app in place, let's synchronize the source files to the cluster by running the command &lt;code&gt;devspace dev&lt;/code&gt;. This command performs the following actions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It instructs devSpace to sync the files located at the root folder to the &lt;code&gt;/app&lt;/code&gt; folder of the remote pod.&lt;/li&gt;
&lt;li&gt;Whenever changes are made to the &lt;code&gt;requirements.txt&lt;/code&gt; file, it triggers a &lt;code&gt;pip install&lt;/code&gt; within the pod.&lt;/li&gt;
&lt;li&gt;Additionally, it forwards port &lt;code&gt;8000&lt;/code&gt;, allowing us to access the app at &lt;code&gt;http://localhost:8000&lt;/code&gt;.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;dev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;private-ai-app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Use the label selector to select the pod for swapping out the container&lt;/span&gt;
    &lt;span class="na"&gt;labelSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app.kubernetes.io/name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;private-ai-app&lt;/span&gt;
    &lt;span class="c1"&gt;# use the name space we assign by devspace use namespace&lt;/span&gt;
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${DEVSPACE_NAMESPACE}&lt;/span&gt;
    &lt;span class="na"&gt;devImage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/loft-sh/devspace-containers/python:3-alpine&lt;/span&gt;
    &lt;span class="na"&gt;workingDir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/app&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uvicorn"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;main:app"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--reload"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--host"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--port"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8000"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# expose the port 8000 to the host&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8000:8000"&lt;/span&gt;
    &lt;span class="c1"&gt;# Add env for the pod if needed&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# This will tell openai python library to use the ialacol service instead of the OpenAI cloud&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OPENAI_API_BASE&lt;/span&gt;
      &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://ialacol.${DEVSPACE_NAMESPACE}.svc.cluster.local:8000/v1"&lt;/span&gt;
    &lt;span class="c1"&gt;# You don't need to have an OpenAI API key, but OpenAI python library will complain without it&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;
      &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-xxx"&lt;/span&gt;
    &lt;span class="na"&gt;sync&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./:/app&lt;/span&gt;
        &lt;span class="na"&gt;excludePaths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;requirements.txt&lt;/span&gt;
        &lt;span class="na"&gt;printLogs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;uploadExcludeFile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./.dockerignore&lt;/span&gt;
        &lt;span class="na"&gt;downloadExcludeFile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./.gitignore&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./requirements.txt:/app/requirements.txt&lt;/span&gt;
        &lt;span class="c1"&gt;# start the container after uploading the requirements.txt and install the dependencies&lt;/span&gt;
        &lt;span class="na"&gt;startContainer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;printLogs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;onUpload&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;exec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
              &lt;span class="s"&gt;pip install -r requirements.txt&lt;/span&gt;
            &lt;span class="na"&gt;onChange&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;requirements.txt"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;lastLines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait for the files sync completed (you should see some logs in the terminal), and test our app by&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{ "prompt": "Hello!" }'&lt;/span&gt; http://localhost:8000/prompt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it, enjoy building your first private AI app  🥳!&lt;/p&gt;

&lt;p&gt;Source code in the article &lt;a href="https://github.com/chenhunghan/private-ai-app-starter-python"&gt;private-ai-app-starter-python&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>cloudnative</category>
      <category>llm</category>
    </item>
    <item>
      <title>Offline AI 🤖 on Github Actions 🙅‍♂️💰</title>
      <dc:creator>chh</dc:creator>
      <pubDate>Sat, 01 Jul 2023 07:55:53 +0000</pubDate>
      <link>https://dev.to/chenhunghan/offline-ai-on-github-actions-38a1</link>
      <guid>https://dev.to/chenhunghan/offline-ai-on-github-actions-38a1</guid>
      <description>&lt;p&gt;In this article, we will walk through the steps to set up an offline AI on Github Actions that respects your privacy by &lt;strong&gt;&lt;em&gt;NOT&lt;/em&gt;&lt;/strong&gt; sending your source code to the internet. This AI will add a touch of humor by telling jokes whenever a developer creates a boring pull request.&lt;/p&gt;

&lt;p&gt;Github provides a generous offering for open source projects, allowing you to use their Github-hosted runner for free as long as your project is open source.&lt;/p&gt;

&lt;p&gt;However, the Github-hosted runner comes with some limitations in terms of computational power. It offers 2 vCPUs, 7GB of RAM, and 14GB of storage (&lt;a href="https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources" rel="noopener noreferrer"&gt;ref&lt;/a&gt;). On the other hand, AI computing, or LLM inference, is considered a luxury due to its resource requirements and associated costs 💸.&lt;/p&gt;

&lt;p&gt;The stock price of Nvidia (the company who makes GPUs for AI):&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14ceammnhl2yietv3lkk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14ceammnhl2yietv3lkk.png" alt="The stock price of Nvidia"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, thanks to the efforts of amazing community projects like &lt;a href="https://github.com/ggerganov/ggml" rel="noopener noreferrer"&gt;ggml&lt;/a&gt;, it is now possible to run LLM (Large Language Model) on edge devices such as 🍓🥧 Raspberry Pi 4.&lt;/p&gt;

&lt;p&gt;In this article, I will present the Github Actions snippets that allow you to run an LLM with 3B parameters directly on Github Actions, even with just 2 CPU cores and 7GB of RAM. These actions are triggered when a developer initiates a new pull request, and the AI will lighten the mood by sharing a joke to entertain the developer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Can 3B AI with 2 CPUs make good jokes?&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;

&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;TEMPERATURE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;DEFAULT_MODEL_HG_REPO_ID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TheBloke/orca_mini_3B-GGML&lt;/span&gt;
  &lt;span class="na"&gt;DEFAULT_MODEL_FILE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;orca-mini-3b.ggmlv3.q4_0.bin&lt;/span&gt;
  &lt;span class="na"&gt;DEFAULT_MODEL_META&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
  &lt;span class="na"&gt;THREADS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;
  &lt;span class="na"&gt;CONTEXT_LENGTH&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1024&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;joke&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v3&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;fetch-depth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create k8s Kind Cluster&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;helm/kind-action@v1.7.0&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;kubectl cluster-info&lt;/span&gt;
          &lt;span class="s"&gt;kubectl get nodes&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Helm&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;azure/setup-helm@v3&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v3.12.0&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install ialacol and wait for pods to be ready&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;helm repo add ialacol https://chenhunghan.github.io/ialacol&lt;/span&gt;
          &lt;span class="s"&gt;helm repo update&lt;/span&gt;

          &lt;span class="s"&gt;cat &amp;gt; values.yaml &amp;lt;&amp;lt;EOF&lt;/span&gt;
          &lt;span class="s"&gt;replicas: 1&lt;/span&gt;
          &lt;span class="s"&gt;deployment:&lt;/span&gt;
            &lt;span class="s"&gt;image: quay.io/chenhunghan/ialacol:latest&lt;/span&gt;
            &lt;span class="s"&gt;env:&lt;/span&gt;
              &lt;span class="s"&gt;DEFAULT_MODEL_HG_REPO_ID: $DEFAULT_MODEL_HG_REPO_ID&lt;/span&gt;
              &lt;span class="s"&gt;DEFAULT_MODEL_FILE: $DEFAULT_MODEL_FILE&lt;/span&gt;
              &lt;span class="s"&gt;DEFAULT_MODEL_META: $DEFAULT_MODEL_META&lt;/span&gt;
              &lt;span class="s"&gt;THREADS: $THREADS&lt;/span&gt;
              &lt;span class="s"&gt;BATCH_SIZE: $BATCH_SIZE&lt;/span&gt;
              &lt;span class="s"&gt;CONTEXT_LENGTH: $CONTEXT_LENGTH&lt;/span&gt;
          &lt;span class="s"&gt;resources:&lt;/span&gt;
            &lt;span class="s"&gt;{}&lt;/span&gt;
          &lt;span class="s"&gt;cache:&lt;/span&gt;
            &lt;span class="s"&gt;persistence:&lt;/span&gt;
              &lt;span class="s"&gt;size: 0.5Gi&lt;/span&gt;
              &lt;span class="s"&gt;accessModes:&lt;/span&gt;
                &lt;span class="s"&gt;- ReadWriteOnce&lt;/span&gt;
              &lt;span class="s"&gt;storageClass: ~&lt;/span&gt;
          &lt;span class="s"&gt;cacheMountPath: /app/cache&lt;/span&gt;
          &lt;span class="s"&gt;model:&lt;/span&gt;
            &lt;span class="s"&gt;persistence:&lt;/span&gt;
              &lt;span class="s"&gt;size: 2Gi&lt;/span&gt;
              &lt;span class="s"&gt;accessModes:&lt;/span&gt;
                &lt;span class="s"&gt;- ReadWriteOnce&lt;/span&gt;
              &lt;span class="s"&gt;storageClass: ~&lt;/span&gt;
          &lt;span class="s"&gt;modelMountPath: /app/models&lt;/span&gt;
          &lt;span class="s"&gt;service:&lt;/span&gt;
            &lt;span class="s"&gt;type: ClusterIP&lt;/span&gt;
            &lt;span class="s"&gt;port: 8000&lt;/span&gt;
            &lt;span class="s"&gt;annotations: {}&lt;/span&gt;
          &lt;span class="s"&gt;nodeSelector: {}&lt;/span&gt;
          &lt;span class="s"&gt;tolerations: []&lt;/span&gt;
          &lt;span class="s"&gt;affinity: {}&lt;/span&gt;
          &lt;span class="s"&gt;EOF&lt;/span&gt;
          &lt;span class="s"&gt;helm install ialacol ialacol/ialacol -f values.yaml --namespace default&lt;/span&gt;

          &lt;span class="s"&gt;echo "Wait for the pod to be ready, it takes about 36s to download a 1.93GB model (~50MB/s)"&lt;/span&gt;
          &lt;span class="s"&gt;sleep 40&lt;/span&gt;
          &lt;span class="s"&gt;kubectl get pods -n default&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ask the AI for a joke&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;kubectl port-forward svc/ialacol 8000:8000 &amp;amp;&lt;/span&gt;
          &lt;span class="s"&gt;echo "Wait for port-forward to be ready"&lt;/span&gt;
          &lt;span class="s"&gt;sleep 5&lt;/span&gt;

          &lt;span class="s"&gt;curl http://localhost:8000/v1/models&lt;/span&gt;

          &lt;span class="s"&gt;RESPONSE=$(curl -X POST -H 'Content-Type: application/json' -d '{ "messages": [{"role": "user", "content": "Tell me a joke."}], "temperature":"'${TEMPERATURE}'", "model": "'${DEFAULT_MODEL_FILE}'"}' http://localhost:8000/v1/chat/completions)&lt;/span&gt;
          &lt;span class="s"&gt;echo "$RESPONSE"&lt;/span&gt;

          &lt;span class="s"&gt;REPLY=$(echo "$RESPONSE" | jq -r '.choices[0].message.content')&lt;/span&gt;
          &lt;span class="s"&gt;echo "$REPLY"&lt;/span&gt;

          &lt;span class="s"&gt;kubectl logs --selector app.kubernetes.io/name=$HELM_RELEASE_NAME -n default&lt;/span&gt;

          &lt;span class="s"&gt;if [ -z "$REPLY" ]; then&lt;/span&gt;
            &lt;span class="s"&gt;echo "No reply from AI"&lt;/span&gt;
            &lt;span class="s"&gt;exit 1&lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;

          &lt;span class="s"&gt;echo "REPLY=$REPLY" &amp;gt;&amp;gt; $GITHUB_ENV&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Comment the Joke&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/github-script@v6&lt;/span&gt;
        &lt;span class="c1"&gt;# Note, issue and PR are the same thing in GitHub's eyes&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;const REPLY = process.env.REPLY&lt;/span&gt;
            &lt;span class="s"&gt;if (REPLY) {&lt;/span&gt;
              &lt;span class="s"&gt;github.rest.issues.createComment({&lt;/span&gt;
                &lt;span class="s"&gt;issue_number: context.issue.number,&lt;/span&gt;
                &lt;span class="s"&gt;owner: context.repo.owner,&lt;/span&gt;
                &lt;span class="s"&gt;repo: context.repo.repo,&lt;/span&gt;
                &lt;span class="s"&gt;body: `🤖: ${REPLY}`&lt;/span&gt;
              &lt;span class="s"&gt;})&lt;/span&gt;
            &lt;span class="s"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Is the joke any good?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg10aqnw3o6ef65icetu3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg10aqnw3o6ef65icetu3.png" alt="The comment from AI"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Well, it's up for debate. If you want better jokes, you can bring &lt;a href="https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners" rel="noopener noreferrer"&gt;self-hosted runner&lt;/a&gt;. Self-hosted runners (with for example 16vCPU and 32GB RAM) would definitely capable of running more sophisticated models such as &lt;a href="https://huggingface.co/mosaicml/mpt-30b" rel="noopener noreferrer"&gt;MPT-30B&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You might be wondering why running Kubernetes is necessary for this project. This article was actually created during the development of a testing CI for the OSS project &lt;a href="https://github.com/chenhunghan/ialacol" rel="noopener noreferrer"&gt;ialacol&lt;/a&gt;. The goal was to have a basic smoke test that verifies the Helm charts and ensures the endpoint returns a &lt;code&gt;200&lt;/code&gt; status code. You can find the full source of the testing CI YAML &lt;a href="https://github.com/chenhunghan/ialacol/blob/main/.github/workflows/smoke_test.yaml" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;While running Kubernetes may not be necessary for your specific use case, it's worth mentioning that the overhead of the container runtime and Kubernetes is minimal. In fact, the CI process, which includes LLM inference from provisioning to completion, takes only &lt;strong&gt;&lt;em&gt;2 minutes&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>githubactions</category>
      <category>ai</category>
      <category>kubernetes</category>
      <category>cicd</category>
    </item>
    <item>
      <title>Containerized AI before Apocalypse 🐳🤖</title>
      <dc:creator>chh</dc:creator>
      <pubDate>Sun, 25 Jun 2023 08:55:16 +0000</pubDate>
      <link>https://dev.to/chenhunghan/containerized-ai-before-apocalypse-1569</link>
      <guid>https://dev.to/chenhunghan/containerized-ai-before-apocalypse-1569</guid>
      <description>&lt;p&gt;ChatGPT is awesome, and privacy is a concern for many. But what if you could host your own private AI on an old PC without relying on GPU clusters?&lt;/p&gt;

&lt;p&gt;Thanks to the efforts of the amazing community projects like &lt;a href="https://github.com/ggerganov/ggml"&gt;ggml&lt;/a&gt;, &lt;a href="https://github.com/ggerganov/llama.cpp"&gt;llama.cpp&lt;/a&gt;, and &lt;a href="https://huggingface.co/TheBloke"&gt;TheBloke&lt;/a&gt;, it is now possible for anyone to chat with AI, privately, without internet, &lt;del&gt;before the apocalypse&lt;/del&gt;.&lt;/p&gt;

&lt;p&gt;In this article, &lt;del&gt;we will containerize an AI before it ends the world&lt;/del&gt;, we will explore how to deploy a Large Language Model (LLM, also known as AI) in a container within a Kubernetes cluster, enabling us to have conversations with it.&lt;/p&gt;

&lt;p&gt;To get started, you'll need a Kubernetes cluster, for example, a &lt;a href="https://minikube.sigs.k8s.io/docs/start/"&gt;minikube&lt;/a&gt; with approximately 8 CPU threads and 5GB of memory. Additionally, you'll need to have &lt;a href="https://helm.sh/docs/intro/install/"&gt;Helm&lt;/a&gt; installed.&lt;/p&gt;

&lt;p&gt;Let's begin by deploying the LLM within a minimal wrapper.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; values.yaml &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
replicas: 1
deployment:
  image: quay.io/chenhunghan/ialacol:latest
  env:
    DEFAULT_MODEL_HG_REPO_ID: TheBloke/orca_mini_3B-GGML
    DEFAULT_MODEL_FILE: orca-mini-3b.ggmlv3.q4_0.bin
    DEFAULT_MODEL_META: ""
    THREADS: 8
    BATCH_SIZE: 8
    CONTEXT_LENGTH: 1024
service:
  type: ClusterIP
  port: 8000
  annotations: {}
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm &lt;span class="nb"&gt;install &lt;/span&gt;orca-mini-3b ialacol/ialacol &lt;span class="nt"&gt;-f&lt;/span&gt; values.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're interested in the technical details, here's what's happening behind the scenes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We are deploying a Helm release &lt;code&gt;orca-mini-3b&lt;/code&gt; using Helm chart &lt;a href="https://github.com/chenhunghan/ialacol/tree/main/charts/ialacol"&gt;ialacol&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;The container image &lt;a href="https://github.com/chenhunghan/ialacol"&gt;ialacol&lt;/a&gt; is a mini RESTFul API server compatible with &lt;a href="https://platform.openai.com/docs/api-reference"&gt;OpenAI API&lt;/a&gt;. Disclaimer: I am the main contributor to this project&lt;/li&gt;
&lt;li&gt;The deployed LLM binary, &lt;a href="https://huggingface.co/psmathur/orca_mini_3b"&gt;orca mini&lt;/a&gt;, has 3 billion parameters. Orca mini is based on the &lt;a href="https://github.com/openlm-research/open_llama"&gt;OpenLLaMA&lt;/a&gt; project.&lt;/li&gt;
&lt;li&gt;The binary has been quantized by &lt;a href="https://huggingface.co/TheBloke"&gt;TheBloke&lt;/a&gt; into a 4-bit GGML format.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, please be patient for a few minutes as the container downloads the binary, which is around 1.93GB in size:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;INFO:     Downloading model... TheBloke/orca_mini_3B-GGML/orca-mini-3b.ggmlv3.q4_0.bin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the download is complete, it's time to start a conversation!&lt;/p&gt;

&lt;p&gt;Expose the service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl port-forward svc/orca-mini-3b 8000:8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ask a question:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;USER_QUERY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"What is the meaning of life? Explain like I am 5."&lt;/span&gt;
&lt;span class="nv"&gt;MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"orca-mini-3b.ggmlv3.q4_0.bin"&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{ "prompt": "### System:You are an AI assistant that follows instruction extremely well. Help as much as you can.### User:'&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;USER_QUERY&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;'### Response:", "model": "'&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MODEL&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;'" }'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     http://localhost:8000/v1/completions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;According to AI...&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The meaning of life is a question that has puzzled humans for centuries. Some believe it to be finding happiness, others think it's achieving success or something greater than ourselves, while some see it as fulfilling our purpose on this planet. Ultimately, everyone answers this question differently and what matters most in the end is how we live our lives with integrity and make a positive impact on those around us.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let's start scaling LLM on Kubernetes!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>beginners</category>
      <category>kubernetes</category>
    </item>
  </channel>
</rss>
