DEV Community: chh

6 lessons from building a MCP Apps for 🏃🏃‍♂️🏃‍♀️

chh — Sun, 22 Feb 2026 07:31:15 +0000

Like many, since Dec 2025, I have switched my daily workflow from typing in VSCode to using only prompts to complete daily tasks.

I have had several weekend projects, like 0ma, for managing local VMs, and scim-mcp, an MCP server that proxies requests to SCIM endpoints. Most of them were for fun, just wanted to see how far I can push LLMs to their limits, using only natural language.

This time I want to build something different: Garmin MCP Apps that can be installed on ChatGPT Desktop or Claude Desktop, and interact with the data from your watch with a generative UI that is intended for a non-technical audience.

This app allows users to not only query data from your Garmin watch, but also play with dynamic visualisation charts to explore your workout data, without giving out who you are or your credentials to the AI.

The target users are serious runners who want to utilise the power of LLMs to plan for their next workout. The Garmin mobile app is awesome, but it is general-purpose, targeting more general users, a few years behind the latest developments in sport science. I feel like I always want to customise some parameters/dashboard. With MCP Apps, there is no interface that is more flexible than natural language + generative UIs.

Running, not AI, in the past few years, has changed my life. I wish to bring this positive impact to more users. With the integration of LLMs and cutting-edge sport science, more users will actually start to enjoy running, and become runners.

The only problem is I have young kids, so I have very limited time during the weekends. Eventually I managed to make the initial version in 12 hours, so here are the tips worth documenting.

Use Frontier Model

Do not waste your time on older, smaller models, even if they are cheaper. Always use the best model on the market. However, there are exceptions: when I started this project, I started with Antigravity + Gemini Pro 3.1 because it was the newest, but sorry, Google, this model is not the best, not even the second best. I ended up using Opus 4.6 with Claude Code. The time you waste in an agentic loop with a weak model will cost more than the cost of tokens from the frontier model.

The Best Context Engineering is No Context Engineering

Let me repeat: the best context engineering is no context engineering. Context engineering, in essence, is to give an LLM access to context it might not otherwise have. For example, give browser CDP access to an LLM using the MCP or skill you prefer so the LLM has access to the DOM to inspect the elements. However, if you want to go fast, you should completely skip the context engineering part, and the best way to take a shortcut is to use the most popular language/framework on the market.

You might not like React, you might not like TypeScript, you might not like Tailwind, and you are probably smart enough to build your own lib that completely avoids the re-renders on state updates or is "truly reactive" compared to React. However, LLMs are trained on internet data (and pirated books); reinventing the wheel means you need to do extra context engineering to prompt-engineer the LLM. That works, but is not the optimal way to speed things up. So choose the most popular. For UI, there is nothing that can compare to React/Tailwind in terms of market share.

Harness Engineering

Harness engineering is similar to context engineering, however instead of giving more data to the LLM, harness engineering is more about making it easier for LLMs to modify your code. Claude Code is an excellent example of good harness engineering; with it, LLMs can modify large amounts of code without making errors, unlike Gemini Pro 3.1 struggling with Antigravity.

Claude Code's hook system makes harness engineering easy. I have made two Claude Code plugins: ralph-hook-fmt, which automatically formats files after they are written or edited, inspired by Formatters in OpenCode; and ralph-hook-lint, which automatically emits lint errors when Claude Code finishes one loop of editing. Both plugins speed up the feedback loop in the agentic loop, so the LLM can react automatically after files are edited and static analysis detects problems. Those two hook plugins are still in early phases, but give them a try or make your own hook.

Plan First

Use your favourite planning tool, no matter if it is from Superpower, Speckit, or the built-in Claude Code plan tool. In plan mode, always ask Claude to relentlessly question you to fill the gaps in the plan. Stay in plan mode until you feel comfortable to let go. Planning is recommended early as the model is smarter in the first few thousand tokens of the context window. Making explorations and clarifications, then forking out from plan mode into implementation mode, will make the implementation more concentrated and focused without reading unnecessary files to stuff the context window.

Use Sub-agents

Sub-agents are useful when you want to do things in parallel, e.g. exploring an unknown code base. LLMs are smarter in the first 0–30% of the context window. Use sub-agents to explore a code base and give back the summary to the main context; this will keep the main context lean and not polluted with tool calls, avoiding losing the "Needle-in-a-Haystack".

Use sub-agents to write code, but only when the structure of the project is stable. Give tasks to sub-agents when you know exactly the boundaries of the task. One example is when you have a chart that already has a pattern to feed data into, and has known patterns to style the chart using shared CSS variables. Sub-agents are good when you know the exact output, for example, making a chart for each API endpoint following an example chart pattern.

Worktree, Worktree, Worktree

Claude Code has native support for git worktree now; start the worktree by claude --worktree. However, like sub-agents, I would suggest using worktree only when the structure starts to stabilise, and there is no, or little, ambiguity in the requirements. Worktree is for self-contained, fully isolated features or bug fixes that have clear boundaries. You probably do not want to use worktree to add a feature that touches all files; you will end up spending more time resolving conflicts. Do it on the main branch. Worktree is for predictable tasks that you are sure will be finished in a known timeframe.

Finally, thanks for reading through to here — your attention span is longer than most humans'. Give Garmin MCP Apps a try, let it help you plan for the next optimal run. Most importantly, start running. AI might change the world but not our everyday life; running (or whatever exercise you are into) will change your life.

I May Be Wrong

chh — Sat, 17 Jan 2026 07:56:55 +0000

If you've spent any time working with AI agents, you've probably seen this response:

You are absolutely right!

I hate it. It usually appears after you've wasted lots of tokens and the agent still can't figure out what's going on. It's that moment of frustration when you, the human, have to diagnose the problem yourself and point out the solution.

I blamed the agent. Why can't it just say "I am wrong"? Why did it have such strong confidence moments earlier, without even a hint of "maybe I'm wrong"?

These are the moments when I doubt whether AI can really replace human intelligence.

But after gaining more experience, I've started to see You are absolutely right! differently. It's still frustrating, but now I recognise it as a signal.

Here's the thing: an LLM is trained to follow human instructions, but you're not talking to a collective intelligence that can reason and reflect. You're projecting thoughts onto a probability machine that returns the most likely next token.

In essence, you're mostly talking to yourself (and your code). The LLM is the rubber duck sitting next to your desk, helping you understand what you're trying to achieve. LLMs can answer questions, but when it comes to open-ended discovery—research with no definite end—you're mostly talking to your own reflections.

So when I see You are absolutely right! I now take it as my cue to step back—to think outside the box and find what's missing.

Depressing? A little. But it's also freeing.

In those moments, I've found wisdom outside of computer science: a book I recently read called I May Be Wrong by Björn Natthiko Lindeblad—a beautiful memoir about a Swedish man who became a forest monk in Thailand.

It's a book about letting go of control and embracing uncertainty (sounds familiar?). Your thoughts aren't you; you may be wrong. It has nothing to do with software engineering or AI. But I'd like to borrow the book's title as a reminder—a memo beside my desk when I vibe code.

I May Be Wrong.

It's time to step back—letting go, and not believing everything you think.

I made a Copilot in Rust 🦀 , here is what I have learned... (as a TypeScript dev)

chh — Sun, 17 Dec 2023 05:27:38 +0000

My article Code Llama as Drop-In Replacement for Copilot Code Completion receiving lots of positive feedbacks. Since then, I made few other attempt to improve the copilot server.

In terms of performance, I made a PR in exllamav2 on a copilot server, using exllama2's super fast CUDA custom kernel.

To improve the completion quality, I tried few other LLMs such as replit-code-v1_5-3b, long_llama_code_7b, CodeShell-7B and stablelm-3b-4e1t.

As a practicing developer + a GPU poor without access to H100 clusters, to contribute in the era of AI Wild West, I'm more interesting to improve the ergonomics of the copilot server.

Hugging Face's candle, a minimalist ML framework for Rust, looks super interesting. So I started to create a minimalist copilot server in Rust 🦀.

Before you continue, note that exllamav2 version, python + CUDA is still much faster then the Rust version. This article is mostly for those interested in learning Rust, and want to learn the programming by building a fun project.

Essentially, this's a Build Your Own Copilot (in Rust 🦀) tutorial, the code is intended to be educational. If you just want to try the final product oxpilot:

brew install chenhunghan/homebrew-formulae/oxpilot

and starts the copilot server

ox serve

or chat with the LLM

ox hi in Japanese

We will be using axum as the web framework, candle for text inferencing, clap for cli arguments parsing and tokio as the asynchronous runtime.

Books and References
Print (console.log) for debugging
- Pretty print
- Measure performance
Feeling Safe
- Variables are immutable by default
- You Should Not Moved! Ownership
  - Ownership and Scope
  - Borrowing
Asynchronous Rust
- Parallelism
- Concurrency
- Task (Green Thread)
- Async Runtime
- Ownership and Async
- Share States in Async Program: Arc and Mutex
Hands-on
- Server-Sent Events (SSE) Server
  - BDD the Endpoint
- Builder Pattern
  - ::default() v.s. ::new()
  - impl Into<String> for function parameter
- Type State
- Share Memory Arc<Mutex<_>>
- Share Memory by Communicating: Actor

(Some sections are still WIP)

If you are already familiar with Rust at some levels, for example feeling comfortable with Rust's ownership/borrowing but not the async world, I suggest to jump to Async section.

If you are already familiar with async Rust, you can go directly to Hands-on which introduces some design patterns you might find useful, or just go to the Github oxpilot project where everything is open-sourced.

Please expect some, ~~if not many~~, human errors. I documented my learning process hoping that can help someone on the internet, which likes to build a exciting project when learning a new language.

Thanks jihchi for reviewing the draft of this article.

Books and References

This article is self-contained, which means it should have all you need to know to read the source code in oxpilot.

However it's not possible to cover everything in each section. I try to provide references at the end of each sections, highly recommend to read the The Rust Programming Language if you haven't.

Tracing is batteries-included `console.log`

console.log is a powerful tool in TypeScript, you can print whatever you want, thus console.log is super useful for debugging. The Rust equivalent is print!, Rust by Example is an excellent document if you want get started quickly to use print!.

However, print! blocks stdio, and it's better to lock stdio and unlock manually, which is tedious.

Luckily we have alternative, Tracing is an awesome project by tokio team, as a TypeScript developer, I feel like home using tracing.

info!("Hello! Rust!");
info!("Print var: {:?}", var);

What is `{:?}` ?

You might wonder what is {:?} in the code block.

info!("Print var: {:?}", var);

{:?} is for printing struct (like Object in TypeScript). Alternatively {:#?} can pretty print (), see more, think of it like console.log(JSON.stringify(object,null,2)).

Measure performance

Tracing is an awesome for logging performance metrics, for example, if I want to measure how long awesome() took.

async fn awesome() {}
awesome().instrument(tracing::info_span!("awesome")).await;

Prints super usefully messages, which tells when we start invoking awesome(), at which line, and which thread it was on, and how long it took to execute the function.

2023-10-22T09:01:13.128553Z INFO ThreadId(01) awesome src/main.rs:172: enter
2023-10-22T09:01:13.128569Z INFO ThreadId(01) awesome src/main.rs:172: close time.busy=15.3µs time.idle=3.96µs

Feeling Safe

Rust is safe by default, the safe often refers to memory-safety. However, from my experience, Rust makes you feel safe shipping to production...once the code compiles.

If you ever wrote a line of code in JavaScript, and then switch to TypeScript, you probably knows what I mean by "feeling safe".

TypeScript protects us from TypeError: Cannot read property '' of undefined at compile time, Rust is like TypeScript with ultra strict mode on which protect us, developers from making mistakes at compile time.

Rust makes pull requests easier to review and increase the confident of shipping to production, the compiler error messages might seen overwhelming, just like TypeScript errors at the beginning.

However , if you ever under the stress of recovering production servers, you will know that learning to resolve the compile time error is better then resolving runtime exceptions.

To embrace the Rust safety net, immutability and ownership are two key concepts to understand.

Variables are immutable by default

"Immutable by default" means once data created, they can't not be mutated, most will agree that immutable data makes your code better.

For example, a seasoned TypeScript developer probably knows the benefits of using const, const makes the code intend explicit when you try to mutate the value.

const x = 5;
x = 1; // Cannot assign to 'x' because it is a constant.

In Rust, variables are immutable by default and only mutable if you explicitly declare as mutable.

fn main() {
    let x = 5; // this does not compile, 
    x = 6; // explicit `let mut x` to make mutation possible.
}

The book's Variables and Mutability has comprehensive explanation on mutability in Rust.

You should not move! Ownership

From a language with a garbage collector, the following code looks natural, we try to create string2 by referencing string1:

fn main() {
  let s1 = String::from("hello");
  let s2 = s1;
  println!("{}", s1);
}

However, the code does not compile, the compiler said you have moved s1.

11 |   let s1 = String::from("hello");
   |       -- move occurs because `s1` has type `String`, which does not implement the `Copy` trait
12 |   let s2 = s1;
   |            -- value moved here
13 |   println!("{}", s1);
   |                  ^^ value borrowed here after move

This might be the first, and continuously frustrating compiler error message when starting Rust.

Rust does not ship with a garbage collector, which means it does not know (at runtime) when to drop the value from memory when you don't need the value anymore.

To archive this goal, Rust introduce the ownership checker, to make the developer mark the value when the rest of the code doesn't need it. Ownership checker helps you to manage memory at compile time, so we don't need to ship the code with a garbage collector that collect, and drop unused values from the memory in the runtime.

value **moved** here in the above example code is telling that the code is violating the ownership rules, which are

Each value in Rust has an owner.

There can only be one owner at a time.

When the owner goes out of scope, the value will be dropped.

The compiler is telling: Hey! s1 is the owner of String::from("hello"), however, you have moved the ownership from s1 to s2, since you don't need the s1, compiler dropped s1, therefore, you should not use it again
in println!!

fn main() {
  let s1 = String::from("hello");
  let s2 = s1; // ownership moved from s1 to s2
  println!("{}{}", s1); // s1 is dropped, why you are still using it?
}

If you are from TypeScript world (or any language with a garbage collector), ownership might looks foreign, however, learning ownership checker makes you aware of memory allocation.

Ownership and Scope

Let's review the ownership rules again, and get deeper into the third rule.

Each value in Rust has an owner.

There can only be one owner at a time.

When the owner goes out of scope (the curly brackets {}), the value will be dropped.

In the following example, the compiler stops us at the second do_something() call, because we violate the ownership rule by moving owner into do_something and try to use owner again.

This does not compile:

fn main() {
    let owner = String::from("value");
    // we took the ownership of "value" from `owner` and
    // "value" is dropped at the end of the `do_something` function
    // thus the variable `owner` does not own it anymore
    do_something(owner);
    // use of moved value: `owner` value used here after move
    print!("{}", owner);
}

fn do_something(_: String) {
    // 
}

playground

print!("{}", owner) violates the ownership rule because we have already move the owner into do_something(owner)'s
scope, therefore, after the the do_something(owner) execution is finished, the owner is out of scope, the owner is dropped and we can't use it anymore.

Borrow

To get around ownership rules, borrowing to rescue.

Borrowing is using reference syntax (&) to make Rust compiler knows we are just borrowing instead of taking ownership, borrowing using reference to make a promise that we are justing temporarily borrowing the value, we do not intend to take the ownership, and return the value when don't need it anymore.

Compiled

fn main() {
    let owner = String::from("value");
    // `do_something` borrows `"value"` from `owner`
    do_something(&owner);
    // No more error!
    print!("{}", owner);
}

fn do_something(_: &String) {
    // "value" is NOT dropped at the end of the function
    // because we are just borrowing (`&String`) not taking the ownership
}

playground

Just like the ownership, the borrowing has a set of rules, these rules are the like contracts you made when you borrow something from someone else.

Areal world example for analogue: you want to borrow a book "Rust for Rustaceans" from a friend, to keep the friendship, you made a contract (a verbal promise: "I will return the borrowed book back to you in one month"), the contract needs to follow the borrowing rules:

At any given time, you can have as many immutable reference you want but only one mutable reference.

Reference must be referencing to a value that is valid (disallow referencing to a dropped value).

It's ok if ownership and borrowing still seems blur, the book's understanding ownership chapter is the best read, and you will get familiar with ownership rules soon after passing data and compiler yelling at you from time to time.

If you are a busy developer, Let's Get Rusty's The Rust Survival Guide is a great way to crash into ownership rules quickly.

Asynchronous

Before we start this section, let's pin the definitions of terminology.

Terminology

Async is a feature in a programming language intended to provide opportunities for the program to execute a unit of computation while waiting another unit of computation to complete.

Parallelism

The program executes units of computation at the same time, simultaneously, for example running two computations in two different cores of CPU.

Concurrency

The program process units of computation, executes them one by one, and yield from a unit to another unit quickly when a unit makes progress, the program yields between units quickly, as if the program executes units at the same time (but it's not simultaneously) ref, for the single-threaded Node.js runtime.

Task (Green Thread)

A task is for some computation running in a parallel or concurrent system. In this article, the term task refer to asynchronous green thread that is not a OS thread but a unit of execution managed by the async runtime.

Runtime (the Task Runner)

Node.js is single-threaded, asynchronous runtime, the program can process tasks asynchronously, however, the program is not processing the tasks in parallel, because Node.js is single-threaded.

To process tasks asynchronously in Rust, the developer needs to setup a task runner. The main function (think of it like index.ts), which is the entry point of a Rust program, is always synchronous, the developer needs to setup the runtime to be able to run asynchronous tasks in Rust.

The following code uses futures::executor as the async task runner.

fn main() {
    // the async task runner.
    futures::executor::block_on(do_something());
}

// An async task
async fn do_something() {
    //
}

In Rust, you are free to choose any async runtimes, like in TypeScript, we have node.js, bun and deno. In Rust we have tokio, async-std, smol and futures, these runtime can be single threaded, like node.js which runs tasks concurrently, or multiple threaded that is true parallelism.

You may find these video useful to understand the async/await in Rust.

Ownership and Async

In You should not move! Ownership we discuss the ownership rules, and in Borrow section, we discuss how to get over ownership rules by borrowing.

In async Rust, no matter you are using a single-threaded concurrent green thread runtime, or distributing computation to multiple OS threads (parallelism), the ownership rules always apply. In the async Rust, the ownership rules are preventing data race in concurrent programming or parallel programming in Rust. (Also known as fearless concurrency)

Remember the ownership rules?

Each value in Rust has an owner.

There can only be one owner at a time.

When the owner goes out of scope the value will be dropped.

Pay special attention of at a time, it is how ownership help us avoiding data race when running computation at the same time (= asynchronous).

Let's look at the synchronous version again, in previous example this failed to compile...

fn main() {
    let owner = String::from("value");
    do_something(owner);
    // use of moved value: `owner` value used here after move
    print!("{}", owner);
}

fn do_something(_: String) {
    // 
}

playground
...because the code does not follow the ownership rules, that is, both do_something() took ownership of String::from("hello"), but Rust compiler only allows one ownership at a time. To protect us from forgetting deallocating memory, the owner is moved into the fist do_something(owner), and we can't compile the code because this error use of moved value:ownervalue used here after.

3 |     do_something(owner);
  |                  ----- value moved here
4 |     // use of moved value: `owner` value used here after move
5 |     print!("{}", owner);
  |                  ^^^^^ value borrowed here after move

We can get over this by borrowing(&)

fn main() {
    let owner = String::from("value");
    // use & to reference owner
    do_something(&owner);
    // We can still use owner after
    print!("{}", owner);
}

fn do_something(_: &String) {
    // 
}

playground

The same ownership rules apply to asynchronous Rust, let's look at parallelism version, which spawns OS threads running code simultaneously:

use std::thread;

fn main() {
    let owner = String::from("value");
    thread::spawn(|| {
        do_something(&owner);
    });
}
fn do_something(_: &String) {
    // 
}

playground

We knew that we need to use borrowing(&) to avoid taking ownership when calling do_something(&owner). However, the compiler still reject, it says:

closure may outlive the current function, but it borrows owner, which is owned by the current function

This very compiler error is telling that the borrowing of owner might be referenced to a value, outside of the thread closure, at a time, when the value is dropped, violating this rule we discussed in borrowing.

Reference must be referencing to a value that is valid (disallow referencing to a dropped value).

To give this outlive error more context, try to run this code in the playground.

use std::thread;

fn main() {
    thread::spawn(|| {
        print!("from thread");
    });
    print!("from main");
}

You might be surprised that there is only from main in the console. It's because Rust's thread implementation in the std allows the created threads to outlive the thread created them, in other words, the parent thread (in our case the main()) created the child thread, the child thread created via thread::spawn might outlived the parent (the main()).

That's the reason why you see from main in the console, execution of || print!("from thread") outlived the execution of main.

If we step back, and think at a higher degree of borrowing in threads:

use std::thread;

fn main() {
    let owner = String::from("value");
    thread::spawn(|| {
        // we borrow owner, but the borrowed value (`owner`)
        // might be dropped in main(), that is the `&` might point
        // to a dropped value
        do_something(&owner);
    });
}
fn do_something(_: &String) {
    // 
}

playground

We are running code simultaneously from main and a thread, at the same time, the compiler stops us by telling us that the "closure (in the thread) may outlive the current function but it borrows owner, which is owned by the main(), we shouldn't do this because we might be referenced to owner when it is invalid in the parent thread (main()).

The same outlive problem can be observed in concurrent code, even if in most concurrent runtimes, code execution is not in OS threads but in tasks:

/*
[dependencies]
tokio = { version = "1.32.0", features = ["full"] }
*/

#[tokio::main]
async fn main() {
    let owner = String::from("value");

    tokio::spawn(do_something(&owner));
}

async fn do_something(_: &String) {
    //
}

rustexplorer

This code block failed with similar error message "owner" does not live long enough.

To get over this Threads Don't Borrow error, that is, to get over the ownership rule that disallow referencing a value from parent to children threads/tasks. We have few solutions:

move the value into the thread (read more) ```rust use std::thread;

fn main() {
let owner = String::from("value");
// move moving the owner into the spawned thread
thread::spawn(move || {
do_something(&owner);
});
}
fn do_something(_: &String) {
//
}

[playground](https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=ff8990b967ff3d589c248f04a0d2b0d3)
2. Use [`scoped` thread](https://doc.rust-lang.org/beta/std/thread/fn.scope.html), which exits before the parent thread (`main`) exits.
```rust
use std::thread;

fn main() {
    let owner = String::from("value");
    // scoped thread alway exists before the main thread exists
    // therefore we can use reference to pointing to `owner`
    thread::scope(|_| {
        do_something(&owner);
    });
}
fn do_something(_: &String) {
    // 
}

playground

"Do not communicate by sharing memory; instead, share memory by communicating" as in the Go language documentation. We will dive into this in the actor section.
Atomic Reference Counting (Arc<T>) and Mutual Exclusion (Mutex<T>).

We will dive in to the (Arc<T>) in the next section.

Share States in Async Program: Arc and Mutex

Sharing a state in an async program can be a challenge. Ownership rules only allow a value to have a owner at a time. We can't use borrowing because the compiler does not know will the borrower in the thread/task pointing to a dropped value at a time.

To solve this problem, we can use Arc (Atomic Reference Counting).

Arc is safe to use to share the state across multiple threads/tasks. To wrap a data into Arc to have multiple copies of the same data:

use std::thread;
use std::sync::Arc;

fn main() {
    let arc = Arc::new(String::from("value"));
    thread::spawn(|| {
        do_something(arc);
    });
}
fn do_something(_: Arc<String>) {
    // 
}

playground

Arc allows safe read to the inner data across threads, it's similar to borrowing but for asynchronous code blocks.

However, Arc only allows read, to enable thread to write to the inner data. The data needs to be handled with proper locking mechanism, that is the (Mutex<T>).

Mutex<T> (reads: mutual exclusion) will block threads waiting for the lock to become available. When calling lock() on a thread, the thread will become the only thread that can access the data, Mutex<T> blocks other threads from access the data, therefore, it's safe to mutate the data while the lock has not been unlocked.

To safely mutate the data we share with state:

use std::thread;
use std::sync::{Arc, Mutex};

fn main() {
    let inner_data = String::from("Hello ");
    let mutex = Arc::new(Mutex::new(inner_data));
    let mutex_clone = mutex.clone();
    thread::spawn(move || {
        let mut inner_data = mutex.lock().unwrap();
        inner_data.push_str(" world (once)!")
    });
    thread::spawn(move || {
        let mut inner_data = mutex_clone.lock().unwrap();
        inner_data.push_str(" world (twice)!")
    });
}

playground

We will dive deeper on how to use Arc<Mutex<_>> to share the mutable state in section Share Memory Arc<Mutex<_>>.

To learn more on sharing state:

The book's Shared-State Concurrency chapter.
Tokio's documentation has a dedicated page on how to share state between async tasks.

Hands-On

In the following sections, we will start building the copilot server.

Server-Sent Events (SSE) Server

In this PR, we add the endpoint for the copilot client

From Code Llama as Drop-In Replacement for Copilot Code Completion we knew that a copilot server is essentially a HTTP server that accepts a request with a prompt and return JSON chucks in Server-Sent Events (SSE). Let's try to specify the SSE endpoint and creating an Server-Sent Events (SSE) Server using axum.

The URL path of the endpoint. /v1/engines/:engine/completions
The endpoint should accept a POST request.
The endpoint takes a path parameter (:engine) and a request body.
The endpoint return a SSE stream of text chucks (Content-Type: text/event-stream).

Since this endpoint is almost identical to OpenAI's completions endpoint, we can use curl to see the input's input (request body) and the output (SSE text chucks)

curl https://api.openai.com/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo-instruct",
    "prompt": "Say this is a test",
    "max_tokens": 7,
    "temperature": 0,
    "stream": true
  }'
# chuck 0
data: {"choices":[{"text":"This ","index":0,"logprobs":null,"finish_reason":null}],
"model":"gpt-3.5-turbo-instruct", "id":"...","object":"text_completion","created":1}
# chuck 1
data: {"choices":[{"text":"is ","index":0,"logprobs":null,"finish_reason":null}],
"model":"gpt-3.5-turbo-instruct", "id":"...","object":"text_completion","created":1}
# chuck 2
data: {"choices":[{"text":"a ","index":0,"logprobs":null,"finish_reason":null}],
"model":"gpt-3.5-turbo-instruct", "id":"...","object":"text_completion","created":1}
# chuck with `"finish_reason":"stop"`
data: {"choices":[{"text":"test.","index":0,"logprobs":null,"finish_reason":"stop"}],
"model":"gpt-3.5-turbo-instruct", "id":"...","object":"text_completion","created":1}
# end of SSE event stream
data: [DONE]

BDD (Behaviour-Driven Development) the Endpoint

We will use reqwest_eventsource and its friends in the test to act as the client, which send request to our endpoint /v1/engines/:engine/completions and assert the response is what we expected. Since reqwest_eventsource and friends are not used in our final binary, let's add them in dev-dependencies in Cargo.toml.

[dev-dependencies]
reqwest = { version = "0.11.22", features = ["json", "stream", "multipart"] }
reqwest-eventsource = "0.5.0"
eventsource-stream = "0.2.3"

Add the dummy handler, and axum's router to route requests to POST /v1/engines/:engine/completions

use axum::{routing::post, Router};

async fn completion() -> &'static str {
    "Hello, World!"
}

fn app() -> Router {
    Router::new()
        .route("/v1/engines/:engine/completions", post(completion))
}

Translate the spec into the test:

#[cfg(test)]
mod tests {
    // imports are only for the tests
    // ...

    /// `super::*` means "everything in the parent module"
    /// It will bring all of the test module’s parent’s items into scope.
    use super::*;
    /// A helper function that spawns our application in the background
    /// and returns its address (e.g. http://127.0.0.1:[random_port])
    async fn spawn_app(host: impl Into<String>) -> String {
        let _host = host.into();
        // Bind to localhost at the port 0, which will let the OS assign an available port to us
        let listener = TcpListener::bind(format!("{}:0", _host)).await.unwrap();
        // We retrieve the port assigned to us by the OS
        let port = listener.local_addr().unwrap().port();

        let _ = tokio::spawn(async move {
            let app = app();
            axum::serve(listener, app).await.unwrap();
        });

        // We return the application address to the caller!
        format!("http://{}:{}", _host, port)
    }

    /// The #[tokio::test] annotation on the test_sse_engine_completion function is a macro.
    /// Similar to #[tokio::main] It transforms the async fn test_sse_engine_completion()
    /// into a synchronous fn test_sse_engine_completion() that initializes a runtime instance
    /// and executes the async main function.
    #[tokio::test]
    async fn test_sse_engine_completion() {
        let listening_url = spawn_app("127.0.0.1").await;
        let mut completions: Vec<Completion> = vec![];
        let model_name = "code-llama-7b";
        let body = serde_json::json!({ ... });

        let time_before_request = SystemTime::now()
            .duration_since(UNIX_EPOCH)
            .unwrap()
            .as_secs();
        let mut stream = reqwest::Client::new()
            .post(&format!(
                "{}/v1/engines/{engine}/completions",
                listening_url,
                engine = model_name
            ))
            .header("Content-Type", "application/json")
            .json(&body)
            .send()
            .await
            .unwrap()
            .bytes_stream()
            .eventsource();

        // iterate over the stream of events
        // and collect them into a vector of Completion objects
        while let Some(event) = stream.next().await {
            match event {
                Ok(event) => {
                    // break the loop at the end of SSE stream
                    if event.data == "[DONE]" {
                        break;
                    }

                    // parse the event data into a Completion object
                    let completion = serde_json::from_str::<Completion>(&event.data).unwrap();
                    completions.push(completion);
                }
                Err(_) => {
                    panic!("Error in event stream");
                }
            }
        }
        // The endpoint should return at least one completion object
        assert!(completions.len() > 0);

        // Check that each completion object has the correct fields
        // note that we didn't check all the values of the fields because
        // `serde_json::from_str::<Completion>` should panic if the field 
        // is missing or in unexpected format
        for completion in completions {
            // id should be a non-empty string
            assert!(completion.id.len() > 0);
            assert!(completion.object == "text_completion");
            assert!(completion.created >= time_before_request);
            assert!(completion.model == model_name);

            // each completion object should have at least one choice
            assert!(completion.choices.len() > 0);

            // check that each choice has a non-empty text
            for choice in completion.choices {
                assert!(choice.text.len() > 0);
                // finish_reason should can be None or Some(String)
                match choice.finish_reason {
                    Some(finish_reason) => {
                        assert!(finish_reason.len() > 0);
                    }
                    None => {}
                }
            }

            assert!(completion.system_fingerprint == "");
        }
    }
}

Run the tests by cargo test, the tests failed, because we haven't implemented the completion()

Add the endpoint, to pass the tests, the endpoint need to response with chucks of SSE struct, let's first fake the values in the struct first, we will connect the endpoint to llm later! It's important to stabilise the HTTP interface first.

use async_stream::stream;
use axum::response::sse::{Event as SseEvent, KeepAlive, Sse};
use axum::Json;
use futures::stream::Stream;
use oxpilot::types::{Choice, Completion, CompletionRequest, Usage};
use serde_json::{json, to_string};
use std::convert::Infallible;
use std::time::{SystemTime, UNIX_EPOCH};

// Reference: https://github.com/tokio-rs/axum/blob/main/examples/sse/src/main.rs
pub async fn completion(
    // `Json<T>` will automatically deserialize the request body to a type `T` as JSON.
    Json(body): Json<CompletionRequest>,
) -> Sse<impl Stream<Item = Result<SseEvent, Infallible>>> {
    // `stream!` is a macro from [`async_stream`](https://docs.rs/async-stream/0.3.5/async_stream/index.html) 
    // that makes it easy to create a `futures::stream::Stream` from a generator.
    Sse::new(stream! {
        yield Ok(
          // Create a new `SseEvent` with the default settings.
          // `SseEvent::default().data("Hello, World!")` will return `data: Hello, World!` as the event text chuck.
          SseEvent::default().data(
            // Serialize the `Completion` struct to JSON and return it as the event text chunk.
            to_string(
              // json! is a macro from serde_json that makes it easy to create JSON values from a struct.
              &json!(
                Completion {
                  id: "cmpl-".to_string(),
                  object: "text_completion".to_string(),
                  created: SystemTime::now()
                      .duration_since(UNIX_EPOCH)
                      .unwrap()
                      .as_secs(),
                  model: body.model.unwrap_or("unknown".to_string()),
                  choices: vec![Choice {
                      text: " world!".to_string(),
                      index: 0,
                      logprobs: None,
                      finish_reason: Some("stop".to_string()),
                  }],
                  usage: Usage {
                      prompt_tokens: 0,
                      completion_tokens: 0,
                      total_tokens: 0
                  },
                  system_fingerprint: "".to_string(),
                }
              )).unwrap()
            )
        );
    })
    .keep_alive(KeepAlive::default())
}

That's it, the tests should pass now.

running 1 test
test tests::test_sse_engine_completion ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.04s

Alternatively, we can test the copilot e2e:

cargo run

will bind the server at port 6666, because we have these in main:

#[tokio::main]
async fn main() {
    // ..
    let listener = tokio::net::TcpListener::bind("0.0.0.0:6666").await.unwrap();
    let app = app();
    axum::serve(listener, app).await.unwrap();
}

Edit the settings.json in VSCode.

"github.copilot.advanced": {
   "debug.overrideProxyUrl": "http://localhost:6666"
}

And open any file, we should see ghost texts with world! that is from our copilot server running at port 6666.

Builder Pattern

In this section, we will implement a new struct (like an Object), LLMBuilder in llm.rs, and use the struct in our binary's entry point main.rs.

To be able to use components from llm.rs in main.rs we layout our files like

└── src
    ├── lib.rs
    ├── llm.rs
    └── main.rs

in llm.rs, we make LLMBuilder public.

// llm.rs
pub struct LLMBuilder {}

Declare new module (mod) in lib.rs)

// lib.rs
pub mod llm; // rust will resolve to `./llm.rs`

and use the module in main.rs

use oxpilot::llm::LLMBuilder;
fn main() {
  // 
}

LLMBuilder is implemented using a design pattern "Builder", which is a creational pattern that lets you construct complex objects steps-by-steps.

The end result is

let llm_builder = LLMBuilder::new()
        .tokenizer_repo_id("hf-internal-testing/llama-tokenizer")
        .model_repo_id("TheBloke/CodeLlama-7B-GGU")
        .model_file_name("codellama-7b.Q2_K.gguf");
let llm = llm_builder.build().await;

Constructor `::default()` v.s. `::new()`

Rust does not have constructor for struct to assign values to fields in struct when creating new instances, it's common to use associated-functions ::new() for the same purpose. Another option is to use Default trait to implement "Constructor".

We implement Default trait for LLMBuilder and implement new() for the user who prefer ::new() pattern.

impl LLMBuilder {
    pub fn new() -> Self {
        Self::default()
    }
};
LLMBuilder::new(); // same as `LLMBuilder::default()`

`impl Into<String>` for function parameter

To make the functions in our struct friendly for user, we use impl Into<String> tricks to allow passing both String and &str as function parameters.

impl LLMBuilder {
  pub fn tokenizer_repo_id(mut self, param: impl Into<String>) {}
}
// both are accepted
LLMBuilder::new().tokenizer_repo_id("string_slice");
LLMBuilder::new().tokenizer_repo_id(String::from("String"));

Type State

In previous section, we implement the builder for LLM, that is great, we can construct LLM with the descriptive chain of methods.

let llm_builder = LLMBuilder::new()
    .tokenizer_repo_id("repo")
    .model_repo_id("repo")
    .model_file_name("file");
let llm = llm_builder.build().await;

However, let's step aside, and be in the shoes the users, if a user tries to use LLMBuilder, it's possible that they forgot to support mandatory parameter, for example, one may forget to chain model_repo_id():

let llm_builder = LLMBuilder::new()
    .model_file_name("file");

This is acceptable. unlike other language which designed to throws exceptions in runtime, Rust's Result will propagate error back to user. As a result, there won't be runtime exceptions if the user deal with the Result properly at compile time:

let llm_builder = LLMBuilder::new()
    .model_file_name("file");
let llm = match llm_builder.await {
    Ok(llm) => llm,
    Err(error) => {
        // handle the error properly here
    }
};

However, what if we can improve the DX, to make the developer knows the problem as soon as possible, to make the feedback loop shorter, ideally when writing the code, i.e., compile time error?

Type State is a pattern that specify the state in type, and make compiler checks the state before running the code.

Our goal is to make compiler warn us, when mandatory parameters for creation of LLM is missing, for example, this will failed to compile:

let llm_builder = LLMBuilder::new();
let llm = lllm_builder.build().await;

The compiler will tell the user that, hey, the build() can't be used yet, you should not pass!

and the code intelligent in the editor will support that, hey, there is tokenizer_repo_id() method available, would you want to try first?

We can help our user, to find the next steps by defining the type state

// Init state when `::new()` is called.
pub struct InitState;

// Intermediate state, with token repo id, ready to accept model repo id
pub struct WithTokenizerRepoId;

And, move the implementation to where have the correct state, at the beginning, the state is InitState, and user can only use new() (does not change state), and tokenizer_repo_id(), which will return the instance with State=WithTokenizerRepoId.

impl LLMBuilder<InitState> {
    pub fn new() -> Self {
        LLMBuilder {
           ...
            // does not change state
            state: InitState,
        }
    }
    pub fn tokenizer_repo_id(
        self,
        tokenizer_repo_id: impl Into<String>,
    ) -> LLMBuilder<WithTokenizerRepoId> {
        LLMBuilder {
            ...
            // change state to `WithTokenizerRepoId`
            state: WithTokenizerRepoId,
        }
    }
}

If we inspect the builder instance, we will notice that it has WithTokenizerRepoId state.

That's great! Let's impl to builder with WithTokenizerRepoId state, so user will know what to do next.

With tokenizer_repo_id in place, the next is to set model_repo_id, calling model_repo_id() will set model_repo_id and return LLMBuilder<WithModelRepoId>

// Intermediate state, with model repo id, ready to accept model file name
pub struct WithModelRepoId;

impl LLMBuilder<WithTokenizerRepoId> {
    pub fn model_repo_id(self, model_repo_id: impl Into<String>) -> LLMBuilder<WithModelRepoId> {
        LLMBuilder {
            ...
           // change state to `WithModelRepoId`
            state: WithModelRepoId,
        }
    }
}

We are almost ready, the final step is to assign model_file_name, then the builder is ready to build.

/// With both token repo id and model repo id
pub struct ReadyState;

impl LLMBuilder<WithModelRepoId> {
    pub fn model_file_name(self, model_file_name: impl Into<String>) -> LLMBuilder<ReadyState> {
        LLMBuilder {
            ...
           // change state to `WithModelRepoId`
            state: ReadyState,
        }
    }
}

Implement the LLMBuilder<ReadyState> which adds the build() method.

impl LLMBuilder<ReadyState> {
    pub async fn build(self) -> Result<LLM> { ... }
}

That's it. We have improved our builder. The compiler will emit errors when any of mandatory parameters are missing, and avoid the runtime exceptions.

The final result

let _ = LLMBuilder::new()
    // mandatory parameters, without these compiler warns
    .tokenizer_repo_id("string_slice")
    .model_repo_id("repo")
    .model_file_name("model.file");

Inspect the builder, it has ReadyState!

Share Memory `Arc<Mutex<_>>`

Share Memory by Communicating: Actor

Use Code Llama (and other open LLMs) as Drop-In Replacement for Copilot Code Completion

chh — Sun, 27 Aug 2023 09:37:08 +0000

CodeLlama is now available under a commercial-friendly license.

The question arises: Can we replace GitHub Copilot and use CodeLlama as the code completion LLM without transmitting source code to the cloud?

The answer is both yes and no. Tweaking hyperparameters becomes essential in this endeavor. Let's explore the options available as of August 2023.

Note: You might want to read my latest article on copilot

By analyzing CodePilot's VSCode extension¹ at thakkarparth007/copilot-explorer, it becomes evident that CodePilot relies on an OpenAI API-compatible backend. Drawing from prior experiences such as fauxpilot, we understand that it's possible to switch the backend by introducing specific modifications to the settings.json file:

"github.copilot.advanced": {
  // fauxpilot was using `codegen`
  "debug.overrideEngine": "codegen",
  // OpenAI API compatible server url
  "debug.testOverrideProxyUrl": "http://localhost:5000",
  "debug.overrideProxyUrl": "http://localhost:5000" 
}

Choosing an OpenAI API-Compatible Server

To make use of CodeLlama, an OpenAI API-compatible server is all that's required. As of 2023, there are numerous options available, and here are a few noteworthy ones:

llama-cpp-python: This Python-based option supports llama models exclusively.
vllm: Known for high performance, though it lacks support for GGML.
flexflow: Touting faster performance compared to vllm.
LocalAI: A feature-rich choice that even supports image generation.
FastChat: Developed by LMSYS.
OpenLLM: An actively developed project.
ialacol: Noteworthy for its focus on Kubernetes.
...and many more

The choice among these options is entirely up to you. For the purpose of this article, I'll be focusing on ialacol, primarily because I am the main contributor and thus intimately familiar with all the implementation details.

Let's begin with GGML models. These models boast a low memory requirement and operate without the need for a GPU (which might not be as affordable anymore). If you possess robust CUDA (Nvidia) GPUs, I recommend directly proceeding to the GPTQ section of this article.

Setting up the OpenAI API-Compatible Server

Getting your OpenAI API-compatible server up and running is a straightforward process.

Clone the Repository and Install Dependencies

Use this one-liner to clone the repository and set up the necessary dependencies:

gh repo clone chenhunghan/ialacol && cd ialacol && python3 -m venv .venv && source .venv/bin/activate && python3 -m pip install -r requirements.txt

Run the server and download the model.

export DEFAULT_MODEL_HG_REPO_ID="TheBloke/CodeLlama-7B-GGML"
export DEFAULT_MODEL_FILE="codellama-7b.ggmlv3.Q2_K.bin
"
export LOGGING_LEVEL="DEBUG" # optional, more on this later
uvicorn main:app --host 0.0.0.0 --port 9999

Configure VSCode Copilot extension, pointing to the server.

To integrate the server with the VSCode Copilot extension, edit settings.json:

"github.copilot.advanced": {
  "debug.overrideEngine": "codellama-7b.ggmlv3.Q2_K.bin",
  "debug.testOverrideProxyUrl": "http://localhost:9999",
  "debug.overrideProxyUrl": "http://localhost:9999"
}

With these configurations in place, you're ready to roll. CodeLlama's code completion capabilities will now be at your fingertips.

Tweaking for Optimal Performance

While CodeLlama's completion capabilities are impressive, they might not always meet your expectations, yielding occasional suggestions by chance. However, they might not match the proficiency of GitHub Copilot, especially in terms of inference speed.

Several factors contribute to this discrepancy:

Our current model utilizes 7 billion parameters. To potentially enhance performance, consider experimenting with the 13B and 34B variants.
GGML/GGUF models are tailored to minimize memory usage rather than prioritize speed. While they excel in asynchronous tasks, code completion mandates swift responses from the server.
GitHub Copilot's extension generates a multitude of requests as you type, which can pose challenges, given that language models typically process one prompt at a time.

To address these considerations, exploring smaller models is a viable option. Smaller models often exhibit a faster inference speed. Here are some alternatives to consider:

CodeGen offers a 2B quantized version.
Replit-Code provides a 3B quantized version.
StarCoder presents a quantized version as well as a quantized 1B version.
TinyCoder stands as a very compact model with only 164 million parameters (specifically for python). There's even a quantized version.
Stablecode-Completion by StabilityAI also offers a quantized version.

For a potential increase in throughput, a useful strategy is queuing requests before the inference server. This optimization boosts throughput (not speed) and can be achieved using tools like text-inference-batcher (Disclaimer: I authored this tool, and tib is still in its early alpha phase).

Leveraging the various trade-offs at our disposal, let's proceed with the plan: utilizing a high-quality 3B model with a small footprint. Additionally, let's set up two instances of servers to enhance performance further.

# in `ialacol` folder you just cloned.
export THREAD=2
# Use small model https://stability.ai/blog/stablecode-llm-generative-ai-coding
export DEFAULT_MODEL_HG_REPO_ID="TheBloke/TheBloke/stablecode-completion-alpha-3b-4k-GGML"
export DEFAULT_MODEL_FILE="stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin"
# truncate the prompt to make inference faster...
# (it's a trade off, you get lower quality results too)
TRUNCATE_PROMPT_LENGTH=100
uvicorn main:app --host 0.0.0.0 --port 9998
# in another terminal session
uvicorn main:app --host 0.0.0.0 --port 9999

Load Balancing with a Queue to Increase Throughput

To enhance throughput, we can employ load balancing with a queuing mechanism. Here's how you can set it up using text-inference-batcher:

Setting Up `tib` for Load Balancing

Clone the repository and set up the necessary environment:

# clone and setup
gh repo clone ialacol/text-inference-batcher && cd text-inference-batcher && npm install

Start tib, directing to your servers.

export UPSTREAMS="http://localhost:9998,http://localhost:9999"
npm start

Configuring the Copilot Extension, directing to the load balancer.

"github.copilot.advanced": {
  "debug.overrideEngine": "stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin",
  // pointing to `tib`
  "debug.testOverrideProxyUrl": "http://localhost:8000",
  "debug.overrideProxyUrl": "http://localhost:8000"
}

Despite the compromise in inference quality due to smaller models and prompt truncation, results improved. However, they still fall short of GitHub Copilot's code completion capabilities.

Let's now venture to push the limits in the opposite direction.

Leveraging Cloud Infrastructure for Enhanced Performance

If you possess powerful cloud infrastructure equipped with GPUs, the process becomes notably streamlined.

In this scenario, we will harness the capabilities of Kubernetes due to its exceptional automation features. Both ialacol and text-inference-batcher are inherently compatible with Kubernetes, which further simplifies the setup.

Let's delve into deploying the 34B CodeLLama GPTQ model onto Kubernetes clusters, leveraging CUDA acceleration via the Helm package manager:

(values.yaml)

replicas: 1
deployment:
  image: ghcr.io/chenhunghan/ialacol-gptq:latest
  env:
    DEFAULT_MODEL_HG_REPO_ID: TheBloke/CodeLlama-34B-GPTQ
    TOP_K: 30
    TOP_P: 0.9
    MAX_TOKENS: 200
    THREADS: 1
resources:
  # Request a node with Nvidia 1 GPU
  limits:
    nvidia.com/gpu: 1
model:
  persistence:
    size: 30Gi
    accessModes:
      - ReadWriteOnce
    storageClassName: ~
service:
  type: ClusterIP
  port: 8000
  annotations: {}
# You probably need to use these to select a node with GPUs.
tolerations: []
affinity: {}

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
# work one
helm upgrade --install codellama-worker-0 ialacol/ialacol -f values.yaml
# work two
helm upgrade --install codellama-worker-1 ialacol/ialacol -f values.yaml
# and maybe more? Depends on your budget :)

Again, load balancing using tib with this values.yaml:

replicas: 1
deployment:
  image: ghcr.io/ialacol/text-inference-batcher-nodejs:latest
  env:
    # pointing to our workers
    UPSTREAMS: "http://codellama-worker-0:8000,http://codellama-worker-1:8000"
    # increase this if your the worker can handle more then one inference at a time.
    MAX_CONNECT_PER_UPSTREAM: 1
resources:
  requests:
    cpu: 500m
    memory: 128Mi
service:
  type: ClusterIP
  port: 8000
  annotations: {}
  # If using an AWS load balancer, you'll need to override the default 60s load balancer idle timeout
  # service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1200"
nodeSelector: {}
tolerations: []
affinity: {}

helm upgrade --install tib text-inference-batcher/text-inference-batcher-nodejs -f values.yaml

Expose the tib service by utilizing your cloud's load balancer, or for testing purposes, you can employ kubectl port-forward.

Conclusion

With CodeLLama operating at 34B, benefiting from CUDA acceleration, and employing at least one worker, the code completion experience becomes not only swift but also of commendable quality. I would confidently state that this setup is on par with the performance of GitHub Copilot.

Nonetheless, it's crucial to acknowledge that this particular configuration does come at a notably higher cost when compared to GitHub Copilot. Striking a balance between budget considerations and privacy concerns is imperative. This investment is especially justifiable when handling proprietary or enterprise-level software projects. Conversely, the pricing structure of Copilot holds its own appeal.

In essence, we're fortunate to have a range of options at our disposal. Your thoughts and feedback are valuable, so feel free to share your insights in the comments section.

Let's keep the conversation going! 🚀

Highly recommended to go through the Copilot source code, you will learn prompt engineering and client cache on different levels before hitting the server. ↩

Deploy Llama 2 AI on Kubernetes, Now

chh — Wed, 19 Jul 2023 16:41:47 +0000

Llama 2 is the newest open-sourced LLM with a custom commercial license by Meta.

Here are simple steps that you can try Llama 13B, by few clicks on Kubernetes.

You will need a node with about 10GB pvc and 16vCPU to get reasonable response time.

cat > values.yaml <<EOF
replicas: 1
deployment:
  image: quay.io/chenhunghan/ialacol:latest
  env:
    DEFAULT_MODEL_HG_REPO_ID: TheBloke/Llama-2-13B-chat-GGML
    DEFAULT_MODEL_FILE: llama-2-13b-chat.ggmlv3.q4_0.bin
    DEFAULT_MODEL_META: ""
    THREADS: 8
    BATCH_SIZE: 8
    CONTEXT_LENGTH: 1024
service:
  type: ClusterIP
  port: 8000
  annotations: {}
EOF
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama-2-13b-chat ialacol/ialacol -f values.yaml

Port forward

kubectl port-forward svc/llama-2-13b-chat 8000:8000

Talk to it

curl -X POST -H 'Content-Type: application/json' \
  -d '{ "messages": [{"role": "user", "content": "Hello, are you better then llama version one?"}], "temperature":"1", "model": "llama-2-13b-chat.ggmlv3.q4_0.bin"}' \
  http://localhost:8000/v1/chat/completions

That's it!

Hi there! I'm happy to help answer your questions. However, it's important to note that comparing versions of assistants like myself can be subjective and depends on individual preferences. Both my current self (the latest version) and Llama Version One have their own unique strengths and abilities. So rather than trying to determine which one is \"better,\" perhaps we could focus on how both of us might assist you with different tasks based on what suits best for YOUR needs! Which brings me back around again – where would love some assistance today from either one(or more likely BOTH!) of our amazing offerings?” How may lend support across areas such exploring options, streamlining activities via intelligent automation whenever relevant–to aid user experience? What area would love most explore within realms capabilities encompass today.

Enjoy!

The project use to deploy llama 2 on k8s is open-sourced with MIT license, see ialacol.

AI for Everyone!

Cloud Native Workflow for Private MPT-30B AI Apps

chh — Sat, 01 Jul 2023 14:16:56 +0000

In this article, we will guide you through the process of developing your own private AI application 🤖, leveraging the capabilities of Kubernetes.

Unlike many other tutorials, we will NOT rely on OpenAI APIs. Instead, we will utilize a private AI instance with a Apache 2.0 licensed model MPT-30B, that ensures the confidentiality of all 🔒 sensitive data 🔒 within your Kubernetes cluster. No data goes to the third-party cloud 🙅‍♂️ 🌩️!

To set up the development environment on Kubernetes, we will utilize devspace. This environment includes a file sync pipeline for your AI application, as well as the backend AI API (a RESTful API service designed to replace OpenAI API) for the AI app.

Let's kick-start the process by deploying the necessary services on Kubernetes using the command devspace deploy. DevSpace will handle the deployment of the initial structure of our applications, along with their dependencies, including ialacol. For more detailed explanations, please refer to the in-line comments provided in the code snippet below:

# This is the configuration file for DevSpace
# 
# devspace use namespace private-ai # suggest to use a namespace instead of the default name space
# devspace deploy # deploy the skeleton of the app and the dependencies (ialacol)
# devspace dev # start syncing files to the container
# devspace purge # to clean up
version: v2beta1
deployments:
  # This are the manifest our private app deployment
  # The app will be in "sleep mode" after `devspace deploy`, and start when we start
  # syncing files to the container by `devspace dev`
  private-ai-app:
    helm:
      chart:
        # We are deploying the so-called Component Chart: https://devspace.sh/component-chart/docs
        name: component-chart
        repo: https://charts.devspace.sh
      values:
        containers:
          - image: ghcr.io/loft-sh/devspace-containers/python:3-alpine
            command:
            - "sleep"
            args:
            - "99999"
        service:
          ports:
          - port: 8000
        labels:
          app.kubernetes.io/name: private-ai-app
  ialacol:
    helm:
      # the backend for the AI app, we are using ialacol https://github.com/chenhunghan/ialacol/
      chart:
        name: ialacol
        repo: https://chenhunghan.github.io/ialacol
      # overriding values.yaml of ialacol helm chart
      values:
        replicas: 1
        deployment:
          image: quay.io/chenhunghan/ialacol:latest
          env:
            # We are using MPT-30B, which is the most sophisticated model at the moment
            # If you want to start with some small but mightym try orca-mini
            # DEFAULT_MODEL_HG_REPO_ID: TheBloke/orca_mini_3B-GGML
            # DEFAULT_MODEL_FILE: orca-mini-3b.ggmlv3.q4_0.bin
            # MPT-30B
            DEFAULT_MODEL_HG_REPO_ID: TheBloke/mpt-30B-GGML
            DEFAULT_MODEL_FILE: mpt-30b.ggmlv0.q4_1.bin
            DEFAULT_MODEL_META: ""
        # Request more resource if needed
        resources:
          {}
        # pvc for storing the cache
        cache:
          persistence:
            size: 5Gi
            accessModes:
              - ReadWriteOnce
            storageClass: ~
        cacheMountPath: /app/cache
        # pvc for storing the models
        model:
          persistence:
            size: 20Gi
            accessModes:
              - ReadWriteOnce
            storageClass: ~
        modelMountPath: /app/models
        service:
          type: ClusterIP
          port: 8000
          annotations: {}
        # You might want to use the following to select a node with more CPU and memory
        # for MPT-30B, we need at least 32GB of memory
        nodeSelector: {}
        tolerations: []
        affinity: {}

Let's wait for few seconds for the pods to become green, I am using Lens, it's awesome btw.

When all pods are green. We are ready for the next step.

The private AI app we are developing is a simple web server with an endpoint POST /prompt. When a client sends a request with a prompt in the request body to POST /prompt, the endpoint's controller will forward the prompt to the backend AI API, retrieve the response, and send it back to the client.

To begin, let's install the necessary dependencies on our local machine

python3 -m venv .venv
source .venv/bin/activate
pip install fastapi uvicorn
pip install openai # We are not using OpenAI API, but we can use openai client library to simplify things because our backend (ialacol) has OpenAI compatible RESTful interface.
pip freeze > requirements.txt

and create a main.py file.

from fastapi import FastAPI
import openai
from pydantic import BaseModel

class Body(BaseModel):
    prompt: str

app = FastAPI()

@app.post("/prompt")
async def completions(
    body: Body
):
    prompt = body.prompt
    # Add more logics here, for example, you can add the context to the prompt
    # using context augmentation retrieval methods
    response = openai.Completion.create(
        prompt=prompt,
        model="mpt-30b.ggmlv0.q4_1.bin",
        temperature=0.5
    )
    completion = response.choices[0].text

    return completion

The implementation of our app's endpoint POST /prompt is straightforward. It acts as a proxy, forwarding the request to the backend. You can further extend it by incorporating additional functionality, such as context augmentation retrieval based on the provided prompt.

With the core functionality of the app in place, let's synchronize the source files to the cluster by running the command devspace dev. This command performs the following actions:

It instructs devSpace to sync the files located at the root folder to the /app folder of the remote pod.
Whenever changes are made to the requirements.txt file, it triggers a pip install within the pod.
Additionally, it forwards port 8000, allowing us to access the app at http://localhost:8000.

dev:
  private-ai-app:
    # Use the label selector to select the pod for swapping out the container
    labelSelector:
      app.kubernetes.io/name: private-ai-app
    # use the name space we assign by devspace use namespace
    namespace: ${DEVSPACE_NAMESPACE}
    devImage: ghcr.io/loft-sh/devspace-containers/python:3-alpine
    workingDir: /app
    command: ["uvicorn"]
    args: ["main:app", "--reload", "--host", "0.0.0.0", "--port", "8000"]
    # expose the port 8000 to the host
    ports:
    - port: "8000:8000"
    # Add env for the pod if needed
    env:
    # This will tell openai python library to use the ialacol service instead of the OpenAI cloud
    - name: OPENAI_API_BASE
      value: "http://ialacol.${DEVSPACE_NAMESPACE}.svc.cluster.local:8000/v1"
    # You don't need to have an OpenAI API key, but OpenAI python library will complain without it
    - name: OPENAI_API_KEY
      value: "sk-xxx"
    sync:
      - path: ./:/app
        excludePaths:
        - requirements.txt
        printLogs: true
        uploadExcludeFile: ./.dockerignore
        downloadExcludeFile: ./.gitignore
      - path: ./requirements.txt:/app/requirements.txt
        # start the container after uploading the requirements.txt and install the dependencies
        startContainer: true
        file: true
        printLogs: true
        onUpload:
          exec:
          - command: |-
              pip install -r requirements.txt
            onChange: ["requirements.txt"]
    logs:
      enabled: true
      lastLines: 200

Wait for the files sync completed (you should see some logs in the terminal), and test our app by

curl -X POST -H 'Content-Type: application/json' -d '{ "prompt": "Hello!" }' http://localhost:8000/prompt

That's it, enjoy building your first private AI app 🥳!

Source code in the article private-ai-app-starter-python

Offline AI 🤖 on Github Actions 🙅‍♂️💰

chh — Sat, 01 Jul 2023 07:55:53 +0000

In this article, we will walk through the steps to set up an offline AI on Github Actions that respects your privacy by NOT sending your source code to the internet. This AI will add a touch of humor by telling jokes whenever a developer creates a boring pull request.

Github provides a generous offering for open source projects, allowing you to use their Github-hosted runner for free as long as your project is open source.

However, the Github-hosted runner comes with some limitations in terms of computational power. It offers 2 vCPUs, 7GB of RAM, and 14GB of storage (ref). On the other hand, AI computing, or LLM inference, is considered a luxury due to its resource requirements and associated costs 💸.

The stock price of Nvidia (the company who makes GPUs for AI):

However, thanks to the efforts of amazing community projects like ggml, it is now possible to run LLM (Large Language Model) on edge devices such as 🍓🥧 Raspberry Pi 4.

In this article, I will present the Github Actions snippets that allow you to run an LLM with 3B parameters directly on Github Actions, even with just 2 CPU cores and 7GB of RAM. These actions are triggered when a developer initiates a new pull request, and the AI will lighten the mood by sharing a joke to entertain the developer.

name: Can 3B AI with 2 CPUs make good jokes?

on:
  push:
    branches:
    - main
  pull_request:
    branches:
    - main

env:
  TEMPERATURE: 1
  DEFAULT_MODEL_HG_REPO_ID: TheBloke/orca_mini_3B-GGML
  DEFAULT_MODEL_FILE: orca-mini-3b.ggmlv3.q4_0.bin
  DEFAULT_MODEL_META: ""
  THREADS: 2
  BATCH_SIZE: 8
  CONTEXT_LENGTH: 1024

jobs:
  joke:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v3
        with:
          fetch-depth: 0

      - name: Create k8s Kind Cluster
        uses: helm/kind-action@v1.7.0

      - run: |
          kubectl cluster-info
          kubectl get nodes

      - name: Set up Helm
        uses: azure/setup-helm@v3
        with:
          version: v3.12.0

      - name: Install ialacol and wait for pods to be ready
        run: |
          helm repo add ialacol https://chenhunghan.github.io/ialacol
          helm repo update

          cat > values.yaml <<EOF
          replicas: 1
          deployment:
            image: quay.io/chenhunghan/ialacol:latest
            env:
              DEFAULT_MODEL_HG_REPO_ID: $DEFAULT_MODEL_HG_REPO_ID
              DEFAULT_MODEL_FILE: $DEFAULT_MODEL_FILE
              DEFAULT_MODEL_META: $DEFAULT_MODEL_META
              THREADS: $THREADS
              BATCH_SIZE: $BATCH_SIZE
              CONTEXT_LENGTH: $CONTEXT_LENGTH
          resources:
            {}
          cache:
            persistence:
              size: 0.5Gi
              accessModes:
                - ReadWriteOnce
              storageClass: ~
          cacheMountPath: /app/cache
          model:
            persistence:
              size: 2Gi
              accessModes:
                - ReadWriteOnce
              storageClass: ~
          modelMountPath: /app/models
          service:
            type: ClusterIP
            port: 8000
            annotations: {}
          nodeSelector: {}
          tolerations: []
          affinity: {}
          EOF
          helm install ialacol ialacol/ialacol -f values.yaml --namespace default

          echo "Wait for the pod to be ready, it takes about 36s to download a 1.93GB model (~50MB/s)"
          sleep 40
          kubectl get pods -n default

      - name: Ask the AI for a joke
        run: |
          kubectl port-forward svc/ialacol 8000:8000 &
          echo "Wait for port-forward to be ready"
          sleep 5

          curl http://localhost:8000/v1/models

          RESPONSE=$(curl -X POST -H 'Content-Type: application/json' -d '{ "messages": [{"role": "user", "content": "Tell me a joke."}], "temperature":"'${TEMPERATURE}'", "model": "'${DEFAULT_MODEL_FILE}'"}' http://localhost:8000/v1/chat/completions)
          echo "$RESPONSE"

          REPLY=$(echo "$RESPONSE" | jq -r '.choices[0].message.content')
          echo "$REPLY"

          kubectl logs --selector app.kubernetes.io/name=$HELM_RELEASE_NAME -n default

          if [ -z "$REPLY" ]; then
            echo "No reply from AI"
            exit 1
          fi

          echo "REPLY=$REPLY" >> $GITHUB_ENV
      - name: Comment the Joke
        uses: actions/github-script@v6
        # Note, issue and PR are the same thing in GitHub's eyes
        with:
          script: |
            const REPLY = process.env.REPLY
            if (REPLY) {
              github.rest.issues.createComment({
                issue_number: context.issue.number,
                owner: context.repo.owner,
                repo: context.repo.repo,
                body: `🤖: ${REPLY}`
              })
            }

Is the joke any good?

Well, it's up for debate. If you want better jokes, you can bring self-hosted runner. Self-hosted runners (with for example 16vCPU and 32GB RAM) would definitely capable of running more sophisticated models such as MPT-30B.

You might be wondering why running Kubernetes is necessary for this project. This article was actually created during the development of a testing CI for the OSS project ialacol. The goal was to have a basic smoke test that verifies the Helm charts and ensures the endpoint returns a 200 status code. You can find the full source of the testing CI YAML here.

While running Kubernetes may not be necessary for your specific use case, it's worth mentioning that the overhead of the container runtime and Kubernetes is minimal. In fact, the CI process, which includes LLM inference from provisioning to completion, takes only 2 minutes.

Containerized AI before Apocalypse 🐳🤖

chh — Sun, 25 Jun 2023 08:55:16 +0000

ChatGPT is awesome, and privacy is a concern for many. But what if you could host your own private AI on an old PC without relying on GPU clusters?

Thanks to the efforts of the amazing community projects like ggml, llama.cpp, and TheBloke, it is now possible for anyone to chat with AI, privately, without internet, ~~before the apocalypse~~.

In this article, ~~we will containerize an AI before it ends the world~~, we will explore how to deploy a Large Language Model (LLM, also known as AI) in a container within a Kubernetes cluster, enabling us to have conversations with it.

To get started, you'll need a Kubernetes cluster, for example, a minikube with approximately 8 CPU threads and 5GB of memory. Additionally, you'll need to have Helm installed.

Let's begin by deploying the LLM within a minimal wrapper.

cat > values.yaml <<EOF
replicas: 1
deployment:
  image: quay.io/chenhunghan/ialacol:latest
  env:
    DEFAULT_MODEL_HG_REPO_ID: TheBloke/orca_mini_3B-GGML
    DEFAULT_MODEL_FILE: orca-mini-3b.ggmlv3.q4_0.bin
    DEFAULT_MODEL_META: ""
    THREADS: 8
    BATCH_SIZE: 8
    CONTEXT_LENGTH: 1024
service:
  type: ClusterIP
  port: 8000
  annotations: {}
EOF
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install orca-mini-3b ialacol/ialacol -f values.yaml

If you're interested in the technical details, here's what's happening behind the scenes:

We are deploying a Helm release orca-mini-3b using Helm chart ialacol
The container image ialacol is a mini RESTFul API server compatible with OpenAI API. Disclaimer: I am the main contributor to this project
The deployed LLM binary, orca mini, has 3 billion parameters. Orca mini is based on the OpenLLaMA project.
The binary has been quantized by TheBloke into a 4-bit GGML format.

Now, please be patient for a few minutes as the container downloads the binary, which is around 1.93GB in size:

INFO:     Downloading model... TheBloke/orca_mini_3B-GGML/orca-mini-3b.ggmlv3.q4_0.bin

Once the download is complete, it's time to start a conversation!

Expose the service:

kubectl port-forward svc/orca-mini-3b 8000:8000

Ask a question:

USER_QUERY="What is the meaning of life? Explain like I am 5."
MODEL="orca-mini-3b.ggmlv3.q4_0.bin"
curl -X POST \
     -H 'Content-Type: application/json' \
     -d '{ "prompt": "### System:You are an AI assistant that follows instruction extremely well. Help as much as you can.### User:'${USER_QUERY}'### Response:", "model": "'${MODEL}'" }' \
     http://localhost:8000/v1/completions

According to AI...

The meaning of life is a question that has puzzled humans for centuries. Some believe it to be finding happiness, others think it's achieving success or something greater than ourselves, while some see it as fulfilling our purpose on this planet. Ultimately, everyone answers this question differently and what matters most in the end is how we live our lives with integrity and make a positive impact on those around us.

Let's start scaling LLM on Kubernetes!

DEV Community: chh

6 lessons from building a MCP Apps for 🏃🏃‍♂️🏃‍♀️

Use Frontier Model

The Best Context Engineering is No Context Engineering

Harness Engineering

Plan First

Use Sub-agents

Worktree, Worktree, Worktree

I May Be Wrong

I made a Copilot in Rust 🦀 , here is what I have learned... (as a TypeScript dev)

Table Of Contents

Books and References

Tracing is batteries-included console.log

What is {:?} ?

Measure performance

Feeling Safe

Variables are immutable by default

You should not move! Ownership

Ownership and Scope

Borrow

Asynchronous

Terminology

Parallelism

Concurrency

Task (Green Thread)

Runtime (the Task Runner)

Ownership and Async

Share States in Async Program: Arc and Mutex

Hands-On

Server-Sent Events (SSE) Server

BDD (Behaviour-Driven Development) the Endpoint

Builder Pattern

Constructor ::default() v.s. ::new()

impl Into<String> for function parameter

Type State

Share Memory Arc<Mutex<_>>

Share Memory by Communicating: Actor

Use Code Llama (and other open LLMs) as Drop-In Replacement for Copilot Code Completion

Choosing an OpenAI API-Compatible Server

Setting up the OpenAI API-Compatible Server

Clone the Repository and Install Dependencies

Run the server and download the model.

Configure VSCode Copilot extension, pointing to the server.

Tweaking for Optimal Performance

Load Balancing with a Queue to Increase Throughput

Setting Up tib for Load Balancing

Leveraging Cloud Infrastructure for Enhanced Performance

Conclusion

Deploy Llama 2 AI on Kubernetes, Now

Cloud Native Workflow for *Private* MPT-30B AI Apps

Offline AI 🤖 on Github Actions 🙅‍♂️💰

Containerized AI before Apocalypse 🐳🤖

Tracing is batteries-included `console.log`

What is `{:?}` ?

Constructor `::default()` v.s. `::new()`

`impl Into<String>` for function parameter

Share Memory `Arc<Mutex<_>>`

Setting Up `tib` for Load Balancing

Cloud Native Workflow for Private MPT-30B AI Apps