CodeLlama is now available under a commercial-friendly license.
The question arises: Can we replace GitHub Copilot and use CodeLlama as the code completion LLM without transmitting source code to the cloud?
The answer is both yes and no. Tweaking hyperparameters becomes essential in this endeavor. Let's explore the options available as of August 2023.
Note: You might want to read my latest article on copilot
By analyzing CodePilot's VSCode extension1 at thakkarparth007/copilot-explorer, it becomes evident that CodePilot relies on an OpenAI API-compatible backend. Drawing from prior experiences such as fauxpilot, we understand that it's possible to switch the backend by introducing specific modifications to the settings.json
file:
"github.copilot.advanced": {
// fauxpilot was using `codegen`
"debug.overrideEngine": "codegen",
// OpenAI API compatible server url
"debug.testOverrideProxyUrl": "http://localhost:5000",
"debug.overrideProxyUrl": "http://localhost:5000"
}
Choosing an OpenAI API-Compatible Server
To make use of CodeLlama, an OpenAI API-compatible server is all that's required. As of 2023, there are numerous options available, and here are a few noteworthy ones:
- llama-cpp-python: This Python-based option supports llama models exclusively.
- vllm: Known for high performance, though it lacks support for GGML.
-
flexflow: Touting faster performance compared to
vllm
. - LocalAI: A feature-rich choice that even supports image generation.
- FastChat: Developed by LMSYS.
- OpenLLM: An actively developed project.
- ialacol: Noteworthy for its focus on Kubernetes.
- ...and many more
The choice among these options is entirely up to you. For the purpose of this article, I'll be focusing on ialacol, primarily because I am the main contributor and thus intimately familiar with all the implementation details.
Let's begin with GGML models. These models boast a low memory requirement and operate without the need for a GPU (which might not be as affordable anymore). If you possess robust CUDA (Nvidia) GPUs, I recommend directly proceeding to the GPTQ section of this article.
Setting up the OpenAI API-Compatible Server
Getting your OpenAI API-compatible server up and running is a straightforward process.
Clone the Repository and Install Dependencies
Use this one-liner to clone the repository and set up the necessary dependencies:
gh repo clone chenhunghan/ialacol && cd ialacol && python3 -m venv .venv && source .venv/bin/activate && python3 -m pip install -r requirements.txt
Run the server and download the model.
export DEFAULT_MODEL_HG_REPO_ID="TheBloke/CodeLlama-7B-GGML"
export DEFAULT_MODEL_FILE="codellama-7b.ggmlv3.Q2_K.bin
"
export LOGGING_LEVEL="DEBUG" # optional, more on this later
uvicorn main:app --host 0.0.0.0 --port 9999
Configure VSCode Copilot extension, pointing to the server.
To integrate the server with the VSCode Copilot extension, edit settings.json
:
"github.copilot.advanced": {
"debug.overrideEngine": "codellama-7b.ggmlv3.Q2_K.bin",
"debug.testOverrideProxyUrl": "http://localhost:9999",
"debug.overrideProxyUrl": "http://localhost:9999"
}
With these configurations in place, you're ready to roll. CodeLlama's code completion capabilities will now be at your fingertips.
Tweaking for Optimal Performance
While CodeLlama's completion capabilities are impressive, they might not always meet your expectations, yielding occasional suggestions by chance. However, they might not match the proficiency of GitHub Copilot, especially in terms of inference speed.
Several factors contribute to this discrepancy:
- Our current model utilizes 7 billion parameters. To potentially enhance performance, consider experimenting with the 13B and 34B variants.
- GGML/GGUF models are tailored to minimize memory usage rather than prioritize speed. While they excel in asynchronous tasks, code completion mandates swift responses from the server.
- GitHub Copilot's extension generates a multitude of requests as you type, which can pose challenges, given that language models typically process one prompt at a time.
To address these considerations, exploring smaller models is a viable option. Smaller models often exhibit a faster inference speed. Here are some alternatives to consider:
- CodeGen offers a 2B quantized version.
- Replit-Code provides a 3B quantized version.
- StarCoder presents a quantized version as well as a quantized 1B version.
-
TinyCoder stands as a very compact model with only 164 million parameters (specifically for
python
). There's even a quantized version. - Stablecode-Completion by StabilityAI also offers a quantized version.
For a potential increase in throughput, a useful strategy is queuing requests before the inference server. This optimization boosts throughput (not speed) and can be achieved using tools like text-inference-batcher (Disclaimer: I authored this tool, and tib
is still in its early alpha phase).
Leveraging the various trade-offs at our disposal, let's proceed with the plan: utilizing a high-quality 3B model with a small footprint. Additionally, let's set up two instances of servers to enhance performance further.
# in `ialacol` folder you just cloned.
export THREAD=2
# Use small model https://stability.ai/blog/stablecode-llm-generative-ai-coding
export DEFAULT_MODEL_HG_REPO_ID="TheBloke/TheBloke/stablecode-completion-alpha-3b-4k-GGML"
export DEFAULT_MODEL_FILE="stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin"
# truncate the prompt to make inference faster...
# (it's a trade off, you get lower quality results too)
TRUNCATE_PROMPT_LENGTH=100
uvicorn main:app --host 0.0.0.0 --port 9998
# in another terminal session
uvicorn main:app --host 0.0.0.0 --port 9999
Load Balancing with a Queue to Increase Throughput
To enhance throughput, we can employ load balancing with a queuing mechanism. Here's how you can set it up using text-inference-batcher:
Setting Up tib
for Load Balancing
- Clone the repository and set up the necessary environment:
# clone and setup
gh repo clone ialacol/text-inference-batcher && cd text-inference-batcher && npm install
- Start tib, directing to your servers.
export UPSTREAMS="http://localhost:9998,http://localhost:9999"
npm start
- Configuring the Copilot Extension, directing to the load balancer.
"github.copilot.advanced": {
"debug.overrideEngine": "stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin",
// pointing to `tib`
"debug.testOverrideProxyUrl": "http://localhost:8000",
"debug.overrideProxyUrl": "http://localhost:8000"
}
Despite the compromise in inference quality due to smaller models and prompt truncation, results improved. However, they still fall short of GitHub Copilot's code completion capabilities.
Let's now venture to push the limits in the opposite direction.
Leveraging Cloud Infrastructure for Enhanced Performance
If you possess powerful cloud infrastructure equipped with GPUs, the process becomes notably streamlined.
In this scenario, we will harness the capabilities of Kubernetes due to its exceptional automation features. Both ialacol and text-inference-batcher are inherently compatible with Kubernetes, which further simplifies the setup.
Let's delve into deploying the 34B CodeLLama GPTQ model onto Kubernetes clusters, leveraging CUDA acceleration via the Helm
package manager:
(values.yaml
)
replicas: 1
deployment:
image: ghcr.io/chenhunghan/ialacol-gptq:latest
env:
DEFAULT_MODEL_HG_REPO_ID: TheBloke/CodeLlama-34B-GPTQ
TOP_K: 30
TOP_P: 0.9
MAX_TOKENS: 200
THREADS: 1
resources:
# Request a node with Nvidia 1 GPU
limits:
nvidia.com/gpu: 1
model:
persistence:
size: 30Gi
accessModes:
- ReadWriteOnce
storageClassName: ~
service:
type: ClusterIP
port: 8000
annotations: {}
# You probably need to use these to select a node with GPUs.
tolerations: []
affinity: {}
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
# work one
helm upgrade --install codellama-worker-0 ialacol/ialacol -f values.yaml
# work two
helm upgrade --install codellama-worker-1 ialacol/ialacol -f values.yaml
# and maybe more? Depends on your budget :)
Again, load balancing using tib
with this values.yaml
:
replicas: 1
deployment:
image: ghcr.io/ialacol/text-inference-batcher-nodejs:latest
env:
# pointing to our workers
UPSTREAMS: "http://codellama-worker-0:8000,http://codellama-worker-1:8000"
# increase this if your the worker can handle more then one inference at a time.
MAX_CONNECT_PER_UPSTREAM: 1
resources:
requests:
cpu: 500m
memory: 128Mi
service:
type: ClusterIP
port: 8000
annotations: {}
# If using an AWS load balancer, you'll need to override the default 60s load balancer idle timeout
# service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1200"
nodeSelector: {}
tolerations: []
affinity: {}
helm upgrade --install tib text-inference-batcher/text-inference-batcher-nodejs -f values.yaml
Expose the tib
service by utilizing your cloud's load balancer, or for testing purposes, you can employ kubectl port-forward
.
Conclusion
With CodeLLama operating at 34B, benefiting from CUDA acceleration, and employing at least one worker, the code completion experience becomes not only swift but also of commendable quality. I would confidently state that this setup is on par with the performance of GitHub Copilot.
Nonetheless, it's crucial to acknowledge that this particular configuration does come at a notably higher cost when compared to GitHub Copilot. Striking a balance between budget considerations and privacy concerns is imperative. This investment is especially justifiable when handling proprietary or enterprise-level software projects. Conversely, the pricing structure of Copilot holds its own appeal.
In essence, we're fortunate to have a range of options at our disposal. Your thoughts and feedback are valuable, so feel free to share your insights in the comments section.
Let's keep the conversation going! 🚀
-
Highly recommended to go through the Copilot source code, you will learn prompt engineering and client cache on different levels before hitting the server. ↩
Top comments (4)
Thank you for posting this!
I just got fast chat running in a container and leveraging Arc GPUs.
github.com/itlackey/ipex-arc-fastchat
Now I am going to use this to connect copilot to it! 🥳
is there a way you can use anyscale or together.ai since they have llama models
Hi chh. Great post. How would one go through the Copilot source code? I thought they were private.
Hi, the client side (copilot-vscode-extension) is compiled in JavaScript, the code has been minimized, but still possible to go through with some hacks, see this awesome repo github.com/thakkarparth007/copilot...