How do you use your GitHub stars?
I'd guess if you've been programming for a few years you've probably hit the star button at the top of a few of your favourite repos. I know some people I follow have done it thousands of times. Do you go back to them though? Do you review them for inspiration for your next project or go to them when you're stuck on a partictular problem?
Inspiration
I've always assumed I would use them but I never have. I found myself doing some research recently into how to build software that uses LLMs, with the deliberate goal of building an as yet undefined side-project. I wanted to build something I hadn't built before, something that was hopefully a little original, and maybe even useful! So yet again I was starring repos like LangChain and Chroma, swearing this time would be different.
As I was running through blog posts and diligently smashing the star buttons I realised that I had just hit on exactly what I wanted to try. I wanted to bring my GitHub stars right into my editor. I wanted to be able to have them next to me as I was working and get a sensible set of suggestions on what might be useful for my needs at that moment, and I had just been starring the exact repos that could make this happen!
The original idea
Use a dataset of your personal stars to inform retrival augmented generation for a question and answer large language model deployed in a command line interface
I thought this would be useful for a few reasons:
- By having it in the CLI its available right in my editor, and to every project.
- By having a set of your personal stars the suggestions are already curated by your interests and preferences. Mine are all Python and R librarys, wierd data bases and charting libraries. Yours might mostly be Ruby gems, or web frameworks, or tools for embedded systems.
- By using a large language model the tool might be more capable of understanding the intentention of your goals, for instance the query "Suggest how to build a web app" might be able to infer that you'd likely want a front end component, a backend component and a data storage component, and might even deal with servers and deployment.
- By using large language models the tool might be more capable of semantic search rather than keyword matching which suits this problem as there is no strong standard on how a library describes it self through it's topics, description and documentation.
Semantic vs keyword
Keyword
A keyword search looks for the exact letters in a string, or potentially a partial match. As an example the query "Data Science"
would find things that exactly matched the charcters in the string "Data Science"
and maybe also ["Data", "Science", "DS"]
.
Semantic
A semantic search looks for the conceptual similarity between things, so in this context "Data Science"
would find things that matched the vector embedding of "Data Science"
as well as maybe also the vector embeddings that are associated with ["Machine Learning", "Artificial Intelligence"]
And so I ran poetry new starpilot
:
Starpilot is like copilot, but for GitHub stars.
I've been starring repos for years thinking "This will definitely be useful later".
However I never really went back to them.
Starpilot is a retrival augmented generation CLI tool for rediscovering your GitHub stars.
Starpilot helps this problem by allowing you to rediscover GitHub repos you had previously starred that are relevant to your current project.
Here's some more details about the motivation for and state of the project.
Installation
This project is in early development and is not yet available on PyPi
- Fork repo
- Clone repo
cd starpilot
poetry install
You will need to have a .env file with
- a GitHub personal access token saved to a
.env
file in the root of the project. This should have the user> read:user scope permission. - a OpenAI API key saved to a
.env
file in the root of the project.
GITHUB_API_KEY="ghp_..."
OPENAI_API_KEY="sk-..."
β¦
Why retrival augmented generation matters
Retrieval Augmented Generation (RAG) is a technique used by large language models to cope with some of the limitations inherent in what are also sometimes referred to as 'Foundational' models.
When a model like GPT3 is trained, it is fed large amounts of textual data written by humans. These get translated into 'weights' in a nueral net. To overly simplify, these weights tell the model what the next most likely text is that follows the text it has already been shown.
However, these models don't know much about what has happened recently, what other programming resources really exist rather than what just sounds like it should exist, or where to exactly get a specific repo or webpage.
Retrieval augmented generation solves this by allowing you to feed the large language model with known real, up to date and relevant information.
Vectorstores
A type of data base called a vectorstore is commonly used for this because they are deliberately optimised towards a similarity search use case. They achieve this in a few ways:
- Vectorstores store what you pass them as a 'vector embedding'. A vector embedding takes data (like text or images) and converts them to a list like representation of numbers.
- Vectorstores keep similar vector embeddings close together in memory. This means that they are as fast as possible at returning lots of documents that have similar semantic meaning, because they are all clustered together.
- Vectorstores have APIs that are specifically designed for these use cases, with querying methods that lean towards semantic searches more than sql queries, and loading techniques that integrate tightly into other systems that generate these vector embeddings from large language models.
Designing a system
With this set of goals and new knowledge I got to work working out which puzzle pieces I needed and how to fit them together. This time I did go through my stars (and a few other things), though maybe this is for the last time!
I figured I could get started using 4 main open source repos. My first commit to my pyproject.toml used these projects:
Typer
Typer, build great CLIs. Easy to code. Based on Python type hints.
Documentation: https://typer.tiangolo.com
Source Code: https://github.com/tiangolo/typer
Typer is a library for building CLI applications that users will love using and developers will love creating. Based on Python 3.6+ type hints.
The key features are:
- Intuitive to write: Great editor support. Completion everywhere. Less time debugging. Designed to be easy to use and learn. Less time reading docs.
- Easy to use: It's easy to use for the final users. Automatic help, and automatic completion for all shells.
- Short: Minimize code duplication. Multiple features from each parameter declaration. Fewer bugs.
- Start simple: The simplest example adds only 2 lines of code to your app: 1 import, 1 function call.
- Grow large: Grow in complexity as much as you want, create arbitrarily complex trees of commands and groups of subcommands, with options andβ¦
typer
is a pretty trendy framework for building CLI tools in python right now. It embraces typing, uses function decorators to magically turn your functions into CLI commands, and has relatively clear documention.
I chose typer
specifically because:
- I wanted to see what the hype was about
- I think typing helps write better code
- I found the documentation really helpful to get started easily
langchain-ai / langchain
β‘ Building applications with LLMs through composability β‘
π¦οΈπ LangChain
β‘ Building applications with LLMs through composability β‘
Looking for the JS/TS library? Check out LangChain.js.
To help you ship LangChain apps to production faster, check out LangSmith LangSmith is a unified developer platform for building, testing, and monitoring LLM applications Fill out this form to get off the waitlist or speak with our sales team.
Quick Install
With pip:
pip install langchain
With conda:
conda install langchain -c conda-forge
π€ What is LangChain?
LangChain is a framework for developing applications powered by language models. It enables applications that:
- Are context-aware: connect a language model to sources of context (prompt instructions, few shot examples, content to ground its response in, etc.)
- Reason: rely on a language model to reason (about how to answer based on provided context, what actions to take, etc.)
This framework consists of several parts.
- LangChain Libraries: The Python andβ¦
langchain
is the most mature and well embraced large language model orchestration framework. Langchain itself doesn't supply you with any specific llm or vector store or embedding approach. Instead it is deliberately 'vendor agnostic'. It provides a common set of APIs and abstractions across a staggering number of vector data bases, large language models and embedding engines.
I chose langchain
because:
- It is the most established tool in a brand new space
- I wasn't really sure which suppliers of vectorstores and large languge models made the most sense for my use case
- I found the documentation really helpful to get started
Chroma
chroma-core / chroma
the AI-native open-source embedding database
Chroma - the open-source embedding database.
The fastest way to build Python or JavaScript LLM apps with memory
pip install chromadb # python client
# for javascript, npm install chromadb!
# for client-server mode, chroma run --path /chroma_db_path
The core API is only 4 functions (run our π‘ Google Colab or Replit template):
import chromadb
# setup Chroma in-memory, for easy prototyping. Can add persistence easily!
client = chromadb.Client()
# Create collection. get_collection, get_or_create_collection, delete_collection also available!
collection = client.create_collection("all-my-documents")
# Add docs to the collection. Can also update and delete. Row-based API coming soon!
collection.add(
documents=["This is document1", "This is document2"], # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
metadatas=[{"source":
β¦chroma
is a vectorstore that has great support from Langchain. There are many others as well but Chroma won out at this stage because:
- I can run
chroma
as an 'embedded' data store, e.g. it runs locally on the users machine -
chroma
was the most often used vectorstore in the Langchain docs for RAG tasks - It was trivially easy to set up to the point at which I was convinced reading the tutorials that they had to have made a mistake
GPT4All
GPT4All
Open-source large language models that run locally on your CPU and nearly any GPU
π¦οΈπ Official Langchain Backend
GPT4All is made possible by our compute partner Paperspace
Run on an M1 macOS Device (not sped up!)
GPT4All: An ecosystem of open-source on-edge large language models.
Important
GPT4All v2.5.0 and newer only supports models in GGUF format (.gguf). Models used with a previous version of GPT4All (.bin extension) will no longer work.
GPT4All is an ecosystem to run powerful and customized large language models that work locally on consumer grade CPUs and any GPU. Note that your CPU needs to support AVX or AVX2 instructions.
Learn more in the documentation.
A GPT4All model is a 3GB - 8GB file that you can download and plug into the GPT4All open-source ecosystem software. Nomic AI supports and maintains this software ecosystem toβ¦
gpt4all
provides a set of LLM models and embedding engines that are also well supported by Langchain. gpt4all
was appealing because:
- It runs locally on 'normal' machines
- It seems well supported and maintained
- It is open about what data it was trained on and what data it will use to train on
PyGithub
PyGitHub
PyGitHub is a Python library to access the GitHub REST API This library enables you to manage GitHub resources such as repositories, user profiles, and organizations in your Python applications.
Install
pip install PyGithub
Simple Demo
from github import Github
# Authentication is defined via github.Auth
from github import Auth
# using an access token
auth = Auth.Token("access_token")
# First create a Github instance:
# Public Web Github
g = Github(auth=auth)
# Github Enterprise with custom hostname
g = Github(base_url="https://{hostname}/api/v3", auth=auth)
# Then play with your Github objects:
for repo in g.get_user().get_repos():
print(repo.name)
# To close connections after use
g.close()
Documentation
More information can be found on the PyGitHub documentation site.
Development
Contributing
Long-term discussion and bugβ¦
Soon after this I realised that pygithub
would be an easy way to go to GitHub to get the information I needed and bring it back into starpilot
to load into the vectorstore. I had initially thought I might be able to use the GitHub Document Loader built into langchain
, though once I sat down to really work it out I realised that this doesn't give access to a users stars, so I needed an alternative.
The other way to build
There were alternatives in all these choices. I think these are all totally viable parts to build effectively the same system:
Click
$ click_
Click is a Python package for creating beautiful command line interfaces in a composable way with as little code as necessary. It's the "Command Line Interface Creation Kit". It's highly configurable but comes with sensible defaults out of the box.
It aims to make the process of writing command line tools quick and fun while also preventing any frustration caused by the inability to implement an intended CLI API.
Click in three points:
- Arbitrary nesting of commands
- Automatic help page generation
- Supports lazy loading of subcommands at runtime
Installing
Install and update using pip:
$ pip install -U click
A Simple Example
import click
@click.command()
@click.option("--count", default=1, help="Number of greetings.")
@click.option("--name", prompt="Your name", help="The person to greet.")
def hello
β¦I actually am using click
, sort of. typer
is built ontop of click
, but to be honest I didn't really know that before I'd mostly decided. click
looks like a really great project, but it wasn't as clear how to get started.
run-llama / llama_index
LlamaIndex (formerly GPT Index) is a data framework for your LLM applications
ποΈ LlamaIndex π¦
LlamaIndex (GPT Index) is a data framework for your LLM application.
PyPI:
- LlamaIndex: https://pypi.org/project/llama-index/.
- GPT Index (duplicate): https://pypi.org/project/gpt-index/.
LlamaIndex.TS (Typescript/Javascript): https://github.com/run-llama/LlamaIndexTS.
Documentation: https://docs.llamaindex.ai/en/stable/.
Twitter: https://twitter.com/llama_index.
Discord: https://discord.gg/dGcwcsnxhU.
Ecosystem
- LlamaHub (community library of data loaders): https://llamahub.ai
- LlamaLab (cutting-edge AGI projects using LlamaIndex): https://github.com/run-llama/llama-lab
π Overview
NOTE: This README is not updated as frequently as the documentation. Please check out the documentation above for the latest updates!
Context
- LLMs are a phenomenal piece of technology for knowledge generation and reasoning. They are pre-trained on large amounts of publicly available data.
- How do we best augment LLMs with our own private data?
We need a comprehensive toolkit to help perform this data augmentation for LLMs.
Proposed Solution
That's where LlamaIndex comes in. LlamaIndex is a "data framework" to help you build LLM apps. It provides the following tools:
- Offers data connectors to ingestβ¦
llama_index
is probably a great project, but I only found it late in my thinking on this project. If I start a different project it's suitable for any time soon I'm definately going to try it out as a comparison.
facebookresearch / faiss
A library for efficient similarity search and clustering of dense vectors.
Faiss
Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed primarily at Meta's Fundamental AI Research group.
News
See CHANGELOG.md for detailed information about latest features.
Introduction
Faiss contains several methods for similarity search. It assumes that the instances are represented as vectors and are identified by an integer, and that the vectors can be compared with L2 (Euclidean) distances or dot products. Vectors that are similar to a query vector are those that have the lowest L2 distance or the highest dot product with the query vector. It alsoβ¦
I'd used faiss
in a tutorial on vectorstores before. It didn't strike me as hugely intuitive to use or as simple to set up (it's recommended installation path is via conda). I also don't particularly like Facebook so I'm happy to use an alternative.
openai / openai-python
The official Python library for the OpenAI API
OpenAI Python API library
The OpenAI Python library provides convenient access to the OpenAI REST API from any Python 3.7+ application. The library includes type definitions for all request params and response fields and offers both synchronous and asynchronous clients powered by httpx.
It is generated from our OpenAPI specification with Stainless.
Documentation
The API documentation can be found here.
Installation
Important
The SDK was rewritten in v1, which was released November 6th 2023. See the v1 migration guide, which includes scripts to automatically update your code.
pip install openai
Usage
The full API of this library can be found in api.md.
import os
from openai import OpenAI
client = OpenAI(
# This is the default and can be omitted
api_key=os.environ.get("OPENAI_API_KEY"),
)
chat_completion = client.chat.completions.create(
messages=[
β¦I'd used openai
for a handful of tutorials and notebook experiments already and been very happy with it. However for a project like this I wasn't really sure what the operational costs would be, and if they would be worth it for the benefit the tool provides. That combined with the requirement to have network connectivity while using the tool pushed me towards experimenting with alternatives. Luckily with langchain
I should be able to provide it as an optional backend in the future?
What state is starpilot
now?
"actively developed", "v0.1.0", "untested" and "it runs on my machine" are good descriptions of the project right now.
I've spent a few evenings this month on it, and see myself at least spending a few more on it next month. The API is getting breaking changes almost everytime I open the project. It's got 0 real tests. It should get some soon though. It requires a few manual installation steps that are documented in README.md
but haven't yet even been attempted on another machine other than the one I'm on right now.
It also doesn't yet achieve exactly what I want it to, but I see no reason yet that it can't with some more development time.
Current features
starpilot read MyCoolUserName
This will connect to Github and read the starred repos of the user MyCoolUserName
. Then it will go to each of those repos and get the topics and descriptions (and optionally the readmes) and load these into chroma
which is persisted on the local hard drive.
starpilot shoot "insert topic here"
This will spin up the chroma
database and perform a semantic similarity search on the string given in the command, then return the documents that seem to be the most relevant.
starpilot fortuneteller "Insert a question here"
This will perform the exact same search as the shoot
command, but then spin up a large language model and pass the results into the large language model for processing. It then returns the documents it found as well as the response from the LLM
So....
That's where this project is at. I've learnt a tonne about the available tools and relevant techniques in this space already, which was really the main goal of starting to begin with!
That said the progress I've made so far only makes me more curious about what else can be done with this and what else can be solved towards the vision of "Making your GitHub stars more valuable in your daily coding". Here's some ideas that I've found exciting while getting my hands dirty that might show up in the future. These are along with the obvious things like any testing at all, a simpler way to set up the project on your machine, better error handling, a more sensible way to update the vectorstore than drop everything and rebuild each time, etc.
- Inspecting the current projects description (both it's loose goals as well as more specific things like what packages it already uses) so that things that are already used aren't suggested and are instead used to inform the response.
- Dynamically creating a GitHub list of similar starred repos for your user (though that would probably rely on this suggestion to extend the GitHub API) so that you naturally have some ways of saving and sharing your starred repos that solve a specific problem between sessions in your terminal
- Building starpilot into a research agent that can perform actions such as installing the selected suggestion into the current project or be sent to GitHub to find new projects that solve the current goals that you haven't starred yet
What do you think?
Does this sound like something intersting to you, maybe even something useful? Did this just spark inspiration in you for a new project? Does this actually already exist somewhere and I'm just being an idiot? Let me know :)
Top comments (9)
Neat
Cheers @ben
Do you find yourself using stars deliberately, or do they waste away gathering virtual dust?
More virtual dust vs anything, but I am into the idea of that β so some AI applied to the problem is definitely interesting.
Mostly a write-only medium, the organization of Stars into Lists made it a little better
This was actually originally going to be a way to automate the creation of GitHub lists, until I discovered there wasn't a user facing API for those objects yet :(
:-D An easy way to organize and categorize my stars, that was my initial thought when pondering about this, before GitHub introduced lists.
Thanks, let's take advantage of those github ποΈ ;-)
I was thinking about something along the same lines, but, to be honest, due to my lack of time and background in machine learning, this was probably never going to happen.
Happy that some one else seems to have seen the possiblities in this :)
The code is open source, so if you were interested in learning some ML/ spending a small amount of time on it I'd love to work on this with others!
TBH thanks to LLM's et al. theres actually relatively little data science work to do. It's mostly plugging things together and software design.
In your vision @ldrscke how did this kind of tool work? Am I really close to your core concepts or have I come from a different angle?
Yes you are quite close, basically, read my stars, put the found repos in an index, answer questions using the indexed content, or recommend related repos