DEV Community

Cover image for 15 hidden open-source gems to become 10x AI engineer🧙‍♂️ 🪄
Sunil Kumar Dash for Composio

Posted on

15 hidden open-source gems to become 10x AI engineer🧙‍♂️ 🪄

AI is all the rage, and there is massive hype around it. Some say this will change the world we know (in a wrong way), and others say it’s a fad.

However, as Elon Musk says, “The most entertaining outcome is the most likely.”

AI will not kill us all, and it is not a fad. Instead, it will propel our productivity to build even more complicated systems, maybe space travel, finding new materials (remember LK-99?), and cures for incurable diseases.

aitookjobs gif

However, to make these things happen, the best we can do now is to accelerate progress in our ways, and AI engineers will play a crucial role in it.

AI engineering is an interdisciplinary field which consists of

  • Training and fine-tuning models.
  • Collecting, cleaning, and preprocessing data.
  • Building AI systems for complex task automation (AI agents, RAG, etc).
  • Deploying models safely to production.
  • Continuous monitoring and evaluation of AI systems.

I have been building and researching a lot around it. So, I prepared a list of open-source software that will make you a better AI engineer.

aiamazing

Click on the emojis below to visit the respective section. 👇

  1. Composio: Build AI automation 10x faster. 🚀
  2. Unsloth: Faster training and fine-tuning of AI models. 🦥💨
  3. DsPy: Framework for programming LLMs. 🛠️
  4. LLMware: Framework for building enterprise RAG. 🏢
  5. TaiPy: Build AI web apps faster with Python. 🐍💻
  6. LanceDB: Vector knowledge base for AI apps. 📚
  7. Phidata: Build LLM agents with memory. 🧠
  8. Phoenix: LLM observability made efficient. 🔥
  9. Airbyte: Reliable and Extensible data pipeline. 🌬️
  10. AgentOps: Agent monitoring and Observability. 👁️
  11. RAGAS: Framework for RAG evaluation. 📊
  12. BentoML: The easiest way to serve AI apps and models. 🍱
  13. LoRAX: Multi LoRA inference server that scales to 1000s of fine-tuned LLMs. 📡
  14. Gateway: Reliably Route to 200 LLMs with a single API. 🌐
  15. LitServe: Flexible, high-throughput serving engine for AI models. 💫

Feel free to explore and contribute to the repositories.


1. Composio 👑: Build AI automation 10x faster 🚀

Tools and integrations form the core of building AI agents.

I have been building AI tools and agents, but tool accuracy was always an issue until I came across Composio.

Composio makes integrating popular applications like GitHub, Slack, Jira, Airtable, and easier with AI agents to build complex automation.

It handles user authentication and authorization for integrations on your users' behalf. So you can build your AI applications in peace. And it’s SOC2 certified.

So, here’s how you can get started with it.

Python

pip install composio-core
Enter fullscreen mode Exit fullscreen mode

Add a GitHub integration.

composio add github
Enter fullscreen mode Exit fullscreen mode

Composio handles user authentication and authorization on your behalf.

Here is how you can use the GitHub integration to star a repository.

from openai import OpenAI
from composio_openai import ComposioToolSet, App

openai_client = OpenAI(api_key="******OPENAIKEY******")

# Initialise the Composio Tool Set
composio_toolset = ComposioToolSet(api_key="**\\*\\***COMPOSIO_API_KEY**\\*\\***")

## Step 4
# Get GitHub tools that are pre-configured
actions = composio_toolset.get_actions(actions=[Action.GITHUB_ACTIVITY_STAR_REPO_FOR_AUTHENTICATED_USER])

## Step 5
my_task = "Star a repo ComposioHQ/composio on GitHub"

# Create a chat completion request to decide on the action
response = openai_client.chat.completions.create(
model="gpt-4-turbo",
tools=actions, # Passing actions we fetched earlier.
messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": my_task}
  ]
)
Enter fullscreen mode Exit fullscreen mode

Run this Python script to execute the given instruction using the agent.

Javascript

You can Install it using npmyarn, or pnpm.

npm install composio-core
Enter fullscreen mode Exit fullscreen mode

Define a method to let the user connect their GitHub account.

import { OpenAI } from "openai";
import { OpenAIToolSet } from "composio-core";

const toolset = new OpenAIToolSet({
  apiKey: process.env.COMPOSIO_API_KEY,
});

async function setupUserConnectionIfNotExists(entityId) {
  const entity = await toolset.client.getEntity(entityId);
  const connection = await entity.getConnection('github');

  if (!connection) {
      // If this entity/user hasn't already connected, the account
      const connection = await entity.initiateConnection(appName);
      console.log("Log in via: ", connection.redirectUrl);
      return connection.waitUntilActive(60);
  }

  return connection;
}
Enter fullscreen mode Exit fullscreen mode

Add the required tools to the OpenAI SDK and pass the entity name on to the executeAgent function.

async function executeAgent(entityName) {
  const entity = await toolset.client.getEntity(entityName)
  await setupUserConnectionIfNotExists(entity.id);

  const tools = await toolset.get_actions({ actions: ["github_activity_star_repo_for_authenticated_user"] }, entity.id);
  const instruction = "Star a repo ComposioHQ/composio on GitHub"

  const client = new OpenAI({ apiKey: process.env.OPEN_AI_API_KEY })
  const response = await client.chat.completions.create({
      model: "gpt-4-turbo",
      messages: [{
          role: "user",
          content: instruction,
      }],
      tools: tools,
      tool_choice: "auto",
  })

  console.log(response.choices[0].message.tool_calls);
  await toolset.handle_tool_call(response, entity.id);
}

executeGithubAgent("joey")
Enter fullscreen mode Exit fullscreen mode

Execute the code and let the agent do the work for you.

Composio works with famous frameworks like LangChain, LlamaIndex, CrewAi, etc.

For more information, visit the official docs, and for even more complex examples, see the repository's example sections.

composio gif

Star the Composio repository ⭐


2. Unsloth: Faster training and finetuning of AI models 🦥💨

Training and fine-tuning Large Language Models (LLMs) are crucial parts of AI engineering.

In many cases, proprietary models may not serve the purpose. It could be cost, personalization, or privacy. At some point, you will need to fine-tune your model on a custom dataset. And right now, Unsloth is one of the best libraries for fine-tuning and training LLMs.

It supports full, LoRA, and QLoRA finetuning of popular LLMs, including Llama-3 and Mistral, and their derivatives like Yi, Open-hermes, etc. It implements custom triton kernels and a manual back-prop engine to improve the speed of the model training.

To start with Unsloth, install it using pip and make sure you have torch 2.4 and CUDA 12.1.

pip install --upgrade pip
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
Enter fullscreen mode Exit fullscreen mode

Here is a simple script to train a Mistral model on a dataset using SFT (Supervised Fine-tuning)

from unsloth import FastLanguageModel 
from unsloth import is_bfloat16_supported
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling internally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 3407,
    ),
)
trainer.train()
Enter fullscreen mode Exit fullscreen mode

For more information, refer to the official documentation.

unsloth gif

Star the Unsloth repository ⭐


3. DsPy: Framework for programming LLMs 🛠️

One factor hampering the use of LLMs in production use cases is their stochastic nature. Prompting them to output the desired response has a high failure rate for these use cases.

DsPy is solving for this problem. Instead of prompting, it programs the LLMs to get maximum reliability.

DSPy simplifies this by doing two key things:

  1. Separating Program Flow from Parameters: This feature keeps your program's flow (the steps you take) separate from the details of how each step is done (the LM prompts and weights). This makes it easier to manage and update your system.
  2. Introducing New Optimizers: DSPy uses advanced algorithms that automatically fine-tune the LM prompts and weights based on your goals, such as improving accuracy or reducing errors.

Check out this Getting Started Notebook for more on how to work with DsPy.

dspy gif

Star the DsPy repository ⭐


4. LLMware: Framework for building enterprise RAG 🏢

Privacy, security, and reliability are crucial when developing enterprise software. If you're looking for a framework to build enterprise AI applications, LLMWare is your top choice.

They provide a unified framework for building LLM-based applications (e.g. RAG, Agents) using tiny, specialized models that can be deployed privately, integrated with enterprise knowledge sources safely and securely, and cost-effectively tuned and adapted for any business process.

LLMWare has two main components:

  1. RAG Pipeline - integrated components connecting knowledge sources to generative AI models.
  2. 50+ small, specialized models fine-tuned for critical tasks in enterprise process automation, including fact-based question-answering, classification, summarization, and extraction.

Set up LLMware using pip

pip3 install llmware
Enter fullscreen mode Exit fullscreen mode

Here is how to create and use datasets.


""" This example demonstrates creating and using datasets
    1. Datasets suitable for fine-tuning embedding models
    2. Completion and other types of datasets
    3. Generating datasets from all data in a library or with filtered data
    4. Creating datasets from AWS Transcribe transcripts
"""

import json
import os
from llmware.dataset_tools import Datasets
from llmware.library import Library
from llmware.retrieval import Query
from llmware.setup import Setup
from llmware.configs import LLMWareConfig

def build_and_use_dataset(library_name):

    # Setup a library and build a knowledge graph.  Datasets will use the data in the knowledge graph
    print (f"\n > Creating library {library_name}...")
    library = Library().create_new_library(library_name)
    sample_files_path = Setup().load_sample_files()
    library.add_files(os.path.join(sample_files_path,"SmallLibrary"))
    library.generate_knowledge_graph()

    # Create a Datasets object from library
    datasets = Datasets(library)

    # Build a basic dataset useful for industry domain adaptation for fine-tuning embedding models
    print (f"\n > Building basic text dataset...")

    basic_embedding_dataset = datasets.build_text_ds(min_tokens=500, max_tokens=1000)
    dataset_location = os.path.join(library.dataset_path, basic_embedding_dataset["ds_id"])

    print (f"\n > Dataset:")
    print (f"(Files referenced below are found in {dataset_location})")

    print (f"\n{json.dumps(basic_embedding_dataset, indent=2)}")
    sample = datasets.get_dataset_sample(datasets.current_ds_name)

    print (f"\nRandom sample from the dataset:\n{json.dumps(sample, indent=2)}")

    # Other Dataset Generation and Usage Examples:

    # Build a simple self-supervised generative dataset- extracts text and splits into 'text' & 'completion'
    # Several generative "prompt_wrappers" are available - chat_gpt | alpaca | 
    basic_generative_completion_dataset = datasets.build_gen_ds_targeted_text_completion(prompt_wrapper="alpaca")

    # Build a generative self-supervised training set by pairing 'header_text' with 'text.'
    xsum_generative_completion_dataset = datasets.build_gen_ds_headline_text_xsum(prompt_wrapper="human_bot")
    topic_prompter_dataset = datasets.build_gen_ds_headline_topic_prompter(prompt_wrapper="chat_gpt")

    # Filter a library by a key term as part of building the dataset
    filtered_dataset = datasets.build_text_ds(query="agreement", filter_dict={"master_index":1})

    # Pass a set of query results to create a dataset from those results only
    query_results = Query(library=library).query("africa")
    query_filtered_dataset = datasets.build_text_ds(min_tokens=250,max_tokens=600, qr=query_results)

    return 0

if __name__ == "__main__":

    LLMWareConfig().set_active_db("sqlite")

    build_and_use_dataset("test_txt_datasets_0")
Enter fullscreen mode Exit fullscreen mode

Explore example how to use LLMWare. For more information, refer to the documentation.

llmware gif

Star the LLMWare repository ⭐


5. TaiPy: Build AI web apps faster with Python. 🐍💻

Taipy is open-source, Python-based software designed for building AI web apps in production environments. It enhances Streamlit and Gradio by enabling Python developers to deploy demo apps in production.

Taipy is designed for data scientists and machine learning engineers to build data & AI web applications.

  1. Enables building production-ready web applications
  2. No need to learn new languages. Only Python is needed.
  3. Concentrate on Data and AI algorithms without development and deployment complexities.

Quickly get started with it using pip.

pip install taipy
Enter fullscreen mode Exit fullscreen mode

This simple Taipy application demonstrates how to create a basic film recommendation system using Taipy.

import taipy as tp
import pandas as pd
from taipy import Config, Scope, Gui

# Defining the helper functions

# Callback definition - submits scenario with genre selection
def on_genre_selected(state):
    scenario.selected_genre_node.write(state.selected_genre)
    tp.submit(scenario)
    state.df = scenario.filtered_data.read()

## Set initial value to Action
def on_init(state):
    on_genre_selected(state)

# Filtering function - task
def filter_genre(initial_dataset: pd.DataFrame, selected_genre):
    filtered_dataset = initial_dataset[initial_dataset["genres"].str.contains(selected_genre)]
    filtered_data = filtered_dataset.nlargest(7, "Popularity %")
    return filtered_data

# The main script
if __name__ == "__main__":
    # Taipy Scenario & Data Management

    # Load the configuration made with Taipy Studio
    Config.load("config.toml")
    scenario_cfg = Config.scenarios["scenario"]

    # Start Taipy Core service
    tp.Core().run()

    # Create a scenario
    scenario = tp.create_scenario(scenario_cfg)

    # Taipy User Interface
    # Let's add a GUI to our Scenario Management for a complete application

    # Get list of genres
    genres = [
        "Action", "Adventure", "Animation", "Children", "Comedy", "Fantasy", "IMAX"
        "Romance","Sci-FI", "Western", "Crime", "Mystery", "Drama", "Horror", "Thriller", "Film-Noir","War", "Musical", "Documentary"
    ]

    # Initialization of variables
    df = pd.DataFrame(columns=["Title", "Popularity %"])
    selected_genre = "Action"

    # User interface definition
    my_page = """
# Film recommendation

## Choose your favorite genre
<|{selected_genre}|selector|lov={genres}|on_change=on_genre_selected|dropdown|>

## Here are the top seven picks by popularity
<|{df}|chart|x=Title|y=Popularity %|type=bar|title=Film Popularity|>
    """

    Gui(page=my_page).run()
Enter fullscreen mode Exit fullscreen mode

Check out the documentation for more.

taipy gif

Star Taipy repository ⭐


6. LanceDB: Vector knowledge base for AI apps. 📚

If you are building AI apps, you will need a vector database to store and retrieve structured data like text, images, and videos. Unlike traditional databases, vector databases store embeddings of these data.

Embeddings are high-dimensional numerical representations of data. Vector databases use methods like similarity scores to retrieve relevant data.

LanceDb is an open-source vector database written in Typescript. It offers production-scale vector search, multi-modal support, Zero-copy, automatic data versioning, GPU-powered querying, and more.

Get started with LanceDB.

npm install @lancedb/lancedb

Enter fullscreen mode Exit fullscreen mode

Create and query a vector database.

import * as lancedb from "@lancedb/lancedb";

const db = await lancedb.connect("data/sample-lancedb");
const table = await db.createTable("vectors", [
    { id: 1, vector: [0.1, 0.2], item: "foo", price: 10 },
    { id: 2, vector: [1.1, 1.2], item: "bar", price: 50 },
], {mode: 'overwrite'});

const query = table.vectorSearch([0.1, 0.3]).limit(2);
const results = await query.toArray();

// You can also search for rows by specific criteria without involving a vector search.
const rowsByCriteria = await table.query().where("price >= 10").toArray();

Enter fullscreen mode Exit fullscreen mode

You can find more on LanceDB here on their documentation.

lancedb gif

Star LanceDB repository ⭐


7. Phidata: Build LLM agents with memory. 🧠

Often, building agents that work may not be as easy as it sounds. Managing memory, caching, and tool execution can become challenging.

Phidata is an open-source framework that offers a convenient and reliable way to build agents with long-term memory, contextual knowledge, and the ability to take action using function calling.

Get started with Phidata by installing via pip

pip install -U phidata
Enter fullscreen mode Exit fullscreen mode

Let’s create a simple assistant that can query the financial data.

from phi.assistant import Assistant
from phi.llm.openai import OpenAIChat
from phi.tools.yfinance import YFinanceTools

assistant = Assistant(
    llm=OpenAIChat(model="gpt-4o"),
    tools=[YFinanceTools(stock_price=True, analyst_recommendations=True, company_info=True, company_news=True)],
    show_tool_calls=True,
    markdown=True,
)
assistant.print_response("What is the stock price of NVDA")
assistant.print_response("Write a comparison between NVDA and AMD, use all tools available.")
Enter fullscreen mode Exit fullscreen mode

An assistant that can surf the web.

from phi.assistant import Assistant
from phi.tools.duckduckgo import DuckDuckGo

assistant = Assistant(tools=[DuckDuckGo()], show_tool_calls=True)
assistant.print_response("Whats happening in France?", markdown=True)
Enter fullscreen mode Exit fullscreen mode

Refer to the official documentation for examples and information.

phidata gif

Star Phidata repository ⭐


8. Phoenix: LLM observability made efficient. 🔥

Building AI applications is only completed by adding an observability layer. Usually, an LLM application has many moving parts, such as prompts, model temperature, p-value, etc., which can significantly impact outcomes even with a slight change.

This can make the applications highly unstable and unreliable. This is where LLM observability comes into the picture. ArizeAI’s Phoneix makes it convenient to track the entire trace of an LLM execution.

It is an open-source AI observability platform designed for experimentation, evaluation, and troubleshooting. It provides:

  • Tracing - Trace your LLM application's runtime using OpenTelemetry-based instrumentation.
  • Evaluation - Leverage LLMs to benchmark your application's performance using response and retrieval evals.
  • Datasets - Create versioned datasets of examples for experimentation, evaluation, and fine-tuning.
  • Experiments - Track and evaluate prompts, LLMs, and retrieval changes.

Phoenix is vendor and language-agnostic, supporting frameworks like LlamaIndex, LangChain, DSPy, and LLM providers like OpenAI and Bedrock.

It can run in various environments, including Jupyter notebooks, local machines, containers, or the cloud.

It is easy to get started with Phoneix.

pip install arize-phoenix
Enter fullscreen mode Exit fullscreen mode

To get started, launch the Phoenix app.

import phoenix as px
session = px.launch_app()
Enter fullscreen mode Exit fullscreen mode

This will initiate the Phoneix server.

You can now set up tracking for your AI application to debug your application as the traces stream in.

To use LlamaIndex's one click, you must install the small integration first:

pip install 'llama-index>=0.10.44'
Enter fullscreen mode Exit fullscreen mode
import phoenix as px
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
import os
from gcsfs import GCSFileSystem
from llama_index.core import (
    Settings,
    VectorStoreIndex,
    StorageContext,
    set_global_handler,
    load_index_from_storage
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
import llama_index

# To view traces in Phoenix, you will first have to start a Phoenix server. You can do this by running the following:
session = px.launch_app()

# Initialize LlamaIndex auto-instrumentation
LlamaIndexInstrumentor().instrument()

os.environ["OPENAI_API_KEY"] = "<ENTER_YOUR_OPENAI_API_KEY_HERE>"

# LlamaIndex application initialization may vary
# depending on your application
Settings.llm = OpenAI(model="gpt-4-turbo-preview")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

# Load your data and create an index. Here we've provided an example of our documentation
file_system = GCSFileSystem(project="public-assets-275721")
index_path = "arize-phoenix-assets/datasets/unstructured/llm/llama-index/arize-docs/index/"
storage_context = StorageContext.from_defaults(
    fs=file_system,
    persist_dir=index_path,
)

index = load_index_from_storage(storage_context)

query_engine = index.as_query_engine()

# Query your LlamaIndex application
query_engine.query("What is the meaning of life?")
query_engine.query("How can I deploy Arize?")

# View the traces in the Phoenix UI
px.active_session().url
Enter fullscreen mode Exit fullscreen mode

Once you've executed a sufficient number of queries (or chats) for your application, you can view the details of the UI by refreshing the browser URL.

Refer to their documentation for more tracing, dataset versioning, and evaluation examples.

phoeneix gif

Star Phoenix repository ⭐


9. Airbyte: Reliable and Extensible data pipeline. 🌬️

Data is essential for building AI applications, especially in production, where you must manage large volumes of data from various sources. Airbyte excels at this.

Airbyte offers an extensive catalogue of over 300 connectors for APIs, databases, data warehouses, and data lakes.

Airbyte also features a Python extension called PyAirByte. This extension supports popular frameworks like LangChain and LlamaIndex, making it easy to move data from multiple sources to your GenAI applications.

Check out this notebook for details on the implementation of PyAirByte with LangChain.

For more information, check out the documentation.

airbyte gif

Star AirByte repository ⭐


10. AgentOps: Agent monitoring and Observability. 👁️

Just like traditional software systems, AI agents require continuous monitoring and observation. This is important to ensure the agent’s behaviour does not deviate from expectations.

AgentOps offers a comprehensive solution for monitoring and observing AI agents.

It offers tools for replay analytics, LLM cost management, agent benchmarking, compliance and security and integrates natively with frameworks like CrewAI, AutoGen, and LangChain.

Get started with AgentOps by installing it through pip.

pip install agentops
Enter fullscreen mode Exit fullscreen mode

Initialize the AgentOps client and automatically get analytics on every LLM call.

import agentops

# Beginning of program's code (i.e. main.py, __init__.py)
agentops.init( < INSERT YOUR API KEY HERE >)

...

# (optional: record specific functions)
@agentops.record_action('sample function being record')
def sample_function(...):
    ...

# End of program
agentops.end_session('Success')
# Woohoo You're done 🎉
Enter fullscreen mode Exit fullscreen mode

Refer to their documentation for more.

agentops gif

Star AgentOps repository ⭐


11. RAGAS: Framework for RAG evaluation. 📊

Building RAG pipelines is challenging, but determining their effectiveness in real-world scenarios is another. Despite advancements in frameworks for RAG applications, ensuring their reliability for real users remains challenging, especially when the cost of incorrect retrievals is high.

RAGAS is a framework designed to solve this problem. It helps you evaluate your Retrieval Augmented Generation (RAG) pipelines.

It helps you generate synthetic test sets, test your RAG pipelines against them, and monitor your RAG app in production.

Check out the documentation to understand how to use RAGAS to improve your new and existing RAG pipelines.

ragas gif

Star RAGAS repository ⭐


12. BentoML: The easiest way to serve AI apps and models. 🍱

BentoML is open-source software that provides a convenient way to serve models and AI apps in production. Whether it's traditional machine-learning models or language models, it can turn any model inference script into a REST API server with just a few lines of code and standard Python-type hints.

It offers model-serving optimization features like dynamic batching, model parallelism, multi-stage pipeline and multi-model inference-graph orchestration.

BentoML lets you quickly implement your own APIs or task queues with custom business logic, model inference and multi-model composition.

Get started by installing the BentoML library.

# Requires Python≥3.8
pip install -U bentoml
Enter fullscreen mode Exit fullscreen mode

Define APIs in a service.py file.

from __future__ import annotations

import bentoml

@bentoml.service(
    resources={"cpu": "4"}
)
class Summarization:
    def __init__(self) -> None:
        import torch
        from transformers import pipeline

        device = "cuda" if torch.cuda.is_available() else "cpu"
        self.pipeline = pipeline('summarization', device=device)

    @bentoml.api(batchable=True)
    def summarize(self, texts: list[str]) -> list[str]:
        results = self.pipeline(texts)
        return [item['summary_text'] for item in results] 
Enter fullscreen mode Exit fullscreen mode

Run the service code locally (serving at http://localhost:3000 by default):

pip install torch transformers  # additional dependencies for local run

bentoml serve service.py:Summarization
Enter fullscreen mode Exit fullscreen mode

Now you can run inference from your browser at http://localhost:3000 or with a Python script:

import bentoml

with bentoml.SyncHTTPClient('http://localhost:3000') as client:
    summarized_text: str = client.summarize([bentoml.__doc__])[0]
    print(f"Result: {summarized_text}") 
Enter fullscreen mode Exit fullscreen mode

Explore documentation for more.

bentoml gif

Star BentoML repository ⭐


13. LoRAX: Multi LoRA inference server that scales to 1000s of finetuned LLMs. 📡

LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency.

It dynamically loads LoRA adapters from HuggingFace, Predibase, or any filesystem just in time without blocking concurrent requests.

LoRAX utilizes advanced quantization and optimization techniques, such as Paged Attention, Flash Attention, tensor-parallelism, and token streaming, to deliver high throughput with low latency.

To get started, you would need Linux OS and Cuda version 11.8 compatible device drivers, Nvidia GPU Ampere, and the above generation.

Launch LoRAX server

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/predibase/lorax:main --model-id $model
Enter fullscreen mode Exit fullscreen mode

Prompt base LLM:

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{
        "inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]",
        "parameters": {
            "max_new_tokens": 64
        }
    }' \
    -H 'Content-Type: application/json' 
Enter fullscreen mode Exit fullscreen mode

Prompt a LoRA adapter:

curl 127.0.0.1:8080/generate \ 
    -X POST \
    -d '{
        "inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]",
        "parameters": {
            "max_new_tokens": 64,
            "adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"
        }
    }' \
    -H 'Content-Type: application/json'
Enter fullscreen mode Exit fullscreen mode

See Reference - REST API for complete details.

lorax gif

Star LoRAX repository ⭐


14. Gateway: Reliably Route to 200 LLMs with a single API 🌐

While building AI applications, we may depend on proprietary LLMs or LLMs served from some cloud hosting site. It would be best to prepare for outages because you never know.

In those cases, you should route requests from one provider to another. Gateway is the best solution for this.

It provides a unified API for 200+ LLM providers. It supports caching, load-balancing, routing, and retries and can be edge-deployed for minimum latency.

This is an essential piece in building fault-tolerant, robust AI systems. It is available in Python, Go, Rust, Java, Ruby, and Javascript.

Get started with Gateway by installing it.

pip install -qU portkey-ai openai
Enter fullscreen mode Exit fullscreen mode

For OpenAI models,

from openai import OpenAI
from portkey_ai import PORTKEY_GATEWAY_URL, createHeaders

client = OpenAI(
    api_key=OPENAI_API_KEY,
    base_url=PORTKEY_GATEWAY_URL,
    default_headers=createHeaders(
        provider="openai",
        api_key=PORTKEY_API_KEY
    )
)

chat_complete = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user",
               "content": "What's a fractal?"}],
)

print(chat_complete.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

For Anthropic models,

from openai import OpenAI
from portkey_ai import PORTKEY_GATEWAY_URL, createHeaders

client = OpenAI(
    api_key=userdata.get('ANTHROPIC_API_KEY')
    base_url=PORTKEY_GATEWAY_URL,
    default_headers=createHeaders(
        provider="anthropic",
        api_key=PORTKEY_API_KEY
    ),
)

response = client.chat.completions.create(
    model="claude-3-opus-20240229",
    messages=[{"role": "user",
               "content": "What's a fractal?"}],
    max_tokens= 512
)
Enter fullscreen mode Exit fullscreen mode

For more information, visit the official repository.

gateway gif

Star Gateway repository ⭐


15. LitServe: Flexible, high throughput serving engine for AI models. 💫

LitServe is another AI model-serving engine. It is highly optimized for parallel execution and has native features to scale AI workloads. Our benchmarks show that LitServe (built on FastAPI) handles more simultaneous requests than FastAPI and TorchServe.

LitServe can be hosted independently on your machines—perfect for hackers, students and developers who prefer a DIY approach.

Install LitServe via pip (other install options):

pip install litserve
Enter fullscreen mode Exit fullscreen mode

Define a server

Here's a Hello World example (explore real examples):

# server.py
import litserve as ls

# STEP 1: DEFINE A MODEL API
class SimpleLitAPI(ls.LitAPI):
    # Called once at startup. Setup models, DB connections, etc...
    def setup(self, device):
        self.model = lambda x: x**2

    # Convert the request payload to model input.
    def decode_request(self, request):
        return request["input"]

    # Run inference on the model, and return the output.
    def predict(self, x):
        return self.model(x)

    # Convert the model output to a response payload.
    def encode_response(self, output):
        return {"output": output}

# STEP 2: START THE SERVER
if __name__ == "__main__":
    api = SimpleLitAPI()
    server = ls.LitServer(api, accelerator="auto")
    server.run(port=8000)
Enter fullscreen mode Exit fullscreen mode

Now run the server via the command line

python server.py
Enter fullscreen mode Exit fullscreen mode

LitAPI class gives complete control and hackability.

LitServer handles optimizations like batching, auto-GPU scaling, etc...

Query the server

Use the automatically generated LitServe client:

python client.py
Enter fullscreen mode Exit fullscreen mode

litserve gif

Star LitServe repository ⭐


Thank you for reading this article. Comment below if you have built or used any other open-source AI repository. 👇

Top comments (3)

Collapse
 
john-123 profile image
John

Nice!

Collapse
 
johny0012 profile image
Johny

Thanks for the post!

Collapse
 
samcurran12 profile image
SamCurran12

Wow

Some comments may only be visible to logged-in visitors. Sign in to view all comments.