Nikola Balic for Daytona

Posted on Dec 3, 2023 • Originally published at daytona.io on Nov 30, 2023

Harnessing AI through Standardization and Isolation

#chroma #openai #api #sde

Rapid AI development promises enormous opportunities, yet fragmented tools and environments often hinder capturing their full potential. What if developers didn’t need to waste precious time battling configurations and compatibility issues? How can teams accelerate innovation given skills gaps and disjointed systems? Enter Standardized Development Environments (SDEs)—the key to transforming workspaces into unmatched AI sandboxes.

In this guide, we'll explore how SDEs, by encouraging standardization and streamlining setup, can foster innovation in AI application development. We’ll also demonstrate how integrating popular tools like Chroma vector database and OpenAI API within SDEs can further enhance productivity and creativity. The result is the ultimate AI playground, ready for developers to unleash their skills and shape the future.

TL;DR

Sitemap Fetching and Parsing: Automates retrieval and analysis of sitemap XML from websites to extract URLs.
Content Extraction and Storage: Downloads web page content and stores it in Chroma, a flexible database system.
AI-Powered Search: Leverages OpenAI's embedding functions for smart content retrieval within Chroma.
Response Generation: Creates response based on prompts and search results, using latest OpenAI's GPT-4 model.
GitHub Repo link: https://github.com/nkkko/ai-sandbox-demo

If privacy is your priority, taking advantage of the security and isolation offered by Standardized Development Environments (SDEs) is advisable. In such cases, you can use local models and sentence transformers. The workspace I've set up is equipped with 12 GB of memory and 4 virtual CPUs, which should be sufficient to run a 7-billion-parameter model at reading speed. However, my example will utilize OpenAI's services for simplicity and improved performance.

Before diving into the integration of AI with SDEs, it's crucial to grasp the need for AI sandbox and to understand foundational concepts of SDEs and Dev Containers.

The Importance of a Sandbox for AI Projects

In the rapidly evolving landscape of artificial intelligence, the ability to experiment safely and efficiently is paramount. A sandbox environment serves as an essential incubator for innovation, providing AI developers with a controlled and flexible space to build and test their projects. Here's why a sandbox is a crucial tool for any AI endeavor:

Risk-Free Experimentation : AI development often involves trial and error, and a sandbox offers a risk-free zone where you can explore new ideas without affecting production systems. This freedom to experiment encourages creativity and can lead to AI functionality and performance breakthroughs.

Realistic Testing Conditions : Sandboxes can simulate real-world conditions, allowing you to observe how your AI behaves under various scenarios. This realistic testing ensures your AI solutions are robust, scalable, and ready for deployment.

Rapid Prototyping : Speed is a competitive advantage in AI development. Sandboxes facilitate rapid prototyping, enabling you to iterate on concepts and refine your models quickly. This agility accelerates the development cycle and helps bring AI applications to market faster.

Learning and Development : Sandboxes are educational playgrounds for newcomers and seasoned professionals. They provide an opportunity to learn new technologies, frameworks, and languages in a practical, hands-on manner, essential for staying current in AI.

Resource Optimization : AI projects can be resource-intensive. Sandboxes allow you to allocate resources dynamically, optimizing usage and reducing costs. This efficient resource management is critical, especially when working with complex models and large datasets.

In summary, a sandbox is more than just a development tool—it's an essential part of the AI ecosystem that supports innovation, learning, and growth. Whether you're a solo developer or part of a large team, a sandbox is the foundation upon which you can build the future of AI.

The Problem of Fragmented Environments

Developing AI applications is challenging enough without managing half-broken tools and dependencies. Unfortunately, this is the reality for many developers struggling with:

Inconsistencies: Different machines, OS versions, and ad-hoc configurations lead to errors and delays. Collaborating with teams also becomes tricky.

Error: Module X version mismatch 
(expected 1.1.0, got 1.0.2)

Onboarding Issues: Getting others up to speed on a project's tools and setup is time-consuming and frustrating.

New Dev: How do I get this project running locally?
Senior Dev: Oh boy, just follow this 12 step guide...

Security & Compliance Risks: Following best practices around vulnerabilities and regulations becomes difficult across fragmented environments.

Lack of Portability: Projects that depend on specific machines or OS configurations aren't easily portable or sharable across teams.

Wasted Time: Hours spent debugging environment issues is time lost building innovative applications.

SDEs address these problems through standardization, saving developers hours of configuration headaches so they can focus on creating.

Introducing the Power of Standardized Dev Envs and modern DEM Platforms

SDEs are more than just tools - they represent a philosophy that champions consistency and efficiency. By providing standardized configurations, SDEs allow developers to dive right into coding without worrying about environment setups. This uniformity becomes especially important in regulated industries like finance, healthcare, etc. where strict governance policies need adherence alongside security.

SDEs provide predefined configurations encompassing everything from dependencies and versions to security policies and tooling rules. This standardization brings immense advantages over traditional virtualized environments by traveling with code across devices. Unlike ephemeral containers, SDEs persist tools, credentials, and settings critical for long-term governance. By codifying best practices instead of reinventing the wheel, SDEs give developers instant access to compliant systems where they can simply code.

SDEs may be manually created or defined through a declarative specification. In the case of Development Containers, the environment specifications are encapsulated in a devcontainer.json file. This enables consistency across different machines while retaining flexibility to update configurations.

The Role of Development Environment Management

DEM platforms like Daytona takes the SDE approach further through automating creation, management and optimization of these standardized environments. It balances developer productivity with compliance requirements by:

Managing configurations, access controls & workflows from a central system
Automating provisioning and deployment of tools/languages
Enforcing security standards through predefined policies

By using Daytona, teams can boost collaboration and innovation within a secure, consistent sandbox.

A Standardized Development Environment (SDE) provides consistent tools and configurations tailored to a project’s specific needs. The environment encompasses everything from software dependencies and versions to security policies and tooling rules.

Key Benefits of SDEs

Portability - The environment travels with code across devices.
Collaboration - Teams use the same dependencies and tools.
Onboarding - New developers spin up instantly with no setup.
Compliance - Standardization facilitates best practices.
Consistency - Eliminates half-broken tooling across machines.
Efficiency - Less downtime from environment issues.

By providing predefined workspaces, SDEs allow developers to go from zero to coding in minutes. For AI innovation to flourish, minimizing tooling distractions is essential.

Unleashing Innovation with RAG and Vector Databases

Retrieval Augmented Generation (RAG) unites the generative power of language models with the precision of information retrieval, enabling the creation of nuanced, contextually accurate content.

Generative models like GPT excel in creating coherent text for tasks such as text completion and question answering. However, they may struggle with vague prompts or limited data, leading to less reliable outputs.

Conversely, retrieval models are adept at sourcing exact answers from large databases, essential for chatbots and search functionalities. Yet, they lack the creative dynamism of generative models, being restricted to predefined responses.

Vector databases complement this duo by streamlining the retrieval process, quickly aligning queries with pertinent information. The fusion of RAG and vector databases thus forges systems that are both creative and precise, enhancing content generation capabilities.

Integrating OpenAI for embeddings creation and generative functionalities

As AI capabilities continue rapidly advancing, developers are eager to test the limits of generative models like GPT-4 and GPT-4 Turbo (gpt-4-1106-preview). Yet high costs, unpredictable outputs, and difficult tooling often restrain the exploration process.

By providing easy-to-use APIs instead of complex tooling, OpenAI lowers the barrier to leveraging innovations like completions and embeddings. Integrating OpenAI via user-friendly calls unlocks capabilities making applications smarter and more creative.

By leveraging OpenAI in combination with a vector database, developers gain access to state-of-the-art AI with minimized overhead. Specifically, the OpenAI API enables:

Content Generation : Automated generation of text matching specified tones, styles, and topics with the state of the art models such as GPT-4 turbo.

Code Completion : Suggestions of relevant code snippets and examples using various tools such as Copilot, Tabnine, Phind or Continue.

Search & Filtering : Relevant document retrieval via semantic similarity rankings.

Data Labeling : Automated classification, entity extraction, and sentiment analysis.

Image Generation : Creative images matching textual descriptions using DALL-E 3.

AI Demo: Setting Up Your AI Playground

Up to this point, we've explored the conceptual advantages of integrating Standardized Development Environments (SDEs), Chroma, and OpenAI. What does this look like in a real-world scenario? Let's introduce our demo project—an AI application that showcases the rapid prototyping enabled by this combination.

The project utilizes:

Dev Container Specification - Dev Containers are configured via a devcontainer.json file, which automates the setup of your development environment.
Chroma - For storage, indexing, and embedding-based semantic search with the help of embedding functions (all-MiniLM-L6-v2 or OpenAI text-embedding-ada-002 with enormous 1536 dimension vectors).
OpenAI - To generate new writings matching specified topics and content fetched from the vector database.

Understanding the AI Demo Project

Our demo project showcasing AI integration with popular developer tools. Here's an overview of what it does:

Web Content Extraction : Automatically extracts text and metadata from web page sitemap.
Storage and Search : Stores the extracted content in a vector database (Chroma) and enables intelligent search.
AI Generation : Uses Large Language Model (LLM) to generate articles new writing in relation to the stored content context.

It utilizes standardized development environments (SDEs) to allow collaborators to instantly replicate the setup and start using it our contributing to the project.

Preparing for Installation

Before beginning, ensure you have:

An SDE that supports Dev Container Specification, such as Daytona.io.
Python version 3.10 or later.
An OpenAI API key (optional).

Installation Steps

Set up your environment using one of the following methods:

Using an SDE:

Navigate to your preferred SDE, such as Daytona.io or a cloud IDE.
Point the SDE to the project's Git repository URL: https://github.com/nkkko/ai-sandbox-demo. Or use a shorthand https://yourdaytonainstance.com/#https://github.com/nkkko/ai-sandbox-demo

That's it! The IDE will automatically build a container with all dependencies based on the .devcontainer config.

Manual Setup:

Clone the repository to your machine.
Create virtual Python environment using venv or conda.
Install the required packages by running:pip install -r requirements.txtAlternatively, execute:pip install openai chromadb python-dotenv bs4 argparse lxml
Resolve the issues with your environment.
Create an .env file and insert your OPENAI_API_KEY.

Choice of Embedding Model

It is important to note that after you clone and run the repository you need to set up the .env file with your OPENAI_API_KEY in case you would prefer to use their embeddings.

Notably, as others have shown model's dimension size does not strongly predict its performance. Several models with fewer dimensions than text-embedding-ada-002's 1536 show similar levels of performance. For example, Supabase noted that when maintaining a constant accuracy@10 of 0.99, pgvector with all-MiniLM-L6-v2 outperformed text-embedding-ada-002 by 78%.

Example Usage

Content Harvesting:

python populate.py https://examples.com/sitemap.xml --n 100 --ef openai

This extracts 100 pages from the site's sitemap into Chroma using OpenAI embeddings.

Semantic Search:

python search.py "SDE best practices" --n 3

Finds the 3 most relevant pages on SDE best practices using vector similarity.

AI Generation:

python write.py "How can SDE improve productivity" --s "Software Development" --n 1

Generates an entire article on the prompt while referring to the top search result from the database for context.

Main Components of the Demo Project

populate.py: Extracts content from a website's sitemap and saves to the database.
search.py: Intelligently searches the database content.
write.py: Generates articles using the database content.
SynthSpyder.py: Core module with main logic.
db/: Chroma storage and utilities.
.devcontainer/: Configuration for standalone environments.
.env: Stores your OpenAI API key.

Why We Chose Chroma for Our Vector Database

Chroma's simplicity and user-friendly approach made it the standout choice for our AI Demo Project. Its straightforward APIs facilitate quick integration, allowing our team to prioritize feature development. Moreover, Chroma's innovative design, optimized for high-dimensional vector data, ensures efficient and accurate searches, which are vital for our project's NLP capabilities.

Additionally, Chroma offers the flexibility of local hosting, with options for both persistent and in-memory databases, catering to our project's scalability and performance needs. Its compatibility with advanced AI tools, such as LangChain and OpenAI, enables us to harness the full spectrum of AI technology effectively.

Designed specifically for high-dimensional vector data, Chroma simplifies storing, managing, and searching knowledge for cutting-edge AI applications. Its intuitive SDKs, robust production-ready deployment options, and specialized focus on embedding-powered features make Chroma a standout for innovation.

Out-of-the-box, Chroma handles critical functions like:

Vector embedding of text
Metadata storage
Efficient ANN search
Document storage
Query embedding
Relevance ranking

Chroma also shines through usability and scalability. Its intuitive SDKs and integrations facilitate rapid prototyping, while the production-ready server application easily handles growth.

Putting It All Together

By encapsulating the runtime toolchain into portable Dev Containers, adding a vector database like Chroma into the mix, and harnessing generative algorithms from OpenAI, developers can focus purely on building intelligently. Platforms like Daytona further accelerate this by managing provisioning and security of these SDEs at scale across teams.

Step 1 - Structuring the Dev Container

A dev container is structured using a devcontainer.json file that defines its Docker container and customize it for a particular project. Here is an example config:

{
  "name": "AI Sandbox Demo",
  "build": {
    "dockerfile": "Dockerfile"
  },
  "features": {
    "ghcr.io/devcontainers/features/github-cli:1": {
      "installDirectlyFromGitHubRelease": true,
      "version": "latest"
    },
    "ghcr.io/devcontainers/features/sshd:1": {
      "version": "latest"
    },
    "ghcr.io/devcontainers-contrib/features/mypy:2": {
      "version": "latest"
    }
  },
  "postCreateCommand": "pip install -r requirements.txt",
  "customizations": {
    "vscode": {
      "extensions": [
        "ms-python.python",
      ]
    }
  }
}

It allows configuring Docker build instructions, tools/languages to install, and IDE customizations for the project.

Step 2 - How to use Chroma as your embeddings vector database

Here is sample code to index and search documents with Chroma:

import chromadb

# Initialize ChromaDB Client
chroma_client = chromadb.Client()

# Create a collection 
collection = chroma_client.create_collection(
  name="articles",
  embedding_function="all-mpnet-base-v2" 
)

# Index documents
collection.upsert(
  documents=["Text content..."],
  metadatas=[{"url": "http://example.com/article"}]  
)

# Search documents
results = collection.query(
  query_texts=["search keywords"],
  n_results=5  
)

print(results['documents']) 
print(results['metadatas'])

This simplicity enables rapid development of AI prototypes on top of Chroma.

Step 3 - Populating the Database

With our dev environment ready, let's start using the demo scripts to extract and store web content.

The populate.py script handles content ingestion. To start, we need:

A website sitemap URL
(Optional) OpenAI API key for enhanced ML search (free trial and somestarting credits are available)

Let's walk through the script:

sitemap_url = "https://example.com/sitemap.xml" 

import SynthSpyder

# Fetch, parse and process sitemap asynchronously 
await SynthSpyder.process_sitemap(sitemap_url)  

# Saves content to ChromaDB collection

This illustrates the simplicity of the content pipeline. Under the hood, it:

Fetches sitemap XML.
Extracts listed URLs.
Downloads each page.
Scrapes main text content.
Stores in the database including metadata like the page URL.

Our database is now populated with structured web content ready for search and analysis!

Step 4 - Searching Content with AI

With a collection of content ingested, we can leverage AI search capabilities.

The search.py script allows queries against the database:

search_query = "Self driving cars" 

import SynthSpyder

# Search the database collection
results = SynthSpyder.search(search_query)  

# Results contain text snippets and metadata 
print(results)

By default, this uses approximate nearest-neighbor search provided by the Chroma vector database.

For semantic search, we can enable OpenAI embeddings:

results = SynthSpyder.search(query, ef_name="openai")

This showcases how SDEs allow us to easily swap out components like ML models.

Now let's generate some articles!

Step 5 - Querying GPT-4 within the set context from vector database

The API can be easily installed and imported into any Python environment:

pip install openai 

import openai

openai.api_key = "sk-..."

response = openai.Completion.create(
  engine="text-davinci-003",
  prompt="Hello world in Python",
  max_tokens=5
)

print(response["choices"][0]["text"])

This simplicity of integration with SDEs allows focusing efforts on creating intelligent applications rather than hassling with dependencies.

The write.py script ties together our content pipeline:

Query Chroma to fetch context around a topic
Feed context into GPT-4 to generate a unique response

For example:

query = "Self driving cars"
prompt = "Write an article about self driving cars" 

import SynthSpyder

# Fetch related content from the database
context = SynthSpyder.search_context(query)

prompt += f"\n\nContext:\n{context}"

# Generate the article
article = SynthSpyder.write(prompt)  

print(article)

And we have an AI-generated article personalized to our database content!

The standardized environment enabled us to easily:

Spin up a reproducible dev container
Ingest and store web content
Build an AI search pipeline
Integrate GPT-4 to generate articles

This demonstrates the power of SDEs as sandboxes for innovating with modern data tools and AI systems.

Opportunities for Improvements

To optimize and improve our project, we could:

Deploy a local Large Language Model (LLM) to enhance data privacy.
Introduce a configurable option in the .env file to select different OpenAI models, allowing for flexible model switching.
Enable the selection of the embedding function within the .env file instead of passing it as an argument every time.
Support the use of multiple collections for diverse data management and cross-collection search.
Develop a user-friendly web interface with Flask or a comparable framework to simplify interaction with the system.

The Future of AI lies in Standardization

As our demo illustrates, SDEs uniquely remove friction from the development process, saving developers hours upon hours. This compounds over the course of a project’s lifespan, enabling teams to achieve exponentially more through unlocked innovation capabilities. These environments turn "What if?" questions into "Why not!" breakthroughs. The only limit is your ambition.

Certainly, the rate at which AI development is evolving demands environments focused on flexibility and experimentation support. Only through standardization can developers hope to keep pace and push boundaries further.

In embracing the SDE approach, organizations also invest in their own future competitiveness within the AI landscape. Those still bogged down by fragmented tools will struggle to attract top talent and innovate quickly enough to compete. To encourage ingenuity, establishing a culture rooted in productivity and consistency is essential.

The next epoch of AI promises to stretch our imaginations beyond what is currently possible. But to reach this full potential, developers need environments tailor-built to support their ambitions. SDEs standardized using templates, containerization, and automation fill this need perfectly.

So whether you're an AI enthusiast eager to experiment, a seasoned veteran pursuing the next big innovation, or a team manager focused on supercharging productivity SDEs are the ultimate sandbox. Offering uniformity but not rigidity, they pave the way for scaling creativity quickly.

The future of AI has arrived. Are your environments ready to handle it?

DEV Community