DEV Community: Komninos Chatzipapas

What Exactly is an AI Agent? Examples and Counterexamples

Komninos Chatzipapas — Sun, 20 Apr 2025 17:14:01 +0000

The term "AI agent" seems to be everywhere lately. From tech news headlines to business strategy discussions, it's rapidly becoming one of the most prominent terms in the artificial intelligence lexicon. Like many popular tech terms, however, its meaning can sometimes get diluted, leading to its use as a catch-all buzzword. So, what really constitutes an AI agent?

At its core, the defining characteristic of an AI agent is its ability to autonomously perform actions to achieve a goal. This means they don't just process information or respond to direct commands; they interact with their environment (digital or physical), make decisions, and execute tasks with a degree of independence. Humans set the objectives, but the agent figures out the steps and takes them.

Defining AI Agents: Beyond Information Gathering

The key differentiator for an AI agent is its capacity for autonomous action. While many AI tools can process information, generate text, or answer questions, an AI agent goes further. It perceives its environment, whether that's sensor data, user input, or information from the web or databases, and then uses that information to plan and execute actions. These actions could be digital, like sending emails, updating databases, executing trades, or triggering other processes, or physical, in the case of robots or self-driving cars.

Crucially, this action-oriented nature requires capabilities like reasoning, planning, and often learning or adaptation. The agent needs to understand its goal, figure out the necessary steps, potentially use various tools (like APIs or web searches), and carry out those steps, sometimes correcting its course along the way. This distinguishes agents from simpler AI tools or assistants that primarily respond to user prompts or require human decision-making at each step. Merely gathering and presenting information, however sophisticatedly, doesn't meet this definition. The agent must do something based on that information, moving towards its objective without constant human intervention.

Examples of AI Agents

When defined by autonomous action, several types of AI systems qualify as agents:

Autonomous Driving Systems: Self-driving cars are a prime example. They perceive their environment (roads, obstacles, traffic signals) using sensors, plan navigation paths, and take actions (steering, braking, accelerating) to reach a destination safely. Algorithmic Trading Bots: These agents analyze market data in real-time, make decisions based on predefined (or learned) strategies, and autonomously execute buy or sell orders.
Smart Home Controllers (with Action): While some smart home devices merely report status, a true agent might autonomously adjust thermostat settings based on occupancy patterns it learns, or proactively lock doors at a certain time, taking action rather than just providing information or waiting for a command.
Autonomous Workflow Tools: Platforms like Auto-GPT or specialized agents built with frameworks like CrewAI aim to take a high-level goal (e.g., "research competitors and summarize findings") and break it down into subtasks, autonomously using tools like web browsers or APIs to gather information, analyze it, and compile a report. Replit Agent is another example that aims to build applications based on natural language prompts.
Customer Service Agents (with Action Capability): While many chatbots only answer questions, more advanced agents can perform actions like processing refunds, changing passwords, or updating customer information in a CRM system, directly resolving issues instead of just providing instructions.
Robotic Process Automation (RPA) enhanced with AI: Some advanced RPA systems incorporate AI agents that don't just follow pre-programmed rules but can adapt to changes, make decisions within a process, and handle more complex, multi-step workflows autonomously.

Counterexamples: What Isn't an AI Agent

Understanding what isn't an AI agent (under this action-oriented definition) helps clarify the concept:

Standard Chatbots (like early ChatGPT): While highly capable at processing language and generating human-like text, chatbots that primarily respond to user prompts without autonomously initiating tasks or interacting with external systems beyond information retrieval generally don't qualify as agents. They react, but don't typically act proactively towards a persistent goal without continuous prompting.
Recommendation Engines: Systems that suggest products, movies, or news articles analyze data and make predictions, but they don't autonomously act on those recommendations (e.g., purchasing the item for you). The action relies on the user.
Basic Data Analysis Tools: AI tools that analyze datasets and present insights, dashboards, or reports are providing information, not autonomously taking action based on those insights.
AI-powered Writing Assistants (e.g., Grammarly): These tools enhance a user's actions (writing) by providing suggestions or corrections, but they don't autonomously write documents or send emails on the user's behalf based on a goal.
Image Recognition Services: AI that identifies objects in images performs a sophisticated perception task but doesn't inherently take autonomous action based on that recognition.

Conclusion

While "AI agent" is indeed a term enjoying significant buzz, understanding its core meaning – an AI tool capable of autonomous action towards a goal – is crucial. This ability to perceive, reason, plan, and act independently distinguishes true agents from many other valuable but less autonomous AI tools. As these technologies continue to evolve, recognizing this distinction will be key to understanding their capabilities and potential impact on various tasks and industries.

Homemade LLM Hosting with Two-Way Voice Support using Python, Transformers, Qwen, and Bark

Komninos Chatzipapas — Wed, 08 Jan 2025 10:17:03 +0000

The integration of LLMs with voice capabilities has created new opportunities in personalized customer interactions.

This guide will walk you through setting up a local LLM server that supports two-way voice interactions using Python, Transformers,
Qwen2-Audio-7B-Instruct, and Bark.

Prerequisites

Before we begin, you'll have the following installed:

Python: Version 3.9 or higher.
PyTorch: For running the models.
Transformers: Provides access to the Qwen model.
Accelerate: Required in some environments.
FFmpeg & pydub: For audio processing.
FastAPI: To create the web server.
Uvicorn: ASGI server to run FastAPI.
Bark: For text-to-speech synthesis.
Multipart & Scipy: To manipulate audio.

FFmpeg can be installed via apt install ffmpeg on Linux or brew install ffmpeg on MacOS.

You can install the Python dependencies using pip: pip install torch transformers accelerate pydub fastapi uvicorn bark python-multipart scipy

Step 1: Setting Up the Environment

First, let’s set up our Python environment and choose our PyTorch
device:

import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

This code checks if a CUDA-compatible (Nvidia) GPU is available and sets the device accordingly.

If no such GPU is available, PyTorch will instead run on CPU which is much slower.

For newer Apple Silicon devices, the device can also be set to mps to run PyTorch on Metal, but the PyTorch Metal implementation is not comprehensive.

Step 2: Loading the Model

Most open-source LLMs only support text input and text output. However, since we want to create a voice-in-voice-out system, this would require us to use two more models to (1) convert the speech into text before it's fed into our LLM and (2) convert the LLM output back into speech.

By using a multimodal LLM like Qwen Audio, we can get away with one model to process speech input into a text response, and then only have to use a second model convert the LLM output back into speech.

This multimodal approach is not only more efficient in terms of processing time and (V)RAM consumption, but also usually yields better results since the input audio is sent straight to the LLM without any friction.

If you're running on a cloud GPU host like Runpod or Vast, you'll want to set the HuggingFace home & Bark directories to your volume storage by running export HF_HOME=/workspace/hf & export XDG_CACHE_HOME=/workspace/bark before downloading the models.

from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration

model_name = "Qwen/Qwen2-Audio-7B-Instruct"
processor = AutoProcessor.from_pretrained(model_name)
model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device)

We chose to use the small 7B variant of the Qwen Audio model series here in order to reduce our computational requirements. However, Qwen may have released stronger and bigger audio models by the time you are reading this article. You can view all the Qwen models on HuggingFace to double check you're using their latest model.

For a production environment, you may want to use a fast inference engine like vLLM for much higher throughput.

Step 3: Loading the Bark model

Bark is a state-of-the-art open-source text-to-speech AI model that supports multiple languages as well as sound effects.

from bark import SAMPLE_RATE, generate_audio, preload_models

preload_models()

Besides Bark, you can also use other open-source or proprietary text-to-speech models. Keep in mind that while the proprietary ones might be more performant, they come at a much higher cost. The TTS arena keeps an up-to-date comparison.

With both Qwen Audio 7B & Bark loaded into memory, the approximate (V)RAM usage is 24GB, so make sure your hardware supports this. Otherwise, you may use a quantized version of the Qwen model to save on memory.

Step 4: Setting Up the FastAPI Server

We’ll create a FastAPI server with two routes to handle incoming audio or text inputs and return audio responses.

from fastapi import FastAPI, UploadFile, Form
from fastapi.responses import StreamingResponse
import uvicorn

app = FastAPI()

@app.post("/voice")
async def voice_interaction(file: UploadFile):
    # TODO
    return

@app.post("/text")
async def text_interaction(text: str = Form(...)):
    # TODO
    return

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

This server accepts audio files via POST requests at the /voice & /text endpoint.

Step 5: Processing Audio Input

We’ll use ffmpeg to process the incoming audio and prepare it for the Qwen model.

from pydub import AudioSegment
from io import BytesIO
import numpy as np

def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray:
    audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1)
    samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16)
    samples = samples.astype(np.float32) / 32768.0

    return samples

def load_audio_as_array(audio_bytes: bytes) -> np.ndarray:
    audio_segment = AudioSegment.from_file(BytesIO(audio_bytes))
    float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000)
    return float_array

Step 6: Generating Textual Response with Qwen

With the processed audio, we can generate a textual response using the Qwen model. This will need to handle both text & audio inputs.
The preprocessor will convert our input to the model's chat template (ChatML in Qwen's case).

def generate_response(conversation):
    text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
    audios = []
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    audio_array = load_audio_as_array(ele["audio_url"])
                    audios.append(audio_array)
    if audios:
        inputs = processor(
            text=text,
            audios=audios,
            return_tensors="pt",
            padding=True
        ).to(device)
    else:
        inputs = processor(
            text=text,
            return_tensors="pt",
            padding=True
        ).to(device)

    generate_ids = model.generate(**inputs, max_length=256)
    generate_ids = generate_ids[:, inputs.input_ids.size(1):]
    response = processor.batch_decode(
        generate_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )[0]

    return response

Feel free to play around with the generation parameters like the temperature on the model.generate function.

Step 7: Converting Text to Speech with Bark

Finally, we’ll convert the generated text response back to speech.

from scipy.io.wavfile import write as write_wav

def text_to_speech(text):
    audio_array = generate_audio(text)
    output_buffer = BytesIO()
    write_wav(output_buffer, SAMPLE_RATE, audio_array)
    output_buffer.seek(0)
    return output_buffer


Step 8: Integrating Everything in the APIs
Update the endpoints to process the audio or text input, generate a response, and return the synthesized speech as a WAV file.@app.post("/voice")
async def voice_interaction(file: UploadFile):
    audio_bytes = await file.read()
    conversation = [
        {
            "role": "user",
            "content": [
                {
                    "type": "audio",
                    "audio_url": audio_bytes
                }
            ]
        }
    ]
    response_text = generate_response(conversation)
    audio_output = text_to_speech(response_text)
    return StreamingResponse(audio_output, media_type="audio/wav")


@app.post("/text")
async def text_interaction(text: str = Form(...)):
    conversation = [
        {"role": "user", "content": [{"type": "text", "text": text}]}
    ]
    response_text = generate_response(conversation)
    audio_output = text_to_speech(response_text)
    return StreamingResponse(audio_output, media_type="audio/wav")

You may choose to also add a system message to the conversations to gain more control over the assistant responses.

Step 9: Testing things out

We can use curl to ping our server as follows:

# Text input
curl -X POST http://localhost:8000/text --output output.wav -H "Content-Type: application/x-www-form-urlencoded" -d "text=Hey"

# Audio input
curl -X POST http://localhost:8000/voice --output output.wav -F "file=@input.wav"

Conclusion

By following these steps, you’ve set up a simple local server capable of two-way voice interactions using state-of-the-art models. This setup can serve as a foundation for building more complex voice-enabled applications.

Applications

If you’re exploring ways to monetize AI-powered language models, consider these potential applications:
Chatbots (e.g. Character AI, NSFW AI Chat);
Phone Agents (e.g. Synthflow, Bland)
Customer Support Automation (e.g. Zendesk, Forethought)
Legal Assistants (Harvey AI, Leya AI)

Full code

import torch
from fastapi import FastAPI, UploadFile, Form
from fastapi.responses import StreamingResponse
import uvicorn
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration
from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
from pydub import AudioSegment
from io import BytesIO
import numpy as np

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model_name = "Qwen/Qwen2-Audio-7B-Instruct"
processor = AutoProcessor.from_pretrained(model_name)
model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device)

preload_models()

app = FastAPI()

def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray:
    audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1)
    samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16)
    samples = samples.astype(np.float32) / 32768.0

    return samples

def load_audio_as_array(audio_bytes: bytes) -> np.ndarray:
    audio_segment = AudioSegment.from_file(BytesIO(audio_bytes))
    float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000)
    return float_array

def generate_response(conversation):
    text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
    audios = []
    for message in conversation:
        if isinstance(message["content"], list):
            for ele in message["content"]:
                if ele["type"] == "audio":
                    audio_array = load_audio_as_array(ele["audio_url"])
                    audios.append(audio_array)
    if audios:
        inputs = processor(
            text=text,
            audios=audios,
            return_tensors="pt",
            padding=True
        ).to(device)
    else:
        inputs = processor(
            text=text,
            return_tensors="pt",
            padding=True
        ).to(device)

    generate_ids = model.generate(**inputs, max_length=256)
    generate_ids = generate_ids[:, inputs.input_ids.size(1):]
    response = processor.batch_decode(
        generate_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )[0]

    return response


def text_to_speech(text):
    audio_array = generate_audio(text)
    output_buffer = BytesIO()
    write_wav(output_buffer, SAMPLE_RATE, audio_array)
    output_buffer.seek(0)
    return output_buffer


@app.post("/voice")
async def voice_interaction(file: UploadFile):
    audio_bytes = await file.read()
    conversation = [
        {
            "role": "user",
            "content": [
                {
                    "type": "audio",
                    "audio_url": audio_bytes
                }
            ]
        }
    ]
    response_text = generate_response(conversation)
    audio_output = text_to_speech(response_text)
    return StreamingResponse(audio_output, media_type="audio/wav")


@app.post("/text")
async def text_interaction(text: str = Form(...)):
    conversation = [
        {"role": "user", "content": [{"type": "text", "text": text}]}
    ]
    response_text = generate_response(conversation)
    audio_output = text_to_speech(response_text)
    return StreamingResponse(audio_output, media_type="audio/wav")


if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Creating an AI-powered Image Generation API Service with FLUX, Python, and Diffusers

Komninos Chatzipapas — Tue, 26 Nov 2024 11:48:13 +0000

FLUX (by Black Forest Labs) has taken the world of AI image generation by storm in the last few months. Not only has it beat Stable Diffusion (the prior open-source king) on many benchmarks, it has also surpassed proprietary models like Dall-E or Midjourney in some metrics.

But how would you go about using FLUX on one of your apps? One might think of using serverless hosts like Replicate and others, but these can get very expensive very quickly, and may not provide the flexibility you need. That's where creating your own custom FLUX server comes in handy.

In this article, we'll walk you through creating your own FLUX server using Python. This server will allow you to generate images based on text prompts via a simple API. Whether you're running this server for personal use or deploying it as part of a production application, this guide will help you get started.

Prerequisites

Before diving into the code, let's ensure you have the necessary tools and libraries set up:

Python: You'll need Python 3 installed on your machine, preferably version 3.10.
torch: The deep learning framework we'll use to run FLUX.
diffusers: Provides access to the FLUX model.
transformers: Required dependency of diffusers.
sentencepiece: Required to run the FLUX tokenizer
protobuf: Required to run FLUX
accelerate: Helps load the FLUX model more efficiently in some cases.
fastapi: Framework to create a web server that can accept image generation requests.
uvicorn: Required to run the FastAPI server.
psutil: Allows us to check how much RAM there is on our machine.

You can install all the libraries by running the following command: pip install torch diffusers transformers sentencepiece protobuf accelerate fastapi uvicorn.

If you're using a Mac with an M1 or M2 chip, you should set up PyTorch with Metal for optimal performance. Follow the official PyTorch with Metal guide before proceeding.

You'll also need to make sure you have at least 12 GB of VRAM if you're planning on running FLUX on a GPU device. Or at least 12 GB of RAM for running on CPU/MPS (which will be slower).

Step 1: Setting Up the Environment

Let's start the script by picking the right device to run inference based on the hardware we're using.

device = 'cuda' # can also be 'cpu' or 'mps'

import os

# MPS support in PyTorch is not yet fully implemented
if device == 'mps':
  os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

import torch

if device == 'mps' and not torch.backends.mps.is_available():
      raise Exception("Device set to MPS, but MPS is not available")
elif device == 'cuda' and not torch.cuda.is_available():
      raise Exception("Device set to CUDA, but CUDA is not available")

You can specify cpu, cuda (for NVIDIA GPUs), or mps (for Apple's Metal Performance Shaders). The script then checks if the selected device is available and raises an exception if it's not.

Step 2: Loading the FLUX Model

Next, we load the FLUX model. We'll load the model in fp16 precision which will save us some memory without much loss in quality.

At this point, you may be asked to authenticate with HuggingFace, as the FLUX model is gated. In order to authenticate successfully, you'll need to create a HuggingFace account, go to the model page, accept the terms, and then create a HuggingFace token from your account settings and add it on your machine as the HF_TOKEN environment variable.

from diffusers import FlowMatchEulerDiscreteScheduler, FluxPipeline
import psutil

model_name = "black-forest-labs/FLUX.1-dev"

print(f"Loading {model_name} on {device}")

pipeline = FluxPipeline.from_pretrained(
      model_name,

      # Diffusion models are generally trained on fp32, but fp16
      # gets us 99% there in terms of quality, with just half the (V)RAM
      torch_dtype=torch.float16,

      # Ensure we don't load any dangerous binary code
      use_safetensors=True

      # We are using Euler here, but you can also use other samplers
      scheduler=FlowMatchEulerDiscreteScheduler()
).to(device)

Here, we're loading the FLUX model using the diffusers library. The model we're using is black-forest-labs/FLUX.1-dev, loaded in fp16 precision.

There is also a timestep-distilled model named FLUX Schnell which has faster inference, but outputs less detailed images, as well as a FLUX Pro model which is closed-source.
We'll use the Euler scheduler here, but you may experiment with this. You can read more on schedulers here.
Since image generation can be resource-intensive, it's crucial to optimize memory usage, especially when running on a CPU or a device with limited memory.

# Recommended if running on MPS or CPU with < 64 GB of RAM
total_memory = psutil.virtual_memory().total
total_memory_gb = total_memory / (1024 ** 3)
if (device == 'cpu' or device == 'mps') and total_memory_gb < 64:
      print("Enabling attention slicing")
      pipeline.enable_attention_slicing()

This code checks the total available memory and enables attention slicing if the system has less than 64 GB of RAM. Attention slicing reduces memory usage during image generation, which is essential for devices with limited resources.

Step 3: Creating the API with FastAPI

Next, we'll set up the FastAPI server, which will provide an API to generate images.

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field, conint, confloat
from fastapi.middleware.gzip import GZipMiddleware
from io import BytesIO
import base64

app = FastAPI()

# We will be returning the image as a base64 encoded string
# which we will want compressed
app.add_middleware(GZipMiddleware, minimum_size=1000, compresslevel=7)

FastAPI is a popular framework for building web APIs with Python. In this case, we're using it to create a server that can accept requests for image generation. We're also using GZip middleware to compress the response, which is particularly useful when sending images back in base64 format.

In a production environment, you might want to store the generated images in an S3 bucket or other cloud storage and return the URLs instead of the base64-encoded strings, to take advantage of a CDN and other optimizations.

Step 4: Defining the Request Model

We now need to define a model for the requests that our API will accept.

class GenerateRequest(BaseModel):
      prompt: str
      seed: conint(ge=0) = Field(..., description="Seed for random number generation")
      height: conint(gt=0) = Field(..., description="Height of the generated image, must be a positive integer and a multiple of 8")
      width: conint(gt=0) = Field(..., description="Width of the generated image, must be a positive integer and a multiple of 8")
      cfg: confloat(gt=0) = Field(..., description="CFG (classifier-free guidance scale), must be a positive integer or 0")
      steps: conint(ge=0) = Field(..., description="Number of steps")
      batch_size: conint(gt=0) = Field(..., description="Number of images to generate in a batch")

This GenerateRequest model defines the parameters required to generate an image. The prompt field is the text description of the image you want to create. Other fields include the image dimensions, the number of inference steps, and the batch size.

Step 5: Creating the Image Generation Endpoint

Now, let's create the endpoint that will handle image generation requests.

@app.post("/")
async def generate_image(request: GenerateRequest):
      # Validate that height and width are multiples of 8
      # as required by FLUX
      if request.height % 8 != 0 or request.width % 8 != 0:
            raise HTTPException(status_code=400, detail="Height and width must both be multiples of 8")

      # Always calculate the seed on CPU for deterministic RNG
      # For a batch of images, seeds will be sequential like n, n+1, n+2, ...
      generator = [torch.Generator(device="cpu").manual_seed(i) for i in range(request.seed, request.seed + request.batch_size)]

      images = pipeline(
            height=request.height,
            width=request.width,
            prompt=request.prompt,
            generator=generator,
            num_inference_steps=request.steps,
            guidance_scale=request.cfg,
            num_images_per_prompt=request.batch_size
      ).images

      # Convert images to base64 strings
      # (for a production app, you might want to store the
      # images in an S3 bucket and return the URLs instead)
      base64_images = []
      for image in images:
            buffered = BytesIO()
            image.save(buffered, format="PNG")
            img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
            base64_images.append(img_str)

      return {
            "images": base64_images,
      }

This endpoint handles the image generation process. It first validates that the height and width are multiples of 8, as required by FLUX. It then generates images based on the provided prompt and returns them as base64-encoded strings.

Step 6: Starting the Server

Finally, let's add some code to start the server when the script is run.

@app.on_event("startup")
async def startup_event():
      print("Image generation server running")

if __name__ == "__main__":
      import uvicorn
      uvicorn.run(app, host="0.0.0.0", port=8000)

This code starts the FastAPI server on port 8000, making it accessible not only from http://localhost:8000 but also from other devices on the same network using the host machine’s IP address, thanks to the 0.0.0.0 binding.

Step 7: Testing Your Server Locally

Now that your FLUX server is up and running, it's time to test it. You can use curl, a command-line tool for making HTTP requests, to interact with your server:

curl -X POST "http://localhost:8000/" \
-H "Content-Type: application/json" \
-d '{
  "prompt": "A futuristic cityscape at sunset",
  "seed": 42,
  "height": 1024,
  "width": 1024,
  "cfg": 3.5,
  "steps": 50,
  "batch_size": 1
}' | jq -r '.images[0]' | base64 -d > test.png

This command will only work on UNIX-based systems with the curl, jq and base64 utilities installed. It may also take up to a few minutes to complete depending on the hardware hosting the FLUX server.

Conclusion

Congratulations! You've successfully created your own FLUX server using Python. This setup allows you to generate images based on text prompts via a simple API. If you're not satisfied with the results of the base FLUX model, you might consider fine-tuning the model for even better performance on specific use cases.

Fun Fact

Since 2022, over 15 billion AI-generated images have been produced, highlighting the growing popularity of AI image generation tools and the rapid changes happening in the creative processes of teams.

Applications

If you’re wondering how to transform AI-powered image generation into profitable ventures, consider the following applications:

Custom Image Generation Platforms (e.g. Leonardo AI, NSFW AI Generator);
AI-Powered Design Assistance Tools (e.g. Adobe Firefly)
Ad Creatives Generation Platforms (e.g. AdCreative AI, Predis AI)
Virtual Environment Design Assistants (e.g. Zaha Hadid)

Full code

You may find the full code used in this guide below:

device = 'cuda' # can also be 'cpu' or 'mps'

import os

# MPS support in PyTorch is not yet fully implemented
if device == 'mps':
  os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

import torch

if device == 'mps' and not torch.backends.mps.is_available():
      raise Exception("Device set to MPS, but MPS is not available")
elif device == 'cuda' and not torch.cuda.is_available():
      raise Exception("Device set to CUDA, but CUDA is not available")

from diffusers import FlowMatchEulerDiscreteScheduler, FluxPipeline
import psutil

model_name = "black-forest-labs/FLUX.1-dev"

print(f"Loading {model_name} on {device}")

pipeline = FluxPipeline.from_pretrained(
      model_name,

      # Diffusion models are generally trained on fp32, but fp16
      # gets us 99% there in terms of quality, with just half the (V)RAM
      torch_dtype=torch.float16,

      # Ensure we don't load any dangerous binary code
      use_safetensors=True,

      # We are using Euler here, but you can also use other samplers
      scheduler=FlowMatchEulerDiscreteScheduler()
).to(device)

# Recommended if running on MPS or CPU with < 64 GB of RAM
total_memory = psutil.virtual_memory().total
total_memory_gb = total_memory / (1024 ** 3)
if (device == 'cpu' or device == 'mps') and total_memory_gb < 64:
      print("Enabling attention slicing")
      pipeline.enable_attention_slicing()


from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field, conint, confloat
from fastapi.middleware.gzip import GZipMiddleware
from io import BytesIO
import base64

app = FastAPI()

# We will be returning the image as a base64 encoded string
# which we will want compressed
app.add_middleware(GZipMiddleware, minimum_size=1000, compresslevel=7)

class GenerateRequest(BaseModel):
      prompt: str
      seed: conint(ge=0) = Field(..., description="Seed for random number generation")
      height: conint(gt=0) = Field(..., description="Height of the generated image, must be a positive integer and a multiple of 8")
      width: conint(gt=0) = Field(..., description="Width of the generated image, must be a positive integer and a multiple of 8")
      cfg: confloat(gt=0) = Field(..., description="CFG (classifier-free guidance scale), must be a positive integer or 0")
      steps: conint(ge=0) = Field(..., description="Number of steps")
      batch_size: conint(gt=0) = Field(..., description="Number of images to generate in a batch")

@app.post("/")
async def generate_image(request: GenerateRequest):
      # Validate that height and width are multiples of 8
      # as required by FLUX
      if request.height % 8 != 0 or request.width % 8 != 0:
            raise HTTPException(status_code=400, detail="Height and width must both be multiples of 8")

      # Always calculate the seed on CPU for deterministic RNG
      # For a batch of images, seeds will be sequential like n, n+1, n+2, ...
      generator = [torch.Generator(device="cpu").manual_seed(i) for i in range(request.seed, request.seed + request.batch_size)]

      images = pipeline(
            height=request.height,
            width=request.width,
            prompt=request.prompt,
            generator=generator,
            num_inference_steps=request.steps,
            guidance_scale=request.cfg,
            num_images_per_prompt=request.batch_size
      ).images

      # Convert images to base64 strings
      # (for a production app, you might want to store the
      # images in an S3 bucket and return the URL's instead)
      base64_images = []
      for image in images:
            buffered = BytesIO()
            image.save(buffered, format="PNG")
            img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
            base64_images.append(img_str)

      return {
            "images": base64_images,
      }

@app.on_event("startup")
async def startup_event():
      print("Image generation server running")

if __name__ == "__main__":
      import uvicorn
      uvicorn.run(app, host="0.0.0.0", port=8000)

5 Best Open-Source LLMs for AI Companionship

Komninos Chatzipapas — Thu, 29 Aug 2024 16:23:50 +0000

AI companions are more than just a passing trend; they could become a $150 billion industry by 2030, according to a recent article by Ark Invest. With the potential to revolutionize how we interact with technology and each other, AI companionship represents the next frontier in digital entertainment.

Companion apps like Replika, Gatebox, HeraHaven, and SoulMachines attract millions of users every month, according to Similarweb data.

But if you’re considering building an AI companion, where do you start? Specifically, which large language model (LLM) should you choose to power your app? Let me walk you through the best open-source LLMs currently available and why they might be the perfect fit for your project.

Choosing the Right LLM for AI Companionship

When thinking about AI companionship, one of the first decisions you’ll need to make is what kind of open-source LLM to use. There are several types of LLMs, but they generally fall into three categories: base models, instruction-tuned models, and chat-tuned models.

Why Chat-Tuned Models Are Essential

Base models are the raw, unrefined versions of LLMs that have been trained on large datasets but lack the fine-tuning required for specific tasks. Instruction-tuned models have been refined to follow human instructions more accurately. However, when it comes to creating AI companions, chat-tuned models are the way to go. These models are specifically optimized for conversational interactions, making them better suited for the nuanced, ongoing dialogues that characterize AI companionship.

Model Size: How Big Is Big Enough?

Another key consideration is the size of the model. For most conversational chatbots, an 8B parameter model should suffice. These models offer a good balance between performance and resource efficiency, making them ideal for most applications. However, if you’re looking to go above and beyond—perhaps aiming to create a more sophisticated AI—you might want to explore models with up to 70B parameters.

The 5 Best Open-Source LLMs for AI Companionship

Here’s a list of some of the best open-source LLMs that are particularly well-suited for creating AI companions:

1. Hermes-3 Llama-3.1-8B

Hermes-3 Llama-3.1-8B is a powerful model that’s been finely tuned for chat applications. It offers a good balance between capability and efficiency, making it a top choice for AI companionship.

2. Yi-1.5-9B-Chat

Yi-1.5-9B-Chat is another excellent option, boasting a slightly larger parameter count, which allows for more complex and nuanced conversations. If your AI companion needs to handle a wide range of topics with ease, this model is worth considering.

3. InternLM2 5-7B Chat

InternLM2 5-7B Chat is a lighter model that still packs a punch. Its smaller size makes it ideal for applications where resources are limited but conversational depth is still a priority.

4. Humanish-Roleplay-Llama-3.1-8B

Humanish-Roleplay-Llama-3.1-8B is designed for roleplay scenarios, making it perfect for users who want an AI companion that can take on different personas. Whether it’s a friendly chat or a more complex interaction, this model excels at maintaining character.

5. OpenChat-3.5-1210

OpenChat-3.5-1210 is one of the larger models on this list and is optimized for deep, engaging conversations. If your AI companion needs to offer an immersive experience, this model is a strong contender.

If you’re interested in seeing how these models stack up against others, you can visit the Open LLM Leaderboard for the most recent comparisons.

Taking Your AI Companion to the Next Level with Fine Tuning

While these models are impressive right out of the box, there’s always room for improvement. With enough prompting, you should be able to get decent results for most applications. However, if you’re looking for a more advanced approach, consider fine-tuning one of these models to better suit your specific needs. Fine-tuning allows you to tweak the model’s behavior, making it more aligned with your desired user experience.

In conclusion, the potential for AI companions is vast, and the right LLM can make all the difference in creating a compelling, engaging experience. Whether you’re building the next million user app or a more niche application, these open-source models provide a strong foundation for success. So, dive in, experiment, and see what you can create with the incredible power of modern LLMs.