DEV Community: Mateusz Kowalewski

Harnessing OpenAI Assistant 2.0 for Named Entity Recognition in PHP/Symfony 7

Mateusz Kowalewski — Thu, 19 Dec 2024 22:51:57 +0000

Integrating large language models into real-world product backends is the latest battleground for innovation. But as with all tech trends, the real winners aren’t those who rush to know it all. Instead, it's about pausing, reflecting, and making smart decisions.

Even though AI is more accessible than ever, thinking it's a trivial task is a big mistake. The field is still in its infancy, and almost everyone in tech and business is trying to figure out how to make sense of it. The internet is overflowing with both reliable information and misleading hype.

Believe it or not, you don’t need to jump on every piece of AI gossip you hear. Take a step back and approach it thoughtfully.

If you’re a PHP developer or work with a PHP codebase, AI might feel like an alien concept. Classes, interfaces, message queues, and frameworks seem worlds apart from NLP, fine-tuning, stochastic gradient descents, LoRA, RAG, and all that jargon. I get it. To systematically learn these concepts, like anything in software engineering, we need time and good practice.

But isn’t AI, machine learning, and data science the domain of Python or R programmers? You’re partially right. Most foundational machine learning is done in Python. And honestly, Python is a joy to learn—it’s versatile and can be applied to countless interesting projects. From building and training a language model to deploying it via an API, Python has you covered.

But let’s get back to PHP.

Don’t worry—you’re not about to become obsolete as a regular software engineer. Quite the opposite. As we’re increasingly expected to understand related fields like containerization, CI/CD, systems engineering, and cloud infrastructure, AI is quickly becoming another essential skill in our toolkit. There’s no better time than now to start learning.

That said, I don’t recommend diving headfirst into building neural networks from scratch (unless you really want to). It’s easy to get overwhelmed. Instead, let me show you a practical starting point for your AI experiments.

What Can You Expect?

Here’s what we’ll cover in this guide:

Asynchronous AI Workflows
I’ll show you how to implement an AI workflow using a message queue—whether it’s RabbitMQ, Amazon SQS, or your preferred broker.

Production-Ready Solution
You’ll see a real example of the solution deployed in a production system that solves basic business requirement with AI

Symfony Integration
The solution is fully implemented within the Symfony PHP framework.

OpenAI Tools
We’ll use the OpenAI PHP library and the latest OpenAI Assistants 2.0.

What Topics Will We Cover?

Dataset Structure
How to build a dataset for training and evaluating your AI model.
Fine-Tuning OpenAI Models
Learn to prepare a proper .jsonl file and fine-tune the GPT-4o-mini model or another one from GPT family for your specific business use case.
Creating and Testing an OpenAI Assistant 2.0
Understand how to set up an OpenAI Assistant and test it in the OpenAI Playground.
Knowledge Bases
Dive into the concept of knowledge bases: why GPT doesn’t know everything and how to supply it with the right context to significantly boost accuracy.
PHP&Symfony Integration
Learn how to seamlessly connect your AI agent with your Symfony application.

Interested? Let's roll.

Prerequisites for OpenAI + PHP Projects

OpenAI Account You’ll need an account on openai.com.
Optional: Weights & Biases Account Setting up an account on wandb.ai is optional but highly recommended. It’s an excellent tool for tracking AI experiments and visualizing results.

Defining the Problem

Let’s dive into the problem we’re solving.
At its core, we’re dealing with a block of text that represents something—an address, in our case. The goal is to classify its components into predefined groups.

So, when a user sends us an address, we aim to return a JSON structure that segments the address into its parts. For example:

{
  "street": "Strada Grui",
  "block": "Bloc 7",
  "staircase": "Scara A",
  "apartment": "Apartament 6",
  "city": "Zărnești"
}

Why Does This Matter?

Well, these addresses are typed by people—our customers—and they’re often inconsistent. Here’s why we need structured, parsed data:

Consistent Formatting Ensures addresses follow a standard format for seamless processing.
Address Validation (Optional) Allows us to verify if the address is valid or exists.

To achieve this, we dissect the address into predefined groups like "street," "house," "postcode," etc., and then reassemble it into the desired sequence.

Why Not Use Regular Expressions?

At first glance, it sounds straightforward. If we enforce formatting for new customers or know they typically write addresses in a specific way, regular expressions might seem like a reasonable solution.

However, let’s consider examples of Romanian addresses:

STRADA GRUI, BLOC 7, SCARA A, APARTAMENT 6, ZĂRNEŞTI

BD. 21 DECEMBRIE 1989 nr. 93 ap. 50, CLUJ, 400124

Romanian addresses are often complex, written in no particular order, and frequently omit elements like the postcode. Even the most sophisticated regular expressions struggle to handle such variability reliably.

This is where AI models like GPT-3.5 Turbo or GPT-4o-mini shine—they can manage inconsistencies and complexities far beyond what static rules like regex can handle.

A Smarter AI Workflow

Yes, we’re developing an AI workflow that significantly improves on the traditional approach.

Every machine learning project boils down to two essentials: data and model. And while models are important, data is far more critical.

Models come pre-packaged, tested, and can be swapped out to see which one performs better. But the true game-changer is the quality of the data we provide to the model.

The Role of Data in Machine Learning
Typically, we split our dataset into two or three parts:

Training Data: Used to teach the model what it should do.
Testing Data: Used to evaluate how well the model performs.

For this project, we want to split an address into its components—making this a classification task. To get good results with a large language model (LLM) like GPT, we need at least 100 examples in our training dataset.

Structuring the Dataset

The raw format of our dataset doesn’t matter much, but I’ve chosen a structure that’s intuitive and easy to use:

INPUT: <what goes into the LLM>
OUTPUT: <what the LLM should produce>

Here’s an example:

input: STRADA EREMIA GRIGORESCU, NR.11 BL.45B,SC.B NR 11/38, 107065 PLOIESTI
output: {{"street": "Eremia Grigorescu", "house_number": "11", "flat_number": "38", "block": "45B", "staircase": "45B", floor: "", "apartment": "", "landmark": "", "postcode": "107065", "county": "Prahova", 'commune': '', 'village': '', "city" "Ploiesti"}}

As you can see, the goal is to produce a structured JSON response based on the input address.

Setting Up Your OpenAI API Account

First, you’ll need an OpenAI API account. It’s a straightforward process, and I recommend adding some initial funds—$10 or $20 is plenty to get started. OpenAI uses a convenient pre-paid subscription model, so you’ll have full control over your spending.

You can manage your account billing here:

Exploring the OpenAI Playground

Once your account is set up, head over to the OpenAI Playground.

Take a moment to familiarize yourself with the interface. When you’re ready, look for the "Assistants" section in the menu on the left.

This is where we’ll create our custom GPT instance.

Customizing Your GPT Assistant

We’ll tailor our GPT assistant to meet our specific needs in two stages:

Detailed System Instructions with Examples
This step helps us quickly test and validate whether the solution works. It’s the fastest way to see results.
Fine-Tuning with Knowledge Base (simple RAG)
Once we’re satisfied with the initial results, we’ll fine-tune the model. This process reduces the need to provide extensive examples in every prompt, which, in turn, decreases inference time (how long it takes for the model to respond via the API).

Both steps combined give us the most accurate and efficient results.

So, let's begin.

Designing a System Prompt for Named Entity Recognition

For our Named Entity Recognition system, we need the model to output structured JSON responses consistently. Crafting a thoughtful system prompt is critical to achieving this.

Here’s what we’ll focus on:

Structuring the initial "dirty" system prompt to produce clear results.
Refining the prompt later to improve efficiency and minimize costs.

Starting with the Few-Shot Technique

The "few-shot" technique is a powerful approach for this task. It works by providing the model with a few examples of the desired input-output relationship. From these examples, the model can generalize and handle inputs it hasn’t seen before.

Key Elements of a Few-Shot Prompt:

The prompt consists of several parts, which should be thoughtfully structured for best results. While the exact order can vary, the following sections are essential to include:

1. Clear Objective

The first part of the prompt provides a general overview of what we expect the model to accomplish.

For example, in this use case, the input is a Romanian address, which may include spelling mistakes and incorrect formatting. This context is essential as it sets the stage for the model, explaining the kind of data it will process.

2. Formatting instruction

After defining the task, we guide the AI on how to format its output.

Here, the JSON structure is explained in detail, including how the model should derive each key-value pair from the input. Examples are included to clarify expectations. Additionally, any special characters are properly escaped to ensure valid JSON.

3. Examples

Examples are the backbone of the "few-shot" technique. The more relevant examples we provide, the better the model’s performance.

Thanks to GPT’s extensive context window (up to 16K tokens), we can include a large number of examples.

To build the example set:

Start with a basic prompt and evaluate the model’s output manually.
If the output contains errors, correct them and add the fixed version to the example set.
This iterative process improves the assistant’s performance over time.

4. Corrective information

It’s inevitable—your assistant will make mistakes. Sometimes, it might repeatedly make the same error.

To address this, include clear and polite corrective information in your prompt. Clearly explain:

What mistakes the assistant is making.
What you expect from the output instead.
This feedback helps the model adjust its behavior and generate better results.

Configuring and Testing Your Assistant

Now that we’ve crafted the initial prompt for our proof-of-concept few-shot solution, it’s time to configure and test the Assistant.

Step1

Head over to the OpenAI dashboard
From the menu on the left, select "Assistants"
You’ll be redirected to the Assistant creation page. Go back to OpenAI dashboard and from menu on the left, choose "Assistants". You will be redirected to the Assistant creator.

Step 2: Configure Your Assistant

Give Your Assistant a Name: Choose a descriptive name for your Assistant that reflects its purpose (e.g., "Address Parser").
Paste the System Instructions: Copy your entire crafted prompt and paste it into the System Instructions input box. This defines your Assistant’s behavior and guides its outputs.
Save Your Assistant: Once you’ve pasted the prompt, click Save to store the configuration.

Step 3: Choose the Right Model

For this project, I chose GPT-4o-mini because:

It’s cheaper than GPT-3.5 Turbo
It's more accurate :)

That said, you should always select your model based on a balance between accuracy and cost. Run or search for benchmarks to determine what works best for your specific task.

Defining the Output Schema

With the initial configuration in place, we can now specify the output schema directly in the Assistant creator.

Step 1: Using the "Generate" Option
Instead of creating the schema manually, I used the "Generate" option provided by the Assistant creator. Here’s how:

Take the JSON output structure from your prompt (from one of the examples)
Paste it into the "Generate" field.

The tool does a great job of automatically generating a schema that matches your structure.

--
To ensure consistent and predictable responses, set the temperature as low as possible.

What is Temperature?

Temperature controls the randomness of the model’s responses. A low temperature value means the model produces more predictable and deterministic outputs.

For our use case, this is exactly what we want. When given the same address as input, the model should always return the same, correct response. Consistency is key for reliable results.

Testing the Assistant

With all the parameters in place, head to the Playground to test your solution.

The Playground provides a console where you can put your Assistant to work. This is where the fun begins—you can rigorously test your bare-bones solution to uncover:

Where the model is hallucinating (generating irrelevant or incorrect information).
Potential shortcomings in your prompt or schema.

These findings will help you refine the Corrective Information section of your prompt, making your Assistant more robust.

Below is an example of a result from one of my manual tests:

INPUT: BD. 21 DECEMBRIE 1989 nr. 93 ap. 50, CLUJ, 400124
OUTPUT:
{
  "street": "Bdul 21 Decembrie 1989",
  "house": "93",
  "flat": "50",
  "block": "",
  "staircase": "",
  "floor": "",
  "apartment": "",
  "landmark": "",
  "intercom": "",
  "postcode": "400124",
  "county": "Cluj",
  "commune": "",
  "village": "",
  "city": "Cluj"
}

Why Manual Testing Matters ?
Old-school, hands-on testing is the foundation of building a reliable solution. By evaluating the model’s outputs manually, you’ll quickly spot issues and understand how to fix them. While automation will come later, manual testing is an invaluable step in creating a solid proof-of-concept.

Plugging the Assistant into Your PHP Symfony Application

Now it’s time to integrate everything into your PHP Symfony application. The setup is straightforward and follows a classic asynchronous architecture.

Here’s a breakdown of the flow:

1.Frontend Interaction

In general, what we have here is a classic asynchronous setup with message queue. We have 2 instances of Symfony application running inside Docker container. The first one is interacting with the frontend client.

For example, when a customer types:

STR. IORGA NICOLAE nr. 20 et 1, CONSTANŢA, 900622

The application processes the input address and packages it into a Message object.
The Message object is then wrapped in a Symfony Messenger Envelope instance. The message is serialized into JSON format, with additional metadata for processing.

Symfony Messenger is perfect for handling asynchronous tasks. It allows us to offload time-consuming operations, like calling the OpenAI API, to background processes.
This approach ensures:

Responsiveness: The frontend remains fast and responsive.
Scalability: Tasks can be distributed across multiple workers.

Below is the message class used for our system:

<?php

declare(strict_types=1);

namespace App\Messenger\Message\Customer;

use Symfony\Component\Validator\Constraints as Assert;

final class CustomerAddressAddedMessage
{
    public function __construct(
        #[Assert\NotBlank]
        #[Assert\Type('int')]
        public readonly int $customerId,
        #[Assert\NotBlank]
        #[Assert\Type('string')]
        public readonly string $inputAddress,
    ) {
    }
}

The application connects to a RabbitMQ instance using the amqp PHP extension. To set this up, you need to define the transport and message binding in your messenger.yaml configuration file.

Refer to the official Symfony Messenger documentation for detailed guidance:
Symfony Messenger Transport Configuration

Documentation: https://symfony.com/doc/current/messenger.html#transport-configuration

Once the message is pushed to the broker (e.g., RabbitMQ, AmazonMQ, or AWS SQS), it’s picked up by the second instance of the application. This instance runs the messenger daemon to consume messages, as marked 3 in the architecture schema.

The consuming process is handled by running:

bin/console messenger:consume

is working as it's process.

The daemon picks up the message from the configured queue, deserializes it back into the corresponding Message class, and forwards it to the Message Handler for processing.

Here’s the Message Handler where the interaction with the OpenAI Assistant occurs:

<?php

declare(strict_types=1);

namespace App\Messenger\MessageHandler\Customer;

use App\Entity\CustomerAddressNormalized;
use App\Messenger\Message\Customer\CustomerAddressAddedMessage;
use App\Service\Address\Normalization\OpenAI\OpenAIAddressNormalizationService;
use Doctrine\ORM\EntityManagerInterface;
use Psr\Log\LoggerInterface;
use Symfony\Component\Messenger\Handler\MessageHandlerInterface;

final class CustomerAddressAddedHandler implements MessageHandlerInterface
{
    public function __construct(
        private readonly LoggerInterface $logger,
        private readonly EntityManagerInterface $entityManager,
        private readonly OpenAIAddressNormalizationService $addressNormalizationService,
    ) {
    }

    public function __invoke(CustomerAddressAddedMessage $customerAddressAddedMessage)
    {
        $this->logger->info(
            'Start processing "{message}", customerId:{customerId}.',
            [
                'message'    => \get_class($customerAddressAddedMessage),
                'customerId' => $customerAddressAddedMessage->customerId,
            ],
        );

        $normalizationString = $this->addressNormalizationService->createAddressNormalization($customerAddressAddedMessage->inputAddress);
        $addressNormalized   = new CustomerAddressNormalized(
            $customerAddressAddedMessage->customerId,
            $customerAddressAddedMessage->inputAddress,
            $normalizationString,
            false,
        );

        $this->entityManager->persist($addressNormalized);
        $this->entityManager->flush();
    }
}

Key Points of the Handler

Logging
Logs the start of processing for traceability and debugging.
Normalization Service
The OpenAIAddressNormalizationService is called to process the input address via the Assistant.
Persistence
The normalized address is saved in the database using Doctrine's EntityManager.

The message handler may seem like standard Symfony code, but the core is located in this line:

private readonly OpenAIAddressNormalizationService $addressNormalizationService,

This service is responsible for interacting with OpenAI via the designated PHP library.

OpenAI PHP Client

Let’s dive into the implementation of the service with the usage of PHP client.

Here’s the full implementation of the OpenAIAddressNormalizationService:

<?php

declare(strict_types=1);

namespace App\Service\Address\Normalization\OpenAI;

use OpenAI;
use OpenAI\Client;
use OpenAI\Responses\Threads\Runs\ThreadRunResponse;
use Psr\Log\LoggerInterface;
use RuntimeException;

class OpenAIAddressNormalizationService
{
    private Client $openAIClient;

    public function __construct(
        private readonly string $openAIApiKey,
        private readonly string $openAIOrganizationID,
        private readonly LoggerInterface $logger,
    ) {
        $this->openAIClient = OpenAI::client($this->openAIApiKey, $this->openAIOrganizationID);
    }

    public function createAddressNormalization(string $inputAddress): string
    {
        $assistant = $this->openAIClient->assistants()->retrieve('asst_VKQLz1xaIY5cFP8dlk2SPZkg');
        $threadRun = $this->openAIClient->threads()->createAndRun(
            [
                'assistant_id' => $assistant->id,
                'thread'       => [
                    'messages' => [
                        [
                            'role'    => 'user',
                            'content' => $inputAddress,
                        ],
                    ],
                ],
            ],
        );

        $normalizationResponse = $this->loadAnswer($threadRun);
        $this->logger->warning($normalizationResponse);

        return json_encode($normalizationResponse, JSON_THROW_ON_ERROR);
    }

    private function loadAnswer(ThreadRunResponse $threadRun): string
    {
        while (in_array($threadRun->status, ['queued', 'in_progress'])) {
            $threadRun = $this->openAIClient->threads()->runs()->retrieve(
                threadId: $threadRun->threadId,
                runId: $threadRun->id,
            );
        }

        if ('completed' !== $threadRun->status) {
            throw new RuntimeException('OpenAI request failed, please try again.');
        }

        $messageList = $this->openAIClient->threads()->messages()->list(threadId: $threadRun->threadId);

        return $messageList->data[0]->content[0]->text->value;
    }
}

The workflow for getting an inference (or completion)—essentially the response from GPT—using the openai-php/client library follows three main stages:

1. Initialization of the OpenAI Client and Assistant retrieval

The first step is setting up the OpenAI client and retrieving the configured Assistant:

$this->openAIClient = OpenAI::client($this->openAIApiKey, $this->openAIOrganizationID);

$assistant = $this->openAIClient->assistants()->retrieve('asst_VKQLz1xaIY5cFP8dlk2SPZkg');

Client Initialization: The OpenAI::client() method creates a new client using your API key and organization ID.
Assistant Retrieval: The retrieve() method connects to the specific Assistant instance configured for your task, identified by its unique ID.

1. Creating and Running a Thread

Once the client and Assistant are initialized, you can create and run a thread to begin the interaction. A thread acts as the communication pipeline, handling the exchange of messages between the user and the Assistant.

Here’s how the thread is initiated:

$threadRun = $this->openAIClient->threads()->createAndRun([
    'assistant_id' => $assistant->id,
    'thread'       => [
        'messages' => [
            [
                'role'    => 'user',
                'content' => $inputAddress,
            ],
        ],
    ],
]);

Assistant ID: The ID of the Assistant determines which model instance will process the input.
Messages: Each thread includes a sequence of messages. In this case, the message has Role which indicates whether the message is from the user (input) or system (response). And the second is Content - contains the actual input text (e.g., an address).

3. Handling the Thread Response

After initiating the thread, the application handles the response. Since OpenAI processes may not complete instantly, you need to poll the thread’s status until it’s marked as 'completed':

while (in_array($threadRun->status, ['queued', 'in_progress'])) {
    $threadRun = $this->openAIClient->threads()->runs()->retrieve(
        threadId: $threadRun->threadId,
        runId: $threadRun->id,
    );
}

Polling: The application repeatedly checks the thread’s status (queued or in_progress). Once the status changes to 'completed', the thread contains the final output.

Once the thread is complete, the application retrieves the response messages to extract the normalized address:

$messageList = $this->openAIClient->threads()->messages()->list(threadId: $threadRun->threadId);
return $messageList->data[0]->content[0]->text->value;

The normalized address can now be returned to the frontend client for immediate use or stored in the database as a structured entity, like CustomerAddressNormalized, for future processing.

When you run this setup, you should be able to extract and save structured output from OpenAI for Named Entity Recognition and other classification or generation tasks.

Make Assistant Blazing Accurate with Fine Tuning and Knowledge Base (Retrieval Augmented Generation)

In some cases, basic AI solutions aren’t enough. When regulatory compliance and business requirements demand high accuracy, we need to go the extra mile to ensure the output is factual and reliable.

For example, when generating a JSON structure, we need to guarantee that its contents align with reality. A common risk is the model hallucinating information—like inventing a place based on a provided postal code. This can cause serious issues, particularly in high-stakes environments.

Ground Truth with Knowledge Base

To eliminate hallucinations and ensure factual accuracy, we need to supply the Assistant with a ground truth knowledge base. This acts as the definitive reference for the model, ensuring it uses accurate information during inference.

My Approach: A Postal Code Knowledge Base
I created a simple (but quite large—around 12 MB) JSON file containing the full demarcation of all postal codes in Romania. This dictionary-like structure provides:

Postal Code: The input value.
Validated Information: Facts like the corresponding city, county, and overall place name.
This knowledge base serves as a reference point for the Assistant when performing Named Entity Recognition.

Structure of the Knowledge Base

Here’s an example snippet of the knowledge base JSON structure:

{
   "ro_places": [
    {
        "place_mark": "Râmnicu Vâlcea, Vâlcea, 240454, RO",
        "country_code": "RO",
        "place_name": "Râmnicu Vâlcea",
        "county": "Vâlcea",
        "postal_code": "240454",
        "admin_code": "39",
        "coordinates": {
            "lat": 45.1,
            "lon": 24.3667
        }
    },
    {
        "place_mark": "Râmnicu Vâlcea, Vâlcea, 240493, RO",
        "country_code": "RO",
        "place_name": "Râmnicu Vâlcea",
        "county": "Vâlcea",
        "postal_code": "240493",
        "admin_code": "39",
        "coordinates": {
            "lat": 45.1,
            "lon": 24.3667
        }
    }, 
..... #around 40 000 post codes
}

With the knowledge base ready, it’s time to instruct the Assistant to use it effectively. While you could embed this directly into the prompt, this approach increases token usage and costs. This makes it an excellent moment to introduce fine-tuning.

What is Fine-Tuning?

Fine-tuning involves modifying the outermost layers (specifically the weight matrices) of a pre-trained model, adapting it for a specific task. In our case, Named Entity Recognition (NER) is a perfect candidate.

A fine-tuned model:

Requires Smaller Prompts: Reducing the need for lengthy instructions or embedded examples.
Cuts Costs: Shorter prompts and fewer examples lower token consumption.
Improves Inference Time: Faster responses mean better real-time performance.

The main goal is to align the model more closely with the nuances of the real-world data it will process. Training on domain-specific data makes the model better at understanding and generating appropriate responses. It also reduces the need for post-processing or error correction by the application.

Preparing the Fine-Tuning Dataset

To fine-tune the model, we need a properly formatted dataset in .jsonl (JSON Lines) format, as required by OpenAI. Each entry in the dataset includes:

System Prompt: The initial instruction for the Assistant.
User Input: The raw input data (e.g., a messy address).
Expected Output: The desired structured response.

This dataset provides the Assistant with domain-specific examples, enabling it to learn how to respond to similar prompts in the future.

Here’s an example of how to structure a fine-tuning entry:

{
   "messages":[
      {
         "role":"system",
         "content":"You are a system that <does something>. Use the knowledge base provided for accurate results."
      },
      {
         "role":"user",
         "content":"TOOLS \n-------\n\n\n\nRESPONSE FORMAT INSTRUCTIONS\n--------------------\n\n\n--------------------\nUSER'S INPUT\n--------------------\n"
      },
      {
         "role":"assistant",
         "content":"{\"key\": \"value\"}"
      }
   ]
}

Let’s break down the structure of a fine-tuning entry in the .jsonl file, using our Romanian address processing case as an example:

Each entry is designed to simulate a real interaction between the user and the Assistant, encapsulating:

The Assistant’s purpose and scope.
The user’s input scenario.
The expected structured output.

This approach helps the model learn the desired behavior in real-world contexts.

1. System Role Message

Describes the assistant’s capabilities and the scope of its functionality, setting expectations for the types of entities it should recognize and extract, such as street names, house numbers, and postal codes.
Example: The system explains that the assistant is designed to function as an Named Entity Recognition model for Romanian addresses, detailing the components it should extract and classify.

2. User Role Message

Provides a detailed scenario or query in which the user gives a specific address input. This part of the data entry is crucial as it directly influences how the model will learn to respond to similar inputs in operational settings.

3. Assistant Role Message

Contains the expected response from the assistant, formatted in JSON. This response is crucial as it trains the model on the desired output format and precision.

Creating a Validation File for Fine-Tuning

Once you’ve created the training file, the next step is to prepare a validation file. This file evaluates the accuracy of the fine-tuned Assistant on real data. The structure is similar to the training file, which makes automating the creation of both files more straightforward.

The validation file is designed to test the model’s generalization capabilities. It ensures that the fine-tuned Assistant can handle new inputs and perform consistently, even when faced with unfamiliar or challenging examples.

Structure of the Validation File

System Message:
To maintain consistency with the training process, the system message should be identical to the one used in the training file.

User Message:
Introduces a new input (e.g., an address) that was not included in the training file. The new example should be realistic and domain-specific, as well as challenge the model by including minor errors or complexities.

Assistant Message:
Provides the expected structured response for the new user input.
This serves as the gold standard for validating the model’s accuracy.

To streamline the creation of both training and validation files, automation scripts can be used. These scripts generate properly formatted .jsonl files based on input datasets.

Visit the repository containing scripts to generate training and validation files:

Automated .jsonl generation scripts

It's a Python version of fine-tuning automation. PHP version is coming soon.

Why Validation Matters?

Accuracy Testing: Measures how well the fine-tuned model performs on unseen data.
Generalization: Validates the model’s ability to handle new, complex, or error-prone inputs.
Feedback Loop: Helps identify areas where additional fine-tuning or data is needed.

Fine-Tuning Your Assistant Manually

If you prefer, you can fine-tune the Assistant manually using OpenAI's graphical user interface (GUI). Once you’ve prepared both the training and validation files, follow these steps to get started:

Step 1: Access the Fine-Tuning Wizard

Go to the Dashboard and click on "Fine-Tuning" in the menu on the left-hand side.
Click the green "Create" button to open the fine-tuning menu.

Step 2: Configure Fine-Tuning

In the fine-tuning menu, update the following settings:

Base model: At the time of writing, the most cost-effective and high-performing base model is gpt-4o-mini-2024-07-18. Choose the base model that best aligns with your performance and budget requirements.
Training and Validation Data - Upload the training data and validation data files you created earlier.
Number of epochs - Set the number of epochs (iterations of training process over the entire dataset). I recommend starting with 3 epochs, but you can experiment with more epochs if your budget allows.

Step 3: Monitor the Fine-Tuning Process

Once the fine-tuning process starts, you can track its progress in the Fine-Tuning Dashboard:

The dashboard provides real-time updates and displays several metrics to help you monitor and evaluate the training process.

Key Metrics to Understand

Training loss Measures how well the model is fitting the training data. A lower training loss indicates the model is effectively learning patterns within the dataset.
Full Validation Loss Indicates performance on unseen data from the validation dataset. A lower validation loss suggests the model will generalize well to new inputs.
Steps and Time Training steps are number of iterations where model weights are updated based on batches of data.
The time stamp indicates when each step was processed, which helps to monitor the duration of training and intervals between evaluations.

Interpreting the Metrics

Monitoring below metrics helps you determine whether the fine-tuning process is converging correctly or requires adjustments.

Decreasing Loss: Both training and validation losses should ideally decrease over time, eventually stabilizing. Overfitting: Occurs when training loss keeps decreasing while validation loss starts to increase or fluctuate. This indicates the model is over-specialized to the training data and performs poorly on unseen data.

Underfitting: Happens when both losses remain high, showing the model isn’t learning effectively.

Plug in the Model & Evaluate

With your Assistant trained and fine-tuned, it’s time to integrate it into your application and start evaluating its performance.

Start Simple.

Begin with manual tests, either in the Assistants Playground or directly within your application. Avoid overcomplicating your evaluation at this stage; focus on ensuring the basics work as expected.

For example, you can use a simple tool like Google Sheets to manually compare the input and output, as shown here:

This is the first step.
AI and ML solutions require ongoing evaluation to ensure performance remains consistent. For tasks like Named Entity Recognition, automated tools such as the PromptFlow extension for Visual Studio Code can help streamline testing and validation.

References

OpenAI Fine-Tuning Documentation
Symfony Messenger Documentation
OpenAI PHP Client Library
PromptFlow Extension for VS Code

Thank You!

Thank you for taking the time to explore this guide! I hope it provides you with a solid foundation to build and fine-tune your AI-powered solutions. If you have any questions or suggestions, feel free to reach out or leave a comment.

Happy coding and best of luck with your AI projects! 🚀

PART 1: Microservice Prospect Generator System (Tools Involved: Symfony, Python, RabbitMQ, MySQL, Docker, ChatGPT)

Mateusz Kowalewski — Fri, 29 Sep 2023 17:50:44 +0000

The Challenge of Sourcing Prospects

Every business grapples with the challenge of sourcing prospects. While there are tons of solutions on the market currently, each comes with its limitations. Learning curves, tool-specific constraints, and sometimes unmet expectations are common pitfalls.

Cost vs. Quality of Leads

Investing money in a business is obvious and all the costs can quickly add up, especially at the beginning. But it's the quality of leads you get in return that really counts. Outsourcing your customer sources raises questions: Is the quality of your leads a priority for them? Is the data fresh and relevant ?

From my perspective, the prospecting process should be transparent and hands-on. It's essential for entrepreneurs to trace back every customer's origin.

The Power of the Described System

Despite the challenges, the system I've elaborated below is able to yield over 10,000 quality contacts within an hour,but specifically tailored for the gym niche across several US States. And the good part is that it can be done to many niches that can be found on Google (as for the moment).

BEWARE

This tool can generate very high amount of prospects very fast. But the API I described uses Google API under the hood.
YOU NEED TO fire it up in portions, and set up a budget in your Google Cloud account, where you use your API. This is just a reminder for a future reference, in case you build this project yourself from the ground up.

Who Can Benefit?

Software developers making the shift to entrepreneurship or business-savvy individuals with a developer ally will find this guide invaluable. Not only will you acquire a leads-filled excel/csv file, but also a step-by-step guide on utilizing ChatGPT for efficient lead sourcing.

Seeking Ready-Made Prospects for Your Venture?

Exploring a new niche or aiming to expand your current one? I've put together some ready to use files for the gym and plumbing/foundation repair sectors on my website. In case you seek outside of these niches, fill out a straightforward form detailing your needs, and I'll promptly generate new prospects for you – completely free. Your growth is important to me, let's navigate it together.

Currently: The system gleans leads from Google, but I'm on the path to incorporating LinkedIn, Instagram, and more.

Essential Skills for Developers:

Proficiency In:

Docker
RabbitMQ
Microservice patterns: Understand their operations, interactions via message queues, and HTTP calls.
APIs

Familiarity With:

PHP & Symfony Framework
Python
Bash scripting
ChatGPT prompting

A Glimpse into the System's Architecture

The system functions through three primary stages.

Here is just a miniature, so we are on the same page.

For an intricate look, here's the full-resolution Miro board:

https://miro.com/app/board/uXjVMiSun6Y=/?share_link_id=994343205773

We'll methodically dissect each segment, beginning with the ChatGPT stage. Here, you'll master the art of prompting ChatGPT for diverse formatted outputs, extending to website content scraping. For the ensuing stages, I harness an API that is in the very core of the system and is responsible for getting business details from Google Maps.

Location Nuances: Many places often share identical names, a phenomenon evident in countries like the US.

API Endpoint Breakdown:

For clarity, let's decode the following API endpoint:

http://localhost:8080/leads/England/Aley+Green(Bedfordshire)?keyword=gym&keywords=fitness%20center

Where we have 3 moving parts (parameters) we need to provide:

Country: England
Place name & Administrative Area: Aley Green, Bedfordshire
Search Criteria: Gyms and fitness centers

The result? A JSON array of venues fitting our specified criteria.

Automating the Process: How Far Have We Come?

Though we're yet to achieve full automation with this solution, the current system exemplifies rapid, formatted data extraction via ChatGPT.

Tapping into ChatGPT for Swift Data Formatting

I targeted the potent markets of the UK and US due to their immense opportunities for digital services.

So, the first prompt to chatGPT is:

as I didn't care about googling it, when I can have everything in one tool. As I was sure that counties is what England is build of, I asked:

And this gave me a good list of 48 counties.

Now. We obviously could ask now one by one:

But ChatGPT is not that good with extracting that long lists in structured format from it's "internals".
The task we gave - to generate a list of all cities will be tedious and will require a lot of asking to "generate more", it often freezes and crashes.

I tried to give this task to BARD, to see if it can generate such a comprehensive dataset from the get-go.
For this prompt:

BARD generated only a handful of places.

When nurtured, it again generated just a few examples, just take a look:

Maybe these responses could be made longer with a bit of prompt-engineering and generating more "few-shot" examples, but we won't eventually know whether or not AI did not skip some important places, which are filled with juicy prospects.

So, we need to move on.
Let me show you what approach worked out well in this situation and can be a good foundation for your experiments.
We will make ChatGPT just a part of data processing engine.

At this point, I would recommend ChatGPT Plus, the paid version. It's the only way ChatGPT can access the internet through plugins. While Google's BARD is an alternative, I won't delve into it in this piece. If you're not already using ChatGPT Plus, give it a whirl. For just $20/month, with some acclimatization, it can drastically reduce your workload.
Here I have my ChatGPT instance switched on with Plugins enabled.

The plugin we actually need to use is Webpilot. It allows ChatGPT to process data from websites(html), but as well text files, such as .pdf's.

In case you need a guide on how to set this up, pause reading and head off to:

https://devopscube.com/chatgpt-plugins/

What I did is I found a source of reputable information to give ChatGPT as a reference.
I used google, and for England, I found out that Wikipedia has a website for each of the counties, listing all the places worth interest in that particular county. As we know, Wikipedia is tried and tested tool, so I have no problem in trusting there are all the places. No need to worry that ChatGPT can screw us over. Nevermind,

For the first on our list, it's :

https://en.wikipedia.org/wiki/List_of_places_in_Bedfordshire

As we see, the template for these urls is:

https://en.wikipedia.org/wiki/List_of_places_in_<county>

When you actually visit this url, you'll see that the page is neatly structured. It lists every place in Bedfordshire county in alphabetical order.

There we go back to the ChatGPT and put this prompt there:

Any advanced, fancy prompt engineering is not needed here.
However it always pays off to describe the requirements in detail, even writing them down from 2 different angles and put at least one example of the output we want.

When AI understands us clearly enough, it will generate actual data we need in desired format at blazing speed.

Now, when ChatGPT machinery is properly set up for a job, we can easily provide the links to next counties and generate .csv files within minutes.
I provided a second link, for Berkshire this time and shortly after, I have a list od 369 places to scrape the prospects from.

Once you’ve done the task once, you can easily scale it, saving time and accelerating results.

INTERESTED IN MORE DETAILS ?

If you want the series to be continued, put the hand up and I will prepare the next parts.
Specifically, how to tie together this architecture between components written in two different programming languages - Python & PHP. I can dive into details and give you ready reciepes on how to connect 2 different systems via message queue.

Let me know in the comments.
Thank you.