Collaborative GenAI Projects - Simple Best Practices

#nlp #ai #datascience #python

This article should provide a simple template in case you want to experiment with LLMs workflows. As an example, we load unstructured data, such as, PDF files, as structured data into tables using Python. As a preparation, you might want to read this article from myself before you go ahead: LLM Coding Concepts: Static Typing, Structured Output, and Async.

Project Directory Structure

project-name/
├── data/
│   ├── logs/
│   │   └── {username}.jsonl    # logs from your LLM calls, append only
│   ├── experiments/
│   │   └── {username}.jsonl 
│   ├── traces/
│   │   └── {username}.jsonl    # traces from your function calls, append only
│   ├── raw/
│   │   ├── document01.pdf
│   │   ├── document02.pdf
│   │   └── ...
│   └── transformed/
│      ├── 01_extracted_texts.parquet
│      └── 02_texts_with_markup.parquet
├── data_ops/
│   ├── helpers/
│   └── transformations/        # scripts to convert one dataset to another one
│      ├── 01_extract_texts.py
│      └── 02_add_markup.py
├── llm_ops/
│   ├── helpers/
│   │   └── tool_calls.py       # contains database calls on the parquet files e.g.
│   ├── steps/
│   │   ├── processing_step_1   # rename accordingly
│   │   │   ├── config.py       # LLM parameters
│   │   │   ├── prompt.py       # the prompt template
│   │   │   ├── base_model.py   # a Pydantic BaseModel for structured output defintion
│   │   │   ├── run.py          # a run function putting it all together
│   │   │   └── report.ipynb    # a Jupyter notebook to develop and evaluate the prompt
│   │   └── processing_step_2   # ....
│   └── workflows/
│      ├── workflow_1
│      |  ├── run.py            # imports and uses several LLM steps and helpers
│      |  └── report.ipynb      # imports and uses several LLM steps and helpers
│      └── agentic_workflow_1
│         └── run.py            # imports and uses several LLM steps and helpers
│         └── report.ipynb      # imports and uses several LLM steps and helpers
├── notebooks/
│   └── playground.ipynb        # from here you can run whole transformations
├── streamlit_app/
│   ├── main.py
│   └── helpers.py              # might contain data access to pre-processed document tables
├── README.md/
└── pyproject.toml/

The llm_ops steps and workflows contain run files which contain run() functions. For LLM steps, these receive the prompt parameters as inputs and yield the raw LLM outputs. For LLM workflows, these can call several LLM steps and implement business logic to combine them.

To develop and improve prompts and workflows iteratively, you perform prompt engineering in the report.ipynb file of the respective step or workflow directory. You can leave the prompt history in an archive subfolder. You may start experimenting by creating a dataset inside of the notebook. When finished, you may consider moving the prompt development dataset to the data/datasets directory as a JSON file.

The data_ops directory contains transformation scripts. These run llm_ops functions on whole tables, resulting in corresponding LLM output tables saved in the data directory.

The data directory contains large data files which should be added to Git using Git large file system, abbreviated Git LFS:

To initialize the whole project with a managed virtual environment and a pyproject.toml containing all package dependencies, it is advised to use uv:

Install uv.
Run uv init.
Add packages via uv add polars duckdb ...

Logging & Tracing

Logs are used to keep track of prompt engineering, i.e., to calculate metrics, such as, accuracy or LLM-as-a-judge scores. Traces are used for debugging LLM workflows, especially to see in which sequence which functions where called with which arguments and how much time it took respectively.

The difference between an LLM log entry and a function trace is that a trace keeps track of the following aspects in addition to the raw LLM inputs and outputs:

A trace has parent child relationships of function calls to know which function called which other functions.
A trace has all raw function parameters.
A log has more specialized LLM parameters, such as, model name.
A log contains also the target label to calculate classification metrics.

Logs

For logs, it is advised to save the following properties for each LLM call, such that you can calculate metrics, such as, accuracy, F1, coherence, et cetera afterwards.

class LlmLogEntry(pydantic.BaseModel):
    id: str              # unique UUID for this entry 
    experiment_id: str   # unique id for the experiment 
    start_time: str
    end_time: str
    duration: float
    prompt_template: str # The raw prompt template without parameters replaced 
    prompt_parameters: dict[str, str]  # The prompt parameters to be inserted into the prompt
    llm_output: dict | str
    provider_name: str   # the LLM provider name (Mistral/OpenAI/...)
    model_name: str      # the LLM used
    temperature: float   # temperature used for the LLM
    top_p: float         # top_p param used for the LLM
    json_schema: str     # The structured output schema for this LLM call
    labels: dict[str, Any] # The gold standard for the output JSON fields, in case classification was performed
    dataset_name: str    # the name of the dataset from which the sample has been drawn
    dataset_version: str # the version of the dataset

An LLM log is saved when you run an experiment when the LLM is called with the respective prompt.

You can convert the LLM experiment logs to a report with custom metrics, such that you see the prompt accuracy:

import polars as pl

df = pl.read_ndjson("data/logs/jhr.jsonl")

df = df.group_by([
    "experiment_id", 
    "prompt_template",
    "temperature",
    "top_p",
    "provider_name",
    "model_name",
    "json_schema",
    "dataset_name",
    "dataset_version"
]).agg(
    sum_input_chars = pl.sum(pl.col("prompt").str.len_chars()),
    accuracy = pl.sum(
        pl.col("labels") == pl.col("llm_output")
    )
)

df.write_ndjson("data/experiments/jhr.jsonl")

Traces

For traces, it is advised to save the following properties for each function call along the stack, such that you can perform runtime analyses and error tracking.

Traces are an advanced concept helpful for tool calling. It can be depriorized against logs.

class TraceEntry(pydantic.BaseModel):
    id: str                   # unique UUID for this entry 
    run_id: str               # unique UUID for this whole run 
    start_time: str | None
    end_time: str | None
    duration: float | None
    parent_id: str | None     # The id of the calling function call
    func_name: str | None
    output_dict: dict | None
    exception_stacktrace: str | None
    kwargs: str | None        # keyword arguments passed to this function call
    args: str | None          # normal parameters passed to this function call 
    json_schema: str | None   # The structured output schema for this LLM call
    username: str | None

DEV Community

Collaborative GenAI Projects - Simple Best Practices

Project Directory Structure

Logging & Tracing

Logs

Traces

Dashboarding

Experiment Tracking

Tracing

Top comments (0)