This article should provide a simple template in case you want to experiment with LLMs workflows. As an example, we load unstructured data, such as, PDF files, as structured data into tables using Python. As a preparation, you might want to read this article from myself before you go ahead: LLM Coding Concepts: Static Typing, Structured Output, and Async.
Project Directory Structure
project-name/
├── data/
│ ├── logs/
│ │ └── {username}.jsonl # logs from your LLM calls, append only
│ ├── traces/
│ │ └── {username}.jsonl # traces from your function calls, append only
│ ├── raw/
│ │ ├── document01.pdf
│ │ ├── document02.pdf
│ │ └── ...
│ └── transformed/
│ ├── 01_extracted_texts.parquet
│ └── 02_texts_with_markup.parquet
├── data_ops/
│ ├── helpers/
│ └── transformations/ # scripts to convert one dataset to another one
│ ├── 01_extract_texts.py
│ └── 02_add_markup.py
├── llm_ops/
│ ├── helpers/
│ │ └── tool_calls.py # contains database calls on the parquet files e.g.
│ ├── steps/
│ │ ├── processing_step_1 # rename accordingly
│ │ │ ├── config.py # LLM parameters
│ │ │ ├── prompt.py # the prompt template
│ │ │ ├── base_model.py # a Pydantic BaseModel for structured output defintion
│ │ │ ├── run.py # a run function putting it all together
│ │ │ └── report.ipynb # a Jupyter notebook to develop and evaluate the prompt
│ │ └── processing_step_2 # ....
│ └── workflows/
│ ├── workflow_1
│ | ├── run.py # imports and uses several LLM steps and helpers
│ | └── report.ipynb # imports and uses several LLM steps and helpers
│ └── agentic_workflow_1
│ └── run.py # imports and uses several LLM steps and helpers
│ └── report.ipynb # imports and uses several LLM steps and helpers
├── notebooks/
│ └── playground.ipynb # from here you can run whole transformations
├── streamlit_app/
│ ├── main.py
│ └── helpers.py # might contain data access to pre-processed document tables
├── README.md/
└── pyproject.toml/
The llm_ops steps and workflows contain run files which contain run() functions. For LLM steps, these receive the prompt parameters as inputs and yield the raw LLM outputs. For LLM workflows, these can call several LLM steps and implement business logic to combine them.
To develop and improve prompts and workflows iteratively, you perform prompt engineering in the report.ipynb file of the respective step or workflow directory. You can leave the prompt history in an archive subfolder. You may start experimenting by creating a dataset inside of the notebook. When finished, you may consider moving the prompt development dataset to the data/datasets directory as a JSON file.
The data_ops directory contains transformation scripts. These run llm_ops functions on whole tables, resulting in corresponding LLM output tables saved in the data directory.
The data directory contains large data files which should be added to Git using Git large file system, abbreviated Git LFS:
To initialize the whole project with a managed virtual environment and a pyproject.toml containing all package dependencies, it is advised to use uv:
- Install uv.
- Run
uv init. - Add packages via
uv add polars duckdb ...
Logging & Tracing
Logs are used to keep track of prompt engineering, i.e., to calculate metrics, such as, accuracy or LLM-as-a-judge scores. Traces are used for debugging LLM workflows, especially to see in which sequence which functions where called with which arguments and how much time it took respectively.
The difference between an LLM log entry and a function trace is that a trace keeps track of the following aspects in addition to the raw LLM inputs and outputs:
- A trace has parent child relationships of function calls to know which function called which other functions.
- A trace has all raw function parameters.
- A log has more specialized LLM parameters, such as, model name.
- A log contains also the target label to calculate classification metrics.
Logs
For traces, it is advised to save the following properties for each LLM call, such that you can calculate metrics, such as, accuracy, F1, coherence, et cetera afterwards.
class LlmLogEntry(pydantic.BaseModel):
id: str # unique UUID for this entry
experiment_id: str # unique id for the experiment
start_time: str
end_time: str
duration: float
prompt_template: str # The raw prompt template without parameters replaced
prompt_parameters: dict[str, str] # The prompt parameters to be inserted into the prompt
llm_output: dict | str
provider_name: str # the LLM provider name (Mistral/OpenAI/...)
model_name: str # the LLM used
temperature: float # temperature used for the LLM
top_p: float # top_p param used for the LLM
json_schema: str # The structured output schema for this LLM call
labels: dict[str, Any] # The gold standard for the output JSON fields, in case classification was performed
dataset_name: str # the name of the dataset from which the sample has been drawn
dataset_version: str # the version of the dataset
An LLM log is saved when you run an experiment when the LLM is called with the respective prompt.
You can convert the LLM experiment logs to a report with custom metrics, such that you see the prompt accuracy:
import polars as pl
df = pl.read_ndjson("data/logs/jhr.jsonl")
df = df.group_by([
"experiment_id",
"prompt_template",
"temperature",
"top_p",
"provider_name",
"model_name",
"json_schema",
"dataset_name",
"dataset_version"
]).agg(
sum_input_chars = pl.sum(pl.col("prompt").str.len_chars()),
accuracy = pl.sum(
pl.col("labels") == pl.col("llm_output")
)
)
df.write_ndjson("data/experiments/jhr.jsonl")
Traces
For traces, it is advised to save the following properties for each function call along the stack, such that you can perform runtime analyses and error tracking.
Traces are an advanced concept helpful for tool calling. It can be depriorized against logs.
class TraceEntry(pydantic.BaseModel):
id: str # unique UUID for this entry
run_id: str # unique UUID for this whole run
start_time: str | None
end_time: str | None
duration: float | None
parent_id: str | None # The id of the calling function call
func_name: str | None
output_dict: dict | None
exception_stacktrace: str | None
kwargs: str | None # keyword arguments passed to this function call
args: str | None # normal parameters passed to this function call
json_schema: str | None # The structured output schema for this LLM call
username: str | None
Dashboarding
Experiment Tracking
tbd
Tracing
tbd

Top comments (0)