DEV Community

Abhijith Neil Abraham

Posted on Jan 17

How to make LLMs work on large amounts of data

#ai #data #datascience #llm

Text to SQL tools have largely dominated the market of applying Intelligence over large amounts of data. However, with the advent of LLMs, this became a task dominated by several other tech, including RAG, Coding/SQL agents, etc.

One major issue with this is that LLMs cannot actually see the data, they only receive a rough abstraction of it, such as summaries, samples, schema descriptions, or partial slices generated by another system.

What happens when you have a large number of rows to process and feed them into an LLM?

Let's see how we can tackle this using Datatune!

vitalops / datatune

Connect LLMS and Agents to infinite data

🎵 Datatune

Scalable Data Transformations with row-level intelligence.

Datatune is not just another Text to SQL tool. With datatune, LLMs and Agents can have full access to infinite amount of data, and apply semantic intelligence in every record.

How It Works

Click here to understand how Datatune works

Installation

pip install datatune

Quick Start

import datatune as dt
from datatune.llm.llm import OpenAI
import dask.dataframe as dd
llm = OpenAI(model_name="gpt-3.5-turbo")
df = dd.read_csv("products.csv")

# Extract categories using natural language
mapped = dt.map(
    prompt="Extract categories from the description and name of product.",
    output_fields=["Category", "Subcategory"],
    input_fields=["Description", "Name"]
)(llm, df)

# Filter with simple criteria
filtered = dt.filter(
    prompt="Keep only electronics products",
    input_fields

…

View on GitHub

The Context Length Problem

LLMs are becoming larger and larger in terms of context length capabilities. However, even with the current pace, a 100M token context length model is no match for the data that comes from an average database of a user.

This means, the data that needs to be transformed can be several orders of magnitude higher than an LLMs's context length.

Consider a mid-sized enterprise with the following very normal setup:

10 million rows in a transactional table
20 columns per row
Average 50 characters per column (IDs, text, timestamps, codes)

That’s:

10,000,000 rows × 20 columns × 50 characters
= 10,000,000,000 characters

Even with aggressive tokenization (≈ 4 characters per token):

10,000,000,000 ÷ 4
≈ 2.5 billion tokens

Now compare this with an extremely optimistic LLM context window of 100 million tokens.

That single table alone is 25× larger than the model’s entire context.

Solving Large Scale Data processing using Datatune

With Datatune, users can give full access to the data for LLMs, with the help of batch processing.

Each row of data is transformed using the input prompt, while this combination is sent to the LLM in a batch, and this process continues until all batches of data are sent. Datatune uses Dask's parallel processing abilities to split the data into partitions and use it to send parallel batches to the LLM.

Understanding Data Transformation Operations

There are 4 first-order data transformation functions (also known as primitives), namely MAP, FILTER, EXPAND, and REDUCE

Datatune is also built on top of these primitives, where each primitive can be performed with natural language operations.

Eg:

mapped = dt.map(
    prompt="Extract categories from the description and name of the product.",
    output_fields=["Category", "Subcategory"],
    input_fields=["Description", "Name"]
)(llm, df)

In the above example, a MAP operation is performed using a prompt to get the output fields Category and Subcategory from the input fields such as Description and Name.

Datatune also can be used to chain multiple transformations together.

Here's another example where a MAP and FILTER are used together

# First, extract sentiment and keywords from each review (MAP)
mapped = dt.map(
    prompt="Classify the sentiment and extract key topics from the review text.",
    input_fields=["review_text"],
    output_fields=["sentiment", "topics"]
)(llm, df)

# Then, keep only negative reviews for further analysis (FILTER)
filtered = dt.filter(
    prompt="Keep only rows where sentiment is negative."
)(llm, mapped)

Datatune Agents

Datatune has Agents which helps the user perform prompts without having to know what primitives to use. It is also helpful when a query is complex and requires multi step transformations chained together.

Here's an example where the previour MAP and FILTER operations that were chained together was solved with just a single prompt in Agents:

df = agent.do(
    """
    From product name and description, extract Category and Subcategory.
    Then keep only products that belong to the Electronics category
    and have a price greater than 100.
    """,
    df
)

The Agent also executes Python code along with row-level primitives (Map, Filter, etc). This is especially useful for some prompts that doesn't require row-level intelligence (numerical columns etc) as it can utilize Datatune's code generation capabilities to work on the data.

Data Sources

Datatune is designed to work with a wide variety of data sources including DataFrames and Databases. Users can use datatune with Ibis integration to help extend connectivity to Databases such as DuckDB, Postgres, MySQL, etc.

Contributing

We're building Datatune in open source, and we would love your contributions!

Check out the Github repository here:

Repo URL: https://github.com/vitalops/datatune