DEV Community

Volodymyr Pavlyshyn
Volodymyr Pavlyshyn

Posted on

Pre and Post Filtering in Vector Search with Metadata and RAG Pipelines

In the modern world of AI, managing vast amounts of data while keeping it relevant and accessible is a significant challenge, mainly when dealing with large language models (LLMs) and vector databases. One approach that has gained prominence in recent years is integrating vector search with metadata, especially in retrieval-augmented generation (RAG) pipelines. Vector search and metadata enable faster and more accurate data retrieval. However, the process of pre- and post-search filtering results plays a crucial role in ensuring data relevance.

The Vector Search and Metadata Challenge

In a typical vector search, you create embeddings from chunks of text, such as a PDF document. These embeddings allow the system to search for similar items and retrieve them based on relevance. The challenge, however, arises when you need to combine vector search results with structured metadata. For example, you may have timestamped text-based content and want to retrieve the most relevant content within a specific date range. This is where metadata becomes critical in refining search results.

Unfortunately, most vector databases treat metadata as a secondary feature, isolating it from the primary vector search process. As a result, handling queries that combine vectors and metadata can become a challenge, particularly when the search needs to account for a dynamic range of filters, such as dates or other structured data.

LibSQL and vector search metadata

LibSQL is a more general-purpose SQLite-based database that adds vector capabilities to regular data. Vectors are presented as blob columns of regular tables. It makes vector embeddings and metadata a first-class citizen that naturally builds deep integration of these data points.

  create table if not exists conversation (
    id varchar(36) primary key not null,
    startDate real,
    endDate real,
    summary text,
    vectorSummary F32_BLOB(512)
   );
Enter fullscreen mode Exit fullscreen mode

It solves the challenge of metadata and vector search and eliminates impedance between vector data and regular structured data points in the same storage.

As you can see, you can access vector-like data and start date in the same query.

select c.id ,c.startDate, c.endDate, c.summary, vector_distance_cos(c.vectorSummary, vector(${vector})) distance
      from conversation
      where
      ${startDate ? `and c.startDate >= ${startDate.getTime()}` : ''}
      ${endDate ? `and c.endDate <= ${endDate.getTime()}` : ''}
      ${distance ? `and distance <= ${distance}` : ''}
      order by distance
      limit ${top};
Enter fullscreen mode Exit fullscreen mode

vector_distance_cos calculated as distance allows us to make a primitive vector search that does a full scan and calculates distances on rows. We could optimize it with CTE and limit search and distance calculations to a much smaller subset of data.

This approach could be calculation intensive and fail on large amounts of data.

Libsql offers a way more effective vector search based on FlashDiskANN vector indexed.

vector_top_k('idx_conversation_vectorSummary', ${vector} , ${top}) i
Enter fullscreen mode Exit fullscreen mode

vector_top_k is a table function that searches for the top of the newly created vector search index. As you can see, we could use only vector as a function parameter, and other columns could be used outside of the table function. So, to use a vector index together with different columns, we need to apply some strategies.

Now we get a classical problem of integration vector search results with metadata queries.

Post-Filtering: A Common Approach

The most widely adopted method in these pipelines is post-filtering. In this approach, the system first retrieves data based on vector similarities and then applies metadata filters. For example, imagine you’re conducting a vector search to retrieve conversations relevant to a specific question. Still, you also want to ensure these conversations occurred in the past week.

Post-filtering allows the system to retrieve the most relevant vector-based results and subsequently filter out any that don’t meet the metadata criteria, such as date range. This method is efficient when vector similarity is the primary factor driving the search, and metadata is only applied as a secondary filter.

    const sqlQuery = `
      select c.id ,c.startDate, c.endDate, c.summary, vector_distance_cos(c.vectorSummary, vector(${vector})) distance
      from  vector_top_k('idx_conversation_vectorSummary', ${vector} , ${top}) i
      inner join conversation c on i.id = c.rowid
      where 
      ${startDate ? `and c.startDate >= ${startDate.getTime()}` : ''}
      ${endDate ? `and c.endDate <= ${endDate.getTime()}` : ''}
      ${distance ? `and distance <= ${distance}` : ''}
      order by distance
      limit ${top};
Enter fullscreen mode Exit fullscreen mode

However, there are some limitations. For example, the initial vector search may yield fewer results or omit some relevant data before applying the metadata filter. If the search window is narrow enough, this can lead to complete results.

One working strategy is to make the top value in vector_top_K much bigger. Be careful, though, as the function's default max number of results is around 200 rows.

Pre-Filtering: A More Complex Approach

Pre-filtering is a more intricate approach but can be more effective in some instances. In pre-filtering, metadata is used as the primary filter before vector search takes place. This means that only data that meets the metadata criteria is passed into the vector search process, limiting the scope of the search right from the beginning.

While this approach can significantly reduce the amount of irrelevant data in the final results, it comes with its own challenges. For example, pre-filtering requires a deeper understanding of the data structure and may necessitate denormalizing the data or creating separate pre-filtered tables. This can be resource-intensive and, in some cases, impractical for dynamic metadata like date ranges.

In certain use cases, pre-filtering might outperform post-filtering. For instance, when the metadata (e.g., specific date ranges) is the most important filter, pre-filtering ensures the search is conducted only on the most relevant data.

Pre-filtering with distance-based filtering

So, we are getting back to an old concept. We do prefiltering instead of using a vector index.

WITH FilteredDates AS (
    SELECT 
        c.id, 
        c.startDate, 
        c.endDate, 
        c.summary, 
        c.vectorSummary
    FROM 
        YourTable c
    WHERE 
        ${startDate ? `AND c.startDate >= ${startDate.getTime()}` : ''}
        ${endDate ? `AND c.endDate <= ${endDate.getTime()}` : ''}
),
DistanceCalculation AS (
    SELECT 
        fd.id, 
        fd.startDate, 
        fd.endDate, 
        fd.summary, 
        fd.vectorSummary,
        vector_distance_cos(fd.vectorSummary, vector(${vector})) AS distance
    FROM 
        FilteredDates fd
)
SELECT 
    dc.id, 
    dc.startDate, 
    dc.endDate, 
    dc.summary, 
    dc.distance
FROM 
    DistanceCalculation dc
WHERE 
    1=1
    ${distance ? `AND dc.distance <= ${distance}` : ''}
ORDER BY 
    dc.distance
LIMIT ${top};
Enter fullscreen mode Exit fullscreen mode

It makes sense if the filter produces small data and distance calculation happens on the smaller data set.

As a pro of this approach, you have full control over the data and get all results without omitting some typical values for extensive index searches.

Choosing Between Pre and Post-Filtering

Both pre-filtering and post-filtering have their advantages and disadvantages. Post-filtering is more accessible to implement, especially when vector similarity is the primary search factor, but it can lead to incomplete results. Pre-filtering, on the other hand, can yield more accurate results but requires more complex data handling and optimization.

In practice, many systems combine both strategies, depending on the query. For example, they might start with a broad pre-filtering based on metadata (like date ranges) and then apply a more targeted vector search with post-filtering to refine the results further.

Conclusion

Vector search with metadata filtering offers a powerful approach for handling large-scale data retrieval in LLMs and RAG pipelines. Whether you choose pre-filtering or post-filtering—or a combination of both—depends on your application's specific requirements. As vector databases continue to evolve, future innovations that combine these two approaches more seamlessly will help improve data relevance and retrieval efficiency further.

Top comments (0)