Craft Better Search with Snowflake Cortex Search

#snowflake #datasuperhero #data #genai

The Ultimate Guide to Transforming Your Search Capabilities: A Practical Case

In today’s enterprise, a large amount of data is generated in many different systems, and it is often difficult to ask questions across these systems. Snowflake Cortex Search is tailor-made for fuzzy searches on unstructured text data. It is even better for retrieval-augmented generation (RAG) pipeline optimization, especially for chatbots powered by GenerativeAI. I will walk you through Cortex Search and its capabilities using a practical use case.

Snowflake Cortex Search?

Cortex Search was made to manage and query large amounts of unstructured text data. It combines keyword and vector search, a hybrid search strategy, in contrast to typical search engines that depend on exact matches. Because of this, it can find relevant documents that match user searches, even if the matches aren’t perfect. At the foundation of document search that ensures relevant results are found despite tiny changes, like as typos or related phrases.

When using Cortex Search, precise keyword matches aren’t necessary. Approximate matching is also supported, so you can get results based on words that are synonyms, typos, or have similar semantic meanings. Take this example: when you type in “Snowflake,” the search engine will offer results for similar phrases like “winter” and even misspellings like “snowflaking.”

Cortex Search combines two search methods: keyword search, which is similar to standard full-text search, and vector search, which assesses the degree to which terms are semantically similar. The hybrid approach enhances the search experience by guaranteeing both accurate and meaning-based matches.

Thing RAG pipeline, and as the context provider, Cortex Search is an integral part of RAG (Retrieval-Augmented Generation) pipelines. Cortex Search finds the best documents for an LLM (Large Language Model) to respond to a user’s query in an RAG-based chatbot. Because the chatbot needs context to give meaningful and accurate answers, this is important for GenAI applications. But let us be clear: Cortex Search is a search engine, not a chatbot.

While cortex analyst is more about structured data, like tables, views. Versus Cortex search that is all about unstructured text and data. Whether your papers are stored in Snowflake tables or uploaded as PDFs, Cortex Search automates the indexing process. Cortex Search automatically updates the index whenever your data changes, guaranteeing that searches always reflect the most recent information.

Use cases for RAG Chatbots are where Snowflake customers are putting Cortex Search to the test the most. This conversational AI is where Cortex Search really shines. To make the chatbot answer intelligently, it finds pertinent information in indexed papers and feeds that context to the LLM. This is especially helpful for industries such as financial services and life sciences when looking up HR policies, answering customer support questions, or conducting research. In contrast, Cortex Search’s capabilities extend beyond chatbots to encompass more conventional search needs. Picture this: customers are required to search through massive amounts of text in a product catalog, company directory, or contact search engine.

Organizations can leverage Cortex Search for internal knowledge management by creating productivity solutions that assist staff in navigating through extensive internal content. For example, you can search through academic articles, GitHub pull requests, or HR policies without having to know the specific words.

Creating a service is super simple and can be done via SnowSight or via SQL command.

CREATE [OR REPLACE] CORTEX SEARCH SERVICE [IF NOT EXISTS] <name>
  ON <search_column>
  ATTRIBUTES <col_name> [, ...]
  WAREHOUSE = <warehouse_name>
  TARGET_LAG = '<num> { seconds | minutes | hours | days }'
  COMMENT = '<comment>'
  AS <query>;

The Process?

Because it combines keyword and vector search techniques so well, Cortex Search is super powerful.

Whenever a user inputs a query into Cortex Search, the query is immediately embedded into a vector space using an embedding model such as Arctic embed. By comparing the query with the embedded documents in its index, the engine is able to conduct a semantic search. At the same time, Cortex Search runs a conventional keyword search, trying to find documents with lexical similarities that match the query’s exact or near-wording.
After Cortex Search compiles results from vector and keyword searches, the results are re-ranked according to relevancy. The outcome is that the most relevant documents based on the given context will be displayed first.
Once your data has been indexed, Cortex Search will take care of incremental updates to the index. Cortex Search ensures that searches reflect the most recent data automatically whenever new rows are added to the table.

A Demo of Cortex Search

To demonstrate Cortex Search, let’s say you wish to create a chatbot that provides responses regarding the human resources manuals and confluence entries at Infostrux Solutions.

Step 1: You can manually upload to a Snowflake stage in PDF format.

Step 2: After you import the PDF into Snowflake, you may parse it using the Python UDF or the preview PDF parsing routines and save the results to a table.

---------------------------------------------------------
-- Create a preprocessing function to do the following:
-- Parse the PDF files and extract the text.
-- Chunk the text into smaller pieces for indexing.
---------------------------------------------------------

CREATE OR REPLACE FUNCTION pdf_text_chunker_cortex(file_url STRING)
    RETURNS TABLE (chunk VARCHAR)
    LANGUAGE PYTHON
    RUNTIME_VERSION = '3.9'
    HANDLER = 'pdf_text_chunker'
    PACKAGES = ('snowflake-snowpark-python', 'PyPDF2', 'langchain')
    AS
$$
from snowflake.snowpark.types import StringType, StructField, StructType
from langchain.text_splitter import RecursiveCharacterTextSplitter
from snowflake.snowpark.files import SnowflakeFile
import PyPDF2, io
import logging
import pandas as pd

class pdf_text_chunker:

    def read_pdf(self, file_url: str) -> str:
        logger = logging.getLogger("udf_logger")
        logger.info(f"Opening file {file_url}")

        with SnowflakeFile.open(file_url, 'rb') as f:
            buffer = io.BytesIO(f.readall())

        reader = PyPDF2.PdfReader(buffer)
        text = ""
        for page in reader.pages:
            try:
                text += page.extract_text().replace('\n', ' ').replace('\0', ' ')
            except:
                text = "Unable to Extract"
                logger.warn(f"Unable to extract from file {file_url}, page {page}")

        return text

    def process(self, file_url: str):
        text = self.read_pdf(file_url)

        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size = 2000, # Adjust this as needed
            chunk_overlap = 300, # Overlap to keep chunks contextual
            length_function = len
        )

        chunks = text_splitter.split_text(text)
        df = pd.DataFrame(chunks, columns=['chunk'])

        yield from df.itertuples(index=False, name=None)
$$;

-------------------------------------------------------------------
-- Then, create a table to hold the parsed data from the PDF files.
-------------------------------------------------------------------

CREATE OR REPLACE TABLE docs_chunks_table_cortex AS
    SELECT
        relative_path,
        build_scoped_file_url(@docs_stage, relative_path) AS file_url,
        -- preserve file title information by concatenating relative_path with the chunk
        CONCAT(relative_path, ': ', func.chunk) AS chunk,
        'English' AS language
    FROM
        directory(@docs_stage),
        TABLE(pdf_text_chunker_cortex(build_scoped_file_url(@docs_stage, relative_path))) AS func;

Step 3: Use the Cortex Search service to build an index of the parsed text, which will prepare it for queries.

-------------------------
-- Create search service
-------------------------

set warehouse = CURRENT_WAREHOUSE();

CREATE OR REPLACE CORTEX SEARCH SERVICE lab_cortex
    ON chunk
    ATTRIBUTES language
    WAREHOUSE = $warehouse
    TARGET_LAG = '1 hour'
    AS (
    SELECT
        chunk,
        relative_path,
        file_url,
        language
    FROM docs_chunks_table_cortex
    );

I enabled the context so it uses my documents as context and asked the question, “Am I allowed to give a client a gift of $5,000 dollars?”

Cortex Search extracts the relevant sections of the document in a basic chatbot configuration in response to a user query. But let’s try the same question without the context of my documents.

As an example, with context, I uploaded my vacation records, “How many days has Augusto Rosa left of vacation?”

Another example is, “Am I allowed to give my work laptop to a friend?”

Users are now able to search through massive volumes of unstructured data with ease, thanks to this simplified procedure, rather than having to manually go through countless papers.

Conclusion

Enterprises can make use of Cortex Search’s scalability by deploying it across 30+ Snowflake regions. Cortex Search’s lightning-fast and incredibly accurate vector keyword search is a result of its integration with the Python and REST APIs, among other features. Businesses can take advantage of their current data pipelines and tools with Cortex Search because it is built to integrate effortlessly with Snowflake’s data platform. This is a simple implementation model that can work for many internal needs.

Cortex Search offers a fresh and potent approach to querying Snowflake’s unstructured text data. With Cortex Search’s hybrid search capabilities, you can design next-gen chatbots or traditional search engines with confidence that your users will receive the best, most relevant results. Any AI-driven application in the Snowflake ecosystem requires it due to its automatic indexing and interaction with RAG pipelines. Businesses can make better, more responsive AI apps by using Cortex Search to mine their unstructured data for insights.

My name is Augusto Rosa, and I am the Vice President of Engineering for Infostrux Solutions. I am also honored to be a Snowflake Data Super Hero 2024 and Snowflake SME.

Thank you for reading this blog post. You can follow me on LinkedIn.

Subscribe to Infostrux Medium Blogs https://medium.com/infostrux-solutions for the most interesting Data Engineering and Snowflake news.