DEV Community

Danny Chan
Danny Chan

Posted on

DataBrick workshop Q & A

Chunk size?
For IR, Q&A, Txt summary the chunk size is sentence or paragraph. For document classification or dialogues, it may be large like sections



Does the embedding include the text from an image?
images and texts are embedded separately



How does a different embedding model impact how the embedding is saved?
The word represented will vary between embedding models because each model processes and encodes semantic and contextual information differently.



What is the context window limitation of the embedding model?
for BERT its 512, GPT-3 its 2048, some newer are larger



multimodal LLM available on databricks as FM model or marketplace?
Yes, you can use the external multimodal models like gpt-4o



do we save the pdf to tables or volumes (file storing)?
Volumes to be used when saving pdf for backup, retention. tables to be used for querying and analysis



*if my use case calls for different chunking strategies for different documents? *

  1. Context and content based chunks,
  2. Hybrid Models 3.Normalise text
  3. Pre trained models to assist chunks



What is the difference between a foundational model and LLM?
Foundational models are base models. All LLMs can be used as FM, but not all FMs are LLMs. BERT can also be a FM



Is it a good idea to store embeddings in a delta table? should they not be stored in a vector database?
Generally embeddings for efficient retrieval and search are stored in vector DB. We can use a hybrid approach and leverage delta tables to store the associated metadata and ancillary data



Is there an embedding technique that gives the chunks a sequence number so that we can avoid overlap and also not lose context?
If you want to avoid overlap you can set the overlap to 0. As for chunking strategy to not lose context, you can look into chunking via things like sections, paragraphs, etc. such that they are kept



similar question: why do we store embeddings in a delta table here?
for SQL like querying on embeddings and unified data management we can prefer delta tables



How can we calculate the cost of using an LLM for a GenAI solution in Databricks production environment?
Databricks cost depends on the enterprise pricing, and SKU pricing and usage. It appears in the bill/usage and is available to the account admin. As part of UC its available in system tables as well



Are indexes and embeddings stored in two different Delta tables? e.g indexes in indexer db and embeddings in Vector Database?
Yes



Can we cache the responses?
Model usage can be stored using MLflow or 3rd party logging tools or even delta tables.



Can we feedback the user responses to the model and make it learn from them to improve? or model training is isolated from its usage?
Can we cache the responses from the model so that we can avoid resending the queries and thus cost?
Each API call is its own isolated request, and endpoints do not persist data or context related to the request



What is the purpose of reranking and ranking?
It is to improve accuracy/relevance of results



Useful link:

https://chunkviz.up.railway.app/

Image description

Top comments (0)