DEV Community: Tuana Celik

Customizing RAG Pipelines to Summarize Latest Hacker News Posts with Haystack 2.0 Preview

Tuana Celik — Fri, 22 Sep 2023 13:41:43 +0000

Take a look at how we are changing Haystack for advanced LLM pipelines, with an example that uses a custom component to fetch the latest Hacker News posts

Over the last few months, the team at deepset has been working on a major upgrade in the Haystack repository. Along the way, we’ve been sharing our updates and design process for the upcoming Haystack 2.0 with the community, as well as releasing new components in a preview package. This means that you can already start exploring features coming to Haystack 2.0 using the preview components available in the haystack-ai package (pip install haystack-ai).

You can run the example code showcased in this article in the accompanying Colab notebook.

In this article, I’ll cover two major concepts in Haystack 2.0.

Components: These are the smallest building blocks in Haystack. They are meant to cover one simple task. As well as using components available in the core Haystack project, it will be easier than ever in 2.0, to create your own custom components.
Pipelines: These are made by connecting components to each other. Pipelines in 2.0 are more flexible than ever and enable you various new connection patterns between your components. While components and pipelines have been at the core of Haystack since the beginning, Haystack 2.0 introduces some significant changes to how they are constructed.

We’ll look at how to create custom components and pipelines using the Haystack 2.0 preview. I’ll share a custom Haystack component that fetches the latest posts from Hacker News, and show how we can use it in a retrieval-augmented generative (RAG) pipeline to generate summaries of Hacker News posts.

Components in Haystack 2.0

A component is a class that does one thing. That thing could be to ‘prompt GPT3.5’, or ‘translate’, or ‘retrieve documents’, and so on.

While Haystack comes with a set of components in the core project, we hope that with Haystack 2.0 you will also be able to easily build components to your own custom requirements.

In Haystack 2.0, a class can become a component with just two additions:

A @component decorator on the class declaration. A run function with a decorator
@component.output_types(my_output_name=my_output_type) that describes what output the pipeline should expect from this component. And that’s about it.

Building a Custom Hacker News Component

I’ll admit, the idea for this custom component came from one of our amazing Haystack ambassadors on Discord during a live coding session (thanks rec 💙) — and it turned out pretty well! So let’s take a look at how we create a custom component that fetches the latest k posts from Hacker News.

First, we create a HackernewsNewestFetcher. For it to be a valid Haystack component, it will also need a run function. For now, let’s create a stub function that simply returns a dictionary containing a single key ‘articles’ with the value ‘Hello world!’.

from haystack.preview import component  

@component  
class HackernewsNewestFetcher():  

  @component.output_types(articles=str)  
  def run(self):  
    return {'articles': 'Hello world!'}

Now let’s make our component actually fetch the latest posts from Hacker News. We can use the newspapers3k package to crawl and get the contents of given URLs. We will also change the output type to return a list of Document objects.

from typing import List  
from haystack.preview import component, Document  
from newspaper import Article  
import requests  

@component  
class HackernewsNewestFetcher():  

  @component.output_types(articles=List[Document])  
  def run(self, last_k: int = 5):  
    newest_list = requests.get(url='https://hacker-news.firebaseio.com/v0/newstories.json?print=pretty')  
    articles = []  
    for id in newest_list.json()[0:last_k]:  
      article = requests.get(url=f"https://hacker-news.firebaseio.com/v0/item/{id}.json?print=pretty")  
      if 'url' in article.json():  
        articles.append(article.json()['url'])  

    docs = []  
    for url in articles:  
      try:  
        article = Article(url)  
        article.download()  
        article.parse()  
        docs.append(Document(text=article.text, metadata={'title': article.title, 'url': url}))  
      except:  
        print(f"Couldn't download {url}, skipped")  
    return {'articles': docs}

We now have a component that, when run, returns a list of Documents containing the contents of the (last_k) latest posts on Hacker News. Here we store the output in the articles key of the dictionary.

Pipelines in Haystack 2.0

A pipeline is a structure that connects one component’s output to another component’s input until a final result is reached.

A pipeline is created with a few steps:

Create a Pipeline: pipeline = Pipeline()
Add components to the pipeline: pipeline.add_component(instance=component_a, name=”ComponentA”) pipeline.add_component(instance=component_b, name=”ComponentB”)
Connect an output from one component to the input of another: pipeline.connect("component_a.output_a", "component_b.input_b")

There are already enough components available in the Haystack 2.0 preview for us to build a simple RAG pipeline that uses our new HackernewsNewestFetcher for the retrieval augmentation step.

Building a RAG Pipeline to Generate Summaries of Hacker News Posts

To build a RAG pipeline that can create a summary for each of the latest k posts on Hacker News, we will use two components from the Haystack 2.0 preview:

The PromptBuilder: This component allows us to create prompt templates using Jinja as our templating language.
The GPTGenerator: This component simply prompts the specified GPT model. We can connect the PromptBuilder output to this component to customize how we interact with our chosen model. First, we initialize all of the components we will need for the pipeline:

from haystack.preview import Pipeline  
from haystack.preview.components.builders.prompt_builder import PromptBuilder  
from haystack.preview.components.generators.openai import GPTGenerator  

prompt_template = """  
You will be provided a few of the latest posts in HackerNews, followed by their URL.  
For each post, provide a brief summary followed by the URL the full post can be found at.  

Posts:  
{% for article in articles %}  
  {{article.text}}  
  URL: {{article.metadata['url']}}  
{% endfor %}  
"""  

prompt_builder = PromptBuilder(template=prompt_template)  
llm = GPTGenerator(model_name="gpt-4", api_key='YOUR_API_KEY')  
fetcher = HackernewsNewestFetcher()

Next, we add the components to a Pipeline:

pipeline = Pipeline()  
pipeline.add_component("hackernews_fetcher", fetcher)  
pipeline.add_component("prompt_builder", prompt_builder)  
pipeline.add_component("llm", llm)

And finally, we connect the components to each other:

pipeline.connect("hackernews_fetcher.articles", "prompt_builder.articles")  
pipeline.connect("prompt_builder", "llm")

Here, notice how we connect hackernews_fetcher.articles to prompt_builder.articles. This is because prompt_builder is expecting articles in its template:

Posts:  
{% for article in articles %}  
  {{article.text}}  
  URL: {{article.metadata['url']}}  
{% endfor %}

The output and input keys do not need to have matching names. Additionally, prompt_builder makes all of the input keys available to your prompt template. We could, for example, provide a documents input to prompt_builder instead of articles. Then our code might look like this:

prompt_template = """  
You will be provided a few of the latest posts in HackerNews, followed by their URL.  
For each post, provide a brief summary followed by the URL the full post can be found at.  

Posts:  
{% for document in documents %}  
  {{document.text}}  
  URL: {{document.metadata['url']}}  
{% endfor %}  
"""  

[...]  

pipeline.connect("hackernews_fetcher.articles", "prompt_builder.documents")

Notice how the prompt now refers to documents, and the connect call now attaches to the corresponding prompt_builder.documents input.

Now that we have a pipeline, we can run it. Here is what I got as a response at about 22:45 CET on September 21st 🤗

result = pipe.run(data={"hackernews_fetcher":{"last_k": 2}})  
print(result['llm']['replies'][0])

Response:

1. "The translation world has legends of its own, but not all legends involve greatness.   
Many provide pain, confusion, or comedy, as these examples of bad game translation prove."   
- This post shares a humorous look at some examples of poor video game translations that have   
resulted in confusion and comedy. The author seeks to highlight that while translation is often   
necessary in game localization, it can sometimes yield suboptimal results.  
Link: https://legendsoflocalization.com/bad-translation/  

2. “Recently, I found myself returning to a compelling series of   
blog posts titled Zero-cost futures in Rust by Aaron Turon about what would   
become the foundation of Rust's async ecosystem.”   
- This post provides an in-depth analysis of the current state of Rust's   
'async' ecosystem, drawing upon the author's own experiences and Aaron Turon's   
blog series, "Zero-cost futures in Rust". The author also discusses the benefits and   
negatives of the current async ecosystem, the problems with ecosystem fragmentation,   
the state and issue of async-std, alternative runtimes, the complexities of writing async code,   
the benefits of synchronous threads over async, and the obsessiveness of Rust landscape with an   
async-first approach. The post concludes with the notion that async Rust should be used only   
when necessary and that the smaller, simpler language inside Rust (the synchronous Rust)   
should be the default mode.  
Link: https://corrode.dev/blog/async/

Further Improvements

This custom component was created as an experiment and you could certainly take it much further in a real-world application.

For example, our experimental component does nothing to reduce the length of the content in each article. This means that GPT-4 may struggle to give a good response, especially when setting last_k to a high number.

Talk to YouTube Videos with Haystack Pipelines

Tuana Celik — Fri, 08 Sep 2023 14:12:41 +0000

Use Whisper to provide YouTube videos as context for retrieval augmented generation

You can use this Colab for a working example of the application described in this article.

In this article, I’ll be showing an example of how to leverage transcription models like OpenAI’s Whisper, so as to build a retrieval augmented generation (RAG) pipeline that will allow us to effectively search through video content.

The example application I’ll showcase is able to answer questions based on the transcript extracted from the video. I’ll use the video by Erika Cardenas as an example. In the video, she talks about chunking and preprocessing documents for RAG pipelines. Once we’re done, we will be able to query a Haystack pipeline that will respond based on the contents of the video.

Transcribing and Storing the Video

To get started, we first need to set up an indexing pipeline. These pipelines in Haystack are designed to be given files of some form (.pdf, .txt, .md and in our case, a YouTube link), and store them in a database. The indexing pipeline is also used to design and define how we would like files to be prepared. This often involves file conversion steps, some preprocessing, and maybe also some embedding creation and so on.

The way we design the components and structure of this pipeline will also be important for another type of pipeline we will create in the next section: The RAG pipeline, also often referred to as the query or LLM pipeline too. While the indexing pipeline defines how we prepare and store data, an LLM pipeline uses said stored data. A simple example of the impact an indexing pipeline has on the RAG pipeline is that depending on the model we’re using, we may have to chunk our files to be longer or shorter.

Reusability

The idea behind Haystack pipelines is that once created, they can be re-invoked when needed. This ensures that data is treated the same way each time. In terms of indexing pipelines, this means we have a way to keep our databases for RAG pipelines always up to date. In a practical sense for this example application, when there’s a new video we want to be able to query, we re-use the same indexing pipeline and run the new video through it.

Creating the Indexing Pipeline

In this example, we’re using Weaviate as our vector database for storage. However, Haystack provides a number of Document Stores which you can pick from.

First, we create our WeaviateDocumentStore:

import weaviate  
from weaviate.embedded import EmbeddedOptions  
from haystack.document_stores import WeaviateDocumentStore  

client = weaviate.Client(  
  embedded_options=weaviate.embedded.EmbeddedOptions()  
)  

document_store = WeaviateDocumentStore(port=6666)

Next, we build the indexing pipeline. Here, our aim is to create a pipeline that will create transcripts of YouTube videos. So, we use the WhisperTranscriber as our first component. This component uses Whisper by OpenAI, an automatic speech recognition (ASR) system which can be used to transcribe audio into text. The component expects audio files, and returns transcripts in Haystack Document form, ready to be used in any Haystack pipeline.

We also include preprocessing, as well as embedding creations in our pipeline. This is because when it’s time to create the RAG pipeline, we would like to do semantic search on the indexed files.

from haystack.nodes import EmbeddingRetriever, PreProcessor  
from haystack.nodes.audio import WhisperTranscriber  
from haystack.pipelines import Pipeline  

preprocessor = PreProcessor()  
embedder = EmbeddingRetriever(document_store=document_store,   
                              embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1")  
whisper = WhisperTranscriber(api_key='OPENAI_API_KEY')  

indexing_pipeline = Pipeline()  
indexing_pipeline.add_node(component=whisper, name="Whisper", inputs=["File"])  
indexing_pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["Whisper"])  
indexing_pipeline.add_node(component=embedder, name="Embedder", inputs=["Preprocessor"])  
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Embedder"])

Next, we create a helper function that extracts the audio of YouTube videos, and we can run the pipeline, for this, we install the pytube package 👇

from pytube import YouTube  

def youtube2audio (url: str):  
    yt = YouTube(url)  
    video = yt.streams.filter(abr='160kbps').last()  
    return video.download()

Now, we can run our indexing pipeline with a URL to a YouTube video:

file_path = youtube2audio("https://www.youtube.com/watch?v=h5id4erwD4s")  
indexing_pipeline.run(file_paths=[file_path])

The Retrieval Augmented Generative (RAG) Pipeline

This part is certainly the fun part. We now define our RAG pipeline. This will be the pipeline that defines how we query our videos. Although RAG pipelines often are built for question-answering, they can be designed for a number of other use cases. What the pipeline does in this case, is largely defined by what prompt you provide the LLM. You can find various prompts for different use cases in the PromptHub.

The Prompt

For this example, we’ve gone with a commonly used style of question-answering prompts, although you can of course change this prompt to do what you want to achieve. For example, changing it to a prompt that asks for a summary might be interesting. You could also make it more general. Here we’re also informing the model that the transcripts belong to Weaviate videos.

You will be provided some transcripts from Weaviate YouTube videos.   
Please answer the query based on what is said in the videos.  
Video Transcripts: {join(documents)}  
Query: {query}  
Answer:

In Haystack, these prompts can be included in a pipeline with the PromptTemplate and PromptNode components.

While the PromptTemplate is where we define the prompt and the variables the prompt expects as inputs (in our case documents and query), the PromptNode is really the interface with which we interact with LLMs. In this example, we’re using GPT-4 as our model of choice, but you can change this to use other models from Hugging Face, SageMaker, Azure and so on.

from haystack.nodes import PromptNode, PromptTemplate, AnswerParser  

video_qa_prompt = PromptTemplate(prompt="You will be provided some transcripts from Weaviate YouTube videos. Please answer the query based on what is said in the videos.\n"  
                                        "Video Transcripts: {join(documents)}\n"  
                                        "Query: {query}\n"  
                                        "Answer:", output_parser = AnswerParser())  

prompt_node = PromptNode(model_name_or_path="gpt-4", 
                         api_key='OPENAI_KEY', 
                         default_prompt_template=video_qa_prompt)

The Pipeline

Finally, we define our RAG pipeline. The important thing to note here is how the documents input gets provided to the prompt we are using.

Haystack retrievers always return documents. Notice below how the first component to get the query is the same EmbeddingRetriever that we used in the indexing pipeline above. This allows us to embed the query using the same model that was used for indexing the transcript. The embeddings of the query and indexed transcripts are then used to retrieve the most relevant parts of the transcript. Since these are returned by the retriever as documents, we are able to fill in the documents parameter of the prompt with whatever the retriever returns:

video_rag_pipeline = Pipeline()  
video_rag_pipeline.add_node(component=embedder, name="Retriever", inputs=["Query"])  
video_rag_pipeline.add_node(component=prompt_node, name="PromptNode", inputs=["Retriever"])

We can run the pipeline with a query. The response will be based on what Erika said in the example video we’re using 🤗

result = video_rag_pipeline.run("Why do we do chunking?")

The result I got for this was the following:

Chunking is done to ensure that the language model is receiving the most   
relevant information and not going over the context window. It involves   
splitting up the text once it hits a certain token limit, depending on   
the model or the chunk size defined. This is especially useful in documents   
where subsequent sentences or sections may not make sense without the   
information from previous ones. Chunking can also help in providing extremely   
relevant information when making queries that are specific to titles or   
sections.

Further Improvements

In this example, we’ve used a transcription model that is able to transcribe audio into text, but it is unable to distinguish between speakers. A follow up step I would like to try is to use a model that allows for speaker distinction. This would allow me to ask questions and in the response from the model, get an understanding of who provided that answer in the video.

Another point I would like to make is that this pipeline, which was for demonstration purposes, uses a light-weight yet quite effective sentence-transformers model for retrieval, and the default setting for preprocessing. More could definitely be done to find out what the best embedding model for retrieval would be. And taking inspiration from Erika’s video, chunking and preprocessing of the transcribed documents could be evaluated and improved.

To discover more about the available pipelines and components that would help you build custom LLM applications, check out the Haystack documentation.

Strava Dashboards with Zapier and Cumul.io

Tuana Celik — Fri, 15 Oct 2021 12:37:57 +0000

Recently one of our Cumul.io Ambassadors shared a company Strava dashboard they built with Cumul.io and I had to build something similar for us. It's a nice way to keep motivated to go out for runs and great for those who are competitive when it comes to exercising (NOT me). And it's a fun way to use a data visualization tool like Cumul.io. So anyway, I followed the lead of Olivier de Lamotte who gave us the idea after sharing the one he built for his team at Qualifio and built our own Cumul.io Team Strava Dashboard!

For this exercise, again as Olivier did, I used Zapier. I initially thought I might just use the Strava API directly. However I soon found out that the activities endpoint for club activities doesn't really provide a lot of data and misses some (imo) obvious ones, like the date of the activity. So instead of building around it myself, I went ahead with the Zapier workflow that is already available, which adds a column to a Google sheet whenever there is a new activity by a club member.

Side note: Both Cumul.io and Zapier have free trials so if you're not a paying user you can still get this Strava dashboard up pretty easily!

Here's how it's set up:

Create a Zapier workflow
Add the dataset to Cumul.io
Let your creativity shine and create the most brutal athletes dashboard for your team

Create a Zapier Workflow

I'm new to Zapier, and the simplest way I can explain how this works is that you create 'Zap's (or workflows) which you can turn on or off. And these workflows are simply: Trigger -> Action. I.e.: "When this happens, do that"

There are already a number of Strava based workflows available on Zapier. Just search for it and one of the first one that comes up will be the one I used for our dashboard, which is 'Add new Strava club activities as rows to Google Sheets':

Once you select that, the workflow will appear in your 'Zaps' tab and it's pretty straight forward. First you should set up your trigger, for which I picked 'New or Updated Club Activity':

And then you will also have to select a Strava account to connect to. Next you'll be able to pick a Strava club that this account is a member of to get new activities from. Warning (from experience)! Be aware that if there is a member of the club that has a private Strava account, the owner of the account that sets up this trigger will also have to follow said member for their activities to be tracked!

Once you have set this up, you can set up the action. This is pretty simple and Zapier will walk you through selecting a Google Sheet to add new activities to. Here's an example of what mine looks like:

Finally, don't forget to turn this workflow (zap) ON!

Once you've set this up, you should be able to see new activities showing up in the Google Sheet you selected. This sheet will be what we connect to Cumul.io.

Add the Dataset to Cumul.io

Now that we have a Google Sheet that lists activities by a club member on every row (and thanks to Zapier it's updated every time there is a new one), we just have to connect the dataset to Cumul.io and create a dashboard.

In your Cumul.io account head over to 'Datasets' and select 'New Dataset'. Here, simply pick Google Drive and select the Google Sheet you just created:

Create a Dashboard

Now this is quite simple. Just go ahead and add the dataset you created and let your creativity shine. Here's the dashboard I created for our team for inspiration. And a lot of that inspiration was taken from Olivier De Lamotte's Qualifio dashboard:

That's all! Let me know if you create your own. Would love to see them so please do share 🎈

Building My First Python Package with Poetry

Tuana Celik — Mon, 19 Apr 2021 15:57:46 +0000

A lot of firsts happening for me here. First post on Dev.to AND first published Python package. So I though I'd take the opportunity to share my experience building and publishing the package with Poetry 😊

Cumul.io has a number of SDKs available for people to install and use, but we were missing one in Python. So I built one! It's a simple one that provides interaction with our Core API (For those of you who don't know I'll add some info about Cumul.io at the end of this post). This might not be surprising to a lot of you but as it was my first go, I soon discovered there are a plethora of routes you can take to publish a package in PyPI and so I put in some research time to decide which one would be the least painful for me. In the end I decided on going for Poetry. This video by Black Hills Information Security was extremely helpful to understand the different options and the advantages and disadvantages that come with them. Here's my takeaway and experience:

Poetry makes the configuration of your project a lot simpler than some of the other methods out there. You end up with only a pyproject.toml (not even a requirements.txt is needed any more) vs setup.py and reqiurements.txt that you need with Pipenv for example. Example pyproject.toml:

[tool.poetry]
name = "cumulio"
version = "x.x.x"
description = "Cumulio Python SDK for the Core API"
authors = []
readme = "README.md"
homepage = "Link to your homepage"
repository = "Link to your repo"
exclude = ["test/*"]
include = ["LICENSE"]
classifiers = [
    "Topic :: Software Development :: Libraries :: Python Modules"
]
packages = [
    { include = "cumulio"}
]

[tool.poetry.scripts]

[tool.poetry.dependencies]
python = "^3.7"
requests = "^2.25.1"

[tool.poetry.dev-dependencies]
autopep8 = "^1.5.6"

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

Notice how you can set your classifiers for PyPI as well as your dependencies.

Not only does it manage your dependencies, it also manages your packaging and upload to PyPI.
Super easy to work in a virtual environment and get shell access inside the virtual environment. Literally just poetry shell

The current Cumul.io Python Package provides a simple endpoint to the Cumul.io API. The code and Poetry setup are all open source on GitHub. Now I know how to upload packages to PyPI, I intend to expand the SDK!

As a first timer, Poetry made my life a lot easier than it could have been. Let me know what your experiences were and if any what disadvantages to poetry you've noticed that I haven't yet. I would be interested to know!

About Cumul.io:

Cumul.io is an API first data analytics and dashboarding platform that makes integrating dashboards and charts into your own platforms super easy. It's designed so that anything you can do frim within Cumul.io's UI, you can also do via the API which makes it quite a customizable option for a data analytics tool. The SDKs are there and are being expanded to cover a wide range of languages that are most commonly used in web development and data science (hence the reason for a Python SDK).