DEV Community

David Mezzetti for NeuML

Posted on • Originally published at towardsdatascience.com on

Introducing txtai, AI-powered semantic search built on Transformers

Add Natural Language Understanding to any application

Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation of the internet and an ever-growing challenge that is never solved or done.

The field of Natural Language Processing (NLP) is rapidly evolving with a number of new developments. Large-scale general language models are an exciting new capability allowing us to add amazing functionality quickly with limited compute and people. Innovation continues with new models and advancements coming in at what seems a weekly basis.

This article introduces txtai, an AI-powered semantic search platform that enables Natural Language Understanding (NLU) based search in any application.

Introducing txtai

GitHub logo neuml / txtai

💡 Build AI-powered semantic search applications

Build AI-powered semantic search applications

Version GitHub Release Date GitHub issues GitHub last commit Build Status Coverage Status


txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.

demo

Traditional search systems use keywords to find data. Semantic search applications have an understanding of natural language and identify results that have the same meaning, not necessarily the same keywords.

Backed by state-of-the-art machine learning models, data is transformed into vector representations for search (also known as embeddings). Innovation is happening at a rapid pace, models can understand concepts in documents, audio, images and more.

Summary of txtai features:

  • 🔎 Large-scale similarity search with multiple index backends (Faiss, Annoy, Hnswlib)
  • 📄 Create embeddings for text snippets, documents, audio, images and video. Supports transformers and word vectors.
  • 💡 Machine-learning pipelines to run extractive question-answering, zero-shot labeling, transcription, translation, summarization and text extraction
  • ↪️️ Workflows that join pipelines together to aggregate business logic. txtai processes can be microservices or…

txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.

Traditional search systems use keywords to find data. Semantic search applications have an understanding of natural language and identify results that have the same meaning, not necessarily the same keywords.

Backed by state-of-the-art machine learning models, data is transformed into vector representations for search (also known as embeddings). Innovation is happening at a rapid pace, models can understand concepts in documents, audio, images and more.

The following is a summary of key features.

  • 🔎 Large-scale similarity search with multiple index backends (Faiss, Annoy, Hnswlib)
  • 📄 Create embeddings for text snippets, documents, audio, images and video. Supports transformers and word vectors.
  • 💡 Machine-learning pipelines to run extractive question-answering, zero-shot labeling, transcription, translation, summarization and text extraction
  • ↪️️ Workflows that join pipelines together to aggregate business logic. txtai processes can be microservices or full-fledged indexing workflows.
  • 🔗 API bindings for JavaScript, Java, Rust and Go
  • ☁️ Cloud-native architecture that scales out with container orchestration systems (e.g. Kubernetes)

Applications range from similarity search to complex NLP-driven data extractions to generate structured databases. The following applications are powered by txtai.

Application Description
paperai AI-powered literature discovery and review engine for medical/scientific papers
tldrstory AI-powered understanding of headlines and story text
neuspo Fact-driven, real-time sports event and news site
codequestion Ask coding questions directly from the terminal

txtai is built with Python 3.7+, Hugging Face Transformers, Sentence Transformers and FastAPI

Install and run txtai

The following code snippet shows how to install txtai and create an embeddings model.

pip install txtai
Enter fullscreen mode Exit fullscreen mode

Next, we can create a simple in memory model with a couple sample records to try txtai out.

Basic Embeddings Instance

Running the code above will print the following:

Embeddings query output

The example above shows for almost all of the queries, the actual text isn’t stored in the list of text sections. This is the true power of transformer models over token based search. What you get out of the box is 🔥🔥🔥!

Build an Embeddings index

For small lists of texts, the method above works. But for larger repositories of documents, it doesn’t make sense to tokenize and convert all embeddings for each query. txtai supports building pre-computed indices which significantly improves performance.

Building on the previous example, the following example runs an index method to build and store the text embeddings. In this case, only the query is converted to an embeddings vector each search.

Build an Embeddings Index

Once again the same results will be returned, only difference is the embeddings are pre-computed.

Embeddings query output

Save and load an Embeddings index

Embeddings indices can be saved to disk and reloaded. At this time, indices are not incrementally created, the index needs a full rebuild to incorporate new data.

Save and load an Embeddings Index

The results of the code above:

Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
Enter fullscreen mode Exit fullscreen mode

Update and delete from an Embeddings index

Updates and deletes are supported for Embedding indices. The upsert operation will insert new data and update existing data

The following section runs a query, then updates a value changing the top result and finally deletes the updated value to revert back to the original query results.

Update and delete from an Embeddings index

The results of the code above:

Initial:       Maine man wins $1M from $25 lottery ticket
After update:  See it: baby panda born
After delete:  Maine man wins $1M from $25 lottery ticket
Enter fullscreen mode Exit fullscreen mode

With a limited amount of code, we’re able to build a system with a deep understanding of natural language. The amount of knowledge that comes from Transformer models is phenomenal.

Sentence Embeddings

txtai builds sentence embeddings to perform similarity searches. txtai takes each text record entry, tokenizes it and builds an embeddings representation of that record. At search time, the query is transformed into a text embedding and then is compared to the repository of text embeddings.

txtai supports two methods for creating text embeddings, sentence transformers and word embeddings vectors. Both methods have their merits as shown below.

Sentence Transformers

GitHub logo huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

  • Creates a single embeddings vector via mean pooling of vectors generated by the Transformers library.
  • Supports models stored on Hugging Face’s model hub or stored locally.
  • See Sentence Transformers for details on how to create custom models, which can be kept local or uploaded to Hugging Face’s model hub.
  • Base models require significant compute capability (GPU preferred). Possible to build smaller/lighter weight models that trade off accuracy for speed.

Word Embeddings

Building a sentence embedding index with fastText and BM25

  • Creates a single embeddings vector via BM25 scoring of each word component. Reference above describes this method in detail.
  • Backed by the pymagnitude library. Pre-trained word vectors can be installed from the referenced link.
  • See words.py for code that can build word vectors for custom datasets.
  • Significantly better speed with default models. For larger datasets, it offers a good trade off of speed and accuracy.

Similarity search at scale

As discussed above, txtai uses similarity search to compare a sentence embeddings against all sentence embeddings in the repository. The first question that may come to mind is how would that scale to millions or billions of records? The answer is with Approximate Nearest Neighbor (ANN) search. ANN enables efficient execution of similarity queries over a large corpus of data.

A number of robust libraries are available in Python that enable ANN search. txtai has a configurable index backend that allows plugging in different ANN libraries. At this time, txtai supports:

GitHub logo facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.


GitHub logo spotify / annoy

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk


GitHub logo nmslib / hnswlib

Header-only C++/python library for fast approximate nearest neighbors

txtai uses sensible default settings for each of the libraries above, to make it as easy as possible to get up and running. The selection of the index is abstracted by default, based on the target environment.

The libraries above either don’t have a method for associating embeddings with record ids or assume the id is an integer. txtai takes care of that and keeps an internal id mapping, which allows any id type.

Benchmarks for each of the supported systems (and others) can help guide what ANN is the best fit for a given dataset. There are also platform differences, for example Faiss is only supported for Linux and macOS.

Extractive Question-Answering

In addition to similarity search, txtai supports extractive question-answering over returned results. This powerful feature enables asking another series of questions for a list of search results.

An example use case of this is with the CORD-19 challenge on Kaggle. This effort required creating summary tables for a series of medical queries, extracting additional columns for each result.

The following shows how to create an Extractive QA component within txtai.

Extractive QA Model

Next step is to load a set of results to ask questions on. The following example has text snippets with sports scores covering a series of games.

Extractive QA Example

Results for the section above.

Extractive QA results

We can see the extractor was able to understand the context of the sections above and is able to answer related questions. The Extractor component can work with a txtai Embeddings index as well as with external data stores. This modularity allows us to pick and choose what functionality to use from txtai to create natural language aware search systems.

Further reading

More detailed examples and use cases for txtai can be found in the following notebooks.

1: Introducing txtai

2: Build an Embeddings index with Hugging Face Datasets

3: Build an Embeddings index from a data source

4: Add semantic search to Elasticsearch

5: Extractive QA with txtai

6: Extractive QA with Elasticsearch

7: Apply labels with zero shot classification

8: API Gallery

9: Build abstractive text summaries

10: Extract text from documents

11: Transcribe audio to text

12: Translate text between languages

13: Similarity search with images

14: Run pipeline workflows

15: Distributed embeddings cluster

16: Train a text labeler

17: Train without labels

18: Export and run models with ONNX

19: Train a QA model

20: Extractive QA to build structured data

21: Export and run other machine learning models

22: Transform tabular data with composable workflows

23: Tensor workflows

24: What's new in txtai 4.0

25: Generate image captions and detect objects

26: Entity extraction workflows

27: Workflow scheduling

28: Push notifications with workflows

29: Anatomy of a txtai index

30: Custom Embeddings SQL functions

31: Near duplicate image detection

32: Model explainability

33: Query translation

34: Build a QA database

Wrapping up

NLP is advancing at a rapid pace and things not possible even a year ago are now possible. This article introduced txtai, an AI-powered semantic search platform, that enables quick integration of robust models with a deep understanding of natural language. Hugging Face’s model hub has a number of base and community-provided models that can be used to customize search for almost any dataset. The possibilities are limitless and we’re excited to see what can built on top of txtai!

Discussion (0)