DEV Community

loading...
NeuML

Introducing txtai, an AI-powered search engine built on Transformers

davidmezzetti profile image David Mezzetti Originally published at towardsdatascience.com on ・6 min read

Add Natural Language Understanding to any application

Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation of the internet and an ever-growing challenge that is never solved or done.

The field of Natural Language Processing (NLP) is rapidly evolving with a number of new developments. Large-scale general language models are an exciting new capability allowing us to add amazing functionality quickly with limited compute and people. Innovation continues with new models and advancements coming in at what seems a weekly basis.

This article introduces txtai, an AI-powered search engine that enables Natural Language Understanding (NLU) based search in any application.

Introducing txtai

GitHub logo neuml / txtai

AI-powered search engine

AI-powered search engine

Version GitHub Release Date GitHub issues GitHub last commit Build Status Coverage Status


txtai builds an AI-powered index over sections of text. txtai supports building text indices to perform similarity searches and create extractive question-answering based systems. txtai also has functionality for zero-shot classification.

demo

NeuML uses txtai and/or the concepts behind it to power all of our Natural Language Processing (NLP) applications. Example applications:

  • paperai - AI-powered literature discovery and review engine for medical/scientific papers
  • tldrstory - AI-powered understanding of headlines and story text
  • neuspo - Fact-driven, real-time sports event and news site
  • codequestion - Ask coding questions directly from the terminal

txtai is built on the following stack:

Installation

The easiest way to install is via pip and PyPI

pip install txtai

You can also install txtai directly from GitHub. Using a Python Virtual Environment is recommended.

pip install git+https://github.com/neuml/txtai

Python 3.6+ is supported

Windows and macOS systems have the following prerequisites. No additional…

txtai builds an AI-powered index over sections of text. txtai supports building text indices to perform similarity searches and create extractive question-answering based systems. txtai also has functionality for zero-shot classification. txtai is open source and available on GitHub.

txtai is built on the following stack:

txtai and/or the concepts behind it has already been used to power the Natural Language Processing (NLP) applications listed below:

  • paperai — AI-powered literature discovery and review engine for medical/scientific papers
  • tldrstory — AI-powered understanding of headlines and story text
  • neuspo — Fact-driven, real-time sports event and news site
  • codequestion — Ask coding questions directly from the terminal

Install and run txtai

The following code snippet shows how to install txtai and create an embeddings model.

pip install txtai
Enter fullscreen mode Exit fullscreen mode

Next, we can create a simple in memory model with a couple sample records to try txtai out.

Basic Embeddings Instance

Running the code above will print the following:

Embeddings query output

The example above shows for almost all of the queries, the actual text isn’t stored in the list of text sections. This is the true power of transformer models over token based search. What you get out of the box is 🔥🔥🔥!

Build an Embeddings index

For small lists of texts, the method above works. But for larger repositories of documents, it doesn’t make sense to tokenize and convert all embeddings for each query. txtai supports building pre-computed indices which significantly improves performance.

Building on the previous example, the following example runs an index method to build and store the text embeddings. In this case, only the query is converted to an embeddings vector each search.

Build an Embeddings Index

Once again the same results will be returned, only difference is the embeddings are pre-computed.

Embeddings query output

Save and load an Embeddings index

Embeddings indices can be saved to disk and reloaded. At this time, indices are not incrementally created, the index needs a full rebuild to incorporate new data.

Save and load an Embeddings Index

The results of the code above:

Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
Enter fullscreen mode Exit fullscreen mode

With a limited amount of code, we’re able to build a system with a deep understanding of natural language. The amount of knowledge that comes from Transformer models is phenomenal.

Sentence Embeddings

txtai builds sentence embeddings to perform similarity searches. txtai takes each text record entry, tokenizes it and builds an embeddings representation of that record. At search time, the query is transformed into a text embedding and then is compared to the repository of text embeddings.

txtai supports two methods for creating text embeddings, sentence transformers and word embeddings vectors. Both methods have their merits as shown below.

Sentence Transformers

GitHub logo huggingface / transformers

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

  • Creates a single embeddings vector via mean pooling of vectors generated by the Transformers library.
  • Supports models stored on Hugging Face’s model hub or stored locally.
  • See Sentence Transformers for details on how to create custom models, which can be kept local or uploaded to Hugging Face’s model hub.
  • Base models require significant compute capability (GPU preferred). Possible to build smaller/lighter weight models that trade off accuracy for speed.

Word Embeddings

Building a sentence embedding index with fastText and BM25

  • Creates a single embeddings vector via BM25 scoring of each word component. Reference above describes this method in detail.
  • Backed by the pymagnitude library. Pre-trained word vectors can be installed from the referenced link.
  • See vectors.py for code that can build word vectors for custom datasets.
  • Significantly better speed with default models. For larger datasets, it offers a good trade off of speed and accuracy.

Similarity search at scale

As discussed above, txtai uses similarity search to compare a sentence embeddings against all sentence embeddings in the repository. The first question that may come to mind is how would that scale to millions or billions of records? The answer is with Approximate Nearest Neighbor (ANN) search. ANN enables efficient execution of similarity queries over a large corpus of data.

A number of robust libraries are available in Python that enable ANN search. txtai has a configurable index backend that allows plugging in different ANN libraries. At this time, txtai supports:

GitHub logo facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.


GitHub logo spotify / annoy

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk


GitHub logo nmslib / hnswlib

Header-only C++/python library for fast approximate nearest neighbors

txtai uses sensible default settings for each of the libraries above, to make it as easy as possible to get up and running. The selection of the index is abstracted by default, based on the target environment.

The libraries above either don’t have a method for associating embeddings with record ids or assume the id is an integer. txtai takes care of that and keeps an internal id mapping, which allows any id type.

Benchmarks for each of the supported systems (and others) can help guide what ANN is the best fit for a given dataset. There are also platform differences, for example Faiss is only supported for Linux and macOS.

Extractive Question-Answering

In addition to similarity search, txtai supports extractive question-answering over returned results. This powerful feature enables asking another series of questions for a list of search results.

An example use case of this is with the CORD-19 challenge on Kaggle. This effort required creating summary tables for a series of medical queries, extracting additional columns for each result.

The following shows how to create an Extractive QA component within txtai.

Extractive QA Model

Next step is to load a set of results to ask questions on. The following example has text snippets with sports scores covering a series of games.

Extractive QA Example

Results for the section above.

Extractive QA results

We can see the extractor was able to understand the context of the sections above and is able to answer related questions. The Extractor component can work with a txtai Embeddings index as well as with external data stores. This modularity allows us to pick and choose what functionality to use from txtai to create natural language aware search systems.

Further reading

More detailed examples and use cases for txtai can be found in the following notebooks.

Part 1: Introducing txtai

Part 2: Build an Embeddings index with Hugging Face Datasets

Part 3: Build an Embeddings index from a data source

Part 4: Add semantic search to Elasticsearch

Part 5: Extractive QA with txtai

Part 6: Extractive QA with Elasticsearch

Part 7: Apply labels with zero shot classification

Part 8: API Gallery

Wrapping up

NLP is advancing at a rapid pace and things not possible even a year ago are now possible. This article introduced txtai, an AI-powered search engine, that enables quick integration of robust models with a deep understanding of natural language. Hugging Face’s model hub has a number of base and community-provided models that can be used to customize search for almost any dataset. The possibilities are limitless and we’re excited to see what can built on top of txtai!

Discussion (0)

pic
Editor guide