txtai is datastore agnostic, the library analyzes sets of text. The following example shows how extractive question-answering can be added on top of an Elasticsearch system.
Install dependencies
Install txtai
and Elasticsearch
.
# Install txtai and elasticsearch python client
pip install txtai elasticsearch
# Download and extract elasticsearch
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.10.1
Start an instance of Elasticsearch.
import os
from subprocess import Popen, PIPE, STDOUT
# Start and wait for server
server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))
sleep 30
Download data
This example is going to work off a subset of the CORD-19 dataset. COVID-19 Open Research Dataset (CORD-19) is a free resource of scholarly articles, aggregated by a coalition of leading research groups, covering COVID-19 and the coronavirus family of viruses.
The following download is a SQLite database generated from a Kaggle notebook. More information on this data format, can be found in the CORD-19 Analysis notebook.
wget https://github.com/neuml/txtai/releases/download/v1.1.0/tests.gz
gunzip tests.gz
mv tests articles.sqlite
Load data into Elasticsearch
The following block copies rows from SQLite to Elasticsearch.
import sqlite3
import regex as re
from elasticsearch import Elasticsearch, helpers
# Connect to ES instance
es = Elasticsearch(hosts=["http://localhost:9200"], timeout=60, retry_on_timeout=True)
# Connection to database file
db = sqlite3.connect("articles.sqlite")
cur = db.cursor()
# Elasticsearch bulk buffer
buffer = []
rows = 0
# Select tagged sentences without a NLP label. NLP labels are set for non-informative sentences.
cur.execute("SELECT s.Id, Article, Title, Published, Reference, Name, Text FROM sections s JOIN articles a on s.article=a.id WHERE (s.labels is null or s.labels NOT IN ('FRAGMENT', 'QUESTION')) AND s.tags is not null")
for row in cur:
# Build dict of name-value pairs for fields
article = dict(zip(("id", "article", "title", "published", "reference", "name", "text"), row))
name = article["name"]
# Only process certain document sections
if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
# Bulk action fields
article["_id"] = article["id"]
article["_index"] = "articles"
# Buffer article
buffer.append(article)
# Increment number of articles processed
rows += 1
# Bulk load every 1000 records
if rows % 1000 == 0:
helpers.bulk(es, buffer)
buffer = []
print("Inserted {} articles".format(rows), end="\r")
if buffer:
helpers.bulk(es, buffer)
print("Total articles inserted: {}".format(rows))
Total articles inserted: 21499
Query data
The following runs a query against Elasticsearch for the terms "risk factors". It finds the top 5 matches and returns the corresponding documents associated with each match.
import pandas as pd
from IPython.display import display, HTML
pd.set_option("display.max_colwidth", None)
query = {
"_source": ["article", "title", "published", "reference", "text"],
"size": 5,
"query": {
"query_string": {"query": "risk factors"}
}
}
results = []
for result in es.search(index="articles", body=query)["hits"]["hits"]:
source = result["_source"]
results.append((source["title"], source["published"], source["reference"], source["text"]))
df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match"])
display(HTML(df.to_html(index=False)))
Title | Published | Reference | Match |
---|---|---|---|
Management of osteoarthritis during COVIDβ19 pandemic | 2020-05-21 00:00:00 | https://doi.org/10.1002/cpt.1910 | Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) . |
Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection | 2020-04-24 00:00:00 | http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1 | This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors. |
Does apolipoprotein E genotype predict COVID-19 severity? | 2020-04-27 00:00:00 | https://doi.org/10.1093/qjmed/hcaa142 | Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors . |
COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants | 2020-07-23 00:00:00 | https://www.ncbi.nlm.nih.gov/pubmed/32705587/ | BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease. |
COVID-19: what has been learned and to be learned about the novel coronavirus disease | 2020-03-15 00:00:00 | https://doi.org/10.7150/ijbs.45134 | β’ Three major risk factors for COVID-19 were sex (male), age (β₯60), and severe pneumonia. |
Derive columns with Extractive QA
The next section uses Extractive QA to derive additional columns. For each article, the full text is retrieved and a series of questions are asked of the document. The answers are added as a derived column per article.
from txtai.embeddings import Embeddings
from txtai.pipeline import Extractor
# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})
# Create extractor instance using qa model designed for the CORD-19 dataset
extractor = Extractor(embeddings, "NeuML/bert-small-cord19qa")
document = {
"_source": ["id", "name", "text"],
"size": 1000,
"query": {
"term": {"article": None}
},
"sort" : ["id"]
}
def sections(article):
rows = []
search = document.copy()
search["query"]["term"]["article"] = article
for result in es.search(index="articles", body=search)["hits"]["hits"]:
source = result["_source"]
name, text = source["name"], source["text"]
if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
rows.append(text)
return rows
results = []
for result in es.search(index="articles", body=query)["hits"]["hits"]:
source = result["_source"]
# Use QA extractor to derive additional columns
answers = extractor([("Risk factors", "risk factor", "What are names of risk factors?", False),
("Locations", "city country state", "What are names of locations?", False)], sections(source["article"]))
results.append((source["title"], source["published"], source["reference"], source["text"]) + tuple([answer[1] for answer in answers]))
df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match", "Risk Factors", "Locations"])
display(HTML(df.to_html(index=False)))
Title | Published | Reference | Match | Risk Factors | Locations |
---|---|---|---|---|---|
Management of osteoarthritis during COVIDβ19 pandemic | 2020-05-21 00:00:00 | https://doi.org/10.1002/cpt.1910 | Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) . | Comorbidities | extrapulmonary sites |
Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection | 2020-04-24 00:00:00 | http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1 | This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors. | CVD, risk factors but no CVD, and neither CVD | None |
Does apolipoprotein E genotype predict COVID-19 severity? | 2020-04-27 00:00:00 | https://doi.org/10.1093/qjmed/hcaa142 | Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors . | socioeconomic inequalities and risk factors | None |
COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants | 2020-07-23 00:00:00 | https://www.ncbi.nlm.nih.gov/pubmed/32705587/ | BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease. | Frailty and multimorbidity | comorbidity groupings |
COVID-19: what has been learned and to be learned about the novel coronavirus disease | 2020-03-15 00:00:00 | https://doi.org/10.7150/ijbs.45134 | β’ Three major risk factors for COVID-19 were sex (male), age (β₯60), and severe pneumonia. | age and underlying disease are strongly correlated | cities, provinces, and countries |
Top comments (0)