Introduction
In the first article, we delved into how elastic search works under the hood.
In this article, we will implement elastic search in a Django application.
This article is intended for someone familiar with Django, we will not be explaining setup deeply or functionality such as models and views.
Setup
Clone this repository into a folder of your choosing.
git clone git@github.com:robinmuhia/elasticSearchPOC.git .
or
Get the repo from this Github link
We need three specific libraries that we will use as they abstract a lot of what we need to implement elastic search.
django-elasticsearch-dsl==8.0
elasticsearch==8.0.0
elasticsearch-dsl==8.12.0
Create a virtual environment, activate it and install the dependencies in the requirements.txt file
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Your project structure should look like the image below;
Now we're ready to go.
Understanding the project
Settings file
The project is a simple Django application. It has your usual setup structure.
In the config folder, we have our settings.py file.
For the purpose of this project, our elastic search settings are simple as shown below;
INSTALLED_APPS = [
"django.contrib.admin",
"django.contrib.auth",
"django.contrib.contenttypes",
"django.contrib.sessions",
"django.contrib.messages",
"whitenoise.runserver_nostatic",
"django.contrib.staticfiles",
"django_extensions",
"django_elasticsearch_dsl",
"rest_framework",
"elastic_search.books",
]
ELASTICSEARCH_DSL = {
"default": {
"hosts": [os.getenv("ELASTICSEARCH_URL", "http://localhost:9200")],
},
}
ELASTICSEARCH_DSL_SIGNAL_PROCESSOR = "django_elasticsearch_dsl.signals.RealTimeSignalProcessor"
ELASTICSEARCH_DSL_INDEX_SETTINGS = {}
ELASTICSEARCH_DSL_AUTOSYNC = True
ELASTICSEARCH_DSL_AUTO_REFRESH = True
ELASTICSEARCH_DSL_PARALLEL = False
In a production ready application, i would recommend using the CelerySignalProcessor. The RealTimeSignalProcessor re-indexes documents immediately any changes are made to a model. CelerySignalProcessor would handle the re-indexing asynchronously to ensure that our users would not have to experience added latency when they modify any of our models. You would have to set up Celery though.
Read more about the nuances of settings here.
Models
from django.db import models
class GenericMixin(models.Model):
"""Generic mixin to be inherited by all models."""
id = models.AutoField(primary_key=True, editable=False, unique=True)
created_at = models.DateTimeField(auto_now_add=True)
updated_at = models.DateTimeField(auto_now=True)
class Meta:
abstract = True
ordering = ("-updated_at", "-created_at")
class Country(GenericMixin):
name = models.CharField(max_length=200)
def __str__(self):
return self.name
class Genre(GenericMixin):
name = models.CharField(max_length=100)
def __str__(self):
return self.name
class Author(GenericMixin):
name = models.CharField(max_length=200)
def __str__(self):
return self.name
class Book(GenericMixin):
title = models.CharField(max_length=100)
description = models.TextField()
genre = models.ForeignKey(Genre, on_delete=models.CASCADE, related_name="genres")
country = models.ForeignKey(Country, on_delete=models.CASCADE, related_name="countries")
author = models.ForeignKey(Author, on_delete=models.CASCADE, related_name="authors")
year = models.IntegerField()
rating = models.FloatField()
def __str__(self):
return self.title
The Generic Mixin has fields that should be inherited by all models. For a production application, i would recommend using a UUID as a primary key but we will use a normal incrementing integer field as it is easier for this project.
The models are pretty self-explanatory but we will be indexing and querying the book model. Our goals are to be able to search for a book using its title and description, while also being able to filter year and rating.
Documents file
We have documents.py file in the books folder.
This folder is important and should be named as such. Our documents will be written here. For our book model, the code is shown below;
from django_elasticsearch_dsl import Document, fields
from django_elasticsearch_dsl.registries import registry
from elastic_search.books.models import Author, Book, Country, Genre
@registry.register_document
class BookDocument(Document):
genre = fields.ObjectField(
properties={
"name": fields.TextField(),
}
)
country = fields.NestedField(
properties={
"name": fields.TextField(),
}
)
author = fields.NestedField(
properties={
"name": fields.TextField(),
}
)
class Index:
name = "books"
class Django:
model = Book
fields = [
"title",
"description",
"year",
"rating",
]
related_models = [Genre, Country, Author]
def get_queryset(self):
return super().get_queryset().select_related("genre", "author", "country")
def get_instances_from_related(self, related_instance):
if isinstance(related_instance, Genre):
return related_instance.genres.all()
elif isinstance(related_instance, Country):
return related_instance.countries.all()
elif isinstance(related_instance, Author):
return related_instance.authors.all()
else:
return []
Import Statements:
We import necessary modules and classes from django_elasticsearch_dsl and our Django models.
Document Definition:
We define a BookDocument class which inherits from Document, provided by django_elasticsearch_dsl.
Registry Registration:
We register the BookDocument class with the registry using the @registry.register_document decorator. This tells the Elasticsearch DSL library to manage this document.
Index Configuration:
We specify the name of the Elasticsearch index for this document as "books". This index name should be unique within the Elasticsearch cluster.
Django Model Configuration:
Under the Django class nested within BookDocument, we link the document to the Django model (Book) and specify which fields of the model should be indexed.
Fields Mapping:
Inside the BookDocument class, we define fields for the Elasticsearch document. These fields map to the fields in the Django model. Some fields, such as genre, country, and author, are nested objects.
Related Models Handling:
We specify related models (Genre, Country, Author) that should be indexed along with the Book model. For each related model, we define how to retrieve instances related to the main model. This involves specifying which fields to index from related models.
Queryset Configuration:
We override the get_queryset method to specify how the queryset should be retrieved. In this case, we use select_related to fetch related objects efficiently.
Instances from Related:
We define the get_instances_from_related method to handle instances from related models. This method is used to retrieve instances related to the main model for indexing purposes.
Views
import copy
from abc import abstractmethod
from elasticsearch_dsl import Document, Q
from rest_framework.decorators import action
from rest_framework.pagination import LimitOffsetPagination
from rest_framework.request import Request
from rest_framework.response import Response
from rest_framework.viewsets import ModelViewSet
from elastic_search.books.documents import BookDocument
from elastic_search.books.models import Book
from elastic_search.books.serializers import BookSerializer
class PaginatedElasticSearchAPIView(ModelViewSet, LimitOffsetPagination):
document_class: Document = None
@abstractmethod
def generate_search_query(self, search_terms_list, param_filters):
"""This method should be overridden
and return a Q() expression."""
@action(methods=["GET"], detail=False)
def search(self, request: Request):
try:
params = copy.deepcopy(request.query_params)
search_terms = params.pop("search", None)
query = self.generate_search_query(
search_terms_list=search_terms, param_filters=params
)
search = self.document_class.search().query(query)
response = search.to_queryset()
results = self.paginate_queryset(response)
serializer = self.serializer_class(results, many=True)
return self.get_paginated_response(serializer.data)
except Exception as e:
return Response(e, status=500)
class BookViewSet(PaginatedElasticSearchAPIView):
serializer_class = BookSerializer
queryset = Book.objects.all()
document_class = BookDocument
def generate_search_query(self, search_terms_list: list[str], param_filters: dict):
if search_terms_list is None:
return Q("match_all")
search_terms = search_terms_list[0].replace("\x00", "")
search_terms.replace(",", " ")
search_fields = ["title", "description"]
filter_fields = ["year", "rating"]
query = Q("multi_match", query=search_terms, fields=search_fields, fuzziness="auto")
wildcard_query = Q(
"bool",
should=[
Q("wildcard", **{field: f"*{search_terms.lower()}*"}) for field in search_fields
],
)
query = query | wildcard_query
if len(param_filters) > 0:
filters = []
for field in filter_fields:
if field in param_filters:
filters.append(Q("term", **{field: param_filters[field]}))
filter_query = Q("bool", should=[query], filter=filters)
query = query & filter_query
return query
Structure
The PaginatedElasticSearchAPIView class has two important methods. The generate search query method has an abstractmethod decorator which means that any class that inherits it has to implement said method.
The other search method adds an endpoint search that accepts a get request and handles the search functionality . It copies the parameters from the URL and then passes the parameters to the generate search query function. The function should return an Elasticsearch Query which will be searched from and then converted to a queryset. The queryset will be paginated over and returned to the user.
In a production app, i would recommend handling the exception by logging the error and defaulting to use Django Rest Framework's search so at the least our search will always work.
Implementation
In the BookViewSet, we provide the document that we will execute the search on.
We also implement the abstract method. Let us explain the query one by one.
Input Parameters:
search_terms_list: These are the words or phrases a user types into the search bar when looking for a book.
param_filters: These are additional filters or conditions a user might want to apply to narrow down their search, like searching only for books published in a certain year or have a certain rating.
Understanding the Search Process:
If the user doesn't provide any search terms, it means they want to see all the books available. So, we create a "match-all" query to fetch all books.
If the user provides search terms, we want to look for those terms in specific fields of our books, like title or description. We also want to be flexible with our search, allowing for slight misspellings or variations in the search terms. That's where the "fuzziness" parameter comes into play. It helps us find similar words even if the user misspells something.
Additionally, we might want to support wildcard searches, where a user can use placeholders like '' to match any characters. For example, searching for 'hist' would match 'history', 'historic', etc.
If there are any filter parameters provided, we want to apply those filters to our search results. For example, if a user wants to see only books published in the year 2022, we want to include that condition in our search.
Constructing the Query:
We use the Elasticsearch DSL (Domain-Specific Language) to construct our search query. This query is like a set of instructions written in a language Elasticsearch understands.
We build our query step by step, considering all the different scenarios mentioned above.
We use the Q class from Elasticsearch DSL to create different parts of our query, such as match queries, wildcard queries, and filter queries.
Finally, we combine all these parts to form a comprehensive search query that captures both the user's search terms and any additional filters they might have applied.
Output:
The method returns the constructed search query, ready to be executed against our Elasticsearch index.
This query will fetch the relevant books based on the user's search terms and filters, providing them with accurate and tailored search results.
URLS
We now setup this up in our urls.py file;
from rest_framework.routers import SimpleRouter
from elastic_search.books import views
router = SimpleRouter()
router.register("books", views.BookViewSet)
urlpatterns = router.urls
Data
We need data to search against and thus there is a factories.py file that will populate the data for us in a db.
First lets create a database;
Set up postgres and run the following commands;
sudo -u postgres psql
DROP USER IF EXISTS elastic;
CREATE USER elastic WITH CREATEDB CREATEROLE SUPERUSER LOGIN PASSWORD 'elastic';
DROP DATABASE IF EXISTS elastic;
CREATE DATABASE elastic WITH OWNER postgres;
GRANT ALL ON DATABASE elastic TO elastic;
\q
Populate the data in the db;
python manage.py generate_test_data 1000
This will create a large dataset for us to run our queries against
Set up elastic search
Run the following to start a local elasticsearch instance with docker
docker run --rm --name elasticsearch_container -p 9200:9200 -e "discovery.type=single-node" -e "xpack.security.enabled=false" elasticsearch:8.10.2
Populate index
Now we can populate our index to test out the application
python manage.py search_index --rebuild
Query Time!!!
Start the server
python manage.py runserver
Head to postman or any API testing platform of your choice
Our base query will be this
http://localhost:8000/api/books/search/
A get request is shown below;
lets make a query for a movie with consumer;
Lets misspell consumer, we get same result
lets test the filter;
Conclusion
We have implemented elastic search and tested it live. We have got expected results. There exists other queries such as nested queries that can be added to include author and country into the search and filters but they are out of the scope of this tutorial. In a future article, i may add them. However, in our next article, we will add a CI/CD pipeline that can be used to test our application.
Top comments (0)