Solr + Python — A Tutorial

#solr #python #tutorial

Update: I have pushed my Python code to GitHub (repo is here). My implementation is a tad more advanced than this tutorial. See the Readme file and code comments.

Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene (Solr website).

Goal

My goal is to demonstrate building an e-commerce gallery page with search, pagination, filtering and multi-select that mirrors the expectations of a typical user. See this article for a nice explanation of the multi-select filtering I am trying to implement.

Search should work for phrase queries like “mens shrt gap” or “gap 2154abc”, factoring in typos, various word forms (stemming) and phonetic spelling.

Solr Setup

Solr 7 is installed locally on my computer with an active connection to a database. Solr is using the deltaQuery feature (in db-data-config.xml) to detect changes in my database and import those records into Solr.

Web Development Setup

I have a basic Django/React app with Python 3. See this article for ideas on how to integrate Django with React. I recommend following these instructions to create your own Solr client.

I was considering using pySolr as a client, but it lacks good documentation and seems to have been neglected since 2015 (like most Solr libraries). Nevertheless, pySolr can work if you are ready to comb through the GitHub issues and codebase.

If you are using pySolr:

Paste export DEBUG_PYSOLR=’true’ into your terminal before running your server, and you will be able to view the URL generated by pySolr.
The URL you see in your terminal doesn’t seem to be clued in about URL encoding issues, so a query like Dolce & Gabbana will work on your website, but break when you paste the URL into a browser.

Facets & Facet Pivots

Facets are synonymous with product categories or specs. Solr has an option to return the available facets with their respective counts for a specific query. You can control the minimum number of products required in a facet by setting facet.mincount=<number>.

For example, if you are selling brand named clothing, facets might refer to gender, style and material. If the search was for “mens casual gap”, the facets returned would look like this (notice the constraints on gender and style):

"facet_fields" : {
   "gender" : [
       "Men" , 25,
       "Women" , 0 
    ],
   "style" : [
       "casual", 10,
       "dress", 0
    ],
   "material" : [
       "wool", 15,
       "cotton", 10
    ],
   }

Example Query

Let’s run through an example:

from urllib.request import urlopen
import urllib.parse
import simplejson

def gallery_items(current_query):

    solr_tuples = [
        # text in search box
        ('q', "mens shirt gap"),
        # how many products do I want to return
        ('rows', current_query['rows_per_page']),
        # offset for pagination
        ('start', current_query['start_row'] * current_query['rows_per_page']),
        # example of a default sort, 
        # for search phrase leave blank to allow 
        # for relevancy score sorting
        ('sort', 'price asc, popularity desc'),
        # which fields do I want returned
        ('fl', 'product_title, price, code, image_file'),
        # enable facets and facet.pivots
        ('facet', 'on'),
        # allow for unlimited amount of facets in results
        ('facet.limit', '-1'),
        # a facet has to have at least one 
        # product in it to be a valid facet
        ('facet.mincount', '1'),
        # regular facets
        ('facet.fields', ['gender', 'style', 'material']),
        # nested facets
        ('facet.pivot', 'brand,collection'),
        # edismax is Solr's multifield phrase parser
        ('defType', 'edismax'),
        # fields to be queried
        # copyall: all facets of a product with basic stemming
        # copyallphonetic: phonetic spelling of facets
        ('qf', 'copyall copyallphonetic'),
        # give me results that match most fields
        # in qf [copyall, copyallphonetic]
        ('tie', '1.0'),
        # format response as JSON
        ('wt', 'json')
    ]

    solr_url = 'http://localhost:<port>/solr/<core>/select?'
    # enocde for URL format
    encoded_solr_tuples = urllib.parse.urlencode(solr_tuples)
    complete_url = solr_url + encoded_solr_tuples
    connection = urlopen(complete_url)
    raw_response = simplejson.load(connection)

Phrase search will be discussed in the next section — Schema Modeling.

I would suggest using tuples for each key-value pair as it will be easier to urlencode. It will also be easier to manipulate, particularly when you have a complicated fq with a ton of AND, OR logic (which will happen very soon if you are doing filtering).
Each facet group will have its own fq field. This ensures that AND logic is applied across filter groups. Here is code for applying OR logic within a facet group:

    def apply_facet_filters(self):
        if self.there_are_facets():
            for facet, facet_arr in self.facet_filter.items():
                if len(facet_arr) > 0:
                    new_facet_arr = []
                    for a_facet in facet_arr:
                        new_facet_arr.append("{0}: \"{1}\"".format(facet, a_facet))
                    self.solr_args.append(('fq', ' OR '.join(new_facet_arr)))

facet.pivot.mincount allows you to control the minimum number of products required for a facet.pivot, but beware, if you set it to 0, your server will likely crash.
I’ve found that field values needed to be formatted in quotes: ‘fq’: “brand: \”{0}\””.format(current_query[‘current_brand’])
facets are returned in arrays like [‘brand’, ‘gap’], not a dict() which I find inconvenient. Here is one way to format them:

import more_itertools as mit
facets = {}

for k,v in raw_response['facet_counts']['facet_fields'].items():
    spec_list = [list(spec) for spec in mit.chunked(v, 2)]
    spec_dict = {}
    for spec in spec_list:
        spec_dict[spec[0]] = spec[1]
    facets[k] = spec_dict

raw_response['facet_counts']['facet_fields'] = facets

By default, if a user selects a facet in a facet group, Solr will return that facet group with only the selected facet, since the search has been narrowed down. But many times, a user would like still like to view the unselected facets and associated counts, to enable multi-select. To allow this functionality, use tagging and excluding. See my repo for a possible implementation.
To create price ranges as a filter with custom intervals, copy price to a new field with one of Trie fieldTypes. The new field should have indexed and stored set to false, and docValues set to true. Then follow the instructions to add custom ranges. See the next section on schema modeling. See my repo for a possible implementation.

Schema Modeling

If you can get past the idea that fields exist simply to store properties of data, and embrace the idea that you can manipulate data so it can be found as users expect it, then you can begin to effectively program relevance rules into the search engine. (Relevant Search, Chapter 5)

We are ready to modify fields in our document schema to conform to the users’ perception of our products.

Take a look at the documentation about how to update the schema, particularly the sections on tokenizing and filtering. Learn about stemming filters. Ask yourself which tokens/filters are relevant for your situation, and whether it should be apply at query or index time.

I will be following a recommendation in the documentation to copy all fields a user might be interested in into a single copyall field. This solves the albino elephant issue, as well as signal discordance:

As we’ve stated, when users search, they typically don’t care how documents decompose into individual fields. Many search users expect to work with documents as a single unit: the more of their search terms that match, the more relevant the document ought to be. It may surprise you to know that search engine features that implemented this idea were late to the party. Instead, Lucene-based multifield search depended on field-centric techniques. Instead of the search terms, field-centric search makes field scores the center of the ranking function. In this section, we explore exactly why field-centric approaches can create relevance issues. You’ll see that having ranking functions centered on fields creates two problems:

The albino elephant problem — A failure to give a higher rank to documents that match more search terms.

Signal discordance — Relevance scoring based on unintuitive scoring of the constituent parts (title versus body) instead of scoring of the whole document or more intuitive larger parts, such as the entire article’s text or the people associated with this film. (Relevant Search, Chapter 6)

We will be using the Schema API through the Admin UI. You cannot edit the schema file manually (why). Here is the recipe for creating the copyall field:

Step 1: Create a fieldType for the field. I am using the same fieldType for both index and query time. I have kept the stemming light to ensure that brand names stay intact.

      <fieldType name="facets" class="solr.TextField">
        <analyzer>
          <tokenizer class="solr.ClassicTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory"/>
          <filter class="solr.ClassicFilterFactory"/>
          <filter class="solr.EnglishMinimalStemFilterFactory"/>
        </analyzer>
      </fieldType>

Step 2: Create a copyall field with a facets as the fieldType. Set multiValued=true to allow multiple values in the field (as an array). Set omitNorms=true since users don’t care about the length of each field (docs), and we don’t want Solr to care either.
Step 3: Create copyFields for every field in the data source that you want to be copied. Remember, there is no chaining of copyField’s.

    {
       "add-copy-field":{
           "source":"brand",
           "dest":"copyall"
        }
    }

Step 4: Repeat steps 1–3 if you want to create a copyall for phonetic spelling. Use an appropriate fieldType. I am using the Beider-Morse Filter.
Step 5: Add a tie breaker of 1 to get a most-fields functionality. The docs provide a nice explanation.

Some ideas:

Add index time boosts for products that are more popular and you want them to rank higher in the search results. You could also do a query time boost, something to the effect of bf='div(1,popularity)'.
Use function queries to customize anything about your the relevancy scoring of your search results.
Consider the N-Gram filter for typo tolerance.
Consider the Edge-N-Gram filter for autocomplete.
Consider using the text_en fieldType for regular English words (it is one of the many fieldTypes which come with Solr):

  <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
  </fieldType>

Debugging and Workflow

Check analysis in the Admin UI for how particular terms are analyzed at index or query time.
Add a console.log in your code to print the url for every query. Set debugQuery=true and read the parsedQeury and explain. All the math fun is lurking in the explain (see Relevant Search, Chapter 2).
After re-configuring the schema, make sure to delete all docs in your index and do a fresh full-import from your database. This can be done in the Admin UI.
If you need to debug the database import, use the Debug-Mode with verbose output.