DEV Community

Cover image for A search engine - Part 1: Outline and Kickoff
David Kröll
David Kröll

Posted on

A search engine - Part 1: Outline and Kickoff

This series is about creating a data structure named inverted index.
It is a central component in probably every search engine out there, and a very nice excercise to apply computer science skills.
This search engine will be written in Go and I am going to use this excercise to point out how I move from zero to a Go package.
The core component is of course the data structure and it's efficient way of searching arbitrary text data.

View the wikipedia article about the inverted index

For sure, this projects it not an enterprise-ready solution, there are probably better and more efficient search engines out there.
However I think this is a very nice example to improve skillset and learn something about this type of index.
Taking all that into consideration, I came up with the following requirements:

  • Create an instance of our search engine
  • Add arbitrary strings to our search engine cache
  • Search through all our strings in the cache

For now, I'll cope with these.

Define the Interface

Looking through my demands on the core data structure and the accessor interface, I'll define the methods that I'd like to use from an external perspective.

// This is pseudo code
// Create a new instance, configurable with a struct
New(config) instance

// Add strings to the search engine cache
Add(string)

// The method for our search execution
Find(string) []string
Enter fullscreen mode Exit fullscreen mode

Now the interface for our external view is defined. We'll work on our datastructures.
We start with the quintessence of our search engine - the inverted index.

Every major search engine makes use of this index type. Basically its a mapping of words to where they are stored in our cache.

So if we store the first verse of Bohemian Rhapsody in our index, we'd like our algorithm to tell us where the word mama is.

Mama, just killed a man                            <-- mama is here
Put a gun against his head, pulled my trigger, now he's dead     
Mama, life had just begun                          <-- mama is also here
But now I've gone and thrown it all away
Mama, ooh, didn't mean to make you cry             <-- mama is here too
If I'm not back again this time tomorrow
Carry on, carry on as if nothing really matters
Enter fullscreen mode Exit fullscreen mode

Every line of Bohemian Rhapsody will be an entry in a string slice. We call it our cache.
For our lookup-table (call it index) we take a map which will have a string as key and as value we store the position in the cache slice.
Since there are possibly more occurences of a single string, we will also pick a slice of integers here.

type GlSearch struct {
    config Config
    cache  []string
    index  map[string][]int
}
Enter fullscreen mode Exit fullscreen mode

This is now our GlSearch struct where we can implement our methods. The most simple will be the New method:

func New(c Config) *GlSearch {

    glsearch := GlSearch{
        config: c,
        cache:  []string{},
        index:  map[string][]int{},
    }

    return &glsearch
}
Enter fullscreen mode Exit fullscreen mode

This method just initializes our struct to avoid nil pointer dereference errors afterwards.

Next method attached to our struct:

// Add appends the provided string to the cache
// and updates the index accordingly
func (g *GlSearch) Add(s string) {
    // the zero-based position our next entry in the cache will take
    i := len(g.cache) 

    g.cache = append(g.cache, s)
    // transform our string to build up a useful index
    // split in words using space characters
    words := split(s, g.config.Seperators)

    // update the index for every word
    for _, v := range words {

        // append or create if it does not exist already
        if e, ok := g.index[v]; !ok {
            e = []int{i}
            g.index[v] = e
        } else {
            e = append(e, i)
            g.index[v] = e
        }
    }

    fmt.Printf("Cache: %q \nIndex: %v\n", g.cache, g.index)
}
Enter fullscreen mode Exit fullscreen mode

Here I introduced the first configuration parameters. It's the Seperators.
Time by time when I notice that I want some parameter for my package to be configurable, I introduce a new field for our Config struct.
I typically go with this approach because at the beginning I don't exactly know what should be configurable
but is always handy to do so.

The split method

// split turns the input string into a slice of strings
// by seperating where any of the runes in the sep slice match
func split(s string, sep []rune) []string {
    return strings.FieldsFunc(s, func(r rune) bool {
        for _, v := range sep {
            if v == r {
                return true
            }
        }
        return false
    })
}
Enter fullscreen mode Exit fullscreen mode

If you're new to the Go programming language, you may not be familiar with the rune datatype.
Read more about it at yourbasic.org

func main() {

    search := glsearch.New(glsearch.DefaultConfig)

    search.Add("Mama, just killed a man")
    search.Add("Put a gun against his head, pulled my trigger, now he's dead")
    // our contents in cache and index right now
    // Cache: ["Mama, just killed a man"
    //         "Put a gun against his head, pulled my trigger, now he's dead"] 
    // Index: map[Mama:[0] Put:[1] a:[0 1] against:[1]
    //        dead:[1] gun:[1] he:[1] head:[1] his:[1] just:[0] killed:[0]
    //        man:[0] my:[1] now:[1] pulled:[1] s:[1] trigger:[1]]

}
Enter fullscreen mode Exit fullscreen mode

Now the index shows the desired representation of our data so far.
By examining the Index data structure precisely, the attendive reader already notices some obstacles and drawbacks in our data.
We'll discuss this in the next part of my series.

Latest comments (0)