Inside a Scholarly Search Engine: Indexing, Ranking, and Retrieval

#python #opensource #beginners #programming

Repository: https://github.com/sisodiajatin/CS547-IR-Scholarly-Search

Let’s be real for a second: academic search is broken.

If you have ever tried to find a specific paper on a generic search engine, you know the pain. You type "neural networks," and you get a mix of Medium articles, YouTube tutorials, and maybe, if you are lucky, the actual PDF you were looking for on page 3.

I ran into this exact wall recently. I realized that building a search engine is not just about matching strings; it is about understanding intent. So, instead of complaining about it, I decided to build one.

This is the story of Scholarly Search, a project where I stopped relying on external search services and built a custom Information Retrieval (IR) system from the ground up using Python and Flask.

What Are We Actually Building?
At its core, this project is a specialized search engine for academic papers. The goal was not just to "find text" but to rank it intelligently. If a user searches for "machine learning," a paper with that phrase in the title should rank higher than one that mentions it once in the footnotes.

To make this happen, I had to move away from simple database queries and embrace the Inverted Index, the data structure that powers basically every search engine on the planet.

The Stack:

Core: Python (handling all logic and data structures).

Web Framework: Flask (serving both the API and the UI).

Frontend: HTML,CSS & Vanilla JavaScript (keeping it lightweight and monolithic).

The Secret Sauce: A custom-built Inverted Index and BM25 Ranking algorithm.

The "Aha!" Moment: Why Simple Counts Do Not Work
When I first started, I thought, "Easy. I will just count how many times the word appears."

I was wrong.

If you search for "the analysis," the word "the" appears in almost every document. If you rank by pure frequency, your results will be dominated by papers that just happen to be wordy, not relevant.

Enter BM25.

BM25 is the industry standard for a reason. It does two smart things:

It penalizes common words. (Inverse Document Frequency)
It penalizes long documents. (Length Normalization)

Here is the actual Python code used to calculate the score. It looks a bit math heavy, but it is really just balancing term frequency against document length:

def score_bm25(n, f, qf, r, N, dl, avdl):
    # K is a scaling factor based on doc length (dl) vs average (avdl)
    K = k1 * ((1 - b) + b * (dl / avdl))

    # This part calculates relevance
    first = math.log(((r + 0.5) / (R - r + 0.5)) / ((n - r + 0.5) / (N - n - R + r + 0.5)))
    second = ((k1 + 1) * f) / (K + f)
    third = ((k2 + 1) * qf) / (k2 + qf)

    return first * second * third

Indexing: The Heavy Lifting
The biggest challenge was speed. You can't scan 50,000 documents every time someone hits "Enter."

The solution is an Inverted Index. Think of it like the index at the back of a textbook. Instead of reading the book to find "Algorithms," you look up "Algorithms" and see a list of page numbers.

I wrote a script that pre-processes the raw data (stripping out punctuation, lowercasing everything) and builds this map in memory.

# Simplified view of the indexing process
inverted_index = defaultdict(list)

for doc_id, text in corpus.items():
    tokens = preprocess(text) # Clean the text
    for term in tokens:
        # Map the term back to the document ID
        inverted_index[term].append(doc_id)

Trade-off Alert: I chose to keep this index in memory (RAM).

Pro: It’s blazing fast. Sub-millisecond lookup times.
Con: It eats RAM. For a dataset this size (<100k docs), it's fine. For anything larger, you'd want to dump this to disk.

The Frontend: Simple & Effective
Because this project focuses on the backend IR logic, I kept the frontend architecture simple.

Instead of over-engineering with a complex framework like React or Vue, I built the interface using standard HTML, CSS, and Vanilla JavaScript. This keeps the application lightweight and ensures that the "search" functionality remains the star of the show.

The UI logic is handled by a simple script that fetches results from the backend API asynchronously:

// A simple fetch function to query the Flask API
function search(query) {
    fetch(`/search?q=${query}`)
        .then(response => response.json())
        .then(data => {
            const resultsDiv = document.getElementById('results');
            resultsDiv.innerHTML = ''; // Clear old results

            data.forEach(paper => {
                // Dynamically create HTML for each result
                let item = `
                    <div class="paper">
                        <h3>${paper.title}</h3>
                        <p>${paper.abstract}</p>
                    </div>
                `;
                resultsDiv.innerHTML += item;
            });
        });
}

Try It Yourself
If you want to poke around the code or run it locally, I have open sourced the whole thing.

DEV Community

Inside a Scholarly Search Engine: Indexing, Ranking, and Retrieval

Top comments (0)