Madhukar Vissapragada

Posted on Jun 27 • Edited on Jul 17

How We Optimized Manuscript Processing at a Scientific Publishing Platform

#redis #architecture #software #showdev

Section 1: The Domain

We build a product that processes academic manuscripts for scientific publishing companies. Authors submit their manuscripts in the form of DOCX files. Our system converts these DOCX files into XML — the industry standard format used across the entire publishing world.

Every publishing company has their own DTD (Document Type Definition) — a schema that defines the structure of their XML. This means the rules for how a manuscript should be structured, tagged, and formatted are different for every journal and every publisher.

The conversion pipeline

When an author submits a manuscript, here is what happens:

Author submits a DOCX file to our system
The system downloads the journal-specific rules file from S3
The rules file is used to convert the DOCX into a structured XML
During conversion, the system raises annotations and discrepancies — places where the manuscript does not conform to the journal's rules
The XML is then converted into an HTML preview
The author views the HTML on the UI and resolves the annotations and discrepancies raised by the system

The problem

The entire process — from DOCX submission to HTML preview — takes approximately 3 minutes.

The product owner received feedback from authors that 3 minutes is too long. Authors are sitting and waiting, not knowing if their manuscript is being processed correctly. This is a poor experience and directly impacts the author's trust in the platform.

The product owner has asked us to reduce the waiting time.

Section 2: Finding the Bottleneck

As part of improving the system latency we started profiling the pipeline to identify where the 3 minutes were being spent.

We quickly identified that downloading the rules file from S3 on every submission was the biggest contributor to latency.

Why is the download so slow?

The rules files are large — each rules file is close to 1 GB in size. Transferring 1 GB over the network on every single submission adds significant overhead before any actual processing even begins.

Understanding the scale

We receive 3,000 to 5,000 submissions per day
Our system supports 300 different journal types
Based on our metrics, 60 to 70 journals account for 90% of all submissions

This last observation is the key insight. We are downloading 1 GB files from S3 thousands of times a day — but for the same 60 to 70 journals repeatedly. The same rules file is being downloaded over and over again for every submission of the same journal type.

Current architecture

The decision

We decided to break this network latency on every request. Instead of downloading the rules file from S3 on every submission, we would cache the rules files of the top 70 journals locally on the application server.

This would eliminate the redundant S3 downloads for 90% of our submissions and directly reduce the processing time for the majority of authors.

Section 3: Why Local Cache?

Before jumping into the implementation we had to decide what type of cache to use.

Global cache — ruled out

At first glance a global cache seems like a reasonable choice. But when we thought about it carefully we realised it does not solve our problem at all.

With a global cache:

Rules file is downloaded from S3 → stored in global cache server
For every submission, app server downloads the rules file from the global cache server

We have simply moved the bottleneck. We are still transferring 1 GB of data over the network for every request — just from a different source. The latency problem remains unsolved.

Local cache — the right choice

With a local cache, the rules file is stored directly on the application server's HDD. When a submission comes in:

If the rules file is already on the local HDD → use it directly, zero network transfer
If not → download from S3 once, store locally, use it

Every subsequent submission for the same journal type uses the locally cached file with no network overhead at all.

Storage cost

We support 300 journal types but only 60 to 70 journals account for 90% of submissions. If we cache the rules files for these 70 journals:

70 journals × 1 GB per rules file = 70 GB per application server

70 GB on a local HDD is completely acceptable. HDD storage is cheap and this is a one time cost per server.

Local cache is the right fit for our use case.

Section 4: The Consistency Problem

Now that we have decided on local caching, we need to think carefully about cache invalidation. Before choosing an algorithm, we need to understand what kind of consistency our system requires.

Does eventual consistency work here?

Let's think through the scenario.

We cache the rules files locally and start processing manuscripts. Now assume that a domain expert — someone from the manuscript team or a peer reviewer — identifies that one of the journal rules has a bug. A rule has been added incorrectly, changed, or deleted. A separate uploader service fixes the rules file, uploads the new version to S3, and updates the DB.

But here is the problem — our application servers have the old rules file cached locally. Before the cache gets invalidated, our system processes around 100 manuscripts using the wrong rules. The XML is generated, the HTML is rendered, and authors go to the UI and spend time resolving annotations and discrepancies based on incorrect rules.

Later, the quality check team identifies that the XML was generated using outdated rules. Now we have to go back to those authors and tell them their work needs to be redone.

Authors will not accept this. The credibility of the platform is at stake.

Eventual consistency does not fit this use case.

We need immediate consistency

Our system must ensure that the moment a rules file is updated, every subsequent manuscript processed by any application server uses the updated rules. No exceptions.

The natural choice for immediate consistency is Write Through cache — write to both cache and the source of truth atomically. But this is extremely hard to implement in a distributed environment.

With multiple application servers each holding their own local cache, achieving atomic writes across all of them simultaneously is practically impossible. If we use 2 Phase Commit to coordinate these writes, the latency overhead would be so high that it defeats the entire purpose of caching in the first place.

The solution — Write Through with Versioning

We need immediate consistency without the complexity and latency of 2 Phase Commit. The solution is a tweaked approach — write through with versioning.

Before we explain the solution, let us look at how the current code is implemented.

Section 5: The Code

v1 — The naive approach

This is how the system was originally implemented. Every time a manuscript submission comes in, the rules file is hardcoded based on the journal ID.

import os

def handle_request_v1(request):
    author_id = request.author_id
    journal_id = request.journal_id
    manuscript = request.manuscript

    author_info = pgsql.get_author_info(author_id)
    if author_info is None:
        return 404

    journal_info = pgsql.get_journal_info(journal_id)
    if journal_info is None:
        return 404

    rules_file_name = f"{journal_id}_rules.csv"

    if not os.path.exists(rules_file_name):
        download_from_s3([journal_info.rules_file_url])

    convert_to_xml(manuscript, rules_file_name)

The problem with v1:

The rules file name is hardcoded as {journal_id}_rules.csv. If the rules file is updated in S3 and the DB, the file name on disk stays the same. The os.path.exists check returns True — the file exists — but the content is stale. The system silently serves outdated rules with no way of knowing.

This is exactly the eventual consistency problem we described.

v2 — Write Through with Versioning

The key insight is simple — instead of hardcoding the rules file name, we fetch it from the DB on every request. The DB always holds the latest file name. The file name itself acts as the version.

When a domain expert updates the rules file:

The uploader service uploads the new rules file to S3 with a timestamp in the file name — e.g. journal_42_rules_2025-03-16_08:00.csv
On success, the uploader updates the file name in the DB
These two writes do not need to be atomic — if S3 succeeds but DB update fails, the new file is simply not being used yet. It causes no harm.

Now when the next submission comes in:

App server fetches journal_info from DB — gets the new file name
Checks if this file exists on local HDD
It does not — because the old file had a different name
Downloads the new file from S3
Processes the manuscript with the correct updated rules

The old cached file is simply never used again and will eventually be evicted by LRU.

import os

def handle_request_v2(request):
    author_id = request.author_id
    journal_id = request.journal_id
    manuscript = request.manuscript

    author_info = pgsql.get_author_info(author_id)
    if author_info is None:
        return 404

    journal_info = pgsql.get_journal_info(journal_id)
    if journal_info is None:
        return 404

    rules_file_name = journal_info.rules_file_name

    if not os.path.exists(f'/rules_data/{rules_file_name}'):
        download_from_s3([journal_info.rules_file_url])

    convert_to_xml(manuscript, f'/rules_data/{rules_file_name}')

What changed from v1 to v2:

	v1	v2
Rules file name	Hardcoded `{journal_id}_rules.csv`	Fetched from DB `journal_info.rules_file_name`
Versioning	None — stale files served silently	Timestamp in file name acts as version
Consistency	Eventual	Immediate
Cache path	No prefix	`/rules_data/` prefix

The key insight:

We are not caching the file name — the file name is fetched fresh from the DB on every request so it can never be stale. We are only caching the actual rules file content on disk. This gives us immediate consistency without 2 Phase Commit.

Section 6: Eviction, Routing and Tradeoffs

Cache Eviction — LRU

We use LRU (Least Recently Used) eviction. It is the best fit for our use case because every rules file has a timestamp in its name. The operating system automatically maintains read and write timestamps for all files on disk. When the cache storage threshold is reached, we simply evict the file that was least recently accessed to make space for new downloads.

This requires no additional tracking logic — the OS does it for us naturally.

Routing — Round Robin

We use Round Robin for routing requests across application servers.

The reason is that rules files are replicated across all app servers — not sharded. Every app server can serve any journal type.

Why not shard by journal ID? If we route all requests for a specific journal to a specific server, and that journal gets a high volume of submissions, that one server gets bombarded while all others remain idle. This creates an uneven load distribution that defeats the purpose of horizontal scaling.

With Round Robin, any request can go to any server. All servers gradually cache the top 70 journals and serve submissions equally.

The tradeoff — cold start latency

There is one unavoidable tradeoff in this design. The first request for any journal type will always be slow — the rules file has to be downloaded from S3 before processing can begin. Only subsequent requests benefit from the local cache.

This is acceptable in most cases but can be a poor experience if a high volume of submissions is expected for a specific journal — for example, right before a submission deadline.

Overcoming the cold start — warm up strategy

We can proactively eliminate this cold start problem using a cache warm-up background job.

Before a high volume period is expected:

A background job sends dummy small manuscript submissions for the most frequently used journals
The app servers download and cache the rules files
The dummy XML outputs are deleted after caching

When the real high volume of author submissions arrives, the rules files are already cached on all app servers and every submission gets the fast path.

The tradeoff is additional background job infrastructure and a small amount of unnecessary S3 download cost. But the author experience is significantly improved.

Top comments (1)

Nalini Manoharan • Jun 27

Insightful! Well explained!