Section 1: The Domain
We build a product that processes academic manuscripts for scientific publishing companies. Authors submit their manuscripts in the form of DOCX files. Our system converts these DOCX files into XML — the industry standard format used across the entire publishing world.
Every publishing company has their own DTD (Document Type Definition) — a schema that defines the structure of their XML. This means the rules for how a manuscript should be structured, tagged, and formatted are different for every journal and every publisher.
The conversion pipeline
When an author submits a manuscript, here is what happens:
- Author submits a DOCX file to our system
- The system downloads the journal-specific rules file from S3
- The rules file is used to convert the DOCX into a structured XML
- During conversion, the system raises annotations and discrepancies — places where the manuscript does not conform to the journal's rules
- The XML is then converted into an HTML preview
- The author views the HTML on the UI and resolves the annotations and discrepancies raised by the system
The problem
The entire process — from DOCX submission to HTML preview — takes approximately 3 minutes.
The product owner received feedback from authors that 3 minutes is too long. Authors are sitting and waiting, not knowing if their manuscript is being processed correctly. This is a poor experience and directly impacts the author's trust in the platform.
The product owner has asked us to reduce the waiting time.
Section 2: Finding the Bottleneck
As part of improving the system latency we started profiling the pipeline to identify where the 3 minutes were being spent.
We quickly identified that downloading the rules file from S3 on every submission was the biggest contributor to latency.
Why is the download so slow?
The rules files are large — each rules file is close to 1 GB in size. Transferring 1 GB over the network on every single submission adds significant overhead before any actual processing even begins.
Understanding the scale
- We receive 3,000 to 5,000 submissions per day
- Our system supports 300 different journal types
- Based on our metrics, 60 to 70 journals account for 90% of all submissions
This last observation is the key insight. We are downloading 1 GB files from S3 thousands of times a day — but for the same 60 to 70 journals repeatedly. The same rules file is being downloaded over and over again for every submission of the same journal type.
Current architecture
The decision
We decided to break this network latency on every request. Instead of downloading the rules file from S3 on every submission, we would cache the rules files of the top 70 journals locally on the application server.
This would eliminate the redundant S3 downloads for 90% of our submissions and directly reduce the processing time for the majority of authors.
Section 3: Why Local Cache?
Before jumping into the implementation we had to decide what type of cache to use.
Global cache — ruled out
At first glance a global cache seems like a reasonable choice. But when we thought about it carefully we realised it does not solve our problem at all.
With a global cache:
- Rules file is downloaded from S3 → stored in global cache server
- For every submission, app server downloads the rules file from the global cache server
We have simply moved the bottleneck. We are still transferring 1 GB of data over the network for every request — just from a different source. The latency problem remains unsolved.
Local cache — the right choice
With a local cache, the rules file is stored directly on the application server's HDD. When a submission comes in:
- If the rules file is already on the local HDD → use it directly, zero network transfer
- If not → download from S3 once, store locally, use it
Every subsequent submission for the same journal type uses the locally cached file with no network overhead at all.
Storage cost
We support 300 journal types but only 60 to 70 journals account for 90% of submissions. If we cache the rules files for these 70 journals:
70 journals × 1 GB per rules file = 70 GB per application server
70 GB on a local HDD is completely acceptable. HDD storage is cheap and this is a one time cost per server.
Local cache is the right fit for our use case.
Section 4: The Consistency Problem
Now that we have decided on local caching, we need to think carefully about cache invalidation. Before choosing an algorithm, we need to understand what kind of consistency our system requires.
Does eventual consistency work here?
Let's think through the scenario.
We cache the rules files locally and start processing manuscripts. Now assume that a domain expert — someone from the manuscript team or a peer reviewer — identifies that one of the journal rules has a bug. A rule has been added incorrectly, changed, or deleted. A separate uploader service fixes the rules file, uploads the new version to S3, and updates the DB.
But here is the problem — our application servers have the old rules file cached locally. Before the cache gets invalidated, our system processes around 100 manuscripts using the wrong rules. The XML is generated, the HTML is rendered, and authors go to the UI and spend time resolving annotations and discrepancies based on incorrect rules.
Later, the quality check team identifies that the XML was generated using outdated rules. Now we have to go back to those authors and tell them their work needs to be redone.
Authors will not accept this. The credibility of the platform is at stake.
Eventual consistency does not fit this use case.
We need immediate consistency
Our system must ensure that the moment a rules file is updated, every subsequent manuscript processed by any application server uses the updated rules. No exceptions.
The natural choice for immediate consistency is Write Through cache — write to both cache and the source of truth atomically. But this is extremely hard to implement in a distributed environment.
With multiple application servers each holding their own local cache, achieving atomic writes across all of them simultaneously is practically impossible. If we use 2 Phase Commit to coordinate these writes, the latency overhead would be so high that it defeats the entire purpose of caching in the first place.
The solution — Write Through with Versioning
We need immediate consistency without the complexity and latency of 2 Phase Commit. The solution is a tweaked approach — write through with versioning.
Before we explain the solution, let us look at how the current code is implemented.
Section 5: The Code
v1 — The naive approach
This is how the system was originally implemented. Every time a manuscript submission comes in, the rules file is hardcoded based on the journal ID.
import os
def handle_request_v1(request):
author_id = request.author_id
journal_id = request.journal_id
manuscript = request.manuscript
author_info = pgsql.get_author_info(author_id)
if author_info is None:
return 404
journal_info = pgsql.get_journal_info(journal_id)
if journal_info is None:
return 404
rules_file_name = f"{journal_id}_rules.csv"
if not os.path.exists(rules_file_name):
download_from_s3([journal_info.rules_file_url])
convert_to_xml(manuscript, rules_file_name)
The problem with v1:
The rules file name is hardcoded as {journal_id}_rules.csv. If the rules file is updated in S3 and the DB, the file name on disk stays the same. The os.path.exists check returns True — the file exists — but the content is stale. The system silently serves outdated rules with no way of knowing.
This is exactly the eventual consistency problem we described.
v2 — Write Through with Versioning
The key insight is simple — instead of hardcoding the rules file name, we fetch it from the DB on every request. The DB always holds the latest file name. The file name itself acts as the version.
When a domain expert updates the rules file:
- The uploader service uploads the new rules file to S3 with a timestamp in the file name — e.g.
journal_42_rules_2025-03-16_08:00.csv - On success, the uploader updates the file name in the DB
- These two writes do not need to be atomic — if S3 succeeds but DB update fails, the new file is simply not being used yet. It causes no harm.
Now when the next submission comes in:
- App server fetches
journal_infofrom DB — gets the new file name - Checks if this file exists on local HDD
- It does not — because the old file had a different name
- Downloads the new file from S3
- Processes the manuscript with the correct updated rules
The old cached file is simply never used again and will eventually be evicted by LRU.
import os
def handle_request_v2(request):
author_id = request.author_id
journal_id = request.journal_id
manuscript = request.manuscript
author_info = pgsql.get_author_info(author_id)
if author_info is None:
return 404
journal_info = pgsql.get_journal_info(journal_id)
if journal_info is None:
return 404
rules_file_name = journal_info.rules_file_name
if not os.path.exists(f'/rules_data/{rules_file_name}'):
download_from_s3([journal_info.rules_file_url])
convert_to_xml(manuscript, f'/rules_data/{rules_file_name}')
What changed from v1 to v2:
| v1 | v2 | |
|---|---|---|
| Rules file name | Hardcoded {journal_id}_rules.csv
|
Fetched from DB journal_info.rules_file_name
|
| Versioning | None — stale files served silently | Timestamp in file name acts as version |
| Consistency | Eventual | Immediate |
| Cache path | No prefix |
/rules_data/ prefix |
The key insight:
We are not caching the file name — the file name is fetched fresh from the DB on every request so it can never be stale. We are only caching the actual rules file content on disk. This gives us immediate consistency without 2 Phase Commit.
Section 6: Eviction, Routing and Tradeoffs
Cache Eviction — LRU
We use LRU (Least Recently Used) eviction. It is the best fit for our use case because every rules file has a timestamp in its name. The operating system automatically maintains read and write timestamps for all files on disk. When the cache storage threshold is reached, we simply evict the file that was least recently accessed to make space for new downloads.
This requires no additional tracking logic — the OS does it for us naturally.
Routing — Round Robin
We use Round Robin for routing requests across application servers.
The reason is that rules files are replicated across all app servers — not sharded. Every app server can serve any journal type.
Why not shard by journal ID? If we route all requests for a specific journal to a specific server, and that journal gets a high volume of submissions, that one server gets bombarded while all others remain idle. This creates an uneven load distribution that defeats the purpose of horizontal scaling.
With Round Robin, any request can go to any server. All servers gradually cache the top 70 journals and serve submissions equally.
The tradeoff — cold start latency
There is one unavoidable tradeoff in this design. The first request for any journal type will always be slow — the rules file has to be downloaded from S3 before processing can begin. Only subsequent requests benefit from the local cache.
This is acceptable in most cases but can be a poor experience if a high volume of submissions is expected for a specific journal — for example, right before a submission deadline.
Overcoming the cold start — warm up strategy
We can proactively eliminate this cold start problem using a cache warm-up background job.
Before a high volume period is expected:
- A background job sends dummy small manuscript submissions for the most frequently used journals
- The app servers download and cache the rules files
- The dummy XML outputs are deleted after caching
When the real high volume of author submissions arrives, the rules files are already cached on all app servers and every submission gets the fast path.
The tradeoff is additional background job infrastructure and a small amount of unnecessary S3 download cost. But the author experience is significantly improved.



Top comments (1)
Insightful! Well explained!