DEV Community

Frederick Peñalo
Frederick Peñalo

Posted on

My production server almost went down because of one library

Sorry if the title sounds a bit weird, but I want to tell you about a real production bug that almost took down all our clients.

We run Kanvas Ecosystem, a backend that powers several frontend apps. One of our core pieces is a filesystem manager: users upload files (images, audio, video, GIFs, etc.) and later attach them to different entities in the system.

Everything was working fine… until it wasn’t.

About 3 or 4 months ago, someone installed a new cache-related library in our Laravel project.

That’s when the problems quietly started.


First rule (learned the hard way)

Never install a new library without talking to your team and testing all scenarios.


The chaos begins

Last week, I launched a new feature: an image scraper.

Users type a word, we scrape images, download them, and store them using our filesystem manager.

Suddenly, production went down.

All clients. All apps.

My boss called me worried, so we jumped into a meeting to find the issue.

Hours went by.

Our code looked perfect.

Dev environment worked.

Tests were passing.

So we thought it might be a CPU issue.

We added more cores and optimized the scraper.

Still broken.


The first real clue

We created a new server from an AWS image.

Same problem.

That made no sense.

Then we checked other services and discovered that Redis was down.

We increased Redis capacity and tested again in the dev environment.

I uploaded one simple image.

It took 47 seconds.

At that point I just thought: why?


Changing approach

I stopped testing only the scraper and started testing everything related to the filesystem manager.

The same issue appeared everywhere.

All clients were affected.

I debugged method by method.

My code looked clean.

Nothing suspicious.

Until I noticed something.


The real problem

At the top of one class, there was a strange line.

An import from that cache library.

Just to test, with zero hope, I removed it.

The upload time dropped to 100ms.

I stared at the screen for a few seconds.


What that library was doing

The library was:

  • Clearing Redis completely
  • Rebuilding the cache again
  • Doing this for all rows in the database

On every single file upload.

Yes.

Every time.

trollface dark


The victory

I called my boss and said:

“I fixed it. I deserve a candy or at least a chocolate.”

I’ve never won the lottery or a gacha game, but the feeling of finding this bug was the closest thing to that.


Final thoughts

I don’t know if this was skill or pure luck, probably both.

But one thing is clear:

  • Read the README
  • Read the issues
  • Understand what a library actually does
  • Test things in environments that look like production

Thanks for reading.

For me, it’s an honor to share this.

And remember: you can do more than you think.

Top comments (0)