DEV Community

Fagner Brack
Fagner Brack

Posted on • Originally published at fagnerbrack.com on

How Can Symlink Cause An Outage?

A small harmless convenience can become a significantly harmful inconvenience.

A vector-based artwork showing a man with a bat trying to fight a huge insect coming off the computer screen
A vector-based artwork showing a man with a bat trying to fight a huge insect coming off the computer screen

I was working on a NodeJS Web Server in a Monorepo. The project had multiple npm projects siblings to each other, each of which spin up their server in a different local port in a single deployable. A homemade gateway/proxy forwarded the user requests into each separate service.

One of those NPM modules (let's call the " PrincipalModule") included another sibling module npm install --save ../my-dependency (let's call " DependencyModule"). The PrincipalModule added the dependency using NPM local module install:

A code snippet showing the piece of a JSON file with the key “book-payments” and the value “file://../book-payments”

The issue started with the need for convenience.

I made a change to use app-root-path in the DependencyModule to load components from the root of its project, such as const something = require(${root}/src/path/to/something);. It imported some public classes of the DependencyModule, which would run in the context of the PrincipalModule.

Everything worked fine in local development and throughout the CI pipeline. All the tests, including the UI tests, which would spin up all the servers together before deployment, were green. However, after the code made its way to production, I got an SMS alert saying prod was down with the following error: "Cannot find module x". The x is the name of one of the classes the code imported using app-root-path to resolve the root.

It was a small project with a small number of users. We've been postponing some operational changes, which can make the system 100% available in the face of when the server fails to start. For that reason, the bug ended up causing an outage.

There was significant test coverage in this project. If you look at the code and the deployment process, it seemed impossible for an outage to happen immediately after some code change. You would expect some tests to fail before reaching that point. Of course, there was still a possibility for a third party going rogue or some bug to trigger later, which is a risk we were safe to take given the business circumstances. However, having a bug straight after deployment was not expected. It was closer to impossible, so what was going on?

I was perplexed. Integration tests were green, and all the UI tests, which used real servers, were green. I tried to run the same CI commands to build the app like prod. No dice. There was no way I could reproduce the issue.

Something interesting was going on.

The system had decent test coverage and a deployment process above the curve; it seemed impossible to observe an outage right after deployment. What was going on?

I started some debugging in prod by deploying some logging after I reverted to a working commit. Here's what I found:

When running the DependencyModule in the context of the PrincipalModule, I would expect app-root-path to resolve it as the root of the DependencyModule. Yet, it was resolving as the root of the PrincipalModule. In local development mode and CI servers, the behaviour is what I would expect: the root was relative to the DependencyModule.

After some research, I figured out that when you run npm install ../dependency, NPM automatically creates a symlink between the Dependency and the Principal; the symlink lives inside the node_modules folder of the Principal. The app-root-path module uses __dirname to find the root, and because of the symlink, it resolves as the root of ../dependency. I tried deleting all node_modules, package-lock.json files and reinstalling everything locally. NPM always creates that symlink, so I still couldn't reproduce the "module not found" issue locally.

Why is the code not behaving the same as in prod? It's the same Node Version, the same NPM version, and the same app-root-path version on PrincipalModule and DependencyModule! I downloaded the code from a zip file directly from the artifact generated by the pipeline and tried to run it locally.

That's when I finally reproduced the bug! Here's the source of the issue:

The code couldn't find some module in prod because the symlink between Principal and Dependency was gone. The app-root-path package was resolving the root from the PrincipalModule in prod, but locally, it resolved the root from the DependencyModule. It's a different context related to this issue reported in the app-root-path project.

Oh, well… did you mention "zip file"?

I used Elastic Beanstalk (EBS). EBS requires the project to be sent as a zip file before running the eb command to upload to the server.

Before deployment, the build process zips all the code using the macOS zip command line to send to EBS. Then, it uploads the zip file to the prod server and unzips the code to call npm start.

If you know how symlink works, you may have figured out the problem by now.

The act of bundling node_modules, then zipping in the build process and then unzipping in prod without running npm install removes the symlink that NPM created automatically. Since the symlink is gone, app-root-path resolves the root path of the PrincipalModule, not the DependencyModule.

The location of the files imported in prod is incorrect, and it's impossible to reproduce when you install the project on your machine.

It's unbelievable until you find the root cause, which then becomes stupid.

Some learnings for me

  • Zipping removes symlink. Yes, I don't use symlink very often, so I didn't know about this.
  • When using NPM, understand the symlink magic that's going on behind NPM when running npm install.

I started using app-root-path only as a convenience to import all my modules from the root. Now I learned to not rely too much on those things and couple third-party dependencies that depend on the environment to make inferences about the system. Depending on that makes the system prone to edge cases which would be rather difficult to debug.

Some recommendations for @npmjs:

Document the magic behind your commands every time they appear in the docs.

There is indeed a line in the docs of npm install command stating:

npm install :

Install the package in the directory as a symlink in the current project.

https://docs.npmjs.com/cli/v7/commands/npm-install

However, I didn't look at that part of the docs while researching this problem. There's a reason why.

The reason is that the problem didn't happen after I run npm install to install the DependencyModule, so it didn't make any sense to look at the documentation for the install command. Instead, the code already had the DependencyModule installed in the PrincipalModule; the bug only exposed itself after the code tried to use _dirname through app-root-path.

Instead, I looked at the Local Paths documentation on the NPM website, and there's nothing there stating that there's a symlink happening to install local packages.

Today I learned: NPM install on local files creates a symlink. The "zip" command line doesn't carry the symlink by default.

Now I may either:

  1. Stop using app-root-path. Instead, import modules relative to where they are in the file system regardless of the project's root[1].
  2. Use zip --symlink when zipping the code for prod with an obvious comment on why[2].

This issue is excellent to show when a harmless convenience can turn into an outage.

A small harmless convenience can become a significantly harmful inconvenience.

And you, what are your thoughts about this? Would you have been able to prevent this error from happening?

Happy to hear your thoughts.

1: The issue here is that you must rely on the IDE auto-adjusting the relative path whenever you create a new directory. Another issue is that Git shows a diff with directory changes from ../ to ../../ instead of ${root}/src/book-payments/ConnectRepository.js to ${root}/src/book-payments/storage/ConnectRepository.js which are clearer.

2: That comes at the cost of coupling one command used in one part of the system with a command used in another. When a developer reads the code, there's no clear cause and effect relationship between each piece.

Thanks to Marcel Silva, and Raimo Radczewski for their insightful inputs to this post.

Thanks for reading. If you have some feedback, reach out to me on Twitter, Facebook or Github.

Top comments (0)