DANISH ZULFIQAR

Posted on Jun 28

I Deployed 6 AI Systems Live — Here's What Actually Broke

#ai #productivity #opensource #product

Environment drift and dependency version traps

I Deployed 6 AI Systems Live — Here's What Actually Broke

A few weeks ago I wrote about the 5 bugs that cost me 60+ hours building 49 AI systems. Every one of those bugs lived inside the code itself wrong array layout, a renamed model class, a serialization mismatch.

This article is the second half of that story, and it taught me something more uncomfortable: code that runs perfectly on your machine can fail completely the moment it leaves your machine for reasons that have nothing to do with your code.

I took 6 of my pinned GitHub projects and deployed every one of them live on Streamlit Cloud. Locally, all 6 worked without a single error. Deploying them surfaced 5 failures I had never seen before, none of which were bugs in my logic.

Here they are, in the order I hit them.

Failure 1 — A Module That Existed Yesterday, Gone Today

My RAG chatbot used this import, unchanged for weeks:

from langchain.chains import ConversationalRetrievalChain

Locally: works. Deployed: instant crash.

ModuleNotFoundError: No module named 'langchain.chains'

The cause had nothing to do with my code. My local environment had an old, cached version of LangChain installed months ago. The deploy environment did a clean install and pulled whatever the latest version was at that moment and recent LangChain releases moved legacy chain classes like this one out of the core package entirely.

The fix that actually worked pin the exact version that still contains the class, rather than chasing the newest API pattern under deployment pressure:

langchain==0.3.7
langchain-community==0.3.7

The lesson: "it works on my machine" is frequently true specifically because your machine never reinstalled anything recently. A clean deploy environment has no such luxury it gets whatever is newest the moment it builds. Pin your versions before you ever need to debug this at 1 AM.

Failure 2 — A File That Exists, Until It Doesn't

My construction RAG project loads a prebuilt FAISS vector index from disk:

vectorstore = FAISS.load_local("faiss_index", embedding, allow_dangerous_deserialization=True)

Locally, instant load. Deployed, a raw crash deep inside FAISS's C++ binding with no clean Python traceback the kind of failure that gives you nothing to Google.

The actual cause: Git LFS. My index file had been quietly stored via Git LFS, which keeps a tiny text pointer in your git history instead of the real binary. Locally, my LFS client silently resolved that pointer into the real file, so I never noticed. The cloud platform's git clone fetched the pointer file only a few hundred bytes of text and handed that to FAISS, expecting a binary index.

The fix:

git lfs untrack "faiss_index/*"
git rm --cached faiss_index/index.faiss
git add faiss_index/index.faiss
git commit -m "Stop using Git LFS — commit as plain binary"
git push

The lesson: Git LFS is invisible exactly when it's working correctly on your machine. The only time you discover you were depending on it is the first time you deploy somewhere that doesn't support it.

Failure 3 — Two Different Size Limits Wearing the Same Error Message

Uploading an 83MB PyTorch model checkpoint through GitHub's website gave me this:

Yowza, that's a big file. Try again with a file smaller than 25MB.

I assumed GitHub simply couldn't take files that size. It can the website's drag-and-drop has a 25MB ceiling, but git push from the command line has a completely separate 100MB ceiling. Same platform, two different limits depending on which door you walk through.

The fix:

git add model.pth
git commit -m "Add trained model checkpoint"
git config --global http.postBuffer 157286400
git push

That postBuffer line matters specifically for files in the 50-100MB range without it, larger pushes can silently time out mid-transfer on a slower connection.

The lesson: a platform's documented limit and a UI's enforced limit are not always the same number. When something fails at a suspiciously round threshold, check whether you're hitting the actual platform limit or an arbitrary limit of the specific interface you happened to use.

Failure 4 — The Platform Changed Under Me, Without Asking

Midway through this deployment sprint, an app that had been working for days suddenly broke with a wall of import errors torchvision missing, dozens of warnings cascading from deep inside transformers.

Nothing in my code had changed. What had changed was the Python version my deploy platform silently selected newer than what I'd pinned, and several of my dependencies didn't yet have compatible builds for it.

The fix that actually held: I stopped trying to pin a specific Python version against a platform that wasn't reliably honoring the pin, and instead removed every heavy compiled dependency I didn't strictly need no torch, no transformers, no faiss for a project whose knowledge base was small enough to live directly in a prompt instead of a vector store. A requirements file with three lines:

streamlit==1.40.0
requests==2.32.3
python-dotenv==1.0.1

cannot break this way, because there is nothing in it with compiled platform-specific wheels to break.

The lesson: when a managed platform controls the runtime, the most resilient strategy is not fighting to pin every variable it's minimizing how many variables you depend on in the first place.

Failure 5 — A Push That Succeeds and Goes Nowhere You're Looking

The most disorienting failure of the five: git push reported complete success files written, no errors, a clean exit. The file was simply not visible anywhere on GitHub afterward.

git branch -a
* master
  remotes/origin/main

My repository had been created with a default main branch already on GitHub. I had been committing and pushing to master the entire time a branch that existed locally and now also existed remotely, sitting parallel to main, never appearing on the page I was checking.

The fix:

git push origin master:main

or, going forward, simply commit directly to whichever branch GitHub actually shows by default.

The lesson: a successful push confirms your laptop and the remote agree with each other. It confirms nothing about whether that destination is the one a human is looking at in a browser tab.

The Pattern Across All Five

None of these were logic bugs. My code was correct in every case. Every failure came from a gap between two environments that I had assumed were equivalent and were not:

my cached dependencies vs. a fresh install
my local LFS resolution vs. a clone that skips LFS
a UI's limit vs. a protocol's limit
the runtime I requested vs. the runtime I was actually given
the branch I was typing into vs. the branch being displayed

"Works locally" is a claim about one specific environment. Deployment is the process of discovering every assumption that claim was quietly resting on.

A Short Checklist For Next Time

Before deploying anything again, I now check:

Are my dependency versions pinned to exact numbers, not ranges?
Does anything in this repo rely on Git LFS — and does my deploy target support it?
Are any committed files close to a platform's size limits, and which limit — UI or protocol?
Can I remove a heavy compiled dependency instead of fighting to pin its version?
Does git branch -a show exactly one branch I'm pushing to, with no silent second branch sitting beside it?

Five questions, thirty seconds, asked before the deploy instead of discovered after.

All 6 systems are live and open source:
🔗 github.com/Danish08654

If you've hit a deployment failure that had nothing to do with your actual code I'd like to hear it. Drop it in the comments.

Top comments (5)

Mike Czerwinski • Jun 28

The line that lands is "works locally is a claim about one specific environment." That is the real bug under all five, and it is worth more than the checklist.

Where I would push: Failure 4. Stripping torchvision and transformers down to streamlit plus requests is not a fix, it is a different system. If the deploy that finally went green no longer runs the model the local version ran, then "it works in production" became a claim about a thinner app than the one you tested. So the honest question: after the minimal requirements cut, was the deployed system still doing the same inference, or did green just mean a lighter thing started?

That gap is the one worth a sixth section.