DEV Community: DANISH ZULFIQAR

Check out my CoderLegion profile & latest post!

DANISH ZULFIQAR — Fri, 10 Jul 2026 04:53:06 +0000

Check out my CoderLegion profile & latest post!

View my full profile: https://coderlegion.com/user/Danish

Workflow

DANISH ZULFIQAR — Thu, 09 Jul 2026 10:51:30 +0000

DANISH ZULFIQAR

Jul 5

A clean, complete guide to version control, collaboration, and containerization. Commands, workflows, and concepts - all in one place.

#ai #productivity #development

2 min read

Production vibe

DANISH ZULFIQAR — Thu, 09 Jul 2026 10:49:58 +0000

DANISH ZULFIQAR

Jul 9

If you're just getting started, think of AWS, GCP, and Azure as cloud platforms

#ai #aws #gcp #azure

1 min read

If you're just getting started, think of AWS, GCP, and Azure as cloud platforms

DANISH ZULFIQAR — Thu, 09 Jul 2026 10:48:52 +0000

What is Cloud Computing?

Traditionally, if you wanted to run an application, you had to buy:

Physical servers
Storage devices
Networking equipment
Cooling systems
Backup power
Security infrastructure

Cloud providers own this infrastructure and let you use it on demand.

The **Three **Major Cloud Providers

                    **1. Amazon Web Services (AWS)

Released: 2006

Company: Amazon Web Services

AWS is the world's largest cloud platform.

                   ****Common Services

EC2 (Virtual Servers)
S3 (Object Storage)
RDS (Managed Databases)
Lambda (Serverless Functions)
IAM (Identity & Access Management)
CloudWatch (Monitoring)
SageMaker (Machine Learning)
Bedrock (Generative AI)
```
                  **Used by
```
Netflix
Airbnb
NASA
Samsung
BMW
```
                **Strengths
```
Largest service ecosystem
Very mature platform
Excellent scalability

Strong AI and ML offerings

        **2. Google Cloud Platform (GCP)**

Released: 2008
Company: Google Cloud
Google Cloud is known for AI, machine learning, big data, and Kubernetes.

                 **Common Services

Compute Engine
Cloud Storage
Cloud SQL
BigQuery
Cloud Run
Vertex AI
Kubernetes Engine (GKE)
```
               **Used by
```
Spotify
PayPal
Snapchat
Twitter/X
```
             **Strengths
```
Excellent AI ecosystem
Big data analytics
High-performance networking
```
       **3. Microsoft Azure
```

Released: 2010
Company: Microsoft Azure
Azure is widely adopted by enterprises and organizations already using Microsoft technologies.

             **Common Services

Virtual Machines
Blob Storage
Azure SQL Database
Azure Kubernetes Service (AKS)
Azure OpenAI Service
Azure AI Search
```
            **Used by
```
Walmart
Adobe
LinkedIn
Many government organizations
Large enterprise businesses
```
           **Strengths
```
Enterprise integration
Strong business adoption

A clean, complete guide to version control, collaboration, and containerization. Commands, workflows, and concepts - all in one place.

DANISH ZULFIQAR — Sun, 05 Jul 2026 03:41:24 +0000

Cheat sheet - all essential commands

                               **Git**

git init Initialize a new repo
git clone Clone a remote repo locally
git status Show working tree status
git add . Stage all changes
git commit -m Commit staged changes with message
git log - oneline Compact commit history
git switch -c Create and switch to a new branch
git merge Merge branch into current
git stash Stash uncommitted changes
git stash pop Re-apply stashed changes
git revert Safely undo a commit
git tag v1.0.0 Tag the current commit

                  **GitHub / Remote**

git remote add origin Link local repo to remote
git push -u origin main Push and set tracking branch
git pull Fetch and merge remote changes
git fetch origin Fetch without merging
gh pr create Open a pull request via CLI
gh pr list List open pull requests
gh run list View GitHub Actions runs

                     **Docker**

docker build -t name:tag . Build image from Dockerfile
docker run -p 8000:8000 name Run container with port mapping
docker run -d - name x name Run container in background
docker ps List running containers
docker logs -f Follow container logs
docker exec -it bash Shell into a running container
docker stop Stop a container
docker rm Remove a container
docker images List local images
docker rmi Delete an image
docker compose up -d Start all services in background
docker compose down Stop and remove all services
docker compose logs -f Follow logs for all services
docker system prune Clean up unused images/containers

That's the full toolkit.
Git tracks it. GitHub shares it. Docker ships it. Together they form the backbone of every modern production AI system.

Maybe you face it?

DANISH ZULFIQAR — Sun, 28 Jun 2026 03:18:26 +0000

Environment drift and dependency version traps

DANISH ZULFIQAR

Jun 28

I Deployed 6 AI Systems Live — Here's What Actually Broke

#ai #productivity #opensource #product

5 min read

I Deployed 6 AI Systems Live — Here's What Actually Broke

DANISH ZULFIQAR — Sun, 28 Jun 2026 03:16:53 +0000

I Deployed 6 AI Systems Live — Here's What Actually Broke

A few weeks ago I wrote about the 5 bugs that cost me 60+ hours building 49 AI systems. Every one of those bugs lived inside the code itself wrong array layout, a renamed model class, a serialization mismatch.

This article is the second half of that story, and it taught me something more uncomfortable: code that runs perfectly on your machine can fail completely the moment it leaves your machine for reasons that have nothing to do with your code.

I took 6 of my pinned GitHub projects and deployed every one of them live on Streamlit Cloud. Locally, all 6 worked without a single error. Deploying them surfaced 5 failures I had never seen before, none of which were bugs in my logic.

Here they are, in the order I hit them.

Failure 1 — A Module That Existed Yesterday, Gone Today

My RAG chatbot used this import, unchanged for weeks:

from langchain.chains import ConversationalRetrievalChain

Locally: works. Deployed: instant crash.

ModuleNotFoundError: No module named 'langchain.chains'

The cause had nothing to do with my code. My local environment had an old, cached version of LangChain installed months ago. The deploy environment did a clean install and pulled whatever the latest version was at that moment and recent LangChain releases moved legacy chain classes like this one out of the core package entirely.

The fix that actually worked pin the exact version that still contains the class, rather than chasing the newest API pattern under deployment pressure:

langchain==0.3.7
langchain-community==0.3.7

The lesson: "it works on my machine" is frequently true specifically because your machine never reinstalled anything recently. A clean deploy environment has no such luxury it gets whatever is newest the moment it builds. Pin your versions before you ever need to debug this at 1 AM.

Failure 2 — A File That Exists, Until It Doesn't

My construction RAG project loads a prebuilt FAISS vector index from disk:

vectorstore = FAISS.load_local("faiss_index", embedding, allow_dangerous_deserialization=True)

Locally, instant load. Deployed, a raw crash deep inside FAISS's C++ binding with no clean Python traceback the kind of failure that gives you nothing to Google.

The actual cause: Git LFS. My index file had been quietly stored via Git LFS, which keeps a tiny text pointer in your git history instead of the real binary. Locally, my LFS client silently resolved that pointer into the real file, so I never noticed. The cloud platform's git clone fetched the pointer file only a few hundred bytes of text and handed that to FAISS, expecting a binary index.

The fix:

git lfs untrack "faiss_index/*"
git rm --cached faiss_index/index.faiss
git add faiss_index/index.faiss
git commit -m "Stop using Git LFS — commit as plain binary"
git push

The lesson: Git LFS is invisible exactly when it's working correctly on your machine. The only time you discover you were depending on it is the first time you deploy somewhere that doesn't support it.

Failure 3 — Two Different Size Limits Wearing the Same Error Message

Uploading an 83MB PyTorch model checkpoint through GitHub's website gave me this:

Yowza, that's a big file. Try again with a file smaller than 25MB.

I assumed GitHub simply couldn't take files that size. It can the website's drag-and-drop has a 25MB ceiling, but git push from the command line has a completely separate 100MB ceiling. Same platform, two different limits depending on which door you walk through.

The fix:

git add model.pth
git commit -m "Add trained model checkpoint"
git config --global http.postBuffer 157286400
git push

That postBuffer line matters specifically for files in the 50-100MB range without it, larger pushes can silently time out mid-transfer on a slower connection.

The lesson: a platform's documented limit and a UI's enforced limit are not always the same number. When something fails at a suspiciously round threshold, check whether you're hitting the actual platform limit or an arbitrary limit of the specific interface you happened to use.

Failure 4 — The Platform Changed Under Me, Without Asking

Midway through this deployment sprint, an app that had been working for days suddenly broke with a wall of import errors torchvision missing, dozens of warnings cascading from deep inside transformers.

Nothing in my code had changed. What had changed was the Python version my deploy platform silently selected newer than what I'd pinned, and several of my dependencies didn't yet have compatible builds for it.

The fix that actually held: I stopped trying to pin a specific Python version against a platform that wasn't reliably honoring the pin, and instead removed every heavy compiled dependency I didn't strictly need no torch, no transformers, no faiss for a project whose knowledge base was small enough to live directly in a prompt instead of a vector store. A requirements file with three lines:

streamlit==1.40.0
requests==2.32.3
python-dotenv==1.0.1

cannot break this way, because there is nothing in it with compiled platform-specific wheels to break.

The lesson: when a managed platform controls the runtime, the most resilient strategy is not fighting to pin every variable it's minimizing how many variables you depend on in the first place.

Failure 5 — A Push That Succeeds and Goes Nowhere You're Looking

The most disorienting failure of the five: git push reported complete success files written, no errors, a clean exit. The file was simply not visible anywhere on GitHub afterward.

git branch -a
* master
  remotes/origin/main

My repository had been created with a default main branch already on GitHub. I had been committing and pushing to master the entire time a branch that existed locally and now also existed remotely, sitting parallel to main, never appearing on the page I was checking.

The fix:

git push origin master:main

or, going forward, simply commit directly to whichever branch GitHub actually shows by default.

The lesson: a successful push confirms your laptop and the remote agree with each other. It confirms nothing about whether that destination is the one a human is looking at in a browser tab.

The Pattern Across All Five

None of these were logic bugs. My code was correct in every case. Every failure came from a gap between two environments that I had assumed were equivalent and were not:

my cached dependencies vs. a fresh install
my local LFS resolution vs. a clone that skips LFS
a UI's limit vs. a protocol's limit
the runtime I requested vs. the runtime I was actually given
the branch I was typing into vs. the branch being displayed

"Works locally" is a claim about one specific environment. Deployment is the process of discovering every assumption that claim was quietly resting on.

A Short Checklist For Next Time

Before deploying anything again, I now check:

Are my dependency versions pinned to exact numbers, not ranges?
Does anything in this repo rely on Git LFS — and does my deploy target support it?
Are any committed files close to a platform's size limits, and which limit — UI or protocol?
Can I remove a heavy compiled dependency instead of fighting to pin its version?
Does git branch -a show exactly one branch I'm pushing to, with no silent second branch sitting beside it?

Five questions, thirty seconds, asked before the deploy instead of discovered after.

All 6 systems are live and open source:
🔗 github.com/Danish08654

If you've hit a deployment failure that had nothing to do with your actual code I'd like to hear it. Drop it in the comments.

I Built 48 Production AI Systems in 60 Days — Here Is What Nobody Tells You About Real AI Engineering

DANISH ZULFIQAR — Sat, 13 Jun 2026 06:59:41 +0000

I Built 48 Production AI Systems

Here Is What Nobody Tells You About Real AI Engineering

I did not study AI engineering. I built it.

For 60 days I woke up at 6 AM, opened VS Code, and shipped one production AI system every day. Not notebooks. Not tutorials. Not demos. Systems — with a live REST API, an interactive dashboard, a trained model, and a GitHub repo with a README that explains the business problem it solves.

48 systems later, I want to tell you what courses do not cover.

Not the architecture patterns. Not the frameworks. The real stuff. The 3 AM stuff. The "why is this working on Colab but crashing on my laptop" stuff.

This is that article.

First — What I Actually Built

Before I get to the lessons, here is the scope so you understand why these lessons matter.

Phase 1 — Production ML (Days 1-7)
Credit scoring for gig workers. B2B intent detection. Dynamic pricing. Carbon estimation. Clinical trial matching. Supplier risk intelligence. Economic forecasting. Every one deployed as a FastAPI endpoint with a Streamlit dashboard.

Phase 2 — Deep Learning and Computer Vision (Days 8–14)
Deepfake detector. Satellite change detector. Document OCR. Plant disease detection. Fitness pose coach. Real models, real inference, real errors.

Phase 3 — LLMs and Agents (Days 15–21)
LangGraph multi-agent research pipeline. MCP business agent. Text-to-image generator. Vertical RAG for construction. Voice agent. Every one using free APIs — Groq, Tavily, gTTS, Whisper.

Phase 4 — MLOps (Days 22–30)
End-to-end MLOps pipeline with MLflow, Evidently AI, auto-retraining, Grafana monitoring, and Docker deployment.

That is what I shipped. Now here is what it cost me.

The 5 Bugs That Taught Me More Than Any Course

Bug 1 — OpenCV and Non-Contiguous Arrays

On Day 8 I was building a deepfake detector. XceptionNet was working. The preprocessing pipeline was clean. Then I hit this:

error: OpenCV(4.13.0) :-1: error: (-5:Bad argument)
in function 'ellipse'
> Layout of the output array img is incompatible with cv::Mat

I stared at this for four hours.

The problem was not my code. It was memory layout. When you do np.where, np.clip, or pass an array through PIL and back to numpy, the resulting array is sometimes stored non-contiguously in memory — rows scattered across RAM instead of packed together. OpenCV's C++ backend cannot read non-contiguous memory and throws this exact error.

The fix is one line:

img = np.ascontiguousarray(img)

Call this before every single OpenCV operation. Not just the ones that fail. Every one. Because the failure is not deterministic — it depends on which numpy operation preceded the cv2 call.

What this taught me: The gap between a working notebook and a working system is often not logic. It is memory, types, and environment — things that tutorials never mention because they never hit production.

Bug 2 — XGBoost 3.x Broke SHAP

On Day 1 I was building a credit scoring system. I had trained a LightGBM model with SHAP explainability — regulatory compliance, every decision explained. It worked perfectly on Google Colab.

I moved to VS Code. Everything crashed.

ValueError: <class 'numpy.random._mt19937.MT19937'>
is not a known BitGenerator module.

The root cause was a numpy version mismatch — Colab was using a newer numpy than my local environment. But the deeper problem was that XGBoost 3.x and the SHAP library had an internal incompatibility nobody documented clearly.

The solution I found was to stop using SHAP entirely and use XGBoost's native contributions instead:

# Instead of this (breaks on XGBoost 3.x)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

# Use this (works on all XGBoost versions)
contributions = model.predict(
    xgb.DMatrix(X),
    pred_contribs=True
)

The math is identical. The result is identical. The dependency conflict disappears.

What this taught me: Version pinning is not optional in production ML. The first thing every new project needs is a locked requirements file. Ship the environment, not just the code.

Bug 3 — LangGraph on Windows Kills Async

On Day 21 I was building an MCP business agent — LangGraph orchestrating 8 MCP tools for invoice processing, AP approval, and Slack notifications. The API was running. The workflow triggered. Then silence.

No error. No output. Just a FastAPI background thread that started and disappeared.

The problem was Windows. Python's asyncio.run() creates a new event loop each time it is called. On Windows, FastAPI background threads already have an event loop running — and asyncio.run() conflicts with it. On Linux this never happens.

The fix:

# At the top of main.py — Windows only
import asyncio
import sys

if sys.platform == 'win32':
    asyncio.set_event_loop_policy(
        asyncio.WindowsProactorEventLoopPolicy()
    )

# In background thread functions
def run_workflow_background(command: str):
    if sys.platform == 'win32':
        loop = asyncio.ProactorEventLoop()
        asyncio.set_event_loop(loop)
        try:
            result = loop.run_until_complete(run_agent(command))
        finally:
            loop.close()
    else:
        result = asyncio.run(run_agent(command))

What this taught me: Cross-platform is a real constraint, not a theoretical one. If you build on Windows and deploy on Linux — or the reverse — test the async behavior explicitly. It will not tell you it is broken. It will just silently do nothing.

Bug 4 — timm Renamed Xception Without Warning

On Day 8 my deepfake detector used XceptionNet from the timm library. I had trained the model on Colab, saved the weights, and moved everything to VS Code.

UserWarning: Mapping deprecated model name xception
to current legacy_xception.
RuntimeError: Error(s) in loading state_dict for XceptionDetector:
Missing key(s) in state_dict: "head.0.weight", "head.0.bias"...
Unexpected key(s) in state_dict: "classifier.0.weight"...

Two separate bugs, same crash.

First: timm renamed xception to legacy_xception. Use the new name to remove the warning and avoid future breakage:

# Old — throws deprecation warning
model = timm.create_model('xception', pretrained=False, num_classes=0)

# New — explicit, no warning
model = timm.create_model('legacy_xception', pretrained=False, num_classes=0)

Second: I had named the classification head self.classifier in Colab but self.head in VS Code. PyTorch saves weights by key name — classifier.0.weight and head.0.weight are completely different keys even if the architecture is identical.

The fix: Name your model layers once. Never rename them. The name is part of the contract between your training environment and your serving environment.

What this taught me: Model serialization is more fragile than it looks. The weight file is not just numbers — it is numbers plus the exact architecture key names. Document both.

Bug 5 — joblib Cannot Cross Python Versions

On Day 29 I saved a GradientBoostingClassifier with joblib on Google Colab (Python 3.10, numpy 1.24) and loaded it on VS Code (Python 3.10, numpy 1.26).

ValueError: <class 'numpy.random._mt19937.MT19937'>
is not a known BitGenerator module.

Same Python version. Different numpy. Dead model.

The GradientBoostingClassifier internally stores a numpy RandomState object. When numpy changes how it serializes random state between minor versions, joblib files become unreadable across those versions even when everything else matches.

Three solutions in order of preference:

# Solution 1 — Save with protocol 2 (maximum compatibility)
joblib.dump(model, 'model.joblib', protocol=2)

# Solution 2 — Use XGBoost native format instead of joblib
model.save_model('model.json')  # XGBoost only
loaded = xgb.XGBClassifier()
loaded.load_model('model.json')

# Solution 3 — Retrain locally (fastest for synthetic data)
# Never transfer joblib files across environments
# Always retrain in the environment you serve from

What this taught me: joblib is not a portable format. It is a snapshot of a specific Python environment. If your training and serving environments differ — even slightly — retrain in the serving environment. Always.

The Pattern Behind All 5 Bugs

Look at what they have in common:

Every single one of them was invisible in a tutorial context.

You cannot hit the numpy contiguous array bug in a Jupyter notebook because notebooks do not use OpenCV in a production pipeline. You cannot hit the joblib cross-version bug in a course because courses do not move models between environments. You cannot hit the LangGraph Windows async bug if you only run python script.py from the command line.

These bugs only exist in the gap between "it works on my machine" and "it works in production."

That gap is where real AI engineering lives.

The Startup Hidden Inside Every ML Project

Here is something else courses never tell you.

Every production ML project you build is also a startup idea. You just have to look at it correctly.

Day 1 — Gig Worker Credit Scorer
60 million gig workers in the US are rejected by traditional credit systems not because they are risky borrowers but because their income does not fit a W-2 pattern. ROC-AUC 0.84. Sub-200ms API. This is a $300 billion lending gap. Startups like Petal and Chime raised hundreds of millions solving exactly this.

Day 6 — Supplier Risk Intelligence
Supply chain disruptions cost companies $228 million on average per incident. My model predicts supplier risk 3-6 months ahead using 31 signals — news sentiment, financial stress, geopolitical exposure. SAP charges enterprise customers $500K/year for similar capability. I built the core in 2 days.

Day 15 — LangGraph Research Agent
A research analyst costs $80-150K/year and produces one report per day. My 5-agent pipeline produces an 800-word verified research report on any topic in 90 seconds using entirely free APIs. The unit economics are violent.

The pattern: find a process that is currently done by expensive humans or legacy enterprise software. Build the AI version. Price it at 10-20% of the incumbent. That is the playbook.

3 Things I Would Tell Myself on Day 1

1. Pin your versions before you write the first line of code.

Create a requirements.txt on day one with exact versions of every dependency. The most painful bugs I hit were not architectural mistakes — they were torch==2.3.0 vs torch==2.4.0 differences. Version drift is silent and expensive.

# requirements.txt — always pin, never assume
torch==2.3.0
torchvision==0.18.0
timm==0.9.16
numpy==1.26.4
xgboost==2.1.1
langchain==1.3.0
langgraph==1.0.5

2. Build the API before you tune the model.

I lost days fine-tuning models before I knew if the API would work. The right order is: build the minimal API first, confirm the pipeline end-to-end, then improve the model. A working 0.75 AUC model in production beats a 0.85 AUC model still in a notebook.

3. Every bug is a blog post.

Every time something breaks and I fix it, I write it down. Those 5 bugs above? Each one is a Stack Overflow answer, a dev.to article, a tweet thread. The person who googles "OpenCV non-contiguous array error 2026" and finds my explanation follows me on GitHub. That compound over time.

What I Am Building Next — June 2026

The 30-day series covered breadth. June is depth.

Eight advanced systems targeting real unsolved gaps in production AI:

→ Persistent Memory Architecture — LangGraph agents that remember across sessions using pgvector + FAISS (solving the biggest gap in enterprise agentic AI)

→ LLM Evaluation Framework — automated hallucination detection as a CI/CD pipeline step (because 87% of companies shipping AI have no systematic evaluation)

→ LoRA Fine-Tuning Pipeline — LLaMA 3.1 8B on private domain data with GGUF quantization for CPU deployment (the technique every regulated industry needs)

→ Knowledge Graph + LLM — GraphRAG outperforms vector RAG on multi-hop questions by 40% per Microsoft Research. I am building the production implementation.

→ Federated Learning System — ML across hospitals that cannot share patient data (GDPR compliance by design, not retrofit)

Each one solves a problem that companies are paying $500K+ in consulting fees to figure out.

Final Thought

The most important thing I learned in 60 days is not a framework or a model architecture.

It is that production AI engineering is a craft that only gets built through shipping.

You can read every paper, watch every tutorial, and follow every course. None of it prepares you for the moment when your model loads perfectly in training and silently returns wrong predictions in production because the preprocessing pipeline has a different random seed.

The only way to learn production is to build for production.

Start shipping.

All systems are open source:
🔗 github.com/Danish08654

Follow for daily updates on the June advanced projects:
🔗 LinkedIn — Danish Zulfiqar

Have you hit any of these bugs? Drop them in the comments — I want to hear what production broke for you.

DEV Community: DANISH ZULFIQAR

Check out my CoderLegion profile & latest post!

Workflow

**A clean, complete guide to version control, collaboration, and containerization. Commands, workflows, and concepts - all in one place.**

Production vibe

If you're just getting started, think of AWS, GCP, and Azure as cloud platforms

If you're just getting started, think of AWS, GCP, and Azure as cloud platforms

**A clean, complete guide to version control, collaboration, and containerization. Commands, workflows, and concepts - all in one place.**

Maybe you face it?

I Deployed 6 AI Systems Live — Here's What Actually Broke

I Deployed 6 AI Systems Live — Here's What Actually Broke

I Deployed 6 AI Systems Live — Here's What Actually Broke

Failure 1 — A Module That Existed Yesterday, Gone Today

Failure 2 — A File That Exists, Until It Doesn't

Failure 3 — Two Different Size Limits Wearing the Same Error Message

Failure 4 — The Platform Changed Under Me, Without Asking

Failure 5 — A Push That Succeeds and Goes Nowhere You're Looking

The Pattern Across All Five

A Short Checklist For Next Time

I Built 48 Production AI Systems in 60 Days — Here Is What Nobody Tells You About Real AI Engineering

I Built 48 Production AI Systems

Here Is What Nobody Tells You About Real AI Engineering

First — What I Actually Built

The 5 Bugs That Taught Me More Than Any Course

Bug 1 — OpenCV and Non-Contiguous Arrays

Bug 2 — XGBoost 3.x Broke SHAP

Bug 3 — LangGraph on Windows Kills Async

Bug 4 — timm Renamed Xception Without Warning

Bug 5 — joblib Cannot Cross Python Versions

The Pattern Behind All 5 Bugs

The Startup Hidden Inside Every ML Project

3 Things I Would Tell Myself on Day 1

What I Am Building Next — June 2026

Final Thought

A clean, complete guide to version control, collaboration, and containerization. Commands, workflows, and concepts - all in one place.

A clean, complete guide to version control, collaboration, and containerization. Commands, workflows, and concepts - all in one place.