I was automating my job search when my pipeline kept crashing. Fixing it became a whole project.

#automation #career #devjournal #sideprojects

You might be wondering why someone would spend weeks building a reliability layer for a pipeline instead of just doing the job search manually in the first place. Fair question. I asked myself the same thing more than once. You will get it soon. Let me paint you the picture first.

Finding a job in Ghana after university is a sport. Not the fun kind. You can have the degree, the certifications, show up every day, and still be invisible because you don't know the person doing the hiring. I watched people send out application after application and hear nothing back. Not a rejection, just silence. That silence does something to you.
I finished national service at a bank and stayed on an informal contract while I figured out the next move. Couldn't start a business without capital I didn't have. So I thought, fine. I'll go around the problem instead of through it.
My plan was to build something I called Pathfinder. Just a guy with a laptop, automating the entire job search process with AI models, all for free, which was very necessary because I had no other option. The engineering team was me. The product manager was me. The CEO was also me, unfortunately. Research the role, match it to my background, write the cover letter and draft a CV tailored to that role. I review everything and submit myself, but at least the grunt work gets handled.
The pipeline kept crashing.
Timeout. Rate limit. Bad response from the model at step 4 after you've already spent your free tier quota on steps 1, 2, and 3. Everything gone. Start from the beginning. again.
I tried wrapping things in try/except. It just failed quietly instead of loudly. Still restarting from scratch every time.
For a few weeks I genuinely considered just doing it manually. That felt like the sane path honestly.
I am not a seasoned developer. I want to be clear about that, so when I went looking for solutions I had no idea LangChain or LangGraph even existed. Found them during research. Spent real time going through both. They are solid, genuinely well built. Just not built for what I needed. too many layers, too much to learn, just to solve one problem. I needed something that wouldn't lose my work when it crashed. That's it.
Then I started finding posts from other developers hitting the same wall. different projects, different goals, same frustration. Pipelines crashing mid run and losing everything, no clean way to resume. Enough people had this problem that something shifted for me, fixing it properly might actually matter to someone other than just me. It was worth the detour.
So I stopped working on Pathfinder completely and spent the next several weeks figuring out how to build a pipeline that simply couldn't lose its progress.
The solution was embarrassingly simple. Every step saves its output before the next one starts. Crash happens, re-run it, finished steps skip themselves, you continue from exactly where it stopped. I added a few things I kept needing along the way. routing simple tasks to free models instead of burning expensive ones on everything. Retrying when the model returns broken output. Checking at startup whether your models still even exist because free tier models get retired without warning and nobody tells you.
February 27 was the first commit.
Pathfinder is still in progress.
For a moment there, shipping something from scratch as a non-professional developer that other people might find useful had me feeling like the new Dario Amodei. I'll be honest about that. Then reality reminded me that it does make an impact, just not quite at that scale yet. But it's real, and that's enough to keep going.
If any of that sounds familiar, the library is called DagPipe. Might be worth a look,it is open source .I have a demo on it on YouTube as well as a breakdown of it Using Notebook lm.Feel free to skip to 7:05 where it gets into the actual technical detail.

Top comments (2)

Apex Stack • Mar 17

The checkpoint-and-resume pattern you landed on is something I wish I'd figured out earlier. I run a content generation pipeline that processes 8,000+ stock tickers through a local LLM, and for the first few weeks I was losing entire batch runs when the model threw a bad response at ticker 6,500. Same frustration — hours of compute, gone.

Your model routing insight is the part that caught my attention most. Routing simple tasks to free models instead of burning expensive ones on everything is exactly the architecture I ended up with too — local Llama 3 handles bulk content generation where quality variance is acceptable, and I only route to premium models for analysis that needs to be precise (financial metrics where a wrong decimal means a P/E ratio of 41% instead of 0.42%). The economic pressure of free-tier limits forces you into better architecture decisions than you'd make with unlimited budget.

The fact that you built DagPipe while solving your own problem and then recognized other developers hitting the same wall is the best kind of side project origin story. Infrastructure tools born from real frustration tend to have much better developer ergonomics than ones designed in the abstract. Curious whether you've seen demand from the AI agent community specifically — checkpoint/resume is becoming critical as multi-step agent workflows get longer and more expensive to restart.

Herbert Yeboah • Mar 18

The 8,000 ticker thing actually got me. Not because the tool failed you, the model just had a bad moment at 6,500 and took the whole run with it. I've been there more than I'd like.
Also, your routing split is cleaner than what I ended up with honestly. Local Llama for bulk, premium only where a wrong decimal turns a P/E ratio into nonsense. I'm using pure Python heuristics in the cognitive router, keyword matching and token thresholds, no LLM deciding what goes where. but the underlying call is the same one you made. Don't spend the good budget on work that doesn't need it.
The free tier pressure point you made is something I think about a lot. When your quota is finite and the models are flaky, you make smarter architecture decisions than you would with unlimited budget. constraint just forces clarity. I wasn't trying to build a recovery layer, I just got tired of losing work.
On agent demand specifically, yeah, it's moving faster than I expected. A 3-node pipeline crashing is annoying. A 12-step workflow dying at step 10 after 40 minutes of compute is the kind of thing that makes you reconsider everything. that's the use case I keep hearing about now. Checkpoint recovery stops feeling optional the longer your pipelines get.
Glad you caught the routing layer. Most people stop at the checkpoint mechanic.