I set out to build a “simple” script to pull metadata for Hebrew children’s videos from YouTube. The plan was to use yt-dlp, pipe the JSON output to Python for filtering, and save it all as a clean TSV. Turns out, “simple” was optimistic. The whole thing turned into a multi-day debugging marathon, especially trying to get everything working smoothly inside Git Bash on Windows.
Here’s a rundown of the biggest hurdles and what I learned.
Environment setup is half the battle. My minimal MINGW64 environment was missing everything:
wget,sudo,jq, andstdbuf. I spent hours finding workarounds, like usingcurlinstead ofwget, creating a Bash wrapper for thejq.exebinary, and usingPYTHONUNBUFFERED=1to force real-time output whenstdbufwasn’t available.yt-dlpcan be a memory hog. When pulling from a massive channel playlist,yt-dlptried to load the entire list of videos into memory before outputting anything, causing it to hang. The magic flags were--lazy-playlistand--flat-playlist, which process the playlist item by item, turning a firehose into a manageable stream.Pipes are fragile. My script kept dying with “Broken pipe” or “Invalid argument” errors. The culprit? My Python script was consuming
stdinfor itself, leaving the piped data fromyt-dlpwith nowhere to go. The fix was to pass file paths as arguments to the Python script (python -c '...' "$OUT_FILE") instead of relying on the standard input stream.Unicode filtering is your friend. My initial searches for Hebrew kids’ content returned a lot of irrelevant videos. The most effective fix was a simple Python function to check if the video title or uploader name contained any characters within the Hebrew Unicode range (
\u0590–\u05FF). It was a surprisingly precise way to filter out the noise.
This was one of those side projects that spiraled, but it was a fantastic learning experience. Untangling the mess of shell environments, process buffering, and API quirks was a powerful reminder that even with great tools, the real work is getting them to talk to each other without tripping over their own feet.
Next: Now that the data extraction is reliable, I might try converting the videos to audio and using an LLM to help me review the content.
Tags: bash, python, yt-dlp, data-engineering, cli, debugging
Top comments (0)