Built a scraper container last week. Docker said build succeeded. Pushed it to the registry. Tried pulling it on the server.
4.2GB download started.
My internet peaked at 2MB/s. That's 35 minutes of waiting every single deploy.
The dumb mistake
Threw everything in the Dockerfile:
FROM python:3.11
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt
CMD ["python", "scraper.py"]
Simple right? Worked locally. Build succeeded. Containerized my scraper. Except python:3.11 is the full Debian image. 1GB base. Includes compilers, build tools, stuff I never touched.
Then I copied my entire project folder. That included old test data (400MB), scraped results from local runs (800MB), node_modules from when I tested a JS library once (600MB), venv folder (200MB).
COPY . grabbed everything. Docker doesn't ignore files like git does.
Then pip installed requests, beautifulsoup4, and selenium. Selenium pulled chromium. Another 500MB.
Fun times.
Fixing it
Switched to python:3.11-slim first. Same Python, no build tools. Dropped from 1GB to 150MB base.
Added .dockerignore before copying:
__pycache__
*.pyc
.git
.venv
venv/
node_modules/
test_data/
results/
*.log
Realized I didn't need selenium. Switched to httpx for the API calls. No chromium download.
Final Dockerfile looked way simpler:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY scraper.py .
CMD ["python", "scraper.py"]
Built it. 287MB total. Deploy went from 35 minutes to 2 minutes.
The no cache dir flag helped too. pip caches packages by default which bloats images.
Notes
Docker builds whatever you give it. Doesn't optimize. Doesn't warn about 4GB images. Just succeeds.
COPY . is dangerous without .dockerignore. Grabs test data, logs, node_modules, everything.
Base images matter. python:3.11 vs python:3.11-slim is 850MB difference for the same language version.
Check your dependencies. Selenium pulls chromium automatically. Sometimes requests or httpx works fine and saves half a gig.
Still annoyed it took me 3 deploys to figure this out honestly.
Top comments (0)