DEV Community

swati goyal
swati goyal

Posted on

Document Localization Studio

GitHub Copilot CLI Challenge Submission

This is a submission for the GitHub Copilot CLI Challenge

What I Built

I built Document Localization Studio — a terminal-first + UI-powered app that localizes documents beyond basic translation.

Instead of treating localization as “just translate text,” this project tackles the real-world complexity teams hit in enterprise docs:

  • 🌐 Language + terminology adaptation (custom glossary + reusable term memory)
  • 🗓️ Date/time + timezone conversion (e.g., America/New_York → Europe/Berlin)
  • 💱 Currency + FX conversion (USD → EUR/JPY/BRL/… with locale defaults you can edit)
  • 📏 Unit conversion (mi→km, lb→kg, °F→°C)
  • 📬 Address/phone/postal tweaks (locale labels + phone formatting)
  • 🧾 Tax label adaptation (Sales Tax → VAT/GST-style labels)
  • 🔒 Legal clause lock/protection ([[LOCK]]...[[/LOCK]] + auto-protect legal-ish sentences)
  • Structure-aware QA (placeholders preserved, length-change warnings, cross-ref/TOC flags, workflow gating)

Supported file types 🧩

  • .txt
  • .docx
  • .pdf
    • layout-preserving mode for editable PDFs (when available)
  • screenshots/images: .png, .jpg, .jpeg via OCR

Supported locales 🗺️

de_de, es_es, fr_fr, it_it, ja_jp, ko_kr, pt_br, zh_cn, zh_tw

Run locally 🧪

cd "/Users/swatigoyal/Documents/New project/document_localizer_challenge"
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
streamlit run app.py

CLI example 🧰
python -m localizer.cli input.pdf output.pdf \ --locale de_de \ --source-timezone America/New_York \ --tone legal

Demo

Walkthrough idea (video/screenshots) 🎬

  • Upload a real invoice/contract PDF (or a DOCX).
  • Pick a target locale (ex: de_de) and watch the default FX rate auto-load (editable).
  • Toggle components (units, tax labels, legal lock, term memory).
  • Run localization.
  • Show the outputs:
    • 📊 Before/After scorecards
    • 🔎 Side-by-side visual diff
    • 🌡️ Layout risk heatmap
    • 🧾 QA report (JSON)
  • Download the localized file + QA report.

Stack / Libraries 🧱🐍

Built with a “free stack”:

  • streamlit (UI dashboard)
  • python-docx (DOCX read/write)
  • pypdf (PDF text extraction)
  • pymupdf (PyMuPDF — layout-preserving PDF localization mode)
  • reportlab (PDF re-render fallback when layout mode isn’t available)
  • pillow + pytesseract (OCR pipeline for screenshots/images)

OCR note: screenshot localization requires a local Tesseract binary in addition to pytesseract (ex: macOS brew install tesseract).

My Experience with GitHub Copilot CLI 🤝⚡

I used GitHub Copilot CLI as a coding partner directly in the terminal to:

  • 🏗️ scaffold modules quickly (pipeline, PDF/DOCX/image IO, CLI wiring)
  • 🧠 iterate on regex-heavy transformations (dates, currency, units, placeholders)
  • 🧩 design locale profiles/defaults and keep the logic consistent
  • 🎛️ wire Streamlit controls to the backend config without breaking flow
  • 🧪 add QA heuristics + sensible fallback paths for PDFs/OCR
  • 🧹 speed up refactors while keeping the project clean and extensible

The biggest win: fast iteration on non-trivial logic (PDF handling + transformation rules + feature toggles) without leaving the terminal.

What’s Next / Improvements 🚀🤖

This is a strong prototype — and there’s a lot we can level-up with AI integration later:

  • 🧠 LLM-backed translation (while keeping deterministic transforms + locks)
  • 📚 smarter terminology alignment (context-aware term choice + consistency scoring)
  • 🧾 stronger compliance checks (policy packs per industry/locale)
  • 🧩 plug-in architecture for new transforms + QA rules
  • 🖼️ better OCR layout reconstruction (tables, columns, headers/footers)

If you’ve worked on localization, I’d love your feedback: what transformations or QA checks would you trust most in production?

Top comments (2)

Collapse
 
savita95devi701 profile image
Savita Devi

Impressive work 👏

Document Localization Studio goes far beyond basic translation by handling terminology, legal locks, FX/unit conversion, and structure-aware QA — that’s real enterprise-level thinking. Clean execution, strong feature depth, and great use of GitHub Copilot CLI for rapid iteration. 🚀

Collapse
 
smiley_shiney_09 profile image
Smiley Shiney

Wow, that’s a great set of functionalities you’ve provided — really well thought out and comprehensive! 👏