Why I Built This
I write documentation in Chinese, but my readers are everywhere. Manually translating each .md file was killing my productivity. Worse, keeping translations in sync with updates was a nightmare.
I needed a tool that could batch-translate entire documentation folders while preserving every # header, **bold**, `code`, and [link]() perfectly. So I built one.
The result is md-translator — a lightweight Python script that uses any OpenAI-compatible API to translate Markdown files in bulk. It's under 200 lines, supports concurrent processing, and costs pennies per project.
Let me show you how it works.
What You'll Build
A CLI tool that:
- Batch translates all
.mdfiles in a directory - Supports any OpenAI-compatible API (DeepSeek, GPT, Qwen, etc.)
- Preserves every bit of Markdown formatting
- Caches translations for resume support
- Runs concurrent translations with configurable workers
Setup
Requires Python 3.8+ and one dependency:
pip install requests
git clone https://github.com/xuks124/md-translator.git
cd md-translator
Set your API key:
export MD_TRANSLATOR_KEY="sk-your-api-key-here"
Core Translation Logic
The core is an API call to any OpenAI-compatible chat endpoint. Here's the key function:
def translate_chunk(chunk, source_lang, target_lang, api_key, api_url, model):
prompt = f"""You are a professional translator. Translate from {source_lang} to {target_lang}.
Rules:
1. Keep ALL Markdown syntax unchanged (```
{% endraw %}
, **, [], ![], #, -, etc.)
2. Only translate text content
3. Keep code blocks and URLs unchanged
4. Return ONLY the translated content
Content:
---
{chunk}
---"""
resp = requests.post(
api_url,
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
json={
"model": model,
"messages": [{"role": "user", "content": prompt}],
"temperature": 0.3,
"max_tokens": 4096
},
timeout=60
)
return resp.json()["choices"][0]["message"]["content"]
{% raw %}
The prompt is the secret sauce. By telling the model to keep syntax unchanged, we get clean Markdown out every time.
Batch Processing with Resume Support
Translation can take time, so the tool caches results by file hash. If you rerun it, already-translated files are skipped instantly:
python
def process_file(md_path, source_lang, target_lang, api_key, api_url, model, output_dir, force):
content = open(md_path, encoding='utf-8').read()
content_hash = hashlib.md5(content.encode()).hexdigest()[:8]
# Load cache — resume from breakpoint
cache_file = Path(md_path.parent / '.md_translator_cache' / f"{md_path.stem}.json")
cache = json.load(open(cache_file)) if cache_file.exists() else {}
if not force and content_hash in cache:
translated = cache[content_hash] # Cache hit!
else:
# Split long files into chunks
if len(content) > 4000:
chunks = [content[i:i+3000] for i in range(0, len(content), 3000)]
translated = '\n'.join(translate_chunk(c, ...) for c in chunks)
else:
translated = translate_chunk(content, ...)
cache[content_hash] = translated
json.dump(cache, open(cache_file, 'w'))
# Output: suffix-based naming (e.g., doc.en.md)
output_path = md_path.parent / md_path.name.replace('.md', f'.{target_lang}.md')
open(output_path, 'w', encoding='utf-8').write(translated)
For large documentation sites, the tool uses ThreadPoolExecutor to translate multiple files concurrently:
python
with ThreadPoolExecutor(max_workers=args.workers) as executor:
futures = {executor.submit(process_file, f, ...): f for f in md_files}
for future in as_completed(futures):
output = future.result()
Running It
bash
# Translate all files in ./docs from Chinese to English
python translate.py --input ./docs --source zh --target en
# Use GPT-4o instead of default DeepSeek
python translate.py -i ./docs -s zh -t en -m gpt-4o -u https://api.openai.com/v1
# Force re-translate everything
python translate.py -i ./docs -s zh -t en -f
# 5 concurrent workers for big projects
python translate.py -i ./docs -s zh -t en -w 5 -o ./translations
Why OpenAI-Compatible APIs?
Lock-in is annoying. This tool works with any provider that speaks the OpenAI chat format:
- DeepSeek (free tier available)
- OpenAI (GPT-4o / GPT-4o-mini)
- One-API (self-hosted unified gateway)
- Qwen / Moonshot / Groq (all OpenAI-compatible)
Just swap --api-url and --model — no code changes needed.
Real-World Usage
I use this to keep my programming handbook in 3 languages. A full translation of 400+ files runs in about 15 minutes and costs less than $2 with DeepSeek.
Format preservation is the killer feature — tables, code blocks with syntax highlighting, nested lists, embedded images — everything stays intact.
Try It Yourself
The full source is on GitHub:
https://github.com/xuks124/md-translator
It's MIT licensed, so fork it, tweak it, use it for your docs. If you build something cool with it, drop a star or open an issue.
Happy translating!
Top comments (0)