Build a Markdown Translation Tool with AI APIs in 50 Lines of Python

#python #ai #tutorial #opensource

Why I Built This

I write documentation in Chinese, but my readers are everywhere. Manually translating each .md file was killing my productivity. Worse, keeping translations in sync with updates was a nightmare.

I needed a tool that could batch-translate entire documentation folders while preserving every # header, **bold**, `code`, and [link]() perfectly. So I built one.

The result is md-translator — a lightweight Python script that uses any OpenAI-compatible API to translate Markdown files in bulk. It's under 200 lines, supports concurrent processing, and costs pennies per project.

Let me show you how it works.

What You'll Build

A CLI tool that:

Batch translates all .md files in a directory
Supports any OpenAI-compatible API (DeepSeek, GPT, Qwen, etc.)
Preserves every bit of Markdown formatting
Caches translations for resume support
Runs concurrent translations with configurable workers

Setup

Requires Python 3.8+ and one dependency:

pip install requests
git clone https://github.com/xuks124/md-translator.git
cd md-translator

Set your API key:

export MD_TRANSLATOR_KEY="sk-your-api-key-here"

Core Translation Logic

The core is an API call to any OpenAI-compatible chat endpoint. Here's the key function:

def translate_chunk(chunk, source_lang, target_lang, api_key, api_url, model):
    prompt = f"""You are a professional translator. Translate from {source_lang} to {target_lang}.
Rules:
1. Keep ALL Markdown syntax unchanged (```
{% endraw %}
, **, [], ![], #, -, etc.)
2. Only translate text content
3. Keep code blocks and URLs unchanged
4. Return ONLY the translated content

Content:
---
{chunk}
---"""

    resp = requests.post(
        api_url,
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        },
        json={
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "temperature": 0.3,
            "max_tokens": 4096
        },
        timeout=60
    )
    return resp.json()["choices"][0]["message"]["content"]
{% raw %}

The prompt is the secret sauce. By telling the model to keep syntax unchanged, we get clean Markdown out every time.

Batch Processing with Resume Support

Translation can take time, so the tool caches results by file hash. If you rerun it, already-translated files are skipped instantly:


python
def process_file(md_path, source_lang, target_lang, api_key, api_url, model, output_dir, force):
    content = open(md_path, encoding='utf-8').read()
    content_hash = hashlib.md5(content.encode()).hexdigest()[:8]

    # Load cache — resume from breakpoint
    cache_file = Path(md_path.parent / '.md_translator_cache' / f"{md_path.stem}.json")
    cache = json.load(open(cache_file)) if cache_file.exists() else {}

    if not force and content_hash in cache:
        translated = cache[content_hash]  # Cache hit!
    else:
        # Split long files into chunks
        if len(content) > 4000:
            chunks = [content[i:i+3000] for i in range(0, len(content), 3000)]
            translated = '\n'.join(translate_chunk(c, ...) for c in chunks)
        else:
            translated = translate_chunk(content, ...)

        cache[content_hash] = translated
        json.dump(cache, open(cache_file, 'w'))

    # Output: suffix-based naming (e.g., doc.en.md)
    output_path = md_path.parent / md_path.name.replace('.md', f'.{target_lang}.md')
    open(output_path, 'w', encoding='utf-8').write(translated)

For large documentation sites, the tool uses ThreadPoolExecutor to translate multiple files concurrently:


python
with ThreadPoolExecutor(max_workers=args.workers) as executor:
    futures = {executor.submit(process_file, f, ...): f for f in md_files}
    for future in as_completed(futures):
        output = future.result()

Running It


bash
# Translate all files in ./docs from Chinese to English
python translate.py --input ./docs --source zh --target en

# Use GPT-4o instead of default DeepSeek
python translate.py -i ./docs -s zh -t en -m gpt-4o -u https://api.openai.com/v1

# Force re-translate everything
python translate.py -i ./docs -s zh -t en -f

# 5 concurrent workers for big projects
python translate.py -i ./docs -s zh -t en -w 5 -o ./translations

Why OpenAI-Compatible APIs?

Lock-in is annoying. This tool works with any provider that speaks the OpenAI chat format:

DeepSeek (free tier available)
OpenAI (GPT-4o / GPT-4o-mini)
One-API (self-hosted unified gateway)
Qwen / Moonshot / Groq (all OpenAI-compatible)

Just swap --api-url and --model — no code changes needed.

Real-World Usage

I use this to keep my programming handbook in 3 languages. A full translation of 400+ files runs in about 15 minutes and costs less than $2 with DeepSeek.

Format preservation is the killer feature — tables, code blocks with syntax highlighting, nested lists, embedded images — everything stays intact.