DEV Community

xuks124
xuks124

Posted on

Build a Markdown Translation Tool with AI APIs in 50 Lines of Python

Why I Built This

I write documentation in Chinese, but my readers are everywhere. Manually translating each .md file was killing my productivity. Worse, keeping translations in sync with updates was a nightmare.

I needed a tool that could batch-translate entire documentation folders while preserving every # header, **bold**, `code`, and [link]() perfectly. So I built one.

The result is md-translator — a lightweight Python script that uses any OpenAI-compatible API to translate Markdown files in bulk. It's under 200 lines, supports concurrent processing, and costs pennies per project.

Let me show you how it works.


What You'll Build

A CLI tool that:

  • Batch translates all .md files in a directory
  • Supports any OpenAI-compatible API (DeepSeek, GPT, Qwen, etc.)
  • Preserves every bit of Markdown formatting
  • Caches translations for resume support
  • Runs concurrent translations with configurable workers

Setup

Requires Python 3.8+ and one dependency:

pip install requests
git clone https://github.com/xuks124/md-translator.git
cd md-translator
Enter fullscreen mode Exit fullscreen mode

Set your API key:

export MD_TRANSLATOR_KEY="sk-your-api-key-here"
Enter fullscreen mode Exit fullscreen mode

Core Translation Logic

The core is an API call to any OpenAI-compatible chat endpoint. Here's the key function:

def translate_chunk(chunk, source_lang, target_lang, api_key, api_url, model):
    prompt = f"""You are a professional translator. Translate from {source_lang} to {target_lang}.
Rules:
1. Keep ALL Markdown syntax unchanged (```
{% endraw %}
, **, [], ![], #, -, etc.)
2. Only translate text content
3. Keep code blocks and URLs unchanged
4. Return ONLY the translated content

Content:
---
{chunk}
---"""

    resp = requests.post(
        api_url,
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        },
        json={
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "temperature": 0.3,
            "max_tokens": 4096
        },
        timeout=60
    )
    return resp.json()["choices"][0]["message"]["content"]
{% raw %}

Enter fullscreen mode Exit fullscreen mode

The prompt is the secret sauce. By telling the model to keep syntax unchanged, we get clean Markdown out every time.


Batch Processing with Resume Support

Translation can take time, so the tool caches results by file hash. If you rerun it, already-translated files are skipped instantly:


python
def process_file(md_path, source_lang, target_lang, api_key, api_url, model, output_dir, force):
    content = open(md_path, encoding='utf-8').read()
    content_hash = hashlib.md5(content.encode()).hexdigest()[:8]

    # Load cache — resume from breakpoint
    cache_file = Path(md_path.parent / '.md_translator_cache' / f"{md_path.stem}.json")
    cache = json.load(open(cache_file)) if cache_file.exists() else {}

    if not force and content_hash in cache:
        translated = cache[content_hash]  # Cache hit!
    else:
        # Split long files into chunks
        if len(content) > 4000:
            chunks = [content[i:i+3000] for i in range(0, len(content), 3000)]
            translated = '\n'.join(translate_chunk(c, ...) for c in chunks)
        else:
            translated = translate_chunk(content, ...)

        cache[content_hash] = translated
        json.dump(cache, open(cache_file, 'w'))

    # Output: suffix-based naming (e.g., doc.en.md)
    output_path = md_path.parent / md_path.name.replace('.md', f'.{target_lang}.md')
    open(output_path, 'w', encoding='utf-8').write(translated)


Enter fullscreen mode Exit fullscreen mode

For large documentation sites, the tool uses ThreadPoolExecutor to translate multiple files concurrently:


python
with ThreadPoolExecutor(max_workers=args.workers) as executor:
    futures = {executor.submit(process_file, f, ...): f for f in md_files}
    for future in as_completed(futures):
        output = future.result()


Enter fullscreen mode Exit fullscreen mode

Running It


bash
# Translate all files in ./docs from Chinese to English
python translate.py --input ./docs --source zh --target en

# Use GPT-4o instead of default DeepSeek
python translate.py -i ./docs -s zh -t en -m gpt-4o -u https://api.openai.com/v1

# Force re-translate everything
python translate.py -i ./docs -s zh -t en -f

# 5 concurrent workers for big projects
python translate.py -i ./docs -s zh -t en -w 5 -o ./translations


Enter fullscreen mode Exit fullscreen mode

Why OpenAI-Compatible APIs?

Lock-in is annoying. This tool works with any provider that speaks the OpenAI chat format:

  • DeepSeek (free tier available)
  • OpenAI (GPT-4o / GPT-4o-mini)
  • One-API (self-hosted unified gateway)
  • Qwen / Moonshot / Groq (all OpenAI-compatible)

Just swap --api-url and --model — no code changes needed.


Real-World Usage

I use this to keep my programming handbook in 3 languages. A full translation of 400+ files runs in about 15 minutes and costs less than $2 with DeepSeek.

Format preservation is the killer feature — tables, code blocks with syntax highlighting, nested lists, embedded images — everything stays intact.


Try It Yourself

The full source is on GitHub:

https://github.com/xuks124/md-translator

It's MIT licensed, so fork it, tweak it, use it for your docs. If you build something cool with it, drop a star or open an issue.


Happy translating!

Top comments (0)