DEV Community

Mandarin Zone
Mandarin Zone

Posted on • Originally published at hsk4.mandarinzone.com

How I Open-Sourced 1,000+ Chinese Exam Questions from WordPress to GitHub

I run Mandarin Zone, a Chinese language school in Beijing since 2008. Over the years, I built 12 complete HSK 4 mock exams using the AYS Quiz Maker WordPress plugin for our students to practice online.

Recently, I decided to open-source all of this content. Here's how I extracted 1,176 questions from a WordPress database and turned them into a clean, developer-friendly GitHub repository.

The Challenge

Our quiz data was locked inside WordPress — stored across multiple database tables (aysquiz_questions, aysquiz_answers, aysquiz_quizzes) with HTML-embedded content, WordPress shortcodes for audio files, and messy formatting.

The Extraction

Step 1: SQL Export

I wrote targeted SQL queries to join the questions, answers, and quiz mapping tables:

SELECT 
    q.id AS question_id,
    q.question AS question_text,
    q.type AS question_type,
    a.answer AS answer_text,
    a.correct AS is_correct,
    a.ordering AS answer_order
FROM aysquiz_questions q
LEFT JOIN aysquiz_answers a ON a.question_id = q.id
ORDER BY q.id, a.ordering;
Enter fullscreen mode Exit fullscreen mode

The first export came out at 400MB for just 8,566 rows — turns out some fields had massive embedded content. After trimming unnecessary columns, it dropped to 1.4MB.

Step 2: Data Cleaning

The raw data had WordPress shortcodes like [audio wav="..."][/audio] and HTML entities everywhere. I wrote a Python script to:

  • Extract audio URLs from shortcodes
  • Strip HTML tags while preserving Chinese text
  • Map question types based on content patterns (listening true/false, reading comprehension, fill-in-the-blank, sentence ordering)
  • Group answers by question ID and sort by ordering

Step 3: Structured JSON

Each test became a clean JSON file:

{
  "quiz_id": 2,
  "title": "HSK 4 Sample Quiz",
  "total_questions": 100,
  "questions": [
    {
      "number": 1,
      "type": "listening_true_false",
      "audio": "https://media.mandarinzone.com/.../hsk4-1-02.wav",
      "options": ["对", "错"],
      "correct_answer_index": 0
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

The Result

  • 12 complete HSK 4 mock exams in JSON format
  • 1,176 questions across 6 question types
  • GitHub Pages demo where anyone can take the tests online
  • CC BY-NC-SA 4.0 license — free for non-commercial use

What is HSK 4?

HSK (汉语水平考试) is China's official Chinese proficiency test, recognized worldwide. Level 4 is intermediate — it certifies you can discuss a wide range of topics and understand ~1,200 vocabulary words. Each exam has 100 questions covering listening, reading, and writing.

What You Can Build With This

  • A mobile HSK practice app
  • Anki flashcard decks
  • NLP training data for Chinese language models
  • Your own quiz platform
  • Spaced repetition study tools

Try It

If you're learning Chinese or building language learning tools, I hope this helps. PRs and stars welcome!


Top comments (0)