I run Mandarin Zone, a Chinese language school in Beijing since 2008. Over the years, I built 12 complete HSK 4 mock exams using the AYS Quiz Maker WordPress plugin for our students to practice online.
Recently, I decided to open-source all of this content. Here's how I extracted 1,176 questions from a WordPress database and turned them into a clean, developer-friendly GitHub repository.
The Challenge
Our quiz data was locked inside WordPress — stored across multiple database tables (aysquiz_questions, aysquiz_answers, aysquiz_quizzes) with HTML-embedded content, WordPress shortcodes for audio files, and messy formatting.
The Extraction
Step 1: SQL Export
I wrote targeted SQL queries to join the questions, answers, and quiz mapping tables:
SELECT
q.id AS question_id,
q.question AS question_text,
q.type AS question_type,
a.answer AS answer_text,
a.correct AS is_correct,
a.ordering AS answer_order
FROM aysquiz_questions q
LEFT JOIN aysquiz_answers a ON a.question_id = q.id
ORDER BY q.id, a.ordering;
The first export came out at 400MB for just 8,566 rows — turns out some fields had massive embedded content. After trimming unnecessary columns, it dropped to 1.4MB.
Step 2: Data Cleaning
The raw data had WordPress shortcodes like [audio wav="..."][/audio] and HTML entities everywhere. I wrote a Python script to:
- Extract audio URLs from shortcodes
- Strip HTML tags while preserving Chinese text
- Map question types based on content patterns (listening true/false, reading comprehension, fill-in-the-blank, sentence ordering)
- Group answers by question ID and sort by ordering
Step 3: Structured JSON
Each test became a clean JSON file:
{
"quiz_id": 2,
"title": "HSK 4 Sample Quiz",
"total_questions": 100,
"questions": [
{
"number": 1,
"type": "listening_true_false",
"audio": "https://media.mandarinzone.com/.../hsk4-1-02.wav",
"options": ["对", "错"],
"correct_answer_index": 0
}
]
}
The Result
- 12 complete HSK 4 mock exams in JSON format
- 1,176 questions across 6 question types
- GitHub Pages demo where anyone can take the tests online
- CC BY-NC-SA 4.0 license — free for non-commercial use
What is HSK 4?
HSK (汉语水平考试) is China's official Chinese proficiency test, recognized worldwide. Level 4 is intermediate — it certifies you can discuss a wide range of topics and understand ~1,200 vocabulary words. Each exam has 100 questions covering listening, reading, and writing.
What You Can Build With This
- A mobile HSK practice app
- Anki flashcard decks
- NLP training data for Chinese language models
- Your own quiz platform
- Spaced repetition study tools
Try It
- Take a test online: hsk4.mandarinzone.com
- GitHub repo: github.com/Make-dream-clear/hsk4-mock-exam
If you're learning Chinese or building language learning tools, I hope this helps. PRs and stars welcome!
Top comments (0)