Dennis Kim

Posted on Jul 5

Building a Korea-Market Middleware for Microsoft Qlib

#qlib #python #machinelearning #quant

TL;DR

Korea's equity market is having a moment, and TOSS Securities recently opened an Open API — a rare, developer-friendly on-ramp for retail quants.
Microsoft's Qlib is the best open-source "AI research + backtest" quant platform, but it does not officially support the Korean market.
So I built a small Node.js/TypeScript + Redis middleware that pulls quotes from the TOSS Open API, normalizes them into Qlib's CSV convention, and feeds dump_bin.py.
I also wrote a Korean-language "Qlib Getting Started" guide for Korean developers, including a full KRX data-integration section.
Next up: a Korea-specialized middleware that also ingests secondary data (corporate disclosures / DART filings, etc.) and is reusable across trading bots — not just Qlib.

Why now? The Korean market opportunity

Korea's stock market has been unusually active lately, and for developers the timing is interesting for one specific reason: TOSS Securities opened an Open API. Historically, Korean retail brokerage automation meant wrestling with legacy Windows-only OCX/COM bridges. A clean, OAuth2-based HTTP API changes the game — it means you can build data pipelines and trading tooling on any stack, on any OS.

Meanwhile, the best open-source quant research stack — Microsoft Qlib — has no first-class Korea support. Its region setting only covers CN / US / TW. That gap is exactly where a middleware belongs.

Qlib is a calculator, not an oracle. No framework saves you from bad data or sloppy methodology. But if the data plumbing is clean, the research loop gets a lot faster.

What is Qlib, quickly

Qlib is Microsoft Research's AI-oriented quantitative investment platform (open-sourced 2020, ~40k+ GitHub stars). It covers the full ML pipeline — data → factor computation → model training → backtest → reporting — in one framework.

A few things that make it stand out:

All-in-one pipeline. No more gluing zipline (backtest) + backtrader (execution) + a separate factor library.
Purpose-built data infra. A binary storage format plus a two-tier cache (ExpressionCache + DatasetCache). In Microsoft's own benchmark (800 symbols × 14 factors, 2007–2020 daily, 1 CPU), the fully-cached path runs in 7.4s vs. 365s for MySQL — roughly 49× faster.
Expression-based factor engine. Define a factor as a string like Ref($close, 1)/$close - 1 and the engine handles vectorization + caching for you.
A reproducible Model Zoo. 25+ SOTA models (LightGBM, GRU, ALSTM, Transformer, TRA, TFT…) on the same Alpha158 / Alpha360 datasets, comparable under identical backtest conditions.
Non-stationarity tooling. Rolling retraining and DDG-DA (meta-learning for concept drift) ship as benchmarks — a Qlib-specific strength.

The one thing Qlib deliberately leaves out: live broker order execution. That's out of scope by design — which matters for how I scoped the middleware below.

The Korean-developer gap: a KR "Getting Started" guide

Since Qlib's docs and community are largely CN/EN-centric, I wrote a Korean-language getting-started guide aimed at Python developers standing up a quant/ML backtest environment for the first time.

It covers:

Project overview, core strengths, and an honest comparison vs. zipline / backtrader / vectorbt / QuantConnect.
Install paths (pip / source / Docker), including the Apple Silicon brew install libomp gotcha for LightGBM.
The 2026 data reality: the official download script is paused; the guide points to the community investment_data dataset instead.
First workflow with qrun, a code-based custom workflow, and the expression engine.
Benchmarks (Alpha158 vs Alpha360, DDG-DA dynamic adaptation).
A full "Korean developer" section: wiring KRX data into Qlib.
Pitfalls — install, data quality, and methodology (look-ahead bias, transaction cost, overfitting, "IC 0.05 is a starting point, not a good number").

Guide (Korean): Qlib-getting-started-KR.md

Connecting KRX data to Qlib

Qlib doesn't officially support Korea, but its dump_bin.py only needs CSV. So the recipe is: collect OHLCV → write CSV in Qlib's convention → convert to Qlib binary.

# pip install pykrx
from pykrx import stock
import pandas as pd, os

os.makedirs("csv_kr", exist_ok=True)
tickers = stock.get_market_ticker_list(market="KOSPI")

for t in tickers[:50]:
    df = stock.get_market_ohlcv("20180101", "20260630", t)
    df = df.reset_index().rename(columns={
        "날짜": "date", "시가": "open", "고가": "high",
        "저가": "low", "종가": "close", "거래량": "volume",
    })
    df["symbol"] = t
    df["factor"] = 1.0  # Qlib adjust-price factor; 1.0 if unadjusted
    df.to_csv(f"csv_kr/{t}.csv", index=False)

python scripts/dump_bin.py dump_all \
    --csv_path ./csv_kr \
    --qlib_dir ~/.qlib/qlib_data/kr_data \
    --include_fields open,close,high,low,volume,factor \
    --date_field_name date --symbol_field_name symbol

Korea-specific checklist (this is where naive ports break):

Item	Why it matters
Adjusted price (`factor`)	Splits/dividends must be reflected or returns get distorted.
Trading rules	`REG_CN` applies China's ±10% limit and T+1 — Korea is ±30% and T+0. Customize the executor.
Delisting / halts	Survivorship bias: ideally include delisted names.
Calendar	Verify the Korean trading-holiday calendar is generated.
Alpha158 factors	Factor definitions are market-neutral, but re-validate on Korean data — a CSI300 IC doesn't guarantee a KOSPI IC.

The middleware: TOSS Open API → Qlib

Rather than cram OAuth2, token expiry, rate limits, and pagination into the same Python codebase that does factor research, I split concerns. A separate middleware drops normalized CSVs; Qlib just consumes them.

TOSS Open API  --OAuth2-->  [Node.js/TS middleware]  --CSV(csv_kr/*.csv)-->  scripts/dump_bin.py  -->  ~/.qlib/qlib_data/kr_data
                                   |
                                 Redis (token cache + market data cache)

Middleware (English README): toss-qlib-middleware/README_EN.md

Authentication (confirmed spec)

Item	Detail
Flow	OAuth2 Client Credentials Grant (no user login step)
Token issuance	`POST {TOSS_BASE_URL}/oauth2/token` with `grant_type` / `client_id` / `client_secret` as a form-urlencoded body (not Basic Auth)
Lifetime	86,400s (24h), no refresh token — you must re-issue with the client secret before expiry
Call header	`Authorization: Bearer {access_token}`
Account/order APIs	need an extra `X-Tossinvest-Account` header (not called here)

I verified the endpoint by actually hitting POST /oauth2/token: even with wrong credentials it returns a real {"error":"invalid_client", ...}, confirming the path and request shape. (As of mid-2026 the service is still in a pre-registration phase, so candle/price field schemas are held defensively.)

Redis caching strategy

Cached item	Key	TTL	Reason
Access token	`toss:access_token`	`86400 − safety margin`	No refresh token → re-issue well before expiry
Token refresh lock	`toss:access_token:lock`	10s (`SET NX`)	Stops a thundering herd of simultaneous re-issues
Finalized past candles	`toss:candles:{symbol}:{interval}:{start}:{end}`	1 day	Closed candles never change
Today's candles	same key	30s	Values keep updating intraday
Current price	`toss:price:{symbol}`	5s	Fresh, per-symbol so batches reuse hits

On 401 the cache is invalidated and the request retried once; on 429 it backs off using the Retry-After header. The candles endpoint returns at most 200 rows and has no start/end filter, so the middleware paginates backward with a before cursor, then returns the merged result sorted ascending.

API surface

Method	Path	Description
GET	`/health`	Health check
GET	`/api/candles/:symbol?start=&end=&interval=day`	Normalized candle JSON (Redis-cached, `before` pagination)
GET	`/api/prices?symbols=005930,000660`	Batch current-price lookup (chunked at 200)
POST	`/api/export/qlib` `{symbols, start, end, outDir?}`	Fetch symbols → write `csv_kr/{symbol}.csv`

Or skip the server and export CSV straight from the CLI:

npm run export:qlib -- --symbols 005930,000660 --start 2020-01-01 --end 2026-07-01

Quick start

cd TechDoc/Quant_Qlib/toss-qlib-middleware
npm install
npm run setup      # interactively creates .env, optionally test-issues a real token
npm run typecheck
npm test           # passes WITHOUT Redis (in-memory adapter validates the logic)
npm run dev        # http://localhost:4000, requires a real Redis instance

Why trading (order execution) is intentionally out of scope

Auth and market-data retrieval are common needs that look nearly identical for everyone — perfect for shared middleware. Order logic (state tracking, dedupe-on-retry, risk limits, fill confirmation) varies completely by strategy and risk tolerance, so shipping it generically would be irresponsible. Qlib itself leaves live execution out too, and a backtest never guarantees live performance. The middleware exposes TossAuthService + TossApiClient as clean extension points if you want to add orders yourself — but test with tiny/paper trades first.

What's next: secondary data + bot-agnostic

The current middleware stops at price/candle data. The roadmap is a Korea-specialized middleware that also ingests secondary data — corporate disclosures and filings (e.g. DART), so strategies can react to events, not just prices.

And crucially: this middleware won't be Qlib-only. The normalization layer is generic enough that the same authenticated, cached, rate-limit-aware data feed can back any trading bot — Qlib is just the first consumer. Think of it as a reusable Korea-market data plane: one integration, many downstream engines.

If you're building anything on the Korean market with Python or TypeScript, I'd love feedback on which secondary datasets matter most to you.

DEV Community