DEV Community

Cover image for Building a Korea-Market Middleware for Microsoft Qlib
Dennis Kim
Dennis Kim

Posted on

Building a Korea-Market Middleware for Microsoft Qlib

TL;DR

  • Korea's equity market is having a moment, and TOSS Securities recently opened an Open API — a rare, developer-friendly on-ramp for retail quants.
  • Microsoft's Qlib is the best open-source "AI research + backtest" quant platform, but it does not officially support the Korean market.
  • So I built a small Node.js/TypeScript + Redis middleware that pulls quotes from the TOSS Open API, normalizes them into Qlib's CSV convention, and feeds dump_bin.py.
  • I also wrote a Korean-language "Qlib Getting Started" guide for Korean developers, including a full KRX data-integration section.
  • Next up: a Korea-specialized middleware that also ingests secondary data (corporate disclosures / DART filings, etc.) and is reusable across trading bots — not just Qlib.

Why now? The Korean market opportunity

Korea's stock market has been unusually active lately, and for developers the timing is interesting for one specific reason: TOSS Securities opened an Open API. Historically, Korean retail brokerage automation meant wrestling with legacy Windows-only OCX/COM bridges. A clean, OAuth2-based HTTP API changes the game — it means you can build data pipelines and trading tooling on any stack, on any OS.

Meanwhile, the best open-source quant research stack — Microsoft Qlib — has no first-class Korea support. Its region setting only covers CN / US / TW. That gap is exactly where a middleware belongs.

Qlib is a calculator, not an oracle. No framework saves you from bad data or sloppy methodology. But if the data plumbing is clean, the research loop gets a lot faster.


What is Qlib, quickly

Qlib is Microsoft Research's AI-oriented quantitative investment platform (open-sourced 2020, ~40k+ GitHub stars). It covers the full ML pipeline — data → factor computation → model training → backtest → reporting — in one framework.

A few things that make it stand out:

  • All-in-one pipeline. No more gluing zipline (backtest) + backtrader (execution) + a separate factor library.
  • Purpose-built data infra. A binary storage format plus a two-tier cache (ExpressionCache + DatasetCache). In Microsoft's own benchmark (800 symbols × 14 factors, 2007–2020 daily, 1 CPU), the fully-cached path runs in 7.4s vs. 365s for MySQL — roughly 49× faster.
  • Expression-based factor engine. Define a factor as a string like Ref($close, 1)/$close - 1 and the engine handles vectorization + caching for you.
  • A reproducible Model Zoo. 25+ SOTA models (LightGBM, GRU, ALSTM, Transformer, TRA, TFT…) on the same Alpha158 / Alpha360 datasets, comparable under identical backtest conditions.
  • Non-stationarity tooling. Rolling retraining and DDG-DA (meta-learning for concept drift) ship as benchmarks — a Qlib-specific strength.

The one thing Qlib deliberately leaves out: live broker order execution. That's out of scope by design — which matters for how I scoped the middleware below.


The Korean-developer gap: a KR "Getting Started" guide

Since Qlib's docs and community are largely CN/EN-centric, I wrote a Korean-language getting-started guide aimed at Python developers standing up a quant/ML backtest environment for the first time.

It covers:

  • Project overview, core strengths, and an honest comparison vs. zipline / backtrader / vectorbt / QuantConnect.
  • Install paths (pip / source / Docker), including the Apple Silicon brew install libomp gotcha for LightGBM.
  • The 2026 data reality: the official download script is paused; the guide points to the community investment_data dataset instead.
  • First workflow with qrun, a code-based custom workflow, and the expression engine.
  • Benchmarks (Alpha158 vs Alpha360, DDG-DA dynamic adaptation).
  • A full "Korean developer" section: wiring KRX data into Qlib.
  • Pitfalls — install, data quality, and methodology (look-ahead bias, transaction cost, overfitting, "IC 0.05 is a starting point, not a good number").

Guide (Korean): Qlib-getting-started-KR.md

Connecting KRX data to Qlib

Qlib doesn't officially support Korea, but its dump_bin.py only needs CSV. So the recipe is: collect OHLCV → write CSV in Qlib's convention → convert to Qlib binary.

# pip install pykrx
from pykrx import stock
import pandas as pd, os

os.makedirs("csv_kr", exist_ok=True)
tickers = stock.get_market_ticker_list(market="KOSPI")

for t in tickers[:50]:
    df = stock.get_market_ohlcv("20180101", "20260630", t)
    df = df.reset_index().rename(columns={
        "날짜": "date", "시가": "open", "고가": "high",
        "저가": "low", "종가": "close", "거래량": "volume",
    })
    df["symbol"] = t
    df["factor"] = 1.0  # Qlib adjust-price factor; 1.0 if unadjusted
    df.to_csv(f"csv_kr/{t}.csv", index=False)
Enter fullscreen mode Exit fullscreen mode
python scripts/dump_bin.py dump_all \
    --csv_path ./csv_kr \
    --qlib_dir ~/.qlib/qlib_data/kr_data \
    --include_fields open,close,high,low,volume,factor \
    --date_field_name date --symbol_field_name symbol
Enter fullscreen mode Exit fullscreen mode

Korea-specific checklist (this is where naive ports break):

Item Why it matters
Adjusted price (factor) Splits/dividends must be reflected or returns get distorted.
Trading rules REG_CN applies China's ±10% limit and T+1 — Korea is ±30% and T+0. Customize the executor.
Delisting / halts Survivorship bias: ideally include delisted names.
Calendar Verify the Korean trading-holiday calendar is generated.
Alpha158 factors Factor definitions are market-neutral, but re-validate on Korean data — a CSI300 IC doesn't guarantee a KOSPI IC.

The middleware: TOSS Open API → Qlib

Rather than cram OAuth2, token expiry, rate limits, and pagination into the same Python codebase that does factor research, I split concerns. A separate middleware drops normalized CSVs; Qlib just consumes them.

TOSS Open API  --OAuth2-->  [Node.js/TS middleware]  --CSV(csv_kr/*.csv)-->  scripts/dump_bin.py  -->  ~/.qlib/qlib_data/kr_data
                                   |
                                 Redis (token cache + market data cache)
Enter fullscreen mode Exit fullscreen mode

Middleware (English README): toss-qlib-middleware/README_EN.md

Authentication (confirmed spec)

Item Detail
Flow OAuth2 Client Credentials Grant (no user login step)
Token issuance POST {TOSS_BASE_URL}/oauth2/token with grant_type / client_id / client_secret as a form-urlencoded body (not Basic Auth)
Lifetime 86,400s (24h), no refresh token — you must re-issue with the client secret before expiry
Call header Authorization: Bearer {access_token}
Account/order APIs need an extra X-Tossinvest-Account header (not called here)

I verified the endpoint by actually hitting POST /oauth2/token: even with wrong credentials it returns a real {"error":"invalid_client", ...}, confirming the path and request shape. (As of mid-2026 the service is still in a pre-registration phase, so candle/price field schemas are held defensively.)

Redis caching strategy

Cached item Key TTL Reason
Access token toss:access_token 86400 − safety margin No refresh token → re-issue well before expiry
Token refresh lock toss:access_token:lock 10s (SET NX) Stops a thundering herd of simultaneous re-issues
Finalized past candles toss:candles:{symbol}:{interval}:{start}:{end} 1 day Closed candles never change
Today's candles same key 30s Values keep updating intraday
Current price toss:price:{symbol} 5s Fresh, per-symbol so batches reuse hits

On 401 the cache is invalidated and the request retried once; on 429 it backs off using the Retry-After header. The candles endpoint returns at most 200 rows and has no start/end filter, so the middleware paginates backward with a before cursor, then returns the merged result sorted ascending.

API surface

Method Path Description
GET /health Health check
GET /api/candles/:symbol?start=&end=&interval=day Normalized candle JSON (Redis-cached, before pagination)
GET /api/prices?symbols=005930,000660 Batch current-price lookup (chunked at 200)
POST /api/export/qlib {symbols, start, end, outDir?} Fetch symbols → write csv_kr/{symbol}.csv

Or skip the server and export CSV straight from the CLI:

npm run export:qlib -- --symbols 005930,000660 --start 2020-01-01 --end 2026-07-01
Enter fullscreen mode Exit fullscreen mode

Quick start

cd TechDoc/Quant_Qlib/toss-qlib-middleware
npm install
npm run setup      # interactively creates .env, optionally test-issues a real token
npm run typecheck
npm test           # passes WITHOUT Redis (in-memory adapter validates the logic)
npm run dev        # http://localhost:4000, requires a real Redis instance
Enter fullscreen mode Exit fullscreen mode

Why trading (order execution) is intentionally out of scope

Auth and market-data retrieval are common needs that look nearly identical for everyone — perfect for shared middleware. Order logic (state tracking, dedupe-on-retry, risk limits, fill confirmation) varies completely by strategy and risk tolerance, so shipping it generically would be irresponsible. Qlib itself leaves live execution out too, and a backtest never guarantees live performance. The middleware exposes TossAuthService + TossApiClient as clean extension points if you want to add orders yourself — but test with tiny/paper trades first.


What's next: secondary data + bot-agnostic

The current middleware stops at price/candle data. The roadmap is a Korea-specialized middleware that also ingests secondary data — corporate disclosures and filings (e.g. DART), so strategies can react to events, not just prices.

And crucially: this middleware won't be Qlib-only. The normalization layer is generic enough that the same authenticated, cached, rate-limit-aware data feed can back any trading bot — Qlib is just the first consumer. Think of it as a reusable Korea-market data plane: one integration, many downstream engines.

If you're building anything on the Korean market with Python or TypeScript, I'd love feedback on which secondary datasets matter most to you.


Links

Not investment advice. This is data-pipeline tooling; investment decisions and their consequences are your own.

Top comments (0)