DEV Community: KX

godzilla.dev - AI Quant Trader Series - Day 12 - How Matching Engines Work?

KX — Tue, 21 Jul 2026 13:44:36 +0000

source: https://godzilla.dev/learning/ai_quant_traders_series_12/

See below for godzilla.dev materials about: AI x Quant Trader Series - Day 12

How Matching Engines Work
Reading time: ~15 minutes
Prerequisites: What is High Frequency Trading, What is Market Microstructure, What is an Order Book, What is Market Data
Focus: understanding the core engine behind every electronic exchange

Part 1: Introduction
Every electronic exchange has one component responsible for turning orders into trades.

The Matching Engine.

Whether you are trading:

Stocks
Futures
Options
ETFs
Cryptocurrencies
every submitted order eventually reaches the matching engine.

Its responsibility is surprisingly simple:

Receive orders, match buyers with sellers, and update the market.
Despite this simple objective, the matching engine is one of the most performance-critical software systems ever built.

Modern exchanges process hundreds of thousands—or even millions—of orders every second while maintaining strict fairness and deterministic behavior.

Part 2: What is a Matching Engine?
A matching engine is the core software component of an electronic exchange.

It continuously receives:

New Orders
Cancel Orders
Modify Orders
and determines whether a trade should occur.

Whenever a compatible buy and sell order exist, the matching engine executes the trade automatically.

Everything happens electronically.

There are no human traders approving transactions.

Part 3: The Matching Process
Suppose the order book currently contains:

ASK

101.20 5

101.10 10

100.90 8

100.80 12

BID
A trader submits:

Buy

101.10
The matching engine immediately compares the incoming order against the best available sell orders.

Buy 10 @ 101.10

↓

Matches

Sell 10 @ 101.10

↓

Trade Executed
The remaining order book is updated automatically.

This entire process usually completes in microseconds.

Part 4: Order Types
Matching engines typically support several order types.

Market Order
Execute immediately.

The engine consumes the best available liquidity.

Market Buy

↓

Execute Now
Limit Order
Execute only if a specified price is available.

Otherwise, the order rests inside the order book.

Buy

100

99.50
If no seller accepts that price,

the order simply waits.

Cancel Order
Removes an existing order from the order book.

No trade occurs.

Liquidity decreases.

Modify Order
Changes the price or quantity of an existing order.

Many exchanges internally implement this as:

Cancel

New Order
Part 5: Price-Time Priority
Most electronic exchanges follow one matching rule.

Price-Time Priority

This means:

Higher bid prices execute first.

Lower ask prices execute first.

If multiple orders exist at the same price,

the earliest submitted order executes first.

Example:

Trader A

Buy

100

09:30:01

Trader B

Buy

100

09:30:03
Trader A receives priority.

This rule guarantees fairness and deterministic execution.

Part 6: Partial Fills
Not every order executes completely.

Suppose the order book contains:

Sell

101
A trader submits:

Buy

101
The result becomes:

Executed

Remaining

5
The remaining quantity either:

waits in the order book
or

continues matching against higher prices
depending on the order type.

Part 7: Matching Engine Architecture
A simplified exchange architecture looks like:

Client

↓

Gateway

↓

Risk Checks

↓

Matching Engine

↓

Trade

↓

Market Data Feed

↓

Participants
Every successful trade generates new market data.

That market data is immediately distributed back to every participant.

This feedback loop runs continuously throughout the trading day.

Part 8: Why Matching Engines Must Be Fast
Imagine an exchange processing:

2 million

orders

per second
The matching engine must:

Validate orders
Maintain the order book
Match orders
Generate trades
Publish market data
without introducing latency.

Every additional microsecond affects every market participant.

This is why matching engines are typically written in:

C++
Rust
Java (low-latency implementations)
with careful optimization of:

CPU cache usage
Memory allocation
Lock-free data structures
Network I/O
Part 9: Determinism Is More Important Than Speed
Many beginners believe the fastest matching engine is always the best.

In reality,

professional exchanges prioritize:

Correctness
Fairness
Deterministic execution
A matching engine that occasionally pauses for 10 milliseconds is far more dangerous than one that consistently responds within 50 microseconds.

Consistency builds trust.

Determinism builds reliable markets.

Part 10: Matching Engine vs Trading Engine
These two terms are often confused.

A Matching Engine belongs to the exchange.

Its job is to match orders.

A Trading Engine belongs to the trader.

Its job is to:

Receive market data
Generate trading signals
Manage positions
Send orders
The trading engine never decides how orders are matched.

That responsibility belongs entirely to the exchange.

Part 11: Where godzilla.dev Fits
Although godzilla.dev is not an exchange,

many of its architectural principles are inspired by exchange design.

Professional trading systems require:

High-performance market data processing
Local order book maintenance
Deterministic event processing
Risk management
Ultra-low latency order routing
These components interact continuously with external matching engines.

Rather than implementing exchange logic itself, godzilla.dev provides the infrastructure required to build production-grade trading systems capable of interacting with modern electronic markets efficiently.

Part 12: Key Takeaways
The matching engine is the heart of every electronic exchange.

It continuously:

Receives orders
Maintains the order book
Matches buyers and sellers
Executes trades
Publishes market data
Modern financial markets would not exist without highly optimized matching engines.

Understanding how they operate is essential for anyone building professional quantitative trading systems.

What's Next?
The next article explores how trading systems communicate with exchanges:

What is an Exchange Gateway?

godzilla.dev - AI Quant Trader Series - Day 11 - What is Market Data?

KX — Tue, 14 Jul 2026 14:26:54 +0000

source: https://godzilla.dev/learning/ai_quant_traders_series_11/

See below for godzilla.dev materials about: AI x Quant Trader Series - Day 11

What is Market Data?¶
Reading time: ~15 minutes
Prerequisites: What is High Frequency Trading, What is Market Microstructure, What is an Order Book
Focus: understanding the data flowing through modern electronic trading systems

Part 1: Introduction¶
Every quantitative trading system begins with one thing.

Market Data.

Before a strategy can decide whether to buy or sell, it must first understand the current state of the market.

That information comes from market data.

Whether you are trading:

Stocks
Futures
Options
ETFs
Cryptocurrencies
every trading decision ultimately depends on a continuous stream of market events.

For High Frequency Trading, market data is not just information.

It is the raw material from which every trading opportunity is created.

Part 2: What is Market Data?¶
Market Data is the real-time information published by an exchange describing everything happening in the market.

Typical market data includes:

Best Bid
Best Ask
Trade Price
Trade Size
Order Book Updates
Volume
Market Status
Instrument Information
Every update represents a new event occurring inside the exchange.

Unlike historical datasets, market data never stops arriving.

It is an infinite stream of events.

Part 3: Types of Market Data¶
Modern exchanges usually provide several categories of market data.

Trade Data¶
Trade data records completed transactions.

Example:

Price: 101.20

Quantity: 5 BTC

Time: 09:30:15.123456
Trade data answers one question:

What actually traded?
Quote Data¶
Quote data describes the current market.

Typical information includes:

Best Bid
Bid Size
Best Ask
Ask Size
Example:

Bid

101.18

Size 25

Ask

101.20

Size 40
Most execution algorithms continuously monitor quote updates.

Order Book Data¶
Rather than publishing only the best prices,

many exchanges provide multiple price levels.

Example:

Ask

101.30

101.20

101.10

100.90

100.80

100.70

Bid
This information allows trading systems to reconstruct the entire local order book.

Part 4: Snapshot vs Incremental Updates¶
Exchanges generally publish market data in two formats.

Snapshot¶
A snapshot contains the complete market state.

Example:

Entire Order Book

↓

One Message
Snapshots are simple but expensive to transmit frequently.

Incremental Updates¶
Incremental updates publish only changes.

Example:

Before

101.20

Size 30

↓

Update

Size 18
Only the modified information is transmitted.

Nearly every modern HFT platform relies primarily on incremental updates because they minimize bandwidth and latency.

Part 5: Market Data Feed¶
Exchanges distribute market data through specialized data feeds.

A simplified architecture looks like:

Exchange

↓

Market Data Feed

↓

Decoder

↓

Local Order Book

↓

Trading Strategy
The market data feed is responsible for delivering every market event to participants as quickly as possible.

For High Frequency Trading,

the market data feed is often the most latency-sensitive component of the entire system.

Part 6: Why Latency Matters¶
Imagine two trading firms receive the same market update.

Firm A processes the update in:

8 μs
Firm B processes it in:

120 μs
Both firms observe the same opportunity.

Only one is likely to execute first.

This is why HFT engineers spend enormous effort optimizing:

Message parsing
Memory allocation
Cache locality
Lock-free queues
Network I/O
Every microsecond matters.

Part 7: Market Data Processing¶
Receiving market data is only the beginning.

A production trading system must also:

Decode exchange protocols
Validate messages
Handle sequence numbers
Detect packet loss
Recover missing data
Maintain synchronization
Update the local order book
These operations occur continuously throughout the trading day.

For active markets, this may involve millions of messages every second.

Part 8: Local Market Data¶
Professional trading systems rarely query the exchange whenever market information is needed.

Instead, they maintain an in-memory representation of the market.

Exchange

↓

Market Data Feed

↓

Incremental Updates

↓

Local Memory

↓

Trading Strategy
Strategies then read data directly from memory.

This architecture eliminates unnecessary network latency and dramatically improves performance.

Part 9: Market Data in High Frequency Trading¶
For long-term investors,

market data is simply information.

For HFT systems,

market data is an event stream.

Strategies react to:

New trades
Quote changes
Order book updates
Liquidity changes
Spread changes
Market imbalance
Many HFT strategies process thousands of events before placing a single order.

Understanding event flow is often more important than predicting future prices.

Part 10: Where godzilla.dev Fits¶
Efficient market data processing is one of the foundations of every ultra-low latency trading platform.

A production implementation must:

Decode exchange messages
Process incremental updates
Maintain local market state
Synchronize order books
Distribute events across multiple strategies
Minimize memory copies
Maintain deterministic latency
These requirements define much of the architecture behind godzilla.dev.

Rather than rebuilding market data infrastructure for every project, developers can focus on strategy research while relying on a modular, high-performance framework designed for modern electronic markets.

Part 11: Key Takeaways¶
Market Data is the real-time information published by exchanges describing market activity.

It includes:

Trades
Quotes
Order Book Updates
Market Status
Instrument Information
Professional trading systems transform this continuous stream of events into an in-memory representation of the market, allowing strategies to react with minimal latency.

Understanding market data is the first step toward building production-grade trading infrastructure.

What's Next?¶
The next article explores the component responsible for turning incoming orders into completed trades:

How Matching Engines Work

That 300% funding APR is not free money: screening for squeeze traps on Binance perpetuals

KX — Wed, 08 Jul 2026 07:14:56 +0000

In late December 2023, TRB went from around $200 to over $600 in a single session on Binance, then collapsed just as fast. Hundreds of millions in short positions were liquidated on the way up. In the days before the move, TRB's perpetual had been flashing exactly the kind of numbers that make a funding rate arbitrageur's eyes light up: deeply skewed funding, fat annualized carry, seemingly free money for anyone willing to take the other side.

Anyone who took that carry trade got carried out.

This post is about the screening layer I run before any funding rate position goes on: a Python script that scores every USDT perpetual on Binance against five structural risk signals, using only public endpoints. No API keys, no paid data. The full script is at the end; the interesting part is why each signal works.

The trap, mechanically

The standard funding rate arbitrage is delta-neutral on paper. Funding is printing high positive? Short the perp, buy spot, collect the payments. Deeply negative? Long the perp, hedge with a spot short. Price risk cancels, funding accrues. In liquid majors this is a boring, capacity-constrained, mostly honest trade.

In a low-float token it can be bait.

Here is the failure mode. A token has a small circulating supply, and a large share of that float sits with a few wallets. Extreme funding appears — often because positioning is already crowded on one side. Carry traders pile in, shorting the perp against spot. Then the float holders push the spot price, hard. Three things break at once:

The perp leg gets margin-called before the hedge helps you. Your spot leg is profitable, but it's sitting in a wallet; your perp short is sitting on an exchange with leverage, being marked against a price that someone else controls. Liquidation doesn't wait for your rebalancing script.
The basis blows out. Perp and spot are supposed to converge via funding. During a squeeze the perp can trade at absurd premiums for hours. "Delta-neutral" assumes the two legs move together; in the tail they don't.
The exit is a door one person wide. Thin spot books mean unwinding the hedge leg costs you a chunk of the carry you came for — if you can unwind at all.

The nasty part: the trade looks most attractive exactly when it is most dangerous. Extreme funding is both the lure and the symptom. So the job of a screener is not to find high funding — that's one API call — but to answer a different question: is this a market where the other side can hurt me on purpose?

Five signals

The screener scores each perpetual on five structural signals. Each one has a "warning" and a "danger" threshold; warning adds 1 point, danger adds 2. The thresholds come from going back over historical squeeze events and asking what the market looked like before the candle.

1. Open interest vs. circulating market cap

oi_mcap_ratio = open_interest_notional / circulating_market_cap

This is the single most informative number. If the notional value of open perpetual contracts exceeds half the circulating market cap of the underlying token, the derivative tail is wagging the spot dog. Above 1.0 — more paper exposure than actual float value — a determined actor doesn't need to fight the market to move the mark price; the market is smaller than the bet on it.

Warning above 0.5, danger above 1.0. For reference, majors like BTC and ETH sit far below these levels; the tokens that end up in squeeze post-mortems almost always screened hot here first.

2. Perp-to-spot volume ratio

perp_spot_ratio = perp_volume_24h / spot_volume_24h

Healthy markets discover price on spot and lever it on derivatives. When 24h perp volume runs 15–40x spot volume, price discovery has effectively moved to the perp, and the spot print — the thing your hedge depends on, and often the input to the mark price — is thin enough to be pushed cheaply. Warning above 15, danger above 40.

There's an important edge case here, covered below: tokens whose perp trades on Binance but whose spot doesn't.

3. Funding rate extremity

funding_apr = last_funding_rate * 3 * 365   # 8h funding, annualized

Annualized funding above 100% is a warning; above 300% is danger territory. Not because the carry isn't real — it is, for as long as it lasts — but because triple-digit APRs don't survive in efficient markets. If the number looks like a DeFi farm from 2021, someone is being paid that much to hold a position nobody sane wants, and you should ask why the other side is this desperate.

Note the abs(): deeply negative funding is scored the same as deeply positive. Squeezes come in both directions.

4. Listing age

Contracts live for under 60 days score danger; under 180, warning. New listings combine every risk factor: no funding history to judge what "normal" looks like, concentrated early float, immature spot liquidity, and maximum attention from exactly the kind of trader who runs squeezes. A disproportionate share of historical trap events happened within the first two quarters of a perp's life.

5. Circulating market cap, absolute

Below $100M warning, below $30M danger. Small caps aren't automatically manipulated, but manipulation is a fixed-cost business — the smaller the float, the cheaper the squeeze. This signal overlaps with signal 1 by construction, and that's intentional: a token that trips both is small and over-levered, which is the full trap setup.

Bonus signal: no market cap data at all

If a token's perp trades on Binance but the token doesn't rank in CoinGecko's top ~2000 by market cap, the screener can't compute signals 1 and 5 — and that absence is itself worth a point. A perpetual contract on an asset too small or too new to have reliable supply data is not where a delta-neutral strategy goes to earn a quiet carry.

The unglamorous parts (where the bugs live)

Three implementation details caused more trouble than the actual scoring logic.

Symbol collision on CoinGecko. Matching Binance base assets to CoinGecko entries by ticker symbol is a minefield — ticker symbols aren't unique, and a $20M token can share a symbol with a $2B one. The screener pulls CoinGecko's markets endpoint ordered by market cap descending and keeps the first (largest) match per symbol:

for c in data:
    sym = (c.get("symbol") or "").lower()
    if sym in wanted and c.get("market_cap"):
        # on symbol collision, keep the highest-mcap match
        mcap.setdefault(sym, float(c["market_cap"]))

This biases toward under-flagging (you might attribute a big token's mcap to a small impostor and miss a trap), which is the conservative direction for a screener whose job is to justify a "no" — but know the limitation.

The 1000x prefix. Binance lists some low-price tokens as 1000PEPE, 1000SHIB etc. — the contract multiplies the price by 1000. For market cap lookups the prefix has to be stripped:

cg_base = base[4:] if base.startswith("1000") else base

Miss this and every 1000-prefixed contract silently gets zero market cap and a spurious flag.

Orphan perps. Some perpetuals trade on Binance futures with no corresponding Binance spot pair at all. For these, the perp/spot ratio is set to infinity rather than skipped:

if r.spot_vol > 0:
    r.perp_spot_ratio = r.perp_vol / r.spot_vol
elif r.perp_vol > 0:
    r.perp_spot_ratio = float("inf")  # perp exists, spot doesn't

A perp whose hedge leg would have to live on a different exchange is a materially worse trade — cross-exchange transfer time is exactly the window in which a squeeze kills you. Infinity, not N/A.

Scoring and output

The scoring function is deliberately dumb — transparent beats clever in a risk filter:

def score_row(r: Row) -> None:
    def add(cond_hi, cond_mid, name):
        if cond_hi:
            r.score += 2
            r.flags.append(f"{name}!!")
        elif cond_mid:
            r.score += 1
            r.flags.append(name)

    add(r.oi_mcap_ratio > 1.0, r.oi_mcap_ratio > 0.5, "OI/MCAP")
    add(r.perp_spot_ratio > 40, r.perp_spot_ratio > 15, "PERP/SPOT")
    add(abs(r.funding_apr) > 3.0, abs(r.funding_apr) > 1.0, "EXTREME_FUNDING")
    add(0 < r.listing_days < 60, 60 <= r.listing_days < 180, "NEW_LISTING")
    add(0 < r.mcap < 3e7, 3e7 <= r.mcap < 1e8, "MICRO_CAP")
    if r.mcap == 0:
        r.score += 1
        r.flags.append("NO_MCAP_DATA")

Maximum score is 11. In practice I treat the bands roughly as: 0–1 normal market, size the carry trade on its own merits; 2–3 proceed with reduced size and tighter liquidation buffers; 4+ the funding is not the opportunity, it's the advertisement. The point of a screener is to make "no" cheap.

Running it takes about a minute for the full universe (the per-symbol open interest endpoint is the bottleneck; a small thread pool keeps it tolerable, and CoinGecko's free tier wants a 1.2s pause between pages):

$ python trap_screener.py --min-score 3
Fetching contract list...
530 perpetuals, fetching tickers...
Fetching market caps (CoinGecko)...
Fetching open interest...

CONTRACT         RISK  MCAP($M)  OI/MCAP    P/S  FUND APR  AGE(d)  FLAGS
------------------------------------------------------------------------
SLXUSDT             6      45.8     0.18    inf     -113%      37  PERP/SPOT!!,EXTREME_FUNDING,NEW_LISTING!!,MICRO_CAP
DATAIPUSDT          6       0.0     0.00    inf     -104%       5  PERP/SPOT!!,EXTREME_FUNDING,NEW_LISTING!!,NO_MCAP_DATA
EVAAUSDT            6      20.4     1.82    inf       40%     277  OI/MCAP!!,PERP/SPOT!!,MICRO_CAP!!
STARUSDT            6      27.3     0.05    inf        5%      55  PERP/SPOT!!,NEW_LISTING!!,MICRO_CAP!!
CTRUSDT             6      12.9     0.08    inf        5%      40  PERP/SPOT!!,NEW_LISTING!!,MICRO_CAP!!
GWEIUSDT            5     202.4     0.06    inf     -314%     160  PERP/SPOT!!,EXTREME_FUNDING!!,NEW_LISTING
ARXUSDT             5      38.4     0.08    inf      -49%      15  PERP/SPOT!!,NEW_LISTING!!,MICRO_CAP
BIRBUSDT            5      17.0     0.10    inf       -8%     160  PERP/SPOT!!,NEW_LISTING,MICRO_CAP!!
...

What this doesn't catch

Honesty section. The screener reads market structure; it cannot see intent. It won't catch a coordinated squeeze on a mid-cap with healthy-looking ratios, an exchange listing announcement that turns structure upside down in an hour, or unlock-schedule cliffs (that data lives elsewhere and is worth adding). It also scores a snapshot — a token can screen clean at noon and be a trap by dinner. This is a pre-trade filter, not a substitute for position-level risk management: liquidation buffers, basis monitoring, and an exit plan sized to actual spot depth.

The full script (~200 lines, requests is the only dependency) is here: https://github.com/godzilla-foundation/godzilla-community/blob/main/strategies/trap_screener/trap_screener.py. Run it, argue with the thresholds, send patches.

I maintain godzilla.dev, an open-source C++/Python framework for self-hosted funding rate arbitrage and market making. The screener in this post is the research side of the problem; execution — actually running the delta-neutral legs with microsecond-level latency without becoming the exit liquidity — is a different one.

godzilla.dev - AI Quant Trader Series - Day 10 - What is an Order Book?

KX — Tue, 07 Jul 2026 13:30:11 +0000

source: https://godzilla.dev/learning/ai_quant_traders_series_10/

See below for godzilla.dev materials about: AI x Quant Trader Series - Day 10

What is an Order Book?¶
Reading time: ~15 minutes
Prerequisites: What is High Frequency Trading, What is Market Microstructure
Focus: understanding the core data structure behind every electronic exchange

Part 1: Introduction¶
Every electronic exchange has one central component.

The Order Book.

Whether you are trading:

Stocks
Futures
Options
ETFs
Cryptocurrencies
every trade begins and ends with the order book.

For quantitative developers, the order book is more than market data.

It is the primary data structure that determines:

Liquidity
Price formation
Execution priority
Market depth
Trading opportunities
Without understanding the order book, it is impossible to understand modern electronic markets.

Part 2: What is an Order Book?¶
An order book is a real-time collection of all active buy and sell orders submitted to an exchange.

It continuously records:

Buy orders (Bids)
Sell orders (Asks)
Prices
Quantities
As orders arrive, are canceled, or are executed, the order book updates immediately.

Unlike historical price charts, the order book represents the current state of market supply and demand.

Part 3: A Simple Order Book¶
A simplified order book might look like this:

ASK

Price Size

101.30 25

101.20 40

101.10 15

101.00

100.90 18

100.80 35

100.70 12

Price Size

BID

The highest buying price is called the Best Bid.

The lowest selling price is called the Best Ask.

Together they define the current market.

Part 4: Bid, Ask and Spread¶
Suppose the market looks like:

Best Ask = 101.10

Best Bid = 100.90
The difference is:

Spread = 0.20
The bid-ask spread represents the immediate cost of trading.

Smaller spreads usually indicate:

Higher liquidity
Lower transaction costs
More active markets
Large spreads often signal uncertainty or low liquidity.

Many quantitative strategies continuously monitor spread changes.

Part 5: Order Book Updates¶
The order book changes whenever one of three events occurs.

New Order¶
A participant submits a new buy or sell order.

BUY

100.95

Size 30
The order is inserted into the appropriate price level.

Cancel Order¶
An existing order is removed.

Liquidity decreases.

The market depth changes.

Trade Execution¶
A buy order matches a sell order.

Both orders disappear (fully or partially).

The traded price becomes the latest transaction price.

These three event types generate almost every message published by an electronic exchange.

Part 6: Market Orders vs Limit Orders¶
The order book primarily stores limit orders.

Limit Order¶
A trader specifies:

Price
Quantity
Example:

Buy

10 BTC

100,000 USD
The order waits until a seller accepts that price.

Market Order¶
A market order specifies only quantity.

The exchange immediately executes against the best available prices in the order book.

For example:

Market Buy

20 BTC
The exchange consumes liquidity from multiple ask levels until the requested quantity is filled.

Part 7: Market Depth¶
An order book contains more than just the best bid and ask.

It also reveals market depth.

For example:

Ask

101.10 10

101.20 25

101.30 80

101.40 200
Large resting orders often influence market behavior.

Some quantitative strategies analyze:

Depth imbalance
Queue size
Liquidity concentration
to predict short-term price movement.

Part 8: Price-Time Priority¶
Most exchanges use Price-Time Priority.

This means:

Higher prices execute first.

If multiple orders exist at the same price,

the earliest order executes first.

Example:

Trader A

Buy

100.00

09:30:01

Trader B

Buy

100.00

09:30:05
Trader A receives execution before Trader B.

Queue position therefore becomes an important competitive advantage in High Frequency Trading.

Part 9: Why the Order Book Matters¶
Traditional investors mostly observe:

Daily candles
Moving averages
Volume
Quantitative traders often observe:

Best Bid
Best Ask
Queue imbalance
Order flow
Market depth
Trade aggressiveness
These microstructure signals often contain far more information than historical prices alone.

Many HFT strategies never use traditional technical indicators.

Instead, they react directly to order book events.

Part 10: Local Order Books¶
Professional trading systems rarely query the exchange every time they need market information.

Instead, they maintain a local order book.

The process is straightforward:

Exchange

↓

Market Data Feed

↓

Incremental Updates

↓

Local Order Book

↓

Trading Strategy
Maintaining a synchronized local order book dramatically reduces latency and enables strategies to process market events without additional network requests.

Almost every production HFT platform relies on this architecture.

Part 11: Where godzilla.dev Fits¶
Maintaining an accurate local order book is one of the most performance-critical components of a trading system.

A production implementation must:

Decode market data
Process millions of updates
Maintain price levels
Handle incremental messages
Synchronize state
Minimize latency
godzilla.dev provides the infrastructure required to build ultra-low latency trading systems capable of processing order book updates efficiently while exposing a clean interface for quantitative strategy development.

Instead of rebuilding market data infrastructure, developers can focus on designing trading strategies.

Part 12: Key Takeaways¶
The order book is the central data structure of every electronic exchange.

It continuously records:

Buy orders
Sell orders
Available liquidity
Market depth
Execution priority
Understanding the order book is essential for:

High Frequency Trading
Market Making
Execution Algorithms
Statistical Arbitrage
Every market event ultimately becomes an order book update.

What's Next?¶
The next article explores the engine responsible for processing every order submitted to the market:

What is a Matching Engine?

godzilla.dev - AI Quant Trader Series - Day 9 - What is Market Microstructure?

KX — Mon, 06 Jul 2026 07:31:39 +0000

source: https://godzilla.dev/learning/ai_quant_traders_series_9/

See below for godzilla.dev materials about: AI x Quant Trader Series - Day 9

What is Market Microstructure?¶
Reading time: ~15 minutes
Prerequisites: basic financial markets, programming fundamentals
Focus: understanding how electronic markets actually work

Part 1: Introduction¶
Most people think financial markets are simply places where buyers meet sellers.

For quantitative traders, this description is far too simplistic.

Every trade, every quote update, every order cancellation is generated by a highly optimized electronic matching system.

Understanding Market Microstructure means understanding how these markets actually operate beneath the surface.

If quantitative finance studies what prices should do, market microstructure studies how prices are formed.

It is one of the most important subjects for:

High Frequency Trading
Market Making
Statistical Arbitrage
Execution Algorithms
Transaction Cost Analysis
Without understanding market microstructure, building a professional trading system becomes extremely difficult.

Part 2: What is Market Microstructure?¶
Market Microstructure studies the process through which financial assets are traded.

Instead of analyzing long-term price movements, it focuses on:

Order submission
Order cancellation
Trade execution
Liquidity
Bid-ask spreads
Price discovery
In other words,

Market microstructure explains how individual market events produce market prices.
Rather than asking:

Why did Bitcoin increase 10%?
Microstructure asks:

Which orders entered the book?
Who provided liquidity?
Who removed liquidity?
How did those interactions change the price?
Part 3: The Continuous Double Auction¶
Most modern electronic exchanges operate using a Continuous Double Auction (CDA).

Buyers submit bids.

Sellers submit asks.

Whenever the best bid meets the best ask, a trade occurs automatically.

For example,

BUY
100 @ 99

SELL
100 @ 99

↓

Trade Executed
The matching engine continuously repeats this process millions of times every day.

There is no human intervention.

Everything is performed automatically.

Part 4: The Order Book¶
The order book is the central data structure of every electronic exchange.

A simplified order book looks like:

ASK

101.3 25

101.2 40

101.1 15

101.0

100.9 18

100.8 35

100.7 12

BID
The highest bid is called the Best Bid.

The lowest ask is called the Best Ask.

The difference between them is known as the Bid-Ask Spread.

Almost every HFT strategy continuously monitors these values.

Part 5: Liquidity¶
Liquidity measures how easily an asset can be traded.

Highly liquid markets typically have:

Small spreads
Deep order books
Fast execution
Large trading volume
Low liquidity usually results in:

Large spreads
Higher slippage
Greater execution risk
Many quantitative strategies are designed specifically to provide or consume liquidity efficiently.

Part 6: Market Participants¶
Not all market participants behave the same way.

Typical participants include:

Retail Traders¶
Small individual investors.

Usually submit market orders.

Institutional Investors¶
Mutual funds.

Pension funds.

Asset managers.

Often execute very large orders.

Market Makers¶
Continuously provide both bids and asks.

Profit from the bid-ask spread while managing inventory risk.

High Frequency Traders¶
React to market events within microseconds.

Focus on execution quality and market efficiency.

Part 7: Market Orders vs Limit Orders¶
Two order types dominate modern markets.

Market Orders¶
Execute immediately.

Price is determined by available liquidity.

Advantages:

Guaranteed execution
Disadvantages:

Slippage
Higher transaction cost
Limit Orders¶
Specify a maximum buying price or minimum selling price.

Advantages:

Price control
Disadvantages:

No execution guarantee
Many HFT firms primarily use limit orders because controlling execution cost is often more important than immediate execution.

Part 8: Price Discovery¶
Prices do not move randomly.

They evolve through the interaction of buyers and sellers.

For example,

A large buy order consumes multiple ask levels.

The best ask moves upward.

The market price increases.

This process is known as price discovery.

The market is constantly discovering the fair value through order flow.

Part 9: Why Microstructure Matters in Quant Trading¶
Traditional investing often focuses on:

Fundamentals
Earnings
Macroeconomics
High-frequency trading focuses on something entirely different:

Market events.

Examples include:

Order imbalance
Queue position
Spread changes
Trade aggressiveness
Order cancellations
Market depth
These signals often exist for only milliseconds.

Understanding them creates opportunities unavailable on longer time horizons.

Part 10: Where godzilla.dev Fits¶
Modern trading systems must process enormous numbers of market events every second.

A production trading platform needs to:

Decode exchange messages
Maintain a local order book
Distribute market data
Execute strategies
Manage risk
Send low-latency orders
These responsibilities form the foundation of every professional trading infrastructure.

godzilla.dev provides an open-source infrastructure designed specifically for these workloads.

Instead of rebuilding market data pipelines and exchange connectivity from scratch, developers can focus on researching trading strategies while relying on a modular, ultra-low latency architecture.

Part 11: Key Takeaways¶
Market Microstructure explains how electronic markets actually function.

It studies:

Order books
Liquidity
Order flow
Matching engines
Price discovery
Rather than predicting prices directly, microstructure explains how prices emerge from the interaction of market participants.

For quantitative developers, this knowledge is often more valuable than traditional financial theory.

What's Next?¶
The following articles build upon these concepts:

What is an Order Book?
What is a Matching Engine?
What is Market Data?
What is an Exchange Gateway?
What is Shared Memory IPC?
Building Low-Latency Trading Systems

godzilla.dev - AI Quant Trader Series - Day 8 - What is High Frequency Trading?

KX — Sat, 04 Jul 2026 04:38:27 +0000

source: https://godzilla.dev/learning/ai_quant_traders_series_8/

See below for godzilla.dev materials about: AI x Quant Trader Series - Day 8

AI × Quant Trader Series — Day 8¶
What is High Frequency Trading?¶
Reading time: ~15 minutes
Prerequisites: basic programming, financial markets
Focus: engineering intuition, system architecture (not trading strategies)

Part 1: Introduction¶
When people hear High Frequency Trading (HFT), they often imagine computers buying and selling stocks in microseconds.

While speed is certainly important, it is not the essence of HFT.

High Frequency Trading is the engineering discipline of building trading systems capable of:

Processing market data
Making trading decisions
Managing risk
Executing orders
all within extremely tight latency constraints.

At its core, HFT combines:

Computer Science
Distributed Systems
Networking
Operating Systems
Market Microstructure
Quantitative Finance
Modern exchanges are software systems.

The competition is no longer between traders.

It is between software architectures.

Part 2: Why High Frequency Trading Exists¶
Electronic markets continuously generate enormous amounts of information.

Every second, exchanges publish:

Order submissions
Order cancellations
Trade executions
Quote updates
Every market event may represent a trading opportunity.

The challenge is simple:

Who can react first?
The first system to detect an opportunity and submit an order usually captures the available liquidity.

Milliseconds matter.

Sometimes even microseconds.

Part 3: The HFT Pipeline¶
A modern HFT system is usually organized as a processing pipeline.

Exchange
│
Market Data Feed
│
Market Data Decoder
│
Shared Memory
│
Trading Strategy
│
Risk Engine
│
Order Manager
│
Exchange Gateway
│
Exchange
Each component performs one specialized task.

Together they create a deterministic low-latency trading system.

Part 4: Core Components¶
4.1 Market Data¶
Everything begins with market data.

Exchanges continuously publish information such as:

Best bid
Best ask
Trades
Order book updates
The market data engine decodes these messages and distributes them to downstream components.

The faster this happens, the sooner strategies can react.

4.2 Trading Strategy¶
The strategy consumes market events and determines whether to:

Buy
Sell
Cancel
Modify existing orders
Strategies can include:

Market Making
Statistical Arbitrage
Cross-Exchange Arbitrage
ETF Arbitrage
Trend Following
The strategy itself is often surprisingly small.

Most engineering effort lies in the surrounding infrastructure.

4.3 Risk Management¶
Every order passes through risk control before reaching the exchange.

Typical checks include:

Position limits
Exposure limits
Price validation
Fat-finger protection
Kill switches
A fast trading system without risk management is simply a fast way to lose money.

4.4 Order Management¶
The Order Management System (OMS) tracks:

Active orders
Filled orders
Cancelled orders
Positions
It provides a consistent view of the trading state across the entire system.

4.5 Exchange Gateway¶
Finally, orders are transmitted through exchange-specific gateways.

Each exchange has its own:

Protocol
Message format
Authentication
Session management
The gateway hides these implementation details from the strategy.

Part 5: Why Latency Matters¶
Suppose two firms observe the same arbitrage opportunity.

Firm A reacts in:

20 μs
Firm B reacts in:

150 μs
Both systems discovered the same opportunity.

Only one receives the execution.

The opportunity disappears immediately after the first successful order.

This is why HFT engineers spend enormous effort reducing latency across every component of the system.

Part 6: Software Engineering Challenges¶
Building an HFT platform is primarily a systems engineering problem.

Common challenges include:

Memory Management¶
Avoid unnecessary allocations.

Reuse objects whenever possible.

Lock-Free Programming¶
Traditional mutexes introduce unpredictable latency.

Many production systems rely on:

Atomic operations
Ring buffers
Lock-free queues
Shared Memory¶
Passing data between processes through sockets is expensive.

Shared memory allows multiple processes to access market data with almost zero copying.

CPU Cache Optimization¶
Modern CPUs are significantly faster than main memory.

Efficient cache usage often produces larger performance gains than algorithmic optimization.

Deterministic Performance¶
Average latency is not enough.

Professional trading systems focus on:

Predictable latency
Stable execution
Minimal jitter
Consistency matters more than occasional speed.

Part 7: HFT vs Algorithmic Trading¶
These terms are often confused.

Algorithmic Trading is a broad category covering any automated trading strategy.

High Frequency Trading is a specialized subset emphasizing:

Extremely low latency
High message throughput
Very short holding periods
Continuous market interaction
Every HFT system is algorithmic trading.

Not every algorithmic trading system is HFT.

Part 8: Common Misconceptions¶
HFT is not Artificial Intelligence¶
Most HFT systems rely on:

Market microstructure
Statistical models
Rule-based execution
Machine learning is only one possible component.

HFT is not only about faster hardware¶
Buying expensive servers does not automatically create a low-latency platform.

Architecture matters more than hardware.

Good software consistently outperforms poor software running on expensive machines.

HFT is not only for large institutions¶
Open-source infrastructure has dramatically reduced the barrier to entry.

Independent quantitative researchers can now build professional-grade trading systems using commodity hardware.

Part 9: Where godzilla.dev Fits¶
Building an HFT platform from scratch requires implementing:

Market data processing
Shared memory communication
Order management
Risk management
Exchange gateways
Strategy framework
Monitoring
Performance optimization
These components represent years of engineering effort.

godzilla.dev provides an open-source ultra-low latency trading framework designed specifically for modern electronic markets.

Instead of rebuilding infrastructure repeatedly, quantitative developers can focus on strategy research while relying on a modular, production-oriented architecture.

Part 10: Key Takeaways¶
High Frequency Trading is fundamentally a systems engineering discipline.

Its objective is not simply "trading faster."

Instead, it focuses on building reliable, deterministic, and ultra-low latency software capable of processing millions of market events while maintaining strict risk controls.

Understanding HFT requires knowledge of:

Market Microstructure
Operating Systems
Computer Networks
Concurrent Programming
Low-Latency Architecture
Trading strategies may evolve.

The underlying engineering principles remain remarkably consistent.

What's Next?¶
The following articles explore each component in greater depth:

What is Market Microstructure?
What is an Order Book?
What is Shared Memory IPC?
What is a Matching Engine?
What is an Order Management System (OMS)?
What is a Risk Engine?
Lock-Free Programming
Event-Driven Architecture
Building Low-Latency Trading Systems

godzilla.dev — AI Quant Trader Series — Day 7

KX — Sat, 04 Jul 2026 04:37:01 +0000

source: https://godzilla.dev/learning/ai_quant_traders_series_7/

See below for godzilla.dev materials about: AI x Quant Trader Series - Day 7

The Swiss Army Knife of Linear Models: Lasso Regression¶
Reading time: ~15 minutes
Prerequisites: basic linear algebra, Python, NumPy
Focus: engineering intuition, quant usage (not ML hype)

Part 1: Introduction to Regularized Linear Models¶
We now move from data processing to one of the most important modeling tools in quantitative trading and applied machine learning: regularized linear models.

In real-world financial modeling, the main difficulty is rarely computation. Instead, it is almost always structure:

Too many features
Strong multicollinearity
Limited samples
High noise-to-signal ratio
A plain linear regression model can fit the data extremely well in-sample, yet fail catastrophically out-of-sample.

This is where Lasso regression becomes indispensable.

Part 2: From Linear Regression to Lasso¶
2.1 Ordinary Least Squares (OLS)¶
The objective function of ordinary least squares is:

OLS attempts to minimize prediction error only.
It places no constraint on model complexity.

As a result:

Coefficients become unstable when features are correlated
Noise features receive non-zero weights
Overfitting is almost guaranteed in high-dimensional settings
2.2 Why Regularization Is Necessary¶
In quantitative finance, feature sets often include:

Dozens of technical indicators
Overlapping factors
Lagged signals
Many of these features carry redundant or spurious information.

Regularization explicitly penalizes complexity, forcing the model to prefer simpler and more stable solutions.

Part 3: Lasso Regression — Core Idea¶
3.1 Objective Function¶
Lasso (Least Absolute Shrinkage and Selection Operator) modifies OLS by adding an L1 penalty:

Where:

The first term measures fit quality
The second term penalizes coefficient magnitude
controls the strength of regularization
3.2 What Makes Lasso Different¶
Unlike Ridge regression (L2 regularization), Lasso drives some coefficients exactly to zero.

This leads to:

Automatic feature selection
Sparse models
Improved interpretability
From an engineering perspective:

Lasso is not just a regression model — it is a structured filter.
Part 4: Intuition — Why Lasso Produces Sparsity¶
The L1 penalty creates a sharp constraint geometry.
When optimization occurs under this constraint, solutions naturally land on coordinate axes.

The practical consequence is simple:

Unimportant features are dropped entirely.
This behavior is extremely valuable in quant trading, where fewer signals often outperform noisy combinations.

Part 5: Implementing Lasso in Python¶
We now implement Lasso using scikit-learn.

Imports¶
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
5.1 Generate Example Data¶
import numpy as np

np.random.seed(42)

X = np.random.randn(100, 10)
true_beta = np.array([3, 0, 0, 1.5, 0, 0, 0, 2, 0, 0])
y = X @ true_beta + np.random.randn(100) * 0.5
5.2 Standardize Features¶
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
5.3 Fit the Lasso Model¶

from sklearn.linear_model import Lasso
import pandas as pd

lasso = Lasso(alpha=0.1)
lasso.fit(X_scaled, y)
pd.Series(lasso.coef_)
the output:

0 2.85
1 0.00
2 0.00
3 1.42
4 0.00
5 0.00
6 0.00
7 1.95
8 0.00
9 0.00
dtype: float64
Noise features are eliminated automatically, while true signals are retained.

Part 6: The Role of Alpha (λ)¶
6.1 Effect of Regularization Strength¶
Small α → weak regularization → overfitting

Large α → aggressive shrinkage → underfitting

for a in [0.01, 0.1, 1.0]:
model = Lasso(alpha=a)
model.fit(X_scaled, y)
print(a, (model.coef_ != 0).sum())
the output:

0.01 7
0.1 3
1.0 0

6.2 Cross-Validation (Recommended)¶

from sklearn.linear_model import LassoCV

lasso_cv = LassoCV(cv=5)
lasso_cv.fit(X_scaled, y)

lasso_cv.alpha_
lasso_cv.coef_
Cross-validation improves robustness across different market regimes.

Part 7: Limitations of Lasso¶
Lasso is not universally optimal:

Performs poorly when features are highly correlated

Cannot model non-linear interactions

Sensitive to outliers

Common remedies include:

Elastic Net (L1 + L2)

PCA + Lasso

Lasso for feature selection followed by non-linear models

godzilla.dev — AI x Quant Trader Series — Day 6

KX — Sat, 04 Jul 2026 04:26:22 +0000

source: https://godzilla.dev/learning/ai_quant_traders_series_6/

See below for godzilla.dev materials about: AI x Quant Trader Series - Day 6

The Swiss Army Knife of Python Data Processing: pandas"¶
Part 2: Rapid Advancement¶
In the previous article, we introduced how to create and access data in pandas using the Series and DataFrame types. In this article, we will cover how to perform operations on pandas data. Once you’ve mastered these operations, you’ll be able to handle most data processing tasks.

First, let’s import the modules we’ll be using in this article:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
To make the data easier to view, let’s adjust the output display width

pd.set_option('display.width', 200)

Other Ways to Create Data¶ The creation of data structures is not limited to the standard forms introduced in the previous article.

In this article, we’ll look at a few more. For example, we can create a Series with dates as its elements:

dates = pd.date_range('20250101', periods=5)
print(dates)
the output:

DatetimeIndex(['2025-01-01', '2025-01-02', '2025-01-03', '2025-01-04', '2025-01-05'], dtype='datetime64[ns]', freq='D')

Assign this date Series as the index of a DataFrame:

df = pd.DataFrame(np.random.randn(5, 4),index=dates,columns=list('ABCD'))
print(df)
the output:

               A         B         C         D

2025-01-01 -1.119762 -0.088336 1.921095 1.158499
2025-01-02 -0.250627 0.271175 -0.505430 -1.490358
2025-01-03 0.710884 -1.478697 0.537757 1.448547
2025-01-04 -1.658607 -0.364456 0.196627 0.881224
2025-01-05 0.347936 0.312740 -0.199889 2.881074

Any object that can be converted into a Series can be used to create a DataFrame:

df2 = pd.DataFrame({ 'A' : 1., 'B': pd.Timestamp('20250214'), 'C': pd.Series(1.6,index=list(range(4)),dtype='float64'), 'D' : np.array([4] * 4, dtype='int64'), 'E' : 'hello pandas!' })
print(df2)
the output:

 A          B    C  D              E

0 1.0 2025-02-14 1.6 4 hello pandas!
1 1.0 2025-02-14 1.6 4 hello pandas!
2 1.0 2025-02-14 1.6 4 hello pandas!
3 1.0 2025-02-14 1.6 4 hello pandas!

Viewing Data¶ In most cases, data is not generated by the analysts themselves but obtained through data APIs, external files, or other sources.

Here, we’ll use a dataset retrieved from binance rest api as an example:

pip install pandas requests python-dateutil

import requests
import pandas as pd
from dateutil import parser

symbol = "BTCUSDT" # REST 接口不带斜杠
interval = "1d"

start = "2025-01-01 00:00:00"
end = "2025-02-01 00:00:00"

start_ms = int(parser.isoparse(start).timestamp() * 1000)
end_ms = int(parser.isoparse(end).timestamp() * 1000)

url = "https://api.binance.com/api/v3/klines"
params = {
"symbol": symbol,
"interval": interval,
"startTime": start_ms,
"endTime": end_ms,
"limit": 1000
}
r = requests.get(url, params=params, timeout=15)
r.raise_for_status()
data = r.json()

返回每一行：

[

0 open time, 1 open, 2 high, 3 low, 4 close, 5 volume,

6 close time, 7 quote asset volume, 8 number of trades,

9 taker buy base volume, 10 taker buy quote volume, 11 ignore

]

df = pd.DataFrame(data, columns=[
"open_time","open","high","low","close","volume",
"close_time","quote_vol","trades","taker_base","taker_quote","ignore"
])

转数值

for col in ["open","high","low","close","volume","quote_vol","taker_base","taker_quote"]:
df[col] = pd.to_numeric(df[col], errors="coerce")

只保留核心列，并用UTC日期索引

df["date_utc"] = pd.to_datetime(df["open_time"], unit="ms", utc=True).dt.date
df = df[["open","high","low","close","volume","date_utc"]].set_index("date_utc").sort_index()

print(df)
the output:

             open       high        low      close        volume

date_utc

2025-01-01 93576.00 95151.15 92888.00 94591.79 10373.326130
2025-01-02 94591.78 97839.50 94392.00 96984.79 21970.489480
2025-01-03 96984.79 98976.91 96100.01 98174.18 15253.829360
2025-01-04 98174.17 98778.43 97514.79 98220.50 8990.056510
2025-01-05 98220.51 98836.85 97276.79 98363.61 8095.637230
2025-01-06 98363.61 102480.00 97920.00 102235.60 25263.433750
2025-01-07 102235.60 102724.38 96181.81 96954.61 32059.875370
2025-01-08 96954.60 97268.65 92500.90 95060.61 33704.678940
2025-01-09 95060.61 95382.32 91203.67 92552.49 34544.836850
2025-01-10 92552.49 95836.00 92206.02 94726.11 31482.864240
2025-01-11 94726.10 95050.94 93831.73 94599.99 7047.904300
2025-01-12 94599.99 95450.10 93711.19 94545.06 8606.866220
2025-01-13 94545.07 95940.00 89256.69 94536.10 42619.564230
2025-01-14 94536.11 97371.00 94346.22 96560.86 27846.617530
2025-01-15 96560.85 100681.94 96500.00 100497.35 30509.991790
2025-01-16 100497.35 100866.66 97335.13 99987.30 27832.853170
2025-01-17 99987.30 105865.22 99950.77 104077.48 39171.852920
2025-01-18 104077.47 104988.88 102277.55 104556.23 24307.829980
2025-01-19 104556.23 106422.43 99651.60 101331.57 43397.282980
2025-01-20 101331.57 109588.00 99550.00 102260.01 89529.231732
2025-01-21 102260.00 107240.81 100119.04 106143.82 45941.020020
2025-01-22 106143.82 106394.46 103339.12 103706.66 22248.692540
2025-01-23 103706.66 106850.00 101262.28 103910.34 53953.120310
2025-01-24 103910.35 107120.00 102750.00 104870.50 23609.240170
2025-01-25 104870.51 105286.52 104106.09 104746.85 9068.323770
2025-01-26 104746.86 105500.00 102520.44 102620.00 9812.512380
2025-01-27 102620.01 103260.00 97777.77 102082.83 50758.134100
2025-01-28 102082.83 103800.00 100272.68 101335.52 22022.057650
2025-01-29 101335.52 104782.68 101328.01 103733.24 23155.358020
2025-01-30 103733.25 106457.44 103278.54 104722.94 19374.074720
2025-01-31 104722.94 106012.00 101560.00 102429.56 21983.181930

Using the code above, we retrieved BTC’s daily market data for all trading days in January 2025. First, let’s check the size of the dataset:

print(df.shape)
the output:

(31, 5)

We can see there are 31 rows, which means we fetched 31 records. Each record has 5 fields.

Now let’s preview the data: DataFrame.head() and DataFrame.tail() show the first five and last five rows, respectively. To change the number of rows displayed, pass a number in the parentheses.

print("Head of this DataFrame:")
print(df.head())
print("Tail of this DataFrame:")
print(df.tail(3))
the output:

Head of this DataFrame:
open high low close volume
date_utc

2025-01-01 93576.00 95151.15 92888.00 94591.79 10373.32613
2025-01-02 94591.78 97839.50 94392.00 96984.79 21970.48948
2025-01-03 96984.79 98976.91 96100.01 98174.18 15253.82936
2025-01-04 98174.17 98778.43 97514.79 98220.50 8990.05651
2025-01-05 98220.51 98836.85 97276.79 98363.61 8095.63723
Tail of this DataFrame:
open high low close volume
date_utc

2025-01-29 101335.52 104782.68 101328.01 103733.24 23155.35802
2025-01-30 103733.25 106457.44 103278.54 104722.94 19374.07472
2025-01-31 104722.94 106012.00 101560.00 102429.56 21983.18193

DataFrame.describe() provides statistical summaries for the purely numeric data in the DataFrame.

print(df.describe())
the output:

            open           high            low          close        volume

count 31.000000 31.000000 31.000000 31.000000 31.000000
mean 99750.482258 101877.524839 97835.769032 100036.080645 27888.217365
std 4148.002200 4544.879400 4050.642975 4011.242254 17319.976611
min 92552.490000 95050.940000 89256.690000 92552.490000 7047.904300
25% 95810.730000 97605.250000 94369.110000 96757.735000 17313.952040
50% 100497.350000 102724.380000 97777.770000 101331.570000 24307.829980
75% 103719.955000 105938.610000 101295.145000 103719.950000 34124.757895
max 106143.820000 109588.000000 104106.090000 106143.820000 89529.231732

Sorting the data makes it easier to inspect. A DataFrame offers two kinds of sorting.

One is label-based sorting—i.e., sorting by the index (row labels) or by column names.

Use DataFrame.sort_index, with axis=0 to sort by the index (rows) and axis=1 to sort by column names. You can also specify ascending or descending order.

print("Order by column names, descending:")
print(df.sort_index(axis=1, ascending=False).head())
the output:

Order by column names, descending:
volume open low high close
date_utc

2025-01-01 10373.32613 93576.00 92888.00 95151.15 94591.79
2025-01-02 21970.48948 94591.78 94392.00 97839.50 96984.79
2025-01-03 15253.82936 96984.79 96100.01 98976.91 98174.18
2025-01-04 8990.05651 98174.17 97514.79 98778.43 98220.50
2025-01-05 8095.63723 98220.51 97276.79 98836.85 98363.61

The second type is value-based sorting. You can specify the column name(s) and the sort order; by default, it sorts in ascending order.

print("Order by column value, ascending:")
print(df.sort_values(by="date_utc", ascending=True).head())
the output:

Order by column value, ascending:
open high low close volume
date_utc

2025-01-01 93576.00 95151.15 92888.00 94591.79 10373.32613
2025-01-02 94591.78 97839.50 94392.00 96984.79 21970.48948
2025-01-03 96984.79 98976.91 96100.01 98174.18 15253.82936
2025-01-04 98174.17 98778.43 97514.79 98220.50 8990.05651
2025-01-05 98220.51 98836.85 97276.79 98363.61 8095.63723

Data Access and Manipulation¶ 3.1 Revisiting Data Access¶ In the previous section, we introduced several ways to access data in a DataFrame using loc, iloc, at, iat, ix, and [].

Here, we’ll introduce another method: using ":" to retrieve part of the rows or all of the columns.

print(df.iloc[1:4][:])
the output:

            open      high       low     close       volume

date_utc

2025-01-02 94591.78 97839.50 94392.00 96984.79 21970.48948
2025-01-03 96984.79 98976.91 96100.01 98174.18 15253.82936
2025-01-04 98174.17 98778.43 97514.79 98220.50 8990.05651

We can extend the method introduced in the previous section that uses Boolean vectors to access data.

This makes it very convenient to filter data. For example, we can select the rows where the closing price is above the average.

print(df[df.close> df.close.mean()].head())
the output:

             open       high        low      close       volume

date_utc

2025-01-06 98363.61 102480.00 97920.00 102235.60 25263.43375
2025-01-15 96560.85 100681.94 96500.00 100497.35 30509.99179
2025-01-17 99987.30 105865.22 99950.77 104077.48 39171.85292
2025-01-18 104077.47 104988.88 102277.55 104556.23 24307.82998
2025-01-19 104556.23 106422.43 99651.60 101331.57 43397.28298
开启送礼物

godzilla.dev — AI x Quant Trader Series — Day 5

KX — Sat, 04 Jul 2026 04:24:06 +0000

source: https://godzilla.dev/learning/ai_quant_traders_series_5/

See below for godzilla.dev materials about: AI x Quant Trader Series - Day 5

The Swiss Army Knife of Python Data Processing: pandas"¶
Part 1: Introduction to Basic Data Structures¶

Introduction to Pandas¶ We've finally arrived at the module the author is most eager to introduce — and arguably the most powerful Python extension for data processing: pandas.

When working with real-world financial data, a single record often contains multiple types of data. For example, a stock ticker is a string, the closing price is a float, and the trading volume is an integer. In C++, this can be handled using a container like a vector of custom structs. In Python, pandas provides high-level data structures — Series and DataFrame — that make data manipulation extremely convenient, fast, and straightforward.

Note that there are some incompatibilities between different versions of pandas. Therefore, it's important to know which version you are using. Let's first check the version of pandas in your local enviroment:

import pandas as pd
pd.version
the output:

'2.2.3'

The two main data structures in pandas are Series and DataFrame. In the next two sections, we’ll explore how to create these structures either from other data types or from scratch. But first, let’s import them along with the relevant modules:

import numpy as np
from pandas import Series, DataFrame

Pandas Data Structure: Series¶ Generally speaking, a Series can be thought of as a one-dimensional array. The main difference between a Series and a regular 1D array is that a Series has an index, which makes it similar to a hash (dictionary-like structure) commonly seen in programming.

2.1 Creating a Series¶
The basic format for creating a Series is:

s = Series(data, index=index, name=name)

Below are a few examples of how to create a Series. Let's start by creating a Series from an array:

a = np.random.randn(5)
print("a is an array:")
print(a)
s = Series(a)
print("s is a Series:")
print(s)
the output:

a is an array:
[ 1.35729482 -1.45138391 0.91716941 -1.24918144 -0.68685959]
s is a Series:
0 1.357295
1 -1.451384
2 0.917169
3 -1.249181
4 -0.686860
dtype: float64

You can specify an index when creating a Series, and you can use Series.index to view the specific index values. One important thing to note is that when creating a Series from an array, the length of the specified index must match the length of the data.

s = Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
s.index
the output:

a -1.898245
b 0.172835
c 0.779262
d 0.289468
e -0.947995
Name: my_series, dtype: float64
my_series

Another optional parameter when creating a Series is name, which allows you to assign a name to the Series. You can access it using Series.name. In a DataFrame, the name of each column becomes the name of the Series when that column is extracted individually.

s = Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'], name='my_series')
print(s)
print(s.name)
the output:

a -1.898245
b 0.172835
c 0.779262
d 0.289468
e -0.947995
Name: my_series, dtype: float64
my_series

A Series can also be created from a dictionary (dict):

d = {'a': 0., 'b': 1, 'c': 2}
print("d is a dict:")
print(d)
s = Series(d)
print("s is a Series:")
print(s)
the output:

d is a dict:
{'a': 0.0, 'c': 2, 'b': 1}
s is a Series:
a 0
b 1
c 2
dtype: float64

Let’s take a look at the case where we specify an index when creating a Series from a dictionary (the index does not have to match the dictionary’s length):

Series(d, index=['b', 'c', 'd', 'a'])
the output:

b 1
c 2
d NaN
a 0
dtype: float64

We can observe two things:

When creating a Series from a dictionary, the data is reordered to match the specified index.

The length of the index does not need to match the length of the dictionary. If there are extra index labels, pandas will automatically assign them a value of NaN (Not a Number — the standard marker for missing data in pandas). If the index is shorter, only the corresponding subset of the dictionary will be used.

If the data is a single value, such as the number 4, then the Series will repeat this value across all index labels:

Series(4., index=['a', 'b', 'c', 'd', 'e'])
the output:

a 4
b 4
c 4
d 4
e 4
dtype: float64

2.2 Accessing Data in a Series¶
You can access data in a Series using index positions (like arrays), index labels (like dictionaries), and even through conditional filtering:

s = Series(np.random.randn(10),index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
s[0]
the output:

1.4328106520571824

s[:2]
the output:

a 1.432811
b 0.120681
dtype: float64

s[[2,0,4]]
the output:

c 0.578146
a 1.432811
e 1.327594
dtype: float64

s[['e', 'i']]
the output:

e 1.327594
i -0.634347
dtype: float64

s[s > 0.5]
the output:

a 1.432811
c 0.578146
e 1.327594
g 1.850783
dtype: float64

'e' in s
the output:

True

Pandas Data Structure: DataFrame¶ Before using a DataFrame, let’s briefly go over its characteristics. A DataFrame is a two-dimensional data structure formed by combining multiple Series (column-wise). Each column, when extracted individually, is a Series. This is very similar to how data is retrieved from a SQL database. Therefore, it’s often more convenient to process a DataFrame column by column, and it's helpful for users to develop a column-oriented mindset when working with data.

One of the key advantages of a DataFrame is its ability to handle columns of different data types with ease. So there's no need to think about operations like matrix inversion on a DataFrame full of floats — for such numerical tasks, it’s usually better to store the data in a NumPy matrix.

3.1 Creating a DataFrame¶
Let’s first look at how to create a DataFrame from a dictionary. A DataFrame is a 2D data structure that serves as a collection of Series. We’ll start by creating a dictionary where the values are Series, and then convert it into a DataFrame:

d = {'one': Series([1., 2., 3.], index=['a', 'b', 'c']), 'two': Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = DataFrame(d)
print(df)
the output:

one two
a 1 1
b 2 2
c 3 3
d NaN 4

You can specify the desired rows (index) and columns when creating the DataFrame. If the dictionary does not contain the corresponding elements, those entries will be filled with NaN (missing values):

df = DataFrame(d, index=['r', 'd', 'a'], columns=['two', 'three'])
print(df)
the output:

two three
r NaN NaN
d 4 NaN
a 1 NaN

You can use dataframe.index and dataframe.columns to view the rows and columns of a DataFrame. The dataframe.values attribute returns the elements of the DataFrame as a NumPy array.

print("DataFrame index:")
print(df.index)
print("DataFrame columns:")
print(df.columns)
print("DataFrame values:")
print(df.values)
the output:

DataFrame index:
Index([u'alpha', u'beta', u'gamma', u'delta', u'eta'], dtype='object')
DataFrame columns:
Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
DataFrame values:
[[ 0. 0. 0. 0. 0.]
[ 1. 2. 3. 4. 5.]
[ 2. 4. 6. 8. 10.]
[ 3. 6. 9. 12. 15.]
[ 4. 8. 12. 16. 20.]]

A DataFrame can also be created from a dictionary whose values are arrays, but all arrays must be of the same length.

d = {'one': [1., 2., 3., 4.], 'two': [4., 3., 2., 1.]}
df = DataFrame(d, index=['a', 'b', 'c', 'd'])
print(df)
the output:

one two
a 1 4
b 2 3
c 3 2
d 4 1

When the values are not arrays, this length restriction does not apply, and any missing values will be automatically filled with NaN.

d= [{'a': 1.6, 'b': 2}, {'a': 3, 'b': 6, 'c': 9}]
df = DataFrame(d)
print(df)
the output:

 a  b   c

0 1.6 2 NaN
1 3.0 6 9

When working with real-world data, you may sometimes need to create an empty DataFrame. This can be done as follows:

df = DataFrame()
print(df)
the output:

Empty DataFrame
Columns: []
Index: []

Another very useful way to create a DataFrame is by using the concat function, which allows you to build a DataFrame from one or more Series or existing DataFrames.

a = Series(range(5))
b = Series(np.linspace(4, 20, 5))
df = pd.concat([a, b], axis=1)
print(df)
the output:

0 1
0 0 4
1 1 8
2 2 12
3 3 16
4 4 20

Here, axis=1 means concatenation by columns, while axis=0 means concatenation by rows. Note that a Series is treated as a single column, so if you choose axis=0, you’ll get a 10×1 DataFrame.

The following example shows how to concatenate DataFrames by rows to form a larger DataFrame:

df = DataFrame()
index = ['alpha', 'beta', 'gamma', 'delta', 'eta']
for i in range(5):
a = DataFrame([np.linspace(i, 5*i, 5)], index=[index[i]])
df = pd.concat([df, a], axis=0)
print(df)
the output:

   0  1   2   3   4

alpha 0 0 0 0 0
beta 1 2 3 4 5
gamma 2 4 6 8 10
delta 3 6 9 12 15
eta 4 8 12 16 20

3.2 Accessing Data in a DataFrame¶
First, it’s important to emphasize again that DataFrame operations are fundamentally column-based. You can think of every operation as first selecting a column (which is a Series), and then accessing elements from that Series.

You can select a column using either dataframe.column_name or dataframe[]. You’ll quickly notice that:

The dot notation (dataframe.column_name) can only select a single column.

The bracket notation (dataframe[]) can be used to select one or multiple columns.

If the DataFrame has no column names, you can use non-negative integers (i.e., indices) inside the brackets to select columns. However, if column names do exist, then you must use those names to select columns. Also, in the absence of column names, dataframe.column_name is not valid.

print(df[1])
print(type(df[1]))
df.columns = ['a', 'b', 'c', 'd', 'e']
print(df['b'])
print(type(df['b']))
print(df.b)
print(type(df.b))
print(df[['a', 'd']])
print(type(df[['a', 'd']]))
the output:

alpha 0
beta 2
gamma 4
delta 6
eta 8
Name: 1, dtype: float64

alpha 0
beta 2
gamma 4
delta 6
eta 8
Name: b, dtype: float64

alpha 0
beta 2
gamma 4
delta 6
eta 8
Name: b, dtype: float64

a d
alpha 0 0
beta 1 4
gamma 2 8
delta 3 12
eta 4 16

In the code above, we used dataframe.columns to assign column names to the DataFrame. As shown, when a single column is extracted, the resulting data structure is a Series. However, when two or more columns are selected, the result remains a DataFrame.

To access specific elements, you can use indices or labels, just like with a Series.

print df['b'][2]
print df['b']['gamma']
the output:

4.0
4.0

To select rows, you can use dataframe.iloc to select by position (index number), or dataframe.loc to select by label (index name).

print(df.iloc[1])
print(df.loc['beta'])
the output:

a 1
b 2
c 3
d 4
e 5
Name: beta, dtype: float64
a 1
b 2
c 3
d 4
e 5
Name: beta, dtype: float64

Rows can also be selected using slicing or a Boolean array (Boolean mask).

print("Selecting by slices:")
print(df[1:3])
bool_vec = [True, False, True, True, False]
print("Selecting by boolean vector:")
print(df[bool_vec])
the output:

Selecting by slices:
a b c d e
beta 1 2 3 4 5
gamma 2 4 6 8 10
Selecting by boolean vector:
a b c d e
alpha 0 0 0 0 0
gamma 2 4 6 8 10
delta 3 6 9 12 15

Rows and columns can be combined to select specific data.

print(df[['b', 'd']].iloc[[1, 3]])
print(df.iloc[[1, 3]][['b', 'd']])
print(df[['b', 'd']].loc[['beta', 'delta']])
print(df.loc[['beta', 'delta']][['b', 'd']])
the output:

   b   d

beta 2 4
delta 6 12
b d
beta 2 4
delta 6 12
b d
beta 2 4
delta 6 12
b d
beta 2 4
delta 6 12

If you want to access a specific element at a particular position (rather than an entire row or column), the fastest way is to use dataframe.at and dataframe.iat, which access data by label and integer position, respectively.

print(df.iat[2, 3])
print(df.at['gamma', 'd'])
the output:

8.0
8.0

godzilla.dev — AI x Quant Trader Series — Day 4

KX — Sat, 04 Jul 2026 04:18:19 +0000

source: https://godzilla.dev/learning/ai_quant_traders_series_4/

See below for godzilla.dev materials about: AI x Quant Trader Series - Day 4

"Widely used Python Libraries"¶
Last time we introduced NumPy. In this article, we'll focus on another commonly used library in quantitative finance: SciPy.

SciPy¶
Overview of SciPy¶
In the previous article, we briefly introduced NumPy. Now let’s take a look at what SciPy can do. While NumPy handles vector and matrix operations—essentially functioning like an advanced scientific calculator—SciPy builds on top of NumPy and provides a more comprehensive and advanced set of functionalities. It offers a wide array of functions for statistics, optimization, interpolation, numerical integration, signal processing, and more, covering almost all fundamental scientific computing tasks.

In quantitative analysis, the most commonly used areas are statistics and optimization. Therefore, this article will focus on SciPy’s statistics and optimization modules. Other modules will be introduced in future articles when relevant.

This article will involve some matrix algebra. If you find it difficult, feel free to skip Part 3 or try to understand the concepts using one-dimensional scalars instead of higher-dimensional vectors.

As always, let's start by importing the necessary modules. Here, we’ll be using the statistics and optimization parts of SciPy:

import numpy as np
import scipy.stats as stats
import scipy.optimize as opt
Statistics Module¶
Generating Random Numbers¶
Let’s begin with generating random numbers, as this will make it easier to demonstrate other concepts later. To generate n random numbers, you can use rv_continuous.rvs(size=n) or rv_discrete.rvs(size=n).

rv_continuous refers to continuous probability distributions such as:

Uniform distribution: uniform

Normal distribution: norm

Beta distribution: beta, etc.

rv_discrete refers to discrete probability distributions such as:

Bernoulli distribution: bernoulli

Geometric distribution: geom

Poisson distribution: poisson, etc.

For example, to generate:

10 random numbers in the interval 0,1 from a uniform distribution, and

10 random numbers from a Beta distribution with parameters α and β (denoted as Beta(α,β)):

rv_unif = stats.uniform.rvs(size=10)
print rv_unif
rv_beta = stats.beta.rvs(size=10, a=4, b=2)
print rv_beta
the output:

[ 0.6419336 0.48403001 0.89548809 0.73837498 0.65744886 0.41845577
0.3823512 0.0985301 0.66785949 0.73163835]
[ 0.82164685 0.69563836 0.74207073 0.94348192 0.82979411 0.87013796
0.78412952 0.47508183 0.29296073 0.52551156]

Each random distribution function in SciPy comes with built-in default parameters—for example, the uniform distribution defaults to the range 0,1. However, when you need to modify these parameters, having to type out the full command each time can be a bit tedious.

To simplify this, SciPy provides a "freezing" feature. This allows you to create a frozen distribution object with fixed parameters, so you don't need to repeatedly specify them. This is particularly useful in scenarios where you work with the same distribution settings multiple times.

For example, in the case of the Beta distribution, instead of specifying the parameters α and β every time you call .rvs(), you can define a frozen distribution like this:

np.random.seed(seed=2015)
rv_beta = stats.beta.rvs(size=10, a=4, b=2)
print "method 1:"
print rv_beta

np.random.seed(seed=2015)
beta = stats.beta(a=4, b=2)
print "method 2:"
print beta.rvs(size=10)
the output:

method 1:
[ 0.43857338 0.9411551 0.75116671 0.92002864 0.62030521 0.56585548
0.41843548 0.5953096 0.88983036 0.94675351]
method 2:
[ 0.43857338 0.9411551 0.75116671 0.92002864 0.62030521 0.56585548
0.41843548 0.5953096 0.88983036 0.94675351]

Hypothesis Testing¶
Now, let’s generate a dataset and examine its related statistical properties. (You can find the parameters and documentation for the relevant distributions here: http://docs.scipy.org/doc/scipy/reference/stats.html)

norm_dist = stats.norm(loc=0.5, scale=2)
n = 200
dat = norm_dist.rvs(size=n)
print "mean of data is: " + str(np.mean(dat))
print "median of data is: " + str(np.median(dat))
print "standard deviation of data is: " + str(np.std(dat))
the output:

mean of data is: 0.383309149888
median of data is: 0.394980561217
standard deviation of data is: 2.00589851641

Suppose this dataset represents actual observed data—such as daily returns of a stock. We can perform a basic analysis on it. One of the simplest analyses is to test whether this dataset follows a given distribution, such as the normal distribution.

This is a classic one-sample hypothesis testing problem. A commonly used method for this is the Kolmogorov–Smirnov test (K-S test).

In a one-sample K-S test, the null hypothesis is that the sample comes from the specified theoretical distribution.

In SciPy, this can be done using the kstest function, where the parameters are:

the dataset,

the name of the distribution to test against (as a string),

and the parameters of that distribution.

mu = np.mean(dat)
sigma = np.std(dat)
stat_val, p_val = stats.kstest(dat, 'norm', (mu, sigma))
print 'KS-statistic D = %6.3f p-value = %6.4f' % (stat_val, p_val)
the output:

KS-statistic D = 0.037 p-value = 0.9428

If the p-value from the hypothesis test is large (note that under the null hypothesis, the p-value is a random variable uniformly distributed over the interval 0,1; see: http://en.wikipedia.org/wiki/P-value), then we fail to reject the null hypothesis—in other words, we accept that the data passes the normality test.

Given the assumption of normality, we can further test whether the mean of this dataset is significantly different from zero. A common method for this is the t-test, specifically the one-sample t-test.

In SciPy, this is done using the ttest_1samp function:

stat_val, p_val = stats.ttest_1samp(dat, 0)
print 'One-sample t-statistic D = %6.3f, p-value = %6.4f' % (stat_val, p_val)
the output:

One-sample t-statistic D = 2.696, p-value = 0.0076

We observe that p-value < 0.05, which means that under a significance level of 0.05, we should reject the null hypothesis—that is, the data’s mean is not equal to 0.

Next, let’s generate another dataset and try a two-sample t-test using ttest_ind. This test checks whether two independent samples have significantly different means.

norm_dist2 = stats.norm(loc=-0.2, scale=1.2)
dat2 = norm_dist2.rvs(size=n/2)
stat_val, p_val = stats.ttest_ind(dat, dat2, equal_var=False)
print 'Two-sample t-statistic D = %6.3f, p-value = %6.4f' % (stat_val, p_val)
the output:

Two-sample t-statistic D = 3.572, p-value = 0.0004

Note that in this case, the second dataset we generated differs from the first in terms of sample size and variance. Therefore, when performing the t-test, we need to use Welch’s t-test by setting equal_var=False in the ttest_ind function.

We again obtain a relatively small p-value, which means that under the 0.05 significance level, we reject the null hypothesis and conclude that the two groups do not have equal means.

The scipy.stats module also provides many other hypothesis testing functions, such as:

bartlett and levene: for testing whether two or more samples have equal variances.

anderson_ksamp: for performing the Anderson-Darling k-sample test, used to check whether multiple samples come from the same distribution.

These tools are useful for more advanced statistical analysis depending on the properties of your data.

godzilla.dev — AI x Quant Trader Series — Day 3

KX — Sat, 04 Jul 2026 04:06:50 +0000

source: https://godzilla.dev/learning/ai_quant_traders_series_3/

See below for godzilla.dev materials about: AI x Quant Trader Series - Day 3

"Widely used Python Libraries"¶
The upcoming series will introduce some of the most widely used Python libraries in quantitative finance:

numpy

scipy

pandas

matplotlib

Each will be explained one by one for beginners.

NumPy¶
What is NumPy¶
Quantitative analysis involves a large amount of numerical computation, so having an efficient and convenient scientific computing tool is essential. Python was not originally designed as a language for scientific computing. However, as more people recognized its ease of use, a wide range of external extensions emerged—NumPy (Numeric Python) being one of them.

NumPy provides a wealth of tools for numerical programming, making it easy to handle operations on vectors, matrices, and more, which significantly facilitates scientific computing tasks. On the other hand, Python is free, and compared to the high costs of using software like MATLAB, NumPy has made Python an increasingly popular choice.

Let’s take a quick look at how to get started with NumPy:

import numpy
numpy.version.full_version
the output:

2.2.4

We used the import command to load the NumPy library and checked the version with numpy.version.full_version, which turned out to be 2.2.4

In the upcoming lessons, we’ll frequently use functions from NumPy. However, constantly writing numpy as a prefix before every function call can be tedious. As mentioned earlier, there’s a shortcut when importing external modules: using from numpy import * allows you to access all functions without the prefix.

Problem solved? Not so fast!

Python has thousands of external modules, and in practice, it’s common to import several of them at once. If two modules happen to include functions or properties with the same name, this can lead to conflicts. To avoid such name clashes—also known as namespace confusion—it’s generally better to keep the module prefix.

So is there a simpler way? Yes—when importing a module, you can assign it an alias. This way, you don’t need to write the full module name every time. For example, we can import NumPy as np and call version.full_version like this:

import numpy as np
np.version.full_version
the output:

2.2.4

A First Look at NumPy Objects: Arrays¶
The fundamental object in NumPy is the homogeneous multidimensional array, meaning all elements in the array must be of the same type—just like arrays in C++. For example, character and numeric types cannot coexist in the same array.

Let’s look at an example:

a = np.arange(20)
Here, we’ve created a one-dimensional array a starting from 0, with a step size of 1, and a total length of 20. In Python, indexing starts at 0, so users coming from R or MATLAB should be cautious about this difference. You can use print to view the array:

print(a)
the output:

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]

We can use the type function to check the type of a. Here, it shows that a is an array:

type(a)
the output:

numpy.ndarray

Using the reshape function, we can restructure this array. For example, we can create a 4×5 two-dimensional array. The arguments passed to reshape specify the size of each dimension, and the data is arranged in order by dimension (for two dimensions, this means row-wise order). This is different from R, where arrays are filled column-wise by default.

a = a.reshape(4, 5)
print(a)
the output:

[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]]

Creating higher-dimensional arrays is no problem either:

a = a.reshape(2, 2, 5)
print(a)
the output:

[[[ 0 1 2 3 4]
[ 5 6 7 8 9]]

[[10 11 12 13 14]
[15 16 17 18 19]]]

Since a is an array, we can call its associated functions to further inspect its properties:

ndim shows the number of dimensions

shape returns the size of each dimension

size gives the total number of elements (equal to the product of all dimension sizes)

dtype displays the data type of the elements

itemsize (not dsize) shows the number of bytes each element occupies

a.ndim
the output:

a.shape
the output:

(2, 2, 5)

a.size
the output:

a.dtype
the output:

dtype('int64')

Creating Arrays¶
Arrays can be created by converting lists, and higher-dimensional arrays can be created by converting nested lists.

raw = [0,1,2,3,4]
a = np.array(raw)
a
the output:

array([0, 1, 2, 3, 4])

raw = [[0,1,2,3,4], [5,6,7,8,9]]
b = np.array(raw)
b
the output:

array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])

Some special arrays have dedicated commands for creation—for example, a 4×5 matrix filled with zeros:

d = (4, 5)
np.zeros(d)
the output:

array([[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])

By default, the generated array is of float type, but you can specify the data type to create an integer array instead:

d = (4, 5)
np.ones(d, dtype=int)
the output:

array([[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1]])

An array of random numbers in the interval [0,1):

np.random.rand(5)
the output:

array([ 0.93807818, 0.45307847, 0.90732828, 0.36099623, 0.71981451])

Array Operations¶
Basic arithmetic operations have been overloaded—operators like +, -, *, and / are all applied element-wise to the entire array. For example, with addition:

a = np.array([[1.0, 2], [2, 4]])
print("a:")
print(a)
b = np.array([[3.2, 1.5], [2.5, 4]])
print("b:")
print(b)
print("a+b:")
print(a+b)
the output:

a:
[[ 1. 2.]
[ 2. 4.]]
b:
[[ 3.2 1.5]
[ 2.5 4. ]]
a+b:
[[ 4.2 3.5]
[ 4.5 8. ]]

Here, you can see that even though only one element in array a is a float and the rest are integers, Python automatically converts all elements to float—because NumPy arrays are homogeneous. Also, when adding two 2D arrays, the size of each dimension must match.

Of course, in NumPy, these operators can also be used between a scalar and an array. The result is that the operation is applied element-wise between the scalar and each element in the array, and the output is still an array.

print("3 * a:")
print(3 * a)
print("b + 1.8:")
print(b + 1.8)
the output:

3 * a:
[[ 3. 6.]
[ 6. 12.]]
b + 1.8:
[[ 5. 3.3]
[ 4.3 5.8]]

Just like in C++, the +=, -=, *=, and /= operators are also supported in NumPy.

a /= 2
print(a)
the output:

[[ 0.5 1. ]
[ 1. 2. ]]

Taking square roots or computing exponentials is also very straightforward:

print("a:")
print(a)
print("np.exp(a):")
print(np.exp(a))
print("np.sqrt(a):")
print(np.sqrt(a))
print("np.square(a):")
print(np.square(a))
print("np.power(a, 3):")
print(np.power(a, 3))
the output:

a:
[[ 0.5 1. ]
[ 1. 2. ]]
np.exp(a):
[[ 1.64872127 2.71828183]
[ 2.71828183 7.3890561 ]]
np.sqrt(a):
[[ 0.70710678 1. ]
[ 1. 1.41421356]]
np.square(a):
[[ 0.25 1. ]
[ 1. 4. ]]
np.power(a, 3):
[[ 0.125 1. ]
[ 1. 8. ]]

Need to find the maximum or minimum of a 2D array? Want to calculate the total sum of all elements, or sum by rows or columns? Use a for loop? No need—NumPy’s ndarray class already provides built-in functions for these operations:

a = np.arange(20).reshape(4,5)
print("a:")
print(a)
print("sum of all elements in a: " + str(a.sum()))
print("maximum element in a: " + str(a.max()))
print("minimum element in a: " + str(a.min()))
print("maximum element in each row of a: " + str(a.max(axis=1)))
print("minimum element in each column of a: " + str(a.min(axis=0)))
the output:

a:
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]]
sum of all elements in a: 190
maximum element in a: 19
minimum element in a: 0
maximum element in each row of a: [ 4 9 14 19]
minimum element in each column of a: [0 1 2 3 4]

Matrix operations are heavily used in scientific computing. In addition to arrays, NumPy also provides a dedicated matrix object. There are two main differences between matrices and arrays:

Matrices are strictly 2-dimensional, whereas arrays can have any number of dimensions (as long as they are positive integers).

The * operator performs matrix multiplication for matrix objects, meaning the number of columns in the left matrix must equal the number of rows in the right matrix. In contrast, for arrays, the * operator performs element-wise multiplication, requiring that the arrays have the same shape.

You can convert an array to a matrix using asmatrix or mat, or you can create a matrix directly. For example:

a = np.arange(20).reshape(4, 5)
a = np.asmatrix(a)
print(type(a))

b = np.matrix('1.0 2.0; 3.0 4.0')
print(type(b))
the output:

Let’s take another look at matrix multiplication. Here, we use the arange function to generate another matrix b. The arange function can also be called with the form arange(start, stop, step) to create an arithmetic sequence. Note that the range includes the start value but excludes the stop value.

b = np.arange(2, 45, 3).reshape(5, 3)
b = np.mat(b)
print(b)
the output:

[[ 2 5 8]
[11 14 17]
[20 23 26]
[29 32 35]
[38 41 44]]

Some might ask: arange specifies the step size, but what if you want to specify the length of the generated 1D array instead? No problem — linspace can do just that.

np.linspace(0, 2, 9)
the output:

array([ 0. , 0.25, 0.5 , 0.75, 1. , 1.25, 1.5 , 1.75, 2. ])

Back to our problem: perform matrix multiplication on matrices a and b.

print("matrix a:")
print(a)
print("matrix b:")
print(b)
c = a * b
print("matrix c:")
print(c)
the output:

matrix a:
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]]
matrix b:
[[ 2 5 8]
[11 14 17]
[20 23 26]
[29 32 35]
[38 41 44]]
matrix c:
[[ 290 320 350]
[ 790 895 1000]
[1290 1470 1650]
[1790 2045 2300]]

Array element access¶
Elements of arrays and matrices can be accessed using indices. The following examples all use two-dimensional arrays (or matrices).

a = np.array([[3.2, 1.5], [2.5, 4]])
print(a[0][1])
print(a[0, 1])
the output:

1.5
1.5

Array element values can be modified using index-based access.

b = a
a[0][1] = 2.0
print("a:")
print(a)
print("b:")
print(b)
the output:

a:
[[ 3.2 2. ]
[ 2.5 4. ]]
b:
[[ 3.2 2. ]
[ 2.5 4. ]]

Now here comes the problem: you clearly modified a[0][1], so why did b[0][1] also change? This is a common pitfall in Python programming. The reason is that Python didn't actually make a true copy of a and assign it to b; instead, it made b point to the same memory address as a. To create a real copy of a for b, you can use copy.

a = np.array([[3.2, 1.5], [2.5, 4]])
b = a.copy()
a[0][1] = 2.0
print("a:")
print(a)
print("b:")
print(b)
the output:

a:
[[ 3.2 2. ]
[ 2.5 4. ]]
b:
[[ 3.2 1.5]
[ 2.5 4. ]]

If you reassign a, meaning you point it to a different address, b will still remain at the original address.

a = np.array([[3.2, 1.5], [2.5, 4]])
b = a
a = np.array([[2, 1], [9, 3]])
print("a:")
print(a)
print("b:")
print(b)
the output:

a:
[[2 1]
[9 3]]
b:
[[ 3.2 1.5]
[ 2.5 4. ]]

The colon : can be used to access all elements along a certain dimension — for example, to extract a specific column from a matrix.

a = np.arange(20).reshape(4, 5)
print("a:")
print(a)
print("the 2nd and 4th column of a:")
print(a[:,[1,3]])
the output:

a:
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]]
the 2nd and 4th column of a:
[[ 1 3]
[ 6 8]
[11 13]
[16 18]]

Let’s try something a bit more complex: extracting elements that meet certain conditions — a common task in data processing, usually applied to a single row or column. In the example below, we extract the third column elements (12 and 17) that correspond to the rows where the first column values are greater than 5 (i.e., 10 and 15).

a[:, 2][a[:, 0] > 5]
the output:

array([12, 17])

The where function can be used to find the positions of specific values in an array.

loc = numpy.where(a==11)
print(loc)
print(a[loc[0][0], loc[1][0]])
the output:

(array([2]), array([1]))
11

Matrix operations¶
Let’s continue using a matrix (or 2D array) as an example. First, let’s look at matrix transposition.

a = np.random.rand(2,4)
print("a:")
print(a)
a = np.transpose(a)
print("a is an array, by using transpose(a):")
print(a)
b = np.random.rand(2,4)
b = np.mat(b)
print("b:")
print(b)
print("b is a matrix, by using b.T:")
print(b.T)
the output:

a:
[[ 0.17571282 0.98510461 0.94864387 0.50078988]
[ 0.09457965 0.70251658 0.07134875 0.43780173]]
a is an array, by using transpose(a):
[[ 0.17571282 0.09457965]
[ 0.98510461 0.70251658]
[ 0.94864387 0.07134875]
[ 0.50078988 0.43780173]]
b:
[[ 0.09653644 0.46123468 0.50117363 0.69752578]
[ 0.60756723 0.44492537 0.05946373 0.4858369 ]]
b is a matrix, by using b.T:
[[ 0.09653644 0.60756723]
[ 0.46123468 0.44492537]
[ 0.50117363 0.05946373]
[ 0.69752578 0.4858369 ]]

Matrix inversion

import numpy.linalg as nlg
a = np.random.rand(2,2)
a = np.mat(a)
print("a:")
print(a)
ia = nlg.inv(a)
print("inverse of a:")
print(ia)
print("a * inv(a)")
print(a * ia)
the output:

a:
[[ 0.86211266 0.6885563 ]
[ 0.28798536 0.70810425]]
inverse of a:
[[ 1.71798445 -1.6705577 ]
[-0.69870271 2.09163573]]
a * inv(a)
[[ 1. 0.]
[ 0. 1.]]

Computing eigenvalues and eigenvectors

a = np.random.rand(3,3)
eig_value, eig_vector = nlg.eig(a)
print("eigen value:")
print(eig_value)
print("eigen vector:")
print(eig_vector)
the output:

eigen value:
[ 1.35760609 0.43205379 -0.53470662]
eigen vector:
[[-0.76595379 -0.88231952 -0.07390831]
[-0.55170557 0.21659887 -0.74213622]
[-0.33005418 0.41784829 0.66616169]]

Concatenate two vectors into a matrix by columns.

a = np.array((1,2,3))
b = np.array((2,3,4))
print np.column_stack((a,b))
the output:

[[1 2]
[2 3]
[3 4]]

After processing some data in a loop and obtaining results, it's often useful to combine those results into a matrix. This can be done using vstack and hstack.

a = np.random.rand(2,2)
b = np.random.rand(2,2)
print("a:")
print(a)
print("b:")
print(b)
c = np.hstack([a,b])
d = np.vstack([a,b])
print("horizontal stacking a and b:")
print(c)
print("vertical stacking a and b:")
print(d)
the output:

a:
[[ 0.6738195 0.4944045 ]
[ 0.25702675 0.15422012]]
b:
[[ 0.6738195 0.4944045 ]
[ 0.25702675 0.15422012]]
horizontal stacking a and b:
[[ 0.6738195 0.4944045 0.28058267 0.0967197 ]
[ 0.25702675 0.15422012 0.55191041 0.04694485]]
vertical stacking a and b:
[[ 0.6738195 0.4944045 ]
[ 0.25702675 0.15422012]
[ 0.28058267 0.0967197 ]
[ 0.55191041 0.04694485]]

Missing Value¶
Missing values are also a form of information in data analysis. NumPy provides nan to represent missing values, and isnan can be used to detect them.

a = np.random.rand(2,2)
a[0, 1] = np.nan
print(np.isnan(a))
the output:

[[False True]
[False False]]

nan_to_num can be used to replace nan with 0. In the more advanced module pandas, which we’ll cover later, we’ll see that it provides functions that allow you to specify the replacement value for nan.

print(np.nan_to_num(a))
the output:

[[ 0.58144238 0. ]
[ 0.26789784 0.48664306]]

NumPy offers many more functions. For a detailed understanding, you can refer to the following links: http://wiki.scipy.org/Numpy_Example_List and http://docs.scipy.org/doc/numpy

godzilla.dev — AI x Quant Trader Series — Day 2

KX — Sat, 04 Jul 2026 04:04:53 +0000

source: https://godzilla.dev/learning/ai_quant_traders_series_2/

See below for godzilla.dev materials about: AI x Quant Trader Series - Day 2

"Who will teach me about Python?"¶
On the first day, We learned the basic operations of Python and several main container types.

Today, We will learn Python's functions, loops and conditionals, and classes. With this, We will have a general understanding of Python. The learning outline for today is as follows:

Functions¶
a) Defining a function
Loops and Conditionals¶
a) if statements

b) while True / break statements

c) for loops

d) List comprehensions

Classes¶ a) A casual talk about classes and objects

b) Defining a class

Functions¶ a) Defining a function¶ (1) Definition Rules

When introducing list methods, we already briefly mentioned functions. Anyone who has studied mathematics knows what a function is — it takes an input (a parameter) and returns a value. Functions can also be defined by yourself, using the following format:

def function_name(parameter):
# function code
In the function code, return indicates the value to be returned. For example, to define a square function square(x) that takes x as input and returns the square of x:

def square(x):return x*x

square(9)
the output:

(2) Defining Functions with Variable Parameters

Sometimes you need to define a function with a variable number of parameters. There are several ways to do this:

Assign default values to parameters For example, define a function like f(a, b=1, c='hehe'). In this case, the last two parameters are optional — if not specified during the function call, they will default to b=1 and c='hehe'. Therefore, the following calls are all valid:

f('dsds')
f('dsds', 2)
f('dsds', 2, 'hdasda')
Keyword arguments The method above fixes the order of parameters — the first value is assigned to the first parameter. With keyword arguments, however, you can specify which value goes to which parameter by name. For example, still using the function f(a, b=1, c='hehe'), you can call it like this:

f(b=2, a=11)
The order of parameters can be changed as long as you specify them using their keywords.

Loops and Conditionals¶ Note that Python uses indentation to indicate which block of code belongs to the loop.

a) if statements¶
Also note two things: first, indentation; and second, a colon (:) is required after the condition.

j=2.67
if j<3:
print('j<3')
the output:

j<3

For multiple conditions, note that elseif should be written as elif. The standard format is:

if condition1:
statement1
elif condition2:
statement2
else:
statement3
Note that if, elif, and else are at the same indentation level — there should be no indentation before them.

t=3
if t<3:
print('t<3')
elif t==3:
print('t=3')
else:
print('t>3')
the output:

t=3

b) while True / break statements¶
The format of this statement is:

while True: # condition is true
statement
if break_condition:
break

Here’s an example:

a=3
while a<10:
a=a+1
print(a)
if a==8: break
the output:

4
5
6
7
8

Although the condition after while is a < 10, meaning the loop will continue as long as a is less than 10, the if condition specifies that the loop should break when a equals 8. Therefore, the output will only go up to 8.

c) for loops¶
No more explanation needed — you can iterate over a sequence, dictionary, etc.

a=[1,2,3,4,5]
for i in a:
print(i)
the output:

1
2
3
4
5

d) List comprehensions¶
List comprehensions are a way to create a new list from an existing one, working similarly to a for loop. The format is:

[output_value for condition]
When the condition is met, an output value is generated, and the final result is a list.

[x*x for x in range(10)]
the output:

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

[x*x for x in range(10) if x%3==0]
the output:

[0, 9, 36, 81]

The above example uses the sequence [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] to generate a new sequence.

Classes¶ a) A casual talk about classes and objects¶ A class is an abstract concept — it doesn't exist in the physical world in terms of time or space. A class simply defines the abstract attributes and behaviors for all its objects. For example, the class "Person" can represent many individuals, but the class itself doesn't exist as a tangible entity in the real world.

An object, on the other hand, is a concrete instance of a class. It is something that actually exists. If "Person" is an abstract class, then you, yourself, are a specific object of that class.

An object of a class is also called an instance of the class. To give another analogy, a class is like a mold, and objects are the concrete things produced using that mold — each with the same attributes and methods. As the saying goes, "They look just alike, as if made from the same mold" — that’s exactly the idea here.

The process of using a mold to create a concrete thing is called instantiation of the class. Let’s take a look at a specific class example below:

b) Defining a class¶
class boy:
gender='male'
interest='girl'
def say(self):
return 'i am a boy'
The statement above defines a class called boy. Now let’s use this class model to construct a specific object:

peter=boy()
Now let’s take a look at the attributes and methods of the specific instance peter.

“What are attributes and methods?”

They are two forms of a class:

Attributes are the static aspects

Methods are the dynamic aspects

For example, the class “Person” may have attributes such as name, gender, height, age, and weight. It may also have methods such as walking, running, and jumping.

peter.gender
the output:

'male'

peter.interest
the output:

'girl'

peter.say()
the output:

'i am a boy'

Here, gender and interest are attributes of peter, while say is his method. If we instantiate another object, for example sam:

sam=boy()
Then sam and peter have the same attributes and methods — you could say, “They were truly made from the same mold!”

Learning more fromhttps://godzilla.dev/