Chenghong M.

Posted on Jun 9

How OpenAI Built a Research Data Platform on Snowflake: A Field Notes on an Architecture in Motion

#snowflake #openai #dataplatform #rockset

Field notes + architecture breakdown. Based on the OpenAI team's session "Research Data Platform at OpenAI" at Snowflake Dev Day 2026 (June 4). All numbers and naming come from the speaker's slides; analysis and extensions are my own and are flagged inline. This is not a "look how cool OpenAI is" piece — neither was the talk. It's an honest record of an engineering team being pushed around by petabytes of RL data, hitting walls, redesigning, and hitting more walls.

Start with a number that lands

OpenAI didn't open with database architecture. They opened with release cadence: the cycle from research to shipped model has compressed from 15 months to 6 weeks.

What does that mean for the data platform? Every speedup upstream forces a corresponding speedup downstream — in how researchers inspect experiments, read samples, compute metrics. And at this point, the dominant research workload is no longer pretraining — it's RL (reinforcement learning) post-training. That shift shows up directly in their Snowflake storage:

Over 70% of the data in their Snowflake is RL sample events and complete samples.

Note the scope: this is Snowflake storage composition, not OpenAI's overall business. But even with that caveat, the number says something real — the core challenge for post-training research data infrastructure has shifted from "how do we store pretraining corpora" to "how do we assemble and query massive, out-of-order, oversized RL samples with low latency."

What the scale actually looks like

The team listed four scaling pressures. Any single one would keep a data team busy:

Data volume: 10x growth in the last 12 months — from single-digit PB to tens of PB, with hundreds of PB projected by year-end.
Write throughput: hundreds of TB/day on average; individual workloads occasionally write more than 1 PB in a single day.
Read latency: dashboards (especially the RL Sample Viewer) need double-digit millisecond reads; researcher scripts and ad-hoc queries need seconds-level response.
Agentic workloads: an increasing share of queries are generated by models, not humans. This drives up warehouse usage and makes capacity planning harder to predict.

That last one is worth pausing on. When agents become a major query origin, the workload no longer follows the human business-hour curve, and both optimization and cost forecasting need to be re-modeled. This is going to be a common problem soon.

Choices: Snowflake as default analytics, Rockset as real-time cache

The high-level positioning is clear:

Snowflake = primary analytics layer for RL experiment data (samples, metrics), plus hardware health and frontier eval workloads. Researchers can spin up pipelines and dashboards quickly without sacrificing scale (acceptable seconds-level latency).
Rockset = real-time cache layer for user-facing and highly interactive paths that need double-digit millisecond reads.
They're also evaluating Snowflake Interactive Tables and Snowflake Postgres for some low-latency use cases.

One piece of context the slides don't spell out but is worth knowing for interpretation: OpenAI acquired Rockset in 2024. The slides don't explicitly attribute the Rockset choice to the acquisition, so strictly speaking, the reading I'm about to offer is my inference, not the speaker's statement — but the timeline makes "Rockset as cache layer" look less like a third-party selection and more like wiring an in-house stack into the research infrastructure. Rockset is built on RocksDB (LSM-tree, write-optimized) and maintains row, columnar, and search (inverted) indexes — efficient for both point lookups and real-time aggregations, exactly the gap Snowflake leaves on the millisecond end.

Hard problem #1: the Joiner

What the problem looks like

A single sample rollout happens across multiple distributed systems. Each system emits one sample event with rich local context. But what researchers actually want isn't scattered events — it's the complete sample. Stitching those events back together by join key is the Joiner's job.

The constraints make it nasty:

Events arrive out of order, with delays from seconds to days.
Payloads are huge (prompts, conversations, chain-of-thought all live here).
Researchers need low latency between sample completion and queryability.

Out-of-order + large payloads + low latency — three constraints that doom every simple solution.

Four generations of Joiner (a textbook streaming-system evolution)

I think this part has the most pedagogical value in the whole talk, because nearly every data team walks some version of this path:

Early 2024 — Driver-side consolidation: aggregate complete samples inside the driver process before logging. Problem: high overhead on training infrastructure.
Mid 2024 — SnowTask-based joins: Snowflake Tasks read from a sample events table and join them. Problem: prohibitively expensive at high event volume.
Late 2024 — Custom Python job: per-experiment periodic batch jobs. Problem: high end-to-end delay; doesn't scale as experiments multiply.
Late 2025 to now — Flink streaming: near real-time joins, horizontally scalable, p99 latency < 1 minute.

The engineering insight worth pulling out of this is captured in the slide's own line: "Each step kept the completed-sample contract while reducing latency/cost/operational risk." Each generation rewrote the implementation top to bottom, but the contract — input is sample events, output is complete sample — never changed. That's the precondition that lets you keep replacing the underlying machinery without breaking upstream researchers. It looks unremarkable on paper, but it's a design discipline that often decides whether a large system can keep evolving at all.

The Flink pipeline, in four stages

The reason it's split into stages is that each stage scales differently, and decoupling lets them be tuned independently:

Scan: process new file arrival notifications, apply filtering and routing.
Index: extract lightweight metadata from files; move large payloads to a Premium SSD blobstore and keep only pointers in the metadata.
Join: use Flink state (backed by RocksDB — the speaker mentioned this explicitly out loud, though the slide doesn't say so) to track incomplete sample lineage; emit a lineage record each time a sample completes.
Emit: read payloads back from blob storage using the lineage metadata and emit the final complete sample.

Two operational details worth remembering: Flink runs in at-least-once mode with Rockset-backed deduplication (trading strict exactness for throughput, with dedup as the correctness safety net); heartbeat events track long rollouts that span multiple days. The known pain point is large checkpoint size — it hits Azure Blobstore throttling and slows down restarts. The classic large-state streaming job problem; anyone who's run one knows the feeling.

Hard problem #2: the 16 MB row limit

Snowflake caps a single row at 16 MB, but an RL sample with nested conversations and CoT routinely blows past that. Their solution is a "trim + reference" combo:

When a row exceeds the limit, trim the oversized fields out of the Snowflake row; preserve the raw record in blob storage.
Keep the blob path and file pointer in the row, so the full payload can be rehydrated on demand.
Media assets (audio, images) already live in blob; the sample only holds pointers.

This is the classic "warehouse as index, object storage as source of truth" pattern — that abstraction is my framing, not the speaker's. The warehouse keeps only the queryable, pruneable structured part; the heavy stuff sinks to cheap blob, linked by pointers.

The main event: getting RL Sample Viewer (RSV) end-to-end dashboard latency under 200ms

RSV is OpenAI's #1 most-used internal research tool — important to keep the scope: the slide says "#1 Most-used internal research tool," not the #1 tool company-wide. Slack, internal ChatGPT, Codex, and the like obviously see vastly more traffic. Hundreds of researchers use RSV daily, each reviewing hundreds of samples. Inspecting samples is the core mechanism for understanding model behavior and debugging issues, so latency directly couples to research productivity. The team spent an entire year compressing end-to-end dashboard latency (note: e2e dashboard latency, not database query latency) from "several seconds, sometimes tens of seconds" down to under 200 ms. Here's how — this section is pure substance.

1. Sharding by experiment (256 shards)

The vast majority of queries are scoped to a single experiment. So the table of tens of PB gets hashed into 256 shards by experiment id, and queries route to the matching shard. The effect: queries no longer scan the full table, just a small physical slice. The cost: skew remains possible — very large runs make a few shards heavy, but most shards stay small enough for good latency.

2. Clustering keys: the #1 design choice

This is the lever that decides Snowflake performance. Rows sharing a clustering key get physically colocated, enabling efficient data pruning and reducing scan volume. The constraint is one clustering key definition per table (can be a composite of multiple fields), so the key has to be chosen for the most important queries in the workload.

3. From `experiment_id` to `(event_date, experiment_id)`: reducing churn

This is the move I think is most worth pulling out, because it hits on a subtle Snowflake clustering trap:

Cluster by experiment_id: new experiment ids are essentially "randomly" distributed, so new data inserts itself between many historical micro-partitions, triggering large-scale rewrites (high churn) — severe write amplification.
Cluster by event_date: new data only lands on recent partitions, leaving historical partitions untouched (low churn); new experiments only affect recent partitions.

Their final clustering key is (event_date, experiment_id). The supporting requirement: ensure all queries include a time filter; for queries without time filters, maintain a separate index table that gives the time range for each experiment.

One-line takeaway: choose clustering keys aligned with "monotonic/time-correlated" dimensions; avoid high-cardinality random inserts, or you'll be silently paying for continuous reclustering. This pattern generalizes to Iceberg / Delta Lake / BigQuery clustering as well.

4. Rockset caches the last 7 days

Cache the last 7 days of data in Rockset to serve real-time queries — this covers 90%+ of workloads. Older data falls back to Snowflake. Reference numbers (note: these reflect PB-scale real-time RL workloads, not general Snowflake performance):

Average RSV query latency on Snowflake: 500ms–1s
p99 long-tail queries on Snowflake can still take several seconds
Rockset consistently achieves double-digit milliseconds

An important scope caveat: that "200ms end-to-end response" only holds on the query path where the last 7 days are cached in Rockset — about 90% of queries. The remaining 10% falling back to Snowflake average 500ms–1s, with p99 still reaching several seconds. So if anyone summarizes this as "OpenAI achieved millisecond response on petabyte-scale RL data," it's a heavily caveated achievement, not a general platform capability. OpenAI did not solve the general problem of "low-latency analytics on petabyte-scale data" — they solved the specific problem of "for our query patterns, layered caching gets 90% of paths into the millisecond range." Those two statements sound similar but differ a lot in what they actually claim.

5. Custom `_id` for deduplication

Some queries need to aggregate over a huge number of rows (e.g., min/max training step for a given experiment requires scanning all that experiment's samples). Rockset automatically deduplicates rows by _id, retaining only the latest version. So they set _id = (experiment_id:training_step), which collapses each (experiment, training step) pair into a single row at ingestion — smaller table, faster aggregation queries. Clever use of "primary key semantics as pre-aggregation."

6. Streamlit → React, with 99% of the code written by Codex

Initially they used Streamlit — Python-native, researchers could build dashboards without frontend expertise. But rendering large samples made the UI noticeably laggy. React gives better UX but requires frontend skills — until Codex could generate good React code, at which point the barrier collapsed. Result: 99% of RSV's React code was generated by Codex, frontend latency dropped, UX improved. More OpenAI dashboards are migrating from Streamlit to React.

There's a fun "dogfood signal" embedded in this: capability improvements in their own tools have started reshaping their internal technology choices.

Side note: Streamlit was acquired by Snowflake in 2022 for $800M — so saying "we're migrating away from Streamlit" on Snowflake's own home stage is, in theory, a little awkward. But OpenAI handled it diplomatically: they framed the cause as "rendering large samples felt laggy in our specific case" and "Codex made React feasible without frontend expertise," neither dismissing Streamlit's original value nor missing a chance to plug their own Codex. This kind of "gentle boundary acknowledgment + in-house product promotion" is a standard move in big-vendor conference talks.

Snowflake as a research metrics store

Sample events also contain plenty of data that can be tracked as metrics: pass rates, response token counts, tool calls, failure rates (both directly and derived). At runtime these are pre-aggregated and logged to Neptune for real-time observability.

So why also use Snowflake for metrics? Two reasons Neptune can't cover: deriving new metrics from existing data, and on-demand backfills when metric definitions change.

The challenge is query latency: metric fields are buried inside deeply nested JSONs, parsing is expensive, and metric definitions change frequently enough that pre-processing (extract + aggregate) is hard to stabilize. They tried two approaches:

Materialized Views (MVs): maintained automatically by Snowflake at the micro-partition level, with a limited set of supported aggregations (AVG/SUM/COUNT). Two problems: first, the base table is constantly being reclustered, which temporarily invalidates MVs and forces queries to fall back to the base table, producing wildly variable latency (jumps from seconds to minutes have been observed); second, MVs don't support incremental backfill — adding a column or redefining a metric requires recomputing the entire MV. At their scale, that cost is prohibitive.

Dedicated metrics table (the chosen path): maintain a separate table containing only the metric-relevant fields, clustered on the right dimensions for the target query patterns, with targeted backfills via DELETE + INSERT. The flow is Completed Samples (Base) → Snow Stream + Task → Metrics Table. Benefits: faster base table queries; no full recomputation for backfills — and in practice, when you add a field, you usually only care about experiments from the last few months anyway. Older data doesn't need to be touched, and targeted backfill is dramatically cheaper.

Active areas

Snowflake Interactive Tables: optimized for low-latency, high-concurrency workloads; they want to use it to serve some application queries directly.
Streamlining backfills: build systems to reduce the operational toil of backfills.

My takeaways

A few observations I want to leave behind — all of these are my own readings, not direct claims from the speaker:

First, within this Research Data Platform, RL sample data is the dominant workload. That 70% number (again, Snowflake storage composition, not OpenAI's overall business) tells you the central tension for post-training research data infrastructure is "how do we assemble and query massive, out-of-order, oversized samples with low latency" — not the conventional warehouse playbook.

Second, the backbone of this architecture is a recurring pattern: warehouse as index, object storage as source of truth, streaming engine as assembler, specialized cache as real-time layer. The 16 MB trimming, blob pointers, Flink Joiner, Rockset cache — all are facets of the same underlying idea.

Third, the clustering-key move is the most generally applicable trick here. "Swap a high-cardinality random clustering key for a time-correlated one to reduce churn" is a universal optimization for any write-heavy Snowflake / data lake workload, and a lot of teams don't realize they're quietly paying for continuous reclustering.

Fourth, tooling progress is quietly redrawing team boundaries. Codex writing 99% of the frontend code directly changes the answer to the old question of "should we use React?" When generation capability is strong enough, the constraints behind technical choices get rewritten — and the second-order effects of that may matter more than any single optimization.

Fifth, this talk is not OpenAI showing off "look how fast we are" — it's a story of "we got pushed around by the specific shape of petabyte-scale RL data, here are the walls we hit and the compromises we made." The whole evolution (four generations of Joiner, clustering key change, MV → dedicated metrics table) shares one shape: "hit the wall, then go around it" — not prescient elegant design. For external readers, this is actually a more useful framing: frontier AI labs' infrastructure isn't fundamentally cognitively different from what you and I work on daily; it's just several orders of magnitude bigger. Reading it as "OpenAI's superpower display" misses the talk's real value — it's an honest record of engineering evolution.

Based on publicly presented conference slides; technical naming and numbers follow the speaker's content. Analysis and inferences are my own, flagged inline where they appear, and do not represent OpenAI or Snowflake's official positions.

Note: This post was drafted with the assistance of Claude, and reviewed by ChatGPT (mainly) and Gemini.

中文版：

OpenAI 怎么在 Snowflake 上搭 Research Data Platform:一场架构演进的现场拆解

现场笔记 + 架构复盘。基于 OpenAI 团队在 Snowflake Dev Day 2026(2026-06-04)的分享《Research Data Platform at OpenAI》整理。文中数据、命名均来自演讲幻灯片;分析与延伸为个人解读,会在文中明确标记。本文不是"OpenAI 多牛"的吹捧文,演讲本身也不是——它是一场关于"被 PB 级 RL 数据推着走、踩坑、改方案、再撞墙"的诚实工程演进记录。

先说一个让人很有体感的数字

OpenAI 的开场不是讲数据库,而是讲发布节奏:模型从研究到上线的周期,已经从 15 个月压缩到了 6 周。

这件事对数据平台意味着什么?意味着上游每加速一档,下游"看实验、读样本、算指标"的链路就要跟着提速一档。而到了这个阶段,研究侧的主力工作负载已经不是预训练,而是 RL(强化学习)后训练——这一点直接体现在他们 Snowflake 里的数据构成上:

Snowflake 里超过 70% 的数据,是 RL 训练跑出来的 sample events 和 complete samples。

注意这是 Snowflake 的存储构成,不是 OpenAI 整体业务构成——但即便如此,这个数字仍然说明,后训练时代的研究数据基础设施,核心矛盾已经从"如何存放预训练语料"转移到"如何让海量、乱序、超大 payload 的 RL 样本被低延迟地拼装和查询"。

规模到底有多变态

团队列了四条"扩展压力",每一条单独拿出来都够一个数据团队头疼:

数据量:过去 12 个月增长 10 倍,从个位数 PB 涨到几十 PB,年底预计冲到数百 PB。
写入吞吐:日均写入已经是数百 TB 级别,个别工作负载单日能写超过 1 PB。
读延迟:仪表盘(尤其是 RL Sample Viewer)要求双位数毫秒级读取;研究员的脚本和临时查询要求秒级响应。
Agentic 负载:越来越多查询是模型自己生成的(agent 在跑),仓库用量上涨,而且让容量规划变得难以预测。

最后这条特别值得玩味:当 agent 成为查询发起方,负载形态就不再服从"人类作息曲线",优化和成本预测都得重新建模。这是个未来会越来越普遍的问题。

选型:Snowflake 当默认分析层,Rockset 当实时缓存

整体定位很清晰:

Snowflake = 主分析层,承载 RL 实验数据(samples、metrics),也兼顾硬件健康、frontier eval 等场景。优点是研究员能快速搭管道和看板,不用为了规模牺牲(可接受的秒级)延迟。
Rockset = 实时缓存层,专门服务那些需要双位数毫秒读取的用户向 / 强交互路径。
同时在评估 Snowflake Interactive Tables 和 Snowflake Postgres 来覆盖部分低延迟场景。

这里值得补一个 slide 上没明说、但有助于理解这个选型的背景:Rockset 在 2024 年被 OpenAI 收购。slide 本身没把"为什么用 Rockset"明确归因到这件事,所以严格说,接下来这个解读是我的推测,不是演讲者的原话——但这个时间线让"用 Rockset 做缓存层"看起来不像普通的第三方选型,更像是把自家技术栈塞进了研究基础设施。Rockset 底层基于 RocksDB(LSM-tree 存储引擎,天生写优化),维护行、列、搜索(倒排)三套索引——既能高效点查,又能做实时聚合,正好补上 Snowflake 在毫秒级实时查询上的短板。

第一道硬骨头:Joiner(把碎片拼成完整样本)

问题长什么样

一次 sample rollout 是跨多个分布式系统发生的,每个系统只产出一个 sample event,各自带着丰富的局部上下文。但研究员要看的不是零散的 event,而是一条完整的 sample。把分散的 events 按 join key 拼回完整样本,这就是 Joiner 的活儿。

难点在于:

event 可能乱序到达,延迟从几秒到几天不等;
payload 非常大(prompt、对话、chain-of-thought 全在里面);
研究员还要求从"样本完成"到"可查询"之间低延迟。

乱序 + 大 payload + 低延迟,这三个约束放一起,基本注定了简单方案都会撞墙。

Joiner 的四代演进(一部典型的流式系统成长史)

这段我觉得是整场最有教学价值的部分,因为它几乎是所有数据团队都会走一遍的路:

Early 2024 — Driver 内合并:在 driver 进程里把完整样本聚好再落库。问题:给训练基础设施带来高额开销。
Mid 2024 — SnowTask 做 join:用 Snowflake Task 从 sample events 表里读取并 join。问题:事件量一高,成本高到难以承受。
Late 2024 — 自定义 Python Job:按实验周期性批跑。问题:端到端延迟高,实验一多就扩不动。
Late 2025 至今 — Flink 流式 Join:近实时 join,水平可扩展,p99 延迟 < 1 分钟。

这条演进线最值得抽出来的工程判断是 slide 自己点出来的那句:"Each step kept the completed-sample contract while reducing latency/cost/operational risk." 每一代实现都翻天覆地,但输入是 sample events、输出是 complete sample 这个对外契约始终没变。这是能够持续重写底层实现而不影响上游研究员的根本前提——一个看似不起眼、但在大型系统演进里几乎决定生死的设计纪律。

Flink 管道拆成四个阶段

之所以拆成多个阶段,是因为每个阶段的扩展特性不一样,分开才好独立伸缩:

Scan(扫描文件):处理新文件到达的通知,做过滤和路由。
Index(抽取事件):从文件里抽取轻量元数据;把大 payload 挪到 Premium SSD blob 存储,元数据里只留指针。
Join(组装样本):用 Flink 的 state(底层是 RocksDB——演讲者口头明确提到)跟踪未完成样本的 lineage,每完成一条样本就发出对应的事件谱系。
Emit(落盘 payload):根据 lineage 元数据从 blob 里读回 payload,发出最终完整样本。

工程细节上有两点值得记:Flink 跑在 at-least-once 模式 + Rockset 做去重(用最终一致性换吞吐,去重兜底正确性);用 心跳事件 来追踪那些跨好几天的超长 rollout。已知痛点是 checkpoint 太大——会撞上 Azure Blob 存储限流,重启时间也长。大状态流式作业的老大难,谁跑谁知道。

第二道硬骨头:16 MB 行限制怎么破

Snowflake 单行有 16 MB 上限,而一条样本动辄是嵌套了对话、CoT 的大 JSON,很容易超标。他们的做法是一套"裁剪 + 引用"组合拳:

行超限时,把大字段从 Snowflake 行里剪掉,原始记录完整保留在 blob 存储;
行里留下 blob 路径和文件指针,需要时按引用 rehydrate,恢复完整 payload;
音频、图片等媒体资产本来就放 blob,样本里只存指向文件位置的指针。

这其实是"把数仓当索引、把对象存储当真相之源"的经典模式——这条抽象是我自己提炼的,slide 没这么讲。数仓只保留可查询、可裁剪的结构化部分,重负载下沉到便宜的 blob,靠指针做关联。

RSV 优化:把端到端 dashboard 延迟从几十秒压到 200ms 以下

RSV(RL Sample Viewer)是 OpenAI 内部使用率第一的研究工具(slide 的限定词是 "#1 Most-used internal research tool" ——不是全公司第一工具,Slack、内部 ChatGPT、Codex 这些日常工具的使用量量级显然更大,所以限定要保住)。每天数百名研究员、人均审阅数百条样本。检查样本是理解模型行为、调试问题的核心手段,所以延迟直接挂钩研究效率。团队花了整整一年,把端到端 dashboard 延迟(注意是 e2e dashboard latency,不是数据库查询延迟)从"几秒、有时几十秒"压到了 200ms 以内。怎么做到的——这部分是纯干货。

1. 按实验分片(256 shards)

绝大多数查询都限定在单个实验内。于是把那张几十 PB 的大表按 experiment id 哈希成 256 个分片,查询路由到对应分片。效果是查询不用扫全表,只碰一小块物理切片。代价是 倾斜依然存在——超大实验会让个别分片变重,但大部分分片足够小,整体延迟可控。

2. Clustering Key:最关键的一个设计决策

这是 Snowflake 性能的命门。同一个 clustering key 的行会被物理放在一起,从而实现高效的 data pruning(分区裁剪),减少扫描量。约束是每张表只能有一个 clustering key 定义(可以是多字段组合),所以这个 key 必须针对最重要的查询来选。

3. 从 experiment_id 改成 (event_date, experiment_id):减少 churn

这是我认为最值得单拎出来讲的一招,因为它踩中了 Snowflake clustering 的一个隐蔽陷阱:

按 experiment_id 聚簇:新实验的 id 是"随机"散布的,新数据会插进大量历史 micro-partition 之间,触发大面积重写(high churn)——写放大极其严重。
按 event date 聚簇:新数据基本只落在最近的分区,历史分区几乎不动(low churn),新实验只影响近期分区。

最终他们采用的 clustering key 是 (event_date, experiment_id)。配套要求是:保证查询都带时间过滤;对那些不带时间过滤的查询,额外维护一张索引表来获取每个实验的时间范围。

一句话原则:聚簇键尽量选"单调/时间相关"的维度,避开高基数随机插入,否则你会被持续 reclustering 的成本吃掉。 这条规律对 Iceberg / Delta Lake / BigQuery 的 clustering 设计同样适用。

4. Rockset 缓存最近 7 天

把最近 7 天的数据缓存进 Rockset 服务实时查询,覆盖 90%+ 的工作负载;更老的数据回退到 Snowflake。给的参考数字(注意:这是 PB 级实时 RL 负载下的数字,不代表 Snowflake 的一般性能):

Snowflake 上 RSV 平均查询延迟 500ms–1s;
p99 长尾查询在 Snowflake 上仍可能要好几秒;
Rockset 能稳定做到双位数毫秒。

一个重要的范围限定:这个"200ms 端到端响应"只在最近 7 天数据被 Rockset 缓存的查询路径上成立——覆盖约 90% 的查询。剩下 10% 落到 Snowflake 的查询,平均 500ms–1s,p99 长尾仍可达数秒。所以"在 PB 级 RL 数据上做到毫秒级响应"如果被那么概括,是个被严格限定的成就,不是平台整体的通用能力。OpenAI 并没有解决"PB 级低延迟分析"这个一般问题——他们解决的是"在自己的查询模式下,通过分层缓存把 90% 路径压到毫秒级"这个具体问题。这两个表述听起来类似,差别其实很大。

5. 用自定义 `_id` 做去重优化

有些查询要在海量行上做聚合(比如取某实验的 min/max training step,得扫该实验全部样本)。Rockset 会按 _id 字段自动去重,只保留最新版本。于是他们把 _id 设成 (experiment_id:training_step),让每个实验的每个 training step 在摄入时就收敛成一行——表更小,聚合查询更快。这是一种非常聪明的"用主键语义做预聚合"。

6. Streamlit → React,而且 99% 的代码是 Codex 写的

最初用 Streamlit,好处是研究员不懂前端也能快速搭看板;但渲染大样本时 UI 明显卡顿。React 体验更好,可门槛在于需要前端能力——直到 Codex 能直接生成 React 代码,这个门槛被抹平了。结果:99% 的 React 代码由 Codex 生成,前端延迟下降、体验提升。OpenAI 内部越来越多看板正从 Streamlit 迁往 React。

这条其实是个很有意思的"吃自家狗粮"信号:工具能力的提升,反过来改变了团队的技术选型边界。

顺便:Streamlit 是 Snowflake 2022 年 8 亿美元收购的产品,在 Snowflake 主场上讲"我们正在迁出 Streamlit",理论上有点尴尬。但 OpenAI 的措辞处理得相当圆滑——把原因归结为"在我们这个场景下渲染大样本卡顿"和"Codex 让 React 也变得没门槛了",既没否定 Streamlit 的设计初衷,也顺带夸了自家 Codex。这种"温柔的功能边界提醒 + 自家产品广告"是大厂会议演讲的标准操作。

第三块:把 Snowflake 当 Metrics Store

样本事件里藏着大量可以当指标追踪的信息:pass-rate、响应 token 数、工具调用、失败率等等(既有直接指标,也有派生指标)。运行时这些指标会预聚合后写进 Neptune 做实时可观测。

那为什么还要用 Snowflake 算指标?因为要从已有数据派生新指标,以及改了指标定义后按需 backfill 历史实验数据——这是 Neptune 给不了的灵活性。

挑战是查询延迟高:指标字段深埋在嵌套 JSON 里,解析慢;而且指标定义经常变,预处理(抽取 + 聚合)很难做稳。他们试了三条路:

Materialized Views(物化视图):Snowflake 自动维护、在 micro-partition 级别预算聚合,支持的聚合有限(AVG/SUM/COUNT 这些)。问题有两个——其一,base table 一直在 reclustering,会临时让 MV 失效、查询回退到 base table,导致延迟忽高忽低(见过几秒到几分钟的跳变);其二,MV 不支持增量 backfill,加一列或重定义一个指标,就得把整张 MV 重算,成本高到劝退。

专用 Metrics Table(最终方案):维护一张只含指标相关字段的专用表,按合适的维度聚簇,通过 DELETE + INSERT 做定向 backfill。链路是 Completed Samples (Base) → Snow Stream + Task → Metrics Table。好处:base 查询更快;backfill 不用全表重算——而且现实中你加字段时,往往只关心最近几个月的实验,老数据根本不用动,定向回填就便宜太多了。

还在探索的方向

Snowflake Interactive Tables:针对低延迟、高并发负载优化,想用它直接服务部分应用查询;
简化 backfill:构建系统降低 backfill 的运维负担。

我的几点总结

把整场拆完,有几个观点想留下来——以下都是我自己的解读,不是演讲者原话:

第一,在这套 Research Data Platform 里,RL sample data 占据了绝对主力。 70% 这个数字(再次强调,是 Snowflake 数据构成,不是 OpenAI 整体业务构成)说明,后训练时代的研究数据基础设施,核心矛盾是"海量、乱序、超大 payload 的样本如何被低延迟地拼装和查询",而不是传统数仓那套。

第二,这套架构的脊梁是一个反复出现的模式:数仓做索引、对象存储做真相之源、流式引擎做拼装、专用缓存做实时。 16MB 裁剪、blob 指针、Flink Joiner、Rockset 缓存,全是这个思路的不同切面。

第三,clustering key 那一招最值得抄作业。 "把聚簇键从高基数随机维度换成时间维度以降低 churn",是个对任何重写密集型 Snowflake / 数据湖场景都通用的优化,而且很多团队意识不到自己正在为 reclustering 默默付费。

第四,工具进步在重塑团队边界。 Codex 写掉 99% 的前端代码,直接让"该不该用 React"这个老问题有了新答案。当生成能力足够强,技术选型的约束条件会被悄悄改写——这件事的二阶效应,可能比任何单点优化都深远。

第五,这场分享不是 OpenAI 在炫"我们多快多牛",而是在讲"我们如何被 PB 级 RL 数据的具体形态推着走、踩了哪些坑、做了哪些妥协"。 整个演进史(Joiner 四代、clustering key 从 experiment_id 改成 (event_date, experiment_id)、MV 改成 dedicated metrics table)的共同特征是 "先撞墙、再绕过墙"——不是预判性的优雅设计。对外部读者其实是个更有用的视角:前沿 AI 实验室的基础设施,跟你我每天做的工作没有本质的认知差距,只是规模大几个数量级。把它读成"OpenAI 的超能力展示",反而错过了这场分享真正的价值——它是个工程演进的诚实记录。

本文基于公开演讲幻灯片整理,技术命名与数据以演讲内容为准;分析观点为作者个人解读,已在文中明确标记,不代表 OpenAI 或 Snowflake 官方立场。

注：本文在Claude的协助下起草，并由ChatGPT（主要）和Gemini进行了审阅。

DEV Community

How OpenAI Built a Research Data Platform on Snowflake: A Field Notes on an Architecture in Motion

Start with a number that lands

What the scale actually looks like

Choices: Snowflake as default analytics, Rockset as real-time cache