Myoungho Shin

Posted on Mar 23 • Edited on Mar 24

GPU Flight - Cut GPU Profiling Data Transfer by With a Schema Migration

#backend #performance #cuda #springboot

This post was originally supposed to be about how to set up GPU Flight locally. But before getting into that, I decided to cover the recent schema changes first because I made a major update and wanted to explain why.

The overall design did not change that much, but the event format and storage strategy changed quite a bit.

As I mentioned before, I was planning to host the GPU Flight frontend and backend on a live servedr so that anyone could try it. But when I briefly calculated the potential traffic, I realized that if multiple users ran heavy profiling jobs at the same time, the cost could become extremely high. On top of that, the backend would not handle that amount of raw data efficiently.

Here is the rough calculation.

If we assume 50,000 messages per second total, that could mean something like 10 users generating 5000 messages per second each. That is not unrealistic, because profiling can easily generate more than 5,000 events.

Now let's assume 1 KB per event.

That becomes:

50 MB/sec
4.32 TB/day
129.6TB/month of raw data in the database If I stored that directly in the database, the storage cost alone could reach roughly $28k/month.

Of course, I would put usage limits on each user and apply an appropriate retention policy, but even then, the raw format was too expensive.

So, I had to redesign the schema to reduce the size.

There were several possible ways to optimize both ingestion and storage. At first, I thought about adding an aggregation layer so that real-time data could be aggregated before being saved to the database. There were also multiple ways to do that: implement the aggregation layer myself, or use a streaming system such as Kafka, Flink, or Spark Streaming.

However, setting up and operating that kind of environment is currently out of my budget as well. So instead, I decided to change the event schema itself and send data in batches to reduce the payload size.

Here is what the old format looked like.

The Old Format: One Fat JSON Per Event

Every CUDA kernel completion triggered a log entry that looked roughly like this:

{
  "type": "kernel_event",
  "pid": 1234,
  "app": "resnet_training",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "void cunn_ClassNLLCriterion_updateOutput_kernel<float, float>()",
  "platform": "CUDA",
  "has_details": true,
  "device_id": 0,
  "stream_id": 140234567890,
  "start_ns": 1706800123456789,
  "end_ns":   1706800123567890,
  "api_start_ns": 1706800123400000,
  "api_exit_ns":  1706800123450000,
  "grid": "(64,1,1)",
  "block": "(256,1,1)",
  "dyn_shared_bytes": 0,
  "num_regs": 32,
  "static_shared_bytes": 8192,
  "local_bytes": 0,
  "const_bytes": 0,
  "occupancy": 0.750000,
  "reg_occupancy": 0.800000,
  "smem_occupancy": 1.000000,
  "warp_occupancy": 0.750000,
  "block_occupancy": 0.750000,
  "limiting_resource": "REGISTERS",
  "max_active_blocks": 4,
  "corr_id": 98765432,
  "local_mem_total_bytes": 0,
  "local_mem_per_thread_bytes": 0,
  "cache_config_requested": 0,
  "cache_config_executed": 0,
  "shared_mem_executed_bytes": 8192,
  "user_scope": "Training|Forward",
  "stack_trace": "main|train_step|forward"
}

~700 bytes per kernel. For a training run that calls 100,000 kernels, that's 70 MB just for kernel events — before counting scope markers, memory copies, PC sampling results, or system metrics.

Three problems stand out:

Repeated strings. session_id, app, platform, and field names like "dyn_shared_bytes" were sent repeatedly with every single event.
Wide rows for mostly-empty columns. Fields such as grid, block, and occupancy values are only meaningful when has_details = true, which is just a subset of kernels. Still, every event carried all ~35 fields.
One HTTP POST per event. At 10 kernels/ms the write loop was spending more time serializing and flushing than the kernels themselves took to run.

The same pattern appeared in other event types as well: scope_begin + scope_end as two separate JSON objects per scope, system_sample with host CPU stats and per-GPU stats packed into one blob every second, and profile_sample with the full mangled function name repeated on every PC-sampling hit.

The New Format: Columnar Batches + String Interning

The redesign rests on two ideas:

1. Lazy String Interning via `dictionary_update`

Every variable-length string — kernel names, scope names, function names, metric names —
is assigned a compact integer ID the first time it is seen. A single dictionary_update
message ships new mappings to the backend:

{
  "type": "dictionary_update",
  "session_id": "550e8400...",
  "kernel_dict": {
    "1": "void cunn_ClassNLLCriterion_updateOutput_kernel<float, float>()",
    "2": "void volta_gemm_fp16_128x128_ldg8_nn_kernel()"
  },
  "scope_name_dict": {
    "1": "Training",
    "2": "Forward",
    "3": "Backward"
  }
}

This dictionary is sent at most once per unique string per session. After that, events refer to strings by integer ID. A 70-byte kernel name becomes a 1-2 byte integer.

2. Columnar Rows, Flushed in Batches of 512

Events of the same type are buffered and emitted together. The schema is declared once in a columns array, and each row is a compact JSON array of numbers:

{
  "type": "kernel_event_batch",
  "session_id": "550e8400...",
  "batch_id": 1,
  "base_time_ns": 1706800123000000,
  "columns": ["dt_ns","kernel_id","stream_id","duration_ns","corr_id","dyn_shared","num_regs","has_details"],
  "rows": [
    [456789, 1, 0, 111101, 98765432, 0, 32, 1],
    [512034, 1, 0, 109876, 98765433, 0, 32, 1],
    [601290, 2, 1,  87654, 98765434, 0, 64, 0]
  ]
}

The timestamp is stored as a delta from base_time_ns, which is usually a small number, often just a few hundred thousand nanoseconds. That keeps the values compact. The kernel name becomes an integer ID.
Columns the backend doesn't need for every kernel — such as grid, block, occupancy — are removed from the hot path and sent separately via kernel_detail only when has_details = 1.

System metrics follow the same pattern:

{
  "type": "host_metric_batch",
  "session_id": "550e8400...",
  "base_time_ns": 1706800000000000,
  "columns": ["dt_ns","cpu_pct_x100","ram_used_mib","ram_total_mib"],
  "rows": [
    [0, 4523, 8192, 32768],
    [1000031450, 4687, 8256, 32768]
  ]
}

CPU percentage is sent as an integer multiplied by 100, which avoids floating-point serialization overhead while still preserving two decimal places.

Measuring the Savings

Let's walk through each event type using a 1-hour training session as an example:

100,000 kernel invocations (mixed forward + backward pass)
7,200 system metric samples (1 Hz)
10,000 scope events (begin + end pairs)
50,000 PC-sampling hits

Kernel Events

	Old	New
Per-event size	~700 B	—
Batch size	—	512 rows
Batch overhead	—	~220 B (header)
Per-row size	700 B	~35 B
Dictionary (500 unique names × 70 B)	35 MB (repeated every call)	35 KB (once)
100,000 events total	70 MB	3.6 MB

Reduction: 19×

The dictionary savings are especially large here. A typical training workload may have hundreds of unique kernel names, many of them template instantiations with mangled names 60–90 characters
long. Under the old format, those strings crossed the wire over and over again.

System Metrics

	Old	New
Per-sample size (host + device combined)	~450 B	—
Per-row size	—	~20 B (host), ~30 B (device)
7,200 samples total	3.2 MB	360 KB

Reduction: 9×

Host metadata (hostname, ip_addr, app) that never changes during a session, so it was removed from per-row payloads and stored at the session level instead.

Scope Events

In the old format, scope begin and end were two separate JSON objects.

{"type":"scope_begin","session_id":"...","name":"Forward","user_scope":"Training|Forward","ts_ns":1706800123000000,...}
{"type":"scope_end",  "session_id":"...","name":"Forward","user_scope":"Training|Forward","ts_ns":1706800124000000,...}

That is about 200 B each, or roughly 400 B per complete scope.

In the new format, both become rows in scope_event_batch, identified by a scope_instance_id that pairs the begin and reliably without string matching:

[0,      1, 2, 0, 1]   ← begin: dt_ns=0, instance=1, name_id=2, event_type=0, depth=1
[980000, 1, 2, 1, 1]   ← end:   dt=980µs later, same instance, event_type=1

	Old	New
Per-scope pair	~400 B	~40 B (2 rows)
10,000 scopes total	4 MB	420 KB

Reduction: 9.5×

PC Sampling / Profile Samples

The old profile_sample repeated the full mangled function name, source file path, and stall-reason string on every row. At 50,000 hits from a single kernel, a 90-character function name alone cost 4.5 MB.

{
  "type": "profile_sample",
  "function_name": "void volta_gemm_fp16_128x128_ldg8_nn_kernel<volta_gemm_params_fp16>@volta_gemm.cu",
  "stall_reason": "MEM_DEPENDENCY",
  "sample_count": 7,
  ...
}

New format: function ID (integer), stall reason (integer CUPTI code), metric value (integer):

[0, 98765432, 0, 3, 420, 0, 7, 4, 0, 2]

	Old	New
Per-sample size	~300 B	~30 B
50,000 samples total	15 MB	1.5 MB

Reduction: 10×

Session Total

Event type	Old	New
Kernel events	70 MB	3.6 MB
System metrics	3.2 MB	360 KB
Scope events	4 MB	420 KB
Profile samples	15 MB	1.5 MB
Total	92.2 MB	5.9 MB

Overall reduction: 15.6× — from 92 MB down to 6 MB per hour of training.

What Changed in the Schema

Because no external data had been published yet, we took the migration as an opportunity to slim the database schema alongside the wire format:

Removed columns (data moved to session-level or derived):

uuid, vendor, name, pci_bus from device_metrics (now in cuda_static_devices)
hostname, ip_addr, event_type from host_metrics and device_metrics
source_file, source_line from profile_samples (already encoded in function_name@file)

New tables:

session_dictionaries — persists the string intern tables so the backend can resolve IDs after restart
memcpy_events — memcpy activity was previously missing from the schema entirely

Redesigned profile_samples:
The old schema had SASS-specific columns (inst_executed, thread_inst_executed, sample_count) hard-coded for one metric pair. The new design stores any metric generically as (metric_name, metric_value), accommodating PC sampling, SASS instruction counts, and future Perfworks metrics with no schema change:

-- Old
CREATE TABLE profile_samples (
    inst_executed        BIGINT,
    thread_inst_executed BIGINT,
    sample_count         INTEGER,
    source_file          VARCHAR,
    source_line          INTEGER,
    reason_name          VARCHAR,
    ...
);

-- New
CREATE TABLE profile_samples (
    sample_kind   VARCHAR(20) NOT NULL,   -- 'pc_sampling' | 'sass_metric'
    metric_name   VARCHAR(255),           -- NULL for pc_sampling
    metric_value  BIGINT NOT NULL,
    stall_reason  INTEGER,                -- CUPTI integer code
    scope_name    VARCHAR(255),           -- resolved from dict at write time
    ...
);

scope_events got scope_instance_id:
Under the old format, matching a scope begin to its end required a fragile WHERE name = ? AND user_scope = ? query that silently broke when two scopes with the same name were active concurrently. A monotonically-increasing scope_instance_id (assigned at the client before the event is pushed to the ring buffer) makes pairing
exact and lock-free.

Implementation Notes

A few things made this easier than it sounds:

The agent needed zero changes. Our log-tailing agent is fully type-agnostic — it reads the "type" field and POSTs the raw JSON to ${backendUrl}/${type}. New message types are handled automatically without touching the agent binary.

Dictionary ordering is guaranteed. The client calls flushDictionary before flushing any batch in the same flushBatches() invocation. The agent sends lines in order. The backend therefore always processes dictionary_update before the batch that references its IDs.

PC-sample scope association became trivial. Previously, linking a PC sample to its enclosing scope required correlating CUPTI correlation IDs through kernel records — which arrived out of order because CUPTI delivers kernel activity asynchronously via a callback. The new design uses a single std::atomic<uint32_t> g_activeScopeNameId updated when a scope-begin row is pushed. Since scope-begin is pushed synchronously from the ScopedMonitor constructor (before any kernels or samples for that scope), the correct scope ID is available by the time the collector processes PC_SAMPLE records.

Conclusion

Even after reducing the payload size by about 15×, the cost can still be high if the service receives a large amount of profiling traffic from multiple users at the same time. If the raw volume is large enough, it can still become expensive very quickly.

That said, there are still practical ways to make the hosted service manageable.

First, I can keep the storage cost under control by applying a short retention policy for raw data instead of storing everything for a long time. Second, I can limit usage by applying request limits per client so that one heavy workload does not overwhelm the backend or generate an unreasonable amount of traffic.

On top of that, this schema update is not the final optimization step. I will continue looking for more ways to reduce the payload size further if there are additional opportunities. If I find a better approach for batching, aggregation, or compression, I will update the design again.

So while the new format already reduces the size significantly, there is still more work to do before running GPU Flight as a fully open hosted service at scale.

DEV Community

GPU Flight - Cut GPU Profiling Data Transfer by With a Schema Migration

The Old Format: One Fat JSON Per Event

The New Format: Columnar Batches + String Interning

1. Lazy String Interning via `dictionary_update`

2. Columnar Rows, Flushed in Batches of 512

Measuring the Savings

Kernel Events

System Metrics

Scope Events

PC Sampling / Profile Samples

Session Total

What Changed in the Schema

Implementation Notes

Conclusion

Top comments (0)

The Old Format: One Fat JSON Per Event

The New Format: Columnar Batches + String Interning

1. Lazy String Interning via dictionary_update

2. Columnar Rows, Flushed in Batches of 512

Measuring the Savings

Kernel Events

System Metrics

Scope Events

PC Sampling / Profile Samples

Session Total

What Changed in the Schema

Implementation Notes

Conclusion

1. Lazy String Interning via `dictionary_update`