We connected to a remote BEAM node we don't own. No access to its source code. No instrumentation planted beforehand. No agents installed, no probes compiled into the target. Pure black-box runtime observation over Distributed Erlang.
From that, we produced a concrete architectural recommendation: your schema registry is hitting PostgreSQL 354 times per observation window for metadata that almost never changes — that should be ETS, not a database table.
This article walks through how we got there — 7 observation sessions against a live Elixir application, progressing from coarse process-level metrics to function-level tracing that revealed the internal structure of code we'd never read.
The Target: Nexus
Nexus is an Elixir API gateway that manages dynamic PostgreSQL connections, runtime schema metadata, and table operations. It exposes a REST API on port 4040 and handles CRUD, aggregations, search, CSV export, batch operations, and multi-database routing — all backed by Ecto dynamic repos.
We exercised it with an integration test suite: 61 tests across 8 categories including connection management, extended CRUD (JSONB, NULLs, hierarchical categories), edge cases (SQL injection, unicode, 10KB strings), concurrent operations (50 concurrent inserts, 5000+ req/s bursts), large datasets (50k rows), and a 60-second stress test with 20 concurrent workers.
The Setup
Giulia runs as two Docker containers from the same image:
- Worker (port 4000) — static analysis engine: AST indexing, Knowledge Graph, dependency topology, complexity metrics, embeddings
- Monitor (port 4001) — runtime observer: connects to target BEAM nodes via Distributed Erlang, collects snapshots at configurable intervals, pushes data to the Worker for fusion
The observation workflow is command-driven:
giulia-observe start nexus@192.168.10.174 # process-level only
giulia-observe start nexus@192.168.10.174 cookie 5000 Nexus.Repo # + function tracing
<run your workload>
giulia-observe stop nexus@192.168.10.174 # finalize fused profile
First Bug: --sname vs --name
First connection failed — Nexus uses long names (--name nexus@192.168.10.174), Giulia used short names (--sname worker). Erlang refuses to connect across name modes. One flag change in docker-compose.yml fixed it.
Process-Level Profiling: Finding the Logger
The first five sessions used process-level observation: BEAM metrics (memory, process count, scheduler run queue) and top-process rankings by CPU. This tells you which modules own the hottest processes but not which functions are being called.
We ran the same workload under different configurations:
| Session | Duration | Run Queue | Top Module | #2 Module | Logger CPU |
|---|---|---|---|---|---|
| 1. First contact (21/49 tests) | 180s | 17 |
:proc_lib 100% |
(below threshold) | 0% |
| 2. Idle baseline | 21s | 0 |
:code_server 38.7% |
(below threshold) | 0% |
| 3. Full suite, info log | 77s | 1 |
:proc_lib 54.3% |
:logger_std_h_default |
28.5% |
| 4. Full suite, warning log | 77s | 5 | :logger_std_h_default |
:proc_lib 33.2% |
39.1% |
| 5. Full suite, async handler | 71s | 2 |
:proc_lib 98.9% |
(below threshold) | Gone |
Session 1 flagged scheduler contention (run queue of 17 — more tasks queued than schedulers available). Sessions 3-4 found something more interesting: Erlang's default logger handler does synchronous IO. Every log call blocks the calling process until the write completes. The overhead is per-call, not per-byte — so raising the log level to :warning (Session 4) actually made it worse. Fewer messages, but each one still blocks, and now the logger is a larger proportion of total work.
The fix:
:logger.update_handler_config(:default, :config, %{
sync_mode_qlen: 1000, # async until queue hits 1000
drop_mode_qlen: 2000, # start dropping above 2000
flush_qlen: 5000 # emergency flush
})
Session 5 confirmed: logger gone from top modules, run queue dropped from 5 to 2. Bottleneck eliminated.
But process-level profiling had hit its ceiling. Session 5 showed :proc_lib at 98.9% CPU — which is like saying "OTP processes are running." Every GenServer, Task, and supervised process runs through :proc_lib. We needed to see inside.
The Stress Test: 919 ops/s, Zero Errors
Before diving deeper, we ran the full suite including a 60-second sustained stress test. Nexus handled the load cleanly — the interesting question was how.
Duration: 60.0s Latency (ms):
Total ops: 55,151 min: 0.7
Errors: 0 (0.0%) median: 5.4
Throughput: 919 ops/s p95: 93.1
p99: 156.9
Operations breakdown: max: 287.2
select 13,619 paginate 5,489
insert 11,203 aggregate 5,440
count 8,291 distinct 2,818
search 5,553 exists 2,738
Memory flat at 87-88 MB. Process count stable at 459. The system handles load well — but how it handles it is where the interesting findings live.
Function-Level Tracing
The Remote Tracing Problem
Erlang's :erlang.trace only works on the local node. Giulia's Monitor runs on a different node than Nexus. There are three ways to run code on a remote BEAM:
-
:rpc.call(node, Module, :function, args)— module must be loaded on the target -
Anonymous function via RPC — closures reference their defining module; fails with
:undef -
Code.eval_stringvia RPC — source code as a string, compiled on the target using only its stdlib
Options 1 and 2 fail because Nexus doesn't have Giulia's code, and Erlang closures aren't portable. Build 135 uses option 3: a self-contained Elixir code string sent via :rpc.call(node, Code, :eval_string, [code]). It spawns a collector, enables tracing for up to 2 seconds (kill switch at 1,000 events), aggregates call counts, and returns. The entire lifecycle runs on the target with zero Giulia dependencies.
Session 6: Single Module (73 seconds, 11 snapshots)
Tracing Nexus.Registry.TableRegistry only:
| Function | Calls | Samples |
|---|---|---|
TableRegistry.get/1 |
9,238 | 10 |
The single hottest function on Nexus. Every CRUD operation goes through get/1 to look up the schema definition before executing. But the call count alone doesn't tell you how it's implemented.
Session 7: Dual Module Trace (85 seconds, 11 snapshots)
Tracing both Nexus.Registry.TableRegistry and Nexus.Repo:
Nexus.Repo — 7,000 total calls:
| Function | Calls | Share |
|---|---|---|
prepare_opts/2 |
1,266 | 18.1% |
default_options/1 |
1,263 | 18.0% |
get/1 |
1,263 | 18.0% |
prepare_query/3 |
956 | 13.7% |
all/1 |
593 | 8.5% |
one/1 |
353 | 5.0% |
insert_all/3 |
307 | 4.4% |
put_dynamic_repo/1 |
12 | 0.2% |
Nexus.Registry.TableRegistry — 2,004 total calls:
| Function | Calls | Share |
|---|---|---|
get/1 |
354 | 17.7% |
prepare_opts/2 |
354 | 17.7% |
default_options/1 |
354 | 17.7% |
prepare_query/3 |
292 | 14.6% |
all/1 |
186 | 9.3% |
one/1 |
106 | 5.3% |
insert_all/3 |
62 | 3.1% |
And there it is. TableRegistry has prepare_opts/2, default_options/1, prepare_query/3 — the telltale signature of an Ecto Repo. The schema registry is backed by a database table, not ETS. Every schema lookup is a round-trip to PostgreSQL.
What the Numbers Say
TableRegistry Should Be ETS (The Headline Finding)
354 get/1 calls to PostgreSQL per observation window, for table metadata that almost never changes. Boot-time load, invalidate on DDL events, done. This eliminates ~2,000 function calls per window (the full Ecto pipeline runs on every registry query) and removes an entire class of unnecessary database round-trips from the hot path.
The Workload Is 87% Reads
Read operations: 2,209 (87%) Read/Write ratio: 6.9:1
Write operations: 320 (13%)
Repo.get/1 at 1,263 calls is the single hottest operation. Combined with all/1 (593) and one/1 (353), reads dominate. A read-through ETS cache with TTL on frequently-accessed records would cut database pressure significantly.
Ecto Overhead Is Structural (Don't Touch It)
Ecto pipeline calls: 3,485 (prepare_opts + default_options + prepare_query)
Actual DB operations: 2,530
Overhead ratio: 1.38x
1.38 pipeline calls per operation. This is Ecto doing its job — query preparation, option merging, type validation. You're not going to bypass it, and the savings would be marginal. The real wins come from not hitting the database at all.
Dynamic Repo Switching Works Fine
12 put_dynamic_repo/1 calls across 2,530 operations (1 switch per 211 ops). Connection switching is not a concern.
The BEAM Has Headroom
Run queue of 1, stable 90 MB memory, ~30 DB ops/sec during trace windows. The VM isn't stressed. If there's a throughput bottleneck, it's on the PostgreSQL side — pg_stat_statements would confirm.
Priority
| # | Action | Impact | Effort |
|---|---|---|---|
| 1 | Cache TableRegistry in ETS |
Eliminates ~2,000 DB calls/window | Low |
| 2 | Read cache for hot get/1 paths |
Reduces 87% read load | Medium |
| 3 | Profile PostgreSQL side | Confirms where latency lives | Low |
What Each Layer Revealed
| Finding | Process-Level | Function-Level |
|---|---|---|
| Synchronous logger at 39% CPU | Yes | — |
| Logger worse at warning level (Sessions 3→4→5) | Yes (comparative) | — |
TableRegistry.get/1 is the #1 function |
No | 9,238 calls |
| TableRegistry is database-backed | No | Ecto signatures |
| 87% read-heavy workload | No | Ratio analysis |
| 1.38x Ecto overhead (structural, ignore it) | No | Yes |
| Dynamic repo switching is fine | No | 12 switches |
| BEAM has headroom, DB is the constraint | Partial | Throughput data |
Process-level found the logger bug — a real operational fix. Function-level found the architectural issue — a schema registry that should never have been database-backed. One is a config change. The other changes how you design the system.
Both came from a node we don't own, running code we've never read, with zero instrumentation on the target.
How This Works
No source access. No agents installed on the target. No probes compiled in. Giulia's Monitor connects to any BEAM node via Distributed Erlang, collects process snapshots and function-level traces by sending self-contained code strings over RPC (Code.eval_string), and pushes the results to the Worker for AST correlation and fused profiling. The entire observation lifecycle runs on the target using only its own stdlib — Giulia never needs to be loaded there.
Connect, observe, trace, recommend.
Top comments (0)