Alessio Battistutta

Posted on Mar 11 • Edited on Mar 17

How We Told a Stranger's Node Where Its Cache Should Be

#elixir #distributedsystems #postgres #testing

We connected to a remote BEAM node we don't own. No access to its source code. No instrumentation planted beforehand. No agents installed, no probes compiled into the target. Pure black-box runtime observation over Distributed Erlang.

From that, we produced a concrete architectural recommendation: your schema registry is hitting PostgreSQL 354 times per observation window for metadata that almost never changes — that should be ETS, not a database table.

This article walks through how we got there — 7 observation sessions against a live Elixir application, progressing from coarse process-level metrics to function-level tracing that revealed the internal structure of code we'd never read.

The Target: Nexus

Nexus is an Elixir API gateway that manages dynamic PostgreSQL connections, runtime schema metadata, and table operations. It exposes a REST API on port 4040 and handles CRUD, aggregations, search, CSV export, batch operations, and multi-database routing — all backed by Ecto dynamic repos.

We exercised it with an integration test suite: 61 tests across 8 categories including connection management, extended CRUD (JSONB, NULLs, hierarchical categories), edge cases (SQL injection, unicode, 10KB strings), concurrent operations (50 concurrent inserts, 5000+ req/s bursts), large datasets (50k rows), and a 60-second stress test with 20 concurrent workers.

The Setup

Giulia runs as two Docker containers from the same image:

Worker (port 4000) — static analysis engine: AST indexing, Knowledge Graph, dependency topology, complexity metrics, embeddings
Monitor (port 4001) — runtime observer: connects to target BEAM nodes via Distributed Erlang, collects snapshots at configurable intervals, pushes data to the Worker for fusion

The observation workflow is command-driven:

giulia-observe start nexus@192.168.10.174                           # process-level only
giulia-observe start nexus@192.168.10.174 cookie 5000 Nexus.Repo    # + function tracing
<run your workload>
giulia-observe stop nexus@192.168.10.174                            # finalize fused profile

First Bug: `--sname` vs `--name`

First connection failed — Nexus uses long names (--name nexus@192.168.10.174), Giulia used short names (--sname worker). Erlang refuses to connect across name modes. One flag change in docker-compose.yml fixed it.

Process-Level Profiling: Finding the Logger

The first five sessions used process-level observation: BEAM metrics (memory, process count, scheduler run queue) and top-process rankings by CPU. This tells you which modules own the hottest processes but not which functions are being called.

We ran the same workload under different configurations:

Session	Duration	Run Queue	Top Module	#2 Module	Logger CPU
1. First contact (21/49 tests)	180s	17	`:proc_lib` 100%	(below threshold)	0%
2. Idle baseline	21s	0	`:code_server` 38.7%	(below threshold)	0%
3. Full suite, info log	77s	1	`:proc_lib` 54.3%	`:logger_std_h_default`	28.5%
4. Full suite, warning log	77s	5	`:logger_std_h_default`	`:proc_lib` 33.2%	39.1%
5. Full suite, async handler	71s	2	`:proc_lib` 98.9%	(below threshold)	Gone

Session 1 flagged scheduler contention (run queue of 17 — more tasks queued than schedulers available). Sessions 3-4 found something more interesting: Erlang's default logger handler does synchronous IO. Every log call blocks the calling process until the write completes. The overhead is per-call, not per-byte — so raising the log level to :warning (Session 4) actually made it worse. Fewer messages, but each one still blocks, and now the logger is a larger proportion of total work.

The fix:

:logger.update_handler_config(:default, :config, %{
  sync_mode_qlen: 1000,     # async until queue hits 1000
  drop_mode_qlen: 2000,     # start dropping above 2000
  flush_qlen: 5000           # emergency flush
})

Session 5 confirmed: logger gone from top modules, run queue dropped from 5 to 2. Bottleneck eliminated.

But process-level profiling had hit its ceiling. Session 5 showed :proc_lib at 98.9% CPU — which is like saying "OTP processes are running." Every GenServer, Task, and supervised process runs through :proc_lib. We needed to see inside.

The Stress Test: 919 ops/s, Zero Errors

Before diving deeper, we ran the full suite including a 60-second sustained stress test. Nexus handled the load cleanly — the interesting question was how.

Duration:       60.0s          Latency (ms):
Total ops:      55,151           min:     0.7
Errors:         0 (0.0%)        median:  5.4
Throughput:     919 ops/s        p95:    93.1
                                 p99:   156.9
Operations breakdown:            max:   287.2
  select    13,619    paginate   5,489
  insert    11,203    aggregate  5,440
  count      8,291    distinct   2,818
  search     5,553    exists     2,738

Memory flat at 87-88 MB. Process count stable at 459. The system handles load well — but how it handles it is where the interesting findings live.

Function-Level Tracing

The Remote Tracing Problem

Erlang's :erlang.trace only works on the local node. Giulia's Monitor runs on a different node than Nexus. There are three ways to run code on a remote BEAM:

:rpc.call(node, Module, :function, args) — module must be loaded on the target
Anonymous function via RPC — closures reference their defining module; fails with :undef
Code.eval_string via RPC — source code as a string, compiled on the target using only its stdlib

Options 1 and 2 fail because Nexus doesn't have Giulia's code, and Erlang closures aren't portable. Build 135 uses option 3: a self-contained Elixir code string sent via :rpc.call(node, Code, :eval_string, [code]). It spawns a collector, enables tracing for up to 2 seconds (kill switch at 1,000 events), aggregates call counts, and returns. The entire lifecycle runs on the target with zero Giulia dependencies.

Session 6: Single Module (73 seconds, 11 snapshots)

Tracing Nexus.Registry.TableRegistry only:

Function	Calls	Samples
`TableRegistry.get/1`	9,238	10

The single hottest function on Nexus. Every CRUD operation goes through get/1 to look up the schema definition before executing. But the call count alone doesn't tell you how it's implemented.

Session 7: Dual Module Trace (85 seconds, 11 snapshots)

Tracing both Nexus.Registry.TableRegistry and Nexus.Repo:

Nexus.Repo — 7,000 total calls:

Function	Calls	Share
`prepare_opts/2`	1,266	18.1%
`default_options/1`	1,263	18.0%
`get/1`	1,263	18.0%
`prepare_query/3`	956	13.7%
`all/1`	593	8.5%
`one/1`	353	5.0%
`insert_all/3`	307	4.4%
`put_dynamic_repo/1`	12	0.2%

Nexus.Registry.TableRegistry — 2,004 total calls:

Function	Calls	Share
`get/1`	354	17.7%
`prepare_opts/2`	354	17.7%
`default_options/1`	354	17.7%
`prepare_query/3`	292	14.6%
`all/1`	186	9.3%
`one/1`	106	5.3%
`insert_all/3`	62	3.1%

And there it is. TableRegistry has prepare_opts/2, default_options/1, prepare_query/3 — the telltale signature of an Ecto Repo. The schema registry is backed by a database table, not ETS. Every schema lookup is a round-trip to PostgreSQL.

What the Numbers Say

TableRegistry Should Be ETS (The Headline Finding)

354 get/1 calls to PostgreSQL per observation window, for table metadata that almost never changes. Boot-time load, invalidate on DDL events, done. This eliminates ~2,000 function calls per window (the full Ecto pipeline runs on every registry query) and removes an entire class of unnecessary database round-trips from the hot path.

The Workload Is 87% Reads

Read operations:  2,209 (87%)    Read/Write ratio: 6.9:1
Write operations:    320 (13%)

Repo.get/1 at 1,263 calls is the single hottest operation. Combined with all/1 (593) and one/1 (353), reads dominate. A read-through ETS cache with TTL on frequently-accessed records would cut database pressure significantly.

Ecto Overhead Is Structural (Don't Touch It)

Ecto pipeline calls:  3,485 (prepare_opts + default_options + prepare_query)
Actual DB operations:  2,530
Overhead ratio:         1.38x

1.38 pipeline calls per operation. This is Ecto doing its job — query preparation, option merging, type validation. You're not going to bypass it, and the savings would be marginal. The real wins come from not hitting the database at all.

Dynamic Repo Switching Works Fine

12 put_dynamic_repo/1 calls across 2,530 operations (1 switch per 211 ops). Connection switching is not a concern.

The BEAM Has Headroom

Run queue of 1, stable 90 MB memory, ~30 DB ops/sec during trace windows. The VM isn't stressed. If there's a throughput bottleneck, it's on the PostgreSQL side — pg_stat_statements would confirm.

Priority

#	Action	Impact	Effort
1	Cache `TableRegistry` in ETS	Eliminates ~2,000 DB calls/window	Low
2	Read cache for hot `get/1` paths	Reduces 87% read load	Medium
3	Profile PostgreSQL side	Confirms where latency lives	Low

What Each Layer Revealed

Finding	Process-Level	Function-Level
Synchronous logger at 39% CPU	Yes	—
Logger worse at warning level (Sessions 3→4→5)	Yes (comparative)	—
`TableRegistry.get/1` is the #1 function	No	9,238 calls
TableRegistry is database-backed	No	Ecto signatures
87% read-heavy workload	No	Ratio analysis
1.38x Ecto overhead (structural, ignore it)	No	Yes
Dynamic repo switching is fine	No	12 switches
BEAM has headroom, DB is the constraint	Partial	Throughput data

Process-level found the logger bug — a real operational fix. Function-level found the architectural issue — a schema registry that should never have been database-backed. One is a config change. The other changes how you design the system.

Both came from a node we don't own, running code we've never read, with zero instrumentation on the target.

How This Works

No source access. No agents installed on the target. No probes compiled in. Giulia's Monitor connects to any BEAM node via Distributed Erlang, collects process snapshots and function-level traces by sending self-contained code strings over RPC (Code.eval_string), and pushes the results to the Worker for AST correlation and fused profiling. The entire observation lifecycle runs on the target using only its own stdlib — Giulia never needs to be loaded there.

Connect, observe, trace, recommend.

DEV Community

How We Told a Stranger's Node Where Its Cache Should Be

The Target: Nexus

The Setup

First Bug: `--sname` vs `--name`

Process-Level Profiling: Finding the Logger

The Stress Test: 919 ops/s, Zero Errors

Function-Level Tracing

The Remote Tracing Problem

Session 6: Single Module (73 seconds, 11 snapshots)

Session 7: Dual Module Trace (85 seconds, 11 snapshots)

What the Numbers Say

TableRegistry Should Be ETS (The Headline Finding)

The Workload Is 87% Reads

Ecto Overhead Is Structural (Don't Touch It)

Dynamic Repo Switching Works Fine

The BEAM Has Headroom

Priority

What Each Layer Revealed

How This Works

Top comments (0)

The Target: Nexus

The Setup

First Bug: --sname vs --name

Process-Level Profiling: Finding the Logger

The Stress Test: 919 ops/s, Zero Errors

Function-Level Tracing

The Remote Tracing Problem

Session 6: Single Module (73 seconds, 11 snapshots)

Session 7: Dual Module Trace (85 seconds, 11 snapshots)

What the Numbers Say

TableRegistry Should Be ETS (The Headline Finding)

The Workload Is 87% Reads

Ecto Overhead Is Structural (Don't Touch It)

Dynamic Repo Switching Works Fine

The BEAM Has Headroom

Priority

What Each Layer Revealed

How This Works

First Bug: `--sname` vs `--name`