DEV Community

Cover image for How We Told a Stranger's Node Where Its Cache Should Be
Alessio Battistutta
Alessio Battistutta

Posted on

How We Told a Stranger's Node Where Its Cache Should Be

We connected to a remote BEAM node we don't own. No access to its source code. No instrumentation planted beforehand. No agents installed, no probes compiled into the target. Pure black-box runtime observation over Distributed Erlang.

From that, we produced a concrete architectural recommendation: your schema registry is hitting PostgreSQL 354 times per observation window for metadata that almost never changes — that should be ETS, not a database table.

This article walks through how we got there — 7 observation sessions against a live Elixir application, progressing from coarse process-level metrics to function-level tracing that revealed the internal structure of code we'd never read.


The Target: Nexus

Nexus is an Elixir API gateway that manages dynamic PostgreSQL connections, runtime schema metadata, and table operations. It exposes a REST API on port 4040 and handles CRUD, aggregations, search, CSV export, batch operations, and multi-database routing — all backed by Ecto dynamic repos.

We exercised it with an integration test suite: 61 tests across 8 categories including connection management, extended CRUD (JSONB, NULLs, hierarchical categories), edge cases (SQL injection, unicode, 10KB strings), concurrent operations (50 concurrent inserts, 5000+ req/s bursts), large datasets (50k rows), and a 60-second stress test with 20 concurrent workers.

The Setup

Giulia runs as two Docker containers from the same image:

  • Worker (port 4000) — static analysis engine: AST indexing, Knowledge Graph, dependency topology, complexity metrics, embeddings
  • Monitor (port 4001) — runtime observer: connects to target BEAM nodes via Distributed Erlang, collects snapshots at configurable intervals, pushes data to the Worker for fusion

The observation workflow is command-driven:

giulia-observe start nexus@192.168.10.174                           # process-level only
giulia-observe start nexus@192.168.10.174 cookie 5000 Nexus.Repo    # + function tracing
<run your workload>
giulia-observe stop nexus@192.168.10.174                            # finalize fused profile
Enter fullscreen mode Exit fullscreen mode

First Bug: --sname vs --name

First connection failed — Nexus uses long names (--name nexus@192.168.10.174), Giulia used short names (--sname worker). Erlang refuses to connect across name modes. One flag change in docker-compose.yml fixed it.


Process-Level Profiling: Finding the Logger

The first five sessions used process-level observation: BEAM metrics (memory, process count, scheduler run queue) and top-process rankings by CPU. This tells you which modules own the hottest processes but not which functions are being called.

We ran the same workload under different configurations:

Session Duration Run Queue Top Module #2 Module Logger CPU
1. First contact (21/49 tests) 180s 17 :proc_lib 100% (below threshold) 0%
2. Idle baseline 21s 0 :code_server 38.7% (below threshold) 0%
3. Full suite, info log 77s 1 :proc_lib 54.3% :logger_std_h_default 28.5%
4. Full suite, warning log 77s 5 :logger_std_h_default :proc_lib 33.2% 39.1%
5. Full suite, async handler 71s 2 :proc_lib 98.9% (below threshold) Gone

Session 1 flagged scheduler contention (run queue of 17 — more tasks queued than schedulers available). Sessions 3-4 found something more interesting: Erlang's default logger handler does synchronous IO. Every log call blocks the calling process until the write completes. The overhead is per-call, not per-byte — so raising the log level to :warning (Session 4) actually made it worse. Fewer messages, but each one still blocks, and now the logger is a larger proportion of total work.

The fix:

:logger.update_handler_config(:default, :config, %{
  sync_mode_qlen: 1000,     # async until queue hits 1000
  drop_mode_qlen: 2000,     # start dropping above 2000
  flush_qlen: 5000           # emergency flush
})
Enter fullscreen mode Exit fullscreen mode

Session 5 confirmed: logger gone from top modules, run queue dropped from 5 to 2. Bottleneck eliminated.

But process-level profiling had hit its ceiling. Session 5 showed :proc_lib at 98.9% CPU — which is like saying "OTP processes are running." Every GenServer, Task, and supervised process runs through :proc_lib. We needed to see inside.


The Stress Test: 919 ops/s, Zero Errors

Before diving deeper, we ran the full suite including a 60-second sustained stress test. Nexus handled the load cleanly — the interesting question was how.

Duration:       60.0s          Latency (ms):
Total ops:      55,151           min:     0.7
Errors:         0 (0.0%)        median:  5.4
Throughput:     919 ops/s        p95:    93.1
                                 p99:   156.9
Operations breakdown:            max:   287.2
  select    13,619    paginate   5,489
  insert    11,203    aggregate  5,440
  count      8,291    distinct   2,818
  search     5,553    exists     2,738
Enter fullscreen mode Exit fullscreen mode

Memory flat at 87-88 MB. Process count stable at 459. The system handles load well — but how it handles it is where the interesting findings live.


Function-Level Tracing

The Remote Tracing Problem

Erlang's :erlang.trace only works on the local node. Giulia's Monitor runs on a different node than Nexus. There are three ways to run code on a remote BEAM:

  1. :rpc.call(node, Module, :function, args) — module must be loaded on the target
  2. Anonymous function via RPC — closures reference their defining module; fails with :undef
  3. Code.eval_string via RPC — source code as a string, compiled on the target using only its stdlib

Options 1 and 2 fail because Nexus doesn't have Giulia's code, and Erlang closures aren't portable. Build 135 uses option 3: a self-contained Elixir code string sent via :rpc.call(node, Code, :eval_string, [code]). It spawns a collector, enables tracing for up to 2 seconds (kill switch at 1,000 events), aggregates call counts, and returns. The entire lifecycle runs on the target with zero Giulia dependencies.

Session 6: Single Module (73 seconds, 11 snapshots)

Tracing Nexus.Registry.TableRegistry only:

Function Calls Samples
TableRegistry.get/1 9,238 10

The single hottest function on Nexus. Every CRUD operation goes through get/1 to look up the schema definition before executing. But the call count alone doesn't tell you how it's implemented.

Session 7: Dual Module Trace (85 seconds, 11 snapshots)

Tracing both Nexus.Registry.TableRegistry and Nexus.Repo:

Nexus.Repo — 7,000 total calls:

Function Calls Share
prepare_opts/2 1,266 18.1%
default_options/1 1,263 18.0%
get/1 1,263 18.0%
prepare_query/3 956 13.7%
all/1 593 8.5%
one/1 353 5.0%
insert_all/3 307 4.4%
put_dynamic_repo/1 12 0.2%

Nexus.Registry.TableRegistry — 2,004 total calls:

Function Calls Share
get/1 354 17.7%
prepare_opts/2 354 17.7%
default_options/1 354 17.7%
prepare_query/3 292 14.6%
all/1 186 9.3%
one/1 106 5.3%
insert_all/3 62 3.1%

And there it is. TableRegistry has prepare_opts/2, default_options/1, prepare_query/3 — the telltale signature of an Ecto Repo. The schema registry is backed by a database table, not ETS. Every schema lookup is a round-trip to PostgreSQL.


What the Numbers Say

TableRegistry Should Be ETS (The Headline Finding)

354 get/1 calls to PostgreSQL per observation window, for table metadata that almost never changes. Boot-time load, invalidate on DDL events, done. This eliminates ~2,000 function calls per window (the full Ecto pipeline runs on every registry query) and removes an entire class of unnecessary database round-trips from the hot path.

The Workload Is 87% Reads

Read operations:  2,209 (87%)    Read/Write ratio: 6.9:1
Write operations:    320 (13%)
Enter fullscreen mode Exit fullscreen mode

Repo.get/1 at 1,263 calls is the single hottest operation. Combined with all/1 (593) and one/1 (353), reads dominate. A read-through ETS cache with TTL on frequently-accessed records would cut database pressure significantly.

Ecto Overhead Is Structural (Don't Touch It)

Ecto pipeline calls:  3,485 (prepare_opts + default_options + prepare_query)
Actual DB operations:  2,530
Overhead ratio:         1.38x
Enter fullscreen mode Exit fullscreen mode

1.38 pipeline calls per operation. This is Ecto doing its job — query preparation, option merging, type validation. You're not going to bypass it, and the savings would be marginal. The real wins come from not hitting the database at all.

Dynamic Repo Switching Works Fine

12 put_dynamic_repo/1 calls across 2,530 operations (1 switch per 211 ops). Connection switching is not a concern.

The BEAM Has Headroom

Run queue of 1, stable 90 MB memory, ~30 DB ops/sec during trace windows. The VM isn't stressed. If there's a throughput bottleneck, it's on the PostgreSQL side — pg_stat_statements would confirm.

Priority

# Action Impact Effort
1 Cache TableRegistry in ETS Eliminates ~2,000 DB calls/window Low
2 Read cache for hot get/1 paths Reduces 87% read load Medium
3 Profile PostgreSQL side Confirms where latency lives Low

What Each Layer Revealed

Finding Process-Level Function-Level
Synchronous logger at 39% CPU Yes
Logger worse at warning level (Sessions 3→4→5) Yes (comparative)
TableRegistry.get/1 is the #1 function No 9,238 calls
TableRegistry is database-backed No Ecto signatures
87% read-heavy workload No Ratio analysis
1.38x Ecto overhead (structural, ignore it) No Yes
Dynamic repo switching is fine No 12 switches
BEAM has headroom, DB is the constraint Partial Throughput data

Process-level found the logger bug — a real operational fix. Function-level found the architectural issue — a schema registry that should never have been database-backed. One is a config change. The other changes how you design the system.

Both came from a node we don't own, running code we've never read, with zero instrumentation on the target.


How This Works

No source access. No agents installed on the target. No probes compiled in. Giulia's Monitor connects to any BEAM node via Distributed Erlang, collects process snapshots and function-level traces by sending self-contained code strings over RPC (Code.eval_string), and pushes the results to the Worker for AST correlation and fused profiling. The entire observation lifecycle runs on the target using only its own stdlib — Giulia never needs to be loaded there.

Connect, observe, trace, recommend.

Top comments (0)