Drawing Lines on the Ocean

#webhooks #geofencing #ais #tutorial

Imagine you've drawn a circle on a map. It surrounds the Port of Rotterdam, with a bit of slack so you catch ships in the approach. You want to know, in something close to real time, when a particular tanker crosses that line. Not when someone checks a dashboard. Not in a daily report. The moment it happens.

On the face of it, trivial. You have a position. You have a polygon. Check if one is inside the other. Done.

Except almost nothing about that sentence is true in practice.

The thing you think you have

The first lie is "you have a position." You don't, really. You have a position that a vessel's AIS transponder broadcast some number of seconds or minutes ago, that was picked up by a terrestrial receiver (if the vessel is near shore) or a satellite (if it isn't), that was relayed through one or more data networks, that was deduplicated against other reports of the same broadcast, that was timestamped — possibly multiple times, by different systems with slightly different clocks — and that finally arrived in a database.

The position is real. The "now" is negotiable.

Class A transponders — required under SOLAS for commercial vessels of 300 GT and up on international voyages — report as often as every 2 seconds for a vessel above 23 knots or one changing course aggressively, every 6 seconds at 14–23 knots, and every 10 seconds for anything underway below 14 knots regardless of course changes. Anchored or moored, it drops to 3 minutes. (The full ITU-R M.1371 table has more buckets than that; the point is the rate scales with how interesting the vessel currently is.) Class B SO units on smaller craft report every 30 seconds above 2 knots and every 3 minutes at or below — and Class B CS units, increasingly common, just report every 30 seconds and yield the channel to Class A. In a place as busy as Rotterdam, a small vessel's updates can get elbowed out and arrive minutes apart.

Satellite AIS is a separate latency story, and honestly the part of this stack I find most quietly enraging — half the vendors in this space sell "real-time satellite AIS" and quietly mean "ten minutes old, on a good day." The protocol's anti-collision mechanism — Self-Organizing TDMA — assumes each vessel can passively listen to nearby traffic and pick an unoccupied time slot from the local frame map. That assumption holds beautifully at ship-to-ship VHF ranges of roughly 20 nautical miles (the design figure; in practice it varies a lot with antenna height). From orbit, it falls apart: thousands of vessels beyond each other's range have independently claimed the same time slots, and a satellite hears all of them at once. Modern LEO constellations claw a lot of those messages back with wider-bandwidth receivers and probabilistic decoding, but gaps remain. By the time you see the vessel cross the line, the next position might place it 200 meters past it. Or 2 kilometers. The crossing itself is something you have to infer.

Which brings us to the second lie: "check if one is inside the other."

Inside, outside, and the hysteresis problem

Here is what naive geofencing looks like in production: a vessel anchors precisely on the boundary of your polygon. Wind and current shift, and the vessel swings through an arc — a ship on 60 meters of chain in 15 meters of water can sweep through a circle over 100 meters wide, and that's real movement, not noise. On top of that, modern receivers add a few meters of GPS jitter under open sky; multipath in a busy harbor, obstructions, or loss of differential correction can push that to tens of meters regardless of how new the equipment is. Every few minutes, a new position arrives — sometimes just inside the fence, sometimes just outside.

Your webhook fires. And fires. And fires. By morning your downstream system has received 847 "vessel entered port" notifications for the same ship that hasn't actually moved.

This is the same problem thermostats have. The solution is the same too: hysteresis. You don't fire on a single crossing. You fire on a state change that has settled.

Concretely, your handler tracks, for each (vessel, geofence) pair, the last known state — inside, outside, or unknown — and only emits a notification when a new observation contradicts the stored state and you have some confidence the contradiction is real.

The simplest version of "real" is a debounce window. We got this wrong the first time by using a single 60-second window platform-wide. Worked fine for a 5nm port-approach fence. Fired seventeen times per crossing on a harbor entrance fence a pilot boat could cross in twenty seconds. It needs to be per-fence, scaled to the smallest dimension you care about and the fastest legitimate vessel.

A buffer zone — enter when the vessel is 100m inside the boundary, exit when it's 100m outside — works beautifully for circles. For arbitrary polygons it's harder; computing an inward offset of a concave shape is a real piece of geometry, not a one-liner, so most implementations apply the buffer at evaluation time rather than try to redraw the polygon.

Your point-in-polygon test also needs to work in projected coordinates or use a proper spherical distance formula. Raw lat/lon arithmetic produces shapes that are noticeably non-circular at high latitudes — at Rotterdam's 51.9°N, a degree of longitude is about 69 km on the ground, not 111. That's a 38% squash. A "5 nautical mile radius" drawn with naive Euclidean distance is an ellipse, and a visibly wrong one.

There's a subtler trap waiting too. If two position updates for the same vessel arrive within milliseconds — entirely possible with high-frequency feeds and multiple workers — a naive "read state, decide, write state" sequence races against itself. Both workers read outside, both see inside, both fire. In Postgres, the cleanest fix is a conditional update with a timestamp guard: UPDATE ... WHERE state = 'outside' AND last_observed_at < $position_timestamp, then check the affected row count. The timestamp clause matters more than it looks — without it, a satellite position from twenty minutes ago, arriving late after a fresh terrestrial position, will cheerfully roll the state back and re-fire the crossing. (Pessimistic locking with SELECT FOR UPDATE is simpler to reason about; at geofence throughputs the difference rarely matters.)

What the webhook should actually contain

When the notification finally does fire, what goes in the payload matters more than people expect. The receiver may need to argue with it — dismiss it as a stale position, route it to the right downstream system, decide whether to wake someone up. Give them the materials to make that call.

{
  "event": "geofence.entered",
  "event_id": "evt_01HXYZ...",
  "occurred_at": "2024-11-14T08:42:17Z",
  "vessel": {
    "mmsi": 338234567,
    "imo": 9074729,
    "name": "EXAMPLE VESSEL"
  },
  "geofence": {
    "id": "gf_rotterdam_approach",
    "name": "Rotterdam Approach"
  },
  "position": {
    "lat": 51.9461,
    "lon": 4.1234,
    "sog": 8.2,
    "cog": 87,
    "reported_at": "2024-11-14T08:41:55Z",
    "nav_status": "under_way_using_engine"
  },
  "confidence": "high",
  "previous_state": "outside"
}

Ship MMSIs are 9 digits with a Maritime Identification Digit (MID) prefix roughly in the 201–775 range; prefixes starting with 0, 8, or 9 indicate coast stations, group calls, or special devices like AtoN buoys and MOB beacons, not vessels. IMO numbers are 7 digits, last digit a check digit (the 9074729 above validates; we had two bug reports in the first month about an earlier example that didn't, from engineers who pasted it straight into a validator). Use fictional-but-structurally-valid identifiers in your docs.

event_id is the receiver's lifeline. Webhooks get retried. Networks fail. Your handler crashes between sending the HTTP request and committing the "sent" flag. The receiver will see the same event twice, and the only thing that lets them deduplicate cleanly is a stable, unique identifier per logical event — not per delivery attempt. (Their dedup store should be bounded too; storing every event_id forever is an unbounded leak nobody notices until it's six months in.)

occurred_at versus position.reported_at is the gap between when the vessel was where it says, and when your system was confident enough to call the crossing. The vessel was at that lat/lon at 08:41:55. You decided 22 seconds later. That delta is the truth about how much to trust the event.

confidence we argue about more than anything else. Current rule: high when the position is under 5 minutes old and the AIS position-accuracy bit is 1 (indicating <10m, usually DGPS-augmented); low when the last known position is older than 15 minutes or satellite-derived with a long gap; nothing in between. We tried medium and customers ignored it — one of our senior engineers defended it for two sprints on the grounds that "of course there's a middle case" — and the dashboard data was unambiguous: nobody filtered on it, nobody alerted on it, it was decoration. We killed it. Five minutes is generous for a tight fence — a vessel at 20 knots covers over 3 km in that time — so for harbor-scale fences we tighten it. Write the thresholds down. Argue about them in code review, not in incident retros.

nav_status is worth including, but treat it as advisory. It's a 4-bit field that's supposed to reflect operational status, but in practice it's set once at commissioning, or updated via an integrated bridge system that may or may not be wired to anything useful, or just left at the default. Vessels routinely transmit under_way_using_engine while sitting at anchor because nobody changed the setting. Useful for suppression heuristics; dangerous as a hard filter, and frankly this is the AIS field I'd most like to set on fire.

Delivery: the part everyone underestimates

The webhook itself is the easy bit. POST a JSON body to a URL. Twenty lines of code.

What separates a hobby project from production is what happens when the POST fails. And here's the thing nobody tells you up front: your retry schedule is going to be wrong, and you'll only find out which way when something breaks.

Ours used to retry every 10 seconds for 30 minutes. Sounded reasonable. Then a customer — call them Skipper, since they were a logistics shop with a fondness for nautical names — did a 35-minute deploy on a Friday afternoon. By the time their server came back, the queue had drained itself dry against 502s. They missed six hours of arrival notifications across half a fleet, and the angry Slack came in at 2am Saturday because that's when their ops team noticed the discrepancy against the carrier portal. We now use 30s, 2m, 10m, 1h, then hourly for 24h, with ±20% jitter on every interval — without the jitter, when a shared hosting provider recovers from an outage, every retry timer in your system fires at the same millisecond and you DDoS the customer on their way back up. Every retry reuses the same event_id.

Sign the payloads. HMAC-SHA256 of the request body with a per-customer secret, sent in a header — Stripe calls theirs Stripe-Signature, GitHub uses X-Hub-Signature-256. The header name doesn't matter. What matters is that the timestamp goes inside the signed material, and you reject anything more than a few minutes old. Without that, anyone who captures one valid request can replay it forever. Stripe's scheme is the widely-copied reference; copy it shamelessly.

Then there's the delivery log. Every attempt — success, failure, response code, response body, latency — written somewhere queryable. The argument "we never got the notification for vessel X" happens roughly once a month, and the log ends it in thirty seconds: yes you did, here's the 200; or no, we tried six times, your server returned 503 every time.

And two layers of failure handling, which get conflated more often than they should. A consecutive-failure threshold — say, ten failures inside a minute — stops you immediately hammering an endpoint that's clearly down. (Strictly speaking this isn't a full circuit breaker; we don't probe for recovery, we just back off and let the retry schedule do the probing.) A separate endpoint-health policy decides when to give up entirely, mark the endpoint dead, and page someone. We use 24 hours of accumulated failure for that, not 48 — 48 was too forgiving, and customers preferred being woken up to discovering on Monday that nothing had worked since Friday.

The geofence itself is a choice

One last subtlety. A geofence isn't just a shape. It's a modeling decision about what you actually care about.

A circular fence around a port's coordinates is easy to draw and almost always wrong. Ports aren't circles. They have channels, anchorages outside the breakwater, restricted zones inside. A 5-mile radius around Rotterdam's city center will include vessels transiting the North Sea that have nothing to do with the port — and will simultaneously miss the Maas pilot boarding ground, which sits well offshore from the outer terminals. Rotterdam's port complex stretches more than 40 km from the sea inland, so "Rotterdam" as a single coordinate is already a fiction.

A polygon traced along the actual port boundary is better, but vessels at the outer anchorage are arguably already arriving, and a freight forwarder cares about a different boundary than a bunker fuel supplier or a port authority. In practice one vessel crosses several geofences on a single voyage — different fences for different customers asking different questions about the same ship.

Get the infrastructure right and you still have to decide where to put the line. The circle on the map is the easy part.