Vasiliy Zakharchenko

Posted on Dec 2 • Originally published at community.developer.atlassian.com

Practical SQL Observability for Forge Apps with forge-sql-orm

#sql #monitoring #performance #tooling

Over the past month, I introduced a new observability layer inside my library forge-sql-orm.
The goal was simple: make Forge SQL measurable and transparent — without breaking Runs on Atlassian compliance and without accessing any customer data.

As part of my Codegeist project, I integrated this layer into a real Forge app and connected it to PostHog.
This instantly gave me end-to-end visibility into resolver performance across environments, tenants, and app versions - all while staying fully within the platform boundary.

Why Observability Is Especially Hard in Forge SQL

Forge SQL is multitenant — the physical database is shared across many customers, and each tenant gets its own logical slice of data.

In practice, this creates two major challenges:

1. Tenants can have radically different dataset sizes

A resolver that runs in 50–100 ms for one customer may take hundreds of milliseconds, or even seconds, for another.

And you have no way to know this ahead of time:

you cannot see the tenant’s actual table sizes
you cannot log into the underlying database
you cannot estimate index selectivity per tenant
you can receive slow-query entries from TiDB, but only after a tenant has already experienced degraded performance

2. Platform-level analytics are available - but not enough

Forge SQL exposes some low-level database metrics:

slow-query logs
cluster statistics
execution summaries

However, these analytics are:

not tied to specific resolvers
not correlated with app versions or environments
not connected to payload size or resolver logic
not continuous (visible only when TiDB marks a query as “slow”)
not designed to show trends, regressions, or per-tenant behavior

They help diagnose extreme cases, but they’re not sufficient for understanding how your application performs in real-world multi-tenant conditions.

You only see what your resolver sees — nothing more.

Why That Becomes a Real Problem

As schemas grow and joins become more complex, behavior becomes unpredictable:

A query with a perfect execution plan can still be slow for a large tenant.
Pagination with large OFFSET becomes inconsistent between customers.
A new join may be harmless in dev but catastrophic for a tenant with millions of rows.
Regression detection is impossible without telemetry — you cannot see if performance worsened after a release.

Because all real data lives inside Atlassian infrastructure, the app developer has almost no visibility into how SQL behaves “out in the wild.”

This is exactly the gap that the new observability layer is designed to fill.

What the New Observability Layer Provides

The layer automatically captures performance characteristics for every resolver invocation.

Aggregated total DB execution time

All SQL statements executed by the resolver contribute to a single aggregated DB time metric.

Tiered logging thresholds

Debug when total DB time > 1000 ms
Warn when total DB time > 2000 ms

Thresholds can be tuned per resolver.

Automatic SQL plan dump

When total DB time exceeds 2000 ms, the ORM prints full execution plans for all queries executed inside the resolver directly into Forge logs.

This helps diagnose:

unexpected TableFullScan
heavy IndexJoin
missing indexes
window functions with large memory needs
skewed statistics for a large tenant
inefficient pagination

Performance telemetry sent to allowed analytics tools (e.g., PostHog)

Only anonymized metadata is sent:

cloudId
environment
resolverName
appVersion
totalDbExecutionTime
totalResponseSize

This enables weekly insights, multi-tenant comparisons, and regression detection — fully compliant and PII-free.

Automatic timeout and out-of-memory diagnostics

The observability layer also detects and analyzes failures such as:

“Your query has been cancelled due to exceeding the allowed memory limit for a single SQL query.”
“The provided query took more than 5000 milliseconds to execute.”

When these errors occur, the ORM retrieves and logs execution plans for the failed queries — making it possible to understand the issue even without access to tenant data.

Configurable by Design

One important aspect of this observability layer is that it is not global.
It is resolver-level, meaning every resolver can define its own behavior independently.

Each resolver can configure:

custom thresholds
warning levels
plan-dump behavior
analytics logic
additional metadata
sampling rules
environment-specific overrides
or disable observability entirely

This flexibility is powered by a simple callback structure:

executeWithMetadata(
  async () => {
    // resolver logic
  },
  async (totaldbTime, totalResponseSize, printPlan) => {
    // your custom logic here
  }
);

Because the logic lives at the resolver level, the system is:

easy to tune
lightweight
platform-safe
precise and predictable

Each resolver gets exactly the level of observability it needs — no more, no less.

Even If Analytics Are Disabled — Observability Still Works

Some customers disable analytics or block outbound requests entirely.

In this case, no telemetry is sent to PostHog (or any analytics tool), but the observability layer still provides full visibility.

All essential diagnostic information remains available directly in the logs:

totalDbExecutionTime
totalResponseSize
per-query execution details
full SQL plans
timeout and OOM diagnostics
resolver-level warnings and thresholds

This means that even if analytics events never leave the customer’s infrastructure, the customer can still provide logs that fully explain:

which resolver was slow
which queries were executed
what each plan looked like
why the slowdown happened (bad plan, full scan, index join, statistics skew, etc.)

Observability does not depend on outbound analytics — telemetry is optional, but diagnostics are always available.

100% inside the Forge boundary

No external storage
No outbound data beyond anonymized telemetry
No PII
No customer content

Runs on Atlassian — by design.

Integration Example

1. `manifest.yml` (permissions)

source

permissions:
  external:
    fetch:
      backend:
        - address: "*.posthog.com"
          category: analytics
          inScopeEUD: false

2. Wrapping a resolver

source

resolver.define('Test Resolver', async (req: Request) => {
    const resolverName = 'Test Resolver';
    return FORGE_SQL_ORM.executeWithMetadata(
        async () => {
           return ... // resolver logic
        },
        async (totalDbExecutionTime, totalResponseSize, printQueriesWithPlan) => {
            await ANALYTIC_SERVICE.sendAnalytics(
                "sql_resolver_performance",
                resolverName,
                req.context.cloudId,
                { totalDbExecutionTime, totalResponseSize },
            );

            if (totalDbExecutionTime > 2000) {
                console.warn(
                    `Resolver ${resolverName} has high database execution time: ${totalDbExecutionTime}ms`,
                );
                await printQueriesWithPlan();
            } else if (totalDbExecutionTime > 1000) {
                console.debug(
                    `Resolver ${resolverName} has elevated database execution time: ${totalDbExecutionTime}ms`,
                );
            }
        },
    );
});

3. Sending analytics to PostHog

source

const appContext = getAppContext();
const properties = {
    resolverName,
    cloudId,
    envName: appContext.environmentType,
    envId: appContext.environmentAri.environmentId,
    version: appContext.appVersion,
    parsedVersion: this.parseVersion(appContext.appVersion),
    totalDbExecutionTime: data.totalDbExecutionTime,
    totalResponseSize: data.totalResponseSize,
    eventVersion: 1,
};

await fetch("https://eu.i.posthog.com/capture/", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
        api_key: process.env.ANALYTICS_API_KEY,
        event: eventName,
        distinct_id: cloudId,
        timestamp: new Date().toISOString(),
        properties: properties,
    }),
});

This is enough to build dashboards that reflect real-world performance across tenants.

4. PostHog query for weekly resolver performance

SELECT
  count(uuid) AS event_count,
  properties.envId,
  properties.cloudId,
  AVG(properties.totalDbExecutionTime) AS avgTime,
  concat(properties.cloudId, ':', properties.envName, ':', properties.resolverName) AS resolverName,
  max(properties.parsedVersion),
  max(timestamp) AS last_seen_at
FROM events
WHERE event = 'sql_resolver_performance'
  AND properties.eventVersion = 1
  AND timestamp >= now() - INTERVAL 7 DAY
GROUP BY
  properties.envId,
  properties.cloudId,
  properties.envName,
  properties.resolverName
HAVING avgTime > 500;

Putting It to the Test

To verify the full pipeline, I intentionally left a performance bottleneck in place:

SELECT SLEEP(2)

This pushed the total DB time above the 2000 ms threshold.

Here’s what happened:

In PostHog, I immediately saw a spike:

I opened the Forge logs and found the detailed plans:

The execution plan pointed directly to the problematic place in code:

After removing the artificial delay, latency dropped from ~2200 ms → ~150 ms.

From Black Box to Glass Box

Before, Forge SQL was essentially a black box.
Now, with observability integrated directly into forge-sql-orm, it becomes:

diagnosable
measurable
predictable
transparent

This observability layer is lightweight, compliant, and extremely helpful when building applications with complex schemas, heavy joins, and tenant-specific performance patterns.

Try It Yourself

Codegeist project:
👉 https://github.com/vzakharchenko/Forge-Secure-Notes-for-Jira

forge-sql-orm repository:
👉 https://github.com/vzakharchenko/forge-sql-orm

Updates:

1. New deterministic default mode (TopSlowest)

The default behavior no longer depends on CLUSTER_STATEMENTS_SUMMARY.

forge-sql-orm now logs the exact SQL digests executed inside the resolver, giving deterministic diagnostics even for long-running logic.

By default it prints:

the single slowest query, and
optionally that query’s execution plan (showSlowestPlans: true)

Configurable like this:

{
  topQueries: 2,          // how many slowest queries to analyze
  showSlowestPlans: true, // re-executes them with EXPLAIN ANALYZE
}

If showSlowestPlans is enabled — the library re-executes these queries with EXPLAIN ANALYZE.
If disabled — it prints only SQL + timing.

plan enabled:

plan disabled:

2. SummaryTable mode (optional)

SummaryTable mode still exists, but now works as an advanced diagnostic option.

It uses a short memory window:

summaryTableWindowTime: 15000 // 15s default

If resolver execution exceeds this window, forge-sql-orm automatically falls back to TopSlowest, avoiding stale diagnostics.

This keeps SummaryTable useful for fresh metadata, but avoids relying on it for long workflows.

3. Updated API

Here is the updated API with all configuration options:

executeWithMetadata(
  async () => {
    // resolver logic
  },
  async (totalDbTime, totalResponseSize, printPlan) => {
    // your custom logic:
    // analytics, thresholds, alerts, logging, etc.
    // e.g.: if (totalDbTime > 1000) await printPlan();
  },
  {
    mode?: QueryPlanMode;            // "TopSlowest" | "SummaryTable" (default: TopSlowest)
    summaryTableWindowTime?: number; // ms window for SummaryTable (default: 15000)
    topQueries?: number;             // number of slowest queries to print (default: 1)
    showSlowestPlans?: boolean;      // print EXPLAIN ANALYZE in TopSlowest mode (default: true)
  }
);

Everything is opt-in and Forge-safe by design.

4. Timeout & OOM post-mortem diagnostics

For catastrophic SQL failures, the library performs an immediate post-mortem lookup.

Right after a Timeout or OOM, TiDB’s metadata is still in memory — so forge-sql-orm extracts the actual plan of the failing query before eviction can occur.

Case A: Timeout

“The provided query took more than 5000 milliseconds to execute…”

forge-sql-orm automatically logs the execution plan of the failing query:

Case B: Out of Memory (OOM)

“Your query has been cancelled due to exceeding the allowed memory limit…”

The library captures the memory-heavy plan that triggered the crash:

Why this matters

No re-execution required - avoids triggering the same timeout or OOM again.
Plans come from the real execution - captured with actual data distribution and bind parameters.
No tenant data is exposed - metadata only, fully compliant.
Runs entirely inside the Forge boundary - no privileged access or special APIs.
Works reliably even for complex, deeply nested SQL workloads - joins, pagination chains, window functions, etc.

This gives developers a safe way to understand severe failures without privileged access.

5. Why developer-side observability matters

This configurability — implemented on the application side — lets developers enable observability exactly where it’s needed:

resolver/long function/scheduler-level instrumentation
custom thresholds
selective plan printing
sampling
environment rules
optional analytics

And importantly:

You don’t need my library to do any of this.
Developers can implement the same pattern manually — forge-sql-orm simply makes it easier and more consistent.

Developer-side observability naturally complements platform-level observability.

Together, they enable building Forge apps with deep SQL execution paths: complex joins, multi-stage pagination, window functions, large OFFSET workflows — while still maintaining transparency and safety.

DEV Community

Practical SQL Observability for Forge Apps with forge-sql-orm

Why Observability Is Especially Hard in Forge SQL

1. Tenants can have radically different dataset sizes

2. Platform-level analytics are available - but not enough

Why That Becomes a Real Problem

What the New Observability Layer Provides

Aggregated total DB execution time

Tiered logging thresholds

Automatic SQL plan dump

Performance telemetry sent to allowed analytics tools (e.g., PostHog)

Automatic timeout and out-of-memory diagnostics

Configurable by Design

Even If Analytics Are Disabled — Observability Still Works

100% inside the Forge boundary

Integration Example

1. `manifest.yml` (permissions)

2. Wrapping a resolver

3. Sending analytics to PostHog

4. PostHog query for weekly resolver performance

Putting It to the Test

From Black Box to Glass Box

Try It Yourself

Updates:

1. New deterministic default mode (TopSlowest)

2. SummaryTable mode (optional)

3. Updated API

4. Timeout & OOM post-mortem diagnostics

Case A: Timeout

Case B: Out of Memory (OOM)

Why this matters

5. Why developer-side observability matters

Top comments (0)

Why Observability Is Especially Hard in Forge SQL

1. Tenants can have radically different dataset sizes

2. Platform-level analytics are available - but not enough

Why That Becomes a Real Problem

What the New Observability Layer Provides

Aggregated total DB execution time

Tiered logging thresholds

Automatic SQL plan dump

Performance telemetry sent to allowed analytics tools (e.g., PostHog)

Automatic timeout and out-of-memory diagnostics

Configurable by Design

Even If Analytics Are Disabled — Observability Still Works

100% inside the Forge boundary

Integration Example

1. manifest.yml (permissions)

2. Wrapping a resolver

3. Sending analytics to PostHog

4. PostHog query for weekly resolver performance

Putting It to the Test

From Black Box to Glass Box

Try It Yourself

Updates:

1. New deterministic default mode (TopSlowest)

2. SummaryTable mode (optional)

3. Updated API

4. Timeout & OOM post-mortem diagnostics

Case A: Timeout

Case B: Out of Memory (OOM)

Why this matters

5. Why developer-side observability matters

1. `manifest.yml` (permissions)