DEV Community

Vasiliy Zakharchenko
Vasiliy Zakharchenko

Posted on • Originally published at community.developer.atlassian.com

Practical SQL Observability for Forge Apps with forge-sql-orm

Over the past month, I introduced a new observability layer inside my library forge-sql-orm.
The goal was simple: make Forge SQL measurable and transparent — without breaking Runs on Atlassian compliance and without accessing any customer data.

As part of my Codegeist project, I integrated this layer into a real Forge app and connected it to PostHog.
This instantly gave me end-to-end visibility into resolver performance across environments, tenants, and app versions - all while staying fully within the platform boundary.

Why Observability Is Especially Hard in Forge SQL

Forge SQL is multitenant — the physical database is shared across many customers, and each tenant gets its own logical slice of data.

In practice, this creates two major challenges:


1. Tenants can have radically different dataset sizes

A resolver that runs in 50–100 ms for one customer may take hundreds of milliseconds, or even seconds, for another.

And you have no way to know this ahead of time:

  • you cannot see the tenant’s actual table sizes
  • you cannot log into the underlying database
  • you cannot estimate index selectivity per tenant
  • you can receive slow-query entries from TiDB, but only after a tenant has already experienced degraded performance

2. Platform-level analytics are available - but not enough

Forge SQL exposes some low-level database metrics:

  • slow-query logs
  • cluster statistics
  • execution summaries

However, these analytics are:

  • not tied to specific resolvers
  • not correlated with app versions or environments
  • not connected to payload size or resolver logic
  • not continuous (visible only when TiDB marks a query as “slow”)
  • not designed to show trends, regressions, or per-tenant behavior

They help diagnose extreme cases, but they’re not sufficient for understanding how your application performs in real-world multi-tenant conditions.

You only see what your resolver sees — nothing more.


Why That Becomes a Real Problem

As schemas grow and joins become more complex, behavior becomes unpredictable:

  • A query with a perfect execution plan can still be slow for a large tenant.
  • Pagination with large OFFSET becomes inconsistent between customers.
  • A new join may be harmless in dev but catastrophic for a tenant with millions of rows.
  • Regression detection is impossible without telemetry — you cannot see if performance worsened after a release.

Because all real data lives inside Atlassian infrastructure, the app developer has almost no visibility into how SQL behaves “out in the wild.”

This is exactly the gap that the new observability layer is designed to fill.


What the New Observability Layer Provides

The layer automatically captures performance characteristics for every resolver invocation.


Aggregated total DB execution time

All SQL statements executed by the resolver contribute to a single aggregated DB time metric.


Tiered logging thresholds

  • Debug when total DB time > 1000 ms
  • Warn when total DB time > 2000 ms

Thresholds can be tuned per resolver.


Automatic SQL plan dump

When total DB time exceeds 2000 ms, the ORM prints full execution plans for all queries executed inside the resolver directly into Forge logs.

This helps diagnose:

  • unexpected TableFullScan
  • heavy IndexJoin
  • missing indexes
  • window functions with large memory needs
  • skewed statistics for a large tenant
  • inefficient pagination

Performance telemetry sent to allowed analytics tools (e.g., PostHog)

Only anonymized metadata is sent:

  • cloudId
  • environment
  • resolverName
  • appVersion
  • totalDbExecutionTime
  • totalResponseSize

This enables weekly insights, multi-tenant comparisons, and regression detection — fully compliant and PII-free.


Automatic timeout and out-of-memory diagnostics

The observability layer also detects and analyzes failures such as:

  • “Your query has been cancelled due to exceeding the allowed memory limit for a single SQL query.”
  • “The provided query took more than 5000 milliseconds to execute.”

When these errors occur, the ORM retrieves and logs execution plans for the failed queries — making it possible to understand the issue even without access to tenant data.


Configurable by Design

One important aspect of this observability layer is that it is not global.
It is resolver-level, meaning every resolver can define its own behavior independently.

Each resolver can configure:

  • custom thresholds
  • warning levels
  • plan-dump behavior
  • analytics logic
  • additional metadata
  • sampling rules
  • environment-specific overrides
  • or disable observability entirely

This flexibility is powered by a simple callback structure:

executeWithMetadata(
  async () => {
    // resolver logic
  },
  async (totaldbTime, totalResponseSize, printPlan) => {
    // your custom logic here
  }
);
Enter fullscreen mode Exit fullscreen mode

Because the logic lives at the resolver level, the system is:

  • easy to tune
  • lightweight
  • platform-safe
  • precise and predictable

Each resolver gets exactly the level of observability it needs — no more, no less.


Even If Analytics Are Disabled — Observability Still Works

Some customers disable analytics or block outbound requests entirely.

In this case, no telemetry is sent to PostHog (or any analytics tool), but the observability layer still provides full visibility.

All essential diagnostic information remains available directly in the logs:

  • totalDbExecutionTime
  • totalResponseSize
  • per-query execution details
  • full SQL plans
  • timeout and OOM diagnostics
  • resolver-level warnings and thresholds

This means that even if analytics events never leave the customer’s infrastructure, the customer can still provide logs that fully explain:

  • which resolver was slow
  • which queries were executed
  • what each plan looked like
  • why the slowdown happened (bad plan, full scan, index join, statistics skew, etc.)

Observability does not depend on outbound analytics — telemetry is optional, but diagnostics are always available.

100% inside the Forge boundary

  • No external storage
  • No outbound data beyond anonymized telemetry
  • No PII
  • No customer content

Runs on Atlassian — by design.


Integration Example

1. manifest.yml (permissions)

source

permissions:
  external:
    fetch:
      backend:
        - address: "*.posthog.com"
          category: analytics
          inScopeEUD: false
Enter fullscreen mode Exit fullscreen mode

2. Wrapping a resolver

source

resolver.define('Test Resolver', async (req: Request) => {
    const resolverName = 'Test Resolver';
    return FORGE_SQL_ORM.executeWithMetadata(
        async () => {
           return ... // resolver logic
        },
        async (totalDbExecutionTime, totalResponseSize, printQueriesWithPlan) => {
            await ANALYTIC_SERVICE.sendAnalytics(
                "sql_resolver_performance",
                resolverName,
                req.context.cloudId,
                { totalDbExecutionTime, totalResponseSize },
            );

            if (totalDbExecutionTime > 2000) {
                console.warn(
                    `Resolver ${resolverName} has high database execution time: ${totalDbExecutionTime}ms`,
                );
                await printQueriesWithPlan();
            } else if (totalDbExecutionTime > 1000) {
                console.debug(
                    `Resolver ${resolverName} has elevated database execution time: ${totalDbExecutionTime}ms`,
                );
            }
        },
    );
});
Enter fullscreen mode Exit fullscreen mode

3. Sending analytics to PostHog

source

const appContext = getAppContext();
const properties = {
    resolverName,
    cloudId,
    envName: appContext.environmentType,
    envId: appContext.environmentAri.environmentId,
    version: appContext.appVersion,
    parsedVersion: this.parseVersion(appContext.appVersion),
    totalDbExecutionTime: data.totalDbExecutionTime,
    totalResponseSize: data.totalResponseSize,
    eventVersion: 1,
};

await fetch("https://eu.i.posthog.com/capture/", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
        api_key: process.env.ANALYTICS_API_KEY,
        event: eventName,
        distinct_id: cloudId,
        timestamp: new Date().toISOString(),
        properties: properties,
    }),
});
Enter fullscreen mode Exit fullscreen mode

This is enough to build dashboards that reflect real-world performance across tenants.


4. PostHog query for weekly resolver performance

SELECT
  count(uuid) AS event_count,
  properties.envId,
  properties.cloudId,
  AVG(properties.totalDbExecutionTime) AS avgTime,
  concat(properties.cloudId, ':', properties.envName, ':', properties.resolverName) AS resolverName,
  max(properties.parsedVersion),
  max(timestamp) AS last_seen_at
FROM events
WHERE event = 'sql_resolver_performance'
  AND properties.eventVersion = 1
  AND timestamp >= now() - INTERVAL 7 DAY
GROUP BY
  properties.envId,
  properties.cloudId,
  properties.envName,
  properties.resolverName
HAVING avgTime > 500;
Enter fullscreen mode Exit fullscreen mode

Putting It to the Test

To verify the full pipeline, I intentionally left a performance bottleneck in place:

SELECT SLEEP(2)
Enter fullscreen mode Exit fullscreen mode

This pushed the total DB time above the 2000 ms threshold.

Here’s what happened:

  1. In PostHog, I immediately saw a spike:

  1. I opened the Forge logs and found the detailed plans:

  1. The execution plan pointed directly to the problematic place in code:

After removing the artificial delay, latency dropped from ~2200 ms → ~150 ms.


From Black Box to Glass Box

Before, Forge SQL was essentially a black box.
Now, with observability integrated directly into forge-sql-orm, it becomes:

  • diagnosable
  • measurable
  • predictable
  • transparent

This observability layer is lightweight, compliant, and extremely helpful when building applications with complex schemas, heavy joins, and tenant-specific performance patterns.


Try It Yourself

Codegeist project:
👉 https://github.com/vzakharchenko/Forge-Secure-Notes-for-Jira

forge-sql-orm repository:
👉 https://github.com/vzakharchenko/forge-sql-orm

Updates:

1. New deterministic default mode (TopSlowest)

The default behavior no longer depends on CLUSTER_STATEMENTS_SUMMARY.

forge-sql-orm now logs the exact SQL digests executed inside the resolver, giving deterministic diagnostics even for long-running logic.

By default it prints:

  • the single slowest query, and
  • optionally that query’s execution plan (showSlowestPlans: true)

Configurable like this:

{
  topQueries: 2,          // how many slowest queries to analyze
  showSlowestPlans: true, // re-executes them with EXPLAIN ANALYZE
}
Enter fullscreen mode Exit fullscreen mode

If showSlowestPlans is enabled — the library re-executes these queries with EXPLAIN ANALYZE.
If disabled — it prints only SQL + timing.

plan enabled:

plan disabled:


2. SummaryTable mode (optional)

SummaryTable mode still exists, but now works as an advanced diagnostic option.

It uses a short memory window:

summaryTableWindowTime: 15000 // 15s default
Enter fullscreen mode Exit fullscreen mode

If resolver execution exceeds this window, forge-sql-orm automatically falls back to TopSlowest, avoiding stale diagnostics.

This keeps SummaryTable useful for fresh metadata, but avoids relying on it for long workflows.

3. Updated API

Here is the updated API with all configuration options:

executeWithMetadata(
  async () => {
    // resolver logic
  },
  async (totalDbTime, totalResponseSize, printPlan) => {
    // your custom logic:
    // analytics, thresholds, alerts, logging, etc.
    // e.g.: if (totalDbTime > 1000) await printPlan();
  },
  {
    mode?: QueryPlanMode;            // "TopSlowest" | "SummaryTable" (default: TopSlowest)
    summaryTableWindowTime?: number; // ms window for SummaryTable (default: 15000)
    topQueries?: number;             // number of slowest queries to print (default: 1)
    showSlowestPlans?: boolean;      // print EXPLAIN ANALYZE in TopSlowest mode (default: true)
  }
);
Enter fullscreen mode Exit fullscreen mode

Everything is opt-in and Forge-safe by design.

4. Timeout & OOM post-mortem diagnostics

For catastrophic SQL failures, the library performs an immediate post-mortem lookup.

Right after a Timeout or OOM, TiDB’s metadata is still in memory — so forge-sql-orm extracts the actual plan of the failing query before eviction can occur.

Case A: Timeout

“The provided query took more than 5000 milliseconds to execute…”

forge-sql-orm automatically logs the execution plan of the failing query:


Case B: Out of Memory (OOM)

“Your query has been cancelled due to exceeding the allowed memory limit…”

The library captures the memory-heavy plan that triggered the crash:


Why this matters

  • No re-execution required - avoids triggering the same timeout or OOM again.
  • Plans come from the real execution - captured with actual data distribution and bind parameters.
  • No tenant data is exposed - metadata only, fully compliant.
  • Runs entirely inside the Forge boundary - no privileged access or special APIs.
  • Works reliably even for complex, deeply nested SQL workloads - joins, pagination chains, window functions, etc.

This gives developers a safe way to understand severe failures without privileged access.

5. Why developer-side observability matters

This configurability — implemented on the application side — lets developers enable observability exactly where it’s needed:

  • resolver/long function/scheduler-level instrumentation
  • custom thresholds
  • selective plan printing
  • sampling
  • environment rules
  • optional analytics

And importantly:

You don’t need my library to do any of this.
Developers can implement the same pattern manually — forge-sql-orm simply makes it easier and more consistent.

Developer-side observability naturally complements platform-level observability.

Together, they enable building Forge apps with deep SQL execution paths: complex joins, multi-stage pagination, window functions, large OFFSET workflows — while still maintaining transparency and safety.

Top comments (0)