From Closures to an AST in a Kotlin Transform Graph

#architecture #computerscience #kotlin #opensource

kumulant is a streaming statistics library: you feed it numbers, it maintains an accumulator like a mean or a quantile sketch, and you read snapshots back. Above the accumulator sits a graph of transforms and filters that preprocesses each value before it lands in the stat: filter out negatives, log-transform latencies, take a weighted dot product of a feature vector.

The first version of that graph was Kotlin lambdas all the way down. Pre-update transforms were (Double) -> Double, filters were (Double) -> Boolean, paired transforms were (Double, Double) -> Pair<Double, Double>. A schema would look something like this (the StatSchema / by stat pattern is from the previous post):

object LatencyMetrics : StatSchema() {
    val p99 by stat(
        DDSketch(probabilities = doubleArrayOf(0.99))
            .filter { it >= 0 }
            .transform { ln(it) }
    )
}

This is the path of least resistance in Kotlin and it works fine while the only caller is in-process Kotlin code. The lambdas are typed, the call site is short, and the closure captures whatever it needs from the enclosing scope.

The wire problem

kumulant's job inside the Eignex rewrite is to back a cloud-deployed monitoring layer. The expected caller is a service that wants to author its stat config as YAML and POST it over HTTP, not link kumulant as a Kotlin dependency. Once you've decided that's the deployment shape, every closure in the graph is a problem. A (Double) -> Double doesn't serialize. You can't write a transform in YAML if transform is a JVM lambda.

The naive fix is to ship a handful of named transforms (log, sqrt, negate) and let YAML reference them by string. That works until the first user needs log(x) minus log(y) or a piecewise expression, at which point you either keep adding named cases or invent a tiny expression language. Better to invent it up front.

The AST

The redesign turns every closure-shaped slot in the graph into a sealed AST:

@Serializable
sealed interface ScalarExpr {
    fun eval(x: Double, y: Double = 0.0, v: DoubleArray = EMPTY_VECTOR): Double
}

@Serializable @SerialName("X")     data object X : ScalarExpr { ... }
@Serializable @SerialName("Const") data class Const(val v: Double) : ScalarExpr { ... }
@Serializable @SerialName("Mul")   data class Mul(val l: ScalarExpr, val r: ScalarExpr) : ScalarExpr { ... }
@Serializable @SerialName("Log")   data class Log(val a: ScalarExpr) : ScalarExpr { ... }
@Serializable @SerialName("VFold") data class VFold(val op: VFoldOp) : ScalarExpr { ... }
// ... Add, Sub, Div, Neg, Abs, Exp, Sqrt, Pow, Min, Max, IfExpr, VDot, V(index)

Mirror the same shape for BoolExpr (Gt, Lt, And, Or, Not, InRange, etc.) and VectorExpr for the cases where the output is a vector of arbitrary length, not a scalar. Each node is @Serializable with a @SerialName discriminator, so kotlinx.serialization round-trips the whole tree polymorphically. The leaves X, Y, and V(i) are placeholders that get bound to the current input when eval runs.

The call site you'd want to keep, transform { ln(it) }, would now have to read transform(Log(X)). Doable, but losing the operator syntax is a real regression. Kotlin's operator overloading recovers it:

operator fun ScalarExpr.plus(rhs: ScalarExpr): ScalarExpr = Add(this, rhs)
operator fun ScalarExpr.times(rhs: Double): ScalarExpr = Mul(this, Const(rhs))
infix fun ScalarExpr.gt(rhs: Double): BoolExpr = Gt(this, Const(rhs))
// ... one per operator, three handfuls in total

With those in scope, the user-facing API looks almost identical to the lambda version. What's underneath is the difference:

val p99 by stat(
    DDSketch(probabilities = doubleArrayOf(0.99))
        .filter(X gt 0.0)        // BoolExpr: Gt(X, Const(0.0))
        .transform(Log(X))       // ScalarExpr: Log(X)
)

That same schema serializes to YAML as a tree the user can hand-edit or templated by a deploy pipeline.

What you lose, what you gain

The loss is real: you can't drop into arbitrary Kotlin in the body of a transform. If your transform isn't expressible as a composition of the AST node types you've defined, you have to add a node. There's no escape hatch to a raw lambda for the YAML path, because the whole point is that the YAML path doesn't have a JVM to run a closure on the other side.

In exchange:

The config serializes to YAML or JSON without thinking about it.
The AST is inspectable, so you can diff two versions of a schema and tell a user what changed before a redeploy.
The runtime cost is still a closure call per node, but you can compile the AST down to a single closure at materialize time and amortize the tree walk; kumulant does this in its spec layer.
Adding a new node type is one data class, one eval impl, one serial name. Adding a new built-in to a named-transforms registry would be the same amount of code with worse composability.

The lesson I took away: as soon as a config has to cross the wire, the wire format isn't a serialization concern bolted on the side of the typed API, it's the thing the typed API has to match. Starting with closures and trying to bolt YAML on later would have meant two sources of truth and a translation layer between them; starting from the AST and letting Kotlin's operator overloading recover the ergonomics gave both surfaces from one definition.

The expression node code above lives in Eignex/kumulant under schema/Expr.kt. The serialization plumbing (polymorphic discriminator, typed-key schemas) is the Eignex/skema library, covered in more detail in the previous post.

DEV Community

From Closures to an AST in a Kotlin Transform Graph

The wire problem

The AST

What you lose, what you gain

Top comments (0)