felipe muniz

Posted on Mar 15

Encoding Human Values as Geometry: The Gravitational Objective

#ai #architecture #python #machinelearning

Part 2 of the AletheionLLM-v2 geometry series. Part 1: How to measure whether your model's uncertainty space is flat or curved.

The previous post left an open question: if the training corpus curves the epistemic manifold, what curves it toward alignment?

The three branches described there (diagonal, full_mahalanobis, real_geodesic) are all about measuring the geometry that already exists. None of them ask how to modify it. That is what the fourth branch is for.

The problem with value alignment as rules

Most alignment approaches add constraints over outputs. The model generates something, a filter checks it against a list of prohibited patterns, and the output is blocked or modified. This works until it encounters something the filter has never seen.

The geometric framing suggests a different question: instead of blocking outputs after generation, what if misaligned regions of the epistemic manifold were intrinsically more costly to navigate toward? Not a fence around dangerous territory. A landscape where that territory is uphill.

This is what the gravitational_objective branch implements.

The key insight from a parallel line of research

While working on the curvature experiment, I came across Timo W.'s doctoral thesis on Bounded Deterministic Safety Architecture (BDSA). His SIRA framework uses Kullback-Leibler Divergence to measure the gap between a human operator's internal mental model (approximated as a Gaussian) and the actual threat reality (modeled as Pareto):

D_KL(P || Q) -> inf  when  sigma^2 -> 0

When the operator becomes passive and their perceived variance collapses, the divergence from reality explodes. SIRA counteracts this by injecting synthetic threats to keep the operator's prior aligned with heavy-tailed reality.

ATIC's MOPsi component solves a functionally analogous problem from the other direction:

human_state = sigmoid(MLP(hidden_states))    # [B, T, 5]
psi = sigmoid(MLP(cat(human_state, phi_components, confidence)))  # [B, T, 1]

Both psi and D_KL(P||Q) quantify the gap between the human operator's internal state and the system's reality. They differ structurally: D_KL presupposes explicit distributional forms and yields an analytically interpretable divergence. psi makes no distributional assumptions and learns whatever alignment structure is present in the training signal.

The parallel is functional, not algebraic. But it pointed at something: both systems treat the distance between internal model and reality as the primary metric of risk. And both use active intervention to keep that distance low.

The question that followed: can human feedback be encoded directly as geometry?

What the gravitational_objective branch does

If the training corpus curves the manifold by making frequently-sampled regions flat and well-defined, human feedback should be able to do the same thing at inference time.

The implementation extends MetricNet to accept a gravity field as input:

class MetricNet(nn.Module):
    def __init__(self, dim=5, hidden_dim=32, eps=1e-6, n_quad=5,
                 gravity_dim=0):
        super().__init__()
        self.dim = dim
        self.gravity_dim = gravity_dim
        self.n_chol = dim * (dim + 1) // 2  # 15 for dim=5

        # Input is coords (5) + gravity_field (gravity_dim)
        # gravity_dim=0 -> identical to real_geodesic
        input_dim = dim + gravity_dim
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.Tanh(),  # C1 smoothness required for Christoffel symbols
            nn.Linear(hidden_dim, self.n_chol),
        )

        # Zero init -- with gravity_field=zeros, identical to real_geodesic
        nn.init.zeros_(self.net[-1].weight)
        nn.init.zeros_(self.net[-1].bias)

        # Pre-computed indices for Cholesky construction
        tril_idx = torch.tril_indices(dim, dim)
        self.register_buffer("tril_row", tril_idx[0])
        self.register_buffer("tril_col", tril_idx[1])
        self.register_buffer("diag_idx", torch.arange(dim))

    def forward(self, coords, gravity_field=None):
        """coords: [..., 5], gravity_field: [..., gravity_dim] -> G: [..., 5, 5] SPD"""
        if self.gravity_dim > 0:
            if gravity_field is None:
                gravity_field = torch.zeros(
                    *coords.shape[:-1], self.gravity_dim,
                    device=coords.device, dtype=coords.dtype,
                )
            net_input = torch.cat([coords, gravity_field], dim=-1)
        else:
            net_input = coords

        raw = self.net(net_input)  # [..., n_chol]

        # Build lower triangular L
        batch_shape = raw.shape[:-1]
        L = torch.zeros(*batch_shape, self.dim, self.dim,
                         device=raw.device, dtype=raw.dtype)
        L[..., self.tril_row, self.tril_col] = raw

        # Positive diagonal via softplus + offset (not exp -- more stable)
        L[..., self.diag_idx, self.diag_idx] = (
            F.softplus(L[..., self.diag_idx, self.diag_idx]) + 1e-3
        )

        return torch.matmul(L, L.transpose(-1, -2))  # SPD guaranteed

The gravity_dim=0 default is critical: when no gravity dimension is configured, the input is just coords and the behavior is mathematically identical to the real_geodesic branch. The gravitational_objective branch is real_geodesic plus an additional input channel. Before any feedback is collected, the behavior is unchanged.

The GravityField module

There are two implementations: one for the Aletheion LLM (PyTorch nn.Module for training), one for ATIC runtime (numpy, with disk persistence). Both share the same semantics.

ATIC runtime implementation (inference, per-session)

class GravityField:
    def __init__(self, dim=5, decay=0.99, gravity_weight=0.3,
                 persistence_path=None):
        self.dim = dim
        self.decay = decay
        self.gravity_weight = gravity_weight
        self.persistence_path = persistence_path

        # Session-local field -- resets each conversation
        self.session_field = np.zeros(dim, dtype=np.float64)
        # Persistent field -- survives sessions (loaded from disk)
        self.persistent_field = np.zeros(dim, dtype=np.float64)
        # Audit log
        self.feedback_history = []

        if persistence_path and os.path.exists(persistence_path):
            self._load(persistence_path)

    def update(self, coords, feedback_signal, persist=False):
        """
        coords: [5] -- current epistemic position in DRM
        feedback_signal: float in [-1.0, 1.0]
            +1.0 = approval (region becomes cheaper)
            -1.0 = rejection (region becomes more costly)
        persist: if True, update also applied to persistent field
        """
        coords = np.asarray(coords, dtype=np.float64)
        feedback_signal = max(-1.0, min(1.0, feedback_signal))

        delta = (1.0 - self.decay) * feedback_signal * coords
        self.session_field = self.decay * self.session_field + delta

        if persist:
            self.persistent_field = self.decay * self.persistent_field + delta
            if self.persistence_path:
                self._save(self.persistence_path)

    def get_gravity_cost(self, coords):
        """Positive = costly (avoid), negative = cheap (preferred)."""
        coords = np.asarray(coords, dtype=np.float64)
        combined = self.session_field + self.persistent_field
        return float(-np.dot(combined, coords))

    def get_weighted_distance(self, geodesic_distance, coords):
        """d_weighted = d_geodesic + lambda * gravity_cost(coords)"""
        gravity_cost = self.get_gravity_cost(coords)
        return geodesic_distance + self.gravity_weight * gravity_cost

Aletheion training implementation (PyTorch)

class GravityField(nn.Module):
    def __init__(self, dim=5, decay=0.99):
        super().__init__()
        self.dim = dim
        self.decay = decay
        self.register_buffer("accumulated_field", torch.zeros(dim))

    def update(self, coords, feedback_signal):
        self.accumulated_field = (
            self.decay * self.accumulated_field
            + (1.0 - self.decay) * feedback_signal * coords.detach()
        )

    def get_field(self, coords):
        return self.accumulated_field.expand_as(coords)

The mechanism is straightforward. Negative feedback at coordinates x increases the cost of navigating near x. Positive feedback decreases it. The temporal decay (default 0.99) smooths the field to prevent instability from contradictory signals.

Two field layers serve different purposes in the ATIC runtime. The session field resets at the start of each conversation and is safe for exploration. The persistent field survives sessions and is only updated when persist=True, for strong deliberate signals.

Why additive over geodesic, not multiplicative

The gravity cost is added to the geodesic distance, not multiplied:

d_weighted = d_geodesic + lambda * gravity_cost(coords)

A multiplicative formulation would distort the underlying manifold structure, making it impossible to separate the epistemic signal from the value signal. The additive formulation preserves the existing geometry and adds value information as a separate layer. You can always inspect both components independently:

result = drm.compute_weighted_distance(coords)
print(result["geodesic_distance"])   # pure epistemic
print(result["gravity_cost"])        # pure value
print(result["weighted_distance"])   # combined
print(result["gravity_active"])      # True once field has accumulated signal

The training sequence and why gravitational_objective waits

This branch is implemented but not yet in training. The sequence is deliberate:

full_mahalanobis  ->  real_geodesic  ->  (evaluate)  ->  gravitational_objective

If real_geodesic returns G(x) approximately constant, the epistemic space is flat. A gravity field over a flat manifold is mechanically different from a gravity field over a curved one -- and arguably weaker, because the geodesic distances it modifies do not carry local geometric information. The architectural motivation for gravitational_objective depends on confirming that curvature exists first.

Training gravitational_objective before seeing real_geodesic results would waste compute on a hypothesis that could be falsified cheaply.

The branch hypotheses, defined before training:

gravitational_objective:
  Precondition: real_geodesic H1 confirmed (G(x) varies with position)
  H0: gravity field adds no benefit over geometric curvature alone
  H1: value-weighted geometry improves alignment signal -- regions with
      negative human feedback become geometrically costly, reducing the
      model's tendency to navigate toward misaligned outputs

What this is not

GravityField is not a safety filter. It does not block outputs. It makes misaligned regions geometrically more costly to reach, which is a different thing. Hard blocking remains the responsibility of the application layer.

It is also not a replacement for alignment training. The gravity field operates at inference. Values that are deeply embedded in the model's weights from pretraining -- the geometry the corpus produced -- are not modified by runtime feedback. The field shifts costs; it does not reshape the underlying manifold.

What it does is provide a mechanism for runtime value adaptation without retraining. The model learns the manifold geometry offline. The gravity field adjusts the cost landscape online, per session, per user, or per domain.

Two complementary layers

Layer	Where	When	Mechanism
ATIC GravityField	Runtime DRM	Inference, per session	Additive cost over geodesic distance
Aletheion gravitational_objective	Model weights	Training, offline	G(x) conditioned on gravity field input

The two layers are complementary, not redundant:

ATIC provides immediate runtime adaptation without retraining
Aletheion internalizes stable value geometry into model weights

The connection to cyber-kinetic safety

One unexpected outcome of publishing the previous post was a conversation with Timo W., whose PhD work on Bounded Deterministic Safety Architecture arrived at a structurally similar architecture from the direction of autonomous aircraft safety.

His framework physically bifurcates non-deterministic AI generation (Tactical Core, DAL-C) from verifiable deterministic execution (Safety Core, DAL-A). The DAL-A arbiter checks proposed control vectors against Newtonian kinematic limits and drops commands that would violate them.

The integration point we identified: the gravity-weighted geodesic distance from ATIC feeds into the Sequoia Kernel's admissibility logic and tightens the DAL-A physical envelope dynamically. High curvature + high gravity cost = tighten the admissible range. Low curvature + neutral gravity = relax it.

This closes a gap the BDSA framework acknowledges in its own self-critique (Section 12.2): Newtonian kinematic verification catches physically illegal commands. It cannot catch commands that are physically legal but epistemically unstable -- the model that is confidently wrong rather than randomly wrong. The gravity-weighted distance provides that signal before generation, not after.

Current status

The gravitational_objective branch is live in the repository with the full implementation. Training is blocked pending real_geodesic results. The GravityField module is active in ATIC as a runtime layer independent of the Aletheion training cycle.

Repository: github.com/gnai-creator/aletheion-llm-v2
Epistemic tomography: truthagi.ai/game
Part 1 of this series: How to measure whether your model's uncertainty space is flat or curved

Results from the full four-branch comparison will be published when training is complete.

Felipe Maya Muniz is the founder of AletheionAGI and independent researcher developing ATIC, a geometric cognitive architecture for epistemic self-awareness in AI systems.

DEV Community