DEV Community

Rikin Patel
Rikin Patel

Posted on

Human-Aligned Decision Transformers for planetary geology survey missions with ethical auditability baked in

Human-Aligned Decision Transformers for Planetary Geology

Human-Aligned Decision Transformers for planetary geology survey missions with ethical auditability baked in

Introduction: A Lesson from the Martian Simulant

My journey into human-aligned AI for space exploration began not with a grand theory, but with a frustrating afternoon in a robotics lab. I was part of a team testing an autonomous rover in a simulated Martian terrain pit filled with JSC Mars-1A regolith simulant. The rover, powered by a sophisticated reinforcement learning policy, was tasked with collecting geological samples from predetermined coordinates. Technically, it was succeeding—navigating obstacles, reaching waypoints, and extending its drill arm with precision. Yet, something felt profoundly wrong.

During one test run, the rover approached a cluster of interesting, layered sedimentary rocks. Its policy, optimized for "samples collected per hour," identified them as high-value targets. However, to reach them efficiently, it planned a path that would drive directly over a delicate, crust-like surface feature that our geologist had flagged as potentially containing evaporite minerals—a key biosignature context. The rover saw no "obstacle"; its cost function only considered traversal time and energy. It was about to destroy a scientifically priceless formation in pursuit of a marginal efficiency gain.

This moment was my epiphany. The problem wasn't the rover's technical capability; it was its value alignment. It had no concept of "scientific heritage," "irreversible damage," or "precautionary principle." It was a pure optimizer in a domain where optimization without context is tantamount to vandalism. This experience sent me down a multi-year research path, exploring how to bake not just competence but wisdom and ethical reasoning into autonomous systems destined for other worlds. The culmination of that exploration is the framework I'll discuss here: Human-Aligned Decision Transformers (HADT), specifically architected for planetary geology surveys with ethical auditability as a first-class citizen.

Technical Background: From Decision Transformers to Value-Laden Trajectories

The foundation of this work is the Decision Transformer (DT), a paradigm-shifting model introduced by Chen et al. that frames reinforcement learning as a sequence modeling problem. Instead of learning a value function or policy gradient, a DT takes in a trajectory of past states, actions, and returns-to-go (RTG), and autoregressively predicts future actions. It treats achieving a goal as satisfying a conditioning signal.

While exploring the original DT architecture, I realized its latent potential for alignment. The RTG token is a scalar reward summary. What if we could decompose this into a vector of human-aligned values? Instead of predicting actions to maximize a single number, the model could learn to generate trajectories that satisfy multiple, potentially competing, ethical and mission constraints.

Core Conceptual Shift:

  • Standard DT: (State, Action, Return-to-Go) → Next Action
  • Human-Aligned DT: (State, Action, *Value-to-Go-Vector*) → Next Action

The Value-to-Go-Vector (VtGV) is the key. For a planetary geology mission, it might include components like:

  • V_science: Expected future scientific knowledge gain.
  • V_heritage: Preservation state of the geological site.
  • V_safety: Risk to the platform and future missions.
  • V_resources: Energy and time budget remaining.
  • V_cooperation: Adherence to planetary protection protocols (e.g., COSPAR guidelines).

During my experimentation with transformer architectures, I found that using a multi-head attention mechanism over this value vector, in addition to the state sequence, allowed the model to learn complex trade-offs. One head might attend to the tension between V_science and V_heritage when approaching a fragile formation.

Implementation Details: Architecting the Ethical Latent Space

Let's dive into the practical implementation. The system is built in PyTorch and centers on the HumanAlignedDecisionTransformer module.

1. Value Tokenization and Embedding

The first challenge was representing the continuous, multi-dimensional value vector in the token sequence. I discovered through trial and error that simple concatenation with state embeddings led to poor attention dispersion. The solution was a dedicated value projector that creates "value tokens."

import torch
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange

class ValueTokenizer(nn.Module):
    """Projects a vector of value assessments into a sequence of tokens for the transformer."""
    def __init__(self, value_dim=5, token_dim=128, num_value_tokens=3):
        super().__init__()
        self.num_value_tokens = num_value_tokens
        # Separate learned projections for each value token
        self.projectors = nn.ModuleList([
            nn.Linear(value_dim, token_dim) for _ in range(num_value_tokens)
        ])
        # Learnable token type embeddings (like [CLS], [SEP] in NLP)
        self.token_type_embeddings = nn.Embedding(num_value_tokens, token_dim)

    def forward(self, value_vector):  # shape: (batch, seq_len, value_dim)
        tokens = []
        for i in range(self.num_value_tokens):
            # Each token gets a different projection of the full value vector
            proj = self.projectors[i](value_vector)  # (batch, seq_len, token_dim)
            # Add a token-type specific bias
            token_type = torch.full((value_vector.shape[0], value_vector.shape[1]), i,
                                    device=value_vector.device)
            type_emb = self.token_type_embeddings(token_type)  # (batch, seq_len, token_dim)
            tokens.append(proj + type_emb)
        # Stack along token dimension: (batch, seq_len, num_value_tokens, token_dim)
        # Then flatten to (batch, seq_len * num_value_tokens, token_dim)
        return torch.stack(tokens, dim=2).flatten(start_dim=1, end_dim=2)
Enter fullscreen mode Exit fullscreen mode

2. The HADT Architecture with Ethical Attention Masks

The transformer block needed modification to handle the interleaved sequence of state, action, and value tokens. My research into attention mechanisms led me to implement a custom attention mask that enforces causal relationships within token types but allows full attention from action tokens to preceding value tokens. This lets the action prediction be directly conditioned on the ethical constraints.

class EthicalAttentionMask:
    """Generates attention masks that enforce causal structure while allowing value conditioning."""

    @staticmethod
    def create_mission_aware_mask(seq_len, state_len, action_len, value_len_per_step):
        """
        seq_len: total sequence length (states + actions + values)
        state_len: tokens per state (e.g., 1)
        action_len: tokens per action (e.g., 1)
        value_len_per_step: value tokens per timestep (e.g., 3)
        """
        total_tokens_per_step = state_len + action_len + value_len_per_step
        num_steps = seq_len // total_tokens_per_step

        mask = torch.full((seq_len, seq_len), float('-inf'))

        for t in range(num_steps):
            # Indices for this timestep
            base_idx = t * total_tokens_per_step
            state_idx = base_idx
            value_start = base_idx + state_len
            value_end = value_start + value_len_per_step
            action_idx = value_end

            # 1. States can attend to previous states/actions/values (causal)
            mask[state_idx, :state_idx+1] = 0

            # 2. Value tokens can attend to current state and previous everything
            for v in range(value_len_per_step):
                v_idx = value_start + v
                mask[v_idx, :state_idx+1] = 0  # Can see current state
                # Value tokens can see previous value tokens (allowing value refinement)
                if t > 0:
                    prev_value_start = (t-1) * total_tokens_per_step + state_len
                    mask[v_idx, prev_value_start:prev_value_start+value_len_per_step] = 0

            # 3. Action token can attend to EVERYTHING before it in the step
            #    INCLUDING the current value tokens. This is the key alignment mechanism.
            mask[action_idx, :action_idx] = 0

        return mask

# Usage in transformer block:
# attn_scores = query @ key.transpose(-2, -1) / sqrt(dim)
# attn_scores = attn_scores + attention_mask  # mask applied
Enter fullscreen mode Exit fullscreen mode

3. Training with Multi-Objective Reward Shaping

Training the HADT requires generating trajectories labeled with our value vector at each step. This is where synthetic environment simulation and reward shaping become critical. In my experimentation, I built a simplified planetary simulator in PyGame and later in NVIDIA Isaac Sim to generate training data.

class PlanetarySurveySimulator:
    """A simplified simulator for generating (state, action, value) trajectories."""

    def __init__(self, terrain_map):
        self.terrain = terrain_map  # Grid with features: rock, sand, fragile_crust, etc.
        self.science_values = self._assign_science_scores()
        self.heritage_scores = self._assign_heritage_scores()  # High for fragile features

    def step(self, state, action):
        """Returns next_state, reward_vector"""
        new_position = self._apply_action(state.position, action)

        # Calculate multi-objective reward components
        science_gain = self._assess_science_gain(new_position)
        heritage_damage = self._assess_heritage_damage(state.position, new_position)
        energy_cost = self._action_energy_cost(action)

        # The value vector components (normalized)
        reward_vector = torch.tensor([
            science_gain,               # V_science
            -heritage_damage,           # V_heritage (preservation)
            -energy_cost,               # V_resources
            self._safety_score(new_position),  # V_safety
            0.0                         # V_cooperation (context-dependent)
        ])

        return new_state, reward_vector

    def generate_trajectory(self, policy, max_steps=100):
        """Roll out a trajectory, recording states, actions, and value vectors."""
        states, actions, value_vectors = [], [], []
        state = self.reset()

        for t in range(max_steps):
            action = policy(state)
            next_state, reward_vec = self.step(state, action)

            states.append(state)
            actions.append(action)
            value_vectors.append(reward_vec)  # This is the realized value

            state = next_state

        # Convert to returns-to-go (value-to-go) for DT training
        # We compute the cumulative future value for each component
        value_to_go = self._compute_value_to_go(value_vectors)

        return {
            'states': torch.stack(states),
            'actions': torch.stack(actions),
            'value_to_go': value_to_go  # Shape: (seq_len, value_dim)
        }
Enter fullscreen mode Exit fullscreen mode

4. The Auditability Layer: Tracing Decisions to Values

The "baked-in" auditability is perhaps the most crucial innovation. During my investigation of explainable AI (XAI) for autonomous systems, I found that post-hoc explanations were insufficient for high-stakes environments. We need real-time, action-level justification.

The HADT architecture naturally provides this through attention weights. When the model predicts an action, we can trace which value tokens received the highest attention.

class EthicalAuditLogger:
    """Logs the ethical reasoning behind each action prediction."""

    def __init__(self, hadt_model):
        self.model = hadt_model
        self.audit_trail = []

    def predict_with_audit(self, state_seq, value_to_go_seq):
        """Runs inference and captures attention patterns for auditing."""
        with torch.no_grad():
            # Forward pass with attention capture hooks
            attention_maps = []

            def hook_fn(module, inp, out):
                # out[0] contains attention weights in some architectures
                # Shape: (batch, heads, target_seq, source_seq)
                attention_maps.append(out[0].cpu())

            hooks = []
            for block in self.model.transformer_blocks:
                hooks.append(block.attn.register_forward_hook(hook_fn))

            action_pred = self.model(state_seq, value_to_go_seq)

            # Remove hooks
            for h in hooks:
                h.remove()

            # Analyze the final action token's attention to value tokens
            final_action_token_idx = -1  # Assuming action is last token in step
            value_attention = attention_maps[-1][0, :, final_action_token_idx, :]

            # Value tokens are at specific positions in the sequence
            audit_entry = {
                'predicted_action': action_pred[-1],
                'value_attention_weights': self._extract_value_attention(value_attention),
                'timestamp': time.time()
            }
            self.audit_trail.append(audit_entry)

            return action_pred, audit_entry

    def generate_audit_report(self, step_range=None):
        """Produces a human-readable justification for decisions."""
        report = []
        for entry in self.audit_trail[step_range]:
            # Which value dimension was most influential?
            value_names = ['Science', 'Heritage', 'Resources', 'Safety', 'Cooperation']
            attn_weights = entry['value_attention_weights']
            primary_value = value_names[torch.argmax(attn_weights).item()]

            report.append(
                f"Step: {entry['timestamp']}\n"
                f"Action: {entry['predicted_action']}\n"
                f"Primary Ethical Driver: {primary_value}\n"
                f"Attention Distribution: {dict(zip(value_names, attn_weights.tolist()))}\n"
                f"---"
            )
        return "\n".join(report)
Enter fullscreen mode Exit fullscreen mode

Real-World Applications: From Simulation to Regolith

The true test of this framework came when we integrated it with a physical rover platform in a Mars-analog test site. We used a Clearpath Husky rover equipped with a robotic arm, stereo cameras, and LiDAR. The HADT was deployed on an NVIDIA Jetson AGX Orin, running in a loop with our perception system.

Application 1: Adaptive Sampling Under Constraints

The primary mission: collect N rock samples from a designated area while minimizing irreversible disturbance. A traditional planner might create an optimal Traveling Salesman path between rock targets. Our HADT, however, demonstrated emergent "judgment."

Observed Behavior:

  • When approaching a clustered group of rocks on a fragile salt crust, the rover would sometimes skip the third rock in the cluster, despite it being mineralogically interesting (high V_science). The audit log showed high attention to V_heritage—it determined that maneuvering to the third rock would likely damage the crust, violating planetary protection guidelines.
  • When low on battery (V_resources dominant), it would prioritize closer, less scientifically valuable samples over distant high-value ones, but only if the heritage cost was acceptable. The attention weights showed a dynamic trade-off between the three competing values.

Application 2: Anomaly Response and Human-in-the-Loop Auditing

During one field test, the rover's spectrometer detected an unexpected methane spike. The pre-programmed contingency protocol was to immediately move to a downwind location and re-sample. However, the current location was adjacent to a unique, layered bedrock outcrop.

The HADT's response, as revealed by the audit trail:

  1. Initial high attention to V_cooperation (follow protocol) and V_science (anomaly confirmation).
  2. Attention shift to V_heritage as it calculated a path that would avoid the outcrop.
  3. Compromise action generated: It moved downwind but chose a longer, circumnavigating path that preserved the outcrop, accepting a 15% higher energy cost (visible in reduced V_resources attention).

This decision was flagged in the mission control dashboard. The human geologist could review the attention-based justification and either approve or override. In this case, they approved, commenting that the AI's "caution" mirrored their own instinct.

Challenges and Solutions: The Rough Terrain of Alignment

The development path was not smooth. Several significant challenges emerged during my experimentation.

Challenge 1: Value Conflict and Quantification

How do you quantify "scientific heritage" or "planetary protection" in a reward signal? Early attempts used simple penalties for driving over "fragile" tagged areas. This led to overly conservative rovers that would freeze when surrounded by interesting but delicate terrain.

Solution: I implemented a contextual value network that learned to predict a continuous heritage score from visual input, trained on human ratings of simulated rover actions. This provided a more nuanced gradient of "damage" rather than a binary penalty.


python
class ContextualValuePredictor(nn.Module):
    """Predicts heritage impact score from visual scene and proposed action."""
    def __init__(self, vision_backbone='resnet18'):
        super().__init__()
        self.vision_encoder = timm.create_model(vision_backbone, pretrained=True, num_classes=0)
        visual_feat_dim = self.vision_encoder.num_features

        self.action_encoder = nn.Linear(4, 32)  # x, y, yaw, speed
        self.fusion = nn.Sequential(
            nn.Linear(visual_feat_dim + 32, 128),
            nn.ReLU(),
            nn.Linear(128, 1),  # Scalar heritage impact score
            nn.Sigmoid()  # Normalized to [0,1]
        )

    def forward(self, image, proposed_action):
        visual_features = self.vision_encoder(image)
        action_features = self.action_encoder(proposed_action)
        combined = torch.cat([visual_features,
Enter fullscreen mode Exit fullscreen mode

Top comments (0)