DEV Community

Cover image for SageMaker Feature Store with Terraform: Centralized ML Features for Training and Inference πŸ—ƒοΈ
Suhas Mallesh
Suhas Mallesh

Posted on

SageMaker Feature Store with Terraform: Centralized ML Features for Training and Inference πŸ—ƒοΈ

Features used for training must match features used for inference, or your model breaks silently. SageMaker Feature Store keeps them in sync with online (real-time) and offline (historical) stores. Here's how to provision it with Terraform.

In the previous posts, we set up the workspace and deployed endpoints. But there's a critical gap: features. Every ML model needs consistent, reliable feature data for both training (batch, historical) and inference (real-time, latest values). When training features and serving features diverge, you get training-serving skew, and your model's accuracy degrades silently.

SageMaker Feature Store solves this with a dual-store architecture. The online store provides low-latency access to the latest feature values for real-time inference. The offline store keeps the full history in S3 (Parquet format) for training and batch inference. When you write a feature, both stores sync automatically. One source of truth. 🎯

πŸ—οΈ Feature Store Architecture

Component What It Does
Feature Group A collection of related features (like a table)
Online Store Low-latency key-value store for real-time lookups
Offline Store Historical data in S3 (Parquet) for training
Record Identifier Primary key for feature lookups
Event Time Timestamp for point-in-time correctness
Glue Data Catalog Auto-created metadata catalog for Athena queries

The online store always holds the latest snapshot. The offline store is append-only, keeping every version of every record. This enables point-in-time queries for training: "What did this customer's features look like 30 days ago?"

πŸ”§ Terraform: Create Feature Groups

IAM Role

# feature_store/iam.tf

resource "aws_iam_role" "feature_store" {
  name = "${var.environment}-feature-store"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "sagemaker.amazonaws.com" }
    }]
  })
}

resource "aws_iam_role_policy" "feature_store_access" {
  name = "feature-store-s3-glue"
  role = aws_iam_role.feature_store.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = ["s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:GetBucketLocation"]
        Resource = [
          "${var.offline_store_bucket_arn}",
          "${var.offline_store_bucket_arn}/*"
        ]
      },
      {
        Effect = "Allow"
        Action = [
          "glue:CreateTable", "glue:UpdateTable", "glue:GetTable",
          "glue:GetDatabase", "glue:CreateDatabase"
        ]
        Resource = "*"
      }
    ]
  })
}
Enter fullscreen mode Exit fullscreen mode

Feature Group Definition

# feature_store/feature_groups.tf

resource "aws_sagemaker_feature_group" "customer_features" {
  feature_group_name             = "${var.environment}-customer-features"
  record_identifier_feature_name = "customer_id"
  event_time_feature_name        = "event_time"
  role_arn                       = aws_iam_role.feature_store.arn

  # Feature schema
  feature_definition {
    feature_name = "customer_id"
    feature_type = "String"
  }

  feature_definition {
    feature_name = "event_time"
    feature_type = "Fractional"
  }

  feature_definition {
    feature_name = "total_purchases"
    feature_type = "Integral"
  }

  feature_definition {
    feature_name = "avg_order_value"
    feature_type = "Fractional"
  }

  feature_definition {
    feature_name = "days_since_last_purchase"
    feature_type = "Integral"
  }

  feature_definition {
    feature_name = "account_age_days"
    feature_type = "Integral"
  }

  feature_definition {
    feature_name = "is_premium"
    feature_type = "Integral"
  }

  # Enable both online and offline stores
  online_store_config {
    enable_online_store = true

    security_config {
      kms_key_id = var.kms_key_arn
    }
  }

  offline_store_config {
    s3_storage_config {
      s3_uri   = "s3://${var.offline_store_bucket}/${var.environment}/feature-store"
      kms_key_id = var.kms_key_arn
    }

    table_format = var.offline_table_format  # "Glue" or "Iceberg"
  }

  tags = {
    Environment = var.environment
    Domain      = "customer"
  }
}
Enter fullscreen mode Exit fullscreen mode

Three feature types: String, Fractional (float), and Integral (integer). Everything else maps to String.

table_format: Choose Glue (default, Hive-compatible) or Iceberg (better for upserts and time travel). Iceberg is recommended for production workloads that need ACID transactions.

Multiple Feature Groups

resource "aws_sagemaker_feature_group" "transaction_features" {
  feature_group_name             = "${var.environment}-transaction-features"
  record_identifier_feature_name = "transaction_id"
  event_time_feature_name        = "event_time"
  role_arn                       = aws_iam_role.feature_store.arn

  feature_definition {
    feature_name = "transaction_id"
    feature_type = "String"
  }

  feature_definition {
    feature_name = "event_time"
    feature_type = "Fractional"
  }

  feature_definition {
    feature_name = "amount"
    feature_type = "Fractional"
  }

  feature_definition {
    feature_name = "merchant_category"
    feature_type = "String"
  }

  feature_definition {
    feature_name = "is_international"
    feature_type = "Integral"
  }

  online_store_config {
    enable_online_store = true
  }

  offline_store_config {
    s3_storage_config {
      s3_uri = "s3://${var.offline_store_bucket}/${var.environment}/feature-store"
    }
    table_format = var.offline_table_format
  }
}
Enter fullscreen mode Exit fullscreen mode

🐍 Ingest Features (SDK)

Terraform defines the schema. The SDK ingests the data:

import boto3
import time

featurestore_runtime = boto3.client("sagemaker-featurestore-runtime")

# Write a single record (real-time)
featurestore_runtime.put_record(
    FeatureGroupName="prod-customer-features",
    Record=[
        {"FeatureName": "customer_id", "ValueAsString": "cust-12345"},
        {"FeatureName": "event_time", "ValueAsString": str(time.time())},
        {"FeatureName": "total_purchases", "ValueAsString": "47"},
        {"FeatureName": "avg_order_value", "ValueAsString": "89.50"},
        {"FeatureName": "days_since_last_purchase", "ValueAsString": "3"},
        {"FeatureName": "account_age_days", "ValueAsString": "730"},
        {"FeatureName": "is_premium", "ValueAsString": "1"},
    ],
)
Enter fullscreen mode Exit fullscreen mode

Read Features for Inference (Online Store)

# Real-time feature lookup (single-digit ms latency)
response = featurestore_runtime.get_record(
    FeatureGroupName="prod-customer-features",
    RecordIdentifierValueAsString="cust-12345",
)

features = {r["FeatureName"]: r["ValueAsString"] for r in response["Record"]}
print(features)
# {'customer_id': 'cust-12345', 'total_purchases': '47', ...}
Enter fullscreen mode Exit fullscreen mode

Query Features for Training (Offline Store via Athena)

import boto3

athena = boto3.client("athena")

query = """
SELECT customer_id, total_purchases, avg_order_value, is_premium
FROM "sagemaker_featurestore"."prod-customer-features"
WHERE event_time <= 1700000000
"""

response = athena.start_query_execution(
    QueryString=query,
    QueryExecutionContext={"Database": "sagemaker_featurestore"},
    ResultConfiguration={"OutputLocation": "s3://my-bucket/athena-results/"},
)
Enter fullscreen mode Exit fullscreen mode

The offline store is automatically cataloged in Glue. Query with Athena for point-in-time training datasets.

πŸ“ Environment Configuration

# environments/dev.tfvars
environment           = "dev"
offline_table_format  = "Glue"     # Simpler for dev
kms_key_arn           = null        # No encryption in dev

# environments/prod.tfvars
environment           = "prod"
offline_table_format  = "Iceberg"  # ACID transactions, time travel
kms_key_arn           = "arn:aws:kms:us-east-1:123456789012:key/abc-123"
Enter fullscreen mode Exit fullscreen mode

⚠️ Gotchas and Tips

Offline store has a ~15 minute delay. Data written via PutRecord appears in the online store immediately but takes up to 15 minutes to land in the offline store (S3). Don't rely on the offline store for near-real-time analytics.

Feature groups are mutable. You can add new features to an existing feature group using the UpdateFeatureGroup API. You cannot remove or rename existing features.

Online store costs. The online store charges per read and write unit. High-throughput inference with thousands of feature lookups per second adds up. Monitor costs and batch lookups where possible using BatchGetRecord.

Point-in-time correctness. Always use event_time for training queries. Querying without time filtering risks data leakage, where future data appears in your training set.

KMS encryption for both stores. The online and offline stores support separate KMS keys. In production, encrypt both. The Glue Data Catalog metadata is not encrypted by Feature Store, manage it separately.

Schema planning matters. Feature types (String, Fractional, Integral) cannot be changed after creation. Plan your schema carefully. Use String for anything you're unsure about, since it's the most flexible.

⏭️ What's Next

This is Post 3 of the ML Pipelines & MLOps with Terraform series.


Your features have a home. Online store for real-time inference, offline store for training, automatic sync between them. No more training-serving skew. No more duplicated feature pipelines. One source of truth, all in Terraform. πŸ—ƒοΈ

Found this helpful? Follow for the full ML Pipelines & MLOps with Terraform series! πŸ’¬

Top comments (0)