Features used for training must match features used for inference, or your model breaks silently. SageMaker Feature Store keeps them in sync with online (real-time) and offline (historical) stores. Here's how to provision it with Terraform.
In the previous posts, we set up the workspace and deployed endpoints. But there's a critical gap: features. Every ML model needs consistent, reliable feature data for both training (batch, historical) and inference (real-time, latest values). When training features and serving features diverge, you get training-serving skew, and your model's accuracy degrades silently.
SageMaker Feature Store solves this with a dual-store architecture. The online store provides low-latency access to the latest feature values for real-time inference. The offline store keeps the full history in S3 (Parquet format) for training and batch inference. When you write a feature, both stores sync automatically. One source of truth. π―
ποΈ Feature Store Architecture
| Component | What It Does |
|---|---|
| Feature Group | A collection of related features (like a table) |
| Online Store | Low-latency key-value store for real-time lookups |
| Offline Store | Historical data in S3 (Parquet) for training |
| Record Identifier | Primary key for feature lookups |
| Event Time | Timestamp for point-in-time correctness |
| Glue Data Catalog | Auto-created metadata catalog for Athena queries |
The online store always holds the latest snapshot. The offline store is append-only, keeping every version of every record. This enables point-in-time queries for training: "What did this customer's features look like 30 days ago?"
π§ Terraform: Create Feature Groups
IAM Role
# feature_store/iam.tf
resource "aws_iam_role" "feature_store" {
name = "${var.environment}-feature-store"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "sagemaker.amazonaws.com" }
}]
})
}
resource "aws_iam_role_policy" "feature_store_access" {
name = "feature-store-s3-glue"
role = aws_iam_role.feature_store.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:GetBucketLocation"]
Resource = [
"${var.offline_store_bucket_arn}",
"${var.offline_store_bucket_arn}/*"
]
},
{
Effect = "Allow"
Action = [
"glue:CreateTable", "glue:UpdateTable", "glue:GetTable",
"glue:GetDatabase", "glue:CreateDatabase"
]
Resource = "*"
}
]
})
}
Feature Group Definition
# feature_store/feature_groups.tf
resource "aws_sagemaker_feature_group" "customer_features" {
feature_group_name = "${var.environment}-customer-features"
record_identifier_feature_name = "customer_id"
event_time_feature_name = "event_time"
role_arn = aws_iam_role.feature_store.arn
# Feature schema
feature_definition {
feature_name = "customer_id"
feature_type = "String"
}
feature_definition {
feature_name = "event_time"
feature_type = "Fractional"
}
feature_definition {
feature_name = "total_purchases"
feature_type = "Integral"
}
feature_definition {
feature_name = "avg_order_value"
feature_type = "Fractional"
}
feature_definition {
feature_name = "days_since_last_purchase"
feature_type = "Integral"
}
feature_definition {
feature_name = "account_age_days"
feature_type = "Integral"
}
feature_definition {
feature_name = "is_premium"
feature_type = "Integral"
}
# Enable both online and offline stores
online_store_config {
enable_online_store = true
security_config {
kms_key_id = var.kms_key_arn
}
}
offline_store_config {
s3_storage_config {
s3_uri = "s3://${var.offline_store_bucket}/${var.environment}/feature-store"
kms_key_id = var.kms_key_arn
}
table_format = var.offline_table_format # "Glue" or "Iceberg"
}
tags = {
Environment = var.environment
Domain = "customer"
}
}
Three feature types: String, Fractional (float), and Integral (integer). Everything else maps to String.
table_format: Choose Glue (default, Hive-compatible) or Iceberg (better for upserts and time travel). Iceberg is recommended for production workloads that need ACID transactions.
Multiple Feature Groups
resource "aws_sagemaker_feature_group" "transaction_features" {
feature_group_name = "${var.environment}-transaction-features"
record_identifier_feature_name = "transaction_id"
event_time_feature_name = "event_time"
role_arn = aws_iam_role.feature_store.arn
feature_definition {
feature_name = "transaction_id"
feature_type = "String"
}
feature_definition {
feature_name = "event_time"
feature_type = "Fractional"
}
feature_definition {
feature_name = "amount"
feature_type = "Fractional"
}
feature_definition {
feature_name = "merchant_category"
feature_type = "String"
}
feature_definition {
feature_name = "is_international"
feature_type = "Integral"
}
online_store_config {
enable_online_store = true
}
offline_store_config {
s3_storage_config {
s3_uri = "s3://${var.offline_store_bucket}/${var.environment}/feature-store"
}
table_format = var.offline_table_format
}
}
π Ingest Features (SDK)
Terraform defines the schema. The SDK ingests the data:
import boto3
import time
featurestore_runtime = boto3.client("sagemaker-featurestore-runtime")
# Write a single record (real-time)
featurestore_runtime.put_record(
FeatureGroupName="prod-customer-features",
Record=[
{"FeatureName": "customer_id", "ValueAsString": "cust-12345"},
{"FeatureName": "event_time", "ValueAsString": str(time.time())},
{"FeatureName": "total_purchases", "ValueAsString": "47"},
{"FeatureName": "avg_order_value", "ValueAsString": "89.50"},
{"FeatureName": "days_since_last_purchase", "ValueAsString": "3"},
{"FeatureName": "account_age_days", "ValueAsString": "730"},
{"FeatureName": "is_premium", "ValueAsString": "1"},
],
)
Read Features for Inference (Online Store)
# Real-time feature lookup (single-digit ms latency)
response = featurestore_runtime.get_record(
FeatureGroupName="prod-customer-features",
RecordIdentifierValueAsString="cust-12345",
)
features = {r["FeatureName"]: r["ValueAsString"] for r in response["Record"]}
print(features)
# {'customer_id': 'cust-12345', 'total_purchases': '47', ...}
Query Features for Training (Offline Store via Athena)
import boto3
athena = boto3.client("athena")
query = """
SELECT customer_id, total_purchases, avg_order_value, is_premium
FROM "sagemaker_featurestore"."prod-customer-features"
WHERE event_time <= 1700000000
"""
response = athena.start_query_execution(
QueryString=query,
QueryExecutionContext={"Database": "sagemaker_featurestore"},
ResultConfiguration={"OutputLocation": "s3://my-bucket/athena-results/"},
)
The offline store is automatically cataloged in Glue. Query with Athena for point-in-time training datasets.
π Environment Configuration
# environments/dev.tfvars
environment = "dev"
offline_table_format = "Glue" # Simpler for dev
kms_key_arn = null # No encryption in dev
# environments/prod.tfvars
environment = "prod"
offline_table_format = "Iceberg" # ACID transactions, time travel
kms_key_arn = "arn:aws:kms:us-east-1:123456789012:key/abc-123"
β οΈ Gotchas and Tips
Offline store has a ~15 minute delay. Data written via PutRecord appears in the online store immediately but takes up to 15 minutes to land in the offline store (S3). Don't rely on the offline store for near-real-time analytics.
Feature groups are mutable. You can add new features to an existing feature group using the UpdateFeatureGroup API. You cannot remove or rename existing features.
Online store costs. The online store charges per read and write unit. High-throughput inference with thousands of feature lookups per second adds up. Monitor costs and batch lookups where possible using BatchGetRecord.
Point-in-time correctness. Always use event_time for training queries. Querying without time filtering risks data leakage, where future data appears in your training set.
KMS encryption for both stores. The online and offline stores support separate KMS keys. In production, encrypt both. The Glue Data Catalog metadata is not encrypted by Feature Store, manage it separately.
Schema planning matters. Feature types (String, Fractional, Integral) cannot be changed after creation. Plan your schema carefully. Use String for anything you're unsure about, since it's the most flexible.
βοΈ What's Next
This is Post 3 of the ML Pipelines & MLOps with Terraform series.
- Post 1: SageMaker Studio Domain π¬
- Post 2: SageMaker Endpoints - Deploy to Prod π
- Post 3: SageMaker Feature Store (you are here) ποΈ
- Post 4: SageMaker Pipelines - CI/CD for ML
Your features have a home. Online store for real-time inference, offline store for training, automatic sync between them. No more training-serving skew. No more duplicated feature pipelines. One source of truth, all in Terraform. ποΈ
Found this helpful? Follow for the full ML Pipelines & MLOps with Terraform series! π¬
Top comments (0)