Feature Store Setup
Feature engineering is where most ML projects spend 60-80% of their time — and where most technical debt accumulates. This toolkit gives you a working feature store built on Feast, plus a custom lightweight alternative for teams that don't need the full infrastructure. You get feature engineering patterns, point-in-time correct joins that prevent data leakage, versioned feature definitions, and serving configurations for both batch and real-time inference. Stop recomputing the same features across notebooks and pipelines.
Key Features
-
Feast Configuration Pack — Complete
feature_store.yaml, entity definitions, feature views, and materialization configs for offline and online stores. - Custom Lightweight Feature Store — A pure-Python alternative using Parquet + SQLite for teams that need feature management without Kubernetes.
- Feature Engineering Library — 40+ reusable transformers for time-series, text, categorical, and numeric features with scikit-learn compatible API.
- Point-in-Time Joins — Utilities that prevent future data leakage in training sets by joining features as-of each entity's event timestamp.
- Feature Versioning — Track schema changes, transformations, and data distributions across feature versions with automatic drift alerts.
- Online/Offline Serving — Serve features from Redis (online, <10ms) or Parquet/BigQuery (offline, batch) with unified retrieval API.
- Feature Monitoring — Distribution drift detection, null rate tracking, and freshness alerts for production feature pipelines.
Quick Start
unzip feature-store-setup.zip && cd feature-store-setup
pip install -r requirements.txt
# Option 1: Initialize Feast feature store
feast init my_feature_store
cp configs/production.yaml my_feature_store/feature_store.yaml
cd my_feature_store && feast apply
# Option 2: Use lightweight custom store
python src/feature_store_setup/core.py init --store-path ./my_store
# configs/production.yaml (Feast)
project: ml_platform
registry: ./data/registry.db
provider: local # local | gcp | aws
online_store:
type: sqlite # sqlite | redis | dynamodb
path: ./data/online_store.db
offline_store:
type: file # file | bigquery | redshift
entity_key_serialization_version: 2
Architecture
┌──────────────┐ ┌───────────────┐ ┌────────────────┐
│ Raw Data │────>│ Feature │────>│ Offline Store │
│ Sources │ │ Engineering │ │ (Parquet/BQ) │
└──────────────┘ └───────┬───────┘ └───────┬────────┘
│ │
┌───────▼───────┐ ┌───────▼────────┐
│ Feature │ │ Online Store │
│ Registry │ │ (Redis/SQLite)│
└───────┬───────┘ └───────┬────────┘
│ │
┌───────▼──────────────────────▼───────┐
│ Unified Retrieval API │
│ get_historical() / get_online() │
└─────────────────────────────────────┘
Usage Examples
Define and Register Features with Feast
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float64, Int64, String
from datetime import timedelta
# Define entity
user = Entity(name="user_id", join_keys=["user_id"])
# Define data source
user_source = FileSource(
path="./data/user_features.parquet",
timestamp_field="event_timestamp",
)
# Define feature view
user_features = FeatureView(
name="user_features",
entities=[user],
ttl=timedelta(days=1),
schema=[
Field(name="total_purchases", dtype=Int64),
Field(name="avg_order_value", dtype=Float64),
Field(name="days_since_last_purchase", dtype=Int64),
Field(name="preferred_category", dtype=String),
],
source=user_source,
)
Point-in-Time Correct Feature Retrieval
from feast import FeatureStore
import pandas as pd
store = FeatureStore(repo_path="./my_feature_store")
entity_df = pd.DataFrame({
"user_id": [1001, 1002, 1003],
"event_timestamp": pd.to_datetime(["2026-01-15 10:00:00", "2026-01-15 14:30:00", "2026-01-16 09:00:00"]),
})
# Point-in-time join prevents data leakage
training_df = store.get_historical_features(
entity_df=entity_df,
features=["user_features:total_purchases", "user_features:avg_order_value"],
).to_df()
Custom Lightweight Feature Store
from feature_store_setup.core import LightweightFeatureStore
store = LightweightFeatureStore(store_path="./my_store")
store.register_feature_group(
name="user_behavioral", entity_key="user_id",
features={"session_count_7d": "int", "avg_session_duration": "float", "bounce_rate": "float"},
source_query="SELECT * FROM user_sessions_agg", refresh_interval="1h",
)
store.materialize("user_behavioral")
# Retrieve for online inference (<10ms)
features = store.get_online_features(feature_group="user_behavioral", entity_ids=[1001, 1002])
Configuration Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
provider |
str | local |
Infrastructure: local, gcp, aws |
online_store.type |
str | sqlite |
Online store backend |
offline_store.type |
str | file |
Offline store backend |
feature_views.*.ttl |
int | 86400 |
Feature freshness TTL in seconds |
source.timestamp_field |
str | required | Column used for point-in-time joins |
Best Practices
- Always use point-in-time joins — Standard LEFT JOINs leak future data into training sets. This is the most common source of inflated offline metrics that don't reproduce in production.
-
Version feature definitions, not just data — When you change a transformation, create a new feature version (
avg_order_value_v2) rather than silently updating the existing one. - Set TTLs on online features — Stale features in production cause silent model degradation. If a feature hasn't been refreshed in 24 hours, it's better to return null than a stale value.
- Monitor feature distributions — Track mean, stddev, null rate, and cardinality in production. Alert when distributions shift more than 2 standard deviations from training data.
- Start with the lightweight store — If you have fewer than 50 features and a small team, the custom Parquet + SQLite store is simpler to operate than full Feast infrastructure.
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
feast apply fails with schema error |
Feature type mismatch with data | Verify Parquet column dtypes match Field definitions exactly |
| Materialization is slow | Full table scan on large datasets | Add created_timestamp_column to enable incremental materialization |
Online features return None
|
Features not materialized to online store | Run feast materialize with the correct time range |
| Point-in-time join returns NaN | Entity timestamps predate feature availability | Ensure feature data covers the full entity timestamp range |
This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [Feature Store Setup Guide] with all files, templates, and documentation for $39.
Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.
Top comments (0)