DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Feature Store Setup Guide: Feature Store Setup

Feature Store Setup

Feature engineering is where most ML projects spend 60-80% of their time — and where most technical debt accumulates. This toolkit gives you a working feature store built on Feast, plus a custom lightweight alternative for teams that don't need the full infrastructure. You get feature engineering patterns, point-in-time correct joins that prevent data leakage, versioned feature definitions, and serving configurations for both batch and real-time inference. Stop recomputing the same features across notebooks and pipelines.

Key Features

  • Feast Configuration Pack — Complete feature_store.yaml, entity definitions, feature views, and materialization configs for offline and online stores.
  • Custom Lightweight Feature Store — A pure-Python alternative using Parquet + SQLite for teams that need feature management without Kubernetes.
  • Feature Engineering Library — 40+ reusable transformers for time-series, text, categorical, and numeric features with scikit-learn compatible API.
  • Point-in-Time Joins — Utilities that prevent future data leakage in training sets by joining features as-of each entity's event timestamp.
  • Feature Versioning — Track schema changes, transformations, and data distributions across feature versions with automatic drift alerts.
  • Online/Offline Serving — Serve features from Redis (online, <10ms) or Parquet/BigQuery (offline, batch) with unified retrieval API.
  • Feature Monitoring — Distribution drift detection, null rate tracking, and freshness alerts for production feature pipelines.

Quick Start

unzip feature-store-setup.zip && cd feature-store-setup
pip install -r requirements.txt

# Option 1: Initialize Feast feature store
feast init my_feature_store
cp configs/production.yaml my_feature_store/feature_store.yaml
cd my_feature_store && feast apply

# Option 2: Use lightweight custom store
python src/feature_store_setup/core.py init --store-path ./my_store
Enter fullscreen mode Exit fullscreen mode
# configs/production.yaml (Feast)
project: ml_platform
registry: ./data/registry.db
provider: local  # local | gcp | aws

online_store:
  type: sqlite  # sqlite | redis | dynamodb
  path: ./data/online_store.db

offline_store:
  type: file  # file | bigquery | redshift

entity_key_serialization_version: 2
Enter fullscreen mode Exit fullscreen mode

Architecture

┌──────────────┐     ┌───────────────┐     ┌────────────────┐
│ Raw Data     │────>│  Feature      │────>│  Offline Store │
│ Sources      │     │  Engineering  │     │  (Parquet/BQ)  │
└──────────────┘     └───────┬───────┘     └───────┬────────┘
                             │                      │
                     ┌───────▼───────┐     ┌───────▼────────┐
                     │  Feature      │     │  Online Store  │
                     │  Registry     │     │  (Redis/SQLite)│
                     └───────┬───────┘     └───────┬────────┘
                             │                      │
                     ┌───────▼──────────────────────▼───────┐
                     │        Unified Retrieval API         │
                     │  get_historical() / get_online()     │
                     └─────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Define and Register Features with Feast

from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float64, Int64, String
from datetime import timedelta

# Define entity
user = Entity(name="user_id", join_keys=["user_id"])

# Define data source
user_source = FileSource(
    path="./data/user_features.parquet",
    timestamp_field="event_timestamp",
)

# Define feature view
user_features = FeatureView(
    name="user_features",
    entities=[user],
    ttl=timedelta(days=1),
    schema=[
        Field(name="total_purchases", dtype=Int64),
        Field(name="avg_order_value", dtype=Float64),
        Field(name="days_since_last_purchase", dtype=Int64),
        Field(name="preferred_category", dtype=String),
    ],
    source=user_source,
)
Enter fullscreen mode Exit fullscreen mode

Point-in-Time Correct Feature Retrieval

from feast import FeatureStore
import pandas as pd

store = FeatureStore(repo_path="./my_feature_store")
entity_df = pd.DataFrame({
    "user_id": [1001, 1002, 1003],
    "event_timestamp": pd.to_datetime(["2026-01-15 10:00:00", "2026-01-15 14:30:00", "2026-01-16 09:00:00"]),
})

# Point-in-time join prevents data leakage
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=["user_features:total_purchases", "user_features:avg_order_value"],
).to_df()
Enter fullscreen mode Exit fullscreen mode

Custom Lightweight Feature Store

from feature_store_setup.core import LightweightFeatureStore

store = LightweightFeatureStore(store_path="./my_store")
store.register_feature_group(
    name="user_behavioral", entity_key="user_id",
    features={"session_count_7d": "int", "avg_session_duration": "float", "bounce_rate": "float"},
    source_query="SELECT * FROM user_sessions_agg", refresh_interval="1h",
)
store.materialize("user_behavioral")

# Retrieve for online inference (<10ms)
features = store.get_online_features(feature_group="user_behavioral", entity_ids=[1001, 1002])
Enter fullscreen mode Exit fullscreen mode

Configuration Reference

Parameter Type Default Description
provider str local Infrastructure: local, gcp, aws
online_store.type str sqlite Online store backend
offline_store.type str file Offline store backend
feature_views.*.ttl int 86400 Feature freshness TTL in seconds
source.timestamp_field str required Column used for point-in-time joins

Best Practices

  1. Always use point-in-time joins — Standard LEFT JOINs leak future data into training sets. This is the most common source of inflated offline metrics that don't reproduce in production.
  2. Version feature definitions, not just data — When you change a transformation, create a new feature version (avg_order_value_v2) rather than silently updating the existing one.
  3. Set TTLs on online features — Stale features in production cause silent model degradation. If a feature hasn't been refreshed in 24 hours, it's better to return null than a stale value.
  4. Monitor feature distributions — Track mean, stddev, null rate, and cardinality in production. Alert when distributions shift more than 2 standard deviations from training data.
  5. Start with the lightweight store — If you have fewer than 50 features and a small team, the custom Parquet + SQLite store is simpler to operate than full Feast infrastructure.

Troubleshooting

Issue Cause Fix
feast apply fails with schema error Feature type mismatch with data Verify Parquet column dtypes match Field definitions exactly
Materialization is slow Full table scan on large datasets Add created_timestamp_column to enable incremental materialization
Online features return None Features not materialized to online store Run feast materialize with the correct time range
Point-in-time join returns NaN Entity timestamps predate feature availability Ensure feature data covers the full entity timestamp range

This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [Feature Store Setup Guide] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)