Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Feature Store Setup Guide: Feature Store Setup

#machinelearning #python #mlops #datascience

Feature Store Setup

Feature engineering is where most ML projects spend 60-80% of their time — and where most technical debt accumulates. This toolkit gives you a working feature store built on Feast, plus a custom lightweight alternative for teams that don't need the full infrastructure. You get feature engineering patterns, point-in-time correct joins that prevent data leakage, versioned feature definitions, and serving configurations for both batch and real-time inference. Stop recomputing the same features across notebooks and pipelines.

Key Features

Feast Configuration Pack — Complete feature_store.yaml, entity definitions, feature views, and materialization configs for offline and online stores.
Custom Lightweight Feature Store — A pure-Python alternative using Parquet + SQLite for teams that need feature management without Kubernetes.
Feature Engineering Library — 40+ reusable transformers for time-series, text, categorical, and numeric features with scikit-learn compatible API.
Point-in-Time Joins — Utilities that prevent future data leakage in training sets by joining features as-of each entity's event timestamp.
Feature Versioning — Track schema changes, transformations, and data distributions across feature versions with automatic drift alerts.
Online/Offline Serving — Serve features from Redis (online, <10ms) or Parquet/BigQuery (offline, batch) with unified retrieval API.
Feature Monitoring — Distribution drift detection, null rate tracking, and freshness alerts for production feature pipelines.

Quick Start

unzip feature-store-setup.zip && cd feature-store-setup
pip install -r requirements.txt

# Option 1: Initialize Feast feature store
feast init my_feature_store
cp configs/production.yaml my_feature_store/feature_store.yaml
cd my_feature_store && feast apply

# Option 2: Use lightweight custom store
python src/feature_store_setup/core.py init --store-path ./my_store

# configs/production.yaml (Feast)
project: ml_platform
registry: ./data/registry.db
provider: local  # local | gcp | aws

online_store:
  type: sqlite  # sqlite | redis | dynamodb
  path: ./data/online_store.db

offline_store:
  type: file  # file | bigquery | redshift

entity_key_serialization_version: 2

Architecture

┌──────────────┐     ┌───────────────┐     ┌────────────────┐
│ Raw Data     │────>│  Feature      │────>│  Offline Store │
│ Sources      │     │  Engineering  │     │  (Parquet/BQ)  │
└──────────────┘     └───────┬───────┘     └───────┬────────┘
                             │                      │
                     ┌───────▼───────┐     ┌───────▼────────┐
                     │  Feature      │     │  Online Store  │
                     │  Registry     │     │  (Redis/SQLite)│
                     └───────┬───────┘     └───────┬────────┘
                             │                      │
                     ┌───────▼──────────────────────▼───────┐
                     │        Unified Retrieval API         │
                     │  get_historical() / get_online()     │
                     └─────────────────────────────────────┘

Usage Examples

Define and Register Features with Feast

from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float64, Int64, String
from datetime import timedelta

# Define entity
user = Entity(name="user_id", join_keys=["user_id"])

# Define data source
user_source = FileSource(
    path="./data/user_features.parquet",
    timestamp_field="event_timestamp",
)

# Define feature view
user_features = FeatureView(
    name="user_features",
    entities=[user],
    ttl=timedelta(days=1),
    schema=[
        Field(name="total_purchases", dtype=Int64),
        Field(name="avg_order_value", dtype=Float64),
        Field(name="days_since_last_purchase", dtype=Int64),
        Field(name="preferred_category", dtype=String),
    ],
    source=user_source,
)

Point-in-Time Correct Feature Retrieval

from feast import FeatureStore
import pandas as pd

store = FeatureStore(repo_path="./my_feature_store")
entity_df = pd.DataFrame({
    "user_id": [1001, 1002, 1003],
    "event_timestamp": pd.to_datetime(["2026-01-15 10:00:00", "2026-01-15 14:30:00", "2026-01-16 09:00:00"]),
})

# Point-in-time join prevents data leakage
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=["user_features:total_purchases", "user_features:avg_order_value"],
).to_df()

Custom Lightweight Feature Store

from feature_store_setup.core import LightweightFeatureStore

store = LightweightFeatureStore(store_path="./my_store")
store.register_feature_group(
    name="user_behavioral", entity_key="user_id",
    features={"session_count_7d": "int", "avg_session_duration": "float", "bounce_rate": "float"},
    source_query="SELECT * FROM user_sessions_agg", refresh_interval="1h",
)
store.materialize("user_behavioral")

# Retrieve for online inference (<10ms)
features = store.get_online_features(feature_group="user_behavioral", entity_ids=[1001, 1002])

Configuration Reference

Parameter	Type	Default	Description
`provider`	str	`local`	Infrastructure: local, gcp, aws
`online_store.type`	str	`sqlite`	Online store backend
`offline_store.type`	str	`file`	Offline store backend
`feature_views.*.ttl`	int	`86400`	Feature freshness TTL in seconds
`source.timestamp_field`	str	required	Column used for point-in-time joins

Best Practices

Always use point-in-time joins — Standard LEFT JOINs leak future data into training sets. This is the most common source of inflated offline metrics that don't reproduce in production.
Version feature definitions, not just data — When you change a transformation, create a new feature version (avg_order_value_v2) rather than silently updating the existing one.
Set TTLs on online features — Stale features in production cause silent model degradation. If a feature hasn't been refreshed in 24 hours, it's better to return null than a stale value.
Monitor feature distributions — Track mean, stddev, null rate, and cardinality in production. Alert when distributions shift more than 2 standard deviations from training data.
Start with the lightweight store — If you have fewer than 50 features and a small team, the custom Parquet + SQLite store is simpler to operate than full Feast infrastructure.

Troubleshooting

Issue	Cause	Fix
`feast apply` fails with schema error	Feature type mismatch with data	Verify Parquet column dtypes match `Field` definitions exactly
Materialization is slow	Full table scan on large datasets	Add `created_timestamp_column` to enable incremental materialization
Online features return `None`	Features not materialized to online store	Run `feast materialize` with the correct time range
Point-in-time join returns NaN	Entity timestamps predate feature availability	Ensure feature data covers the full entity timestamp range

This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [Feature Store Setup Guide] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →

DEV Community