DEV Community: MagicLex

Why Do You Need a Feature Store?

MagicLex — Fri, 18 Aug 2023 16:13:56 +0000

Future-Proofing ML Operations: The Case for Feature Stores

When an organization needs to productionize machine learning models, built using data from their data platforms, they often require a feature store as a central repository and collaboration layer for the data teams. But feature stores are more than just a data warehouse for features. They also power online AI applications, providing the key operational machine learning infrastructure that feeds data to online models.

In this article we will explain why the feature stores are now widely considered as the backbone of modern machine learning systems and we will identify the various challenges to consider when you start thinking about implementing one in your organization.

First, a short history on the origins of and need for feature stores. The first feature store was announced in late 2017 as a platform developed internally by Uber. Uber had difficulties in scaling and operationalising models for machine learning; engineering teams were building bespoke, often one-off systems that would enable each model to be put in production. Those early ML Systems produced numerous anti-patterns and a range of technical debts that organizations would incur over time. Simply put, there is a gap between the systems needed to develop a model and the systems needed to operationalise a model that valuable business systems will depend on.

1- Identifying the Need for a Feature Store

AI business value

As companies mature enough to understand their business historically with Business intelligence tools and data, they often look to AI to build intelligent applications or services. Business intelligence looks at the current and past data to help build meaningful insights and advice on the current situation of the business and potential actions that can be undertaken. Machine learning is the logical next step; it helps build products and meaningful predictions; it is a look into the future.

Companies are increasingly focussing on getting AI in operation as quickly as possible and using feature stores and ML platforms to accelerate that journey is the appropriate approach, chosen by industry leading companies such as robinhood, stripe, doordash and more.

Here are a few considerations that can help you identify if and when your organization has reached the threshold towards operational ML and might start the implementation of a feature store and ML platforms;

Valuable models are created but once the experimentation stage is over they do not bridge the chasm to operations - the models do not consistently generate revenue or savings.
Existing models running in production are expensive - they are hard to debug, review and upgrade, they are bespoke systems that are difficult and costly to maintain.
Monitoring production pipelines is challenging, or impossible. The data that powers AI changes over time, and identifying when there are significant changes that require retraining your AI is not easy.
Feature data is not centrally managed; it is duplicated, features are re-engineered, and generally data is not reused across the organization.
Inability to provide very fresh feature data or handle real-time data for ML models, which is critical for industries like finance, retail, or logistics where real-time insights can add significant business value.
Difficulties in managing the lifecycle of feature data, including the tracking of versions and historical changes.
No cohesive governance in the storage and use of AI assets (feature data and models), everything is done in a bespoke manner, leading to compliance risks.
Hard to derive a direct business value from the models, they exist in isolated environments that do not directly influence business operations.
Slow ramp-up time when onboarding new talent into the ML teams. Sharing available AI assets is complex because operational knowledge is held by a few individuals or groups.

In addition to solving those issues, in general, the implementation of a feature store will enforce best-practices in managing AI assets, and helps eliminate silos in organizations where different business units or teams are responsible for a portion of the process. As a collaborative platform, the feature store empowers each team as part of a cohesive process for productionizing AI and it reduces friction in establishing collaboration across teams.

There is also a strong argument that the feature store is not just a tool for big corporations, but for any organization that wants to build an AI-enabled system or application. Feature stores remove the need to build the operational AI support, reducing headcount and operational costs for small teams, enabling them to focus on the more valuable work of building the ML pipelines and AI-enabled applications or systems.

2- Feature Stores in a Nutshell

A feature store is a central, governed platform that enables data to be discovered for training AI and it also enables operational systems to use data to make predictions (inference) with AI. The feature store provides a Data Catalog describing the available data (the features) along with metadata, used for discovery but also to define the constraints under which data may be used in AI models to ensure compliance. The feature store also needs to provide security and SLAs (service level agreements) to ensure data is highly available for use by models in operational systems, ensuring critical business systems do not suffer downtime.

Where does the feature data, stored in feature stores and used by AI for training and predictions, come from? Typically, feature data is created using data from existing Enterprise Data Sources (databases, data warehouses, data lakes, lakehouses, message queues). Ultimately, this data is generated by business products and services (financial results, customer behaviors, click events…), and the predictions made by AI are used by those same products and services.

The feature data, itself, is often slightly different from the data in your existing Enterprise data platforms. Feature data is often concentrated signals from existing data (e.g., feature data might be the moving point averages for a stock every 1 minute, rather than all individual stock prices that change many times per second), it can also include things like customer demographics, product prices, or website traffic data. Feature data may be computed at regular intervals (hourly, daily, etc) or it might be real-time data (the current Nasdaq Composite Index price ). The variety and the varying cadences for different feature data create complexities in computing, storing, and making that feature data available to operational models.

Delving into the technical specifics, a feature store is a unified platform that consists of three internal data systems:

an offline store for large volumes of historical feature data used for training and for batch predictions,
an online store for feature data that is read at low latency by online (interactive) AI applications;
and a metadata layer that stores information about the feature store itself, and metadata about the feature data, users, computations, and so on.

Feature stores solve two major problems in operational machine learning; how to deliver the right feature data in the right format for training or prediction, and how to compute and manage feature data from disparate and disjointed sources.

Since their inception, feature stores have widely extended their capabilities and are the foundation of most operational machine learning platforms in companies that provide operational AI services. Feature stores now support collaboration across data and operations teams, data re-use, versioning features and models, governance, monitoring, and more.

3- Roadmap to implement a feature store: MVP

After you have identified the need for faster implementation of your models, you need to set up a framework. As a result of helping and collaborating with hundreds of professionals in Fortune 500 companies, we have identified the most efficient model: create an end-to-end minimum viable product.

The MVP approach allows you to be more nimble in making decisions and changes to your pipelines and with the appropriate technology it empowers your organization to generate value faster and helps onboard stakeholders in the process.

Beyond the frame of work you choose; here are other considerations that are important when you have decided on implementing (or even building!) a feature store:

Establish the role of the Feature Store, for you: As they play essential functions in the management of machine learning systems and provide an unified layer for feature data, feature stores are also very often rich in a diverse set of capabilities. It is essential that you first establish the dominant roles that you want your feature store to play in your team’s ML ecosystem. As we see in some of the next points; leveraging existing tools, skills and tables is important. Spend time establishing a clear picture as to what your feature store accomplishes for your organization.
Build or buy decision: decide on where you can add most value to your ML system, fastest. Do you need to first build the infrastructure for managing data for ML (build) or can you start by building ML pipelines for your data and build on existing ML infrastructure (buy). While building provides you with the most tailor-made solution for your problem; it might also build a cage around your needs and costs; you are in essence architecting a system for your current requirements, and committing to upgrading that system to keep up with useful new features that appear in other feature stores. Buying enables you to focus on building the pipelines and models that add value to your applications and/or services.
Leverage existing skills and tools: build the ML pipelines and models using the languages and frameworks that your team already use. Consider what parts of your existing data infrastructure might be already efficiently implemented and could be extended to the feature store. For example, could you reuse existing tables containing feature data in Snowflake without the need to re-implement your existing pipelines? How well does the feature store integrate with your existing data tools for cataloging, visualization, and pipeline orchestration?
Modularity and future-proofing: Most data science teams use Python for data science - creating features, models, and making predictions (inference). What is the cost for you (and them) to change languages or tools that will lock you in to a specific vendor and they will not see as adding value to their careers. Your feature store should be open, and fit well into the wider technical ecosystem of data science.
Scalability: What happens if the service or application you “AI-enable” increases in popularity? Scaling a machine learning system is no small task. It is not merely a matter of handling a larger quantity of data, but it also involves managing an increasing number of features, dealing with a higher frequency of updates, and coordinating more complex model serving scenarios. Your feature store implementation should be flexible enough to allow for that change in scale and be resilient enough to support it.

4- Additional Considerations before Implementing a Feature Store

Before moving to your implementation of a feature store, you might want to tackle a few common misconceptions of feature stores. Some are benign misunderstandings that require a deeper understanding of the technology and challenges, while others might lead your organization down a rabbit hole from which it will take months, or even years to emerge.

‍* “A feature store is just storage” This falls under a benign misconception. Yes, the feature store provides storage of feature data and metadata. From that perspective anything could be a feature store; a warehouse, a database, or any kind of object storage. Yes, feature stores use the same underlying storage technologies - large volumes of historical data on less expensive object storage, and smaller volumes of “online” data in a low-latency database. The challenge is in building the APIs for ingestion and retrieval of consistent data from these stores, and providing the metadata to enable security and governance. Creating a cohesive ensemble of tools that can communicate with the different systems via an productive API is what makes the feature store more than just a storage. ‍

“We can build it” This falls under a not so benign misconception. Yes, some hyperscale AI organizations may require extremely use-case specific particularities and have large enough budgets to pull this off (and talk loudly about their achievements), but most companies do not need to build a feature store from scratch. There is a huge opportunity cost in first building the infrastructure for AI, rather than the ML pipelines and models needed for your actual AI systems. As a feature store vendor, we are obviously biased here, but we also have deep experience of the challenges involved. A significant portion of the users of our platform first started by building their own, then reverted to buying over building. Starting by building, and transitioning to buying, also creates tensions in an organization as personal projects get abandoned and sunk costs fallacies often mean decisions are delayed longer than reasonable, leading to larger costs. ‍
”Feature Stores are only necessary for large volumes of data”. This belief is derived from the assumption that feature stores are only necessary in certain specialized cases, perhaps for large enterprises with large teams, and an organization may need a feature store only if, and when, they handle large volumes of data. However, it is important not to overlook the other important added value of feature stores – they bring consistency to ML operations, enforce good architectural patterns, connect ML pipelines, enable the reuse of features, support MLOps best practices around versioning, automated testing, and monitoring. Even for smaller data sets, feature stores can deliver substantial benefits in terms of operations, collaboration, security, and governance.

5- Real-world Examples of Feature Store Implementations

Here we cover some real-world examples of companies that successfully implemented feature stores, in particular, we will look at how Hopsworks has been leveraged to make a positive impact in the machine learning operations of a few organizations and we will look at the general lessons we learned along the way in assisting those customers.

American First Credit Union; fast ROI, improved team collaboration At AFCU, Hopsworks’ feature store powers the Enterprise AI services and integrates all operational and analytical data into machine learning processes, allowing for faster development cycles, integration with existing data warehouses, and providing a flexible environment for data scientists.

‍Outcome: AFCU achieved significant gains over their previous process for training models. They saw a 3 to 4 times productivity gain while simplifying their machine learning codebase/pipelines. New features were easier to test and data science workflows were improved. AFCU was able to reduce complexity and increase the readability of new features. There was also improved visibility and reusability of features across models and use cases.

Arbetsförmedlingen, National Swedish Employment Agency; Job recommendation system and end to end machine learning. The Swedish National Employment Agency “Arbetsförmedlingen” needed a highly available production environment for AI and was looking for a feature store capable of not only working as a unified data layer but also to manage and orchestrate workflows and processes around AI, including GPUs for model training and serving models.

‍Outcome: Arbetsförmedlingen used Hopsworks to quickly serve real-time predictions of suitable job postings. Hopsworks also helped identify discriminatory texts in job announcements. The platform offered centralization and collaboration allowing Data Scientists to work with modern libraries developing, creating feature pipelines and developing AI models in a structured manner.

*Human Exposome Assessment Project; Access and analysis of genomic data. * HEAP requires large-scale processing of genomic data on Apache Spark and deep learning to analyze large datasets of human exposome data. HEAP has many activities around identifying novel viruses, performing large cohort studies, and identifying genetic mutations causing diseases.

‍Outcome: HEAP used Hopsworks to lead the delivery of the Informatics Platform and Knowledge Engine. This enabled data warehousing, stream processing, and deep learning with advanced analytics. As a result, HEAP saw a 90% cost reduction, faster data processing, and an integrated data science platform.

6- Conclusions and lessons learned

In examining many of our customers’ journeys in implementing a feature store, and as one of the leaders in this new field, we have gleaned valuable insights in how data for AI has emerged as the core capabilities for operational machine learning systems. Our main conclusion is that companies that implement a feature store see faster iteration and faster models in production; it also allows them to scale and move to real-time machine learning use cases with more ease.

There has also been an obvious increase in market interest in feature stores in 2023 (Snowflake Summit had 4 tracks dedicated to feature stores this 2023 summer, and Microsoft is the last of the 4 main cloud players to announce their own light-weight implementation of one), which only reinforces the position of the feature store as a core technology in the ML system and MLOps spaces.

Here are some additional lessons we believe are valuable for anyone considering machine learning systems as a whole, and even more importantly through the scope a feature store;

Plan for changes: Whether it's a matter of adjusting to changes in feature data or supporting new use cases, you need the agility to evolve rapidly Promote collaboration: Create an environment that removes the silos between the different data and ai teams, you can not build a system that is isolated from the business or from the other systems
Build for High Availability and fault tolerance: machine learning models are business critical operations, and if they are not they will likely be! Design your ML systems with high availability and fault tolerance in mind. Data governance is not an afterthought: Control and audit access to the data and models. ML assets should be versioned and cataloged - as data volumes and complexity increases, managing metadata for ML assets effectively becomes a non-negotiable requirement.
Design to be part of an ecosystem: Do not create a system that operates in isolation but rather as part of a wider data ecosystem of tools. Compatibility is a significant advantage.

Ultimately, you need a feature store not because we say so, but because your models need a feature store to run in production. You need a feature store to generate value from your models.

Free Serverless ML Course with Python

MagicLex — Tue, 20 Sep 2022 14:45:22 +0000

Build Batch and Real-Time Prediction Services with Python

Ready to learn how to use Python to build free serverless service? This free course has got you covered - Serverless ML Course

You should not need to be an expert in Kubernetes or cloud computing to build an end-to-end service that makes intelligent decisions with the help of a ML model. Serverless ML makes it easy to build a system that uses ML models to make predictions. You do not need to install, upgrade, or operate any systems. You only need to be able to write Python programs that can be scheduled to run as pipelines. The features and models your pipelines produce are managed by a serverless feature store / model registry. We will also show you how to build a UI for your prediction service by writing Python and some HTML.

Learning Outcomes:

Learn to develop and operate AI-enabled (prediction) services on serverless infrastructure
Develop and run serverless feature pipelines
Deploy features and models to serverless infrastructure
Train models and and run batch/inference pipelines
Develop a serverless UI for your prediction service
Learn MLOps fundamentals: versioning, testing, data validation, and operations
Develop and run a real-time serverless machine learning system

Course Contents

Pandas and ML Pipelines in Python. Write your first serverless App.
The Feature Store for Machine Learning. Feature engineering for a credit-card fraud serverless App.
Training Pipelines and Inference Pipelines
Bring a Prediction Service to Life with a User Interface (Gradio, Github Pages, Streamlit)
Automated Testing and Versioning of features and models
Real-time serverless machine learning systems. Project presentation.

Who is the target audience?

You have taken a course in machine learning (ML) and you can program in Python. You want to take the next step beyond training models on static datasets in notebooks. You want to be able to build a prediction service around your model. Maybe you work at an Enterprise and want to demonstrate your models’ value to stakeholders in the stakeholder's own language. Maybe you want to include ML in an existing application or system.

Why is this course different?

You don’t need any operations experience beyond using GitHub and writing Python code. You will learn the essentials of MLOps: versioning artifacts, testing artifacts, validating artifacts, and monitoring and upgrading running systems. You will work with raw and live data - you will need to engineer features in pipelines. You will learn how to select, extract, compute, and transform features.

Will this course cost me money?

No. You will become a serveless machine learning engineer without having to pay to run your serverless pipelines or to manage your features/models/user-interface. We will use Github Actions and Hopsworks that both have generous time-unlimited free tiers.

Building the Data API for MLOps — 4 years of lessons learnt

MagicLex — Wed, 31 Aug 2022 08:05:29 +0000

When it comes to feature stores, there are two main approaches to feature engineering. One approach is to build a domain specific language (DSL) that covers all the possible feature engineering steps (e.g., aggregations, dimensionality reduction and transformations) that a data scientist might need. The second approach is to use a general purpose framework for feature engineering, based on DataFrames (Pandas or Spark), to enable users to do feature engineering using their favorite framework. The DSL approach requires re-writing any existing feature engineering pipelines from scratch, while the DataFrames approach is backwards compatible with existing feature engineering code written for Pandas or Spark.

At Hopsworks, we pioneered the DataFrames approach and our APIs reflect that. There are a couple of reasons behind this choice.

Flexibility: While the very first feature stores (e.g., Uber’s Michelangelo, AirBnB’s Zipline) adopted a DSL, they were developed inside a company for the specific needs of that company. Hopsworks is a general purpose feature store, and our customers run a plethora of different use cases on Hopsworks: anomaly detection, recommendations, time-series forecasting, NLP, etc. For Hopsworks to be able to support all these different use cases, we need to provide data scientists with a powerful and flexible API to achieve what they need. These tools already exist and are represented by Pandas, PySpark and the vast array of libraries that the Python ecosystem already provides. New Python libraries for feature engineering are usable immediately in feature engineering pipelines.
User experience: The Hopsworks feature store is meant to be a collaborative and productivity enhancing platform for teams. We believe that if users had to learn a new DSL to be able to create features it would drastically reduce the feature store’s attractiveness to developers, due to the need to learn a new framework with limited transferable skills — the DSL is seen as a form of vendor lock-in. With Hopsworks users can keep using the same libraries and processes they already know how to use.
Bring your own pipeline: The vast majority of Hopsworks users are working on brownfield projects. That is to say that they already have feature engineering pipelines running in their environment and they are not starting from scratch. Having a DSL would mean having to re-engineer those pipelines to be able to use a feature store. With Hopsworks and the DataFrame APIs, it doesn’t matter if the pipelines are using Python, PySpark or SQL. Users just need to change the output of the pipeline to write a DataFrame to Hopsworks instead of saving the DataFrame to a S3 bucket or a data warehouse.

Write API — Feature Group

Hopsworks feature store is agnostic to the feature engineering step. You can run the process on different environments: from Colab to Snowflake, from Databricks to SageMaker and, on Hopsworks itself. The only requirement for developers is to use the HSFS (HopsworkS Feature Store) library to interact with Hopsworks

At the end of your feature pipelines, when you have the final DataFrame, you can register it with Hopsworks using the HSFS API. Features in Hopsworks are registered as a table of features, called a feature group.

Hopsworks provides several write modes to accommodate different use cases:

Batch Write: This was the default mode prior to version 3.0. It involves writing a DataFrame in batch mode either to the offline feature store, or to the online one, or to both. This mode is still the default when you are writing a Spark DataFrame.
Streaming Write: This mode was introduced in version 2.0 and expanded in version 3.0. Streaming writes provide very low latency streaming updates to the online store and efficient batch updates to the offline store, while ensuring consistency between the online and offline stores. In Hopsworks 3.0, this is the mode for Python clients.
External connectors: This mode allows users to mount tables (of features) in Data Warehouses like Snowflake, Redshift, Delta Lake, and BigQuery as feature groups in Hopsworks.The tables are not copied into Hopsworks, and the Data Warehouse becomes the offline store for Hopsworks. Hopsworks manages the metadata, statistics, access control, and lineage for the external tables. Hopsworks can act as a virtual feature store, where many different Data Warehouses can be offline stores in the same Hopsworks feature store.

Create

Before ingesting the engineered features you need to create a feature group and define its metadata. An example of feature group definition is the following:

fg = fs.get_or_create_feature_group(
          name="transactions",
          version=1,
          description="Transaction data",
          primary_key=['cc_num'],
          event_time='datetime',
          online_enabled=True
)

In the example above we define a feature group called transactions. It has a version (read about on our approach to versioning), a description, a primary key, partition key and an event time. The primary_key is required if we (1) want to be able to join the features in this feature group with features in other feature groups, and (2) for retrieval of precomputed features from the online store. The partition_key is used to enable efficient appends to storage and efficient querying of large volumes of feature data, reducing the amount of data read in pruned queries. The event_time specifies the column in our feature group containing the timestamp for the row update (when the event happened in the real-world), enabling features to be able to correctly be joined together without data leakage. The online_enabled attribute defines whether the feature group will be available online for real time serving or not.

As you can see, we have not defined the full schema of the feature group. That’s because the information about the feature names and their data types are inferred from the Pandas DataFrame when writing.

To write the features to Hopsworks, users need to call fg.insert(df), where df is the Pandas DataFrame. At this stage the platform takes over and starts creating all the necessary feature metadata and scaffolding. As mentioned above, you can but you don’t have to, explicitly specify the schema of the feature group. If you don’t, the feature names and data types are mapped based on the columns of the Pandas DataFrame (read more on data types and mapping).

Feature Data Validation

In Hopsworks 3.0, we introduced first class support for Great Expectations for validation of feature data. Developers have the option of registering a Great Expectation suite to a feature group. In this case, before sending the Pandas DataFrame to Hopsworks for writing, the feature store APIs transparently invoke the Great Expectations library and validate the DataFrame. If it complies with the expectations, the write pipeline proceeds and the data is written into the feature store. If the DataFrame doesn’t comply with the expectation suite, an alert can be sent to a configured channel (e.g., Slack or Email). Alert channels are securely defined in Hopsworks.

Write

The write pipeline involves the Pandas DataFrame being serialized as Avro and securely written to a Kafka topic. The APIs also take care of serializing complex features like embeddings in such a way that they can be stored correctly.

From the Kafka topic, the data is picked up immediately by the online feature store service which streams it into the online feature store (RonDB). For offline storage, a job can be scheduled at regular intervals to write the data to the offline feature store. With this “kappa-style” architecture, Hopsworks can guarantee that the online data is available as soon as possible (TM), while at the same time, it can be compacted and written periodically in larger batches to the offline feature store to take advantages of the performance improvements given by large files in systems like Spark, S3 and HopsFS. Finally, Kafka only ensures at-least-once semantics for features written to the Kafka topic, but we ensure the correct, consistent replication of data to online and offline stores using idempotent writes to the online store, and ACID updates with duplicate record removal to the offline store.

Statistics Computation

Finally after the data has been written in the offline feature store, its statistics are updated. For each feature group, by default, Hopsworks transparently computes descriptive statistics, the distribution, and correlation matrix for features in the feature group. Statistics are then presented in the UI for users to explore and analyze.

Read API — Feature View

The feature view is a new abstraction introduced in Hopsworks 3.0. Feature views are the gateway for users to access feature data from the feature store. At its core, a feature view represents the information about which features, from which feature groups, a model needs. Feature views contain only metadata about features, similar to how views in databases contain information about tables. In contrast to database views, however, feature views can also extend the features (columns) with feature transformations — more on this later.

Feature Selection

The first step to create a feature view is to select a set of features from the feature store. Features can be selected from different feature groups which are joined together. Hopsworks provides a Pandas-style API to select and join features from different feature groups. For example:

# Get a reference to the transactions and aggregation feature groups
trans_fg =
   fs.get_feature_group(
    name='transactions_fraud_batch_fg', 
    version=1)

window_aggs_fg =
   fs.get_feature_group(
    name='transactions_4h_aggs_fraud_batch_fg',
    version=1)

#Select features from feature groups and join them with other features
ds_query = trans_fg.select(["fraud_label", "category", "amount", 
                 "age_at_transaction", "days_until_card_expires", 
                 "loc_delta"])\
    .join(window_aggs_fg.select_except(["cc_num"])

What the Hopsworks feature store does on your behalf is to transpile the Pandas-like code into a complex SQL query that implements a point-in-time correct JOIN. As an example, the above snippet gets transpiled into:

WITH right_fg0 AS (
  SELECT 
    * 
  FROM 
    (
      SELECT 
        `fg1`.`fraud_label` `fraud_label`, 
        `fg1`.`category` `category`, 
        `fg1`.`amount` `amount`, 
        `fg1`.`age_at_transaction` `age_at_transaction`, 
        `fg1`.`days_until_card_expires` `days_until_card_expires`, 
        `fg1`.`loc_delta` `loc_delta`, 
        `fg1`.`cc_num` `join_pk_cc_num`, 
        `fg1`.`datetime` `join_evt_datetime`, 
        `fg0`.`trans_volume_mstd` `trans_volume_mstd`, 
        `fg0`.`trans_volume_mavg` `trans_volume_mavg`, 
        `fg0`.`trans_freq` `trans_freq`, 
        `fg0`.`loc_delta_mavg` `loc_delta_mavg`, 
        RANK() OVER (
          PARTITION BY `fg0`.`cc_num`, 
          `fg1`.`datetime` 
          ORDER BY 
            `fg0`.`datetime` DESC
        ) pit_rank_hopsworks 
      FROM 
        `fabio_featurestore`.`transactions_1` `fg1` 
        INNER JOIN `fabio_featurestore`.`transactions_4h_aggs_1` `fg0` ON `fg1`.`cc_num` = `fg0`.`cc_num` 
        AND `fg1`.`datetime` >= `fg0`.`datetime`
    ) NA 
  WHERE 
    `pit_rank_hopsworks` = 1
) (
  SELECT 
    `right_fg0`.`fraud_label` `fraud_label`, 
    `right_fg0`.`category` `category`, 
    `right_fg0`.`amount` `amount`, 
    `right_fg0`.`age_at_transaction` `age_at_transaction`, 
    `right_fg0`.`days_until_card_expires` `days_until_card_expires`, 
    `right_fg0`.`loc_delta` `loc_delta`, 
    `right_fg0`.`trans_volume_mstd` `trans_volume_mstd`, 
    `right_fg0`.`trans_volume_mavg` `trans_volume_mavg`, 
    `right_fg0`.`trans_freq` `trans_freq`, 
    `right_fg0`.`loc_delta_mavg` `loc_delta_mavg` 
  FROM 
    right_fg0
)

The above SQL statement pulls the data from the specified data sources, e.g. if it is an external feature group defined over a Snowflake table, the SQL query will fetch the necessary data from Snowflake. The HSFS APIs also infer the joining keys based on the largest matching subset of primary keys of the feature groups being joined. This default behavior can be overridden by data scientists who can provide their own joining conditions.

More importantly though, the query enforces point in time correctness of the data being joined. The APIs will join each event you want to use for training with the most recent feature value before the event occurred (for each feature selected).

As you can see above, the query is quite complex and it would be error prone to write manually. The Hopsworks feature store makes it easy for data scientists to select and correctly join features using a Pandas-like API — one they are already familiar with.

To create a feature view, you call the create_feature_view()method. You need to provide a name, the version, the query object containing the input features, and a list of features that will be used as a label (target) by your model. The label(s) will not be returned when retrieving data for batch or online scoring.

feature_view = fs.create_feature_view(
    name='transactions_view',
    version=1,
    query=ds_query,
    labels=["fraud_label"]
)

Feature Transformations

Although feature transformations can be performed before features are stored in the feature store, a feature store can increase feature reuse across different models by supporting consistent feature transformations for both offline and online APIs (training and inference). Hopsworks can transparently perform feature transformations with Python UDFs (user-defined functions) when you select features from the feature store. For example, when you select features for use in a feature view, you might decide to normalize a numerical feature in the feature view.

Let’s look at the implications of only supporting feature transformations before the feature store (as is the case in many well known feature stores). Assume you adopt the OBT ( one big table) data modeling approach, and store several years of engineered data in a feature group containing data for all your customers. You might have several models that use the same features in that feature group. One model might be trained using those rows with data for only US customers, a second model only uses European customer data, a third model might be trained on the entire history of the data available in the feature group, while a fourth model might be trained on only the last year of data. Each model is trained on a different training dataset. And each of these training datasets has different rows, and hence different descriptive statistics (min, max, mean, standard deviation). Many transformation functions are stateful, using descriptive statistics. For example, normalizing a numerical feature uses the mean value for that feature in the training dataset.

If you had transformed your features before storing them in the feature store, you could not create the four different training sets using the same feature groups. Instead, you would have one feature group with all the data available for the third model. You would also have the problem of how to train the fourth model on the last year of data. Its descriptive statistics are different from the full dataset, so transformed feature values for the full dataset and the last year of data would be different. You would need to store the last year of data in a different feature group. The same is true for models trained on data for US and EU customers, respectively. With this pattern, the amount of data storage required to store your features and the number of feature groups needed is a function of the number of models you have in production, not the number of features used by your models!

By applying the transformations only when using the features, the same set of features can be used by all models — meaning you only need to store your feature data once, and your model transforms the feature on-demand. Transforming features before the feature store is, in general, an anti-pattern that increases cost both in terms of storage but also in terms of the number of feature pipelines that need to be maintained. The only exception to this rule is high value online models where online transformation latency is too high for the use case, but this is a rare exception to the rule (that is anyway supported in Hopsworks).

You can specify what features to transform and the transformation functions to apply to those features by providing a dictionary of features and transformation functions. Hopsworks comes with a set of built-in transformation functions (such as MinMax Scalar and LabelEncoder). You can also define and register custom transformation functions as Python functions that take the feature as input and return the transformed feature as output.

The feature view stores the list of features and any transformation function applied to those features. Transformation functions are then transparently applied both when generating the training data, as well as when generating a batch of single feature vectors for inference. The feature view also stores the descriptive statistics for each versioned training dataset it creates, enabling transformation functions to use the correct descriptive statistics when applying transformation functions. For example, if our 4th model that used only the last year of data was training dataset version 4, then transformation functions for that 4th model would use the descriptive statistics (and any other state needed) from version 4 of the training data.

# Load transformation functions.
min_max_scaler = 
    fs.get_transformation_function(name="min_max_scaler")
label_encoder = 
    fs.get_transformation_function(name="label_encoder")
# Map features to transformations.
transformation_functions_map = {
    "category": label_encoder,
    "amount": min_max_scaler,
    "trans_volume_mavg": min_max_scaler,
    "trans_volume_mstd": min_max_scaler,
    "trans_freq": min_max_scaler,
    "loc_delta": min_max_scaler,
    "loc_delta_mavg": min_max_scaler,
    "age_at_transaction": min_max_scaler,
    "days_until_card_expires": min_max_scaler,
}

Training Data

As mentioned already, training data is generated using a feature view. The feature view holds the information on which features are needed and which transformation functions need to be applied.

Training data can be automatically split into train, test and validation sets. When that happens, the necessary statistics for the transformation functions are automatically computed only on the train set. This prevents leakage of information from the validation and test set into the model trained on the train set.

Training data can be generated on the fly as shown below:

X_train, y_train, X_val, y_val, X_test, y_test = feature_view.train_validation_test_splits(validation_size=0.3, test_size=0.2)

Alternatively, users can launch a Hopsworks job that generates and stores the training data as files in a desired file format (e.g., CSV, TFRecord). This is useful, for instance, when your training data does not fit in a Pandas DataFrame, but your model training pipeline can incrementally load training data from files, as TensorFlow does with its DataSet API for files stored in TFRecord format.

Prediction Services

When it comes to put the model into production, there are two classes of prediction services we can build with models:

Analytical Models: Models where inference happens periodically and in batches.
Operational Models: Models where inference happens in real time with strict latency requirements.

For Analytical Models, best practice dictates that the inference pipeline should be set up such that the data to be scored is already available in the feature store. What this means in practice is that the new (unseen) data (features) is extracted from the feature groups, transformed and returned as DataFrames or files. A batch scoring program will then load the correct model version and perform inference on the new data, with the predictions stored in some sink (which could be an operational database or even another feature group in Hopsworks).

By setting up the feature pipeline such that the same data is feature-engineered for both training and inference in feature groups, the same inference data can then be used in future iterations of model training when the actual outcomes of the batch inference predictions become known and are stored in the feature store.

To retrieve the batch inference data, you can use the get_batch_data method. You need to provide a time interval for the window of data you need to score. Example:

transactions_to_score = 
     feature_view.get_batch_data(start_time = start_time, end_time = end_time)

For operational Models, predictions need to be available with strict latency requirements. What it means in practice is that the feature data needs to be fetched from the online feature store. Typically, only one or a small set of feature vectors are scored by online inference pipelines. The feature view APIs provide a way to retrieve the feature vectors from the online feature store. In this case, users need to provide a set of keys (primary keys) for feature groups that are used in the feature view:

transactions_to_score = 
       feature_view.get_feature_vector(entry={'cc_num': '12124324235'})

Additionally for some use cases, some of the features needed to make a prediction are only known at runtime. For this, you can explicitly include the features and their untransformed feature values in the feature vector retrieval call, indicating these features are provided by the client . The feature view will apply the feature transformations to both feature values retrieved from the online feature store as well as the client-provided feature values provided in real time.

Get started

As always you can get started building great models on Hopsworks by using our serverless deployment. You don’t have to connect any cloud account or deploy anything, you can just register on app.hopsworks.ai and start building.

Originally published at https://www.hopsworks.ai.

Connecting Python to the Modern Data Stack.

MagicLex — Wed, 10 Aug 2022 12:30:07 +0000

In recent years, the Modern Data Stack, a suite of frameworks and tools has emerged in the world of Data and Business Intelligence and is dominating in enterprise data; Data Lakes and Warehouses, ETL and Reverse tools, Orchestration, Monitoring and much more.

One big unfilled hole in the MDS is Enterprise AI. Machine learning is dominated by Python tools and libraries. There have been attempts to transpile Python code to SQL for the MDS, but it is highly unlikely that Data Scientists will start performing dimensionality reduction, variable encodings, and model training/evaluation in user-defined functions and SQL.

MDS; a SQL centric paradigm.

SQL is ubiquitous across the modern data stack, from data ingestion and the different forms of data transformations between data warehouses, lakes, and BI tools. SQL is the language of choice for the MDS. Its declarative nature makes it easier to scale out compute to process large volumes of data, compared to a general purpose programming language like Python - which lacks native distributed computing support.

Machine Learning; a Python centric world.
Just as SQL dominates in analytics, Python dominates Data Science. Python’s grip on machine learning is so pervasive that Stack Overflow’s survey results from June 2022 show that Pandas, NumPy, TensorFlow, Scikit-Learn, and PyTorch are all in the top 11 of the most popular frameworks and libraries across all languages. Python has shown itself to be flexible enough for use within notebooks for prototyping and reporting, for production workflows (such as in Airflow), for parallel processing (PySpark, Ray, Dask), and now even for data driven user interfaces (Streamlit). In fact, even entire serverless ML systems with feature pipelines, batch prediction pipelines, and a user interface can be written in Python, such as done in this Surf Prediction System from PyData London.

Modern Data Stack vs Modern AI Stack: closing the gap

One reason models never make it to production is simply that the production MDS stack is not designed to make it easy to productionize machine learning models written in Python. Data scientists and ML Engineers are often left with prototypes that work on data dumps, without feature pipelines written in Python that are not connected to the MDS and inference pipelines that cannot connect make use of historical or contextual features because they are not connected back to existing data infrastructure. Snowflake introduced Snowpark, acknowledging the need for general purpose Python support in the MDS, but without its own Feature Store, Snowpark by itself is not enough.

How do we empower Data Scientists to access data in the MDS from Python without overwhelming them with the complexities of SQL and data access control? The Feature Store is one part of the solution to this problem. It is a new layer that bridges some of the infrastructural gap. However, the first Feature Stores came from the world of Big Data, and have primarily supported Spark, and sometimes Flink, for feature engineering To date, there has been a noticeable lack of a Python centric Feature Store that bridges the gap between the SQL world and the Python world.

Enters Hopsworks 3.0, the Python-centric feature store.

Hopsworks was the first open-source feature store, released at the end of 2018, and now with the version 3.0 release, it takes a big step to bridge the Modern Data Stack with the machine learning stack in Python.

With improved Read and Write APIs for Python, Hopsworks 3.0 allows data scientists to work, share, and interact with production and prototype environments in a Python-centric manner. Hopsworks uses transpilation to bring the power of SQL to their Python SDK and seamlessly transfer data from warehouses to Python for feature engineering and model training. Hopsworks provides a Pandas DataFrame API for writing features, and ensures the consistent replication of features between Online and
Offline Stores.

Hopsworks 3.0 now comes with support for Great Expectations for data validation in feature pipelines, and custom transformation functions can be written as Python user-defined functions and applied consistently between training and inference.

There is more to Hopsworks 3.0, and you can read about it in their very recent release blog.

For a more direct experience, use their newly released serverless app.hopsworks.ai ; allowing to use Hopsworks 3.0 without any infrastructure requirements; in less than 2 minutes and a colab notebook.