Before we dive into what a feature store is, quick refresher: in machine learning, a feature is data used as input in a predictive model. It is the x in f(x) = y
A feature store is an ML-specific system that:
Feature stores are trying to solve 3 problems:
In most cases feature stores add unnecessary complexity, and are only suited for specific ML uses cases. You might even be asking, “If a feature store is simply making sure the same pre-processing happens on the data, why can’t I do that transformation during inference on the raw data?”
There are two scenarios that it isn’t viable:
To summarize, a feature-store is most valuable when:
During the model experimentation, let us assume that our data scientists suggested implementing a slightly different feature structure for our model that requires low-latency as some features require server-side processing.
Throughout this tutorial, we will be predicting whether a transaction made by a given user will be fraudulent. This prediction will be made in real-time as the user makes the transaction, so we need to be able to generate a prediction at low latency.
Our system will perform the following workflows:
We will be using an open-source framework called Feast which is built from the team at Tecton, one of the leading feature-store companies globally. Tecton is a hosted version of Feast and comes with a few more beneficial features such as monitoring. We will then be deploying our application to AWS.
If you don’t have it, download the data required for this tutorial from here. This is originally from a Kaggle dataset for Fraud Detection. Place this dataset in a data directory in the root of your project. You can run this project either in VS Code or in Google Colab here. You can also checkout the Github repo here.
We’re going to convert this dataset into a format that Feast can understand, a parquet file. We also need to add 2 columns, event_timestamp and created_timestamp, so that feast can index the data time. We’ll do this by min-max normalizing the TransactionDT column, assigning a timestamp from the current date to a year back using the normalized column, and then adding these columns to the data.
In a Python REPL or as a separate file, run the following code block:
Since infrastructure and architecture are not the purpose of this tutorial we will use terraform to quickly setup our infrastructure in AWS to continue with the rest of the tutorial.
Without deviating too much let us quickly explain what terraform is and the different components we set up:
The following is created from the terraform file:
Okay enough geeking out on Terraform — lets keep moving!
We need to setup our AWS credentials in order to deploy this terraform setup to our account. To start make sure you have your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables setup. If not, go to your AWS console and follow the instructions below:
If you already have a user, make sure you have the following permissions:
Once a user is created, you can click on your user and go to the tab that says “Security Credentials”. Scroll down and click the button that says “Create access key”. You should then see an Access Key and Secret Key generated for you.
Run the code below pasting in the generated keys
Install the Terraform framework. We use Homebrew on macOS but you may install it however you prefer.
In your terminal, go to the “infra” folder that came along with this tutorial. We are going to initialise Terraform in this folder and apply the plan. Name the project fraud-classifier.
Once your infrastructure is deployed you should see the following fields in your output in your terminal. Save these, we will need them.
We are now going to create a Glue job to get our data from S3 to Redshift, creating a schema called spectrum. Use the values from the previous output.
You should then get a JSON result back. Grab the Id field returned, and run a describe-statement below with that value to check if the job completed successfully. You should see a Status of FINISHED.
If that is all running successfully then we are done with our AWS setup!
To get started, let us install the Feast framework. Feast can be installed using pip.
Make sure you now cd back into the root of the project. In Feast, you define your features using a .yaml file in a repository. To create a repository run the command below and follow the prompts. The redshift database name will be dev and the user will be admin. For the staging location, use the s3://fraud-classifier-bucket bucket that was created in plan. Use arn:aws:iam::<account_number>:role/s3_spectrum_role as the S3 IAM role.
This will create a few files in a folder called feature_repo that are mostly example files (you should delete driver_repo.py and test.py) but we only care about:
- feature_store.yaml: This is a configuration file where you will define the location of your Redshift cluster, S3 bucket and DynamoDB Database.
NB: Make sure to use the same AWS region you used in your terraform setup
This file contains the following fields:
Since we are using AWS, we have to use aws in the command. However, you can replace that with other cloud providers (e.g. Google Cloud you can use gcp).
Within the feature_repo folder, create a file called features.py in which we will define our features. Before we get started, we need to understand the concept of an Entity and a FeatureView:
Fill the file with the following contents:
First we create our transaction entity and define the SQL that will fetch the required features from our Redshift data warehouse. We then create a featureView that uses the Redshift instance to fetch the features and define the data type for each feature. We also define the time we would like the feature to contain. In this case we want 30 days worth of data.
Deploy the feature store by running apply from within the feature/ folder.
If everything was created correctly, you would have seen the following output:
Next we load our features into the online store using the materialize-incremental command. This command will load the latest feature values from a data source into the online store from the last materialize call. There is an alternative command, materialize, that will allow you to load features from a specific date range rather that the latest data. You can read more about it here.
If successful, you should see some activity in your terminal that its uploading the features. Once completed, you should see the results in our DynamoDB instance on AWS. This will take just under 6 minutes, so you may wanna grab a coffee!
In our project, we have two files with respect to our model:
Let’s go through run.py first as it’s quite simple. Here, we simply load our training data, train our model and make a prediction with the online feast store.
During the initialization of our model we attach the feature store to our model object to use later. The repo path is where the folder that contains our feature_store.yaml and example.py that we created above — Feast fetches the configuration from there.
# Line 57self.fs = feast.FeatureStore(repo_path="feature_repo")
When we would like to train our model, we want to get the historical data relating to our features. The method below launches a job that executes a join of features from the offline store onto the entity dataframe.
An entity dataframe is the target dataframe on which you would like to join feature values. The entity dataframe must contain a timestamp column called event_timestamp and all entities (primary keys) necessary to join feature tables onto. All entities found in feature views that are being joined onto the entity dataframe must be found as column on the entity dataframe. In our case, transactions contains a column called ‘transactionid’ to which we use to get all the transaction features. We should also ensure the target variable is attached to the entity dataframe.
Once completed, a job reference will be returned. This job reference can then be converted to a Pandas dataframe by calling to_df().
# Line 66training_df = self.fs.get_historical_features( entity_df=transactions[["transactionid", "event_timestamp", "isfraud"]], features=self.feast_features,).to_df()
When we do online inference (prediction) using our model, we don’t want to have to fetch all the historical data or anything really from our data warehouse since that will take multiple seconds. Rather we want to get the data we need from a low-latency data-source so we can have a low response time (~100ms). We do that below with the get_online_features function.
# Line 108transaction = request["transactionid"][0]return self.fs.get_online_features( entity_rows=[{"transactionid": transaction}], features=self.feast_features,).to_dict()
The above allows us to pass in the specific transaction and get the feature values for this user instantaneously. We can then use these values in our predict function to return what we predicted for the loan
Now let us run our run.py file to see this live and the output of our model
python run.py
That’s it for our tutorial on feature stores! Feature stores can add a lot of value to your ML infrastructure when it comes to using the same features across multiple models as well as doing server-side feature calculations however can add some complxity. Using Feast is great to implement this but if you want a more managed approach with extra functionality such as identifying model drift then you can try Tecton, or the the features stores that are native to the AWS and Google platforms.