Open Source Feature Stores

There are currently two main open-source feature stores used in production today: Hopsworks and Feast.

Hopsworks Feature Store was the first open-source feature store, released in December 2018, followed shortly thereafter by Feast. Hopsworks is available under the AGPL-V3 license, while Feast is available under the Apache v2 license. Hopsworks is developed by Logical Clocks, while Feast is developed primarily by Tecton and GoJek, but is part of the CNCF.

Hopsworks

Hopsworks is a stand-alone platform that you can install with an installer, and out-of-the-box, it provides support for:

  • feature computation (Spark, PySpark, Python, Spark Streaming);
  • offline feature storage using Hive and HopsFS;
  • online feature storage using RonDB. RonDB also is the Hive metastore and backend database for Hopsworks;
  • streaming ingestion with Apache Kafka;
  • a complete data science platform (optional) with Jupyter notebooks and Jobs for model training and feature engineering with Python, Spark, or Flink,
  • model serving support with TensorFlow serving server and flask.

Hopsworks open-source is fully featured compared to the enterprise version, but it does not include support for single-sign on (Active Directory, OAuth-2) or integration with external Kubernetes clusters (to run notebook servers, Jobs, or serve models).

Hopsworks provides out-of-the box support for batch and streaming compute with Spark, Flink, and Python, the Feature Store, as well as model training and serving . On AWS/Azure, storage and compute are separated, with data stored in S3 and Azure Blob Store, respectively. Kafka comes with Hopsworks as does Airflow, but you can also use external version if you wish.
The Hopsworks Feature Store can be easily connected to upstream Feature Engineering platforms (Databricks, Sagemaker, Cloudera), Model Training Platforms, and Model Serving platforms. Or you can use the built-in feature engineering, model training, and serving capabilities in Hopsworks.

Feast

Feast is a smaller system than Hopsworks, and is often deployed on a kubernetes cluster (with Helm charts or terraform scripts) but can also be deployed on AWS, Azure,or GCS. Feast was originally designed to work on GCP with BigQuery, but in late 2020 switched to use Spark as the engine for ingesting features from external sources. Feast does not come with a UI, security, or the ability to do feature engineering. It needs to be connected up to existing services and databases to provide its functionality:

  • you need to connect to a Spark cluster to be able to ingest features (like EMR on AWS);
  • you need a Postgres database (like RDS on AWS) to store feature metadata;
  • you need a Kafka cluster (like managed Kafka on AWS) to ingest streaming data;
  • you need a Redis database (like Elasticache on AWS) to serve features online with low latency.
How to ingest features and serve them with Feast

You can install Feast on AWS by connecting it to managed services: RDS Postgres, Kafka, EMR (Spark), Elasticache (Redis)