AWS Sagemaker Feature Store is a managed Feature Store built around lower level AWS resources such as S3, DynamoDB, Athena and Glue Data Catalog.

General Architecture

Metadata Server: (Optional) Glue Data Catalog

Offline Feature Store: S3

Online Feature Store: DynamoDB 

Supported Data Types: String, Integral, Fractional

Lower Level Implementation

Batch Ingestion

Data ingestion is done transparently to both online and offline feature store through a PutRecord operation. This operation can either be accessed directly through BOTO3 (Python) or through the higher-level Sagemaker Python SDK.

AWS claims high throughput using this operation. Supports parallelization only through multiple BOTO3 connections writing in parallel.

No direct support for other distributed engines, such as Apache Spark, except through other AWS Services (see streaming architecture). Once data is written to the offline feature store it can be read with EMR/Spark, however, this is direct access to S3, not going through any Feature Store APIs, thereby requiring the user to know the feature store layout on S3.

Batch writing to the offline feature store quickly increases in latency. (1)

 

Stream Ingestion

Streaming ingestion is supported with the use of additional AWS services such as EMR and AWS Lambdas to call the PutRecord operation repeatedly.

See the following diagram for a reference architecture. (2)


------------
(1) https://www.alexdebrie.com/posts/dynamodb-transactions-performance
(2) https://github.com/aws-samples/amazon-sagemaker-feature-store-streaming-aggregation

APIs

Sagemaker Python SDK

The Sagemaker Python SDK is the main entry point for developers to access the meta data about feature groups in the feature store.

Querying Data (Batch Access) (3)

Batch access to feature data to create training datasets is solely done through SQL as API using additional AWS services as execution engine. 

Option 1: AWS Athena

The SDK provides some utility methods to generate query templates for example for time-travel queries, however, the user has to manually fill in the templates.


Option 2: EMR on S3

EMR can be used as execution engine directly over the offline feature store S3 buckets. However, users have to handle paths and metadata in EMR jobs mostly manually.

(Potential) Option 3: Databricks Spark

Sagemaker SDK provides methods to generate Hive DDL templates in order to register offline feature store tables within S3 as external tables. One possibility could be to do this from Databricks in order to then read from S3 using Databricks’ Spark and the metadata from Hive.

Graphical User Interface

The graphical user interface to the feature store is integrated with Sagemaker studio, giving access to the metadata.

------------

(3) https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-ingest-data.html

Cost

Apart from the cost for running Sagemaker instances and S3 storage, DynamoDB as the online feature store makes up the largest part of the cost. (4)

DynamoDB is priced per million request write and read requests(5). Additionally, each GB-month costs $0.25 for general storage. Services like backup and restore are priced extra.

Atlassian Use Case (6)

DynamoDB as the online feature store, annually ~$40k for one model with seven features. See this talk on the effort Atlassian put in to reduce the cost of that model from $425k/year with 10 features down to $40k.

------------

(4)  https://blog.yugabyte.com/11-things-you-wish-you-knew-before-starting-with-dynamodb
(5) https://aws.amazon.com/dynamodb/pricing/on-demand
(6) https://www.youtube.com/watch?v=mVVZLCJzRvc&t=322s 

Advantages

  • Uses cloud-native, managed services for online and offline feature store
  • Integration with Sagemaker

Disadvantages

  • No direct Databricks support, potentially only through Hive external tables on S3
  • No distributed compute engine support (Spark)
  • No complex feature data types, such as arrays for embeddings
  • Mostly SQL-based API (difficult for Python developers)
  • No schema versioning or data versioning for feature groups 
  • No feature statistics, alerting, monitoring
  • No custom metadata with free-text search
  • High Latency Online Feature Store (DynamoDB)
  • Append only offline store - no updates or deletes (needed for GDPR)
  • Up to 15 minutes stale data in online feature store
  • No integration with model monitoring
  • Time-travel queries not transparent, developer has to manually filter feature values based on event-time column
  • Very limited documentation.

 

 

Resources