Tecton officially released their feature store platform in December 2020 (they went general availability (GA)), although it is not yet open access. However, it is still hard to find out details about their plaform. Luckily, we stumbled across their documentation, which is now linked on their website. Here's a brief review of Tecton.ai as a feature store.
Feature Definitions and Transformations
Tecton uses Spark as its feature engineering platform. You define features and transformations in the Tecton platform in a domain-specific language, and Tecton will periodically send the feature definitions to an external Spark cluster to compute the feature values (Databricks is supported, with support also for EMR).
The feature definition/transformation is defined in a DSL (domain-specific language) written in Python that mixes:
- a function defining how to compute an individual feature value (a transformation);
- scheduling and configuration information for when/how to compute the feature on the external Spark cluster.
There are 3 types of feature transformations supported: batch source, streaming source, and on-demand. Batch and streaming feature transformations are both mapped to batch spark jobs, so it is unclear why they decided they needed two different types of transformations here. On-demand feature transformations are performed on live data arriving at prediction time as Python functions.
One peculiarity here is that if you want to use Python libraries to compute your features (if you have progressed beyond hello-world, you will want to do this!), they can only be used inside the feature definition not in the Python code outside the scope of the function. We can only assume that is because the libraries are installed in external Spark cluster, but the DSL is executed in Tecton as a Python program, and Tecton internally does not manage Python libraries for you.
It is not clear how you can develop or unit-test your feature definitions - Tecton does not appear to include an IDE. Based on the videos, it looks like you should develop and test the features on Databricks and then copy the functions to your source code repository managed by Tecton. However, then you will then have copies of your code (non-DRY!), one for development-testing, one for the Tecton feature repository. There is support for Python tests to run when you upload a new feature (called Plan Hooks), but i fail to see how a feature transformation written in PySpark can be tested in a function written in Python with access to any data, because it is run on your Python client machine.
An feature definition is put inside a feature package that includes information about ownership, coordinated computation, and metadata. Tecton define different types of Feature Packages: Temporal, Time-Windowed Aggregations, Online, and Push. Temporal feature packages are the default that support, we assume, point-in-time JOINs. The time-windowing aggregations support min/max/avg computations. Online feature packages use online transformations. Push features are defined outside of Tecton and ingested into Tecton. It is not clear how the materialization settings in a feature definition are used in push features and if Push features have parity with other feature types (visualization, etc).
Note, however, that features do not appear to have a namespace scope. So, if you have two features with the same name (even if they are in different feature packages, we assume?), then they will clash. So, expect long feature names and defining your own policy for feature naming to prevent clashes. It is interesting to note that there is a flat namespace for features and that each feature has a single value for its entity ID (multi-part entity IDs do not appear to be supported yet).
There is also no support for managing groups of materialized features as training datasets in popular file formats like tfrecord, csv, npy, and petastorm. Instead, there are feature services that can be used to return training data as Spark or Pandas dataframes, see below for details.
Tecton SDK and Command Line Tool
The Tecton SDK is described in the documentation and it enables you to access/use Tecton from Python from external environments (like Databricks). You can use it to retrieve online feature values. There is also a command line tool (tecton), which is used to add and validate new features, and apply the new features to the feature store.
Feature Development and Use
In the videos shown, developers have to work with 3 different tools at the same time to develop features:
- Databricks to define and test feature logic;
- Tecton Web UI to schedule/inspect features;
- the command line is needed to manage the python files (one python file per feature). You call a command line tool (tecton) to check if new features that you add conflict with existing ones and to add the feature to the feature store.
When features have been added to the feature store, they are executed as Spark batch jobs on Databricks - the Spark jobs compute features based on the feature definitions (Python functions defined in the DSL). Tecton does not appear to support Spark Streaming jobs yet - materialization jobs can only be scheduled to run periodically. There is not much about how to debug these batch jobs and how you monitor the cost (although you probably don't care about cost if you are looking at Tecton). In the monitoring section of the documentation, they say you can click links on feature packages/materilzation and get to the Spark job logs in Databricks, which is where you would look for errors.
Tecton provide information about the delay between scheduling your features for computation and when they will actually be updated in the online and offline store. They give a 'best-effort' guarantee they call the grace period when features 'should' be available in the offline/online stores. They estimate this to be between 4X and 1X the scheduling period. If you schedule your feature computations every 10 mins, then you wait 30 additional minutes There is a tradeoff of increased cost for fresher features. For features computed less often (e.g., daily), you only need to wait an additional day for the features to be refreshed.
Orchestration and CI/CD Support
Tecton appear to control orchestration of feature computations - it is not currently possible to use an orchestration platform like Airflow or Azure Data Factory or Dagster or Stich. Currently, Tecton appear to only support time as an orchestration event - you recompute features at a given time interval. It is not clear how you can currently use Tecton in a Ci/CD pipeline that is run in response to events such as new data arriving at some data source or a model monitoring platform indicating that a model needs to be retrained.
You can group features together into "training datasets" using Feature Services. They provide both an offline API for returning training data as PySpark or Pandas dataframes and an online API for returning the latest feature values for model serving. Online features are available behind a REST/gRPC API. Here, you define the list of features you want to serve together. Internally, Tecton make calls to DynamoDB and then their service joins the feature values together, returning them as single result. You can monitor serving latencies in the Tecton UI.
There is recent support for security. You can manage users by adding them via the Web UI or using SAML/SSO to an identity provider like Active Directory. The access control is based on workspace-based access to feature configurations and materialization jobs, which is similar to the project-based access control in the Hopsworks Feature Store. Storage connectors to external sources can use an IAM role, api key, or credential to connect to data sources like Kafka, Snowflake, and Redshift.
Tecton is a premium priced product - you pay for Tecton, Databricks, and DynamoDB. Is it also premium quality product? We will find out when it is open for general use. We can see, however, that it is currently AWS-only, Spark batch-centric, and it requires 3 tools (Tecton UI, Databricks, and command-line tecton tool) to develop features. Its limitations appear to be support for testing, defining and computing features on any external Spark platforms (documentation is lacking on this), no support for Python (only PySpark) for feature computation, no support for streaming computation of features which means the freshest a feature can be is 30 mins (even that is on an expensive recompute schedule), and unclear integration into existing orchestration or CI/CD tooling. Otherwise, it does what it says on the tin, and this may be enough for many enterprises.