Tejas – A Serverless Feature Store

12 July 2019
Architecture

This article explores how the Data Science team at RIVIGO created a serverless store to retrieve features needed by different machine learning models.


 

Introduction

 

Tejas – A serverless feature store enables Rivigo data science team to store and retrieve features needed by machine learning models. Every machine learning lifecycle starts with creating a rich feature set. As the number of Machine learning models increase, the overlap of features in use by these models also increases. To avoid repetitive computation of features, organizations need a shared feature store. That’s why we created Tejas. Tejas is built completely using serverless offerings of AWS.

 

Significance of Name

 

The core mission of RIVIGO is to improve and make the life of truck drivers (‘Pilots’) more human. In our journey to do so, we are applying machine learning to improve our Relay model. “Tejas” is a Sanskrit word meaning “Fire” which is one of the five gross elements assigned as a ‘zone’ to the human body. It signifies brilliance and power. A rich feature store is power to any ML problem. Hence we named rivigo feature store as Tejas.

 

ML Model Lifecycle

 

Raw Data → Ingestion → Data Pipelines/ETL → Feature Store → Training Pipeline → Model Store/Model Manager → Model Deploy

Data Pipelines and Ingestion are a topic for discussion for another article.  In this article, we will discuss more around feature store and how it is being used at Rivigo.

 

Components Needed to Build Feature Store

 

To build any feature store broadly we need the following:

  • A Storage layer – This could be MySQL but we decided to go with NoSQL as most of our features were nested. So we picked columnar format Parquet and Object Store (Amazon S3).
  • A MetaStore- To store metadata of feature like database name, table name, actual storage paths of data, version of features, data partition information, etc. We have used Amazon Glue Metastore.
  • A Query Service- To selective retrieve data needed by training models. We have used Amazon Athena Interactive Query Service.

 

 

Architecture of Tejas

 

Tejas is broadly divided into three modules:

  • TejasLambda – A lambda function which interacts with AWS Metadata Store(Glue), AWS Query engine (Athena), Storage layer ( S3) and manager user authentication using IAM roles
  • TejasPy – A Python client which can be imported in python code/notebook and data scientist can submit features they want to share with other data scientist to feature store or retrieve features for ad-hoc analysis.
  • Tejas – A Java client (Developed in Scala because we love higher-order functions)  to be used by our spark pipelines to submit features to feature store.

 

 

Submitting Data

 

To create a managed training dataset of features, the user supplies a Pandas or PySpark data frame with the necessary labels for bookkeeping.

One needs to maintain a column in the input data frame which holds an epoch timestamp, which is used internally for temporal partitioning of the data for faster queries.

 

create_fg
success1, response1 = create_fg(    
client="freight",    
app="supply_user",    
entity="activity",    
version="v0001",    
time_col="created_at",    
time_col_unit="ms",    
pandas_df=sample_data)

 

The return values indicate successful completion and request-id for debugging in case the AWS Lambda logs need to be inspected.

In the background, the method validates the data, ensures it’s not clobbering an existing feature-set, extracts the schema, registers it in the AWS Glue meta store, stores the schema in a structured AWS S3 path for later examination. Most of the registration and storage work is offloaded to an AWS Lambda instance, which is responsible for gatekeeping of all the access restrictions to the other AWS services.

Appending data to existing features is also supported.

 

upload_fg
success2, response2 = upload_fg(    
client="freight",     
app="supply_user",     
entity="activity",    
version="v0001",    
pandas_df=input_data)

 

Internally the upload asserts the new data conforms to the schema previously registered for this specific feature and version.

 

Retrieving Data

 

Data retrieval adds flexibility in what and how much data is to be fetched, using AWS Athena as a SQL query interface over the registered features. Due to the async nature of the serverless architecture (AWS Lambda acting as a bridge to AWS Athena), the user needs to query in two steps.

First, trigger the query … any standard Presto SQL Query is supported. The return values include the location where result-set will be dumped in S3 and the AWS Athena Query-id.

 

dump_fg
success3, query_id, s3path, response3 = dump_fg(        
"select * from feature_store.test_freight_user_activity_v0001;")

 

For user convenience, a helper method is also provided, which parses the dumped query result into a Pandas Dataframe. The input parameter is the previously returned “query-id”, and return is the data frame, and the schema as JSON.

 

read_fg
out_df, out_schema = read_fg(query_id)

 

Security

 

The democratization of data brings with itself challenges of protecting the data as well. We need to provide security against accidental data deletion or duplication. For this reason, we have not exposed any API to delete data. Further, all data upload ensures the output location doesn’t already exist so that we don’t clobber old data nor append to it introducing duplicates. Ensuring the data remains consistent across uploads is also paramount, so before inserting newer data we always validate it against the currently registered schema.

Access to data is also guarded via a two-layer approach, where the client APIs don’t directly interact with the Data-Lake, but via an AWS Lambda. This allows us to control the write access privileges of the end-users to the AWS Glue Meta Store and AWS Athena query engine. and prevent meddling around with the data destructively.

 

Challenges we faced

 

The main challenge we faced was how to tackle schema validation and evolution. No out of the box schema validator exists for hive or parquet at the time of this discussion. We also wanted the schema to be human-readable for easier debugging and sharing. We tried Avro JSON schema as a possible solution, but that had issues with data type compatibility with parquet. PyArrow table types also didn’t support all possible parquet data types. So we finally opted to JSON serialize the hive schema and use that as a reference to validate the incoming data’s inferred schema recursively.

Parsing the output of the AWS Athena into a possibly nested data frame was another troublesome aspect since the results were dumped as CSV. We sidestepped the issue by sanitizing all input data and column names (stripping/replacing commas, double quotes, and newlines).

Would like to know more about how we overcome those challenges, let’s talk, will love to share more.

Saswata Dutta and Vibhanshu Abhishek have also contributed to this article.