This article explores how the Data Science team at RIVIGO created a serverless store to retrieve features needed by different machine learning models.
Tejas – A serverless feature store enables Rivigo data science team to store and retrieve features needed by machine learning models. Every machine learning lifecycle starts with creating a rich feature set. As the number of Machine learning models increase, the overlap of features in use by these models also increases. To avoid repetitive computation of features, organizations need a shared feature store. That’s why we created Tejas. Tejas is built completely using serverless offerings of AWS.
The core mission of RIVIGO is to improve and make the life of truck drivers (‘Pilots’) more human. In our journey to do so, we are applying machine learning to improve our Relay model. “Tejas” is a Sanskrit word meaning “Fire” which is one of the five gross elements assigned as a ‘zone’ to the human body. It signifies brilliance and power. A rich feature store is power to any ML problem. Hence we named rivigo feature store as Tejas.
Raw Data → Ingestion → Data Pipelines/ETL → Feature Store → Training Pipeline → Model Store/Model Manager → Model Deploy
Data Pipelines and Ingestion are a topic for discussion for another article. In this article, we will discuss more around feature store and how it is being used at Rivigo.
To build any feature store broadly we need the following:
Tejas is broadly divided into three modules:
To create a managed training dataset of features, the user supplies a Pandas or PySpark data frame with the necessary labels for bookkeeping.
One needs to maintain a column in the input data frame which holds an epoch timestamp, which is used internally for temporal partitioning of the data for faster queries.
create_fg success1, response1 = create_fg( client="freight", app="supply_user", entity="activity", version="v0001", time_col="created_at", time_col_unit="ms", pandas_df=sample_data)
The return values indicate successful completion and request-id for debugging in case the AWS Lambda logs need to be inspected.
In the background, the method validates the data, ensures it’s not clobbering an existing feature-set, extracts the schema, registers it in the AWS Glue meta store, stores the schema in a structured AWS S3 path for later examination. Most of the registration and storage work is offloaded to an AWS Lambda instance, which is responsible for gatekeeping of all the access restrictions to the other AWS services.
Appending data to existing features is also supported.
upload_fg success2, response2 = upload_fg( client="freight", app="supply_user", entity="activity", version="v0001", pandas_df=input_data)
Internally the upload asserts the new data conforms to the schema previously registered for this specific feature and version.
Data retrieval adds flexibility in what and how much data is to be fetched, using AWS Athena as a SQL query interface over the registered features. Due to the async nature of the serverless architecture (AWS Lambda acting as a bridge to AWS Athena), the user needs to query in two steps.
First, trigger the query … any standard Presto SQL Query is supported. The return values include the location where result-set will be dumped in S3 and the AWS Athena Query-id.
dump_fg success3, query_id, s3path, response3 = dump_fg( "select * from feature_store.test_freight_user_activity_v0001;")
For user convenience, a helper method is also provided, which parses the dumped query result into a Pandas Dataframe. The input parameter is the previously returned “query-id”, and return is the data frame, and the schema as JSON.
read_fg out_df, out_schema = read_fg(query_id)
The democratization of data brings with itself challenges of protecting the data as well. We need to provide security against accidental data deletion or duplication. For this reason, we have not exposed any API to delete data. Further, all data upload ensures the output location doesn’t already exist so that we don’t clobber old data nor append to it introducing duplicates. Ensuring the data remains consistent across uploads is also paramount, so before inserting newer data we always validate it against the currently registered schema.
Access to data is also guarded via a two-layer approach, where the client APIs don’t directly interact with the Data-Lake, but via an AWS Lambda. This allows us to control the write access privileges of the end-users to the AWS Glue Meta Store and AWS Athena query engine. and prevent meddling around with the data destructively.
The main challenge we faced was how to tackle schema validation and evolution. No out of the box schema validator exists for hive or parquet at the time of this discussion. We also wanted the schema to be human-readable for easier debugging and sharing. We tried Avro JSON schema as a possible solution, but that had issues with data type compatibility with parquet. PyArrow table types also didn’t support all possible parquet data types. So we finally opted to JSON serialize the hive schema and use that as a reference to validate the incoming data’s inferred schema recursively.
Parsing the output of the AWS Athena into a possibly nested data frame was another troublesome aspect since the results were dumped as CSV. We sidestepped the issue by sanitizing all input data and column names (stripping/replacing commas, double quotes, and newlines).
Would like to know more about how we overcome those challenges, let’s talk, will love to share more.
More stories by Anita Tailor
This article talks about how the engineers at RIVIGO Labs built a serverless, cost-effective reporting platform that helps convert data into valuable insights. Is Data the new Gold? Yes, it is, but gold has to pass through fire to realize its value. Similarly, data must be refined and converted into valuable insights to