Building RIVIGO’s Microservices Stack

Technology

Setting up of a system that is loosely coupled is critical for laying a strong foundation for scale up. This post gives an overview of the design and implementation of RIVIGO’s microservices stack.

 

Why Microservice

When we decided to build a scalable service stack at RIVIGO, several options including microservices, centralized logging, monitoring, alerting, auditing, spring cloud, Netflix OSS etc. were considered.

Most business problems have a solution that requires different components to talk to each other and come up with a decision or transaction to perform. This is what happens in the physical world too. For example, until there is money in the wallet, one cannot perform a purchase. Hence, it makes logical sense to separate out the concerns in the system synonymous to the business component. In this example, if the logic for checking wallet balance changes in the future, there is no change in the logic for purchase decision. Such separation was a key ask from our service stack. Microservice architecture fits the bill here since it is driven by Single Responsibility Principle.

Advantages

  • Simplifies development and enhancements
  • Enables better testability as each service can be tested in silo
  • Enables continuous integration as each service passes automation criteria independently
  • Enables continuous deployment of large scale complex systems
  • Helps manage the hardware as the service which has increased utilization scales
  • When using dockers, solves for the problem of machine utilization
  • Enables agile development effort around multiple teams as each of them owns and is responsible for one or more single service and can develop, deploy and scale services independently
  • Reduces application start up time which speeds up deployment and also improves productivity
  • Improves fault isolation drastically
  • Provides freedom to choose the technology implementation framework basis the problem being solved

Complexity

  • Complex distributed architecture
  • Implementation of robust inter-service communication
  • Implementation of strategy for distributed transactions
  • Set up of strong co-ordination between different 2 PT teams

In this post, we will cover how we implemented a microservice architecture at RIVIGO. Specifically, we will give an overview of designing and setting loosely coupled architecture from the word go.

 

Architecture

The diagram below gives an overview of the stack setup at RIVIGO.

In the following sections, we will cover the basic considerations the we kept in mind while designing the complete stack.

Module Design

The first step was to structure the overall code base in a way that helps achieve maximum reusability of common modules across services. Some basic modules, which are a part of almost every web application, were added to a common layer:

  • Entity state machine – We needed a module that could transition entities from one state to another. However, state transition could only be done by certain user role(s) in the application. Since it was a user-triggered operation we dropped the FSM implementation and introduced a simple module which constrained flow from one state to another and also checked the user role. The table would have columns as From_State, To_State and a bitmap of user roles. Being a loosely coupled design this allowed us flexibility in implementing multi-step maker checker flows.
  • Task Manager and Scheduler – We designed a micro-task scheduler based on spring scheduler which handles the flow of maintaining state machine, generating and mailing error files and retrying itself. We defined an abstraction layer that implemented a handler which would parse the payload JSON stored with each task row or do some other custom business processing. This simplified the initial task management for us.
  • Import/export module – The idea was to keep the user flow simple and easy to use. An abstraction layer was responsible for defining an import/export as sync/a-sync, validating the mandatory columns and data types in each row and then passing on the data to a handler which was implemented by the business processor. Each handler would receive a list of rows where it could do its custom validations and processing. The abstracted part also maintains an error log for every failed row which is later used to generate a failed CSV file. This failed CSV file is formatted in the same way as the upload template except an additional column for error message.
  • Audit Module – As the name suggests, the purpose of this was to keep a change log for all entities in the system. For all persistent objects we defined some marker base entities which prototypes them as auditable. The audit flow is currently based on Hibernate’s post commit handler which passes on the before and after payload to an audit service. The audit module is an independent elastic-based microservice which is responsible for determining the delta between the copies, indexing them in elastic and enabling search via various keys for audit logs. In future, this would be rearchitected to a “mysql-binlog+debezium+kafka-stream” based service which is much more resilient and accurate.

The release versions and snapshots for these modules are managed via an arti-factory. Inclusion is easily managed via maven.

Service discovery and communication

This brings us to the concept of client-side load balancing, where a client or service node knows about all the IPs registered to a service domain and decides which IP to fire the request to. In an upscaling/downscaling environment, nodes constantly come up or go down and need to register themselves on service domain(s). Post this, a client-side load balancer distributes requests coming to it. The basis for load balancer can either be a default one or one can build a custom one. Building a custom load balancer was our requirement for a self-healing service design.

The client-side load balancer can be a separate layer, or it can be managed in the application itself. Netflix OSS provides a ribbon for this use case which binds each service node as a sidecar. For our use case, we used Consul as service discovery and registry server. The reason for choosing this will be covered in detail in our further posts. The primary reason is that this provides consul templates which have been used in our self-healing micro service architecture.

For communication across services, we used Netflix OSS‘s Feign Client and Hystrix for circuit breaker.

For more libraries you can refer to Spring Cloud which provides a suite of tools for designing distributed systems.

Centralized Logging and Distributed Traceability 

The diagram below shows the pipeline that we chose to implement the logging.

A detailed post will follow soon on why we made this choice. Sharing a brief from there –

Filebeat helps throttle the shipping of data as it uses a backpressure-sensitive protocol when sending data forward to account for higher volumes of data. There is also no loss of data as logs are written to a file first. The major concern was eliminating logstash to prevent useless resource consumption, and that is achieved by using Inode. Moreover, filebeat has configurations to optionally specify the Ingest pipeline which would process data before dumping it into ES indices. This means we can specify the necessary grok filters in the pipeline and add it to the Filebeat config file. Once the data is there in Elasticsearch, all sorts of analytics and visualizations can be done through Kibana.

Distributed tracing is being able to trace the request initiated through a client across all services it has been served by. For this we used Zipkin, a distributed traceability tool which seamlessly integrates with springbased services and also provides a good web UI for request tracing. It is based on Dapper and HTrace and works by joining traces across http request/message boundaries.

It binds a span and a trace with each request/thread/message. The terminologies for this are:

SPAN – Basic unit of work with unique 64-bit Identifier

TRACE – Tree like data structure, each node as a span

Here is a sample demonstration of an incoming request being traced via Zipkin:

Besides being able to trace requests, it gives one more major advantage in centralized logging. We were able to insert traceid and spanid through log4j in the application logs and were able to index it in elastic through grok filtering. Searching a traceid in elastic helps one trace logs for a request across all services. Debugging becomes a lot faster too. The combination of centralized logging and distributed tracing has created quite a useful tool for debugging requests across services in app logs.

Monitoring and alerting

We needed to set up basic monitoring to be able to generate alerts and have real-time charts for NOC teams to monitor. Further use case was gathering data around application, database and instance behavior and use it for self-healing service architecture. After considering various options we finalized Prometheus, Alert Manager and Grafana. We piped the data from two sources – spring actuator metrics and nginx logs.

Set up is very simple with changes in pom and yml files.

 

Conclusion

Setting up of this microservice architecture has built a strong foundation for scale up. Both business requirements and tech architecture need to be resilient for unforeseen and unpredictable issues. Designing a system which is loosely coupled enables one to make those changes easily and with controlled releases and lesser rollbacks.

The next phase involves setting up a self-healing deployment pipeline. Using docker, ECS, Nginx and Consul, it would orchestrate complete deployment, load balancing as per health score and continuous monitoring. With data ML models, it can be used to off-load some of the decision making and also find anomalies in different flows in the system, whether it be incoming requests, sudden spike in machine metrics or error count etc. We will soon share a detailed post on this topic as well.