Setting up of a system that is loosely coupled is critical for laying a strong foundation for scale up. This post gives an overview of the design and implementation of RIVIGO’s microservices stack.
When we decided to build a scalable service stack at RIVIGO, several options including microservices, centralized logging, monitoring, alerting, auditing, spring cloud, Netflix OSS etc. were considered.
Most business problems have a solution that requires different components to talk to each other and come up with a decision or transaction to perform. This is what happens in the physical world too. For example, until there is money in the wallet, one cannot perform a purchase. Hence, it makes logical sense to separate out the concerns in the system synonymous to the business component. In this example, if the logic for checking wallet balance changes in the future, there is no change in the logic for purchase decision. Such separation was a key ask from our service stack. Microservice architecture fits the bill here since it is driven by Single Responsibility Principle.
In this post, we will cover how we implemented a microservice architecture at RIVIGO. Specifically, we will give an overview of designing and setting loosely coupled architecture from the word go.
The diagram below gives an overview of the stack setup at RIVIGO.
In the following sections, we will cover the basic considerations the we kept in mind while designing the complete stack.
The first step was to structure the overall code base in a way that helps achieve maximum reusability of common modules across services. Some basic modules, which are a part of almost every web application, were added to a common layer:
The release versions and snapshots for these modules are managed via an arti-factory. Inclusion is easily managed via maven.
Service discovery and communication
This brings us to the concept of client-side load balancing, where a client or service node knows about all the IPs registered to a service domain and decides which IP to fire the request to. In an upscaling/downscaling environment, nodes constantly come up or go down and need to register themselves on service domain(s). Post this, a client-side load balancer distributes requests coming to it. The basis for load balancer can either be a default one or one can build a custom one. Building a custom load balancer was our requirement for a self-healing service design.
The client-side load balancer can be a separate layer, or it can be managed in the application itself. Netflix OSS provides a ribbon for this use case which binds each service node as a sidecar. For our use case, we used Consul as service discovery and registry server. The reason for choosing this will be covered in detail in our further posts. The primary reason is that this provides consul templates which have been used in our self-healing micro service architecture.
For more libraries you can refer to Spring Cloud which provides a suite of tools for designing distributed systems.
Centralized Logging and Distributed Traceability
The diagram below shows the pipeline that we chose to implement the logging.
A detailed post will follow soon on why we made this choice. Sharing a brief from there –
“Filebeat helps throttle the shipping of data as it uses a backpressure-sensitive protocol when sending data forward to account for higher volumes of data. There is also no loss of data as logs are written to a file first. The major concern was eliminating logstash to prevent useless resource consumption, and that is achieved by using Inode. Moreover, filebeat has configurations to optionally specify the Ingest pipeline which would process data before dumping it into ES indices. This means we can specify the necessary grok filters in the pipeline and add it to the Filebeat config file. Once the data is there in Elasticsearch, all sorts of analytics and visualizations can be done through Kibana.“
Distributed tracing is being able to trace the request initiated through a client across all services it has been served by. For this we used Zipkin, a distributed traceability tool which seamlessly integrates with spring–based services and also provides a good web UI for request tracing. It is based on Dapper and HTrace and works by joining traces across http request/message boundaries.
It binds a span and a trace with each request/thread/message. The terminologies for this are:
SPAN – Basic unit of work with unique 64-bit Identifier
TRACE – Tree like data structure, each node as a span
Here is a sample demonstration of an incoming request being traced via Zipkin:
Besides being able to trace requests, it gives one more major advantage in centralized logging. We were able to insert traceid and spanid through log4j in the application logs and were able to index it in elastic through grok filtering. Searching a traceid in elastic helps one trace logs for a request across all services. Debugging becomes a lot faster too. The combination of centralized logging and distributed tracing has created quite a useful tool for debugging requests across services in app logs.
Monitoring and alerting
We needed to set up basic monitoring to be able to generate alerts and have real-time charts for NOC teams to monitor. Further use case was gathering data around application, database and instance behavior and use it for self-healing service architecture. After considering various options we finalized Prometheus, Alert Manager and Grafana. We piped the data from two sources – spring actuator metrics and nginx logs.
Set up is very simple with changes in pom and yml files.
Setting up of this microservice architecture has built a strong foundation for scale up. Both business requirements and tech architecture need to be resilient for unforeseen and unpredictable issues. Designing a system which is loosely coupled enables one to make those changes easily and with controlled releases and lesser rollbacks.
The next phase involves setting up a self-healing deployment pipeline. Using docker, ECS, Nginx and Consul, it would orchestrate complete deployment, load balancing as per health score and continuous monitoring. With data ML models, it can be used to off-load some of the decision making and also find anomalies in different flows in the system, whether it be incoming requests, sudden spike in machine metrics or error count etc. We will soon share a detailed post on this topic as well.