A self-healing system can take necessary steps on its own to recover from a broken state. This post is an extension to an earlier article about building RIVIGO’s microservices stack, and talks about how we are building intelligent self-healing systems.
In a recent post, we covered how we are moving to a microservice architecture at RIVIGO. In simple terms, microservice is a way to break a monolithic service into loosely coupled services, which communicate with each other through simple APIs. While there are several obvious advantages of a microservice architecture, there are also some challenges that come with it too:
Most of these challenges are related to dependencies management and services breakdown. To overcome these, we needed to build a robust platform for dependency management (which includes inter-service communication) and for self-healing (where misbehaving services can be replaced).
With regards to inter-service communication, we used a combination of internal SSO, feign client and Hystrix for all real time transaction flows. All non-transactional and event-based communication is done via Kafka. We are trying to minimize real-time flows as it affects the resilience of the application. For event-based communication, we have also been exploring reactive microservices.
As for designing self-healing systems, we took inspiration from multiple sources including Alibaba’s AI driven infrastructure for ‘Singles Day Sale’ and the idea of automating test cases to some extent. The true test of success of this implementation though would be deploying chaos monkey in production.
In this post, we will talk about how we implemented this architecture.
A lot of services means complex deployment. Hence, there must be a smooth and fault-tolerant deployment strategy. Below are some required characteristics of microservices:
The first step was to containerize each service. There was no doubt that docker container was the best fit given the aforementioned characteristics of microservices. Below are some of the key reasons why:
Service Discovery
A micro service container can be deployed on any box, which makes IP and port of the service dynamic in nature. This leads to the need for a service registry and discovery platform where dependent services can communicate with and discover each other. The concept is not new. Many tools have existed long before Docker was created.
However, containers brought the need for such tools to a completely new level. The basic idea behind service discovery is for each new instance of a service (or an application) to be able to identify its current environment and store that information. Storage itself is performed in a registry usually in key/value format. Since the discovery is often used in distributed system, registry needs to be scalable, fault tolerant and distributed among all nodes in the cluster. The primary usage of such a storage is to provide IP and port of the service to all interested parties that might need to communicate with it. This data is often extended to other types of information. Discovery tools tend to provide an API that can be used by a service to register itself as well as by others to find the information about that service.
Out of many open source service discovery platforms available, we shortlisted Eureka, Consul and Zookeeper for consideration. We finally decided to go ahead with Consul because it:
The diagram below shows the birds-eye view of system. We have used NGINX as the load balancer.
As soon as a new container for service A comes up, the container makes calls to get external IP and port on which the service is listening. With this information, the container calls the Consul’s register service. Consul template looks for any changes with respect to addition/deletion of container for existing services or addition of any new services. On finding any changes, it rewrites nginx.conf files and reloads it. Any service which is using service A now finds the new container discoverable via consul.
Change management
Google’s site reliability team has found that roughly 70% of the outages are caused by changes in a live system. In a microservices architecture, services are mutually inter-dependent. To minimize the risk of issues, implementation of change management strategies and automatic rollouts is necessary.
For example, when you deploy new code or change a configuration, you should apply the changes to a subset of the instances gradually (also known as Canary Deployment), monitor them and even automatically revert the deployment if it has a negative effect on your key metrics. These observations should also be automated too and is explained later in this post.
Monitoring and Alerting
Given the dynamic nature of containers, monitoring and alerting becomes a big challenge. We chose Prometheus to handle all monitoring and alerting (alert-manager) related problems.
Prometheus is based on a pull architecture. One of the main concerns while using pull architecture is scraping config. Like in the case of Prometheus, to reload a new config, service needs to be restarted (or tag-based config loading needs to be done). This becomes error-prone and tedious if services are setup on cluster (where IPs are not fixed). Recently, Prometheus released a version that has new service discovery mechanisms. In addition to DNS-SRV records, it now supports Consul out of the box and a file-based interface allows you to connect your own discovery mechanisms as well. Over time, the plan is to add other common service discovery mechanisms to Prometheus.
For the scope of this post, we will focus mainly on discovery with Consul. To dynamically discover containers from Consul, all we need to add is couple of lines in Prometheus config.
Prometheus keeps checking for instances of service app at consul location 54.169.145.93. If some instances are down, Prometheus removes that from its list and stops emitting metrics. The tags of each Consul node are concatenated by a configurable separator and exposed through the meta_consul_tags label. Similarly, various other consul-specific meta labels are also provided. Using the alert-manager in Prometheus, we can easily setup alerts on all the emitted graphs.
As explained in the introduction of this post, the chances of failure in a microservices architecture can be high given the inter-dependent nature of the stack.
Typically, whenever a service misbehaves, an email alert is triggered. The alert includes graphs, which are then carefully evaluated and a decision on fixing the instance is taken accordingly.
Imagine a situation when there are 50 services and 3 or 4 instances running for every one of them. This means there are about 200 instances running at any given time. Examining each alert received from these services, doing an RCA and taking a corrective action manually is not scalable. It is also error-prone. This is why we needed to build a system that can detect broken nodes and repair them itself without any human intervention.
Self-healing can help recover an application. Simply put, it means that an application can take necessary steps on its own to recover from a broken state. It should be implemented as an external system that watches the instances health and restarts or reduces the load when they are in a broken state for a configurable period.
Score Engine
Score engine is the nucleus of this self-healing system and removes any human intervention. It calculates the score for every service it is monitoring basis different input metrics. The score can be a weighted score, a linear equation or a time decay function. This will mostly be based on the use case of the metric being tracked. Basis on the score, a container is labeled in green, yellow or red zone.
A poor score would mean that a container is in the red zone and needs to be removed from load balancing. No more load is diverted to it until the container recovers and gets itself relabeled to green. Medium score would mean that a container is in the yellow zone and that it can work but take lesser than normal load. A good score labels the container as green and represents good health and complete utilization.
Workflow
Score Calculation
Parameters to be considered at app level are:
Parameters considered at system level are:
Note that these metrics and values are not fixed and may be different for different services.
Firstly, individual scores are calculated for all the metrics mentioned above. Subsequently, their weighted mean gives the overall scores of the application. The weighted means is calculated based on the pre-assigned weights mentioned above.
Score of each metric = (cur Value – good value) / (critical value – good value)
Here, we cap the score at 1. Note that critical and good value will vary for every metric and application. It is fed statically to score engine at boot time.
Application score = Σpi*wi / Σwi
If the generated score is less than the critical value, the system takes the application instance out of rotation. If the score is greater than the good value, no action is taken. Finally, if the score is in between critical and good, then the load is decreased as per the following formula.
Load decrease = (score – critical_score) / (max_score – critical_score)
Precision is the most critical factor here. If the score engine overpredicts, we will keep serving requests to broken nodes and if it underpredicts, there will be unnecessary restarts. Currently, the engine does all the math based on the static data that is fed at boot-time. This implies the need to do a config change every time a service has to be onboarded. Can this be automated as well? Can the score engine be so intelligent that it can find out threshold values on its own? This is possible. For this, the score engine must run AI algorithms to decide what is good or bad for the service. We will write about this in a separate post soon.
Stitching it together
We have created two modes of deployment here. Having dev mode is necessary as it is needed for QA testing. The main difference between dev and prod setup is in the number of setups. There will be only one prod setup at a time but there can be multiple dev setups at any time for each developer.
The overall workflow for prod setup is:
The dev workflow setup is more or less similar to the prod setup. In this case, the service registers itself as <Service_name_JiraId>. If there are dependent services, they are all spanned together with the same JIRA ID.
We have created a mathematical model and basic logic for our machines to take data driven decisions. The next steps for us on this are to aggregate more data on RCAs, prepare a feature map on application and system level models and be able to drive a self-evolving decision-making system through data.
Lakshay Kaushik, Lead DevOps Engineer and Parijat Rathore, Director of Engineering, have also contributed to this article.