Designing Self-healing Systems

Technology

A self-healing system can take necessary steps on its own to recover from a broken state. This post is an extension to an earlier article about building RIVIGO’s microservices stack, and talks about how we are building intelligent self-healing systems.

 

Introduction

In a recent post, we covered how we are moving to a microservice architecture at RIVIGO. In simple terms, microservice is a way to break a monolithic service into loosely coupled services, which communicate with each other through simple APIs. While there are several obvious advantages of a microservice architecture, there are also some challenges that come with it too:

  • Higher chance of failures – This is simple probability. If one service may be down once in seven days, the probability of a service being down in an hour is 1/ (24*7). If there are two such services, then the probability of a service being down in an hour is (2/ (24*7)) – (1/ (24*7*24*7)). If there are 50 such services, the probability will be much higher.
  • Service communication – Dependent services need to interact with each other. Lack of a robust platform can add manual operational overheads and decrease system availability. This problem becomes more complex in dynamic and distributed environments where services IPs are changing very frequently. A robust service discovery and load balancing system should be in place to provide inter service communications.
  • Complex deployment – There are typically several dependencies among services that underscore the need for coordination during deployment, making it fairly complex.
  • Complex testing – Testing a microservices-based application can be cumbersome. Using the monolithic approach, one would just need to launch a package on an application server and ensure its connectivity with the underlying database. However, in this architecture, each dependent service needs to be confirmed before one can start testing.
  • More coding – Since there are dependencies, a fallback plan needs to be developed for each dependent service in case it stops responding.

Most of these challenges are related to dependencies management and services breakdown. To overcome these, we needed to build a robust platform for dependency management (which includes inter-service communication) and for self-healing (where misbehaving services can be replaced).

With regards to inter-service communication, we used a combination of internal SSO, feign client and Hystrix for all real time transaction flows. All non-transactional and event-based communication is done via Kafka. We are trying to minimize real-time flows as it affects the resilience of the application. For event-based communication, we have also been exploring reactive microservices.

As for designing self-healing systems, we took inspiration from multiple sources including Alibaba’s AI driven infrastructure for ‘Singles Day Sale’ and the idea of automating test cases to some extent. The true test of success of this implementation though would be deploying chaos monkey in production.

In this post, we will talk about how we implemented this architecture.

 

Automated Deployment

A lot of services means complex deployment. Hence, there must be a smooth and fault-tolerant deployment strategy. Below are some required characteristics of microservices:

  • Is relatively small and technology agnostic
  • Can be independently deployed and developed
  • Enables ease in scaling development
  • Enables improved fault isolation

The first step was to containerize each service. There was no doubt that docker container was the best fit given the aforementioned characteristics of microservices. Below are some of the key reasons why:

  • Resource Utilization – Docker containers comprise of just the application and its dependencies, neither more nor less. Each container runs as an isolated process in user space on the host operating system, sharing the kernel with other containers. Therefore, it enjoys the resource isolation and allocation benefits of virtual machines but is much more portable and efficient.
  • Application Portability – Docker puts application and all of its dependencies into a container which is portable among different platforms, Linux distributions and clouds.
  • Compatibility – This solution supports a microservices architecture very well. These services are built around business capabilities and independently deployable by fully automated deployment machinery. Each microservice can be deployed without interrupting the other microservices and containers provide an ideal environment for service deployment in terms of speed, isolation management and lifecycle. It is easy to deploy new versions of services inside containers.

 

Service Discovery

A micro service container can be deployed on any box, which makes IP and port of the service dynamic in nature.  This leads to the need for a service registry and discovery platform where dependent services can communicate with and discover each other. The concept is not new. Many tools have existed long before Docker was created.

However, containers brought the need for such tools to a completely new level. The basic idea behind service discovery is for each new instance of a service (or an application) to be able to identify its current environment and store that information. Storage itself is performed in a registry usually in key/value format. Since the discovery is often used in distributed system, registry needs to be scalable, fault tolerant and distributed among all nodes in the cluster. The primary usage of such a storage is to provide IP and port of the service to all interested parties that might need to communicate with it. This data is often extended to other types of information. Discovery tools tend to provide an API that can be used by a service to register itself as well as by others to find the information about that service.

Out of many open source service discovery platforms available, we shortlisted Eureka, Consul and Zookeeper for consideration. We finally decided to go ahead with Consul because it:

  • Is distributed, highly available, and scalable
  • Provides a strongly consistent data store that uses a gossip protocol to communicate and form dynamic clusters, based on the Serf library
  • Provides a hierarchical key/value store
  • Provides native support for multiple data centers
  • Enables service lookups using the DNS protocol
  • Is not resource hungry and provides service discovery out of box
  • Has a very useful utility, consul-template to write files with values obtained from consul, which we are using to rewrite NGINX conf files
  • Has a great web UI through which one can view all services and nodes, monitor health checks and their statuses, read and set key/value data as well as switch from one datacenter to another

 

Architecture Overview

The diagram below shows the birds-eye view of system. We have used NGINX as the load balancer.

As soon as a new container for service A comes up, the container makes calls to get external IP and port on which the service is listening. With this information, the container calls the Consul’s register service. Consul template looks for any changes with respect to addition/deletion of container for existing services or addition of any new services. On finding any changes, it rewrites nginx.conf files and reloads it. Any service which is using service A now finds the new container discoverable via consul.

Change management

Google’s site reliability team has found that roughly 70% of the outages are caused by changes in a live system. In a microservices architecture, services are mutually inter-dependent. To minimize the risk of issues, implementation of change management strategies and automatic rollouts is necessary.

For example, when you deploy new code or change a configuration, you should apply the changes to a subset of the instances gradually (also known as Canary Deployment), monitor them and even automatically revert the deployment if it has a negative effect on your key metrics. These observations should also be automated too and is explained later in this post.

Monitoring and Alerting

Given the dynamic nature of containers, monitoring and alerting becomes a big challenge. We chose Prometheus to handle all monitoring and alerting (alert-manager) related problems.

Prometheus is based on a pull architecture. One of the main concerns while using pull architecture is scraping config. Like in the case of Prometheus, to reload a new config, service needs to be restarted (or tag-based config loading needs to be done). This becomes error-prone and tedious if services are setup on cluster (where IPs are not fixed). Recently, Prometheus released a version that has new service discovery mechanisms. In addition to DNS-SRV records, it now supports Consul out of the box and a file-based interface allows you to connect your own discovery mechanisms as well. Over time, the plan is to add other common service discovery mechanisms to Prometheus.

For the scope of this post, we will focus mainly on discovery with Consul. To dynamically discover containers from Consul, all we need to add is couple of lines in Prometheus config.

Prometheus keeps checking for instances of service app at consul location 54.169.145.93. If some instances are down, Prometheus removes that from its list and stops emitting metrics. The tags of each Consul node are concatenated by a configurable separator and exposed through the meta_consul_tags label. Similarly, various other consul-specific meta labels are also provided. Using the alert-manager in Prometheus, we can easily setup alerts on all the emitted graphs.

 

Self-healing

As explained in the introduction of this post, the chances of failure in a microservices architecture can be high given the inter-dependent nature of the stack.

Typically, whenever a service misbehaves, an email alert is triggered. The alert includes graphs, which are then carefully evaluated and a decision on fixing the instance is taken accordingly.

Imagine a situation when there are 50 services and 3 or 4 instances running for every one of them. This means there are about 200 instances running at any given time. Examining each alert received from these services, doing an RCA and taking a corrective action manually is not scalable. It is also error-prone. This is why we needed to build a system that can detect broken nodes and repair them itself without any human intervention.

Self-healing can help recover an application. Simply put, it means that an application can take necessary steps on its own to recover from a broken state. It should be implemented as an external system that watches the instances health and restarts or reduces the load when they are in a broken state for a configurable period.

Score Engine

Score engine is the nucleus of this self-healing system and removes any human intervention. It calculates the score for every service it is monitoring basis different input metrics. The score can be a weighted score, a linear equation or a time decay function. This will mostly be based on the use case of the metric being tracked. Basis on the score, a container is labeled in green, yellow or red zone.

A poor score would mean that a container is in the red zone and needs to be removed from load balancing. No more load is diverted to it until the container recovers and gets itself relabeled to green. Medium score would mean that a container is in the yellow zone and that it can work but take lesser than normal load.  A good score labels the container as green and represents good health and complete utilization.

Workflow

  • Score engine starts with static config of services that need to be scored
  • Each service registers itself in the consul
  • Score engine keeps checking consul for new registered (or deregistered) instances
  • Once the IPs of registered services are received, it starts hitting them for sys and app metrics
  • It calculates health scores on metrics values and exposes them through http API.

Score Calculation

Parameters to be considered at app level are:

  • Error logging (w: .1)
  • Database connections (w: .2)
  • Overall response time (w: .1)
  • 200 ok vs other response codes (w: .1)
  • CPU utilization (w: .1)
  • Memory utilization (w: .1)
  • Total number of active/blocked and waiting threads (w: .2)

Parameters considered at system level are:

  • Process up or not (w: 1)
  • OS memory (w: .5)
  • Number of file descriptor (w: 1)
  • CPU usages (w: .5)

Note that these metrics and values are not fixed and may be different for different services.

Firstly, individual scores are calculated for all the metrics mentioned above. Subsequently, their weighted mean gives the overall scores of the application. The weighted means is calculated based on the pre-assigned weights mentioned above.

Score of each metric = (cur Value – good value) / (critical value – good value)

Here, we cap the score at 1. Note that critical and good value will vary for every metric and application. It is fed statically to score engine at boot time.

Application score = Σpi*wi / Σwi

If the generated score is less than the critical value, the system takes the application instance out of rotation. If the score is greater than the good value, no action is taken. Finally, if the score is in between critical and good, then the load is decreased as per the following formula.

Load decrease = (score – critical_score) / (max_score – critical_score)

Precision is the most critical factor here. If the score engine overpredicts, we will keep serving requests to broken nodes and if it underpredicts, there will be unnecessary restarts. Currently, the engine does all the math based on the static data that is fed at boot-time. This implies the need to do a config change every time a service has to be onboarded. Can this be automated as well? Can the score engine be so intelligent that it can find out threshold values on its own? This is possible. For this, the score engine must run AI algorithms to decide what is good or bad for the service. We will write about this in a separate post soon.

Stitching it together

We have created two modes of deployment here. Having dev mode is necessary as it is needed for QA testing. The main difference between dev and prod setup is in the number of setups. There will be only one prod setup at a time but there can be multiple dev setups at any time for each developer.

The overall workflow for prod setup is:

  • Jenkins job creates the docker image and pushes it into the docker registry
  • ECS pulls the docker image and spans a new container
  • As soon as the container is up, it registers itself in consul
  • When registered in consul,
    1. Prometheus discovers the new instance and starts pulling the metrics
    2. Score engine discovers the new instance, starts pulling the metrics and exposes health score over http API
    3. Scheduler discovers the new instance too
  • Scheduler fetches the new container score from score engine and also registers this in NGINX upstream
    1. If the score is bad, it deletes its entry from NGINX upstream list and kills the container
    2. If the score is average, it reduces the load weightage
    3. If the score is good, it keeps checking periodically

The dev workflow setup is more or less similar to the prod setup. In this case, the service registers itself as <Service_name_JiraId>. If there are dependent services, they are all spanned together with the same JIRA ID.

 

Conclusion

We have created a mathematical model and basic logic for our machines to take data driven decisions. The next steps for us on this are to aggregate more data on RCAs, prepare a feature map on application and system level models and be able to drive a self-evolving decision-making system through data.

 

Lakshay Kaushik, Lead DevOps Engineer and Parijat Rathore, Director of Engineering, have also contributed to this article.