Implementing Centralized Logging at RIVIGO

28 June 2018
Technology

An overview of RIVIGO’s centralized logging infrastructure that is resilient, efficient and is well suited to our stack.

 

Introduction

When we were moving to a microservices-based architecture at RIVIGO, one of the biggest challenges that had to be resolved was log management (access logs, application logs, system metrics etc.) and facilitation of easy debugging through the same. The usual way of tailing the default log file (catalina.out in spring or web-1.log in spring boot applications) works well with monolithic services but not with microservices.

In the above illustration, there is a system that is set up with four different microservices. For the case of this example, let us assume each one of these microservices has two machines behind a load balancer. This makes it eight machines in total. If a request spanning across these microservices fails, analyzing the cause of failure would mean tunneling into eight machines and assessing if the failure was due to a behavior specific to a particular microservice. Given how time-consuming this can be, the process is sub-optimal.

Centralized logging is the solution here since it enables all the application logs from different sources to be routed to a singular sink. Laced with metadata like hostname, IP, log level etc., it is a strong tool for developers to evaluate application performance and debug errors.

The concept has been known for a while now, and there are lots of methods and LaaS (Logging as a Service) providers. In our quest to find the best method, we evaluated several different approaches. We finally created our own pipeline that is simple yet robust, well suited to our stack and can be used across different teams at RIVIGO. In this post, we will discuss the various alternatives we considered and the overall architecture and impact of our centralized logging and tracing infrastructure.

 

The six different logging infra we evaluated and why we went with the seventh

 

1) ELK

Commonly known as the Elastic Stack, this is the most popular stack as far as log management is concerned. Elasticsearch is a NoSQL database that is based on the Lucene search engine. Logstash is a log pipeline tool that accepts inputs from various sources, executes different transformations, and exports the data to various targets. Kibana is a visualization layer that works on top of Elasticsearch. Together, these three make a stack which solves for the logging problem in a very efficient way. Logstash routes application logs from machines. Its grok filters prepare JSON documents of the log data, which are then indexed on Elasticsearch. Kibana visualizes Elasticsearch data through its interactive dashboards and performs useful analytics on it.

However, memory consumption is a major issue with Logstash. It is highly resource consuming as CPU utilization can go up to 70-80% even in idle case. This implies that there was no way we could let every node run an individual instance. The other issue is that Logstash suffers when there is an unprecedented increase in log input. Even if we had a cluster of nodes running Logstash, we would waste a lot of compute power as we on-boarded more micro services and scaled that cluster to handle the load.

2) Filebeat + ELK

Filebeat is a part of the Beats module of Elastic Stack. Beats is a platform for lightweight shippers that send data from edge machines to Logstash and Elasticsearch.

Unlike Logstash, Filebeat can run on every node as it does not have resource consumption issues. It is fast and lightweight and consumes very less memory. Also, there is no problem of sudden spikes in incoming data as it gathers any updates to the file periodically and forwards them. While Filebeat is an excellent shipper, we would still be relying on Logstash for grok filters.

3) Kafka + ELK

To avoid Filebeat being installed on every node, one could use Kafka. In this case, the application would push logs to Kafka and thereafter Logstash can read from the topic. A logstash.conf file in such a case would be as given below.

Kafka can prevent Logstash from a surge in incoming logs. At the same time, in case our Logstash node (or cluster) goes down, logs would be retained in the topic and can be read by Logstash (the consumer) later when it heals.

4) Filebeat + Kafka + ELK

A combination of the two previous alternatives, this makes the system more robust and resistant. However, again at the cost of resource consuming units.

By now, it was clear to us that there were three problems that needed to be solved.

  • We needed an efficient shipper which is light and fast
  • We needed to ensure that logs are not lost when the consumers of the shipper are unavailable
  • We needed a way to convert the raw logs into JSON. This may seem simple but has some complications. Most critical was handling stack traces as a single entity rather than n separate log events, without using the grok filters of Logstash. It was certain we did not want to go with Logstash, as it had too many performance related issues as covered above.

5) JsonLogAppender + Filebeat + Kafka + Kafka Connect + Elasticsearch + Kibana

We needed to eliminate Logstash, which meant JSON parsing had to be done at application level. As we were using log4j to generate logs, there was a way in which we could have them written as JSON by using JSON appenders.

The following configuration needs to be added to log4j.properties file.

Kafka Connect is an open source framework that integrates Kafka with other systems. We used the Elasticsearch (Sink) connector to index JSON documents directly into Elasticsearch.

However, there is a problem with the above approach. There are very few open source JSON appenders available, and their performance and reliability is questionable. While trying this out, we noticed that stack traces would be broken, or some logs would be missed altogether at times. This was not a system which we could have on production.

Moreover, we needed some sort of custom grok filtering specific to our application because there was information (MDC data, zipkin trace etc.) which was present in the logs and it needed to be handled.

The objective now was to have grok filters without Logstash and the solution was Ingest Nodes. Ingest is a lightweight solution for pre-processing and enriching documents within Elasticsearch before they are indexed. It was first introduced with the ES 5.0 release and can be used to perform common data transformation and enrichments. It has more than 20 built-in processors like grok, date, gsub, lowercase/uppercase, remove and rename.

6) Filebeat + Kafka + Kafka Connect + Elastic Ingest Node + Elasticsearch + Kibana

Looking at all the pros and cons of the different setups that we had examined, this had the potential to be the ideal architecture. There is exactly one read guarantee and spike tolerance through Filebeat and Kafka setup. Retention of unconsumed logs is done by Kafka. We also have a scalable and secure stream data pipeline through Kafka Connect. Grok filters are provided by Ingest Nodes. In the end we have Elasticsearch as the store and Kibana for visualizations and analytics.

However, Kafka connect does not have native support for ES Ingest Nodes. This meant that kafka-connect-es (our connector to ES sink) could not be configured to accept an ingest pipeline.

 

RIVIGO’s centralized logging infrastructure overview

 

Based on our evaluation of different approaches, we realized that if we took out Kafka from the picture, we would still have a pretty neat architecture. We would write logs to a file first with custom rolling and retention policies, and Filebeat could ship it directly to ES ingest nodes, where we can structure it into JSON documents and index it into Elasticsearch. If Elasticsearch nodes were down due to some reason, Filebeat registry would maintain a pointer to the last successfully shipped line and continue from there whenever our nodes are up again. This was a system that would consume minimum resources and would be resilient even when there are large volumes of logs.

Below is a sample logback.xml file.

A sample config for filebeat (to log both application logs and access logs) is given below.

A simple ingest pipeline template to get started is given below.

For application logs:

For Nginx Access logs:

 

Distributed Tracing

In order to augment a centralized logging infrastructure, a tracing infrastructure is also required. The purpose of this infrastructure is to help debug performance issues when there is no stack trace or an exception and visualize system behaviour which would otherwise go out of hand when there are hundreds of distributed microservices.

At RIVIGO, our tracing stack is made up of Zipkin, spring-cloud-sleuth and cassandra. A typical tracing system is made up of six components.

  • Trace instrumentation
  • Reporter
  • Collector
  • Storage
  • Search API
  • UI

Trace instrumentation and reporter

This is required for instrumenting application’s code, generating valid traces and reporting them to the collector. An instrumentation library should have a very low overhead and must report the trace information to the collector out-of-band to have zero impact on application’s performance.

We use spring-cloud-sleuth to instrument our applications automatically with zero manual configuration from developers. RPC calls spanning across microservices are traced by introducing custom HTTP headers X-B3-TraceId and X-B3-SpanId. By including sleuth in the class-path, all the HTTP calls, DB calls, and ThreadPoolExecutor Runnables are automatically instrumented. This is done through spring’s auto configuration. Spring-cloud-zipkin is used to report trace/span data to Zipkin. Both of these libraries have been found to have very low overhead/performance impact in our production services.

Collector

Zipkin has a collector daemon which would receive trace/span data from all microservices sent from the applications asynchronously. This uncorrelated data would be correlated and validated by the collector.

Storage, Search API and UI

Cassandra has been found to serve our needs well as a storage backend for Zipkin, which has a query service used by the web UI wherein the visualization of the correlated span/trace information is shown. The UI is made up of familiar waterfall graphs similar to network latency shown in chrome developer tools. This is where we can find out which microservice/RPC/DB call is taking longer than usual and find other metadata like host information, service name, controller class and method. This comes very handy in debugging production performance issues.

 

Impact

After carefully evaluating various options, we created our own centralized logging infrastructure that is resilient, efficient and is well suited to our stack. Beyond the obvious advantages, the following were the additional benefits.

  • We used MDC to put basic information like Application ID, username, email, Zipkin trace, Zipkin span etc., which helped ensure that the entire API call journey could be traced
  • We used Nginx, so the access logs from Nginx could also be indexed using the same architecture with some modifications in the filebeat configuration file
  • Most importantly, there has been close to 40% reduction in time taken for debugging issues for the engineering team. This is the singular fitness test for any logging infrastructure.

 

Gautham Goli, Software Development Engineer at RIVIGO, has also contributed to this article.