How & Why We Built Our Own Elasticsearch as a Service

23 August 2018
Technology

An overview of how we built our own ‘Elasticsearch as a Service’ to power all site search and centralize logging elasticsearch cluster.

Introduction:

At Rivigo, multiple applications are using Elasticsearch as a core infrastructure engine to solve numerous problems like centralized logging infrastructure, search capability in applications, storing consignment and audit logs time series data. All these use cases involve querying/searching, structured as well as unstructured data storage on specific keywords and getting results with lower latency. Hence, elasticsearch has proved to be very promising for such use cases.

 

Why Elasticsearch as a Service?

Previously, we were building our POC cluster manually but considering that the elasticsearch cluster architecture may change basis use-case and team, we would have ended up doing heavy ops work in creating elasticsearch clusters repeatedly.

Considering this and to reduce that drudgery, we started looking at options like AWS Elasticsearch Service. Though the AWS Elasticsearch Service was great- easy to deploy, operate and scale, we soon realised its shortcomings in our context and decided to move away from it to build our own Elasticsearch as a Service. There were several reasons for us to discard AWS and build our own Elasticsearch as a service, some of them were:

  1. There was no constraint for choosing dedicated master nodes. It allowed creating a cluster with an even number of dedicated master node, which in practice can create the split-brain scenario.
  2. Hot Warm Architecture cannot be built on AWS Elasticsearch Service.
  3. Configuring and tuning of cluster level parameters is very limited and most of these parameters are blocked.
  4. Third party community plugins like read-only restSQL on Elasticsearch cannot be installed on the cluster.
  5. On-demand EC2 instances of the same capacity as AWS Elasticsearch Service instances are costlier. We can reduce EC2 cost by putting more RI’s.
  6. There are a lot of instance types available on on-demand EC2 as compared to AWS Elasticsearch Service.
  7. Only a few metrics are available for monitoring AWS Elasticsearch Service in cloud watch.
  8. The maximum size of EBS volume is 1TB in AWS Elasticsearch Service.

 

How we built our own Elasticsearch as a Service:

For building our own elasticsearch service we evaluated Terraform and CloudFormation. Considering the possible destructive nature (a single change resulting in probable loss of entire data) of Terraform, we chose to build our Elasticsearch service on CloudFormation. We are currently in the process of finding a solid solution towards controlled data management, once we crack that, we would move to Terraform as our tool of choice for our Elasticsearch service.

During the initial phase of building the elasticsearch service, we were writing yml CloudFormation templates but soon realized it was not very productive as it resulted in us losing a lot of time. Since we realized this early on, we started researching to look for better alternatives. It was then we came across a python library “troposphere” that could be used to generate CloudFormation templates just by being more pythonic. It was a gamechanger. Just a few lines of code in python and we could generate templates in json or yml.

One of the constraints of using CloudFormation templates is that it doesn’t support dynamic requirements (whether a dedicated master node is needed or not, if needed how many, single node elasticsearch cluster for dev’s or multi-node clusters) but we wanted a command to deploy any type of requirement. It was here that the shift to python played a pivotal role, as we wrote code in python in such a way that any type of cluster can be created on a single command.

We were not looking for CloudFormation just to launch a bunch of servers with certain configurations but rather wanted to create a complete stack from launching servers to elasticsearch cluster deployment. This was made possible due to an easy integration of Cloud-Init into the template using CFN Init.

Once we were ready with the code and it was time to move it to production and most importantly, test. After continuously battle testing the code numerous times, we launched our production cluster, which is the backbone of our centralized logging.

 

The Final Architecture:

It was crucial for building centralize logging architecture to keep high availability and scalability in mind. After a lot of discussions and our past experiences, we decided to build the Hot Warm Architecture. The architecture proposes to keep 2 types of nodes in the elasticsearch cluster (tagging nodes using node tag attributes):

  1. Hot nodes: These nodes are responsible for real-time indexing. Thus, will have an index of the current day only. Since indexing needs a lot of CPUs and Disk IOs, we decided to have high-end machines with SSDs.
  2. Warm Nodes: Post-midnight every night, after the creation of the new index, the older index will be moved to these nodes where retention of indices will be much higher and as per different product teams.

Below is sample elasticsearch.yml which decides whether the node is hot or warm.

In our efforts to take higher availability and scalability in consideration while building the architecture, we broke our cluster into more scalable components and made it production ready.

  1. Dedicated master nodes: The soul of elasticsearch clusters are master nodes and keeping them up all the time will provide our HA need. Rather than hardcoding IPs of the master in elasticsearch.yml we used EC2 discovery plugin, thus removing the need for persistent IPs for master nodes.
  2. Dedicated Ingest/Client Nodes: In our architecture, we removed logstash (ELK -> L for Logstash) and rewrote our grok patterns in elasticsearch ingest nodes pipelines. We kept these nodes into auto-scaling thus maximizing our scalability issues. These nodes have internal ELB sitting on top of them thus there was a single URL for interaction with Elasticsearch cluster.
  3. Dedicated Kibana Node: Dedicated Kibana is used and it’s in ASG. Internally it points to the URL of internal ELB.
  4. Hot Nodes: These are high-end machines. Each backed with 500GB of SSDs as data volumes (These are created as LVM- making increasing volume extremely easy, with just a few commands).
  5. Warm Nodes: These are low-end machines and backed with 2 TB of SSDs.

 

Elasticsearch index lifecycle management:

The last piece of work was to make lifecycle management of indices in elasticsearch cluster. This was done by keeping into account the following things:

  1. Our dedicated Kibana node is powered by elasticsearch curator.
  2. Index rollover from Hot Node to Warm Node is performed by Curator and crontab.
  3. To keep backup of indices after a certain interval of time we utilized elasticsearch S3 backup and restore plugin.
  4. Curator is responsible for backing up indices to S3 using S3 plugin after a certain interval of days.