How we improved TATs and reduced drudgery and costs with Terraform & Spot Instances

26 April 2019
Technology

This article gives an overview of how the DevOps Team at RIVIGO leverages Terraform and Spot Instances on AWS to improve responsiveness (single click) while making their own life more meaningful by removing drudgery.

 

The problem

 

At RIVIGO, we run our infrastructure on AWS, and given our micro-services architecture and rapid growth, developers are regularly creating new services. Each service needs multiple environments (dev/stage/QA/prod), and for historical reasons, we have been using Elastic Beanstalk (In hindsight and after close evaluation, we have realised that Elastic Beanstalk isn’t the best choice and we are gradually moving away from it).

Earlier it was manageable to create these instances using the AWS console, but sometime late last year we realised that most of our time was spent on AWS console and our developers complained of a slow turn around time.

We spent many hours navigating the AWS console, provisioning resources(ElasticBeanstalk, EC2, RDS, ElastiCache, EMR) across multiple VPCs and regions, and while we followed the same steps every time something or the other was missed frequently (e.g. incorrect tagging, subnets). Besides, the team was not able to focus and work on other improvement initiatives as the majority of the time went into serving developer requests. Instead of trying to rigidly document the exact provisioning process or adding more DevOps to the mix, we decided there has to be a better way.

 

 

What Is Terraform and Why Is It on the rise?

 

Terraform is a tool for developing, changing and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions. It supports major cloud vendors, so it’s agnostic to which cloud you are running, and it can help with automation.

So without going deep theoretically what Terraform really does, let’s get started about what really worked 🙂

 

Technologies/Packages/API’s Used:

1.) Terraform

2.) Jenkins

3.) Bash Scripting

4.) Spotinst (a third party service for spot instance creation & management)

5.) Postman tool

6.) Mailing/Communication platform(Gmail, Slack, Teams/Outlook)

What we at RIVIGO, believe is that GUI is way simpler than CLI, hence rather than typing Terraform CLI commands, and filling in the custom inputs that usually takes a lot of time and bandwidth, we integrated Terraform with Jenkins so that we are able to fill our requirements (such as Instance_Name, Instance_type, Subnet, VPC, Key_pair, etc.) in Jenkins field, and allow Jenkins variables to do the work for us.

We use a bash script in the backend which takes the input from Jenkins and write it to Terraform.tfvars file. The Jenkins input variables would get appended to the Terraform-user-input variable file(Terraform.tfvars). Check out the image below for better understanding.

We just need to set this up once, and we will have the power to create unlimited EC2 instances in any region in under 10 seconds. Going ahead, now, we need to make sure we have provided string/choice/extended_choice parameters in the Jenkins build for all the variables mentioned below.

  • Environment
  • Team
  • Service
  • App
  • InstanceType
  • KeyPair
  • AMIName
  • Root_Vol_Size
  • Subnet
  • Count
  • And Tags(so on and so forth)

Make sure to take above variables as input in Jenkins job, and run a background script(be it in Bash, Python, Ruby), that keeps appending the values in these variables to Terraform.tfvars file, and then performing Terraform plan and Terraform apply. Make sure to delete the terraform.tfstate in every build, else you will end up modifying your existing infrastructure and land yourself in trouble.

Example:

sed -i ‘/Environment=.*/c Environment=’\””${Environment}”\”” terraform.tfvars

In the above example, we are taking the variables as input from the User. For eg.: in Environment variable, we need to type whether it is prod, dev or staging. Terraform modules have been mapped accordingly in the backend, where Prod is mapped to Prod VPC’s, and Dev is mapped to Dev VPC’s. You DON’T need to enter the VPC-ID, thus eradicating the complex problems of remembering the VPC ID ’s and manual entering of input after “Terraform apply”.

When you run the build after entering all the values, an instance would be created.

Your Instance Name would be: <Team>-<App>-<Service>-<Environment> where( we have configured “Name” = “<Team>-<App>-<Service>-<Environment>” of the EC2-instance in main.tf file.)

 

For Eg.:

Team: DevOps (the team name: DevOps, frontend, backend, data-science, product)

App: Kafka  (the app which is being used inside the server: this can be the company jargon, depends on the app to app)

Service: UI  (the type of service: UI, web, backend, scheduler, worker, docker)

Environment: dev  (in which environment you want the environment to be made: dev/staging/prod)

 

Hence, the name of the EC2 instance would be devops-kafka-ui-dev . You can replace the values according to your team, app, service, environment accordingly.

 

 

Pipelining flow for the whole process:

 

Pipelines are the best way to have control over a build. It helps one to make sure that the resources that would be created are the only ones we require. The pipeline would take us through every step of the process one by one. Hence, it would be way better if we are ready with a Groovy script, that handles the Terraform flow.

As soon as we fill all of the values in Jenkins and trigger the build, the values would get appended in Terraform.tfvars, and Terraform init would be followed by Terraform plan. Terraform plan would mail the output whether these are the only resources we want to make, hence after approval by any DevOps member, the build would go back and perform the Terraform apply, and the Terraform output would again get mailed to us.

In this stage, the Jenkins agents check out our git repo into its workspace, and takes the input from Jenkins variables and keep appending them to terraform.tfvars.

 

In this stage, Jenkins performs terraform init operations deleting the current terraform.tf state such that current infrastructure doesn’t get deleted/modified and new infrastructure can be built seamlessly.

The Approval stage is optional, but it pauses the pipeline and waits for the approval of a human operator before continuing. In this example, it gives us a chance to check the output of terraform plan before applying it. Note that the script function lets us break out of the simplified declarative pipeline stuff and write some native groovy script.

We get a nice visual representation of the pipeline from this UI. When we get to the approval stage, Jenkins will wait for our input. At this point, we can click back on the TF Plan stage and make sure we’re happy with the plan that is going to be applied. Since we created our remote state backend, Terraform should know there are no changes to make, unless we’ve altered our Terraform code.

Please note: We are not keeping the terraform.tfstate in our scenario(as it is not needed in our case). In case if you need it, you need to add a step to upload the Terraform.tfstate to either S3 bucket (or to a remote server depending upon your choice).

 

 

Time to save some money 

Importing the created Terraform EC2 Instance(Stateful) to Spotinst.

 

Once we have created a server running at On Demand cost, we used an API from Spotinst to convert to Spot Pricing. Spot instances are significantly cheaper than on-demand instances, but they can be stopped at any time. Using Spotinst made it easier for us to manage these, but you can also use the corresponding offering straight from AWS.

So first of all, you need to go to your Spotinst account and create an API token. So better be handy with the API token and the account ID(Eg: act-123456) as you need to use that token and account-id inside the Spotinst API body.

Spotinst terms resources as ElastiGroup. So specifically, we need to create an ElastiGroup in Spotinst. You will of-course name the ElastiGroup as the name of the EC2-instance which is created. So we need the instance_name and the instance_id. You can get both of them by either parsing “terraform output” or terraform.tfstate file, using bash jq modules.

Download Postman tool, and test the above API in it. Make sure the Status code returned is 200 OK.

We will cover best practices for running Terraform in a team and how to remotely run Terraform in automation in our future articles.

Keep Terraform-ing.