Running Production EKS on SPOT

Amazon Web Services (AWS) is a powerful cloud computing platform that stores, manages, and processes data over the internet. Our organization has used this feature for various applications. However, we realized we could increase our efficiency while maneuvering through the AWS platform. Here are some instances of how we reduced our AWS costs by combining spot and on-demand instances for production applications. But before diving into it, let us understand the importance of a spot instance.

‍

What is a spot instance?

In a nutshell, a spot instance is a sure-shot way to minimize your Amazon Elastic Compute Cloud (EC2) costs as it is easy to compute. To break it down for you, a spot instance uses spare EC2 capacity that is available at 1/10th the price of a traditional on-demand server. Because spot instances enable you to request unused EC2 instances at steep discounts, you can significantly lower your Amazon EC2 costs. In Kubernetes or K8s, it is better to use spot instances as your worker nodes to balance costs according to the nature of computation. To break it down for you a little more, a node is a worker machine in Kubernetes that can be virtual or physical (depending on the cluster). A node can have multiple pods and is managed by the Kubernetes control plane, which automatically schedules the pods across the nodes in the cluster.

How did we use spot instances in the Amazon EKS (Elastic Kubernetes Service) workload?

To reduce our AWS EC2 expenses, we used a step-by-step configuration model to reach the end result. For instance,

We created managed spot instance node groups.
Place similar configuration instance types in a single node group. For example, for an m5.large (2 vCPU/8 GiB RAM) instance type, add-ones with the same vCPU and RAM values, such as m5a.large, m5n.large, and m4.large.
‍

Configured Cluster Autoscaler Kubernetes

We used Cluster Autoscaler (CA), which triggers AWS Autoscaling Group, to adjust the number of nodes. The best part is that CA automatically adjusts the number of nodes if the nodes lack resources for allocation to pods. When there is more than one node group and CA identifies that it needs to scale up the cluster due to unscheduled pods, it will have to decide which group to expand. To help balance that, we can force CA to always prefer adding Spot Instances over on-demand via the below configuration. Sounds amazing, right?

CA uses an expander to choose which group to scale. With AWS, CA provides four different expander strategies for selecting the node group to which new nodes will be added:

Random (default),
Most-pods
Least-waste
Priority

ConfigMap example is as follows:

‍

Leveraged AWS Node Termination Handler (Github)

Nobody likes interruptions. We were also encountering a lot of problems concerning handling spot interruptions. We used the AWS Node Termination Handler to prepare our cluster for spot interruptions. The handler runs a pod as a daemon set on each spot instance node. This detects a spot interruption warning notice by watching the AWS EC2 metadata. If any interruption is detected, the handler will schedule all the pods on other nodes and then trigger a node drain.

If an interruption or rebalance recommendation notice is detected, the handler will trigger a node drain. Node drain safely evicts all pods hosted on it. When a pod is evicted using the eviction API, it is gracefully terminated, honoring the terminationGracePeriodSeconds setting.

‍

Each of the evicted pods will then be rescheduled on a different node so that all the deployments will get back to their desired capacity.

‍

An AWS node-termination-handler Helm installation example is as follows:

‍

‍

‍

Prevent service downtime

By now we have a hybrid cluster that can auto-scale Spot Instances, fall back to On-Demand if necessary, and handle graceful pod evictions when a Spot node is reclaimed.

‍
For the production environment to keep all the services alive, draining random nodes could cause a catastrophe. What if all deployment pod replicas stay on a single Spot Instance pool (same machine type and AZ), with higher chances of being reclaimed at the same time?

To prevent this from occurring, we configured affinity rules.

Affinity rules

Well, the pod affinity and anti-affinity rules allow you to do that. With these rules, you can specify how pods should be scheduled relative to other pods. The rules are defined using custom node labels and label selectors specified in pods.

Two types of pod affinity rules can make it happen:

Preferred: Preferred specifies that the scheduler will try to enforce the rules, but there’s no guarantee.

Required: Required, on the other hand, specifies that the rule must be met before a pod can be scheduled.

In the following example, we use the preferred podAntiAffinity type:
‍

‍

‍

By setting different weights, the k8s scheduler will first try to spread those 3 nginx replicas over different AZs (‘failure-domain.beta.kubernetes.io/zone’ node label). If there is no room available in separate zones, it will continue to try to schedule them on different Instance types (instance-type node label). Lastly, if no place is available in either separate AZs or Instance types, it will try to spread the replicas across separate nodes (hostname node label).

Pod Disruption Budget

Now comes countering the disruptions and preventing service outages. The Pod Disruption Budget (PDB) indicates the maximum number of disruptions that can be caused by a collection of pods. PDB can help us limit the number of concurrent evictions and prevent a service outage.

If, for example, the deployment has 3 replicas staying on single or multiple nodes that are being drained simultaneously, k8s will first evict two pods and then continue to the third one only after one of the rescheduled pods has become ready in another node. This all converges to ensure seamless functioning throughout.

‍

‍

Conclusion

So, as you can see, by using the above methods, we moved our entire cluster onto Spot nodes. Moreover, we used a combination of Spot & on-demand instances for production workloads for fail-safe purposes. This helped us reduce our AWS EC2 expenses by more than 70%, with zero downtime and no spot-related outages.

Quick Demo

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Running Production EKS on SPOT

Quick Demo

Popular Posts

Recent posts

Start your cloud transformation now.

GET READY TO KNOW THE LASTEST UPDATE!