Innovaccer manages unified health records of over 10 million patients in the United States of America. More than 50+ health systems are using Innovaccer’s SaaS platform to help caregivers perform state of art diagnostics and deliver value-based care. The SaaS platform uses AWS as an infrastructure backend distributed between multiple linked accounts whose architectural complexity keeps on increasing with scale.
The largest issue that Innovaccer faced during the pandemic was the massive inflow of data from providers and patients, which came in at a quick pace. Due to this, cloud costs increased exponentially as the engineering and finance teams didn’t have clear visibility into per-patient expenditures.
Innovaccer’s data activation platform aggregates analyze and transform the healthcare data through a series of data processing systems running throughout the day. The cost of running the data processing systems kept spiking up due to the varying volume of data. There was no direct way where AutoScaling groups could dynamically scale based on cluster metrics and not just based on autoscaling criteria which could scale up the cluster based on a single metric provided by CloudWatch.
Almost 50% of Innovaccer’s monthly AWS bill was due to Hadoop jobs. Clearly, Ankit Maheshwari (VP, Technology) needed a solution that could optimize their Hadoop clusters in terms of cost without compromising upon uptime guarantees and platform stability. Also, he needed a solution that could provide more visibility into their rapidly growing infrastructure.
OpsLyft is a great believer of AWS, so their initial approach was to solve the problem by making use of existing cloud-native solutions before going ahead and doing any custom implementation. Post their attempts to play around with Amazon AutoScaling Group, they identified the following limitations:
- Auto Scaling tightly coupled with only Cloudwatch
- Alarms can be triggered (automatically) but only based on a single metric
- Limited stat functions – avg, sum, min, max, etc. No capability to define custom metrics
Clearly, now OpsLyft needed a solution that is capable of scaling up/down the cluster based on more advanced metrics that were not available in CloudWatch. That’s where Cost360 came to rescue which could consume metrics from Hadoop cluster and perform scaling activities based on that.
OpsLyft wrote a publisher which collected the following metrics:
- Supply metrics from Hadoop cluster summary table:
- Demand metrics as a cumulative sum of map & reduce tasks of all the running jobs:
Post collecting these metrics, OpsLyft just had to push these metrics to CloudWatch and determine the scaling capacity to a computed value based on cluster load. This solves Innovaccer’s problem! Now the cluster was supposed to scale only when supply metrics are pushed, also the cluster now would scale to the required number of nodes based on computed demand metrics which otherwise would scale to a fixed number irrespective of the amount of load on the cluster.
The dashboards of Cost360 provided visibility and proactive monitoring of the AWS costs in detail which gave a better idea of Hadoop costs to Innovaccer so that they could see how effective this approach is and measure the reductions.
OpsLyft’s Plutus CLI handled pushing of metrics and with its help, the developers were able to take immediate actions with simple single-line commands saving them the hassle to work on some extensive solution for the same.
Following are the benchmark results:
Figure 1: Pre Product Deployment Phase- Demand vs Supply metrics for Hadoop
The number of nodes in the cluster is static irrespective of the supply-metrics dynamics. The only metric with which the cluster scales down is the cool-down period configured as a part of CloudWatch.
Figure 2: Post Product Development Phase- Demand vs Supply for Hadoop.
The above results are post-product deployment. The Hadoop cluster is now scaling on-demand.
OpsLyft solved the problem, but the real win was to see the reduction on the month-end bill even if the number of clusters runs during the month increased.
Figure 3: Post Product Deployment Phase- Cost Reduction
With Cost360 deployed, now Ankit and his team had full visibility of their Hadoop costs. The dashboards provided in Cost360 gave Innovaccer a holistic view of their cloud infrastructure that enabled them to take immediate actions when they see any cost surge.
Ankit and his team at Innovaccer no longer had to worry about their Hadoop clusters as OpsLyft’s team solved the problem of scaling as they only scaled when the metrics were pushed without compromising upon uptime guarantees and platform stability.
After using OpsLyft’s solution every cluster in Innovaccer’s infrastructure saved at an average $30 per hour. As the month-end came, innovaccer saw a 30% reduction on AWS Bill for that month!