DevOps: The Unholy Trinity Of Costs, Automation & Resiliency In Cloud

Cloud is everywhere

Everyone who is anyone is either running their infrastructure on cloud or planning to do it. Cloud is even more important for startups because cloud enables trading initial costs with pay-as-you-go model. This has been common and received wisdom for a while now.

However despite the promise of the cloud, beyond a small number of servers (10+) cloud based infrastructure actually starts costing serious amounts of moolah. We will explore the underlying problems, steps to rectify them and run top-notch infrastructure.

Modern Applications

Any modern application developed in the last decade is pretty much distributed. By distributed application we mean any application which runs on multiple servers, e.g. load balancer, application servers, databases, cache, search server and so on. While this allows scalability and lower per server costs, from a management perspective this adds great amount of complexity. We could have written shell scripts to manage some common actions but they are now split across multiple servers, servers come and go (due to scaling), errors in one server impact functionality on other servers.

Frankly, it’s a mess. This leads to three class of problems: 1) Automation 2) Resiliency 3) Cloud Costs.

Let’s look into these problems.


Setting up multiple servers and making them all work together is hard (I’ll refer to this piece as deployment). There are too many moving pieces. I need to:

  1.     Provision multiple servers (in parallel hopefully),
  2.     Install the right software (and config files) on each one,
  3.     Make them all talk to each other (e.g. tell app servers where the DB is)
  4.     Test them together

Tools like Chef, Puppet are pretty much focused on 1 & 2, leaving you doing hacks, hard-coding or using service discovery to make the servers talk to each other.

This automation is hard & time consuming, so I’ll typically do it by hand. This works (and is often the right choice) when all you have is 3 servers & are still finding product-market fit. But once the infra grows, this leads to many night-outs!


After the infrastructure deployment has been automated we have to start thinking about building resiliency. Resilient infrastructure handles common problems (viz a server crashed, a server is hung, load is too high) in an automated way. Building resilient infrastructure takes some time and effort but is well worth the payoff.

Another important point is to automate upgrades (This is the code promotion pipeline). Code is fragile and most of the errors will happen during pushing new code updates. The infrastructure has to be built so that you can deploy new code/schema, observe, and immediately undo the actions if things break. Production downtime leads to immense pressure, and is not the place to start debugging. The first priority is to unbreak things (remove the sword hanging over your head) and then digging into what blew up and how to fix it.

The goal of all this resiliency automation is to keep the application running no matter what. Not having resiliency will lead to DevOps teams always firefighting, working under pressure and finally burning out. And since the team never has any free time they cannot build automation further exacerbating the problem.

Cloud Costs

This problem is the end result of problem numbers 1 & 2. Without automation servers will be setup by hand. Since the setup is so time consuming servers are left running even when not needed because of the pain in setup. Similarly we over-provision servers to handle load because our infrastructure is not designed to grow or shrink based on the load automatically. Automation will help us setup application clusters and other infrastructure (build/logging/search) only when we need it and blow it away once done. Containers are another way of consolidating multiple services into one, pushing up utilisation and reducing costs.

If we look across the typical cloud infrastructure (20-100+ servers) we will notice as an average the server utilization is less than 50%. So we’re paying double of what we’re actually using. The goal is a number around 80% for all of your infrastructure (not just production). Want not, waste not.

Cut the Gordian Knot

The solution to the the problems is in three parts:

First, build automation ASAP. The goals of this automation are threefold:

  1.     Deploy your entire application with one click: For staging and QA environments.
  2.     Automate upgrades and rollbacks: For Production.
  3.     Fix common problems automatically: For getting sleep at night

Second, don’t go for perfection, good enough is good enough. We’ve seen folks keep pushing automation for later because they want to do it right in a big-bang rewrite. Don’t do that. Do something that saves your time and aggravation today.

Third, keep a close eye on cloud costs and utilisation. Review how many servers you are using, how much load they are handling and prune them ruthlessly. The team culture has to inculcate a sense of efficiency of only paying for bare minimum infra costs. The cloud vendors have enough money, with great margins, and we should stop making them richer.


Cloud is a disruptive technology. Cloud has the potential for changing the way we run our infrastructure, being agile in development and reacting quickly to product changes. To take full advantage of cloud we need to evolve our thinking, get out of the mindset of looking at cloud as servers-for-rent and start using cloud to its specific strengths. By tackling the common problem patterns, adopting the solution mindsets listed above we can fully utilize cloud tech to its full potential.

[About The Author: Aaditya Sood is the founder and CEO of]

Sign Up for nextbigwhat newsletter

The smartest newsletter, partly written by AI.

Download, the short news app for busy professionals