Fifty Shades of Platform

July 07, 2025

Yesterday, I started getting my hands dirty with buildpacks for a side-project. I have known buildpacks for a while and back in the day at Grofers, we had made plans to potentially integrate buildpacks into our developer experience, but we never got to it. I first got introduced to the idea of buildpacks back when I got introduced to Heroku. It seemed like magic when you could, with a very minimal configuration, deploy a python application on the cloud without worrying about scripting the process of pulling the code, install dependencies, and bootstrap the application with the right configuration. It made everything very easy. In a way, it created a bubble. The bubble in which deploying an application was so easy. The bubble which would be burst pretty soon. Heroku was great for simple applications, but I found out fairly quickly during my career that application deployment is more nuanced in different organizations.

Different organizations have different preferences for how much control and flexibility that they want. These preferences are driven by real-world factors like cost and performance. Performance in terms of not only the application performance but also how your infrastructure affects the performance of your team in terms of delivering business value. Cost shows up in your infrastructure bill and also your engineers’ time in building infrastructure instead of business value. I’ve been fortunate enough to see different variations of these requirements, which prompts me to reflect a bit on how I’ve seen platforms being setup and evolving as the requirements of an evolving team change over time.

Minimum Viable Infrastructure

When I started working at Beatoven.ai as a founding engineer, while building the product, I also had to think about how to build the infrastructure. Having worked on infrastructure at BrowserStack and Grofers in different capacities - I knew how much work it is to manage infrastructure (varying by specific tools chosen). Joining a company at the time when we were just building out our product with minimal funds, I wanted to pick something as hands-off as possible. If I could pick independently, I would have hosted our product on the likes of Vercel or Heroku, or both. But we had a bunch of credits on AWS, so it’d be nice to not spend money on infrastructure elsewhere. From hindsight, I knew a couple of things - one, we would want to leverage containers. Two, we were not going to do Kubernetes. Not because I thought Kubernetes wasn’t good, but having worked on Kubernetes, I knew it took quite a bit of work to manage and that it was an overkill for the requirements we had. The time at which we would legitimately need something like Kubernetes was very far.

So with AWS in the picture, ECS seemed like a good contender. Thankfully, AWS Copilot CLI had been released and experience of spinning up infrastructure and deployments that it offered, was hitting a sweet spot. It offered an easy-to-use CLI that would work off a simple configuration file. Logs were easy to fetch. Deployments were easy to trigger. Once setup and hooked into GitHub Actions, it was on autopilot. There were obviously many challenges and talking about them requires another blog post, but it was more or less successful at keeping day-to-day infrastructure operations hands-off.

I think that’s a dream for a lot of people making decisions on building infrastructure in such scenarios. When the resources are constrained, you want to focus most of them building the product and iterating over it. That’s why Copilot CLI or managed platforms like Vercel, Heroku, Railway, etc. shine in this kind of scenario. Sure, you lose some flexibility and control. With Copilot CLI — you have to live with the kind of resource naming conventions that it has for your AWS resources. You can’t configure deployment on release tags instead of branch pushes on Vercel. One could list a few more, but in most scenarios, the requirements are simple enough and fairly common that most of these tools do most things out of the wish list. For those whose who outgrow this wish list at some point, they don’t have a lot of options other than to take the steering wheel.

Infrastructure as an Internal Product

Teams that pre-empt a need (correctly or not) for an infrastructure fully under their control or grow out of their minimal viable infrastructure end up building their own. Whatever may be the origin story, this is where most infrastructure related work happens. Dedicated teams are formed to provision and manage infrastructure resources (on cloud services such as AWS) and build their platforms (deployments, testing, observability, etc.) on top of this infrastructure.

With a lot of control and flexibility, comes maintenance effort. Containerizing an application and setting up environment specific configuration files. Writing provisioning scripts in Terraform. Effort to set up and manage CI/CD platforms like Jenkins. Building CI/CD pipelines on top of them to deploy an application to Kubernetes. Setting up and configuring the monitoring stack think Prometheus, Grafana and Alert Manager. Spinning up a database and glueing scripts for user management, migrations, backup and restore. The actual list and the effort spent depends on the specific tool set and team’s requirements, but these examples put across a general idea.

Many of these things are necessary. In a growing e-commerce company, for example, there might be many teams building systems that need to integrate with each other. Deployment patterns would need to be tailored to these interactions. The end-user systems need to serve large volume of users, so reliability becomes a high-priority requirement because downtimes often mean service disruptions, which means the company could lose business.

At Grofers, we had around 8 teams building a dozen microservices. These microservices were far from being loosely coupled, so in order to test changes in one or two microservices, we had to deploy the entire system in an environment. This required us to build a very customized orchestration on top of Kubernetes, which allowed to deploy all the services (with their data dependencies) to a namespace where changes could be tested. This allowed us to not only perform functional testing, but also some amount of performance testing before releasing changes into production.

While many of such solutions are warranted, often they end up being iceberg projects - where it looks like setting up something new would be an easy win towards solving a problem, and after that it’ll be done, but more often than not, it’s hardly ever done. At Grofers, we spent way more effort than we anticipated in not only setting up our testing infrastructure, but also in maintaining it and adding new features. You peel new layers of the problem every day, and all of that needs time and effort.

With a bespoke platform, also comes required learning that needs to be propagated among the engineers who are working on business outcomes. Learning to write Dockerfiles, YAML configs, etc. are some examples of it. How much of this happens depends on the engineering culture of the team and how much shared ownership is encouraged. So to varying extents, this leads to time being taken away from the engineers who’d rather be working on business outcomes than spend time learning how to operate their applications.

While it’s true that AI makes it easy to write IaC in Terraform, GitHub Actions workflows, or Kubernetes YAML files - but at the end of the day you end up with artifacts, each of which comes with the cost of testing, and debugging which is often time-consuming, occasionally very frustrating.

The In-Between

As mentioned earlier, when the adopters of Minimum Viable Infrastructure outgrow the capabilities of their platform, they don’t have a lot of options other than to deep dive into rolling out their own infrastructure platform. A growing startup who is on Heroku, for example, may want to orchestrate their deployments differently. They may want to ship the logs to a destination which isn’t supported. They may have certain compliance requirements which are either not supported by the platform, or is way too expensive for them. In such a situation, there is a reluctant switch to Infrastructure as an Internal Product. Reluctant because they still believe in the hands-off infrastructure mindset or/and need to keep the costs low, but because there are not a lot of easy ways to strike a balance between the two, they have to go DIY. And then comes the same dance of provision and maintenance, that has been done by countless teams.

It doesn’t have to be that way, though. Between the opinionated hands-off solutions and bespoke platforms, there are tools that allow you to do a bit of both. Consider a platform like Porter - it brings the features provided by a PaaS like Heroku but allows you to deploy your application in your own infrastructure. Sounds great right? For the most part it’s great, but if you have an opinion on whether you want Kubernetes on your infrastructure or not, it could be a factor to think about. Several other platforms like that exist, but with one or the other caveat. This makes it quite a task in itself to find a platform that fits your needs perfectly, or at least 80-90% of them.

In my personal experience, it has not been super easy and not being able to find the right tool for the job, I had to cross the chasm and set up an IaC scaffold (Terraform/Pulumi) to provision an infrastructure that had requirements mostly met by existing platforms except for maybe one or two, like compliance and cost. I’ve found myself doing some of these things a few times like setting up a service to run on AWS ECS, which prompted me to put together a side-project in an attempt to fill this chasm a little.

In an ideal world, it should be possible for a team to start with the Minimum Viable Infrastructure, which evolves as the platform requirements of the team evolve over time.

← Lessons learned building developer experience