How a US based FinTech SaaS avoided their development infra costs by 80% using IaC

After securing their initial round of funding, the development efforts were fully focused on building and strengthening their SaaS offering. When talking to the founders, I understood that they plan to expand their development team, and worried about increasing dev infrastructure costs. I identified opportunities where unutilized resources in sub-production environments were unnecessarily burning cash during off periods. Especially in a setup where they would require their isolated development environments only to test features, before integrating them in the integration environments.

After doing the initial analysis, I came up with a plan to implement automated and temporary infrastructure provisioning and deprovisioning for their isolated development environments. Depending on how long the tests ran, the solution could potentially save them 95% of the infrastructure costs associated with these environments.

Previous state

The opportunity of working with startups/modern tech companies presents itself with some advantages. They were already cloud-native where they deployed their services reliably using Kubernetes. However, almost all of this was being done manually.

Majority of the feature development happened on the local machines, while rest of it could also easily be done locally – it depended on the developer’s preference. The isolated development environments were used by the developers to do a final check along with other components within their scope. The average duration required by these integration tests lasted for around a couple of hours where they deployed the changes, ran test scripts, observed system behavior and performance, captured logs for analysis, and redo this in case of any quick fix if required.

Once these tests were successful, the changes were planned to be promoted to the integration environments, where additional teams’ changes would be merged and tested together for coherence. The tech vision here was that the isolated environments would allow developers in each team develop features at maximum speed, and coordination of efforts in the integration environment would allow the overall product service to deliver blazingly fast.

Opportunities
  1. Since the team was currently small, they could afford a dedicated, always on development environment. But forecasted costs would multiply, which were inherently unnecessary.
  2. This was the great opportunity to imbibe the development practices in place. So, implementing local development practice made a lot of sense at this stage.
  3. Since the environments were manually created, there were configuration drifts. Sub-production environments did not imitate production. This was not sustainable in the long run.
  4. The DevOps setup was quite raw, and often involved multiple manual tweaks. It was the right time to introduce automation.
Impact

I took charge of streamlining the development process with a focus on reducing costs of the cloud infrastructure required.

  1. Phase 1: After working closely with the development team, I analyzed their requirements and prepared a plan to prepare Terraform IaC to codify their environment. This involved listing out all the components needed to run their tests – Databases, Networking, K8s clusters, EC2 instances, environmental dependencies, etc. At the same time, we worked together to prioritize local development. This provided me with insights on the difficulties faced by the developers in following this approach.
  2. Phase 2: I worked with the developers to resolve their local development issues first. In parallel, I began developing Terraform templates for all the infrastructure components. The goal here was to completely codify the development infrastructure and test the templates by repeatedly provisioning and deprovisioning them to make sure the temporary infra components were connected as expected.
  3. Phase 3: I developed CI/CD pipelines to automate this process of provisioning and deprovisioning the dev environments on demand. The pipeline was iteratively updated to seed certain configurations like test data in the database, environment variables in EC2 instances and containers on K8s clusters. This was done via a custom JSON config file.
  4. Phase 4: Upgraded the CI/CD pipeline to handle application build – test – push container images to the repository. Incorporated GitOps practices to manage triggers based on commits, PRs, and the custom config file from the previous step. The config file also specified a boolean flag to indicate if the developer wanted to test this on the cloud infra or just wanted to push the PR to Github.
  5. Phase 5: To enable collaboration, another parameter was added to the config file to identify the environment. If multiple developers in the same team need to test their changes on the same set of infrastructure components, then they could specify the same “env id” in their commits/PRs. The pipeline would then do a check if the corresponding environment was already provisioned based on the information captured in the key-value store in previous runs. If it did, it would simply rebuild the application and deploy on the same cluster, else create a new infra stack. This solution also helped cross team collaboration when required.
  6. Phase 6: To facilitate auto-deprovisioning, the Terraform template also added a monitoring microservice which maintained a counter in absence of the incoming traffic. After a few hours of inactivity, it sent out an event, and a webhook integrated CI/CD pipeline would run the destroy routine on given environment id.
  7. Phase 7: Developer handover – this is where I trained the development teams to use the configuration file when they needed to test this in a cloud environment, or when they simply wanted to do a commit and PR, or when they wanted to collaborate with one another. When the teams expanded, we further streamlined the process of onboarding the developers on this setup to a point where no explicit onboarding was required.

The provisioning of a new stack took a few minutes, which was okay for the team given the savings, consistency, ease of testing, and improved team dynamics.

Wins
  1. Looking at the usage patterns, only 20% of the infrastructure costs were required as per what was projected.
  2. Configuration files made it easy for the new developers to quickly get familiarized, with no learning curve.
  3. The entire deployment process for experimental features was automated and there was practically no need for a DevOps engineer to look into the dev env pipelines until something major change was needed.
  4. The GitOps strategy also freed developers to think beyond commits and PRs, so that they could deliver twice as fast.
  5. This indirectly sowed the seeds of platform engineering from the early days of the company.