09.10.2024

Efficient review environments: Why we replaced Kubernetes with Virtual Machines

The image displays a deployment log for a software project in gitlab. It shows successful deployments for two tasks. — Screenshot from GitLab: Active review environments for our website

For the development of our TYPO3 and Shopware projects, we use an approval environment where we create a copy of the website based on the current development code, including the live data set. This allows us and our clients to test new features directly, without being distracted by outdated or test data.

This process is fully automated in our GitLab instance. When the first commit is made in a feature branch, the environment is created, and with each subsequent commit, the code is updated. Once the branch is approved and merged, the environment is automatically torn down.

Until a few months ago, these environments ran on an external Kubernetes cluster. We created Docker images for each instance and deployed them using the appropriate Helm charts. However, several issues arose. First, simply purchasing a cluster is not enough. To stay as close as possible to the existing production setup, storage needs to be provided in such a way that the application behaves as expected. However, this isn't truly "Kubernetes-native," which created additional overhead that wasn't directly related to running test systems.

Moreover, debugging issues - the primary goal of test systems - became significantly more complex in such an environment. Cloud-native applications are typically stateless. They can run in read-only environments because they expect databases to be external and logs to be collected by an external log analysis system. This makes sense for distributed systems and enables horizontal scaling on a large scale. A Kubernetes pod running an application can be replaced by another at any time and then removed. Without in-depth knowledge, it can be difficult to determine or control exactly when this happens, which complicates error analysis. Additionally, during development, it is challenging to execute simple CLI commands or inspect log files.

While it’s possible to connect to the next pod using kubectl and perform these tasks, this adds unnecessary complexity for the development team, which negatively impacts Developer Experience (DX).

A workflow pipeline is displayed, showing various job stages: 'prepare,' 'analysis,' 'build,' and 'deployment.' Each stage lists specific jobs with checkmarks indicating completion, such as 'collect-build-dependencies' and 'frontend-build,' while some jobs are in progress. — Screenshot of our deployment pipeline

Another issue arose from the fact that instances running on Kubernetes were architecturally far removed from our production systems. This led to subtly altered application behavior once they were running on the cluster, which was also difficult to debug in practice.

So, we went back to the drawing board and re-examined the problem we were trying to solve. The main goal was to dynamically provide test instances and consume compute and storage resources only as long as those instances were needed.

Most infrastructure providers now offer APIs to programmatically allocate and decommission resources. Terraform, a widely-used tool, provides a common abstraction that allows us to easily switch between providers if needed. With Terraform, we can script the creation of VMs with defined properties at various providers. The provider takes care of the initial setup of the VM and typically installs an image of a selected operating system, such as a Linux distribution.

What was missing was the automated setup and configuration of the necessary software to run our test instances, i.e., a typical PHP-based web server stack. This is something we handle daily, and in collaboration with Terraform, we were able to automate it for the test environments using Ansible.

The result is that we now have environments that are much closer to what we run in production. These systems allow direct access via SSH, just like our production systems, making it easy to analyze errors, inspect logs, or check whether specific processes are running. Despite the fact that the test instances we create are effectively production-grade systems, we can still provision or decommission them as needed. Our goal was fully achieved: The environments are now more stable and faster.

This shift once again showed us that not everything that is technically possible is necessarily the best option. In hindsight, Kubernetes is not the optimal choice for the scenario of a "1:1 copy of a web application." We regularly review the technologies we use and adjust them as needed, allowing us to continuously improve both technically and conceptually - to the benefit of our client.

That said, Kubernetes still has its place for us. We continue to use it successfully in other areas, such as for running our GitLab runners.

Yesterday, we received the second monthly invoice for our new infrastructure. Compared to the months with Kubernetes, we are now saving 65% in costs. Granted, this wasn’t our initial goal, but we’re happy to take this additional benefit! :-)

Please feel free to share this article.

Efficient review environments: Why we replaced Kubernetes with Virtual Machines

Complicated isn’t always better

65% savings