As for the CI (Continuous Integration) systems, most experienced programmers will know, Jenkins is one of the most advanced servers amongst all. Since it comes with so many features by default, it’s one of the most recommended CI tool out there. There are so many plugins and assistance tools for Jenkins, making life much easier for users.
For readers who are not familiar with the tool:
Jenkins is a self-contained, open source automation server which can be used to automate all sorts of tasks such as building, testing, and deploying software. Jenkins can be installed through native system packages, Docker, or even run standalone by any machine with the Java Runtime Environment installed.
Recently, Insider’s Quality Assurance team had to extend count of nodes so that we can manage more jobs within a shorter time. At the current setup, we had one server with a single node which call all builds and that resolved most problems like busy server and long waiting periods between jobs due to heavy load.
For the short term, we increased the amount of servers which served Jenkins Master via AWS since all of our servers are built on top of Amazon systems. It was pretty easy to add one or two more servers, just to increase node count and it created a chance to result in shorter waiting times etc. But the basic problem with our solution was that those nodes could be idle as well. The solution increased the cost of operation as well, since machines cost money for the time they were in operation.
Following a short research, we found out that a Jenkins plugin can handle those instances within count of requests that master gets from manual and/or automatically triggered job builds. We installed the plugin and configured it to our AWS account with required credentials. At first, it was pretty effortless as every time our Jenkins server got a build request and there was not enough nodes to handle, plugin would takeover and create a new node with a pre-configured image on AWS.
It was quite actionable for earlier cases until we had a recent request from our team where they were in need of static jenkins nodes aside from the auto-scale setting. Without deeper insights into the case, I created two different servers with the same image, which also happens to be the one used for the auto-scaling group. After a few days, the machines were not operating and I found out somehow that they were terminated. Since AWS does not hold enough data about terminated servers, it was pretty hard to find the reason behind.
Just to test, I created another set of two instances with the very same image and started to watch how our Jenkins master and installed plugin reacted to it. For the same thing to occur, I had to spend about 12 hours but it was worth to see the reults, as it gave me the opportunity to dive into the root causes of the problem.
The problem and/or side effect, whichever you want to call it, occured as the following, if there was any available machine with the related image on the AWS account, it did not scale into a new instance but preferred to use the existing one. Even if it was created manually, the plugin simply would take over and start requested builder on it. After the usage, if there was not any request which required those machines the plugin would send a termination request and they would be terminated within 30 minutes.
Our solution was clearly quite simple. We just had to clone that AMI on Amazon with a different name and spin our servers manually from that AMI while auto-scale was using the old version.
So we actually did learn a lot upon those unexpected issues.
“Always check every side effect before operating with any integration”