In our last blog, we explained how a server works. Of course, we’re sure you’re also interested what happens if it doesn’t work. If you play Goodgame Empire, for example, you’d quickly be able to see that a server isn’t available if you can’t connect to the game for a certain time. Game servers are therefore essential for online MMOs such as ours to guarantee ongoing fun for our large international community of players.
We sat down with two Goodgame experts to discuss how a server could become unavailable and how to minimize instances like these. Robert (left) manages our IT department and is in charge of server hardware. Jens (right) is the head of our game technology department, which shares the responsibility for the server software with our game developers.
“To ensure that we have as few server crashes as possible, we work with a traffic light monitoring system that alerts us as soon as problems appear,” explains Robert. Green means that the server is doing well and everything is at it should be. Yellow warns of potential future issues. Red tells us that there’s a serious problem at hand that needs to get fixed as soon as possible.”
Furthermore, the teams carry out an extensive root cause analysis if a server is down. Both the IT guys and the developers follow up on the specific error to find out what caused it. For example, it could be the case that a hardware component that can cause crashes has been integrated, or that the server has been configured in such a way that makes it unstable under high loads. “Once the cause has been determined, the team fixes the problem and ensures that something like this won’t happen again,” said Jens.
Of course, we also shut down the servers for scheduled moves and maintenance work. The systems are then unavailable for a few minutes, and the players see a maintenance screen. Nonetheless, it can also happen that a server can’t be reached due a network, hardware, or software problem. The following sections describe precisely what happens in a case such as this.
The network is causing problems
If a server can’t be reached, this isn’t necessarily due to the hardware or software – the server connection could have also caused the issue. If, for example, the player loses their internet connection, they can’t reach the server either. If they then try to open a game or any website in their browser, it won’t load. Our network team helps fix this problem and gets in touch with the respective provider, such as Comcast or AT&T, to ensure that they receive the information required to find a quick solution to this problem.
When it comes to networks, there is another industry-wide phenomenon to watch out for: external network attacks. These don’t present a large threat to our games, however, because we can quickly filter out and neutralize these attacks. During a network attack, someone with a lot of PCs under their control tries to direct a large number of requests to our servers from all directions to make the servers slower and force them to crash. However, our network team usually filters an attack like this out of the system in 10-15 minutes before any damage can be caused.
The hardware doesn’t work
Like with normal computers, our leased servers can also experience hardware problems that cause a server to fail, such as when a central component like the motherboard is affected. However, if it’s just something like a hard drive that breaks, nothing crashes. “Our servers are structured in such a way that we have double protection no matter where we are,” explained Robert. “This means that there are usually two copies of every component, i.e. two network connections, two hard drives, and so on.” When something crashes, we work with the technicians in the data centers to quickly replace the faulty component.
The software has a bug
A bug in the software can also make a server unreachable. Of course, our quality assurance makes sure that this happens as seldom as possible. In case of a bug, the developers receive a “bug report” with information on the specific problem and try to fix it as quickly as possible. “No computer program is perfect. This applies to our games as well, of course. It’s important for us to recognize problems early on before they affect the players. The developers then develop a solution. And depending on how critical the problem is, we either implement a hot fix or the solution becomes part of the next scheduled update. To prevent any mistakes from occurring twice, the developers regularly share information on problems and solutions,” explained Jens.
Due to our large community of players, our game servers have to be very stable in all areas. Our experts in IT and software development have therefore built high performance servers that are optimized on an ongoing basis. With a topic as intricate and technical as this, our blog can only begin to scratch the surface. We hope that we could nonetheless shed some light on the subject and that it’s now a bit easier to understand why servers sometimes crash.