Stop Killing Your Cattle: Server Infrastructure Advice

It’s great to treat your infrastructure like cattle—until it comes to
troubleshooting.

If you’ve spent enough time at DevOps conferences, you’ve heard the phrase “pets
versus cattle” used to describe server infrastructure. The idea behind this
concept is that traditional infrastructure was built by hand without much
automation, and therefore, servers were treated more like special pets—you
would do anything you could to keep your pet alive, and you knew it by name because
you hand-crafted its configuration. As a result, it would take a lot of effort
to create a duplicate server if it ever went down. By contrast, modern DevOps
concepts encourage creating “cattle”, which means that instead of unique,
hand-crafted servers, you use automation tools to build your servers so that no
individual server is special—they are all just farm animals—and
therefore, if a
particular server dies, it’s no problem, because you can respawn an exact copy
with your automation tools in no time.

If you want your infrastructure and your team to scale, there’s a lot of
wisdom in treating servers more like cattle than pets. Unfortunately, there’s
also a downside to this approach. Some administrators, particularly those that
are more
junior-level, have extended the concept of disposable servers to the point
that it has affected their troubleshooting process. Since servers are
disposable, and sysadmins can spawn a replacement so easily, at the first hint of
trouble with a particular server or service, these administrators destroy and
replace it in hopes that the replacement won’t show the problem. Essentially,
this is the “reboot the Windows machine” approach IT teams used in the 1990s
(and Linux admins sneered at) only applied to the cloud.

This approach isn’t dangerous because it is ineffective. It’s dangerous
exactly because it often works. If you have a problem with a machine and
reboot it, or if you have a problem with a cloud server and you destroy and
respawn it, often the problem does go away. Because the approach appears to
work and because it’s a lot easier than actually performing troubleshooting
steps, that success then reinforces rebooting and respawning as the first
resort, not the last resort that it should be.

Source: Linux Journal