Skip to main content

Destroying multiple production databases

A sysadmin horror story, to be read only with the lights on. Learn from these mistakes.
Image
Broken nails

In my 22-year-old career as an IT specialist, I encountered two major issues where—due to my mistakes—important production databases were blown apart. Here are my stories.

Freshman mistake

The first time was in the late 1990s when I started working at a service provider for my local municipality’s social benefit agency. I got an assignment as a newbie system administrator to remove retired databases from the server where databases for different departments were consolidated.

Due to a type error on a top-level directory, I removed two live database files instead of the one retired database. What was worse was that due to the complexity of the database consolidation during the restore, other databases were hit, too. Repairing all databases took approximately 22 hours.

What helped

A good backup that was tested each night by recovering an empty file at the end of the tar archive catalog, after the backup was made.

Future-looking statement

It’s important to learn from our mistakes. What I learned is this:

  • Write down the steps you will perform and have them checked by a senior sysadmin. It was the first time I did not ask for a review by one of the seniors. My bad.
  • Be nice to colleagues from other teams. It was a DBA that saved me.
  • Do not copy such a complex setup of sharing databases over shared file systems.
  • Before doing a life cycle management migration, go for a separation of the filesystems per database to avoid the complexity and reduce the chances of human error.
  • Change your approach: Later in my career, I always tried to avoid lift and shift migrations.

Senior sysadmin mistake

In a period where partly offshoring IT activities was common practice in order to reduce costs, I had to take over a database filesystem extension on a Red Hat 5 cluster. Given that I set up this system a couple of years before, I had not checked the current situation.

I assumed the offshore team was familiar with the need to attach all shared LUNs to both nodes of the two-node cluster. My bad, never assume. As an Australian tourist once mentioned when a friend and I were on a vacation in Ireland after my Latin grammar school graduation: "Do not make an ars out of you me." Or, another phrase: "Assuming is the mother of all mistakes."

Well, I fell for my own trap. I went for the filesystem extension on the active node, and without checking the passive node’s (node2) status, tested a failover. Because we had agreed to run the database on node2 until the next update window, I had put myself in trouble.

As the databases started to fail, we brought the database cluster down. No issues yet, but all hell broke loose when I ran a filesystem check on an LVM-based system with missing physical volumes.

Looking back

I would say you’re stupid to myself. Running pvs, lvs, or vgs would have alerted me that LVM detected issues. Also, comparing multipath configuration files would have revealed probable issues.

So, next time, I would first, check to see if LVM contains issues before going for the last resort: A filesystem check and trying to fix the millions of errors. Most of the time you will destroy files, anyway.

What saved my day

What saved my day back then was:

  • My good relations with colleagues over all teams, where a short talk with a great storage admin created the correct zoning to the required LUNs, and I got great help from a DBA who had deep knowledge of the clustered databases.

  • A good database backup.

  • Great management and a great service manager. They kept the annoyed customer away from us.

  • Not making make promises I could not keep, like: "I will fix it in three hours." Instead, statements such as the one below help keep the customer satisfied: "At the current rate of fixing the filesystem, I cannot guarantee a fix within so many hours. As we just passed the 10% mark, I suggest we stop this approach and discuss another way to solve the issue."

Future-looking statement

I definitely learned some things. For example, always check the environment you’re about to work on before any change. Never assume that you know how an environment looks—change is a constant in IT.

Also, share what you learned from your mistakes. Train offshore colleagues instead of blaming them. Also, inform them about the impact the issue had on the customer’s business. A continent’s major transport hub cannot be put on hold due to a sysadmin’s mistake.

A shutdown of the transport hub might have been needed if we failed to solve the issue and the backup site in a data centre of another service provider would have been hurt too. Part of the hub is a harbour and we could have blown up a part of the harbour next to a village of about 10,000 people if both a cotton ship and an oil tanker would have gotten lost on the harbour master's map and collided.

General lessons learned

I learned some important lessons overall from these and other mistakes:

  • Be humble enough to admit your mistakes.

  • Be arrogant enough to state that you are one of the few people that can help fix the issues you caused.

  • Show leadership of the solvers' team, or at least make sure that all of the team’s roles will be fulfilled—including the customer relations manager.

  • Take back the role of problem-solver after the team is created, if that is what was requested.

  • "Be part of the solution, do not become part of the problem," as a colleague says.

I cannot stress this enough: Learn from your mistakes to avoid them in the future, rather than learning how to make them on a weekly basis.

Topics:   Sysadmin culture  
Author’s photo

Jan Gerrit Kootstra

Solution Designer (for Telco network services). Red Hat Accelerator. More about me

Try Red Hat Enterprise Linux

Download it at no charge from the Red Hat Developer program.