Monthly disk health reports

Our office, like most offices these days, is paperless.  (Well, not exactly paperless.  We go through many reams of paper per week.  But I digress.)  Being paperless, our important files are all maintained on file servers.  Imagine how one’s office would be brought to its knees, or worse, if a such electronic files were to be lost!  Which brings me to the concept of monthly disk health reports.

We start with the concept of RAID (redundant array of inexpensive disks).  The general question is how to reduce to a minimum the risk of loss of data due to disk failure during some specified time interval (say, five years).  You could attack this risk by saying that you will spend however much money is needed to construct a single disk drive that would have an acceptably low risk of failure during that interval.  It turns out that this is a losing game.  No matter how much money you spend on a single hard drive, you will not reduce its risk of failure to a small enough level.

The genius of RAID is to set up an array of two or more disks with clever hardware and clever algorithms so that within the array, any particular item of data is stored at least twice.  The notion is that the loss of any one disk drive within the array will not give rise to loss of any data.  The further assumption is that if any drive within the array were to fail, the user will replace that drive right away before any second drive in that same array were to fail.  (Such a second drive failure, if it were to happen too soon on the heels of the first drive failure, would lead to loss of data.)  The RAID concept permits the use of drives that are not so very expensive, while providing reliability that exceeds that of even the most expensive drive that money can buy.

The alert reader will appreciate that this RAID concept delivers on its promise only if it can be assumed that with any two drives, the failure rates are uncorrelated.  If there were some reason to fear that both drives might fail on the same day, then the RAID array would be no more reliable than any one of the drives.  For this reason the person who is constructing a RAID array will take pains to ensure that the two drives in the array were not manufactured (for example) on the same day in the same factory (and thus might have in common some latent weakness due to something bad that happened in the factory on that day).

In our office we pick two different brands for the two drives.  And we order the two drives on different days so that they are not delivered in the same box (so that the two drives were not exposed to the same risks of damage along the way during shipment).

If you look around, you will find that the companies that make disk drives offer various levels of claimed reliability, for a price.  An ordinary 4-terabyte drive with a rated MTBF of five or ten years might cost $90.  But the same manufacturer might offer a so-called “NAS drive” which has a rated MTBF of one hundred years, for around $140.

Having said all of this, it would be really good if there were a way to get early warning of the possibility that a particular drive might fail.  With such early warning, one could make sure to have a spare drive readily at hand so that it can be swapped into the array on short notice.  Or one might take the prophylactic step of swapping the new drive into service before failure.  Wouldn’t it be nice to get early warning like this?

What got me thinking about this is that today is the first day of the month.  And each month, on the first day of the month, all of our various file servers send me email reports as to the health of the disk drives that are inside the file servers.  Here is one such report that arrived a few minutes ago from one of our file servers:

Disk 1:

  • Disk Reconnection Count 0
  • Bad Sector Count 0
  • Disk Re-identification Count 0

Disk 2:

  • Disk Reconnection Count 0
  • Bad Sector Count 0
  • Disk Re-identification Count 0

These reports are sort of like golf scores — smaller numbers are better.  What you would like to see is a “zero” for each counter.  If you receive a report with numbers bigger than zero, this is an indication that a particular hard drive might be getting ready to fail.  This particular monthly report is completely good news.

Of course RAID by itself is not enough to secure one’s files in a paperless office.  Any particular server could, for example, have the bad luck to be located in a building that catches fire and burns down.  So there will also be some kind of automatic offsite backup.  Maybe the offsite backup will be carried out daily.  These days there are systems that carry out the offsite backups in real time.  Of course the backup needs to take place by means of an encrypted data channel.  (One would not want it to be possible for an eavesdropper to see the files that are being backed up by means of the data channel.)

How do you select the drives that go into your RAID arrays?  Do you pick different brands?  Do you use NAS-rated drives?  Please post a comment below.

4 Replies to “Monthly disk health reports”

  1. Interesting article.
    Not being a large firm (in fact the smallest firm that can exist)) I back up at least once a day to an internal hard drive and a removable hard drive. Every two-three days I swap out the removable hard drive and put it in a safe. About every week I backup the latest removable hard drive to another backup computer (C drive) and then backup the c drive to the internal backup drive.
    It seems complicated, but it runs smoothly and does not take much time. The advantages are that I have the removable drive offline just in case everything operating gets infected, if for travel or emergency reasons I need the latest backup, I just pull the removable drive, and in case of total collapse of one computer I have a backup ready to go. Additionally at all times I have at least 5 backups that are fairly up to date. PS-not a friend of cloud backups.
    One refinement I added-my prime C drive is constantly changing structure due to my filing system changes. Once a week or once a month I change the name of one of my main backup files, for example from “Clients” to “clients2” so that when the Clients file on the C drive is backed up it exactly mirrors the Clients file on my C drive. I can then delete Clients 2 or keep it. This can be repeated for all the backups.
    I suspect this might have glitches for the medium size firms….

  2. Thanks you, Carl. Readers should know that other things can go south with any version of RAID, leading to unrecoverable data. Do your diligence first. Caveat emptor! RICK

    1. Rick, we used to use RAID 5. We eventually realized that if we pull any one drive from a RAID 5 array and drop it into a computer to try to read it, we will fail. In contrast, with RAID 1, a single drive pulled from the array can be read all by itself. Nowadays our habit is to use RAID 1, for this reason.

      Of course the traditional reason to use RAID 5 is that it provides faster response for large streamed files. Our main use case is the storage and retrieval of more traditional word processor files and PDF files, and we find (in our informal tests) that we don’t perceive a meaningful difference in response time as between RAID 1 and RAID 5. So we have now migrated to RAID 1. I wonder which version of RAID you use lately?

  3. Hi Carl,

    Very interesting subject. We have our IT support provide the drives that go into the RAID array. As yet we haven’t had any notable failures but we do follow their advice to periodically replace the drives periodically or when their performance checks indicate that it might be prudent to do so. The cost of spinning platter HD’s are low enough that it’s not an issue.

    Disk failure is always something to be concerned about, like you we the server handles storage and retrieval of word processing files and PDF’s.

    I’m looking forward to building a new SSD RAID array in the coming year.

Leave a Reply

Your email address will not be published. Required fields are marked *