Linux RAID Management - replacing disks

Find the disk to fail and replace due to old age

Check the replication health of the RAID array to ensure it is healthy before any actions are taken.

cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdb1[5] sda1[4] sdd1[3]
      5860267008 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

From the recent replacment sdb1 is the Seagate drive with sda and sdd being the remaining Western Digital work horse disks. To make sure we can find the correct drive to replace by finding the serial number for each disk using smart tools.

/dev/sda

sudo smartctl -a /dev/sda

Model Family:     Western Digital AV-GP (AF)
Device Model:     WDC WD30EURS-63SPKY0
Serial Number:    WD-WMC1T3650695

 9 Power_On_Hours          0x0032   006   006   000    Old_age   Always       -       69340

/dev/sdd

sudo smartctl -a /dev/sdd

Model Family:     Western Digital AV-GP (AF)
Device Model:     WDC WD30EURS-63R8UY0
Serial Number:    WD-WCAWZ2027633

9 Power_On_Hours          0x0032   001   001   000    Old_age   Always       -       79993

Disk sda has been online for 7.9 years and sdd for 9.1 years. Both of these disks are still going strong but its only a matter of time till they also fail.

With sdd being the oldest i will replace it first and let the array rebuild before replacing sda.

The best way to approah this is to add the new disk first as a spare for the array then mark the disk to be replaced like so

sudo mdadm --manage /dev/md0 --replace /dev/sdd1

What this triggers is the spare to take over and replace the marked device.  This is similar to marking a device as faulty, but the device remains in service during the recovery process to increase resilience against multiple failures.  When the replacement process finishes, the replaced device will be marked as faulty.

Unfortunatly I don't have enough SATA ports available to add the replacement disk as a spare so ill be failing the device and replacing it. this is more risky as while the rebuild is underway you have no resilience to a second disk failure.

Fail the disk to be removed from the array 

sudo mdadm --manage /dev/md0 -f /dev/sdd1
mdadm: set /dev/sdd1 faulty in /dev/md0

cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdb1[5] sda1[4] sdd1[3](F)
      5860267008 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]

Hot remove the disk

sudo mdadm --manage /dev/md0 -r /dev/sdd1
mdadm: hot removed /dev/sdd1 from /dev/md0

Array Status

cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdb1[5] sda1[4]
      5860267008 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]


With sdd failed I will remove the superblock to prevent it rejoining the array in future if i add it back in future.

sudo mdadm --zero-superblock /dev/sdd1

With that done its time to shutdown the server and find and replace the disk we just failed, as i have the serial number for the disk.

With the replacement disk added using gdisk to create a new GPT and Linux RAID partition

Add the replacement into the array.

sudo mdadm --manage /dev/md0 -a /dev/sdc1
mdadm: added /dev/sdc1

Array Status

cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdc1[3] sda1[4] sdb1[5]
      5860267008 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
      [>....................]  recovery =  0.1% (3304704/2930133504) finish=493.6min speed=98823K/sec

With the replacement disk now a member of the array the rebuild is taking place. Once this is compete we can do the same for the 7 year old disk to cycle out the older disks.