Find the disk to fail and replace due to old age
Check the replication health of the RAID array to ensure it is healthy before any actions are taken.
cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdb1[5] sda1[4] sdd1[3]
5860267008 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
From the recent replacment sdb1 is the Seagate drive with sda and sdd being the remaining Western Digital work horse disks. To make sure we can find the correct drive to replace by finding the serial number for each disk using smart tools.
/dev/sda
sudo smartctl -a /dev/sda
Model Family: Western Digital AV-GP (AF)
Device Model: WDC WD30EURS-63SPKY0
Serial Number: WD-WMC1T3650695
9 Power_On_Hours 0x0032 006 006 000 Old_age Always - 69340
/dev/sdd
sudo smartctl -a /dev/sdd
Model Family: Western Digital AV-GP (AF)
Device Model: WDC WD30EURS-63R8UY0
Serial Number: WD-WCAWZ2027633
9 Power_On_Hours 0x0032 001 001 000 Old_age Always - 79993
Disk sda has been online for 7.9 years and sdd for 9.1 years. Both of these disks are still going strong but its only a matter of time till they also fail.
With sdd being the oldest i will replace it first and let the array rebuild before replacing sda.
The best way to approah this is to add the new disk first as a spare for the array then mark the disk to be replaced like so
sudo mdadm --manage /dev/md0 --replace /dev/sdd1
What this triggers is the spare to take over and replace the marked device. This is similar to marking a device as faulty, but the device remains in service during the recovery process to increase resilience against multiple failures. When the replacement process finishes, the replaced device will be marked as faulty.
Unfortunatly I don't have enough SATA ports available to add the replacement disk as a spare so ill be failing the device and replacing it. this is more risky as while the rebuild is underway you have no resilience to a second disk failure.
Fail the disk to be removed from the array
sudo mdadm --manage /dev/md0 -f /dev/sdd1
mdadm: set /dev/sdd1 faulty in /dev/md0
cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdb1[5] sda1[4] sdd1[3](F)
5860267008 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
Hot remove the disk
sudo mdadm --manage /dev/md0 -r /dev/sdd1
mdadm: hot removed /dev/sdd1 from /dev/md0
Array Status
cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdb1[5] sda1[4]
5860267008 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
With sdd failed I will remove the superblock to prevent it rejoining the array in future if i add it back in future.
sudo mdadm --zero-superblock /dev/sdd1
With that done its time to shutdown the server and find and replace the disk we just failed, as i have the serial number for the disk.
With the replacement disk added using gdisk to create a new GPT and Linux RAID partition
Add the replacement into the array.
sudo mdadm --manage /dev/md0 -a /dev/sdc1
mdadm: added /dev/sdc1
Array Status
cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdc1[3] sda1[4] sdb1[5]
5860267008 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
[>....................] recovery = 0.1% (3304704/2930133504) finish=493.6min speed=98823K/sec
With the replacement disk now a member of the array the rebuild is taking place. Once this is compete we can do the same for the 7 year old disk to cycle out the older disks.