Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best way to configure alerts on drive smart changes #1759

Open
sbe-arg opened this issue May 6, 2024 · 14 comments
Open

Best way to configure alerts on drive smart changes #1759

sbe-arg opened this issue May 6, 2024 · 14 comments

Comments

@sbe-arg
Copy link

sbe-arg commented May 6, 2024

I had a drive go from good to warning sometime in the last 2 weeks (noticed yesterday) and I did not receive any alerts.

I get alerts for other things such as logins and cpu utilization etc.

I was hoping to get one alert early about the drive warning state and not to prey for one if the drive dies and or the raids go degraded.

I'm running the latest version.

Is this something I have to set on the scheduled smart scans?

@votdev
Copy link
Member

votdev commented May 6, 2024

Is this something I have to set on the scheduled smart scans?

No, the only thing that is necessary is to enable monitoring in the "SMART | Devices" page. In that case smartmontools will send emails if there is anything happening to your disks. Configuring and activating notifications is of course a prerequisite.

@sbe-arg
Copy link
Author

sbe-arg commented May 7, 2024

Okay thats odd I have that enabled but the alert was not sent for a warning disk.

@votdev
Copy link
Member

votdev commented May 7, 2024

Is there a warning message in the SMART log? SMART will only trigger a email if there is a change in the attributes. It makes no predictions.

@sbe-arg
Copy link
Author

sbe-arg commented May 7, 2024

ill share the logs

@sbe-arg
Copy link
Author

sbe-arg commented May 8, 2024

I don't have any smard logs for the last month

this is the last of my logs. I'm oddly confused now on why I'm missing a month of logs.

Apr 05 23:13:54 omv1 smartd[1384526]: Device: /dev/disk/by-id/ata-ST2000DM001-1ER164_Z4Z6P5F9 [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 65 to 64
Apr 05 23:13:54 omv1 smartd[1384526]: Device: /dev/disk/by-id/ata-ST2000DM001-1ER164_Z4Z6P5F9 [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 35 to 36
Apr 05 23:13:54 omv1 smartd[1384526]: Device: /dev/disk/by-id/ata-ST4000DM004-2CV104_WFN82YLQ [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 74 to 73
Apr 05 23:13:54 omv1 smartd[1384526]: Device: /dev/disk/by-id/ata-ST4000DM004-2CV104_WFN82YLQ [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 26 to 27
Apr 05 23:13:54 omv1 smartd[1384526]: Device: /dev/disk/by-id/ata-ST4000DM004-2U9104_WW60YAQS [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 73 to 72
Apr 05 23:13:54 omv1 smartd[1384526]: Device: /dev/disk/by-id/ata-ST4000DM004-2U9104_WW60YAQS [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 27 to 28
Apr 06 01:12:51 omv1 smartd[1384526]: smartd received signal 15: Terminated
Apr 06 01:12:51 omv1 smartd[1384526]: Device: /dev/disk/by-id/ata-ST1000DM003-1ER162_Z4Y6SQGT [SAT], state written to /var/lib/smartmontools/smartd.ST1000DM003_1ER162-Z4Y6SQGT.ata.state
Apr 06 01:12:51 omv1 smartd[1384526]: Device: /dev/disk/by-id/ata-ST2000DM001-1ER164_Z4Z6P5F9 [SAT], state written to /var/lib/smartmontools/smartd.ST2000DM001_1ER164-Z4Z6P5F9.ata.state
Apr 06 01:12:51 omv1 smartd[1384526]: Device: /dev/disk/by-id/ata-ST2000DM001-9YN164_W1E2ADV9 [SAT], state written to /var/lib/smartmontools/smartd.ST2000DM001_9YN164-W1E2ADV9.ata.state
Apr 06 01:12:51 omv1 smartd[1384526]: Device: /dev/disk/by-id/ata-ST4000DM004-2CV104_WFN82YLQ [SAT], state written to /var/lib/smartmontools/smartd.ST4000DM004_2CV104-WFN82YLQ.ata.state
Apr 06 01:12:51 omv1 smartd[1384526]: Device: /dev/disk/by-id/ata-ST4000DM004-2U9104_WW60YAQS [SAT], state written to /var/lib/smartmontools/smartd.ST4000DM004_2U9104-WW60YAQS.ata.state
Apr 06 01:12:51 omv1 smartd[1384526]: Device: /dev/disk/by-id/ata-16GB_SATA_Flash_Drive_C011201111070000498B [SAT], state written to /var/lib/smartmontools/smartd.16GB_SATA_Flash_Drive-C011201111070000498B.ata.state
Apr 06 01:12:51 omv1 smartd[1384526]: smartd is exiting (exit status 0)

image

@sbe-arg
Copy link
Author

sbe-arg commented May 8, 2024

Went manually via ssh to gunzip the older logs to find out I do have more logs from smartd that the console its not showing but there is no record of the reallocation or bad sectors for this drive.

The drive is scanned and is showing events.

May  1 04:15:06 omv1 smartd[2164]: Device: /dev/disk/by-id/ata-ST2000DM001-9YN164_W1E2ADV9 [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 120 to 99
May  1 04:45:06 omv1 smartd[2164]: Device: /dev/disk/by-id/ata-ST2000DM001-9YN164_W1E2ADV9 [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 99 to 100
May  1 05:15:06 omv1 smartd[2164]: Device: /dev/disk/by-id/ata-ST2000DM001-9YN164_W1E2ADV9 [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 102
May  1 11:45:06 omv1 smartd[2164]: Device: /dev/disk/by-id/ata-ST2000DM001-9YN164_W1E2ADV9 [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 102 to 103
May  2 02:15:07 omv1 smartd[2164]: Device: /dev/disk/by-id/ata-ST2000DM001-9YN164_W1E2ADV9 [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 103 to 104
May  2 03:15:06 omv1 smartd[2164]: Device: /dev/disk/by-id/ata-ST2000DM001-9YN164_W1E2ADV9 [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 104 to 105
May  2 05:15:06 omv1 smartd[2164]: Device: /dev/disk/by-id/ata-ST2000DM001-9YN164_W1E2ADV9 [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 105 to 106
May  2 10:15:06 omv1 smartd[2164]: Device: /dev/disk/by-id/ata-ST2000DM001-9YN164_W1E2ADV9 [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 106 to 107
May  3 02:45:06 omv1 smartd[2164]: Device: /dev/disk/by-id/ata-ST2000DM001-9YN164_W1E2ADV9 [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 107 to 108
May  3 03:45:07 omv1 smartd[2164]: Device: /dev/disk/by-id/ata-ST2000DM001-9YN164_W1E2ADV9 [SAT], Temperature changed -5 Celsius to 35 Celsius (Min/Max 34/47)
May  3 13:15:06 omv1 smartd[2164]: Device: /dev/disk/by-id/ata-ST2000DM001-9YN164_W1E2ADV9 [SAT], Temperature changed +5 Celsius to 40 Celsius (Min/Max 34/47)
May  4 06:15:07 omv1 smartd[2164]: Device: /dev/disk/by-id/ata-ST2000DM001-9YN164_W1E2ADV9 [SAT], Temperature changed -5 Celsius to 35 Celsius (Min/Max 34/47)
May  4 11:45:07 omv1 smartd[2164]: Device: /dev/disk/by-id/ata-ST2000DM001-9YN164_W1E2ADV9 [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 108 to 109
May  4 15:15:07 omv1 smartd[2164]: Device: /dev/disk/by-id/ata-ST2000DM001-9YN164_W1E2ADV9 [SAT], Temperature changed +5 Celsius to 40 Celsius (Min/Max 34/47)

I have a pretty basic setup as well...
Docker with a few services.
2-3 shares
6 drives (2 mirror raids and OS disk and one ephemeral drive for temp storage) very low transfers

The only job it is supposed to do well that is letting me know when a drive is dying and failed to do it. Very disappointing.
Is there a way to mock this in development to test if it triggers alarms at all?

@votdev
Copy link
Member

votdev commented May 8, 2024

Please make sure SMART notifications has been checked in System | Notification | Notifications.

@sbe-arg
Copy link
Author

sbe-arg commented May 8, 2024

The smart notifications have been enabled for 2y+ since i setup this server

@votdev
Copy link
Member

votdev commented May 13, 2024

How does your /etc/smartd.conf look like?

@votdev
Copy link
Member

votdev commented May 13, 2024

Does the /usr/share/smartmontools/smartd-runner script exist on your system? What is the output of ls -alh /etc/smartmontools/run.d?

@sbe-arg
Copy link
Author

sbe-arg commented May 13, 2024

@votdev here is the info

root@omv1:~# cat /etc/smartd.conf
# This file is auto-generated by openmediavault (https://www.openmediavault.org)
# WARNING: Do not edit this file, your changes will get lost.

DEFAULT -a -o on -S on -T permissive -W 5,50,55 -n never,q
/dev/disk/by-id/drive1 \
  -m email -M exec /usr/share/smartmontools/smartd-runner

/dev/disk/by-id/drive2 \
  -m email -M exec /usr/share/smartmontools/smartd-runner

/dev/disk/by-id/drive3 \
  -m email -M exec /usr/share/smartmontools/smartd-runner

/dev/disk/by-id/drive4 \
  -m email -M exec /usr/share/smartmontools/smartd-runner

/dev/disk/by-id/drive5 \
  -m email -M exec /usr/share/smartmontools/smartd-runner

/dev/disk/by-id/drive6 \
  -m email -M exec /usr/share/smartmontools/smartd-runner

root@omv1:~# ls -alh /etc/smartmontools/run.d
total 12K
drwxr-xr-x 2 root root 4.0K May 13 19:05 .
drwxr-xr-x 5 root root 4.0K May 13 18:55 ..
-rwxr-xr-x 1 root root  231 Oct 10  2019 10mail
root@omv1:~# cat /usr/share/smartmontools/smartd-runner
#!/bin/bash -e

tmp=$(mktemp)
cat >$tmp

run-parts --report --lsbsysinit --arg=$tmp --arg="$1" \
    --arg="$2" --arg="$3" -- /etc/smartmontools/run.d

rm -f $tmp

root@omv1:~# 

@votdev
Copy link
Member

votdev commented May 14, 2024

@votdev here is the info

root@omv1:~# cat /etc/smartd.conf
# This file is auto-generated by openmediavault (https://www.openmediavault.org)
# WARNING: Do not edit this file, your changes will get lost.

DEFAULT -a -o on -S on -T permissive -W 5,50,55 -n never,q
/dev/disk/by-id/drive1 \
  -m email -M exec /usr/share/smartmontools/smartd-runner

/dev/disk/by-id/drive2 \
  -m email -M exec /usr/share/smartmontools/smartd-runner

/dev/disk/by-id/drive3 \
  -m email -M exec /usr/share/smartmontools/smartd-runner

/dev/disk/by-id/drive4 \
  -m email -M exec /usr/share/smartmontools/smartd-runner

/dev/disk/by-id/drive5 \
  -m email -M exec /usr/share/smartmontools/smartd-runner

/dev/disk/by-id/drive6 \
  -m email -M exec /usr/share/smartmontools/smartd-runner

root@omv1:~# ls -alh /etc/smartmontools/run.d
total 12K
drwxr-xr-x 2 root root 4.0K May 13 19:05 .
drwxr-xr-x 5 root root 4.0K May 13 18:55 ..
-rwxr-xr-x 1 root root  231 Oct 10  2019 10mail
root@omv1:~# cat /usr/share/smartmontools/smartd-runner
#!/bin/bash -e

tmp=$(mktemp)
cat >$tmp

run-parts --report --lsbsysinit --arg=$tmp --arg="$1" \
    --arg="$2" --arg="$3" -- /etc/smartmontools/run.d

rm -f $tmp

root@omv1:~# 

Looks all fine. Don't know where the problem is atm.

@sbe-arg
Copy link
Author

sbe-arg commented May 14, 2024

It would be nice to be able to simulate a drive failure to test this. Probably is part of smard

@votdev
Copy link
Member

votdev commented May 14, 2024

It would be nice to be able to simulate a drive failure to test this. Probably is part of smard

This is part of smartmontools, OMV can not do anything here. I checked yesterday if smartd supports that, but did not find anything on the first look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants