Error unc at lba - Исправление ошибок и поиск оптимальных решений проблем

When running smartctl on your hard drive, you often get a plethora of information that can be hard to interpret for unexperienced users. This post attempts to provide aid in interpreting what the technical reasons behind the error messages are. If you’re looking for advice on whether to replace your hard drive, the only guidance I can give you is it might fail any time, so better backup your data, but it might also run for many years to come.. Furthermore, this article does not describe basic SMART WHEN_FAILED checking but rather interpretation of more subtle signs of possibly impending HDD failures.

One example that is particularly hard to interpret is the device error log storing the last few errors, for example

Error 8910 occurred at disk power-on lifetime: 7257 hours (302 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 1a 00 33 96 61  Error: UNC at LBA = 0x01963300 = 26620672

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 18 00 33 96 40 00      03:09:52.125  READ FPDMA QUEUED
  60 88 10 50 06 11 40 00      03:09:52.125  READ FPDMA QUEUED
  60 08 08 60 ac 5e 40 00      03:09:52.113  READ FPDMA QUEUED
  60 08 00 48 cf 6d 40 00      03:09:52.099  READ FPDMA QUEUED
  60 90 f0 b0 ef e5 40 00      03:09:52.065  READ FPDMA QUEUED

Obviously, the first line shows when this error occured. The other lines, however, are not as obvious. Let’s examine the next section:

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 41 1a 00 33 96 61  Error: UNC at LBA = 0x01963300 = 26620672

While this section also shows the content of some registers while the error occured, the interesting part of it is the error description Error: UNC at LBA = 0x01963300 = 26620672.

A LBA is a logical block address, i.e. some logical address on the hard drive. It is shown in both hexadecimal form 0x01963300 and in decimal form 26620672. In order to convert it to a byte address, you need to multiply it by the value listed at the head of the smartctloutput:

Sector Size:      512 bytes logical/physical

In almost any case, this value is 512 bytes, so in this example the byte offset would be 26620672 * 512 = 13629784064 = 12.69 GiB. In some cases it might be helpful to look up this address in a tool like GParted to see in which partition the error occured in. Also see this smartmontools HOWTO describing this process in detail.

UNC errors

The error message now tells us than an error called UNC occured at this LBA. UNC is shorthand for UNCorrectable, which means the data which has been read from the hard drive at this LBA was damaged and could not be corrected.

Hard drives not only store your data by itself, but automatically compute a so-called error-correction code (ECC). While there are many subtypes of those mathematical codes, they have one aspect in common: Given a set of bytes (e.g. the ones stored on the hard drive) which might be slightly damaged (i.e. some 0-bits are now-1 bits or vice versa) and and the matching ECC code (constituting of a few extra bytes) a suitable decoder can recover a limited number of bit errors. In most cases, ECC codes can also detect errors – for example, one specific ECC code might be able to correct one bit flip in two bytes, but it can detect up to three bitflips in two bytes.

If there are more bitflips than the ECC can recover (but not more than it can detect), this results in an unrecoverable error – the UNC. If there are more bitflips than the ECC can detect, anything might happen: Usually, the data that is computed from the ECC will be damaged, or no error might be detected at all.

Note that this explanation is highly simplified. For example, ECC codes are not stored as bytes separate from the data, but instead a mathematical function is computed on the data, resulting in a set of bytes that is larger that the original dataset – containing both the data itself plus the error-recovery extra data. In other words, the ECC data plus the data itself are mixed together.

This has multiple consequences for the interpretation. Firstly, this means that physically the data could be read, yet it does not seem to be correct. This means

Other error messages

While UNC errors occur reasonably often, there are other, more rare errors that you can’t find too much documentation about.

There is one definitive source for all smartctl error messages: The smartmontools source code.

We can find the error descriptions in ataprint.cpp (also see the GPL license information in the source tarball):

const char  *abrt  = "ABRT";  // ABORTED
const char   *amnf  = "AMNF";  // ADDRESS MARK NOT FOUND
const char   *ccto  = "CCTO";  // COMMAND COMPLETION TIMED OUT
const char   *eom   = "EOM";   // END OF MEDIA
const char   *icrc  = "ICRC";  // INTERFACE CRC ERROR
const char   *idnf  = "IDNF";  // ID NOT FOUND
const char   *ili   = "ILI";   // MEANING OF THIS BIT IS COMMAND-SET SPECIFIC
const char   *mc    = "MC";    // MEDIA CHANGED 
const char   *mcr   = "MCR";   // MEDIA CHANGE REQUEST
const char   *nm    = "NM";    // NO MEDIA
const char   *obs   = "obs";   // OBSOLETE
const char   *tk0nf = "TK0NF"; // TRACK 0 NOT FOUND
const char   *unc   = "UNC";   // UNCORRECTABLE
const char   *wp    = "WP";    // WRITE PROTECTED

Realistically, you’ll only encounter a few of these errors even if you are working with hard disks professionally. Some of these errors like MC, MCR or NM are also related to hot-swapping of hard drives and do not neccessarily represent errors related to hard drive health itself.

One important error is ICRC – the interface CRC error. This means that there are errors being detected on the IDE/SATA or PCIe bus the hard drive is connected to. Although this is rare and might be caused by the HDD itself, it might mean that your chipset (the hardware controlling e.g. SATA) is damaged – in this case, replacing the hard drive would not fix the issue. Possibly there is also an intermittent cable connection.

How severe are those errors?

Over the life of most hard drives, especially consumer models, errors will occur – more often so in portable devices where high acceleration forces are more like to be encountered.

What separates a good hard drive from one at the end of its life (excluding those that fail without warning) is often the frequency of new errors. If you look at the total lifetime of the HDD, i.e. Power_On_Hours or similar:

9 Power_On_Hours          0x0032   082   082   000    Old_age   Always       -       8586

and compare the value (in this case 8586) with the lifetime at the last error,

Error 8911 occurred at disk power-on lifetime: 7257 hours

in this case, 7257, you can see over a thousand HDD operational hours have passed since the last error. This indicates that there is no mechanical defect which could result in destruction of the hard drive but rather a couple of defective or damaged sectors. UNC errors do not even neccessarily mean that the sectors are physically damaged.

Often hard drive errors are triggered when a files that are accessed very rarely (such as archived video files that are only opened every few years). When there are enough bit flips in such files for any reason, this can result in a larger number of HDD errors appearing at once.

Another indicator is the total number of errors the hard drive has encountered, i.e. 8911 in

Error 8911 occurred at disk power-on lifetime: 7257 hours

or in

ATA Error Count: 8911 (device log contains only the most recent five errors)

While this number is not shown for all hard drives, a very high number or a number which is growing rapidly indicates there is some physical issue with the drive. Issues relating to only a few bad sectors induce a sudden jump in the error counter, but after that. Note, however, that there can be other reasons for a high error counter, for example a bad or intermittent physical connection to the hard drive.

Also see this previous post on how to fix bad HDD sectors.

Источник

Hi,

Five year old system (disks replaced three years ago) Specs:

MB: Supermicro X11SSM-F
CPU: Intel Xeon E3-1220 V5 Skylake
RAM: 32 Gb ECC: 2x Samsung DDR4-2133 CL15 ECC SC — 16GB
Boot: Supermicro SATA DOM
Disks: 6x WDC WD60EFRX-68L0BN1 i raidz2
OS: TrueNAS-12.0-RELEASE (f862218137)

Received this error today after scrub:

Code:

root@freenas:~ # zpool status
  pool: freenas-boot
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 00:01:23 with 0 errors on Fri Jun 11 03:46:23 2021
config:

        NAME          STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          ada0p2      ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 8K in 10:37:45 with 0 errors on Tue Jun 15 12:37:48 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        tank                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/788fc1ad-5f48-11e8-bd8c-0cc47ab4d97a  ONLINE       0     0     0
            gptid/13a2c7ac-5f91-11e8-be7a-0cc47ab4d97a  ONLINE       0     0     0
            gptid/28cd3c61-5fd8-11e8-b6db-0cc47ab4d97a  ONLINE       0     0     0
            gptid/e0c32132-601c-11e8-8ec6-0cc47ab4d97a  ONLINE       0     0     0
            gptid/9a4156da-605d-11e8-b0f3-0cc47ab4d97a  ONLINE       3     0     0
            gptid/60556717-60a4-11e8-bb0f-0cc47ab4d97a  ONLINE       0     0     0

errors: No known data errors
root@freenas:~ #

Smartctl shows:

Code:

root@freenas:~ # smartctl -x /dev/ada5
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.2-RC3 amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD60EFRX-68L0BN1
Serial Number:    WD-WX31D87DYY38
LU WWN Device Id: 5 0014ee 20f4b16d2
Firmware Version: 82.00A82
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5700 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jun 15 14:29:34 2021 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                ( 6344) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 717) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   197   197   021    -    9133
  4 Start_Stop_Count        -O--CK   100   100   000    -    32
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   100   253   000    -    0
  9 Power_On_Hours          -O--CK   064   064   000    -    26369
 10 Spin_Retry_Count        -O--CK   100   253   000    -    0
 11 Calibration_Retry_Count -O--CK   100   253   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    31
192 Power-Off_Retract_Count -O--CK   200   200   000    -    6
193 Load_Cycle_Count        -O--CK   200   200   000    -    565
194 Temperature_Celsius     -O---K   117   106   000    -    35
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O   2048  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb6  GPL,SL  VS       1  Device vendor specific log
0xb7       GPL,SL  VS      54  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 5
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 5 [4] occurred at disk power-on lifetime: 26360 hours (1098 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 dd 8c 4f 70 40 00  Error: UNC at LBA = 0x1dd8c4f70 = 8011927408

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 40 00 a0 00 02 8e d0 96 20 40 08 26d+00:18:12.496  READ FPDMA QUEUED
  60 00 08 00 98 00 02 4c 9d c3 c8 40 08 26d+00:18:12.496  READ FPDMA QUEUED
  60 00 50 00 90 00 01 dd 8c 4f 28 40 08 26d+00:18:12.496  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08 26d+00:18:12.495  READ LOG EXT
  60 00 40 00 80 00 02 8e d0 96 20 40 08 26d+00:18:08.000  READ FPDMA QUEUED

Error 4 [3] occurred at disk power-on lifetime: 26360 hours (1098 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 dd 8c 4f 70 40 00  Error: UNC at LBA = 0x1dd8c4f70 = 8011927408

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 40 00 80 00 02 8e d0 96 20 40 08 26d+00:18:08.000  READ FPDMA QUEUED
  60 00 08 00 78 00 02 4c 9d c3 c8 40 08 26d+00:18:08.000  READ FPDMA QUEUED
  60 00 50 00 70 00 01 dd 8c 4f 28 40 08 26d+00:18:08.000  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08 26d+00:18:07.999  READ LOG EXT
  60 00 40 00 60 00 02 8e d0 96 20 40 08 26d+00:18:03.505  READ FPDMA QUEUED

Error 3 [2] occurred at disk power-on lifetime: 26360 hours (1098 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 dd 8c 4f 70 40 00  Error: UNC at LBA = 0x1dd8c4f70 = 8011927408

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 40 00 60 00 02 8e d0 96 20 40 08 26d+00:18:03.505  READ FPDMA QUEUED
  60 00 08 00 58 00 02 4c 9d c3 c8 40 08 26d+00:18:03.505  READ FPDMA QUEUED
  60 00 50 00 50 00 01 dd 8c 4f 28 40 08 26d+00:18:03.505  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08 26d+00:18:03.503  READ LOG EXT
  61 00 10 00 40 00 01 f3 8a ae 48 40 08 26d+00:17:59.020  WRITE FPDMA QUEUED

Error 2 [1] occurred at disk power-on lifetime: 26360 hours (1098 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 dd 8c 4f 70 40 00  Error: WP at LBA = 0x1dd8c4f70 = 8011927408

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 10 00 40 00 01 f3 8a ae 48 40 08 26d+00:17:59.020  WRITE FPDMA QUEUED
  61 00 40 00 38 00 01 f3 8a 1b 18 40 08 26d+00:17:59.020  WRITE FPDMA QUEUED
  61 00 40 00 30 00 01 f3 8a 1a c0 40 08 26d+00:17:59.020  WRITE FPDMA QUEUED
  60 00 40 00 28 00 02 8e d0 96 20 40 08 26d+00:17:59.020  READ FPDMA QUEUED
  60 00 08 00 20 00 02 4c 9d c3 c8 40 08 26d+00:17:59.019  READ FPDMA QUEUED

Error 1 [0] occurred at disk power-on lifetime: 26360 hours (1098 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 dd 8c 4f 70 40 00  Error: UNC at LBA = 0x1dd8c4f70 = 8011927408

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 d8 00 01 dd 8c 60 f0 40 08 26d+00:17:54.568  READ FPDMA QUEUED
  60 00 50 00 d0 00 01 dd 8c 4f 28 40 08 26d+00:17:54.562  READ FPDMA QUEUED
  60 00 38 00 c8 00 01 dd 8c 4d 20 40 08 26d+00:17:54.558  READ FPDMA QUEUED
  60 00 08 00 c0 00 01 dd 8b ff 20 40 08 26d+00:17:54.552  READ FPDMA QUEUED
  60 00 08 00 b8 00 01 dd 8b fb 38 40 08 26d+00:17:54.546  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       127         -
# 2  Extended offline    Completed without error       00%        13         -
# 3  Extended offline    Aborted by host               90%         0         -
# 4  Conveyance offline  Completed without error       00%         0         -
# 5  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
Device State:                        Active (0)
Current Temperature:                    35 Celsius
Power Cycle Min/Max Temperature:     20/38 Celsius
Lifetime    Min/Max Temperature:     20/46 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (280)

Index    Estimated Time   Temperature Celsius
 281    2021-06-15 06:32    37  ******************
 ...    ..(136 skipped).    ..  ******************
 418    2021-06-15 08:49    37  ******************
 419    2021-06-15 08:50    36  *****************
 ...    ..( 22 skipped).    ..  *****************
 442    2021-06-15 09:13    36  *****************
 443    2021-06-15 09:14    35  ****************
 ...    ..( 80 skipped).    ..  ****************
  46    2021-06-15 10:35    35  ****************
  47    2021-06-15 10:36    38  *******************
 ...    ..(106 skipped).    ..  *******************
 154    2021-06-15 12:23    38  *******************
 155    2021-06-15 12:24    37  ******************
 ...    ..(124 skipped).    ..  ******************
 280    2021-06-15 14:29    37  ******************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4              31  ---  Lifetime Power-On Resets
0x01  0x010  4           26369  ---  Power-on Hours
0x01  0x018  6     71888865320  ---  Logical Sectors Written
0x01  0x020  6      1818551784  ---  Number of Write Commands
0x01  0x028  6    535154745801  ---  Logical Sectors Read
0x01  0x030  6      2580298900  ---  Number of Read Commands
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4           26188  ---  Spindle Motor Power-on Hours
0x03  0x010  4           26175  ---  Head Flying Hours
0x03  0x018  4             572  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4             476  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               5  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              35  ---  Current Temperature
0x05  0x010  1              36  ---  Average Short Term Temperature
0x05  0x018  1              32  ---  Average Long Term Temperature
0x05  0x020  1              46  ---  Highest Temperature
0x05  0x028  1              20  ---  Lowest Temperature
0x05  0x030  1              43  ---  Highest Average Short Term Temperature
0x05  0x038  1              21  ---  Lowest Average Short Term Temperature
0x05  0x040  1              40  ---  Highest Average Long Term Temperature
0x05  0x048  1              26  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             168  ---  Number of Hardware Resets
0x06  0x010  4              62  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
Index                LBA    Hours
    0         4303484304        -
    1         4303484305        -
    2         4303484306        -
    3         4303484307        -
    4         4303484308        -
    5         4303484309        -
    6         4303484310        -
    7         4303484311        -
    8         4303484312        -
    9         4303484313        -
   10         4303484314        -
   11         4303484315        -
   12         4303484316        -
   13         4303484317        -
   14         4303484318        -
   15         4303484319        -
   16         4321384408        -
   17         4321384409        -
   18         4321384410        -
   19         4321384411        -
   20         4321384412        -
   21         4321384413        -
   22         4321384414        -
   23         4321384415        -
   24         4321384424        -
   25         4321384425        -
   26         4321384426        -
   27         4321384427        -
   28         4321384428        -
   29         4321384429        -
   30         4321384430        -
... (937 entries not shown)

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            2  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            4  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4     15166624  Vendor specific

Could the ada5 disk be failing? What should be my next move? Do a smartctl long test on all the disks?

Could the ada5 disk be failing? What should be my next move? Do a smartctl long test on all the disks?

Since you haven’t run long smart in a while that would be the first if it was me

Since you haven’t run long smart in a while that would be the first if it was me

Thank you. Running the test now.

Since you haven’t run long smart in a while that would be the first if it was me

Ok. I ran the test and it failed. Should I assume a broken disk and replace it or something else?

smartctl:

Code:

root@freenas:~ # smartctl -x /dev/ada5
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.2-RC3 amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD60EFRX-68L0BN1
Serial Number:    WD-WX31D87DYY38
LU WWN Device Id: 5 0014ee 20f4b16d2
Firmware Version: 82.00A82
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5700 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jun 16 08:33:24 2021 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 118) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                ( 6344) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 717) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   197   197   021    -    9133
  4 Start_Stop_Count        -O--CK   100   100   000    -    32
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   064   064   000    -    26388
 10 Spin_Retry_Count        -O--CK   100   253   000    -    0
 11 Calibration_Retry_Count -O--CK   100   253   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    31
192 Power-Off_Retract_Count -O--CK   200   200   000    -    6
193 Load_Cycle_Count        -O--CK   200   200   000    -    566
194 Temperature_Celsius     -O---K   118   106   000    -    34
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   200   000    -    1
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x0c       GPL     R/O   2048  Pending Defects log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb6  GPL,SL  VS       1  Device vendor specific log
0xb7       GPL,SL  VS      54  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 5
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 5 [4] occurred at disk power-on lifetime: 26360 hours (1098 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 dd 8c 4f 70 40 00  Error: UNC at LBA = 0x1dd8c4f70 = 8011927408

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 40 00 a0 00 02 8e d0 96 20 40 08 26d+00:18:12.496  READ FPDMA QUEUED
  60 00 08 00 98 00 02 4c 9d c3 c8 40 08 26d+00:18:12.496  READ FPDMA QUEUED
  60 00 50 00 90 00 01 dd 8c 4f 28 40 08 26d+00:18:12.496  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08 26d+00:18:12.495  READ LOG EXT
  60 00 40 00 80 00 02 8e d0 96 20 40 08 26d+00:18:08.000  READ FPDMA QUEUED

Error 4 [3] occurred at disk power-on lifetime: 26360 hours (1098 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 dd 8c 4f 70 40 00  Error: UNC at LBA = 0x1dd8c4f70 = 8011927408

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 40 00 80 00 02 8e d0 96 20 40 08 26d+00:18:08.000  READ FPDMA QUEUED
  60 00 08 00 78 00 02 4c 9d c3 c8 40 08 26d+00:18:08.000  READ FPDMA QUEUED
  60 00 50 00 70 00 01 dd 8c 4f 28 40 08 26d+00:18:08.000  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08 26d+00:18:07.999  READ LOG EXT
  60 00 40 00 60 00 02 8e d0 96 20 40 08 26d+00:18:03.505  READ FPDMA QUEUED

Error 3 [2] occurred at disk power-on lifetime: 26360 hours (1098 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 dd 8c 4f 70 40 00  Error: UNC at LBA = 0x1dd8c4f70 = 8011927408

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 40 00 60 00 02 8e d0 96 20 40 08 26d+00:18:03.505  READ FPDMA QUEUED
  60 00 08 00 58 00 02 4c 9d c3 c8 40 08 26d+00:18:03.505  READ FPDMA QUEUED
  60 00 50 00 50 00 01 dd 8c 4f 28 40 08 26d+00:18:03.505  READ FPDMA QUEUED
  2f 00 00 00 01 00 00 00 00 00 10 40 08 26d+00:18:03.503  READ LOG EXT
  61 00 10 00 40 00 01 f3 8a ae 48 40 08 26d+00:17:59.020  WRITE FPDMA QUEUED

Error 2 [1] occurred at disk power-on lifetime: 26360 hours (1098 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 dd 8c 4f 70 40 00  Error: WP at LBA = 0x1dd8c4f70 = 8011927408

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 10 00 40 00 01 f3 8a ae 48 40 08 26d+00:17:59.020  WRITE FPDMA QUEUED
  61 00 40 00 38 00 01 f3 8a 1b 18 40 08 26d+00:17:59.020  WRITE FPDMA QUEUED
  61 00 40 00 30 00 01 f3 8a 1a c0 40 08 26d+00:17:59.020  WRITE FPDMA QUEUED
  60 00 40 00 28 00 02 8e d0 96 20 40 08 26d+00:17:59.020  READ FPDMA QUEUED
  60 00 08 00 20 00 02 4c 9d c3 c8 40 08 26d+00:17:59.019  READ FPDMA QUEUED

Error 1 [0] occurred at disk power-on lifetime: 26360 hours (1098 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 dd 8c 4f 70 40 00  Error: UNC at LBA = 0x1dd8c4f70 = 8011927408

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 d8 00 01 dd 8c 60 f0 40 08 26d+00:17:54.568  READ FPDMA QUEUED
  60 00 50 00 d0 00 01 dd 8c 4f 28 40 08 26d+00:17:54.562  READ FPDMA QUEUED
  60 00 38 00 c8 00 01 dd 8c 4d 20 40 08 26d+00:17:54.558  READ FPDMA QUEUED
  60 00 08 00 c0 00 01 dd 8b ff 20 40 08 26d+00:17:54.552  READ FPDMA QUEUED
  60 00 08 00 b8 00 01 dd 8b fb 38 40 08 26d+00:17:54.546  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       60%     26378         4303484304
# 2  Extended offline    Completed without error       00%       127         -
# 3  Extended offline    Completed without error       00%        13         -
# 4  Extended offline    Aborted by host               90%         0         -
# 5  Conveyance offline  Completed without error       00%         0         -
# 6  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
Device State:                        Active (0)
Current Temperature:                    34 Celsius
Power Cycle Min/Max Temperature:     20/38 Celsius
Lifetime    Min/Max Temperature:     20/46 Celsius
Under/Over Temperature Limit Count:   0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (407)

Index    Estimated Time   Temperature Celsius
 408    2021-06-16 00:36    35  ****************
 ...    ..(171 skipped).    ..  ****************
 102    2021-06-16 03:28    35  ****************
 103    2021-06-16 03:29    34  ***************
 ...    ..( 69 skipped).    ..  ***************
 173    2021-06-16 04:39    34  ***************
 174    2021-06-16 04:40    35  ****************
 ...    ..(232 skipped).    ..  ****************
 407    2021-06-16 08:33    35  ****************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4              31  ---  Lifetime Power-On Resets
0x01  0x010  4           26388  ---  Power-on Hours
0x01  0x018  6     71923078276  ---  Logical Sectors Written
0x01  0x020  6      1819607711  ---  Number of Write Commands
0x01  0x028  6    535156219835  ---  Logical Sectors Read
0x01  0x030  6      2580368784  ---  Number of Read Commands
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4           26206  ---  Spindle Motor Power-on Hours
0x03  0x010  4           26193  ---  Head Flying Hours
0x03  0x018  4             573  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4             479  ---  Read Recovery Attempts
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               5  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              34  ---  Current Temperature
0x05  0x010  1              35  ---  Average Short Term Temperature
0x05  0x018  1              32  ---  Average Long Term Temperature
0x05  0x020  1              46  ---  Highest Temperature
0x05  0x028  1              20  ---  Lowest Temperature
0x05  0x030  1              43  ---  Highest Average Short Term Temperature
0x05  0x038  1              21  ---  Lowest Average Short Term Temperature
0x05  0x040  1              40  ---  Highest Average Long Term Temperature
0x05  0x048  1              26  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             168  ---  Number of Hardware Resets
0x06  0x010  4              62  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
Index                LBA    Hours
    0         4303484304        -
    1         4303484305        -
    2         4303484306        -
    3         4303484307        -
    4         4303484308        -
    5         4303484309        -
    6         4303484310        -
    7         4303484311        -
    8         4303484312        -
    9         4303484313        -
   10         4303484314        -
   11         4303484315        -
   12         4303484316        -
   13         4303484317        -
   14         4303484318        -
   15         4303484319        -
   16         4321384408        -
   17         4321384409        -
   18         4321384410        -
   19         4321384411        -
   20         4321384412        -
   21         4321384413        -
   22         4321384414        -
   23         4321384415        -
   24         4321384424        -
   25         4321384425        -
   26         4321384426        -
   27         4321384427        -
   28         4321384428        -
   29         4321384429        -
   30         4321384430        -
... (937 entries not shown)

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            2  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            4  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4     15231590  Vendor specific

Ok. I ran the test and it failed. Should I assume a broken disk and replace it or something else?

When SMART can’t complete the test with the error shown in the output the disk is dying and should be replaced. If it was aborted by host there might be other things at work, but here it’s the disk.

Remember to do proper burn-in on the replacement

When SMART can’t complete the test with the error shown in the output the disk is dying and should be replaced. If it was aborted by host there might be other things at work, but here it’s the disk.

Remember to do proper burn-in on the replacement

Thank you. I RMAd the disk and currently doing burn-in on the new disk.

Источник

0

2

Кроха сын к отцу пришел и спросила кроха: «Папа, это хорошо или очень плохо?»

# smartctl -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0.0-12-server] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     ST1000DM003-9YN162
Serial Number:    Z1D0L2FY
LU WWN Device Id: 5 000c50 03fcee1af
Firmware Version: CC4B
User Capacity:    1 000 204 886 016 bytes [1,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri May  4 10:09:06 2012 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  575) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 108) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x3085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   105   100   006    Pre-fail  Always       -       235420362
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       24
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   072   057   030    Pre-fail  Always       -       8622211739
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       997
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       24
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   082   082   000    Old_age   Always       -       18
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   074   066   045    Old_age   Always       -       26 (Min/Max 23/34)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       19
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       54
194 Temperature_Celsius     0x0022   026   040   000    Old_age   Always       -       26 (0 21 0 0)
197 Current_Pending_Sector  0x0012   100   090   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   090   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       107253923316683
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       57499686168628
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       86599643247192

SMART Error Log Version: 1
ATA Error Count: 18 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 18 occurred at disk power-on lifetime: 961 hours (40 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00  15d+04:39:26.598  READ FPDMA QUEUED
  ef 10 02 00 00 00 a0 00  15d+04:39:26.598  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 00  15d+04:39:26.598  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  15d+04:39:26.598  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  15d+04:39:26.598  SET FEATURES [Set transfer mode]

Error 17 occurred at disk power-on lifetime: 961 hours (40 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00  15d+04:39:23.672  READ FPDMA QUEUED
  ef 10 02 00 00 00 a0 00  15d+04:39:23.672  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 00  15d+04:39:23.672  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  15d+04:39:23.672  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  15d+04:39:23.672  SET FEATURES [Set transfer mode]

Error 16 occurred at disk power-on lifetime: 961 hours (40 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00  15d+04:39:20.721  READ FPDMA QUEUED
  ef 10 02 00 00 00 a0 00  15d+04:39:20.721  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 00  15d+04:39:20.721  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  15d+04:39:20.721  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  15d+04:39:20.721  SET FEATURES [Set transfer mode]

Error 15 occurred at disk power-on lifetime: 961 hours (40 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00  15d+04:39:17.770  READ FPDMA QUEUED
  ef 10 02 00 00 00 a0 00  15d+04:39:17.770  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 00  15d+04:39:17.770  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  15d+04:39:17.770  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  15d+04:39:17.770  SET FEATURES [Set transfer mode]

Error 14 occurred at disk power-on lifetime: 961 hours (40 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00  15d+04:39:14.811  READ FPDMA QUEUED
  ef 10 02 00 00 00 a0 00  15d+04:39:14.811  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 e0 00  15d+04:39:14.811  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00  15d+04:39:14.811  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  15d+04:39:14.810  SET FEATURES [Set transfer mode]

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Источник

И снова здравствуйте. Перевод следующей статьи подготовлен специально для студентов курса «Администратор Linux». Поехали!

Что такое S.M.A.R.T.?

S.M.A.R.T. (расшифровывается как Self-Monitoring, Analysis, and Reporting Technology) – это технология, вшитая в накопители, такие как жесткие диски или SSD. Ее основная задача – это мониторинг состояния.

На деле, S.M.A.R.T. контролирует несколько параметров во время обычной работы с диском. Он мониторит такие параметры как количество ошибок чтения, время запуска диска и даже состояние окружающей среды. Помимо этого, S.M.A.R.T. также может проводить тесты с использованием накопителя.

В идеале, S.M.A.R.T. позволит прогнозировать предсказуемые отказы, такие как отказы, вызванные механическим износом или ухудшением состояния поверхности диска, а также непредсказуемые отказы, вызванные каким-либо неожиданным дефектом. Поскольку обычно диски не выходят из строя внезапно, S.M.A.R.T. помогает операционной системе или системному администратору идентифицировать те диски, которые скоро выйдут из строя, чтобы их можно было заменить и избежать потери данных.

Что не относится к S.M.A.R.T.?

Все это, конечно, круто. Однако S.M.A.R.T. – это не хрустальный шар. Он не может спрогнозировать отказ со стопроцентной вероятностью и не может гарантировать, что накопитель не выйдет из строя без предупреждения. В лучшем случае S.M.A.R.T. стоит использовать для оценки вероятности поломки.

Учитывая статистический характер прогнозирования отказов, технология S.M.A.R.T. особенно интересует компании, использующие большое количество устройств для хранения данных. Чтобы выяснить, насколько точно S.M.A.R.T. может прогнозировать отказы и сообщать о необходимости замены дисков в центрах обработки данных или серверных мейнфреймах, даже проводились специальные исследования.

В 2016 году Microsoft и университет штата Пенсильвания провели исследование, связанное с SSD.

Согласно этому исследованию, некоторые атрибуты S.M.A.R.T. считаются хорошими индикаторами неизбежности отказа. В особенности в статье упоминаются:

Счетчик переназначенных (Realloc) секторов:

Несмотря на то, что основополагающие технологии радикально отличаются, этот показатель остается востребованным как в мире SSD, так и в мире жестких дисков. Стоит отметить, что из-за особенностей алгоритмов балансировки износа, используемых в SSD, когда несколько секторов выходят из строя, то с большой вероятностью можно предположить, что скоро выйдут из строя еще больше.

Ошибки в цикле Program/Erase (P/E):

Это признак проблем с основным оборудованием флеш-памяти, связанных с тем, что диск не может удалить данные из блока или сохранить их там. Дело в том, что процесс производства несовершенен, поэтому появление таких ошибок вполне можно ожидать. Однако флеш-память имеет ограниченное число циклов записи/удаления. По этой причине внезапное увеличение числа событий может сигнализировать о том, что диск достигает своего предела, и вполне ожидаемо, что другие ячейки памяти также начнут выходить из строя.

CRC и неисправимые ошибки («Data Error ”):

События такого типа могут быть вызваны ошибками хранения, либо проблемами с внутренним каналом связи накопителя. Этот индикатор учитывает как исправленные ошибки (без проблем сообщенные хост-системе), так и неисправленные ошибки (из-за которых происходит блокировка диска, сообщившего хост-системе о невозможности чтения). Другими словами, исправляемые ошибки невидимы для операционной системы, тем не менее они влияют на производительность накопителя, увеличивая вероятность переназначения сектора.

SATA downshift count:

Из-за временных помех, проблем с каналом связи между накопителем и хостом или из-за внутренних проблем с накопителем, интерфейс SATA может переключиться на более низкую скорость передачи сигналов. Снижение скорости соединения ниже номинального уровня оказывает очевидное влияние на производительность диска. Таким образом, этот показатель является наиболее значимым, в особенности, когда он коррелирует с наличием одного или нескольких предыдущих показателей.

Согласно исследованию, 62% вышедших из строя SSD показали наличие как минимум одного из вышеприведенных симптомов. С другой стороны можно сказать, что 38% изученных накопителей сломались без индикации этих симптомов. В исследованиях не упоминалось, были ли какие-то еще сообщения об отказах от S. M. A. R. T. по другим «симптомам». По этой причине нельзя напрямую сопоставить эти значения с отказом без предупреждения в 36% случаев из статьи от Google.

В исследовании Microsoft и университета штата Пенсильвания не раскрывались модели исследуемых дисков, однако, по словам авторов, большинство дисков поступают от одного и того же поставщика в течение уже нескольких поколений.

В ходе исследования также были отмечены значительные различия в надёжности между различными моделями. Например, «худшая» изученная модель показывает двадцатипроцентную частоту отказов через 9 месяцев после первой ошибки переназначения и до 36-ти процентов отказов в течение 9 месяцев после первого появления ошибок данных. «Худшей» моделью было названо более старое поколение дисков, рассматриваемых в статье.

С другой стороны, с теми же симптомами, что приведены выше, накопители нового поколения отказали в 3% и 20% в соответствии с теми же ошибками. Трудно сказать, можно ли объяснить эти цифры улучшением конструкции накопителя и производственного процесса, или здесь роль играет эффект устаревания накопителя.

Самое интересное, что упоминается в статье (я уже писал об этом ранее), так это то, что увеличение количества зарегистрированных ошибок может случить тревожным индикатором:

«Существует большая вероятность появления симптомов, предшествующих отказу SSD, которые активно себя проявляют и быстро прогрессируют, сильно сокращая время жизни накопителя до нескольких месяцев.»

Другими словами, одна случайная ошибка, о которой сообщил S.M.A.R.T., определенно не должна рассматриваться как сигнал о неизбежном отказе. Однако, когда исправный SSD начинает сообщать о все большем количестве ошибок, следует ждать краткосрочного или среднесрочного сбоя.

Но как узнать, в каком состоянии сейчас ваш SSD? Для удовлетворения своего любопытства, либо из желания начать внимательно следить за своими накопителями, вы можете использовать инструмент мониторинга smartctl.

Использование `smartctl` для мониторинга состояния вашего SSD в Linux

Чтобы следить за S.M.A.R.T статусом вашего диска, я предлагаю использовать инструмент smartctl, который является частью пакета smartmontool (по крайней мере на Debian/Ubuntu).

sudo apt install smartmontools

smartctl – это инструмент командной строки, но это особенно помогает в случаях, когда вам нужно автоматизировать сбор данных, например, с ваших серверов.

Первый шаг в использовании smartctl – это проверка того, есть ли на вашем диске S.M.A.R.T. и поддерживается ли он инструментом:

sh$ sudo smartctl -i /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Momentus 7200.4
Device Model:     ST9500420AS
Serial Number:    5VJAS7FL
LU WWN Device Id: 5 000c50 02fa0b800
Firmware Version: D005SDM1
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Mar 12 15:54:43 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Как видите, мой внутренний жесткий диск ноутбука действительно поддерживает S.M.A.R.T. и он включен. Итак, как теперь получить S.M.A.R.T статус? Есть ли какие-то зафиксированные ошибки?

Выдача отчета «о всей S.M.A.R.T. информации о диске» — это опция -a:

sh$ sudo smartctl -i -a /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Momentus 7200.4
Device Model:     ST9500420AS
Serial Number:    5VJAS7FL
LU WWN Device Id: 5 000c50 02fa0b800
Firmware Version: D005SDM1
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Mar 12 15:56:58 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (    0) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 110) minutes.
Conveyance self-test routine
recommended polling time:      (   3) minutes.
SCT capabilities:            (0x103f)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   111   099   006    Pre-fail  Always       -       29694249
  3 Spin_Up_Time            0x0003   100   098   085    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   095   095   020    Old_age   Always       -       5413
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       3
  7 Seek_Error_Rate         0x000f   071   060   030    Pre-fail  Always       -       51710773327
  9 Power_On_Hours          0x0032   070   070   000    Old_age   Always       -       26423
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   096   037   020    Old_age   Always       -       4836
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   072   072   000    Old_age   Always       -       28
188 Command_Timeout         0x0032   100   096   000    Old_age   Always       -       4295033738
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   056   042   045    Old_age   Always   In_the_past 44 (Min/Max 21/44 #22)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       184
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       104
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       395415
194 Temperature_Celsius     0x0022   044   058   000    Old_age   Always       -       44 (0 13 0 0 0)
195 Hardware_ECC_Recovered  0x001a   050   045   000    Old_age   Always       -       29694249
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       25131 (246 202 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3028413736
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1613088055
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 3
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 occurred at disk power-on lifetime: 21171 hours (882 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00      00:45:12.580  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:45:12.580  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:45:12.579  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:45:12.571  READ FPDMA QUEUED
  60 00 20 ff ff ff 4f 00      00:45:12.543  READ FPDMA QUEUED

Error 2 occurred at disk power-on lifetime: 21171 hours (882 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00      00:45:09.456  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      00:45:09.451  READ FPDMA QUEUED
  61 00 08 ff ff ff 4f 00      00:45:09.450  WRITE FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      00:45:08.878  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      00:45:08.856  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 21131 hours (880 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00      05:52:18.809  READ FPDMA QUEUED
  61 00 00 7e fb 31 45 00      05:52:18.806  WRITE FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      05:52:18.571  READ FPDMA QUEUED
  ea 00 00 00 00 00 a0 00      05:52:18.529  FLUSH CACHE EXT
  61 00 08 ff ff ff 4f 00      05:52:18.527  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     10904         -
# 2  Short offline       Completed without error       00%        12         -
# 3  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Понимание выходных данных команд `smartctl`

На выходе получается много информации, которую не всегда легко понять. Наиболее интересной, вероятно, является та часть, которая помечена как “Vendor Specific SMART Attributes with Thresholds”. Она сообщает различные статистические данные, собранные S.M.A.R.T. устройством, и позволяет сравнить эти значения (текущие или худшие за все время) с некоторым порогом, определенным поставщиком.

Например, вот мои отчеты о переназначенных секторах на диске:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       3

Вы можете заметить атрибут «Pre-fail». Он означает, что значение является аномальным. Таким образом, если значение превышает пороговое, велика вероятность сбоя. Другая категория »Old_age» используется для атрибутов, отвечающих значениям «нормального износа».

Последнее поле (здесь со значением «3») соответствует исходному значению атрибута, которое сообщает диск. Обычно это число имеет физическое значение. Здесь это фактическое количество переназначенных секторов. Для других атрибутов это может быть температура в градусах Цельсия, время в часах или минутах или количество раз, когда для диска было выполнено определенное условие.

В дополнение к исходному значению, диск с поддержкой S.M.A.R.T. должен сообщать «нормализованные значения» (значения полей, самые худшие и пороговые). Эти значения нормируются в диапазоне 1-254 (0-255 для пороговых значений). Прошивка диска выполняет эту нормализацию с помощью некоторого внутреннего алгоритма. Кроме того, разные производители могут нормализовать один и тот же атрибут по-разному. Большинство значений представлены в процентах, причем чем выше, тем лучше, но так бывает не всегда. Когда параметр ниже или равен пороговому значению, указанному производителем, диск считается неисправным в терминах этого атрибута. Помня о всех указаниях из первой части статьи, когда атрибут, показывающий ранее значение “pre-fail” все-таки дал сбой, наиболее вероятно, что скоро диск выйдет из строя.

В качестве второго примера возьмем “seek error rate”:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  7 Seek_Error_Rate         0x000f   071   060   030    Pre-fail  Always       -       51710773327

На самом деле (и это основная проблема отчетности S.M.A.R.T.), точное значение полей каждого атрибута понимает только поставщик. В моем случае Seagate использует логарифмическую шкалу для нормализации значения. Таким образом, «71» означает примерно одну ошибку на 10 миллионов запросов (10 в степени 7,1). Забавно, что самым худшим показателем за все время была одна ошибка на 1 миллион запросов (10 в 6-й степени).

Если я правильно понимаю, то это значит, что головки моего диска сейчас расположены точнее, чем раньше. Я не следил за этим диском внимательно, поэтому анализирую полученные данные весьма субъективно. Возможно накопитель просто надо было немного «обкатать» с тех пор как он был введен в эксплуатацию? Или может быть это следствие механического износа деталей и, следовательно, теперь имеет место меньшая сила трения? В любом случае, какова бы ни была причина, это значение является скорее показателем производительности, чем ранним предупреждением об ошибке. Так что меня оно не сильно беспокоит.

Помимо вышеприведенного и трех крайне подозрительных ошибок, записанных около шести месяцев назад, этот диск находится в удивительно хорошем состоянии (по данным S.M.A.R.T.) для стокового диска ноутбука, проработавшего более 1100 дней (26423 часа).

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours          0x0032   070   070   000    Old_age   Always       -       26423

Из любопытства я провел этот же тест на гораздо более новом ноутбуке, оснащенном SSD:

sh$ sudo smartctl -i /dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA THNSNK256GVN8
Serial Number:    17FS131LTNLV
LU WWN Device Id: 5 00080d 9109b2ceb
Firmware Version: K8XA4103
User Capacity:    256 060 514 304 bytes [256 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      M.2
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Mar 13 01:03:23 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Первое, что бросается в глаза, так это то, что несмотря на наличие S.M.A.R.T., устройства нет в базе данных smartctl. Но это не помешает инструменту собирать данные с SSD, однако он не сможет сообщить точные значения различных атрибутов, специфичных для поставщика:

sh$ sudo smartctl -a /dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  120) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  11) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000a   100   100   000    Old_age   Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   100   100   050    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0013   100   100   050    Pre-fail  Always       -       0
  7 Unknown_SSD_Attribute   0x000b   100   100   050    Pre-fail  Always       -       0
  8 Unknown_SSD_Attribute   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       171
 10 Unknown_SSD_Attribute   0x0013   100   100   050    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       105
166 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
167 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       0
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
169 Unknown_Attribute       0x0013   100   100   010    Pre-fail  Always       -       100
170 Unknown_Attribute       0x0013   100   100   010    Pre-fail  Always       -       0
173 Unknown_Attribute       0x0012   200   200   000    Old_age   Always       -       0
175 Program_Fail_Count_Chip 0x0013   100   100   010    Pre-fail  Always       -       0
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       18
194 Temperature_Celsius     0x0023   063   032   020    Pre-fail  Always       -       37 (Min/Max 11/68)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
240 Unknown_SSD_Attribute   0x0013   100   100   050    Pre-fail  Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Выше вы видите выходные данные абсолютно нового SSD. Данные понятны даже в случае отсутствия нормализации или метаинформации для данных конкретного поставщика, как в моем случае с “Unknown_SSD_Attribute.” Я могу только надеяться, что в последующих версиях smartctl в базе данных появятся данные об этой модели диска, и я смогу лучше определять потенциальные проблемы.

Проверьте свой SSD в Linux с помощью smartctl

До сих пор мы рассматривали данные, собранные во время нормальной работы накопителя. Однако протокол S.M.A.R.T. также поддерживает несколько команд для автономного тестирования для запуска диагностики по требованию.

Автономное тестирование может проводиться во время обычных операций с диском, если не было указано иное. Поскольку тест и запросы ввода-вывода хоста будут конкурировать, производительность диска упадет на время теста. Спецификация S.M.A.R.T. определяет несколько видов автономного тестирования:

Короткое автономное тестирование (-t short)
Такой тест проверит электрическую и механическую, производительность, а также производительность чтения диска. Короткое автономное тестирование обычно занимает всего несколько минут (обычно от 2 до 10).

Расширенное автономное тестирование (-t long)
Этот тест занимает почти в два раза больше времени. Как правило, это просто более детальная версия короткого автономного тестирования. Кроме того, этот тест будет сканировать всю поверхность диска на наличие ошибок данных без ограничения по времени. Продолжительность теста будет пропорциональна размеру диска.

Транспортировочное автономное тестирование (-t conveyance)
Этот тестовый набор предложен в качестве сравнительно быстрого способа проверки на возможные повреждения, возникшие во время транспортировки устройства.

Вот примеры, взятые с тех же дисков, что были выше. Я предлагаю вам угадать, где какой:

sh$ sudo smartctl -t short /dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Mar 12 18:06:17 2018

Use smartctl -X to abort test.

Сейчас производится проверка. Давайте дождемся завершения, чтобы посмотреть результат:

sh$ sudo sh -c 'sleep 120 && smartctl -l selftest /dev/sdb'
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       171         -

Проведем тот же тест на другом диске:

sh$ sudo smartctl -t short /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Mar 12 21:59:39 2018

Use smartctl -X to abort test.

И еще раз, отправим в сон на две минуты и посмотрим результат:

sh$ sudo sh -c 'sleep 120 && smartctl -l selftest /dev/sdb'
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     26429         -
# 2  Short offline       Completed without error       00%     10904         -
# 3  Short offline       Completed without error       00%        12         -
# 4  Short offline       Completed without error       00%         0         -

Интересно, что в этом случае мы видим, что производители диска и компьютера, похоже, уже тестировали диск (на времени жизни в 0 часов и 12 часов). Я сам определенно был гораздо менее озабочен состоянием диска, чем они. Итак, поскольку я уже показал быстрые тесты, то и расширенный тоже запущу, чтобы посмотреть как это происходит.

sh$ sudo smartctl -t long /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 110 minutes for test to complete.
Test will complete after Tue Mar 13 00:09:08 2018

Use smartctl -X to abort test.

Судя по всему на этот раз ждать придется гораздо дольше, чем при проведении короткого теста. Так что давайте посмотрим:

sh$ sudo bash -c 'sleep $((110*60)) && smartctl -l selftest /dev/sdb'
[sudo] password for sylvain:
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       20%     26430         810665229
# 2  Short offline       Completed without error       00%     26429         -
# 3  Short offline       Completed without error       00%     10904         -
# 4  Short offline       Completed without error       00%        12         -
# 5  Short offline       Completed without error       00%         0         -

В последнем тесте обратите внимание на различие в результатах, полученных с помощью короткого и расширенного теста, даже если они были выполнены один за другим. Ну, возможно, этот диск не в таком уж и хорошем состоянии! Отмечу, что тест остановился после первой ошибки чтения. Поэтому, если вы хотите получить исчерпывающую информацию обо всех ошибках чтения, вам придется продолжать тест после каждой ошибки. Я призываю вас взглянуть на одну очень хорошо написанную страницу руководства smartctl(8) для получения дополнительной информации о параметрах -t select, N-max и -t select, чтобы уметь делать так:

sh$ sudo smartctl -t select,810665230-max /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Selective self-test routine immediately in off-line mode".
SPAN         STARTING_LBA           ENDING_LBA
   0            810665230            976773167
Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful.
Testing has begun.

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Selective offline   Completed without error       00%     26432         -
# 2  Extended offline    Completed: read failure       20%     26430         810665229
# 3  Short offline       Completed without error       00%     26429         -
# 4  Short offline       Completed without error       00%     10904         -
# 5  Short offline       Completed without error       00%        12         -
# 6  Short offline       Completed without error       00%         0         -

Заключение

Определенно, S.M.A.R.T. – это именно та технология, которую стоит добавить в свой инструментарий для мониторинга работоспособности дисков ваших серверов. Вам также стоит взглянуть на S.M.A.R.T. Disk Monitoring Daemon smartd(8), который может помочь вам автоматизировать мониторинг с помощью отчетов системного журнала.

Учитывая статистическую природу прогнозирования сбоев, я не уверен, что агрессивный S.M.A.R.T. мониторинг будет сильно полезен на персональных компьютерах. Помните, что каким бы ни был накопитель, однажды он все равно выйдет из строя – и, как мы видели ранее, в одной трети случаев он сделает это без предупреждения. Поэтому ничто не обеспечит целостность ваших данных лучше, чем RAID технология и резервные копии!

До встречи на курсе, друзья!

Источник

SMART Error Log Version: 1
ATA Error Count: 34 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It «wraps» after 49.710 days.

Error 34 occurred at disk power-on lifetime: 17821 hours (742 days + 13 hours)
When the command that caused the error occurred, the device was active or idle.