Cam status uncorrectable parity crc error - Исправление ошибок и поиск оптимальных решений проблем

Should I be worried about this message I see on the dmesg console?

Code:

(ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada0:ahcich0:0:0:0): Retrying command
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 28 01 00 40 00 00 00 00 00

I read in a forum it might be something to do with the seating of my Drive in its Enclosure, so I resat the drive and the error went away for a while. But it’s returned, the system seems to work fine, shall I just reseat the drive again? Blame it on dodgy hardware, or is it something more serious?

It is just a one drive FreeBSD 10-0 Release system as follows:

Code:

# camcontrol devlist
<HGST HTS721010A9E630 JB0OA3B0>    at scbus0 target 0 lun 0 (ada0,pass0)
<AHCI SGPIO Enclosure 1.00 0001>   at scbus1 target 0 lun 0 (pass1,ses0)

Are there any diagnostic scans I could do?

I’ve never studied the CAM library before, so I’m a bit nervous especially when the man page says : :

Novice users should stay away from this utility.

Any help muchas appreciated.

Cheers

SirDice

zzatskl said:

Are there any diagnostic scans I could do?

Yes, you can install

sysutils/smartmontools

and run smartctl(8) to get the drive’s parameters. It may be getting bad sectors and would need to be replaced if that’s the case.

Thread Starter
#3

SirDice said:

It may be getting bad sectors and would need to be replaced if that’s the case.

Thanks for getting back to me and just when things were going well, this is the output from smartctl:

Code:

# smartctl -a /dev/ada0 > smart.out
# cat smart.out
smartctl 6.2 2013-07-26 r3841 [FreeBSD 10.0-RELEASE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Travelstar 7K1000
Device Model:     HGST HTS721010A9E630
Serial Number:    JG40006EG6Y3UC
LU WWN Device Id: 5 000cca 6a6c3278d
Firmware Version: JB0OA3B0
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Feb 24 15:49:35 2014 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   45) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 179) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   062    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   040    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   181   181   033    Pre-fail  Always       -       2
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       54
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   040    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       335
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       54
191 G-Sense_Error_Rate      0x000a   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0012   094   094   000    Old_age   Always       -       63712
194 Temperature_Celsius     0x0002   146   146   000    Old_age   Always       -       41 (Min/Max 11/50)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       21
223 Load_Retry_Count        0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 21 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 21 occurred at disk power-on lifetime: 330 hours (13 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 20 b8 9e 00 00  Error: ICRC, ABRT at LBA = 0x00009eb8 = 40632

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 60 28 01 00 40 00   8d+01:10:38.615  WRITE FPDMA QUEUED
  61 30 58 a8 9e 00 40 00   8d+01:10:38.615  WRITE FPDMA QUEUED
  61 08 58 68 99 ec 40 00   8d+01:10:37.562  WRITE FPDMA QUEUED
  61 40 58 e8 79 16 40 00   8d+01:10:36.540  WRITE FPDMA QUEUED
  61 40 50 a8 79 16 40 00   8d+01:10:36.540  WRITE FPDMA QUEUED

Error 20 occurred at disk power-on lifetime: 329 hours (13 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 30 38 a1 02 02  Error: ICRC, ABRT at LBA = 0x0202a138 = 33726776

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 d0 e8 07 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED
  61 40 c8 a8 07 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED
  61 40 c0 68 07 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED
  61 40 b8 28 07 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED
  61 40 b0 e8 06 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED

Error 19 occurred at disk power-on lifetime: 282 hours (11 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 30 b8 ca 10 09  Error: ICRC, ABRT at LBA = 0x0910cab8 = 152095416

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 e8 a8 7a 1a 40 00   6d+01:34:08.812  WRITE FPDMA QUEUED
  61 28 e0 68 55 1a 40 00   6d+01:34:08.812  WRITE FPDMA QUEUED
  61 40 d8 a8 0d 19 40 00   6d+01:34:08.812  WRITE FPDMA QUEUED
  61 40 d0 a8 ca 10 40 00   6d+01:34:08.812  WRITE FPDMA QUEUED
  61 08 d0 28 01 00 40 00   6d+01:33:51.552  WRITE FPDMA QUEUED

Error 18 occurred at disk power-on lifetime: 271 hours (11 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 10 d8 72 1a 09  Error: ICRC, ABRT at LBA = 0x091a72d8 = 152728280

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 c8 a8 72 1a 40 00   5d+13:54:09.338  WRITE FPDMA QUEUED
  61 38 c0 a8 52 1a 40 00   5d+13:54:09.338  WRITE FPDMA QUEUED
  61 40 b8 a8 0d 19 40 00   5d+13:54:09.338  WRITE FPDMA QUEUED
  61 40 b0 e8 c1 10 40 00   5d+13:54:09.338  WRITE FPDMA QUEUED
  61 08 b0 28 01 00 40 00   5d+13:53:49.033  WRITE FPDMA QUEUED

Error 17 occurred at disk power-on lifetime: 147 hours (6 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 e7 fc eb 07  Error: ICRC, ABRT at LBA = 0x07ebfce7 = 132906215

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 10 e8 c2 aa 40 00      09:58:33.655  WRITE FPDMA QUEUED
  61 40 08 e8 bb aa 40 00      09:58:33.655  WRITE FPDMA QUEUED
  61 40 00 28 bb aa 40 00      09:58:33.655  WRITE FPDMA QUEUED
  61 40 f8 e8 ba aa 40 00      09:58:33.655  WRITE FPDMA QUEUED
  61 40 f0 28 b6 aa 40 00      09:58:33.655  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I need to read up on what all the output means. But please can you let me know if there is an obvious quick fix or should I just replace the drive (it’s under warranty)?

Cheers

wblock@

The most recent error was only five hours ago. Run a short test first: smartctl -tshort /dev/ada0. If that completes without errors, run a long test: smartctl -tlong /dev/ada0. Both can be monitored with smartctl -a, the status of the test is near the top and the results near the end.

If either test fails, time to replace the drive.

Thread Starter
#5

Thanks for the diagnostic advice.

Please can you recommend reference sites to read up on this tool, so far I’m reading

http://sourceforge.net/apps/trac/smartmontools/wiki

and

http://en.wikipedia.org/wiki/Self-Monitoring,_Analysis,_and_Reporting_Technology

This is the full output of the short test:

Code:

 # smartctl -tshort /dev/ada0
smartctl 6.2 2013-07-26 r3841 [FreeBSD 10.0-RELEASE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Feb 24 16:21:18 2014

Use smartctl -X to abort test.

 # cat smartshort.out
smartctl 6.2 2013-07-26 r3841 [FreeBSD 10.0-RELEASE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Travelstar 7K1000
Device Model:     HGST HTS721010A9E630
Serial Number:    JG40006EG6Y3UC
LU WWN Device Id: 5 000cca 6a6c3278d
Firmware Version: JB0OA3B0
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Feb 24 16:24:15 2014 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   45) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 179) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   062    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   040    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   181   181   033    Pre-fail  Always       -       2
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       54
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   040    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       336
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       54
191 G-Sense_Error_Rate      0x000a   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0012   094   094   000    Old_age   Always       -       63880
194 Temperature_Celsius     0x0002   146   146   000    Old_age   Always       -       41 (Min/Max 11/50)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       21
223 Load_Retry_Count        0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 21 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 21 occurred at disk power-on lifetime: 330 hours (13 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 20 b8 9e 00 00  Error: ICRC, ABRT at LBA = 0x00009eb8 = 40632

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 60 28 01 00 40 00   8d+01:10:38.615  WRITE FPDMA QUEUED
  61 30 58 a8 9e 00 40 00   8d+01:10:38.615  WRITE FPDMA QUEUED
  61 08 58 68 99 ec 40 00   8d+01:10:37.562  WRITE FPDMA QUEUED
  61 40 58 e8 79 16 40 00   8d+01:10:36.540  WRITE FPDMA QUEUED
  61 40 50 a8 79 16 40 00   8d+01:10:36.540  WRITE FPDMA QUEUED

Error 20 occurred at disk power-on lifetime: 329 hours (13 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 30 38 a1 02 02  Error: ICRC, ABRT at LBA = 0x0202a138 = 33726776

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 d0 e8 07 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED
  61 40 c8 a8 07 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED
  61 40 c0 68 07 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED
  61 40 b8 28 07 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED
  61 40 b0 e8 06 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED

Error 19 occurred at disk power-on lifetime: 282 hours (11 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 30 b8 ca 10 09  Error: ICRC, ABRT at LBA = 0x0910cab8 = 152095416

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 e8 a8 7a 1a 40 00   6d+01:34:08.812  WRITE FPDMA QUEUED
  61 28 e0 68 55 1a 40 00   6d+01:34:08.812  WRITE FPDMA QUEUED
  61 40 d8 a8 0d 19 40 00   6d+01:34:08.812  WRITE FPDMA QUEUED
  61 40 d0 a8 ca 10 40 00   6d+01:34:08.812  WRITE FPDMA QUEUED
  61 08 d0 28 01 00 40 00   6d+01:33:51.552  WRITE FPDMA QUEUED

Error 18 occurred at disk power-on lifetime: 271 hours (11 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 10 d8 72 1a 09  Error: ICRC, ABRT at LBA = 0x091a72d8 = 152728280

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 c8 a8 72 1a 40 00   5d+13:54:09.338  WRITE FPDMA QUEUED
  61 38 c0 a8 52 1a 40 00   5d+13:54:09.338  WRITE FPDMA QUEUED
  61 40 b8 a8 0d 19 40 00   5d+13:54:09.338  WRITE FPDMA QUEUED
  61 40 b0 e8 c1 10 40 00   5d+13:54:09.338  WRITE FPDMA QUEUED
  61 08 b0 28 01 00 40 00   5d+13:53:49.033  WRITE FPDMA QUEUED

Error 17 occurred at disk power-on lifetime: 147 hours (6 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 e7 fc eb 07  Error: ICRC, ABRT at LBA = 0x07ebfce7 = 132906215

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 10 e8 c2 aa 40 00      09:58:33.655  WRITE FPDMA QUEUED
  61 40 08 e8 bb aa 40 00      09:58:33.655  WRITE FPDMA QUEUED
  61 40 00 28 bb aa 40 00      09:58:33.655  WRITE FPDMA QUEUED
  61 40 f8 e8 ba aa 40 00      09:58:33.655  WRITE FPDMA QUEUED
  61 40 f0 28 b6 aa 40 00      09:58:33.655  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       335         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The line at the end:

Code:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       335         -

is worrying, does it mean the (new) drive is on it’s death bed and has just 335 hours (13 days) of life left?

The long test states 179 minutes to wait, I’ll post these results later.

I’ve learnt something today and will put smartd_enable=»YES»‘ in my /etc/rc.conf file to monitor the health of drives from now on. Please can you let me know if this is good practise and if you put some lines in /etc/periodic.conf to email a daily (weekly) drive status report, what they are?

Thanks once again for good advice.

SirDice

zzatskl said:
The line at the end:
Code:
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       335         -
is worrying, does it mean the (new) drive is on it’s death bed and has just 335 hours (13 days) of life left?

You’re reading it wrong. The drive has been running for a total of 335 hours. The «Remaining» is the percentage of tests still to do, 0% remaining means it’s done

The errors may be caused by bad cables. Replace the cables and see if that make the errors go away.

Thread Starter
#8

Full output of long test:

Code:

# smartctl -tlong /dev/ada0
smartctl 6.2 2013-07-26 r3841 [FreeBSD 10.0-RELEASE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 179 minutes for test to complete.
Test will complete after Mon Feb 24 19:24:03 2014

 # cat smartlong.out
smartctl 6.2 2013-07-26 r3841 [FreeBSD 10.0-RELEASE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Travelstar 7K1000
Device Model:     HGST HTS721010A9E630
Serial Number:    JG40006EG6Y3UC
LU WWN Device Id: 5 000cca 6a6c3278d
Firmware Version: JB0OA3B0
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Feb 24 19:53:09 2014 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   45) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 179) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   062    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   040    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   181   181   033    Pre-fail  Always       -       2
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       54
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   040    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       339
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       54
191 G-Sense_Error_Rate      0x000a   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0012   094   094   000    Old_age   Always       -       63916
194 Temperature_Celsius     0x0002   122   122   000    Old_age   Always       -       49 (Min/Max 11/51)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       21
223 Load_Retry_Count        0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 21 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 21 occurred at disk power-on lifetime: 330 hours (13 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 20 b8 9e 00 00  Error: ICRC, ABRT at LBA = 0x00009eb8 = 40632

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 60 28 01 00 40 00   8d+01:10:38.615  WRITE FPDMA QUEUED
  61 30 58 a8 9e 00 40 00   8d+01:10:38.615  WRITE FPDMA QUEUED
  61 08 58 68 99 ec 40 00   8d+01:10:37.562  WRITE FPDMA QUEUED
  61 40 58 e8 79 16 40 00   8d+01:10:36.540  WRITE FPDMA QUEUED
  61 40 50 a8 79 16 40 00   8d+01:10:36.540  WRITE FPDMA QUEUED

Error 20 occurred at disk power-on lifetime: 329 hours (13 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 30 38 a1 02 02  Error: ICRC, ABRT at LBA = 0x0202a138 = 33726776

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 d0 e8 07 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED
  61 40 c8 a8 07 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED
  61 40 c0 68 07 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED
  61 40 b8 28 07 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED
  61 40 b0 e8 06 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED

Error 19 occurred at disk power-on lifetime: 282 hours (11 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 30 b8 ca 10 09  Error: ICRC, ABRT at LBA = 0x0910cab8 = 152095416

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 e8 a8 7a 1a 40 00   6d+01:34:08.812  WRITE FPDMA QUEUED
  61 28 e0 68 55 1a 40 00   6d+01:34:08.812  WRITE FPDMA QUEUED
  61 40 d8 a8 0d 19 40 00   6d+01:34:08.812  WRITE FPDMA QUEUED
  61 40 d0 a8 ca 10 40 00   6d+01:34:08.812  WRITE FPDMA QUEUED
  61 08 d0 28 01 00 40 00   6d+01:33:51.552  WRITE FPDMA QUEUED

Error 18 occurred at disk power-on lifetime: 271 hours (11 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 10 d8 72 1a 09  Error: ICRC, ABRT at LBA = 0x091a72d8 = 152728280

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 c8 a8 72 1a 40 00   5d+13:54:09.338  WRITE FPDMA QUEUED
  61 38 c0 a8 52 1a 40 00   5d+13:54:09.338  WRITE FPDMA QUEUED
  61 40 b8 a8 0d 19 40 00   5d+13:54:09.338  WRITE FPDMA QUEUED
  61 40 b0 e8 c1 10 40 00   5d+13:54:09.338  WRITE FPDMA QUEUED
  61 08 b0 28 01 00 40 00   5d+13:53:49.033  WRITE FPDMA QUEUED

Error 17 occurred at disk power-on lifetime: 147 hours (6 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 e7 fc eb 07  Error: ICRC, ABRT at LBA = 0x07ebfce7 = 132906215

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 10 e8 c2 aa 40 00      09:58:33.655  WRITE FPDMA QUEUED
  61 40 08 e8 bb aa 40 00      09:58:33.655  WRITE FPDMA QUEUED
  61 40 00 28 bb aa 40 00      09:58:33.655  WRITE FPDMA QUEUED
  61 40 f8 e8 ba aa 40 00      09:58:33.655  WRITE FPDMA QUEUED
  61 40 f0 28 b6 aa 40 00      09:58:33.655  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       338         -
# 2  Short offline       Completed without error       00%       335         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Seems to have completed without errors:

Code:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       338         -
# 2  Short offline       Completed without error       00%       335         -

kpa said:

The errors may be caused by bad cables. Replace the cables and see if that make the errors go away.

I’ll have a look at the cables nest.

Thread Starter
#9

I’ve opened up the Zoctac box:

Removed the hard drive and reseated it, taking care to seat at a slight angle as in the instructions

The errors displayed on the monitor have now gone. Results from running smartctl again:

Code:

# cat smartshort.out
smartctl 6.2 2013-07-26 r3841 [FreeBSD 10.0-RELEASE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Travelstar 7K1000
Device Model:     HGST HTS721010A9E630
Serial Number:    JG40006EG6Y3UC
LU WWN Device Id: 5 000cca 6a6c3278d
Firmware Version: JB0OA3B0
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jan  1 00:35:05 2012 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   45) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 179) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   062    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   040    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   181   181   033    Pre-fail  Always       -       2
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       57
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   040    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       340
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       57
191 G-Sense_Error_Rate      0x000a   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       7
193 Load_Cycle_Count        0x0012   094   094   000    Old_age   Always       -       63924
194 Temperature_Celsius     0x0002   130   130   000    Old_age   Always       -       46 (Min/Max 11/51)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       21
223 Load_Retry_Count        0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 21 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 21 occurred at disk power-on lifetime: 330 hours (13 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 20 b8 9e 00 00  Error: ICRC, ABRT at LBA = 0x00009eb8 = 40632

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 60 28 01 00 40 00   8d+01:10:38.615  WRITE FPDMA QUEUED
  61 30 58 a8 9e 00 40 00   8d+01:10:38.615  WRITE FPDMA QUEUED
  61 08 58 68 99 ec 40 00   8d+01:10:37.562  WRITE FPDMA QUEUED
  61 40 58 e8 79 16 40 00   8d+01:10:36.540  WRITE FPDMA QUEUED
  61 40 50 a8 79 16 40 00   8d+01:10:36.540  WRITE FPDMA QUEUED

Error 20 occurred at disk power-on lifetime: 329 hours (13 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 30 38 a1 02 02  Error: ICRC, ABRT at LBA = 0x0202a138 = 33726776

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 d0 e8 07 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED
  61 40 c8 a8 07 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED
  61 40 c0 68 07 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED
  61 40 b8 28 07 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED
  61 40 b0 e8 06 78 40 00   8d+00:07:04.420  WRITE FPDMA QUEUED

Error 19 occurred at disk power-on lifetime: 282 hours (11 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 30 b8 ca 10 09  Error: ICRC, ABRT at LBA = 0x0910cab8 = 152095416

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 e8 a8 7a 1a 40 00   6d+01:34:08.812  WRITE FPDMA QUEUED
  61 28 e0 68 55 1a 40 00   6d+01:34:08.812  WRITE FPDMA QUEUED
  61 40 d8 a8 0d 19 40 00   6d+01:34:08.812  WRITE FPDMA QUEUED
  61 40 d0 a8 ca 10 40 00   6d+01:34:08.812  WRITE FPDMA QUEUED
  61 08 d0 28 01 00 40 00   6d+01:33:51.552  WRITE FPDMA QUEUED

Error 18 occurred at disk power-on lifetime: 271 hours (11 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 10 d8 72 1a 09  Error: ICRC, ABRT at LBA = 0x091a72d8 = 152728280

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 c8 a8 72 1a 40 00   5d+13:54:09.338  WRITE FPDMA QUEUED
  61 38 c0 a8 52 1a 40 00   5d+13:54:09.338  WRITE FPDMA QUEUED
  61 40 b8 a8 0d 19 40 00   5d+13:54:09.338  WRITE FPDMA QUEUED
  61 40 b0 e8 c1 10 40 00   5d+13:54:09.338  WRITE FPDMA QUEUED
  61 08 b0 28 01 00 40 00   5d+13:53:49.033  WRITE FPDMA QUEUED

Error 17 occurred at disk power-on lifetime: 147 hours (6 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 01 e7 fc eb 07  Error: ICRC, ABRT at LBA = 0x07ebfce7 = 132906215

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 40 10 e8 c2 aa 40 00      09:58:33.655  WRITE FPDMA QUEUED
  61 40 08 e8 bb aa 40 00      09:58:33.655  WRITE FPDMA QUEUED
  61 40 00 28 bb aa 40 00      09:58:33.655  WRITE FPDMA QUEUED
  61 40 f8 e8 ba aa 40 00      09:58:33.655  WRITE FPDMA QUEUED
  61 40 f0 28 b6 aa 40 00      09:58:33.655  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       340         -
# 2  Short offline       Completed without error       00%       340         -
# 3  Extended offline    Completed without error       00%       338         -
# 4  Short offline       Completed without error       00%       335         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I need to read up to interpret the results. I’d be grateful for a quick translation though?

Cheers.

P1040890.JPG

160 KB

· Views: 1,678
P1040891.JPG

165.5 KB

· Views: 1,726

wblock@

It seems okay. The fields to watch are

Reallocated_Sector_Ct

and

Current_Pending_Sector

, which indicate bad sector remapping. The

Load_Cycle_Count

looks high. Generally, notebook drives repark their heads more than they should, adding to wear. Vendors have a somewhat arbitrary opinion of what the fields mean, so that not may be a simple count of how many times it parked. If it is, that would mean three times an hour for as long as it’s been in use. See

sysutils/ataidle

to disable that, if the drive permits it.

smartd(8) can be configured to alert you when the drive has a problem. Important note: the sample config file installed in

/usr/local/etc/smartd.conf

does not do that by default, it must be edited, and the daemon enabled in

/etc/rc.conf

Thread Starter
#11

wblock said:

smartd(8) can be configured to alert you when the drive has a problem.

Thanks for the tip, for all novices like me, in my /usr/local/sbin/smartd.conf file I put:

Code:

/dev/ada0 -a -o on -S on -s (S/../.././02|L/../../6/03)

which according to the man page should «Monitor all attributes, enable automatic online data collection, automatic Attribute autosave, and start a short self-test every day between 2-3am, and a long self test Saturdays between 3-4am» which sounds just right for my little Zotac box.

I’ve added this line to /etc/rc.conf

and added this line to /etc/periodic.conf

Code:

daily_status_smart_devices="/dev/ada0"

I started smartd with # service smartd start

I checked smartd was running OK with # smartd -q onecheck and this was the successful output:

Code:

smartd 6.2 2013-07-26 r3841 [FreeBSD 10.0-RELEASE amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

Opened configuration file /usr/local/etc/smartd.conf
Configuration file /usr/local/etc/smartd.conf parsed.
Device: /dev/ada0, opened
Device: /dev/ada0, HGST HTS721010A9E630, S/N:JG40006EG6Y3UC, WWN:5-000cca-6a6c3278d, FW:JB0OA3B0, 1.00 TB
Device: /dev/ada0, found in smartd database: HGST Travelstar 7K1000
Device: /dev/ada0, enabled SMART Attribute Autosave.
Device: /dev/ada0, enabled SMART Automatic Offline Testing.
Device: /dev/ada0, is SMART capable. Adding to "monitor" list.
Monitoring 1 ATA and 0 SCSI devices
Device: /dev/ada0, opened ATA device
Device: /dev/ada0, previous self-test completed without error
Started with '-q onecheck' option. All devices sucessfully checked once.
smartd is exiting (exit status 0)

In summary. I blame the Zotac Drive Enclosure for the error messages. I think the quality of it may be a bit unreliable, which reminds me I should register the warranty.

It’s good to learn.

Cheers

Источник

undefined писал(а): В корпус про который Вы написали — два вентилятора, смеетесь, никогда с большим кол-вом
дисков не работали? Никогда файловые сервера не видели?

а кто сказал что их сейчас то там два?.. два в комплекте с завода.. 4 120 там.. два дополнительных стоят в отсеках 5.25 прямо напротив дисков.. все корзины с дисками продуваются хорошо..

undefined писал(а): Хоть SATA 6 и должны без проблем договариваться по скорости с SATA контроллером, не всегда
это успешно.

непонял?.. сата 6 диски висят на сата 6 контроллере, сата 3 диски на чипсетном.. и по скоростям все там ок..

а теперь насчет ada2 c запредельным Seek_Error_Rate
вот так было при первом скане:
поверхность

график скорости

диск на красных и коричневых постукивал..

а вот так стало после теста чтения (да, да чтения, ниже будет скрин окна последовательности тестов)
поверхность (два оранжевых из за того что во время теста нажал GET SMART)

график скорости

вся жесть была на первой 1/3 диска, так что второй раз остальные нормальные 2/3 тестировать нет смысла..

лог выполнения тестов — два раза тест на чтение..

смарт после двойного чтения (к сожалению до неуспел сделать скрин, но там был завален 67 атрибут)

далее поставил диск на сервер — fcsk -y /dev/ufsid/disk_id и он радостно сообщил «could not determine filesystem» далее fcsk -y -t ufs /dev/ufsid/disk_id — File system marked clean
все файлы на месте, сделал рехеш около 100Гб — ошибок нет..

ada3 и ada7 проверил викторией — все с идеальной поверхностью, даже зеленых нету (<100ms)..

но радость длилась не долго — загрузил сервер, поработало минут 20, на ada3 повалили ошибки, ушел в кернел паник и после перезагрузки повис на лоадере..

при этом из БП было какое-то шипение.. завтра буду разбирать.. наверное конды повзрывались.. старый БП уже.. наверное лет 7..

Отправлено спустя 58 минут 40 секунд:

55 градусов — это нонсенс для _холодных_ Hitachi (это можно _назвать_ нормой для Seagate)

55 градусов в пике это нормально.. по менинию hitachi в том числе.. ))
deskstar:

Код: Выделить всё

Environmental (operating)
Ambient temperature  5 to 60 C

http://www.hgst.com/tech/techlib.nsf/te … 000_ds.pdf — официальный pdf
http://www.nix.ru/autocatalog/hdd_ibm_h … 11994.html

ultrastar:

Код: Выделить всё

Environmental (operating)
Ambient temperature 5 to 60 C

http://www.hgst.com/tech/techlib.nsf/te … 000_ds.pdf — официальный pdf
http://www.nix.ru/autocatalog/hdd_ibm_h … 21558.html

теперь что касается БП вот тесты жестких дисков, в том числе 7к3000 deskstar/ultrastar: http://www.hardwareluxx.ru/index.php/ar … l?start=16 (там кстати и про температуру есть)
у ultrastar пиковое потребление под нагрузкой 11 ватт у deskstar 7.4, но пусть будет что у всех 9 дисков по 11 ватт (да если они все одновременно что-то перезаписывать будут) 9*11=99 ватт, пиковая в Core Quad 9550 95W (а проц там редко когда более 20% занят), пусть на материнку, куллеры и память тоже 100W будет — всего получается 99 + 95 + 100 = 294W так что нормального испpавного 650W блока там хватает с запасом.. более чем двухкратным..

Источник

No. Wrong. This is perfectly possible if the bad sector was never access by the system (the pool was never full). Do you have anything to back the claim the sector should show in the attributes if it was never accessed? Manufacturer documentation, spec sheet, ATA standard, … (Trust me, I checked.)

The information you need is not provided by the Manufacturer to consumers. I have inside info on WD drives and I have some data for Seagates from circa 2008(I can’t vouch if they changed though). I’m not a fan of Seagates because they do very unusual things with their SMART attributes. But, I do have a hard drive that has never had a partition table put on it, never been used for data storage, and only short tests were performed, and it has a value of over 10000 for Current Pending Sector Count. At the same time that the Current Pending Sector Count went above zero, only a minute before that a short test had been started.

The short/long tests do not update SMART attributes, the result is logged only in the seftest error log. Quoting from smartctl man page:
«short — [ATA] runs SMART Short Self Test (usually under ten minutes). This command can be given during normal system operation (unless run in captive mode — see the ‘-C’ option below). This is a test in a different category than the immediate or automatic offline tests. The «Self» tests check the electrical and mechanical performance as well as the read performance of the disk. Their results are reported in the Self Test Error Log, readable with the ‘-l selftest’ option.» (http://smartmontools.sourceforge.net/man/smartctl.8.html)

That says nothing about what the manufacturer does for a short test though(or a long test for that matter). Some drives (one of my old SSDs come to mind) take less than 3 seconds to complete a short test and the manufacturer has said it actually performs no test at all, but they didn’t disable SMART tests because it may cause problems with some software and/or hardware configurations that automatically run SMART tests at a given schedule. They simply report that the SMART test passed. They said that if something was wrong with the drive it would fail completely or fail the diagnostic on bootup. Of course, that’s not too useful as I’d prefer that it not run the test at all and I get an error that the test isn’t supported(My Intel SSD doesn’t have a Conveyance test option and returns an error if you try to run one). Same for a long test on that drive, unfortunately. I was tipped off that something was horribly wrong when I tried to do a long test and a 64GB drive reported it passed 3 seconds later.

Only the immediate offline test updates the attributes. Quoting the documentation again:
«offline — [ATA] runs SMART Immediate Offline Test. This immediately starts the test described above. This command can be given during normal system operation. The effects of this test are visible only in that it updates the SMART Attribute values, and if errors are found they will appear in the SMART error log, visible with the ‘-l error’ option.»

My issue is that you are quoting smartctl. smartctl only tells the drive to run a test. What the test actually performs is totally up to the manufacturer. They can do no test at all(such as the SSD I mentioned above), or they can do an exhaustive test of every single component on the drive. The choice is theirs, and they aren’t about to tell you what a particular SMART test does or doesn’t do. In general, a long test is supposed to be the short test, but include a total surface scan of all user areas. (I’ll discuss this more below)

You also mention the reallocated sector count. If it is a hard unrecoverable error reallocation will not happen on a read operation, as there is nothing to reallocate (the sector is unreadable). The drive will keep the sector as is, hoping that it maybe be able to read it later. The reallocation will definitely happen only when you try to write into that sector. See: ATA drive is failing self-tests, but SMART health status is ‘PASSED’. What’s going on?

Yeah, I’ve read that link before. They aren’t explaining some small details. If a sector is labeled as UNC it should be annotated in the «Current_Pending_Sector_Count» value. If you do happen to write to those sectors they will be marked «bad», your newly written data will then be written to the spare sectors, Reallocated_Sector_Ct and/or Reallocated_Event_Count may increment, you may see a Raw_Read_Error_Rate go up, and life goes on. The whole reason why it behaves this way is to allow for RAID controllers to accept the fact that the sector is bad and the RAID controller will regenerate the missing data from parity(if it exists). If it doesn’t exist, well, the data was already lost so it really doesn’t matter if you claim it was lost on your next read or when it found the problem. But its better to allow a possible RAID controlller(or software backup) to restore the bad data. I have a drive that I just RMAed that did exactly that. Somewhat disappointed because I did the RMA because a long SMART test found problems, so I did the RMA. Then, I did a scrub just before the disk replacement(which then lowered CUPS to zero, and the Reallocated_Event_Count and Reallocated_Sector_Count went above zero).

You have to keep in mind that the health status is, from what I understand, based solely on if the «Value» is worse than the «Threshold».

Here’s one of my disks..

Code:

 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0
  3 Spin_Up_Time            0x0027  147  141  021    Pre-fail  Always      -      9650
  4 Start_Stop_Count        0x0032  099  099  000    Old_age  Always      -      1017
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  063  063  000    Old_age  Always      -      27129
10 Spin_Retry_Count        0x0032  100  100  000    Old_age  Always      -      0
11 Calibration_Retry_Count 0x0032  100  100  000    Old_age  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      473
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      403
193 Load_Cycle_Count        0x0032  199  199  000    Old_age  Always      -      4225
194 Temperature_Celsius    0x0022  118  101  000    Old_age  Always      -      34
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      0

Everything looks good.

Now, if I changed a line, say…Reallocated_Sector_Ct had a value of 139 or lower(note that the «THRESH» is 140) then I’d expect that the drive would have said:

SMART overall-health self-assessment test result: FAILED (or something to that effect). Then you get the BIOS warnings on bootup and all sorts of other nastiness. Of course, at that point, you are probably in deep doo doo and if you don’t have a backup you’ve probably lost significant data. Not always, but usually. One RMAed drive I had gave me a BIOS warning on first powerup despite everything else being fine. I could test the drive all day long and have no errors. I still called Seagate and they sent me a second RMA in the same week.

Now check out all the drive THRESH that are zero. Those will never ever trip SMART failure warnings. Kind of crappy if you ask me because Current_Pending_Sector count seems to often be my first indication that a drive is failing(for WDs.. Seagates seem to be Multi_Zone_Error_Rate from my experience). And no matter how many sectors fail, you will never go «below» zero. So unless you look at the actual attributes and interpret them, your disk could be losing data left and right and you might not know it until the drive is practically dead. Me personally, I want to know the second a disk starts having problems of any kind. Not just when reallocation events reach the THRESH value.

Now let me muddy the waters. Reallocated_Sector_Ct was 200, right? And we said that when it hits 139 it will trigger the nasty SMART Failed message. That doesn’t mean that if you reallocate 61 sectors it will trigger the warning. Each integer increment might be linear, might be exponential, or each increment might be 10k sectors. Which one is it for your disks? How about my disks? Are they even the same? Are you sure about that?

I do perform regular long test on my drives(twice a month) on the 7th and 21st. Typically if the RAW_VALUE looks like the disk isn’t in perfect health a Long test has always failed for me. Been luck, the way my batch was made, I don’t know. But, the important thing is that a failed short or long test give you the opportunity to RMA the drive. Typically, I’d always RMA a drive that fails a short or long test. Regardless of what it tests(or doesn’t test), the outcome is the same. Any manufacturer’s SMART test shouldn’t ever fail. Luckily for hard disk manufacturers I’ve never seen a Windows Desktop that had SMART tests run on them regularly aside from one’s I’ve setup myself, so the average user is ignorant to any indication of failure often until it is significant(and frequently self evident). I try to stay proactive with my server disks, I use RAIDZ3, and I replace them at the first sign of problems.

Your first link is similar to this thread, but Seagate had major firmware issues in 2009(the main reason whey I stopped using them after only buying and recommending Seagate for more than 10 years). If I buy $2k worth of drives that turn into paperweights within 90 days because they can’t perform their function, and then they pass every test you throw at them(hence I don’t qualify for an RMA), don’t expect me to buy more of them.(I switched to WD at that point and I’ve had good luck with them…so far). Most of those drives are still in the 20 drive box I put them in because I can’t trust them to store data without randomly disconnecting from any system they are put in. Of course my issue is unrelated to the issue we are discussing. I really can’t explain why Short test would fail while a Long test would pass. The SMART spec used to say that a Long test required all Short tests + the full surface scan, but I don’t know if that has changed or not. But the Short tests typically are things like a controller diagnostic(done on POST, and failure is often evident because the disk disconnects from the host), bad RAM cache on the drive(often evident because you’ll start seeing all sorts of nasty behavior as things get corrupted, zpool status will identify the corrupted data, etc.), and a short seek test(well, if the drive is having problems seeking you’d again know before you did a test that something was very wrong, probably would show up on zpool status, etc.).

I won’t comment on the WD conspiracy theory.

And I don’t blame you. The non-standard(based on the information I have) use of the SMART data worries me, but it could be a failing drive that is having other problems so its not a big deal. Right now we have no solid ground to claim anything is a actually awry(except that the short test is failing and that is

definitely

bad). But its behavior is not what I’d expect, hence the reason why I’ll just keep an eye on the forums for future failing disks that are WD Reds and see how they behave. Might be a fluke or might not. I’m not going to go rushing out with a conspiracy theory based on a single disk’s issues.

————————————

We’ve kind of gotten way off topic on this. The reality of it is that the OP’s hard drive clearly has something wrong with it. Regardless of if it is normal behavior for SMART, normal behavior for WD drives, etc the issue still stands that it should definitely be RMAed(or at least not relied on long-term). I think we both agree on that. As for what is or isn’t normal behavior, that seems to be manufacturer’s secret sauce and not something they’ll discuss with the public.

Источник

Hey all,

I just recently picked up a Vnopn J3160 system to replace my previous Untangle virtual firewall appliance. I’ve installed 4GB of RAM and an older 128GB mSATA Samsung SSD that I had available.

I’ve been trying to evaluate OPNsense as it seems like a good combination of the things I like about Untangle (reporting, UI & ease of use) but also has the extra power and control that pfSense offers. I originally had pfSense installed and running for about a week to iron out any hardware issues. I installed OPNsense 19.7 last night but ran into an issue that has me stumped.

Installation appears to have gone through without any issue and was able to complete the initial configuration wizard, upon updating the system to the latest kernel & packages then rebooting the console starts to report a bunch of CAM status: Uncorrectable parity/CRC errors on ada0. Based on a little google-foo people have solved this with reseating or replacing SATA cables / enclosures so I removed, cleaned and reinstalled SSD into slot but errors still returned.

The strangest part though is if I shutdown/reboot the system once it starts reporting the errors it will not boot from the SSD anymore and BIOS/UEFI will not accept it as a valid boot media and I need to reinstall via USB again, almost as if the install has been corrupted. I was able to run a quick and extended SMART test on the drive and it reported back no errors.

I will reinstall again tonight to grab screenshots of CAM error messages & SMART results and also test if the problems present themselves with a base install of 19.7 with no upgrades.

Has anyone else seen / heard of problem like this?

Thanks

** UPDATE **

Looks like the faults might all come down to the SSD. Installed to a USB drive for testing and system is not reporting any issues. Will need to do some further testing to see if the drive is actually faulty or if something in OPNsense/FreeBSD just doesn’t like this model of SSD.

Have new SSD on the way.

« Last Edit: January 08, 2020, 09:03:29 am by MD389 »

Logged

Источник