Smart error log version 1 - Исправление ошибок и поиск оптимальных решений проблем

From Unraid | Docs

Jump to: navigation, search

under construction, only slightly usable so far, please report errors in the talk page

Disclaimer: this page is based on personal experience gained from examining numerous SMART reports, therefore it should not be considered authoritative. Accuracy however is highly desired, so please feel free to correct it as needed, or suggest corrections or question its statements on the associated Talk page.

1 Prologue
2 Introduction to SMART
3 SMART report structure
- 3.1 General information section
- 3.2 SMART overall health test
- 3.3 SMART parameters section
- 3.4 SMART attributes section
- 3.5 Error Log section
- 3.6 Test results section
4 Table of attributes
- 4.1 1 Raw_Read_Error_Rate
- 4.2 3 Spin_Up_Time
- 4.3 4 Start_Stop_Count
- 4.4 5 Reallocated_Sector_Ct
- 4.5 7 Seek_Error_Rate
- 4.6 9 Power_On_Hours
5 Additional info

Prologue

There is a lot of ignorance and misinformation out there about SMART reports, so this will be an effort to help users to a better understanding of the content of SMART reports.

Consider the following SMART report extract:

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   112   099   006    Pre-fail  Always       -       42208416
  3 Spin_Up_Time            0x0003   096   096   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       7
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   056   055   030    Pre-fail  Always       -       25772440425
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       72
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       7
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   057   048   045    Old_age   Always       -       43 (Min/Max 36/43)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       5
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       19
194 Temperature_Celsius     0x0022   043   052   000    Old_age   Always       -       43 (0 28 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       260348032581703
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       423266408125
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       97907054046

Looks rather intimidating, doesn’t it, with huge scary numbers! But with a little knowledge from this page, you should be able to quickly say «That drive looks fine! A little warm though!»

Introduction to SMART

From SMART on Wikipedia, «S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often written as SMART) is a monitoring system for computer hard disk drives to detect and report on various indicators of reliability, in the hope of anticipating failures.» It was a laudable effort by the drive manufacturers to provide standard ways to both report current drive parameters and status, and also indicate issues, especially those that might be predictive of imminent drive failure. Unfortunately, the standard had considerable ambiguity, and the various drive engineers have often differed greatly in their interpretations and implementations of both the common attributes, and the introduction of new attributes.

This page is primarily a guide to understanding SMART attributes, in real world usage. They are unfortunately very inconsistent in their behavior, not only between the different attributes, but between the various drive models, and especially between brands. In some cases, the RAW_VALUE is the counter to watch, in others, it is more important to watch what the VALUE does, and there are yet other behaviors too. To understand a particular attribute report line, you have to understand how that SMART attribute is usually handled, keeping in mind who the manufacturer is, and to a lesser extent, what drive model it is. You can try researching it online, but information is really skimpy, nothing authoritative at all from the manufacturers themselves. The table of SMART attributes below should help you understand them, but every manufacturer uses a different set of SMART attributes, even using the common ones in differing ways, even across their own drive models.

There are many computer professionals with a very low opinion of SMART reporting, and they generally discount SMART reports, partly because of all the inconsistency, but also because many drives fail with no SMART warnings at all. I find that once you understand the inconsistencies, and keep some perspective, there is much that can still be learned. For one example, the Seek_Error_Rate (a critical attribute) on Seagate drives generally starts and stays in the mid 50’s to high 60’s (attribute values generally start at 100 and drop to 1). Not knowing this, you might immediately think there is a serious issue with your new Seagate drive. But now that you do know this, you won’t be concerned until it drops into the low 50’s or below. The same Seek_Error_Rate value on any other brand would be immediately concerning. Hopefully the table below will help you understand what ‘normal’ looks like, for the different attributes on different drives by different makers.

SMART report structure

Each section below includes an example of that section, in a gray box with dotted border. It’s just an example, yours may greatly differ.

General information section

Identifying information for the SMART program and the drive — its model, serial number, firmware, capacity/size, time of this report, and SMART support status

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     ST1500DL003-9VT16L
Serial Number:    5YD3D71H
Firmware Version: CC32
User Capacity:    1,500,301,910,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Nov 18 16:11:43 2011 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART overall health test

Basic overall health test of the drive, only 2 choices — PASSED or FAILED
If test result is FAILED, then that means the SMART firmware believes that the drive is in imminent danger of catastrophic failure, so it is imperative to copy off ALL important data. Usually, it is best to copy off the most important files, then the next most important files, then the next, and so on, because the drive may completely quit before you finish copying.

SMART overall-health self-assessment test result: PASSED

SMART parameters section

These are generally of little interest to us
They do include the recommended polling time for the short and long tests, in other words don’t request a SMART report any sooner than this recommendation
- Unfortunately the original standard must have stipulated using a single byte to store the polling times, which caps their maximum value at 255. That makes the ‘Extended self-test’ (the long test) polling time of 255 rather useless.
I have seen a case where an unusually long ‘Total time to complete Offline data collection’ for one unusually slow drive was the only indication of a faulty drive. The SMART reports for other drives that were exactly the same model had essentially identical SMART reports, with no issues, except for the difference in this parameter.

Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 ( 623) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.

SMART attributes section

This is the table of SMART attributes for this drive. The columns are described below the example. Yours may greatly differ from this example, as some drives report more attributes, and some drives report considerably fewer. The newest drives often introduce new attributes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   111   100   006    Pre-fail  Always       -       32796080
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       5
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   030    Pre-fail  Always       -       265367
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       19
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       5
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   070   069   045    Old_age   Always       -       30 (Lifetime Min/Max 26/31)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       5
194 Temperature_Celsius     0x0022   030   040   000    Old_age   Always       -       30 (0 26 0 0)
195 Hardware_ECC_Recovered  0x001a   037   029   000    Old_age   Always       -       32796080
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       172868138696723
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2919100768
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       572998840

Column 1 is the attribute number, usually a decimal number between 1 and 255. Some SMART tools report it in hex, from 01 to FF. These are relatively standard ID’s, except that different manufacturers will occasionally introduce a new one, unused by anyone else. Generally, the only ones you can count on seeing are: 1, 3, 4, 5, 7, 9, 10, 187, 190 or 194, 193, 195, 197, 198, and 199.
Column 2 is the relatively standardized attribute name. There are a few that seem only used by a single manufacturer.
Column 3 is the attribute handling flag, of no interest to us — ignore it.
Column 4 is the VALUE, one of the most important values in the table. It is stored in a single byte on the drive for each SMART attribute, so its range is from 0 to 255.
- However, the values of 0, 254, and 255 are reserved for internal use, so you never see them.
- The value of 253 usually always means «Not Used Yet», so when you see it, you are probably looking at a brand new drive. Sometimes though, there can be a few attributes that take awhile before they are used, so may stay 253 for longer.
- VALUE is almost always used as a normalized scale of perfectly good to perfectly bad, usually starting at VALUE=100, then dropping toward a worst case of VALUE=1. You can generally think of it as representing a scale starting at 100% good, then slowly dropping until failure at some predetermined percentage number, in the THRESHOLD column.
- Someone realized that if the values only run from 100 to 1, then they are wasting the possible values from 101 to 252, so some SMART programmers have decided to stretch the scale for certain attributes to start at 200 instead of 100, providing twice the data points. Unfortunately, which attributes are scaled from 200 to 1 is completely inconsistent, with almost all SMART reports showing some attributes starting at 100, and other attributes starting at 200. In addition, there are a few Maxtor and Samsung drives that took the start of the scale all the way to 252 or 253! Above, you see all but 1 attribute using 100, the exception being attribute 199 which starts at 200. In general, you can think of 200-type scales as 100 times 2 (just divide the number by 2), and from now on, that is what we are going to do in most of the discussion.
- The temperature attributes 190 and 194 are exceptions to the scaling. They are either temperatures or forms of the temperature, and they don’t scale (their WORST value may look like it scales though).
- The error rate attributes 1 and 7 are also exceptions, although of a different kind. Raw read and seek errors are a natural part of normal operation, so even in a brand new and perfect drive, there is a factory-determined optimal rate of read and seek errors. They are nothing to worry about, they’re the natural result of temperature expansion and other things, and they are used to help the drive constantly recalibrate itself. But because these error rates are non-zero, you essentially cannot have a perfect error rate of zero that you declare is a VALUE of 100. So manufacturers determine what an optimal error rate should be and call it 100. But often, drives may achieve an error rate (especially when they are new) that is even better than the optimal one set by the manufacturer, which results in an error rate that is HIGHER than 100! For an example, see the VALUE above of attribute 1, the Raw_Read_Error_Rate. It’s as if the drive is performing at 111%!
Column 5 is WORST, the lowest VALUE ever recorded (except for a few unusual and uncommon cases).
- [incomplete]
Column 6 is THRESH, the manufacturer determined lowest value that WORST should be allowed to fall to, before reporting it as a FAILED quantity. Some are counters, some are informational such as temperature or hours used or
- [incomplete]
Column 7 is TYPE, the type of attribute. It can either be Pre-fail or Old_age.
- If it is Pre-fail, then the attribute is considered a critical attribute, one that participates in the overall SMART health assessment (PASSED/FAILED) of the drive. If the value of WORST falls below THRESH, then the drive FAILS the overall SMART health test, and complete failure may be imminent. The Pre-fail term means that if this attribute fails, then the drive is considered ‘about to fail’.
- If it is Old_age, then the attribute is considered (for SMART purposes) a noncritical attribute, one that does not fail the drive. The Old_age term means that the attribute is related to normal aging, normal wear and tear of the drive.
- When new attributes are introduced, they may seem like a critical item, perhaps even with an appropriate THRESH set. But if they are marked as Old_age, then they do NOT fail the drive, even if WORST falls below THRESH. Naturally, this could be highly concerning, but there is no authoritative interpretation available, so no definitive conclusions can be made. These attributes should be considered Experimental.
- [incomplete]
Column 8 is UPDATED. Supposedly, this is an indicator when the attribute is updated, Always or Offline. If Always, then it is assumed that the attribute is updated whenever a relevant event occurs. In other words, it is always ‘live’. If Offline, then supposedly the attribute is only updated when offline tests are being performed. But in real life, our experience is that these are inaccurate. Just look at the example above, at attributes 241 and 242. They appear to be live counters of LBA’s read and written, yet the test section of that particular SMART report indicates that there have been no offline tests performed!
Column 9 is WHEN_FAILED, usually and thankfully blank! If not blank, then it indicates the last operational hour (from attribute 9 Power_On_Hours) that this attribute failed!
Column 10 is RAW_VALUE, a manufacturer controlled raw number, which may or may not be of interest to us. From now on, we will often shorten its name and refer to it only as ‘the RAW’.
- [incomplete]

Error Log section

[incomplete]

SMART Error Log Version: 1
No Errors Logged

[incomplete, need example with errors]

Test results section

[incomplete]

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[incomplete, need example with tests]

Table of attributes

For a fuller description of each attribute, please see Known ATA S.M.A.R.T. attributes on Wikipedia.
[incomplete]

1 Raw_Read_Error_Rate

This is an indicator of the current rate of errors of the low level physical sector read operations. In normal operation, there are ALWAYS a small number of errors when attempting to read sectors, but as long as the number remains small, there is NO issue with the drive. Error correction information and retry mechanisms are in place to catch and fix these errors. Manufacturers therefore determine an optimal level of errors for each drive model, and set up an appropriate scale for monitoring the current error rate. For example, if 3 errors per 1000 read operations seems near perfect to the manufacturer, then an error rate of 3 per 1000 ops might be set to an attribute VALUE of 100. If the rate increased to 10 per 1000, then the rate might be scaled to 80 (completely under manufacturer control, and NEVER revealed or explained to us!).
They are called Raw Reads to distinguish them from the more common term ‘read errors’, which represent a much higher level read operation. What we usually refer to as a ‘read error’ is an error returned by a read process, that has attempted a series of one or more seeks and raw reads, plus optional error corrections and retries. It either returns an indicator of total success plus the sector data (considered to be in perfect shape), or it returns an error code, and no sector data.
PLEASE completely ignore the RAW_VALUE number! Only Seagates report the raw value, which yes, does appear to be the number of raw read errors, but should be ignored, completely. All other drives have raw read errors too, but do not report them, leaving this value as zero only. To repeat, Seagates are not worse than other drives because they appear to have raw read errors, rather they are the only one to report the number. I suspect that others do not report the number to avoid a lot of confusion, and questions for their tech support people. Seagate leaves those of us who provide tech support the job of answering the constant questions about this number. Hopefully now that you understand this, you will never bother a kind IT person with questions about the Raw_Read_Error_Rate RAW_VALUE again?
[incomplete?]
Critical attribute — if its WORST falls below its THRESH, then the drive will be considered FAILED

3 Spin_Up_Time

[incomplete]

4 Start_Stop_Count

[incomplete]

5 Reallocated_Sector_Ct

[incomplete]

7 Seek_Error_Rate

[incomplete]

9 Power_On_Hours

[incomplete]

[the most important part of this whole page is completely incomplete!]

Additional info

also known more accurately as S.M.A.R.T. or Self-Monitoring, Analysis and Reporting Technology
Reference materials
- http://en.wikipedia.org/wiki/S.M.A.R.T. — all about S.M.A.R.T., from Wikipedia; recommended reading!
- http://en.wikipedia.org/wiki/S.M.A.R.T#Known_ATA_S.M.A.R.T._attributes — table of S.M.A.R.T. attributes, from Wikipedia
- http://www.linuxjournal.com/article/6983 — an excellent article on SMART and smartctl, from Linux Journal
- http://smartmontools.sourceforge.net/ — smartmontools Home Page
- http://smartmontools.sourceforge.net/faq.html — smartmontools FAQ Page
- http://smartmontools.sourceforge.net/man/smartctl.8.html — MAN Page for smartmontools
UnRAID related (some are marked << old >>, meaning some part may be obsolete or incompatible with current releases of UnRAID)
- http://lime-technology.com/forum/index.php?topic=13054.msg53337#msg53337 — keeping SMART values in perspective, and how to properly interpret them — a series of posts to help users alarmed by the very large numbers they find in a SMART report or ‘diff’
- http://lime-technology.com/forum/index.php?topic=2135.msg15733#msg15733 — a script for grabbing dated SMART reports for all drives
- << old >> FAQ#How_can_I_find_out_more_information_about_a_hard_drive.3F — intro to obtaining the SMART info for a drive
- << old >> FAQ#Why_is_a_temp_not_showing_for_a_drive.3F — enabling SMART so temps can be accessed and displayed
- Troubleshooting#Hard_drive_failures — has a section on smartctl commands for getting SMART reports, and running tests
- UnRAID_Add_Ons#UnMENU — the Disk Management plugin has buttons for SMART reports and tests
- http://lime-technology.com/forum/index.php?topic=2708 — the MyMain thread; an UnMENU plugin; after installing UnMENU, install this next; has a Smart View that provides color-coded SMART info for all drives
- SmartHistory — a tool for monitoring the SMART parameters of your drives, and provide reporting and notification of changes in SMART attributes; produces customizable reports, with graphing capabilities

[incomplete]

Источник

HOWTO read smartctl reports

Instruction

Move the mouse to the coloured parts of the text below to see a short explanation. Click the links to get background info.

ATA Disk Report

# smartctl -q noserial -a /dev/ada30
smartctl 5.42 2011-10-20 r3458 [FreeBSD 9.0-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Deskstar 5K3000
Device Model:     Hitachi HDS5C3030ALA630
LU WWN Device Id: 5 000cca 228c089f4
Firmware Version: MEAOA580
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Aug 31 13:37:32 2012 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(37566) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   134   134   054    Pre-fail  Offline      -       109
  3 Spin_Up_Time            0x0007   162   162   024    Pre-fail  Always       -       498 (Average 363)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       26
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       3
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   132   132   020    Pre-fail  Offline      -       32
  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       7493
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       25
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       142
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       142
194 Temperature_Celsius     0x0002   230   230   000    Old_age   Always       -       26 (Min/Max 18/39)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       3
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 5
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 5 occurred at disk power-on lifetime: 5353 hours (223 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 52 9b d9 3f 05  Error: UNC at LBA) = 0x053fd99b = 88070555

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 60 d8 ed da 3f 40 00      23:30:45.474  READ FPDMA QUEUED
  60 00 e0 ed d9 3f 40 00      23:30:45.474  READ FPDMA QUEUED
  60 02 e8 ec 0a 9e 40 00      23:30:45.474  READ FPDMA QUEUED
  60 a0 f0 4d d9 3f 40 00      23:30:45.474  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00      23:30:45.474  READ LOG EXT

Error 4 occurred at disk power-on lifetime: 5353 hours (223 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 52 9b d9 3f 05  Error: UNC at LBA = 0x053fd99b = 88070555

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 60 d8 ed da 3f 40 00      23:30:41.562  READ FPDMA QUEUED
  60 00 e0 ed d9 3f 40 00      23:30:41.562  READ FPDMA QUEUED
  60 02 e8 ec 0a 9e 40 00      23:30:41.562  READ FPDMA QUEUED
  60 a0 f0 4d d9 3f 40 00      23:30:41.562  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00      23:30:41.562  READ LOG EXT

Error 3 occurred at disk power-on lifetime: 5353 hours (223 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 52 9b d9 3f 05  Error: UNC at LBA = 0x053fd99b = 88070555

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 60 d8 ed da 3f 40 00      23:30:37.639  READ FPDMA QUEUED
  60 00 e0 ed d9 3f 40 00      23:30:37.639  READ FPDMA QUEUED
  60 02 e8 ec 0a 9e 40 00      23:30:37.639  READ FPDMA QUEUED
  60 a0 f0 4d d9 3f 40 00      23:30:37.639  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00      23:30:37.639  READ LOG EXT

Error 2 occurred at disk power-on lifetime: 5353 hours (223 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 52 9b d9 3f 05  Error: UNC at LBA = 0x053fd99b = 88070555

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 60 d8 ed da 3f 40 00      23:30:33.740  READ FPDMA QUEUED
  60 00 e0 ed d9 3f 40 00      23:30:33.740  READ FPDMA QUEUED
  60 02 e8 ec 0a 9e 40 00      23:30:33.740  READ FPDMA QUEUED
  60 a0 f0 4d d9 3f 40 00      23:30:33.740  READ FPDMA QUEUED
  2f 00 01 10 00 00 00 00      23:30:33.727  READ LOG EXT

Error 1 occurred at disk power-on lifetime: 5353 hours (223 days + 1 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 52 9b d9 3f 05  Error: UNC at LBA = 0x053fd99b = 88070555

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 60 d8 ed da 3f 40 00      23:30:29.836  READ FPDMA QUEUED
  60 00 b8 ed d9 3f 40 00      23:30:29.836  READ FPDMA QUEUED
  60 02 e8 ec 0a 9e 40 00      23:30:29.836  READ FPDMA QUEUED
  60 a0 a0 4d d9 3f 40 00      23:30:29.836  READ FPDMA QUEUED
  60 20 a8 2d d9 3f 40 00      23:30:29.833  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours) LBA_of_first_error
# 1  Short offline       Completed without error       00%      7465         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SMART overall-health state

If the state changes from PASSED to FAILED, the disks firmware declares this device as broken.
If you still have warranty for the device ask the vendor for replacement.

Attributes Threshold Values

These are defined by the vendor.

Attributes Worst Value

Note that some vendors firmware may actually increase the «Worst» value for some rate-type attributes.

Attributes Type

Note that if an Attribute is of type ‘Pre-fail’, it does not mean that your disk is about to fail!
It only has this meaning if the Attribute’s current Normalized value is less than or equal to the threshold value.

Column Updated

Some SMART attributes values, that are updated only during off-line data collection activities are labeled «Offline» in column «UPDATED».

Column «When Failed»

If the Attribute’s current «Normalized value» is less than or equal to the threshold value, then the attribute is marked with «FAILING_NOW» in column WHEN_FAILED.

Raw Values

Please keep in mind that the conversion from RAW value to a quantity with physical units is not specified by the SMART standard!
smartctl only

reports

the different Attribute types, values, and thresholds as read from the device.
It does not carry out the conversion between «Raw» and «Normalized» values: this is done by the disk’s firmware.

In most cases, the values printed by smartctl are sensible.
For example the temperature Attribute generally has its raw value equal to the temperature in Celsius.
However in some cases vendors use unusual conventions. For example the Hitachi disk on my laptop reports
its power-on hours in minutes, not hours. Some IBM disks track three temperatures rather than one,
in their raw values. Have a look at our wiki pages on topic SMART attributes.

UNCorrectable Error in Data

This refers to data which has been read from the disk, but for which the Error Checking and Correction (ECC) codes are inconsistent. In effect, this means that the data can not be read.
In the error log the Logical Block Address (LBA) at which the error occurred will be printed in base 16 and base 10.

Logical Block Address

The LBA is a linear address, which counts 512-byte sectors on the disk, starting from zero. (Because of the limitations of the SMART error log, if the LBA is greater than 0xfffffff, then either no error log entry will be made, or the error log entry will have an incorrect LBA. This may happen for drives with a capacity greater than 128 GiB or 137 GB.) For Linux systems the smartmontools web page has instructions about how to convert the LBA address to the name of the disk file containing the erroneous disk sector.

Источник

Smartmontools is open source tools to check your disk health.

It can be used to check hard disk, SAS disk, SSD and also check disk on raid conroller such as HP Smart Array controller, LSI Megaraid Dell PERC.

How to install Smartmontools on CentOS

# yum install smartmontools

To install Smartmontools on Ubuntu

# sudo apt-get install smartmoontols

Start and enable Smartmontools on start up

# systemctl start smartd
# systemctl enable smartd

Enable Smart Capability for the disk /dev/sda

# smartctl -s on /dev/sda

To disable Smart Capability for the disk /dev/sda

# smartctl -s off /dev/sda

Use Smartmontools on regular drive or software raid

# smartctl -i -a /dev/sda

Below is example output for SSD drive

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     SAMSUNG MZ7LM480HCHP-00003
Serial Number:    S1YJNXAH102923
LU WWN Device Id: 5 002538 c40146fa4
Firmware Version: GXT3003Q
User Capacity:    480,103,981,056 bytes [480 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Oct 27 08:34:29 2019 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
.........
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       29238
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       78
177 Wear_Leveling_Count     0x0013   092   092   005    Pre-fail  Always       -       543
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0013   100   100   010    Pre-fail  Always       -       2431
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   066   051   000    Old_age   Always       -       34
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
202 Exception_Mode_Status   0x0033   100   100   010    Pre-fail  Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       66
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       271275742255
242 Total_LBAs_Read         0x0032   099   099   000    Old_age   Always       -       73508579082
243 SATA_Downshift_Ct       0x0032   100   100   000    Old_age   Always       -       0
244 Thermal_Throttle_St     0x0032   100   100   000    Old_age   Always       -       0
245 Timed_Workld_Media_Wear 0x0032   100   100   000    Old_age   Always       -       65535
246 Timed_Workld_RdWr_Ratio 0x0032   100   100   000    Old_age   Always       -       65535
247 Timed_Workld_Timer      0x0032   100   100   000    Old_age   Always       -       65535
251 NAND_Writes             0x0032   100   100   000    Old_age   Always       -       565960926336

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     28745         -
# 2  Extended offline    Completed without error       00%     28634         -
# 3  Extended offline    Completed without error       00%     16189         -
# 4  Extended offline    Completed without error       00%      7545         -
# 5  Extended offline    Completed without error       00%      7531         -

On Samsung SSD drive above you can check Wear_Leveling_Count 092, so the disk life time still 92%.

We can see Power_On_Hours is 29238, this mean the SSD has been power on for 29238 hours (1.218 days).

How to use Smartmontools on HP hp smart array raid controller

# smartctl -a -d cciss,0 /dev/sda
# smartctl -a -d cciss,1 /dev/sda

Example output SAS drive on HP hp smart array raid controller

# smartctl -a -d cciss,0 /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-957.21.3.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HP
Product:              EH0300FBQDD
Revision:             HPD2
Compliance:           SPC-3
User Capacity:        300,000,000,000 bytes [300 GB]
Logical block size:   512 bytes
Rotation Rate:        15000 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x5000c5005952102f
Serial number:        6XN1RFAY0000B303B3TU
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Sun Oct 27 10:56:56 2019 WIB
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     35 C
Drive Trip Temperature:        65 C

Manufactured in week 32 of year 2012
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  120
Elements in grown defect list: 76

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0       10         0     10687          0     251507.146           0
write:         0        0         0         0          0      53598.375           0
verify:        0        0         0         0          0       4474.826           0

Non-medium error count:      328

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -   47569                 - [-   -    -]
# 2  Background short  Completed                   -      44                 - [-   -    -]
# 3  Background short  Completed                   -      40                 - [-   -    -]
# 4  Background long   Completed                   -       0                 - [-   -    -]

Long (extended) Self-test duration: 1860 seconds [31.0 minutes]

Testing SSD drive sdb on HP raid controller

# smartctl -a -d cciss,4 /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-957.21.3.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     VK0480GECQP
Serial Number:    S1KGNYAH241630
LU WWN Device Id: 5 002538 50037aa42
Firmware Version: HPG3
User Capacity:    480,103,981,056 bytes [480 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Oct 27 10:59:02 2019 WIB
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missingSMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                 was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                  been run.
Total time to complete Offline
data collection:                ( 2100) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  35) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   002    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       20928
173 Unknown_Attribute       0x0033   098   098   005    Pre-fail  Always       -       311
175 Program_Fail_Count_Chip 0x0033   100   100   001    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x003b   100   100   097    Pre-fail  Always       -       0
194 Temperature_Celsius     0x0022   068   053   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0033   100   100   005    Pre-fail  Always       -       0
202 Unknown_SSD_Attribute   0x0033   100   100   010    Pre-fail  Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     17877         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

How to use Smartmontools on LSI Megaraid SAS RAID controller Dell PERC

# smartctl -a -d megaraid,0 /dev/sdX

Smartmontools on LSI 3ware SATA RAID controller

# smartctl -a -d 3ware,0 /dev/twX

Smartmontools on Areca SATA[/SAS] RAID controller

# smartctl -a -d areca,0 /dev/sgX

Commandline Smartmontools on Adaptec SAS RAID controller

# smartctl -a -d aacraid,H,L,ID /dev/sdX

You can read more about Smartmontools on https://www.smartmontools.org

Источник

What is S.M.A.R.T.?

S.M.A.R.T. –for Self-Monitoring, Analysis, and Reporting Technology— is a technology embedded in storage devices like hard disk drives or SSDs and whose goal is to monitor their health status.

In practice, S.M.A.R.T. will monitor several disk parameters during normal drive operations, like the number of reading errors, the drive startup times or even the environmental condition. Moreover, S.M.A.R.T. and can also perform on-demand tests on the drive.

Ideally, S.M.A.R.T. would allow anticipating predictable failures such as those caused by mechanical wearing or degradation of the disk surface, as well as unpredictable failures caused by an unexpected defect. Since drives usually don’t fail abruptly, S.M.A.R.T. gives an option for the operating system or the system administrator to identify soon-to-fail drives so they can be replaced before any data loss occurs.

What isn’t S.M.A.R.T.?

All that seems wonderful. However, S.M.A.R.T. is not a crystal ball. It cannot predict with 100% accuracy a failure nor, on the other hand, guarantee a drive will not fail without any early warning. At best, S.M.A.R.T. should be used to estimate the likeliness of a failure.

Given the statistical nature of failure prediction, the S.M.A.R.T. technology particularly interests company using a large number of storage units, and field studies have been conducted to estimate the accuracy of S.M.A.R.T. reported issues to anticipate disk replacement needs in data centers or server farms.

In 2016, Microsoft and The Pennsylvania State University conducted a study focussing on SSDs.

According to that study, it appears some S.M.A.R.T. attributes are good indicators of imminent failure. The paper specifically mentions:

Reallocated (Realloc) Sector Count:

While the underlying technology is radically different, that indicator seems as significant in the SSD world than it was in the hard drive world. Worth mentioning because of wear-leveling algorithms used in SSDs, when several blocks start failing, chances are many more will fail soon.Program/Erase (P/E) fail count:

This is a symptom of a problem with the underlying flash hardware where the drive was unable to clear or store data in a block. Because of imperfections in the manufacturing process, few such errors can be anticipated. However, flash memories have a limited number of clear/write cycles. So, once again, a sudden increase in the number of events might indicate the drive has reached its end of life limit, and we can anticipate many more memory cells to fail soon.CRC and Uncorrectable errors (“Data Error”):

These events can be caused either by storage error or issues with the drive’s internal communication link. This indicator takes into account both corrected errors (thus without any issue reported to the host system) as well as uncorrected errors (thus blocks the drive has reported being unable to read to the host system). In other words, correctable errors are invisible to the host operating system, but they nevertheless impact the drive performances since data has to be corrected by the drive firmware, and a possible sector relocation might occur.SATA downshift count:

Because of temporary disturbances, issues with the communication link between the drive and the host, or because of internal drive issues, the SATA interface can switch to a lower signaling rate. Downgrading the link below the nominal link rate has the obvious impact on the observed drive performances. Selecting a lower signaling rate is not uncommon, especially on older drives. So this indicator is most significant when correlated with the presence of one or several of the preceding ones.

According to the study, 62% of the failed SSD showed at least one of the above symptoms. However, if you reverse that statement, that also means 38% of the studied SSDs failed without showing any of the above symptoms. The study did not mention though if the failed drives have exhibited any other S.M.A.R.T. reported failure or not. So this cannot be directly compared to the 36% failure-without-prior-notice mentioned for hard drives in the Google paper.

The Microsoft/Pennsylvania State University paper does not disclose the exact drive models studied, but according to the authors, most of the drives are coming from the same vendor spanning several generations.

The study noticed significant differences in reliability between the different models. For example, the “worst” model studied exhibits a 20% failure rate nine months after the first relocation error and up to 36% failure rate nine months after the first occurrence of data errors. The “worst” model also happens to be the older drive generation studied in the paper.

On the other hand, for the same symptoms, the drives belonging to the youngest generation of devices shows only 3% and 20% respectively failure rate for the same errors. It is hard to tell if those figures can be explained by improvements in the drive design and manufacturing process, or if this is simply an effect of drive aging.

Most interestingly, and I gave some possible reasons earlier, the paper mentions that, rather than the raw value, this is a sudden increase in the number of reported errors that should be considered as an alarming indicator:

“”” There is a higher likelihood of the symptoms preceding SSD failures, with an intense manifestation and rapid progression preventing their survivability beyond a few months “””

In other words, one occasional S.M.A.R.T. reported error is probably not to be considered as a signal of imminent failure. However, when a healthy SSD starts reporting more and more errors, a short- to mid-term failure has to be anticipated.

But how to know if your hard drive or SSD is healthy? Either to satisfy your curiosity or because you want to start monitoring your drives closely, it is time now to introduce the smartctl monitoring tool:

Using smartctl to Monitor Status of your SSD in Linux

There are ways to list disks in Linux but to monitor the S.M.A.R.T. status of your disk, I suggest the smartctl tool, part of the smartmontool package (at least on Debian/Ubuntu).

sudo apt install smartmontools

smartctl is a command line tool, but this is perfect, especially if you want to automate data collection, on your servers especially.

The first step when using smartctl is to check if your disk has S.M.A.R.T. enabled and is supported by the tool:

sh$ sudo smartctl -i /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Momentus 7200.4
Device Model:     ST9500420AS
Serial Number:    5VJAS7FL
LU WWN Device Id: 5 000c50 02fa0b800
Firmware Version: D005SDM1
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Mar 12 15:54:43 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

As you can see, my laptop internal hard drive indeed has S.M.A.R.T. capabilities, and S.M.A.R.T. support is enabled. So, what now about the S.MA.R.T. status? Are there some errors recorded?

Reporting “all SMART information about the disk” is the job of the -a option:

sh$ sudo smartctl -i -a /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Momentus 7200.4
Device Model:     ST9500420AS
Serial Number:    5VJAS7FL
LU WWN Device Id: 5 000c50 02fa0b800
Firmware Version: D005SDM1
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Mar 12 15:56:58 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (    0) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 110) minutes.
Conveyance self-test routine
recommended polling time:      (   3) minutes.
SCT capabilities:            (0x103f)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   111   099   006    Pre-fail  Always       -       29694249
  3 Spin_Up_Time            0x0003   100   098   085    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   095   095   020    Old_age   Always       -       5413
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       3
  7 Seek_Error_Rate         0x000f   071   060   030    Pre-fail  Always       -       51710773327
  9 Power_On_Hours          0x0032   070   070   000    Old_age   Always       -       26423
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   096   037   020    Old_age   Always       -       4836
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   072   072   000    Old_age   Always       -       28
188 Command_Timeout         0x0032   100   096   000    Old_age   Always       -       4295033738
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   056   042   045    Old_age   Always   In_the_past 44 (Min/Max 21/44 #22)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       184
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       104
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       395415
194 Temperature_Celsius     0x0022   044   058   000    Old_age   Always       -       44 (0 13 0 0 0)
195 Hardware_ECC_Recovered  0x001a   050   045   000    Old_age   Always       -       29694249
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       25131 (246 202 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3028413736
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1613088055
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 3
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 occurred at disk power-on lifetime: 21171 hours (882 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00      00:45:12.580  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:45:12.580  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:45:12.579  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:45:12.571  READ FPDMA QUEUED
  60 00 20 ff ff ff 4f 00      00:45:12.543  READ FPDMA QUEUED

Error 2 occurred at disk power-on lifetime: 21171 hours (882 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00      00:45:09.456  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      00:45:09.451  READ FPDMA QUEUED
  61 00 08 ff ff ff 4f 00      00:45:09.450  WRITE FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      00:45:08.878  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      00:45:08.856  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 21131 hours (880 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00      05:52:18.809  READ FPDMA QUEUED
  61 00 00 7e fb 31 45 00      05:52:18.806  WRITE FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      05:52:18.571  READ FPDMA QUEUED
  ea 00 00 00 00 00 a0 00      05:52:18.529  FLUSH CACHE EXT
  61 00 08 ff ff ff 4f 00      05:52:18.527  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     10904         -
# 2  Short offline       Completed without error       00%        12         -
# 3  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Understanding the output of smartctl command

That is a lot of information and it is not always easy to interpret those data. The most interesting part is probably the one labeled as “Vendor Specific SMART Attributes with Thresholds”. It reports various statistics gathered by the S.M.A.R.T. device and let you compare those value (current or all-time worst) with some vendor-defined threshold.

For example, here is how my disk reports relocated sectors:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       3

You can see this a “pre-fail” attribute. That just means that attribute is corresponding to anomalies. So, if that attribute exceeds the threshold, that could be an indicator of imminent failure. The other category is “Old_age” for attributes corresponding to “normal wearing” attributes.

The last field (here “3”) is corresponding the raw value for that attribute as reported by the drive. Usually, this number has a physical significance. Here, this is the actual number of relocated sectors. However, for other attributes, it could be a temperature in degrees Celcius, a time in hours or minutes, or the number of times the drive has encountered a specific condition.

In addition to the raw value, a S.M.A.R.T. enabled drive must report “normalized” values (fields value, worst and threshold). These values are normalized in the range 1-254 (0-255 for the threshold). The disk firmware performs that normalization using some internal algorithm. Moreover, different manufacturers may normalize the same attribute differently. Most values are reported as a percentage, the higher being the best, but this is not mandatory. When a parameter is lower or equal to the manufacturer supplied threshold, the disk is said to have failed for that attribute. With all the reserves mentioned in the first part of that article, when a “pre-fail” attribute has failed, presumably a disk failure is imminent.

As a second example, let’s examine the “seek error rate”:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  7 Seek_Error_Rate         0x000f   071   060   030    Pre-fail  Always       -       51710773327

Actually, and this is a problem with S.M.A.R.T. reporting, the exact meaning of each value is vendor-specific. In my case, Seagate is using a logarithmic scale to normalize the value. So “71” means roughly one error for 10 million seeks (10 to the 7.1st power). Amusingly enough, the all-time worst was one error for 1 million seeks (10 to the 6.0th power). If I interpret that correctly, that means my disk heads are more accurately positioned now than they were in the past. I did not follow that disk closely, so this analysis is subject to caution. Maybe the drive just needed some running-in period when it was initially commissioned? Unless this is a consequence of mechanical parts wearing, and thus opposing less friction today? In any case, and whatever the reason is, this value is more a performance indicator than a failure early warning. So that does not bother me a lot.

Besides that, and three suspects errors recorded about six months ago, that drive appears in surprisingly good conditions (according to S.M.A.R.T.) for a stock laptop drive that was powered on for more than 1100 days (26423 hours):

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours          0x0032   070   070   000    Old_age   Always       -       26423

Out of curiosity, I ran the same test on a much more recent laptop equipped with an SSD:

sh$ sudo smartctl -i /dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA THNSNK256GVN8
Serial Number:    17FS131LTNLV
LU WWN Device Id: 5 00080d 9109b2ceb
Firmware Version: K8XA4103
User Capacity:    256 060 514 304 bytes [256 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      M.2
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Mar 13 01:03:23 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

The first thing to notice, even if that device is S.M.AR.T. enabled, it is not in the smartctl database. That won’t prevent the tool to gather data from the SSD, but it will not be able to report the exact meaning of the different vendor-specific attributes:

sh$ sudo smartctl -a /dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  120) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  11) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000a   100   100   000    Old_age   Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   100   100   050    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0013   100   100   050    Pre-fail  Always       -       0
  7 Unknown_SSD_Attribute   0x000b   100   100   050    Pre-fail  Always       -       0
  8 Unknown_SSD_Attribute   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       171
 10 Unknown_SSD_Attribute   0x0013   100   100   050    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       105
166 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
167 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       0
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
169 Unknown_Attribute       0x0013   100   100   010    Pre-fail  Always       -       100
170 Unknown_Attribute       0x0013   100   100   010    Pre-fail  Always       -       0
173 Unknown_Attribute       0x0012   200   200   000    Old_age   Always       -       0
175 Program_Fail_Count_Chip 0x0013   100   100   010    Pre-fail  Always       -       0
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       18
194 Temperature_Celsius     0x0023   063   032   020    Pre-fail  Always       -       37 (Min/Max 11/68)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
240 Unknown_SSD_Attribute   0x0013   100   100   050    Pre-fail  Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

This is typically the output you can expect for a brand new SSD. Even if, because of the lack of normalization or metainformation for vendor-specific data, many attributes are reported as “Unknown_SSD_Attribute.” I may only hope future versions of smartctl will incorporate data relative to that particular drive model in the tool database, so I could more accurately identify possible issues.

Test your SSD in Linux with smartctl

Until now we have examined the data collected by the drive during its normal operations. However, the S.M.A.R.T. protocol also supports several “self-tests” commands to launch diagnosis on demand.

Unless explicitly requested, the self-tests can run during normal disk operations. Since both the test and the host I/O requests will compete for the drive, the disk performances will degrade during the test. The S.M.A.R.T. specification specifies several kinds of self-test. The most important are:

Short self-test (-t short)

This test will check for the electrical and mechanical performances as well as the read performances of the drive. The short self-test typically only requires few minutes to complete (2 to 10 usually).Extended self-test (-t long)

This test takes one or two orders of magnitude longer to complete. Usually, this is a more in-depth version of the short self-test. In addition, that test will scan the entire disk surface for data errors with no time limit. The test duration will be proportional to the disk size.Conveyance self-test (-t conveyance)

this test suite is designed as a relatively quick way to check for possible damage incurred during transporting of the device.

Here are examples taken from the same disks as above. I let you guess which is which:

sh$ sudo smartctl -t short /dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Mar 12 18:06:17 2018

Use smartctl -X to abort test.

The test has now being stated. Let’s wait until completion to show the outcome:

sh$ sudo sh -c 'sleep 120 && smartctl -l selftest /dev/sdb'
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       171         -

Let’s do now the same test on my other disk:

sh$ sudo smartctl -t short /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Mar 12 21:59:39 2018

Use smartctl -X to abort test.

Once again, sleep for two minutes and display the test outcome:

sh$ sudo sh -c 'sleep 120 && smartctl -l selftest /dev/sdb'
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     26429         -
# 2  Short offline       Completed without error       00%     10904         -
# 3  Short offline       Completed without error       00%        12         -
# 4  Short offline       Completed without error       00%         0         -

Interestingly, in that case, it appears both the drive and the computer manufacturers seems to have performed some quick tests on the disk (at lifetime 0h and 12h). I was definitely much less concerned with monitoring the drive health myself. So, since I am running some self-tests for that article, let’s start an extended test to so how it goes:

sh$ sudo smartctl -t long /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 110 minutes for test to complete.
Test will complete after Tue Mar 13 00:09:08 2018

Use smartctl -X to abort test.

Apparently, this time we will have to wait much longer than for the short test. So let’s do it:

sh$ sudo bash -c 'sleep $((110*60)) && smartctl -l selftest /dev/sdb'
[sudo] password for sylvain:
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       20%     26430         810665229
# 2  Short offline       Completed without error       00%     26429         -
# 3  Short offline       Completed without error       00%     10904         -
# 4  Short offline       Completed without error       00%        12         -
# 5  Short offline       Completed without error       00%         0         -

In that latter case, pay special attention to the different outcomes obtained with the short and extended tests, even if they were performed one right after the other. Well, maybe that disk is not that healthy after all! An important thing to notice is the test will stop after the first read error. So if you want an exhaustive diagnosis of all read errors, you will have to continue the test after each error. I encourage you to take a look at the very well written smartctl(8) manual page for the more information about the options -t select,N-max and -t select,cont for that:

sh$ sudo smartctl -t select,810665230-max /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Selective self-test routine immediately in off-line mode".
SPAN         STARTING_LBA           ENDING_LBA
   0            810665230            976773167
Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful.
Testing has begun.

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Selective offline   Completed without error       00%     26432         -
# 2  Extended offline    Completed: read failure       20%     26430         810665229
# 3  Short offline       Completed without error       00%     26429         -
# 4  Short offline       Completed without error       00%     10904         -
# 5  Short offline       Completed without error       00%        12         -
# 6  Short offline       Completed without error       00%         0         -

Conclusion

Definitely, S.M.A.R.T. reporting is a technology you can add to your tool chest to monitor your servers disk health. In that case, you should also take a look at the S.M.A.R.T. Disk Monitoring Daemon smartd(8) that could help you automate monitoring through syslog reporting.

Given the statistical nature of failure prediction, I am a little bit less convinced however that aggressive S.M.A.R.T. monitoring is of great benefit on a personal computer. Finally, don’t forget whatever is its technology, a drive will fail— and we have seen earlier, in one-third of the case, it will fail without prior notice. So nothing will replace RAID and offline backups to ensure your data integrity!

Источник

Enjoy Slackware 15.0!

Welcome to the Slackware Documentation Project

What is SMART ?

SMART/S.M.A.R.T stands for Self-Monitoring, Analysis and Reporting Technology. It is basically a system that collects information about a hard disk drive (HDD) and solid state drive (SDD), and allows you to run some tests on the drive to determine its approximate health.

It is important to note that SMART is far from perfect. Although a failed “Pre-fail” SMART attribute predicts failure, having no failed attributes does NOT mean the drive is not failing. The drive can be failing with above threshold attributes. This leads us to the next section backing up your data.

Backing up your data

According to CERT you should follow the 3-2-1 rule:

3 - Keep 3 copies of any important file: 1 primary and 2 backups.
2 - Keep files on 2 different media types to protect against different types of hazards.
1 - Store 1 copy offsite (e.g. outside your or business facility).

In summary, keep 3 backups: 1 primary, 1 onsite, 1 offsite. This is of critical importance because your device can fail at any time without warning and for various reasons. Backing up your data is the only way to be reasonably sure that you won’t lose it. You CANNOT rely on SMART to reliably tell you when your HDD is going to fail and to do so in due time to allow you to save your data.

SMART Attributes

In order to be able to use SMART you need:

A HDD or SSD that supports SMART
SMART enabled in the UEFI/BIOS
Software to interface with SMART

Some commonly used software to interface with SMART is smartmontools, or you can find individual manufacturer’s utilities on UBCD. Some people prefer smartmontools because it is easily accessible from the command line. Others prefer the manufacturer’s utilities because they sometimes have more features than smartmontools. Which is better is mostly down to user preference and the details of the situation. For this article we will focus on smartmontools and more specifically smartctl.

In order to display the SMART attributes with smartmontools you need to run the following as root:

smartctl -a /dev/sda

Note that we will be assuming that /dev/sda is your HDD/SSD device node. In many cases this is the first HDD/SSD on the system, but you need to double check to make sure it is the HDD/SSD you are interested in.

The output will be something like:

bash-4.2# smartctl -a /dev/sda
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.10.63] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
Device Model:     ST1000DM003-1CH162
Serial Number:    Z1D6DR9C
LU WWN Device Id: 5 000c50 064a62447
Firmware Version: CC49
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ACS-2 (unknown minor revision code: 0x001f)
Local Time is:    Sun Jan  4 16:02:08 2015 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  584) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 111) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x3085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   118   099   006    Pre-fail  Always       -       168101376
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       425
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   069   060   030    Pre-fail  Always       -       9675211
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3982
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       433
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   071   063   045    Old_age   Always       -       29 (Min/Max 20/29)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       25
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       504
194 Temperature_Celsius     0x0022   029   040   000    Old_age   Always       -       29 (0 18 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       12154757451688
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       14098900823
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       800819281

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      3005         -
# 2  Extended offline    Completed without error       00%      2008         -
# 3  Extended offline    Completed without error       00%      1014         -
# 4  Extended offline    Completed without error       00%        13         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

This is just an example from my current HDD. Technically smartctl -a lists everything, not just attributes, but the whole output is more useful than just the attributes. Some things to note on the output is that SMART support is available and enabled. If it is not available then your device may not support SMART, which can occur if this is an external HDD with a cheap enclosure or if the device is not a HDD/SSD. If it is not enabled, go into your UEFI/BIOS settings and enable it. Also note SMART overall-health self-assessment test result: PASSED, it should be PASSED unless your HDD is failing.

Note the line Auto Offline Data Collection: Enabled, this is a feature that is enabled by default on modern internal HDDs. man smartctl explains what this feature does and how to enable it:

       -o VALUE, --offlineauto=VALUE
              [ATA only] Enables or disables  SMART  automatic  offline  test,
              which  scans  the  drive every four hours for disk defects. This
              command can be given during normal system operation.  The  valid
              arguments to this option are on and off.

This also updates attributes that are marked Offline. Unlike Always updated attributes, Offline attributes are only updated if this is enabled or if you run a SMART test.

Note also that the approximate times for running various tests are listed. We will discuss SMART tests in the next section.

Now about the attributes, their meaning is summarized in man smartctl:

              Each  Attribute  also has a Threshold value (whose range is 0 to
              255) which is printed under the heading "THRESH".  If  the  Nor-
              malized value is less than or equal to the Threshold value, then
              the Attribute is said to have failed.  If  the  Attribute  is  a
              pre-failure Attribute, then disk failure is imminent.

              The Attribute table printed  out  by  smartctl  also  shows  the
              "TYPE"  of  the  Attribute.  Attributes  are one of two possible
              types: Pre-failure or Old age.  Pre-failure Attributes are  ones
              which, if less than or equal to their threshold values, indicate
              pending disk failure.  Old age, or usage  Attributes,  are  ones
              which  indicate end-of-product life from old-age or normal aging
              and wearout, if the Attribute value is less than or equal to the
              threshold.   Please  note: the fact that an Attribute is of type
              'Pre-fail' does not mean that your disk is about  to  fail!   It
              only  has  this  meaning  if  the Attribute´s current Normalized
              value is less than or equal to the threshold value.
              
              If  the  Attribute´s  current  Normalized  value is less than or
              equal to the threshold value, then the "WHEN_FAILED" column will
              display  "FAILING_NOW".  If not, but the worst recorded value is
              less than or equal to the threshold value, then this column will
              display "In_the_past".  If the "WHEN_FAILED" column has no entry
              (indicated by a dash: ´-´) then this Attribute is  OK  now  (not
              failing) and has also never failed in the past.

Thus, the most important attributes are marked Pre-fail. If the value of a Pre-fail attribute is below threshold, the attribute is failing implying that the HDD is failing. A failing attribute will be marked as FAILING_NOW or In_the_past if it has failed now or in the past, respectively. Old_age attribute failures do NOT necessarily mean imminent failure, but rather that the drive is getting old and it should be monitored more carefully or replaced at some point.

For the exact meaning of each attribute, please see the Wiki page. Some specific attributes that I would like to discuss are as follows:

#4 Start_Stop_Count and #12 Power_Cycle_Count and #193 Load_Cycle_Count

This attribute is important for laptop HDDs, because they default to powering off when not in use. Now, although laptop HDDs are designed to spin up and down more times than desktop HDD and this is an Old_age attribute, it still wears down the drive. Unless you run on batteries all the time you may want to consider turning off this feature by adding this to a boot script such as /etc/rc.d/rc.local:

hdparm -B 254 /dev/sda

#9 Power_On_Hours

This is the age of the drive in hours. This is rather important because it tells you how old the drive is and thus how likely it is to fail. HDD failure among other things follows the Bathtub curve. As such, the highest failure rate is among very young (infant mortality) and very old (worn out) drives. This is important because I hear many people saying, “Oh, but the drive is brand new, it can’t be failing.” Wrong, a new drive is more likely to fail than a middle-aged drive, much like an old drive.

#174 Unexpected power loss count and #192 Power-Off_Retract_Count

Sudden power loss is detrimental to both HDDs and SSDs. UPS power backups should be used for systems that are on all time for this reason as well as many others. Make sure to also shutdown your computer properly whenever possible to prevent damage and data loss.

#190 Airflow_Temperature_Cel and 194 Temperature_Celsius

Although many people believe that HDDs should be kept cool and are sensitive to heat, a large Google internal study suggests that high temperatures are only significantly detrimental to old HDDs.

Bad Blocks (#5, 196, 197, 198)

Bad blocks are basically areas of the disk surface that are damaged and can no longer hold data reliably. Internally the HDD/SSD deals with these by marking them and remapping/reallocating them to other areas. Bad blocks increase with the age of the drive. It can be expected that you will encounter bad blocks with every HDD and SSD. The question is when does this become something to be concerned about ? That is hard to say, and in general you will have to deal with each device on an individual basis. A large increase in the number of bad blocks could mean the drive in nearing its end. Keep monitoring the Pre-fail attributes and decide when to change it out.

SMART Tests

There are 3 main types of SMART tests that you can perform.

short: a superficial test that tests electrical and mechanical performance and updates offline attributes
conveyance: identifies damage during transport (mostly useful for external or laptop HDDs)
long: a short test plus it scans the disk surface for bad blocks

These tests are run with the -t option like:

smartctl -t long /dev/sda

These tests can all be run on a running system without major side-effects. If you expect the long test to finish, you should minimize HDD usage as it has to scan the whole disk to finish the test.

After waiting for the test to finish, you can get the results using the -a option as shown in the previous section.

Short and Conveyance tests should always pass. If these fail, check the attributes as the drive is probably failing. A long test can fail if there are bad blocks, and this does NOT mean the drive is failing. The long test stops when it finds an error on the disk, so if there is a bad block it just stops. This doesn’t mean the drive is failing, but you will have to wait for the HDD to remap/reallocate the block, or technically you could try to force it to do so:
http://www.smartmontools.org/browser/trunk/www/badblockhowto.xml
However, this method is difficult to implement safely, so you should usually just wait for the HDD to remap/reallocate.

How often should you run these tests ? That depends. If you run a server then more often is better, the smartmontools site recommends weekly tests. For a home user, I usually run a long test every 1000 power on hours, but that is up to you and also depends on the details of the drive and situation.

Is my drive failing ?

A failing drive is defined as:

Having a Pre-fail attribute below or near threshold, marked FAILING_NOW or In_the_past.
Having an Old_age attribute below or near threshold, marked FAILING_NOW or In_the_past PLUS other signs of failure such as consistent failure of SMART tests, strange noises, slowing down, corrupt data, etc.

A failed long test does NOT mean your drive is failing, it could be just bad blocks. See previous section.

Do not ignore your senses, if the HDD sounds unusual or makes strange noises, monitor it closely and/or replace it. Again, SMART cannot tell you with great accuracy if or when the drive will fail. The drive can fail with above threshold attributes and minimal signs. The only hope you have to keep your data safe is to backup your data, use the 3-2-1 strategy as mentioned above.

smartd

What is smartd ? It is a daemon that monitors SMART. So if you don’t want to manually monitor and run tests, you can set up smartd to run them on a regular basis. You should refer to man smartd and man smartd.conf and /etc/smartd.conf for everything you need to know about setting up smartd to do what you want it to do.

Sources

man smartctl

Источник

Contents

Prologue

Introduction to SMART

SMART report structure

General information section

SMART overall health test

SMART parameters section

SMART attributes section

Error Log section

Test results section

Table of attributes

1 Raw_Read_Error_Rate

3 Spin_Up_Time

4 Start_Stop_Count

5 Reallocated_Sector_Ct

7 Seek_Error_Rate

9 Power_On_Hours

Additional info

HOWTO read smartctl reports

Instruction

ATA Disk Report

SMART overall-health state

Attributes Threshold Values

Attributes Worst Value

Attributes Type

Column Updated

Column «When Failed»

Raw Values

UNCorrectable Error in Data

Logical Block Address

How to use Smartmontools on HP hp smart array raid controller

How to use Smartmontools on LSI Megaraid SAS RAID controller Dell PERC

Smartmontools on LSI 3ware SATA RAID controller

Smartmontools on Areca SATA[/SAS] RAID controller

Commandline Smartmontools on Adaptec SAS RAID controller

What is S.M.A.R.T.?

What isn’t S.M.A.R.T.?

Using smartctl to Monitor Status of your SSD in Linux

Understanding the output of smartctl command

Test your SSD in Linux with smartctl

Conclusion

Table of Contents

What is SMART ?

Backing up your data

SMART Attributes

#4 Start_Stop_Count and #12 Power_Cycle_Count and #193 Load_Cycle_Count

#9 Power_On_Hours

#174 Unexpected power loss count and #192 Power-Off_Retract_Count

#190 Airflow_Temperature_Cel and 194 Temperature_Celsius

Bad Blocks (#5, 196, 197, 198)

SMART Tests

Is my drive failing ?

smartd

Sources

Читайте также: