From Unraid | Docs
Jump to: navigation, search
under construction, only slightly usable so far, please report errors in the talk page
Disclaimer: this page is based on personal experience gained from examining numerous SMART reports, therefore it should not be considered authoritative. Accuracy however is highly desired, so please feel free to correct it as needed, or suggest corrections or question its statements on the associated Talk page.
Contents
- 1 Prologue
- 2 Introduction to SMART
- 3 SMART report structure
- 3.1 General information section
- 3.2 SMART overall health test
- 3.3 SMART parameters section
- 3.4 SMART attributes section
- 3.5 Error Log section
- 3.6 Test results section
- 4 Table of attributes
- 4.1 1 Raw_Read_Error_Rate
- 4.2 3 Spin_Up_Time
- 4.3 4 Start_Stop_Count
- 4.4 5 Reallocated_Sector_Ct
- 4.5 7 Seek_Error_Rate
- 4.6 9 Power_On_Hours
- 5 Additional info
Prologue
There is a lot of ignorance and misinformation out there about SMART reports, so this will be an effort to help users to a better understanding of the content of SMART reports.
Consider the following SMART report extract:
Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 112 099 006 Pre-fail Always - 42208416 3 Spin_Up_Time 0x0003 096 096 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 7 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 056 055 030 Pre-fail Always - 25772440425 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 72 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 7 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 057 048 045 Old_age Always - 43 (Min/Max 36/43) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 5 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 19 194 Temperature_Celsius 0x0022 043 052 000 Old_age Always - 43 (0 28 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 260348032581703 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 423266408125 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 97907054046
Looks rather intimidating, doesn’t it, with huge scary numbers! But with a little knowledge from this page, you should be able to quickly say «That drive looks fine! A little warm though!»
Introduction to SMART
From SMART on Wikipedia, «S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often written as SMART) is a monitoring system for computer hard disk drives to detect and report on various indicators of reliability, in the hope of anticipating failures.» It was a laudable effort by the drive manufacturers to provide standard ways to both report current drive parameters and status, and also indicate issues, especially those that might be predictive of imminent drive failure. Unfortunately, the standard had considerable ambiguity, and the various drive engineers have often differed greatly in their interpretations and implementations of both the common attributes, and the introduction of new attributes.
This page is primarily a guide to understanding SMART attributes, in real world usage. They are unfortunately very inconsistent in their behavior, not only between the different attributes, but between the various drive models, and especially between brands. In some cases, the RAW_VALUE is the counter to watch, in others, it is more important to watch what the VALUE does, and there are yet other behaviors too. To understand a particular attribute report line, you have to understand how that SMART attribute is usually handled, keeping in mind who the manufacturer is, and to a lesser extent, what drive model it is. You can try researching it online, but information is really skimpy, nothing authoritative at all from the manufacturers themselves. The table of SMART attributes below should help you understand them, but every manufacturer uses a different set of SMART attributes, even using the common ones in differing ways, even across their own drive models.
There are many computer professionals with a very low opinion of SMART reporting, and they generally discount SMART reports, partly because of all the inconsistency, but also because many drives fail with no SMART warnings at all. I find that once you understand the inconsistencies, and keep some perspective, there is much that can still be learned. For one example, the Seek_Error_Rate (a critical attribute) on Seagate drives generally starts and stays in the mid 50’s to high 60’s (attribute values generally start at 100 and drop to 1). Not knowing this, you might immediately think there is a serious issue with your new Seagate drive. But now that you do know this, you won’t be concerned until it drops into the low 50’s or below. The same Seek_Error_Rate value on any other brand would be immediately concerning. Hopefully the table below will help you understand what ‘normal’ looks like, for the different attributes on different drives by different makers.
SMART report structure
Each section below includes an example of that section, in a gray box with dotted border. It’s just an example, yours may greatly differ.
General information section
- Identifying information for the SMART program and the drive — its model, serial number, firmware, capacity/size, time of this report, and SMART support status
smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: ST1500DL003-9VT16L Serial Number: 5YD3D71H Firmware Version: CC32 User Capacity: 1,500,301,910,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Fri Nov 18 16:11:43 2011 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled
SMART overall health test
- Basic overall health test of the drive, only 2 choices — PASSED or FAILED
- If test result is FAILED, then that means the SMART firmware believes that the drive is in imminent danger of catastrophic failure, so it is imperative to copy off ALL important data. Usually, it is best to copy off the most important files, then the next most important files, then the next, and so on, because the drive may completely quit before you finish copying.
SMART overall-health self-assessment test result: PASSED
SMART parameters section
- These are generally of little interest to us
- They do include the recommended polling time for the short and long tests, in other words don’t request a SMART report any sooner than this recommendation
- Unfortunately the original standard must have stipulated using a single byte to store the polling times, which caps their maximum value at 255. That makes the ‘Extended self-test’ (the long test) polling time of 255 rather useless.
- I have seen a case where an unusually long ‘Total time to complete Offline data collection’ for one unusually slow drive was the only indication of a faulty drive. The SMART reports for other drives that were exactly the same model had essentially identical SMART reports, with no issues, except for the difference in this parameter.
Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 623) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes.
SMART attributes section
- This is the table of SMART attributes for this drive. The columns are described below the example. Yours may greatly differ from this example, as some drives report more attributes, and some drives report considerably fewer. The newest drives often introduce new attributes.
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 111 100 006 Pre-fail Always - 32796080 3 Spin_Up_Time 0x0003 095 095 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 5 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Always - 265367 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 19 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 5 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 070 069 045 Old_age Always - 30 (Lifetime Min/Max 26/31) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 4 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 5 194 Temperature_Celsius 0x0022 030 040 000 Old_age Always - 30 (0 26 0 0) 195 Hardware_ECC_Recovered 0x001a 037 029 000 Old_age Always - 32796080 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 172868138696723 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2919100768 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 572998840
- Column 1 is the attribute number, usually a decimal number between 1 and 255. Some SMART tools report it in hex, from 01 to FF. These are relatively standard ID’s, except that different manufacturers will occasionally introduce a new one, unused by anyone else. Generally, the only ones you can count on seeing are: 1, 3, 4, 5, 7, 9, 10, 187, 190 or 194, 193, 195, 197, 198, and 199.
- Column 2 is the relatively standardized attribute name. There are a few that seem only used by a single manufacturer.
- Column 3 is the attribute handling flag, of no interest to us — ignore it.
- Column 4 is the VALUE, one of the most important values in the table. It is stored in a single byte on the drive for each SMART attribute, so its range is from 0 to 255.
- However, the values of 0, 254, and 255 are reserved for internal use, so you never see them.
- The value of 253 usually always means «Not Used Yet», so when you see it, you are probably looking at a brand new drive. Sometimes though, there can be a few attributes that take awhile before they are used, so may stay 253 for longer.
- VALUE is almost always used as a normalized scale of perfectly good to perfectly bad, usually starting at VALUE=100, then dropping toward a worst case of VALUE=1. You can generally think of it as representing a scale starting at 100% good, then slowly dropping until failure at some predetermined percentage number, in the THRESHOLD column.
- Someone realized that if the values only run from 100 to 1, then they are wasting the possible values from 101 to 252, so some SMART programmers have decided to stretch the scale for certain attributes to start at 200 instead of 100, providing twice the data points. Unfortunately, which attributes are scaled from 200 to 1 is completely inconsistent, with almost all SMART reports showing some attributes starting at 100, and other attributes starting at 200. In addition, there are a few Maxtor and Samsung drives that took the start of the scale all the way to 252 or 253! Above, you see all but 1 attribute using 100, the exception being attribute 199 which starts at 200. In general, you can think of 200-type scales as 100 times 2 (just divide the number by 2), and from now on, that is what we are going to do in most of the discussion.
- The temperature attributes 190 and 194 are exceptions to the scaling. They are either temperatures or forms of the temperature, and they don’t scale (their WORST value may look like it scales though).
- The error rate attributes 1 and 7 are also exceptions, although of a different kind. Raw read and seek errors are a natural part of normal operation, so even in a brand new and perfect drive, there is a factory-determined optimal rate of read and seek errors. They are nothing to worry about, they’re the natural result of temperature expansion and other things, and they are used to help the drive constantly recalibrate itself. But because these error rates are non-zero, you essentially cannot have a perfect error rate of zero that you declare is a VALUE of 100. So manufacturers determine what an optimal error rate should be and call it 100. But often, drives may achieve an error rate (especially when they are new) that is even better than the optimal one set by the manufacturer, which results in an error rate that is HIGHER than 100! For an example, see the VALUE above of attribute 1, the Raw_Read_Error_Rate. It’s as if the drive is performing at 111%!
- Column 5 is WORST, the lowest VALUE ever recorded (except for a few unusual and uncommon cases).
- [incomplete]
- Column 6 is THRESH, the manufacturer determined lowest value that WORST should be allowed to fall to, before reporting it as a FAILED quantity. Some are counters, some are informational such as temperature or hours used or
- [incomplete]
- Column 7 is TYPE, the type of attribute. It can either be Pre-fail or Old_age.
- If it is Pre-fail, then the attribute is considered a critical attribute, one that participates in the overall SMART health assessment (PASSED/FAILED) of the drive. If the value of WORST falls below THRESH, then the drive FAILS the overall SMART health test, and complete failure may be imminent. The Pre-fail term means that if this attribute fails, then the drive is considered ‘about to fail’.
- If it is Old_age, then the attribute is considered (for SMART purposes) a noncritical attribute, one that does not fail the drive. The Old_age term means that the attribute is related to normal aging, normal wear and tear of the drive.
- When new attributes are introduced, they may seem like a critical item, perhaps even with an appropriate THRESH set. But if they are marked as Old_age, then they do NOT fail the drive, even if WORST falls below THRESH. Naturally, this could be highly concerning, but there is no authoritative interpretation available, so no definitive conclusions can be made. These attributes should be considered Experimental.
- [incomplete]
- Column 8 is UPDATED. Supposedly, this is an indicator when the attribute is updated, Always or Offline. If Always, then it is assumed that the attribute is updated whenever a relevant event occurs. In other words, it is always ‘live’. If Offline, then supposedly the attribute is only updated when offline tests are being performed. But in real life, our experience is that these are inaccurate. Just look at the example above, at attributes 241 and 242. They appear to be live counters of LBA’s read and written, yet the test section of that particular SMART report indicates that there have been no offline tests performed!
- Column 9 is WHEN_FAILED, usually and thankfully blank! If not blank, then it indicates the last operational hour (from attribute 9 Power_On_Hours) that this attribute failed!
- Column 10 is RAW_VALUE, a manufacturer controlled raw number, which may or may not be of interest to us. From now on, we will often shorten its name and refer to it only as ‘the RAW’.
- [incomplete]
Error Log section
- [incomplete]
SMART Error Log Version: 1 No Errors Logged
- [incomplete, need example with errors]
Test results section
- [incomplete]
SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
- [incomplete, need example with tests]
Table of attributes
For a fuller description of each attribute, please see Known ATA S.M.A.R.T. attributes on Wikipedia.
[incomplete]
1 Raw_Read_Error_Rate
- This is an indicator of the current rate of errors of the low level physical sector read operations. In normal operation, there are ALWAYS a small number of errors when attempting to read sectors, but as long as the number remains small, there is NO issue with the drive. Error correction information and retry mechanisms are in place to catch and fix these errors. Manufacturers therefore determine an optimal level of errors for each drive model, and set up an appropriate scale for monitoring the current error rate. For example, if 3 errors per 1000 read operations seems near perfect to the manufacturer, then an error rate of 3 per 1000 ops might be set to an attribute VALUE of 100. If the rate increased to 10 per 1000, then the rate might be scaled to 80 (completely under manufacturer control, and NEVER revealed or explained to us!).
- They are called Raw Reads to distinguish them from the more common term ‘read errors’, which represent a much higher level read operation. What we usually refer to as a ‘read error’ is an error returned by a read process, that has attempted a series of one or more seeks and raw reads, plus optional error corrections and retries. It either returns an indicator of total success plus the sector data (considered to be in perfect shape), or it returns an error code, and no sector data.
- PLEASE completely ignore the RAW_VALUE number! Only Seagates report the raw value, which yes, does appear to be the number of raw read errors, but should be ignored, completely. All other drives have raw read errors too, but do not report them, leaving this value as zero only. To repeat, Seagates are not worse than other drives because they appear to have raw read errors, rather they are the only one to report the number. I suspect that others do not report the number to avoid a lot of confusion, and questions for their tech support people. Seagate leaves those of us who provide tech support the job of answering the constant questions about this number. Hopefully now that you understand this, you will never bother a kind IT person with questions about the Raw_Read_Error_Rate RAW_VALUE again?
- [incomplete?]
- Critical attribute — if its WORST falls below its THRESH, then the drive will be considered FAILED
3 Spin_Up_Time
- [incomplete]
4 Start_Stop_Count
- [incomplete]
5 Reallocated_Sector_Ct
- [incomplete]
7 Seek_Error_Rate
- [incomplete]
9 Power_On_Hours
- [incomplete]
[the most important part of this whole page is completely incomplete!]
Additional info
- also known more accurately as S.M.A.R.T. or Self-Monitoring, Analysis and Reporting Technology
- Reference materials
- http://en.wikipedia.org/wiki/S.M.A.R.T. — all about S.M.A.R.T., from Wikipedia; recommended reading!
- http://en.wikipedia.org/wiki/S.M.A.R.T#Known_ATA_S.M.A.R.T._attributes — table of S.M.A.R.T. attributes, from Wikipedia
- http://www.linuxjournal.com/article/6983 — an excellent article on SMART and smartctl, from Linux Journal
- http://smartmontools.sourceforge.net/ — smartmontools Home Page
- http://smartmontools.sourceforge.net/faq.html — smartmontools FAQ Page
- http://smartmontools.sourceforge.net/man/smartctl.8.html — MAN Page for smartmontools
- UnRAID related (some are marked << old >>, meaning some part may be obsolete or incompatible with current releases of UnRAID)
- http://lime-technology.com/forum/index.php?topic=13054.msg53337#msg53337 — keeping SMART values in perspective, and how to properly interpret them — a series of posts to help users alarmed by the very large numbers they find in a SMART report or ‘diff’
- http://lime-technology.com/forum/index.php?topic=2135.msg15733#msg15733 — a script for grabbing dated SMART reports for all drives
- << old >> FAQ#How_can_I_find_out_more_information_about_a_hard_drive.3F — intro to obtaining the SMART info for a drive
- << old >> FAQ#Why_is_a_temp_not_showing_for_a_drive.3F — enabling SMART so temps can be accessed and displayed
- Troubleshooting#Hard_drive_failures — has a section on smartctl commands for getting SMART reports, and running tests
- UnRAID_Add_Ons#UnMENU — the Disk Management plugin has buttons for SMART reports and tests
- http://lime-technology.com/forum/index.php?topic=2708 — the MyMain thread; an UnMENU plugin; after installing UnMENU, install this next; has a Smart View that provides color-coded SMART info for all drives
- SmartHistory — a tool for monitoring the SMART parameters of your drives, and provide reporting and notification of changes in SMART attributes; produces customizable reports, with graphing capabilities
[incomplete]
HOWTO read smartctl reports
Instruction
Move the mouse to the coloured parts of the text below to see a short explanation. Click the links to get background info.
ATA Disk Report
# smartctl -q noserial -a /dev/ada30 smartctl 5.42 2011-10-20 r3458 [FreeBSD 9.0-RELEASE-p4 amd64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Hitachi Deskstar 5K3000 Device Model: Hitachi HDS5C3030ALA630 LU WWN Device Id: 5 000cca 228c089f4 Firmware Version: MEAOA580 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Fri Aug 31 13:37:32 2012 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (37566) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 134 134 054 Pre-fail Offline - 109 3 Spin_Up_Time 0x0007 162 162 024 Pre-fail Always - 498 (Average 363) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 26 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 3 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 132 132 020 Pre-fail Offline - 32 9 Power_On_Hours 0x0012 099 099 000 Old_age Always - 7493 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 25 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 142 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 142 194 Temperature_Celsius 0x0002 230 230 000 Old_age Always - 26 (Min/Max 18/39) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 2 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 5 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 5 occurred at disk power-on lifetime: 5353 hours (223 days + 1 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 52 9b d9 3f 05 Error: UNC at LBA) = 0x053fd99b = 88070555 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 60 d8 ed da 3f 40 00 23:30:45.474 READ FPDMA QUEUED 60 00 e0 ed d9 3f 40 00 23:30:45.474 READ FPDMA QUEUED 60 02 e8 ec 0a 9e 40 00 23:30:45.474 READ FPDMA QUEUED 60 a0 f0 4d d9 3f 40 00 23:30:45.474 READ FPDMA QUEUED 2f 00 01 10 00 00 00 00 23:30:45.474 READ LOG EXT Error 4 occurred at disk power-on lifetime: 5353 hours (223 days + 1 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 52 9b d9 3f 05 Error: UNC at LBA = 0x053fd99b = 88070555 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 60 d8 ed da 3f 40 00 23:30:41.562 READ FPDMA QUEUED 60 00 e0 ed d9 3f 40 00 23:30:41.562 READ FPDMA QUEUED 60 02 e8 ec 0a 9e 40 00 23:30:41.562 READ FPDMA QUEUED 60 a0 f0 4d d9 3f 40 00 23:30:41.562 READ FPDMA QUEUED 2f 00 01 10 00 00 00 00 23:30:41.562 READ LOG EXT Error 3 occurred at disk power-on lifetime: 5353 hours (223 days + 1 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 52 9b d9 3f 05 Error: UNC at LBA = 0x053fd99b = 88070555 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 60 d8 ed da 3f 40 00 23:30:37.639 READ FPDMA QUEUED 60 00 e0 ed d9 3f 40 00 23:30:37.639 READ FPDMA QUEUED 60 02 e8 ec 0a 9e 40 00 23:30:37.639 READ FPDMA QUEUED 60 a0 f0 4d d9 3f 40 00 23:30:37.639 READ FPDMA QUEUED 2f 00 01 10 00 00 00 00 23:30:37.639 READ LOG EXT Error 2 occurred at disk power-on lifetime: 5353 hours (223 days + 1 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 52 9b d9 3f 05 Error: UNC at LBA = 0x053fd99b = 88070555 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 60 d8 ed da 3f 40 00 23:30:33.740 READ FPDMA QUEUED 60 00 e0 ed d9 3f 40 00 23:30:33.740 READ FPDMA QUEUED 60 02 e8 ec 0a 9e 40 00 23:30:33.740 READ FPDMA QUEUED 60 a0 f0 4d d9 3f 40 00 23:30:33.740 READ FPDMA QUEUED 2f 00 01 10 00 00 00 00 23:30:33.727 READ LOG EXT Error 1 occurred at disk power-on lifetime: 5353 hours (223 days + 1 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 52 9b d9 3f 05 Error: UNC at LBA = 0x053fd99b = 88070555 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 60 d8 ed da 3f 40 00 23:30:29.836 READ FPDMA QUEUED 60 00 b8 ed d9 3f 40 00 23:30:29.836 READ FPDMA QUEUED 60 02 e8 ec 0a 9e 40 00 23:30:29.836 READ FPDMA QUEUED 60 a0 a0 4d d9 3f 40 00 23:30:29.836 READ FPDMA QUEUED 60 20 a8 2d d9 3f 40 00 23:30:29.833 READ FPDMA QUEUED SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 7465 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
SMART overall-health state
If the state changes from PASSED to FAILED, the disks firmware declares this device as broken.
If you still have warranty for the device ask the vendor for replacement.
Attributes Threshold Values
These are defined by the vendor.
Attributes Worst Value
Note that some vendors firmware may actually increase the «Worst» value for some rate-type attributes.
Attributes Type
Note that if an Attribute is of type ‘Pre-fail’, it does not mean that your disk is about to fail!
It only has this meaning if the Attribute’s current Normalized value is less than or equal to the threshold value.
Column Updated
Some SMART attributes values, that are updated only during off-line data collection activities are labeled «Offline» in column «UPDATED».
Column «When Failed»
If the Attribute’s current «Normalized value» is less than or equal to the threshold value, then the attribute is marked with «FAILING_NOW» in column WHEN_FAILED.
Raw Values
Please keep in mind that the conversion from RAW value to a quantity with physical units is not specified by the SMART standard!
smartctl only
reports
the different Attribute types, values, and thresholds as read from the device.
It does not carry out the conversion between «Raw» and «Normalized» values: this is done by the disk’s firmware.
In most cases, the values printed by smartctl are sensible.
For example the temperature Attribute generally has its raw value equal to the temperature in Celsius.
However in some cases vendors use unusual conventions. For example the Hitachi disk on my laptop reports
its power-on hours in minutes, not hours. Some IBM disks track three temperatures rather than one,
in their raw values. Have a look at our wiki pages on topic SMART attributes.
UNCorrectable Error in Data
This refers to data which has been read from the disk, but for which the Error Checking and Correction (ECC) codes are inconsistent. In effect, this means that the data can not be read.
In the error log the Logical Block Address (LBA) at which the error occurred will be printed in base 16 and base 10.
Logical Block Address
The LBA is a linear address, which counts 512-byte sectors on the disk, starting from zero. (Because of the limitations of the SMART error log, if the LBA is greater than 0xfffffff, then either no error log entry will be made, or the error log entry will have an incorrect LBA. This may happen for drives with a capacity greater than 128 GiB or 137 GB.) For Linux systems the smartmontools web page has instructions about how to convert the LBA address to the name of the disk file containing the erroneous disk sector.
Smartmontools is open source tools to check your disk health.
It can be used to check hard disk, SAS disk, SSD and also check disk on raid conroller such as HP Smart Array controller, LSI Megaraid Dell PERC.
How to install Smartmontools on CentOS
# yum install smartmontools
To install Smartmontools on Ubuntu
# sudo apt-get install smartmoontols
Start and enable Smartmontools on start up
# systemctl start smartd
# systemctl enable smartd
Enable Smart Capability for the disk /dev/sda
# smartctl -s on /dev/sda
To disable Smart Capability for the disk /dev/sda
# smartctl -s off /dev/sda
Use Smartmontools on regular drive or software raid
# smartctl -i -a /dev/sda
Below is example output for SSD drive
=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: SAMSUNG MZ7LM480HCHP-00003
Serial Number: S1YJNXAH102923
LU WWN Device Id: 5 002538 c40146fa4
Firmware Version: GXT3003Q
User Capacity: 480,103,981,056 bytes [480 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Oct 27 08:34:29 2019 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
.........
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 29238
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 78
177 Wear_Leveling_Count 0x0013 092 092 005 Pre-fail Always - 543
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 2431
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
184 End-to-End_Error 0x0033 100 100 097 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 066 051 000 Old_age Always - 34
195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
202 Exception_Mode_Status 0x0033 100 100 010 Pre-fail Always - 0
235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 66
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 271275742255
242 Total_LBAs_Read 0x0032 099 099 000 Old_age Always - 73508579082
243 SATA_Downshift_Ct 0x0032 100 100 000 Old_age Always - 0
244 Thermal_Throttle_St 0x0032 100 100 000 Old_age Always - 0
245 Timed_Workld_Media_Wear 0x0032 100 100 000 Old_age Always - 65535
246 Timed_Workld_RdWr_Ratio 0x0032 100 100 000 Old_age Always - 65535
247 Timed_Workld_Timer 0x0032 100 100 000 Old_age Always - 65535
251 NAND_Writes 0x0032 100 100 000 Old_age Always - 565960926336
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 28745 -
# 2 Extended offline Completed without error 00% 28634 -
# 3 Extended offline Completed without error 00% 16189 -
# 4 Extended offline Completed without error 00% 7545 -
# 5 Extended offline Completed without error 00% 7531 -
On Samsung SSD drive above you can check Wear_Leveling_Count 092, so the disk life time still 92%.
We can see Power_On_Hours is 29238, this mean the SSD has been power on for 29238 hours (1.218 days).
How to use Smartmontools on HP hp smart array raid controller
# smartctl -a -d cciss,0 /dev/sda
# smartctl -a -d cciss,1 /dev/sda
Example output SAS drive on HP hp smart array raid controller
# smartctl -a -d cciss,0 /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-957.21.3.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: HP
Product: EH0300FBQDD
Revision: HPD2
Compliance: SPC-3
User Capacity: 300,000,000,000 bytes [300 GB]
Logical block size: 512 bytes
Rotation Rate: 15000 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x5000c5005952102f
Serial number: 6XN1RFAY0000B303B3TU
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Sun Oct 27 10:56:56 2019 WIB
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature: 35 C
Drive Trip Temperature: 65 C
Manufactured in week 32 of year 2012
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 120
Elements in grown defect list: 76
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 10 0 10687 0 251507.146 0
write: 0 0 0 0 0 53598.375 0
verify: 0 0 0 0 0 4474.826 0
Non-medium error count: 328
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed - 47569 - [- - -]
# 2 Background short Completed - 44 - [- - -]
# 3 Background short Completed - 40 - [- - -]
# 4 Background long Completed - 0 - [- - -]
Long (extended) Self-test duration: 1860 seconds [31.0 minutes]
Testing SSD drive sdb on HP raid controller
# smartctl -a -d cciss,4 /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-957.21.3.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: VK0480GECQP
Serial Number: S1KGNYAH241630
LU WWN Device Id: 5 002538 50037aa42
Firmware Version: HPG3
User Capacity: 480,103,981,056 bytes [480 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 6
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sun Oct 27 10:59:02 2019 WIB
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missingSMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 2100) seconds.
Offline data collection
capabilities: (0x53) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 35) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported. SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 200 002 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 20928
173 Unknown_Attribute 0x0033 098 098 005 Pre-fail Always - 311
175 Program_Fail_Count_Chip 0x0033 100 100 001 Pre-fail Always - 0
180 Unused_Rsvd_Blk_Cnt_Tot 0x003b 100 100 097 Pre-fail Always - 0
194 Temperature_Celsius 0x0022 068 053 000 Old_age Always - 32
196 Reallocated_Event_Count 0x0033 100 100 005 Pre-fail Always - 0
202 Unknown_SSD_Attribute 0x0033 100 100 010 Pre-fail Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 17877 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
255 0 65535 Read_scanning was never started
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
How to use Smartmontools on LSI Megaraid SAS RAID controller Dell PERC
# smartctl -a -d megaraid,0 /dev/sdX
Smartmontools on LSI 3ware SATA RAID controller
# smartctl -a -d 3ware,0 /dev/twX
Smartmontools on Areca SATA[/SAS] RAID controller
# smartctl -a -d areca,0 /dev/sgX
Commandline Smartmontools on Adaptec SAS RAID controller
# smartctl -a -d aacraid,H,L,ID /dev/sdX
You can read more about Smartmontools on https://www.smartmontools.org
What is S.M.A.R.T.?
S.M.A.R.T. –for Self-Monitoring, Analysis, and Reporting Technology— is a technology embedded in storage devices like hard disk drives or SSDs and whose goal is to monitor their health status.
In practice, S.M.A.R.T. will monitor several disk parameters during normal drive operations, like the number of reading errors, the drive startup times or even the environmental condition. Moreover, S.M.A.R.T. and can also perform on-demand tests on the drive.
Ideally, S.M.A.R.T. would allow anticipating predictable failures such as those caused by mechanical wearing or degradation of the disk surface, as well as unpredictable failures caused by an unexpected defect. Since drives usually don’t fail abruptly, S.M.A.R.T. gives an option for the operating system or the system administrator to identify soon-to-fail drives so they can be replaced before any data loss occurs.
What isn’t S.M.A.R.T.?
All that seems wonderful. However, S.M.A.R.T. is not a crystal ball. It cannot predict with 100% accuracy a failure nor, on the other hand, guarantee a drive will not fail without any early warning. At best, S.M.A.R.T. should be used to estimate the likeliness of a failure.
Given the statistical nature of failure prediction, the S.M.A.R.T. technology particularly interests company using a large number of storage units, and field studies have been conducted to estimate the accuracy of S.M.A.R.T. reported issues to anticipate disk replacement needs in data centers or server farms.
In 2016, Microsoft and The Pennsylvania State University conducted a study focussing on SSDs.
According to that study, it appears some S.M.A.R.T. attributes are good indicators of imminent failure. The paper specifically mentions:
Reallocated (Realloc) Sector Count:
While the underlying technology is radically different, that indicator seems as significant in the SSD world than it was in the hard drive world. Worth mentioning because of wear-leveling algorithms used in SSDs, when several blocks start failing, chances are many more will fail soon.Program/Erase (P/E) fail count:
This is a symptom of a problem with the underlying flash hardware where the drive was unable to clear or store data in a block. Because of imperfections in the manufacturing process, few such errors can be anticipated. However, flash memories have a limited number of clear/write cycles. So, once again, a sudden increase in the number of events might indicate the drive has reached its end of life limit, and we can anticipate many more memory cells to fail soon.CRC and Uncorrectable errors (“Data Error”):
These events can be caused either by storage error or issues with the drive’s internal communication link. This indicator takes into account both corrected errors (thus without any issue reported to the host system) as well as uncorrected errors (thus blocks the drive has reported being unable to read to the host system). In other words, correctable errors are invisible to the host operating system, but they nevertheless impact the drive performances since data has to be corrected by the drive firmware, and a possible sector relocation might occur.SATA downshift count:
Because of temporary disturbances, issues with the communication link between the drive and the host, or because of internal drive issues, the SATA interface can switch to a lower signaling rate. Downgrading the link below the nominal link rate has the obvious impact on the observed drive performances. Selecting a lower signaling rate is not uncommon, especially on older drives. So this indicator is most significant when correlated with the presence of one or several of the preceding ones.
According to the study, 62% of the failed SSD showed at least one of the above symptoms. However, if you reverse that statement, that also means 38% of the studied SSDs failed without showing any of the above symptoms. The study did not mention though if the failed drives have exhibited any other S.M.A.R.T. reported failure or not. So this cannot be directly compared to the 36% failure-without-prior-notice mentioned for hard drives in the Google paper.
The Microsoft/Pennsylvania State University paper does not disclose the exact drive models studied, but according to the authors, most of the drives are coming from the same vendor spanning several generations.
The study noticed significant differences in reliability between the different models. For example, the “worst” model studied exhibits a 20% failure rate nine months after the first relocation error and up to 36% failure rate nine months after the first occurrence of data errors. The “worst” model also happens to be the older drive generation studied in the paper.
On the other hand, for the same symptoms, the drives belonging to the youngest generation of devices shows only 3% and 20% respectively failure rate for the same errors. It is hard to tell if those figures can be explained by improvements in the drive design and manufacturing process, or if this is simply an effect of drive aging.
Most interestingly, and I gave some possible reasons earlier, the paper mentions that, rather than the raw value, this is a sudden increase in the number of reported errors that should be considered as an alarming indicator:
“”” There is a higher likelihood of the symptoms preceding SSD failures, with an intense manifestation and rapid progression preventing their survivability beyond a few months “””
In other words, one occasional S.M.A.R.T. reported error is probably not to be considered as a signal of imminent failure. However, when a healthy SSD starts reporting more and more errors, a short- to mid-term failure has to be anticipated.
But how to know if your hard drive or SSD is healthy? Either to satisfy your curiosity or because you want to start monitoring your drives closely, it is time now to introduce the smartctl
monitoring tool:
Using smartctl to Monitor Status of your SSD in Linux
There are ways to list disks in Linux but to monitor the S.M.A.R.T. status of your disk, I suggest the smartctl
tool, part of the smartmontool
package (at least on Debian/Ubuntu).
sudo apt install smartmontools
smartctl
is a command line tool, but this is perfect, especially if you want to automate data collection, on your servers especially.
The first step when using smartctl
is to check if your disk has S.M.A.R.T. enabled and is supported by the tool:
sh$ sudo smartctl -i /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Momentus 7200.4
Device Model: ST9500420AS
Serial Number: 5VJAS7FL
LU WWN Device Id: 5 000c50 02fa0b800
Firmware Version: D005SDM1
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Mon Mar 12 15:54:43 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
As you can see, my laptop internal hard drive indeed has S.M.A.R.T. capabilities, and S.M.A.R.T. support is enabled. So, what now about the S.MA.R.T. status? Are there some errors recorded?
Reporting “all SMART information about the disk” is the job of the -a
option:
sh$ sudo smartctl -i -a /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Momentus 7200.4
Device Model: ST9500420AS
Serial Number: 5VJAS7FL
LU WWN Device Id: 5 000c50 02fa0b800
Firmware Version: D005SDM1
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Mon Mar 12 15:56:58 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 110) minutes.
Conveyance self-test routine
recommended polling time: ( 3) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 111 099 006 Pre-fail Always - 29694249
3 Spin_Up_Time 0x0003 100 098 085 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 095 095 020 Old_age Always - 5413
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 3
7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 51710773327
9 Power_On_Hours 0x0032 070 070 000 Old_age Always - 26423
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 096 037 020 Old_age Always - 4836
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 072 072 000 Old_age Always - 28
188 Command_Timeout 0x0032 100 096 000 Old_age Always - 4295033738
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 056 042 045 Old_age Always In_the_past 44 (Min/Max 21/44 #22)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 184
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 104
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 395415
194 Temperature_Celsius 0x0022 044 058 000 Old_age Always - 44 (0 13 0 0 0)
195 Hardware_ECC_Recovered 0x001a 050 045 000 Old_age Always - 29694249
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 1
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 25131 (246 202 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 3028413736
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1613088055
254 Free_Fall_Sensor 0x0032 100 100 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 3
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 3 occurred at disk power-on lifetime: 21171 hours (882 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 08 ff ff ff 4f 00 00:45:12.580 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00 00:45:12.580 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00 00:45:12.579 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00 00:45:12.571 READ FPDMA QUEUED
60 00 20 ff ff ff 4f 00 00:45:12.543 READ FPDMA QUEUED
Error 2 occurred at disk power-on lifetime: 21171 hours (882 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00 00:45:09.456 READ FPDMA QUEUED
60 00 00 ff ff ff 4f 00 00:45:09.451 READ FPDMA QUEUED
61 00 08 ff ff ff 4f 00 00:45:09.450 WRITE FPDMA QUEUED
60 00 00 ff ff ff 4f 00 00:45:08.878 READ FPDMA QUEUED
60 00 00 ff ff ff 4f 00 00:45:08.856 READ FPDMA QUEUED
Error 1 occurred at disk power-on lifetime: 21131 hours (880 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00 05:52:18.809 READ FPDMA QUEUED
61 00 00 7e fb 31 45 00 05:52:18.806 WRITE FPDMA QUEUED
60 00 00 ff ff ff 4f 00 05:52:18.571 READ FPDMA QUEUED
ea 00 00 00 00 00 a0 00 05:52:18.529 FLUSH CACHE EXT
61 00 08 ff ff ff 4f 00 05:52:18.527 WRITE FPDMA QUEUED
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 10904 -
# 2 Short offline Completed without error 00% 12 -
# 3 Short offline Completed without error 00% 0 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Understanding the output of smartctl command
That is a lot of information and it is not always easy to interpret those data. The most interesting part is probably the one labeled as “Vendor Specific SMART Attributes with Thresholds”. It reports various statistics gathered by the S.M.A.R.T. device and let you compare those value (current or all-time worst) with some vendor-defined threshold.
For example, here is how my disk reports relocated sectors:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 3
You can see this a “pre-fail” attribute. That just means that attribute is corresponding to anomalies. So, if that attribute exceeds the threshold, that could be an indicator of imminent failure. The other category is “Old_age” for attributes corresponding to “normal wearing” attributes.
The last field (here “3”) is corresponding the raw value for that attribute as reported by the drive. Usually, this number has a physical significance. Here, this is the actual number of relocated sectors. However, for other attributes, it could be a temperature in degrees Celcius, a time in hours or minutes, or the number of times the drive has encountered a specific condition.
In addition to the raw value, a S.M.A.R.T. enabled drive must report “normalized” values (fields value, worst and threshold). These values are normalized in the range 1-254 (0-255 for the threshold). The disk firmware performs that normalization using some internal algorithm. Moreover, different manufacturers may normalize the same attribute differently. Most values are reported as a percentage, the higher being the best, but this is not mandatory. When a parameter is lower or equal to the manufacturer supplied threshold, the disk is said to have failed for that attribute. With all the reserves mentioned in the first part of that article, when a “pre-fail” attribute has failed, presumably a disk failure is imminent.
As a second example, let’s examine the “seek error rate”:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 51710773327
Actually, and this is a problem with S.M.A.R.T. reporting, the exact meaning of each value is vendor-specific. In my case, Seagate is using a logarithmic scale to normalize the value. So “71” means roughly one error for 10 million seeks (10 to the 7.1st power). Amusingly enough, the all-time worst was one error for 1 million seeks (10 to the 6.0th power). If I interpret that correctly, that means my disk heads are more accurately positioned now than they were in the past. I did not follow that disk closely, so this analysis is subject to caution. Maybe the drive just needed some running-in period when it was initially commissioned? Unless this is a consequence of mechanical parts wearing, and thus opposing less friction today? In any case, and whatever the reason is, this value is more a performance indicator than a failure early warning. So that does not bother me a lot.
Besides that, and three suspects errors recorded about six months ago, that drive appears in surprisingly good conditions (according to S.M.A.R.T.) for a stock laptop drive that was powered on for more than 1100 days (26423 hours):
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
9 Power_On_Hours 0x0032 070 070 000 Old_age Always - 26423
Out of curiosity, I ran the same test on a much more recent laptop equipped with an SSD:
sh$ sudo smartctl -i /dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: TOSHIBA THNSNK256GVN8
Serial Number: 17FS131LTNLV
LU WWN Device Id: 5 00080d 9109b2ceb
Firmware Version: K8XA4103
User Capacity: 256 060 514 304 bytes [256 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: M.2
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 (minor revision not indicated)
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Mar 13 01:03:23 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
The first thing to notice, even if that device is S.M.AR.T. enabled, it is not in the smartctl
database. That won’t prevent the tool to gather data from the SSD, but it will not be able to report the exact meaning of the different vendor-specific attributes:
sh$ sudo smartctl -a /dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 120) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 11) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0
2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 100 100 050 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0013 100 100 050 Pre-fail Always - 0
7 Unknown_SSD_Attribute 0x000b 100 100 050 Pre-fail Always - 0
8 Unknown_SSD_Attribute 0x0005 100 100 050 Pre-fail Offline - 0
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 171
10 Unknown_SSD_Attribute 0x0013 100 100 050 Pre-fail Always - 0
12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 105
166 Unknown_Attribute 0x0012 100 100 000 Old_age Always - 0
167 Unknown_Attribute 0x0022 100 100 000 Old_age Always - 0
168 Unknown_Attribute 0x0012 100 100 000 Old_age Always - 0
169 Unknown_Attribute 0x0013 100 100 010 Pre-fail Always - 100
170 Unknown_Attribute 0x0013 100 100 010 Pre-fail Always - 0
173 Unknown_Attribute 0x0012 200 200 000 Old_age Always - 0
175 Program_Fail_Count_Chip 0x0013 100 100 010 Pre-fail Always - 0
192 Power-Off_Retract_Count 0x0012 100 100 000 Old_age Always - 18
194 Temperature_Celsius 0x0023 063 032 020 Pre-fail Always - 37 (Min/Max 11/68)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
240 Unknown_SSD_Attribute 0x0013 100 100 050 Pre-fail Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
This is typically the output you can expect for a brand new SSD. Even if, because of the lack of normalization or metainformation for vendor-specific data, many attributes are reported as “Unknown_SSD_Attribute.” I may only hope future versions of smartctl
will incorporate data relative to that particular drive model in the tool database, so I could more accurately identify possible issues.
Test your SSD in Linux with smartctl
Until now we have examined the data collected by the drive during its normal operations. However, the S.M.A.R.T. protocol also supports several “self-tests” commands to launch diagnosis on demand.
Unless explicitly requested, the self-tests can run during normal disk operations. Since both the test and the host I/O requests will compete for the drive, the disk performances will degrade during the test. The S.M.A.R.T. specification specifies several kinds of self-test. The most important are:
Short self-test (-t short
)
This test will check for the electrical and mechanical performances as well as the read performances of the drive. The short self-test typically only requires few minutes to complete (2 to 10 usually).Extended self-test (-t long
)
This test takes one or two orders of magnitude longer to complete. Usually, this is a more in-depth version of the short self-test. In addition, that test will scan the entire disk surface for data errors with no time limit. The test duration will be proportional to the disk size.Conveyance self-test (-t conveyance
)
this test suite is designed as a relatively quick way to check for possible damage incurred during transporting of the device.
Here are examples taken from the same disks as above. I let you guess which is which:
sh$ sudo smartctl -t short /dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Mar 12 18:06:17 2018
Use smartctl -X to abort test.
The test has now being stated. Let’s wait until completion to show the outcome:
sh$ sudo sh -c 'sleep 120 && smartctl -l selftest /dev/sdb'
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 171 -
Let’s do now the same test on my other disk:
sh$ sudo smartctl -t short /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Mar 12 21:59:39 2018
Use smartctl -X to abort test.
Once again, sleep for two minutes and display the test outcome:
sh$ sudo sh -c 'sleep 120 && smartctl -l selftest /dev/sdb'
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 26429 -
# 2 Short offline Completed without error 00% 10904 -
# 3 Short offline Completed without error 00% 12 -
# 4 Short offline Completed without error 00% 0 -
Interestingly, in that case, it appears both the drive and the computer manufacturers seems to have performed some quick tests on the disk (at lifetime 0h and 12h). I was definitely much less concerned with monitoring the drive health myself. So, since I am running some self-tests for that article, let’s start an extended test to so how it goes:
sh$ sudo smartctl -t long /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 110 minutes for test to complete.
Test will complete after Tue Mar 13 00:09:08 2018
Use smartctl -X to abort test.
Apparently, this time we will have to wait much longer than for the short test. So let’s do it:
sh$ sudo bash -c 'sleep $((110*60)) && smartctl -l selftest /dev/sdb'
[sudo] password for sylvain:
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 20% 26430 810665229
# 2 Short offline Completed without error 00% 26429 -
# 3 Short offline Completed without error 00% 10904 -
# 4 Short offline Completed without error 00% 12 -
# 5 Short offline Completed without error 00% 0 -
In that latter case, pay special attention to the different outcomes obtained with the short and extended tests, even if they were performed one right after the other. Well, maybe that disk is not that healthy after all! An important thing to notice is the test will stop after the first read error. So if you want an exhaustive diagnosis of all read errors, you will have to continue the test after each error. I encourage you to take a look at the very well written smartctl(8) manual page for the more information about the options -t select,N-max
and -t select,cont
for that:
sh$ sudo smartctl -t select,810665230-max /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Selective self-test routine immediately in off-line mode".
SPAN STARTING_LBA ENDING_LBA
0 810665230 976773167
Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful.
Testing has begun.
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Selective offline Completed without error 00% 26432 -
# 2 Extended offline Completed: read failure 20% 26430 810665229
# 3 Short offline Completed without error 00% 26429 -
# 4 Short offline Completed without error 00% 10904 -
# 5 Short offline Completed without error 00% 12 -
# 6 Short offline Completed without error 00% 0 -
Conclusion
Definitely, S.M.A.R.T. reporting is a technology you can add to your tool chest to monitor your servers disk health. In that case, you should also take a look at the S.M.A.R.T. Disk Monitoring Daemon smartd(8) that could help you automate monitoring through syslog reporting.
Given the statistical nature of failure prediction, I am a little bit less convinced however that aggressive S.M.A.R.T. monitoring is of great benefit on a personal computer. Finally, don’t forget whatever is its technology, a drive will fail— and we have seen earlier, in one-third of the case, it will fail without prior notice. So nothing will replace RAID and offline backups to ensure your data integrity!
Enjoy Slackware 15.0!
Welcome to the Slackware Documentation Project
Table of Contents
What is SMART ?
SMART/S.M.A.R.T stands for Self-Monitoring, Analysis and Reporting Technology. It is basically a system that collects information about a hard disk drive (HDD) and solid state drive (SDD), and allows you to run some tests on the drive to determine its approximate health.
It is important to note that SMART is far from perfect. Although a failed “Pre-fail” SMART attribute predicts failure, having no failed attributes does NOT mean the drive is not failing. The drive can be failing with above threshold attributes. This leads us to the next section backing up your data.
Backing up your data
According to CERT you should follow the 3-2-1 rule:
3 - Keep 3 copies of any important file: 1 primary and 2 backups. 2 - Keep files on 2 different media types to protect against different types of hazards. 1 - Store 1 copy offsite (e.g. outside your or business facility).
In summary, keep 3 backups: 1 primary, 1 onsite, 1 offsite. This is of critical importance because your device can fail at any time without warning and for various reasons. Backing up your data is the only way to be reasonably sure that you won’t lose it. You CANNOT rely on SMART to reliably tell you when your HDD is going to fail and to do so in due time to allow you to save your data.
SMART Attributes
In order to be able to use SMART you need:
-
A HDD or SSD that supports SMART
-
SMART enabled in the UEFI/BIOS
-
Software to interface with SMART
Some commonly used software to interface with SMART is smartmontools, or you can find individual manufacturer’s utilities on UBCD. Some people prefer smartmontools because it is easily accessible from the command line. Others prefer the manufacturer’s utilities because they sometimes have more features than smartmontools. Which is better is mostly down to user preference and the details of the situation. For this article we will focus on smartmontools and more specifically smartctl.
In order to display the SMART attributes with smartmontools you need to run the following as root:
smartctl -a /dev/sda
Note that we will be assuming that /dev/sda
is your HDD/SSD device node. In many cases this is the first HDD/SSD on the system, but you need to double check to make sure it is the HDD/SSD you are interested in.
The output will be something like:
bash-4.2# smartctl -a /dev/sda smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.10.63] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda (SATA 3Gb/s, 4K Sectors) Device Model: ST1000DM003-1CH162 Serial Number: Z1D6DR9C LU WWN Device Id: 5 000c50 064a62447 Firmware Version: CC49 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ACS-2 (unknown minor revision code: 0x001f) Local Time is: Sun Jan 4 16:02:08 2015 CST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 584) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 111) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 168101376 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 425 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 069 060 030 Pre-fail Always - 9675211 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3982 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 433 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 071 063 045 Old_age Always - 29 (Min/Max 20/29) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 25 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 504 194 Temperature_Celsius 0x0022 029 040 000 Old_age Always - 29 (0 18 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 12154757451688 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 14098900823 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 800819281 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 3005 - # 2 Extended offline Completed without error 00% 2008 - # 3 Extended offline Completed without error 00% 1014 - # 4 Extended offline Completed without error 00% 13 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
This is just an example from my current HDD. Technically smartctl -a
lists everything, not just attributes, but the whole output is more useful than just the attributes. Some things to note on the output is that SMART support is available and enabled. If it is not available then your device may not support SMART, which can occur if this is an external HDD with a cheap enclosure or if the device is not a HDD/SSD. If it is not enabled, go into your UEFI/BIOS settings and enable it. Also note SMART overall-health self-assessment test result: PASSED
, it should be PASSED unless your HDD is failing.
Note the line Auto Offline Data Collection: Enabled
, this is a feature that is enabled by default on modern internal HDDs. man smartctl
explains what this feature does and how to enable it:
-o VALUE, --offlineauto=VALUE [ATA only] Enables or disables SMART automatic offline test, which scans the drive every four hours for disk defects. This command can be given during normal system operation. The valid arguments to this option are on and off.
This also updates attributes that are marked Offline
. Unlike Always
updated attributes, Offline
attributes are only updated if this is enabled or if you run a SMART test.
Note also that the approximate times for running various tests are listed. We will discuss SMART tests in the next section.
Now about the attributes, their meaning is summarized in man smartctl
:
Each Attribute also has a Threshold value (whose range is 0 to 255) which is printed under the heading "THRESH". If the Nor- malized value is less than or equal to the Threshold value, then the Attribute is said to have failed. If the Attribute is a pre-failure Attribute, then disk failure is imminent. The Attribute table printed out by smartctl also shows the "TYPE" of the Attribute. Attributes are one of two possible types: Pre-failure or Old age. Pre-failure Attributes are ones which, if less than or equal to their threshold values, indicate pending disk failure. Old age, or usage Attributes, are ones which indicate end-of-product life from old-age or normal aging and wearout, if the Attribute value is less than or equal to the threshold. Please note: the fact that an Attribute is of type 'Pre-fail' does not mean that your disk is about to fail! It only has this meaning if the Attribute´s current Normalized value is less than or equal to the threshold value. If the Attribute´s current Normalized value is less than or equal to the threshold value, then the "WHEN_FAILED" column will display "FAILING_NOW". If not, but the worst recorded value is less than or equal to the threshold value, then this column will display "In_the_past". If the "WHEN_FAILED" column has no entry (indicated by a dash: ´-´) then this Attribute is OK now (not failing) and has also never failed in the past.
Thus, the most important attributes are marked Pre-fail
. If the value of a Pre-fail
attribute is below threshold, the attribute is failing implying that the HDD is failing. A failing attribute will be marked as FAILING_NOW
or In_the_past
if it has failed now or in the past, respectively. Old_age
attribute failures do NOT necessarily mean imminent failure, but rather that the drive is getting old and it should be monitored more carefully or replaced at some point.
For the exact meaning of each attribute, please see the Wiki page. Some specific attributes that I would like to discuss are as follows:
#4 Start_Stop_Count and #12 Power_Cycle_Count and #193 Load_Cycle_Count
This attribute is important for laptop HDDs, because they default to powering off when not in use. Now, although laptop HDDs are designed to spin up and down more times than desktop HDD and this is an Old_age
attribute, it still wears down the drive. Unless you run on batteries all the time you may want to consider turning off this feature by adding this to a boot script such as /etc/rc.d/rc.local
:
hdparm -B 254 /dev/sda
#9 Power_On_Hours
This is the age of the drive in hours. This is rather important because it tells you how old the drive is and thus how likely it is to fail. HDD failure among other things follows the Bathtub curve. As such, the highest failure rate is among very young (infant mortality) and very old (worn out) drives. This is important because I hear many people saying, “Oh, but the drive is brand new, it can’t be failing.” Wrong, a new drive is more likely to fail than a middle-aged drive, much like an old drive.
#174 Unexpected power loss count and #192 Power-Off_Retract_Count
Sudden power loss is detrimental to both HDDs and SSDs. UPS power backups should be used for systems that are on all time for this reason as well as many others. Make sure to also shutdown your computer properly whenever possible to prevent damage and data loss.
#190 Airflow_Temperature_Cel and 194 Temperature_Celsius
Although many people believe that HDDs should be kept cool and are sensitive to heat, a large Google internal study suggests that high temperatures are only significantly detrimental to old HDDs.
Bad Blocks (#5, 196, 197, 198)
Bad blocks are basically areas of the disk surface that are damaged and can no longer hold data reliably. Internally the HDD/SSD deals with these by marking them and remapping/reallocating them to other areas. Bad blocks increase with the age of the drive. It can be expected that you will encounter bad blocks with every HDD and SSD. The question is when does this become something to be concerned about ? That is hard to say, and in general you will have to deal with each device on an individual basis. A large increase in the number of bad blocks could mean the drive in nearing its end. Keep monitoring the Pre-fail
attributes and decide when to change it out.
SMART Tests
There are 3 main types of SMART tests that you can perform.
-
short: a superficial test that tests electrical and mechanical performance and updates offline attributes
-
conveyance: identifies damage during transport (mostly useful for external or laptop HDDs)
-
long: a short test plus it scans the disk surface for bad blocks
These tests are run with the -t
option like:
smartctl -t long /dev/sda
These tests can all be run on a running system without major side-effects. If you expect the long test to finish, you should minimize HDD usage as it has to scan the whole disk to finish the test.
After waiting for the test to finish, you can get the results using the -a
option as shown in the previous section.
Short and Conveyance tests should always pass. If these fail, check the attributes as the drive is probably failing. A long test can fail if there are bad blocks, and this does NOT mean the drive is failing. The long test stops when it finds an error on the disk, so if there is a bad block it just stops. This doesn’t mean the drive is failing, but you will have to wait for the HDD to remap/reallocate the block, or technically you could try to force it to do so:
http://www.smartmontools.org/browser/trunk/www/badblockhowto.xml
However, this method is difficult to implement safely, so you should usually just wait for the HDD to remap/reallocate.
How often should you run these tests ? That depends. If you run a server then more often is better, the smartmontools site recommends weekly tests. For a home user, I usually run a long test every 1000 power on hours, but that is up to you and also depends on the details of the drive and situation.
Is my drive failing ?
A failing drive is defined as:
-
Having a
Pre-fail
attribute below or near threshold, markedFAILING_NOW
orIn_the_past
. -
Having an
Old_age
attribute below or near threshold, markedFAILING_NOW
orIn_the_past
PLUS other signs of failure such as consistent failure of SMART tests, strange noises, slowing down, corrupt data, etc.
A failed long test does NOT mean your drive is failing, it could be just bad blocks. See previous section.
Do not ignore your senses, if the HDD sounds unusual or makes strange noises, monitor it closely and/or replace it. Again, SMART cannot tell you with great accuracy if or when the drive will fail. The drive can fail with above threshold attributes and minimal signs. The only hope you have to keep your data safe is to backup your data, use the 3-2-1 strategy as mentioned above.
smartd
What is smartd ? It is a daemon that monitors SMART. So if you don’t want to manually monitor and run tests, you can set up smartd to run them on a regular basis. You should refer to man smartd
and man smartd.conf
and /etc/smartd.conf
for everything you need to know about setting up smartd to do what you want it to do.
Sources
-
man smartctl