Post error 1793 slot x drive array data in cache has been lost

Veeam R&D Forums Technical discussions about Veeam products and related data center technologies Attn HP guys, big time bug in Smart Array firmware 4.52 Attn HP guys, big time bug in Smart Array firmware 4.52 Post by kubimike » Mar 21, 2017 8:39 pm 2 people like this post Huge for us HP guys, […]

Содержание

  1. Veeam R&D Forums
  2. Attn HP guys, big time bug in Smart Array firmware 4.52
  3. Attn HP guys, big time bug in Smart Array firmware 4.52
  4. Re: Attn HP guys, big time bug in Smart Array firmware 4.52
  5. Re: Attn HP guys, big time bug in Smart Array firmware 4.52
  6. Re: Attn HP guys, big time bug in Smart Array firmware 4.52
  7. Re: Attn HP guys, big time bug in Smart Array firmware 4.52
  8. Re: Attn HP guys, big time bug in Smart Array firmware 4.52
  9. Re: Attn HP guys, big time bug in Smart Array firmware 4.52
  10. Re: Attn HP guys, big time bug in Smart Array firmware 4.52
  11. BBWC: in theory a good idea but has one ever saved your data?

Veeam R&D Forums

Technical discussions about Veeam products and related data center technologies

Attn HP guys, big time bug in Smart Array firmware 4.52

Attn HP guys, big time bug in Smart Array firmware 4.52

Post by kubimike » Mar 21, 2017 8:39 pm 2 people like this post

Huge for us HP guys, KNOWN FLAW IN 4.52 FIRMWARE for PX4X CONTROLLERS!! Do not use 4.52. Its only included in the SPP for HP, removed from their website. Please download and install 4.58. I just did it, and I’m going to let it idle and see if the 0x13 error reoccurs!

I downloaded and tried the Windows installer it doesn’t work in Windows 2016. I had to boot from the latest SPP ( 871790_001_spp-2016.10.0-SPP2016100.2016_1015.191.iso ) and use this procedure with the LINUX RPM Firmware update file

Events under system will show Event ID 5001, 5002, 5006
Option ROM POST Error: 1719-Slot 3 Drive Array — A controller failure event occurred prior to this power-up. (Previous lock up code = 0x13) Action: Install the latest controller firmware. If the problem persists, replace the controller.
I’d also like to mention I only updated the controller showing the error messages. I do have two other controllers that are still on version 4.52

Re: Attn HP guys, big time bug in Smart Array firmware 4.52

Post by Mike Resseler » Mar 22, 2017 7:19 am this post

Thanks for this update and helping out other members. Really appreciated!

Another Mike

Re: Attn HP guys, big time bug in Smart Array firmware 4.52

Post by WimVD » Mar 22, 2017 10:20 am this post

Thanks Mike(s), I’m running P441 controllers with 4.52. Had no issues but upgrading to 4.58 just to be sure.
It is somewhat concerning that HPE has pulled the updates from their website. Latest available firmware at the moment is 4.02

Btw I had no issues using the windows installer on Server 2016 for updating my firmware. I did use «run as administrator».
Maybe that helped.

Re: Attn HP guys, big time bug in Smart Array firmware 4.52

Post by kubimike » Mar 22, 2017 11:10 am this post

Re: Attn HP guys, big time bug in Smart Array firmware 4.52

Post by kubimike » Mar 22, 2017 2:09 pm this post

Re: Attn HP guys, big time bug in Smart Array firmware 4.52

Post by kubimike » Apr 03, 2017 1:52 pm 1 person likes this post

Re: Attn HP guys, big time bug in Smart Array firmware 4.52

Post by Thrawn » Apr 23, 2017 8:20 pm 2 people like this post

Re: Attn HP guys, big time bug in Smart Array firmware 4.52

Post by WimVD » Apr 24, 2017 6:58 am this post

The following is a complete listing of the fixes included in version 5.04:

Kernel core dump using kdump might not complete in Linux when using Smart Array Gen9 firmware version 4.52.
Intermittent memory errors might cause the controller to stop responding. (POST Lockup 0x13)
System might stop responding if a parity error is found during surface scan of a RAID6 volume. (POST Lockup 0x13)
In rare cases, the controller might stop responding while running IO without displaying a lockup code.
Non-Maskable Interrupt (NMI) might occur on systems with a Windows OS and HPE Gen9 Smart Array or Smart HBA adapters and performing continuous reboot testing.
System IO might stop when using certain 6TiB and 8TiB SAS drives in a dual-path configuration with an HPE Gen9 Smart Array or Smart HBA adapter in HBA-mode, possibly resulting in an OS crash.
Physical slot location of SATA drives might not be returned correctly when using Microsoft Storage Spaces Direct (S2D).
Multiple direct attach SATA drives within a cluster might show the same World Wide Name (WWN) when using Microsoft Storage Spaces Direct (S2D).
Drive LED’s might be illuminated for the wrong drive and/or only illuminate momentarily when connected to an expander configuration with an HPE Gen9 Smart Array or Smart HBA adapter in HBA-mode.
Controllers using a 4GB cache module might fail data retention following an unexpected power event. (POST message 1793 — Data in Write-Back Cache has been Lost)

Источник

BBWC: in theory a good idea but has one ever saved your data?

I’m familiar with what a BBWC (Battery-backed write cache) is intended to do — and previously used them in my servers even with good UPS. There are obvously failures it does not provide protection for. I’m curious to understand whether it actually offers any real benefit in practice.

(NB I’m specifically looking for responses from people who have BBWC and had crashes/failures and whether the BBWC helped recovery or not)

Update

After the feedback here, I’m increasingly skeptical as whether a BBWC adds any value.

To have any confidence about data integrity, the filesystem MUST know when data has been committed to non-volatile storage (not necessarily the disk — a point I’ll come back to). It’s worth noting that a lot of disks lie about when data has been committed to the disk (http://brad.livejournal.com/2116715.html). While it seems reasonable to assume that disabling the on-disk cache might make the disks more honest, there’s still no guarantee that this is the case either.

Due to the typcally large buffers in a BBWC, a barrier can require significantly more data to be commited to disk therefore causing delays on writes: the general advice is to disable barriers when using a non-volatile write back cache (and to disable on-disk caching). However this would appear to undermine the integrity of the write operation — just because more data is maintained in non-volatile storage does not mean that it will be more consistent. Indeed, arguably without demarcation between logical transactions there seems to be less opportunity to ensure consistency than otherwise.

If the BBWC were to acknowledge barriers at the point the data enters it’s non-volatile storage (rather than being committed to disk) then it would appear to satisfy the data integrity requirement without a performance penalty — implying that barriers should still be enabled. However since these devices generally exhibit behaviour consistent with flushing the data to the physical device (significantly slower with barriers) and the widespread advice to disable barriers, they cannot therefore be behaving in this way. WHY NOT?

If the I/O in the OS is modelled as a series of streams then there is some scope to minimise the blocking effect of a write barrier when write caching is managed by the OS — since at this level only the logical transaction (a single stream) needs to be committed. On the other hand, a BBWC with no knowledge of which bits of data make up the transaction would have to commit its entire cache to disk. Whether the kernel/filesystems actually implement this in practice would require a lot more effort than I’m wiling to invest at the moment.

A combination of disks telling fibs about what has been committed and sudden loss of power undoubtedly leads to corruption — and with a Journalling or log structured filesystem which don’t do a full fsck after an outage its unlikely that the corruption will be detected let alone an attempt made to repair it.

In terms of the modes of failure, in my experience most sudden power outages occur because of loss of mains power (easily mitigated with a UPS and managed shutdown). People pulling the wrong cable out of rack implies poor datacentre hygene (labelling and cable management). There are some types of sudden power loss event which are not prevented by a UPS — failure in the PSU or VRM a BBWC with barriers would provide data integrity in the event of a failure here, however how common are such events? Very rare judging by the lack of responses here.

Certainly moving the fault tolerance higher in the stack is significantly more expensive the a BBWC — however implementing a server as a cluster has lots of other benefits for performance and availability.

An alternative way to mitigate the impact of sudden power loss would be to implement a SAN — AoE makes this a practical proposition (I don’t really see the point in iSCSI) but again there’s a higher cost.

Источник

Power up the server to see if the problem still exists.

5.

If configured for fault-tolerant operation and the RAID level can sustain failure of all indicated drives:

6.

Press the F2 key to fail the drives that are not responding

a.

Replace the failed drives.

b.

Press the F1 key to start the system with all logical drives on the controller disabled.

7.

Be sure the system is always powered up and down correctly.

When powering up the system, all external storage systems must be powered up before the server.

When powering down the system, the server must be powered down before external storage systems.

1792-Slot X Drive Array — Valid Data Found in Write-Back Cache…

…Data will automatically be written to drive array.

Audible Beeps: None

Possible Cause: Power was interrupted while data was in the write-back cache. Power was then restored

within several days, and the data in the write-back cache was flushed to the drive array.

Action: No action is required. No data has been lost. Perform orderly system shutdowns to avoid leaving

data in the write-back cache.

1793-Slot X Drive Array — Data in Write-Back Cache has been Lost…

(plus one of the following:)

…* Battery Pack Charge Depleted

* Battery Pack Disconnected

* Write-Back Cache Backup Failed

* Write-Back Cache Restore Failed

Audible Beeps: None

Possible Cause: Power was interrupted while data was in the write-back cache, or the battery pack batteries

failed. Data in the write-back cache has been lost.

Action:

Verify the integrity of the data stored on the drive. Power was not restored within enough time to save

1.

the data.

Perform orderly system shutdowns to avoid leaving data in the write-back cache.

2.

If the data is corrupt, restore previous data backup.

3.

1794–Slot X Drive Array – Cache Module Battery Pack is Charging…

…Caching will be enabled once the Battery Pack has been charged. No action is required.

Audible beeps: None

Possible cause: The Cache Module Battery Pack is charging.

Action: No action is required.

1794-Slot X Drive Array — Cache Module Battery Pack/Super-Cap Removed or Not

Installed…

…Caching will be re-enabled when Battery Pack/Super-Cap is connected.

HP ProLiant server errors 88

kubimike

Expert
Posts: 371
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Attn HP guys, big time bug in Smart Array firmware 4.52

Huge for us HP guys, KNOWN FLAW IN 4.52 FIRMWARE for PX4X CONTROLLERS!! Do not use 4.52!!! Its only included in the SPP for HP, removed from their website. Please download and install 4.58. I just did it, and I’m going to let it idle and see if the 0x13 error reoccurs!

http://h20564.www2.hpe.com/hpsc/doc/pub … -c05352202

I downloaded and tried the Windows installer it doesn’t work in Windows 2016. I had to boot from the latest SPP (871790_001_spp-2016.10.0-SPP2016100.2016_1015.191.iso) and use this procedure with the LINUX RPM Firmware update file

How to:
http://h20564.www2.hpe.com/hpsc/doc/pub … kc-0132754

Events under system will show Event ID 5001, 5002, 5006
Option ROM POST Error: 1719-Slot 3 Drive Array — A controller failure event occurred prior to this power-up. (Previous lock up code = 0x13) Action: Install the latest controller firmware. If the problem persists, replace the controller.
I’d also like to mention I only updated the controller showing the error messages. I do have two other controllers that are still on version 4.52



WimVD

Service Provider
Posts: 55
Liked: 19 times
Joined: Dec 23, 2014 4:04 pm
Contact:

Re: Attn HP guys, big time bug in Smart Array firmware 4.52

Post

by WimVD » Mar 22, 2017 10:20 am

Thanks Mike(s), I’m running P441 controllers with 4.52. Had no issues but upgrading to 4.58 just to be sure.
It is somewhat concerning that HPE has pulled the updates from their website. Latest available firmware at the moment is 4.02

Btw I had no issues using the windows installer on Server 2016 for updating my firmware. I did use «run as administrator».
Maybe that helped…


kubimike

Expert
Posts: 371
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: Attn HP guys, big time bug in Smart Array firmware 4.52

Post

by kubimike » Mar 22, 2017 11:10 am

@WimVD, yeah I tried running ‘As Administrator’ as well the .EXE would just crash. The crash was visible from the event log. I ran into the lock-up errors on my P841, my P441s haven’t exhibited the problem yet. My P841 is connected to external storage via SAS, Now HP needs to pull 4.52 from the SPP!


kubimike

Expert
Posts: 371
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: Attn HP guys, big time bug in Smart Array firmware 4.52

Post

by kubimike » Mar 22, 2017 2:09 pm

bad news, controller went offline this morning. same 0x13 stop error. Called back into HP, they have another customer with the same issue. Their case has been escalated to a level 2 engineer. I should get a call shortly with a new plan of action. What a nightmare this has been.


kubimike

Expert
Posts: 371
Liked: 41 times
Joined: Feb 03, 2017 2:34 pm
Full Name: MikeO
Contact:

Re: Attn HP guys, big time bug in Smart Array firmware 4.52

Post

by kubimike » Apr 03, 2017 1:52 pm
1 person likes this post

more info on this now HP is telling me after I had 5 days of open cases that there is another bug in the firmware. If you’re experiencing the volume suddenly disappearing turn off surface scans for now. I’ve been told another firmware will be released in April to address this issue.



WimVD

Service Provider
Posts: 55
Liked: 19 times
Joined: Dec 23, 2014 4:04 pm
Contact:

Re: Attn HP guys, big time bug in Smart Array firmware 4.52

Post

by WimVD » Apr 24, 2017 6:58 am

That’s quite a list of critical fixes :shock:
Thanks for the heads up. Time to update 8)

The following is a complete listing of the fixes included in version 5.04:

Kernel core dump using kdump might not complete in Linux when using Smart Array Gen9 firmware version 4.52.
Intermittent memory errors might cause the controller to stop responding. (POST Lockup 0x13)
System might stop responding if a parity error is found during surface scan of a RAID6 volume. (POST Lockup 0x13)
In rare cases, the controller might stop responding while running IO without displaying a lockup code.
Non-Maskable Interrupt (NMI) might occur on systems with a Windows OS and HPE Gen9 Smart Array or Smart HBA adapters and performing continuous reboot testing.
System IO might stop when using certain 6TiB and 8TiB SAS drives in a dual-path configuration with an HPE Gen9 Smart Array or Smart HBA adapter in HBA-mode, possibly resulting in an OS crash.
Physical slot location of SATA drives might not be returned correctly when using Microsoft Storage Spaces Direct (S2D).
Multiple direct attach SATA drives within a cluster might show the same World Wide Name (WWN) when using Microsoft Storage Spaces Direct (S2D).
Drive LED’s might be illuminated for the wrong drive and/or only illuminate momentarily when connected to an expander configuration with an HPE Gen9 Smart Array or Smart HBA adapter in HBA-mode.
Controllers using a 4GB cache module might fail data retention following an unexpected power event. (POST message 1793 — Data in Write-Back Cache has been Lost)


Who is online

Users browsing this forum: Google [Bot] and 18 guests

Error messages  122 

a.

Repair the connection and press the F2 key. 

b.

If the problem persists, run ADU (»

Be sure the cable is routed properly. 

1789-Slot X Drive Array SCSI Drive(s) Not Responding… 

…Check cables or replace the following SCSI drives: SCSI Port Y: SCSI ID Z 
Select F1 to continue – drive array will remain disabled. 
Select F2 to failed drives that are not responding – Interim Recovery Mode will be enabled if configured for fault 
tolerance. 

Audible Beeps: None 

Possible Cause: Drives that were working when the system was last used are now missing or are not 
starting up. A possible drive problem or loose SCSI cable exists. 

Action

1.

Power down the system. 

2.

Be sure all cables are properly connected. 

3.

Be sure all drives are fully seated. 

4.

Power cycle any external SCSI enclosures while the system is off. 

5.

Power up the server to see if the problem still exists. 

6.

If configured for fault-tolerant operation and the RAID level can sustain failure of all indicated drives: 

a.

Press the F2 key to fail the drives that are not responding 

b.

Replace the failed drives. 

7.

Press the F1 key to start the system with all logical drives on the controller disabled. 

Be sure the system is always powered up and down correctly. 

When powering up the system, all external storage systems must be powered up before the server. 

When powering down the system, the server must be powered down before external storage 
systems. 

1792-Drive Array Reports Valid Data Found in Array Accelerator… 

…Data will automatically be written to drive array. 

Audible Beeps: None 

Possible Cause: Power was interrupted while data was in the array accelerator memory. Power was 
then restored within several days, and the data in the array accelerator was flushed to the drive array. 

Action: No action is required. No data has been lost. Perform orderly system shutdowns to avoid leaving 
data in the array accelerator. 

1793-Drive Array — Array Accelerator Battery Depleted — Data Lost. (Error message 1794 also 
displays.) 

Audible Beeps: None 

Possible Cause: Power was interrupted while data was in the array accelerator memory, or the array 
accelerator batteries failed. Data in array accelerator has been lost. 

Action:  

1.

Verify the integrity of the data stored on the drive. Power was not restored within enough time to 
save the data.  

2.

Perform orderly system shutdowns to avoid leaving data in the array accelerator.  

Понравилась статья? Поделить с друзьями:
  • Post error 1792 slot x drive array valid data found in cache module
  • Post error 1792 drive array reports valid data found in array accelerator
  • Post error 1787 drive array operating in interim recovery mode
  • Post error 1786 drive array recovery needed
  • Post error 1785 slot x drive array not configured