Btrfs scrub uncorrectable error

I also found this thread trying to figure out what to do next after finding BTRFS checksum errors.

The listed answers with dmesg didn’t work for me; my scrub took a long time, and by then the oldest messages visible with sudo dmesg were too recent.

One solution could have been to leverage the -W, --follow-new flag to dmesg, something like starting the command below before the scrub:

$ sudo dmesg --follow-new | grep --line-buffered "checksum error at" >> checksum_errors.txt

And then do some post-processing on its output.

However, I found that searching using journalctl‘s -k, --dmesg and --grep flags were sufficient, and went back far enough to find all the errors I was experiencing. I printed out only the offset and the filename (and just ignored the trailing )) with the command below; in my case I had multiple errors in this one file.

$ sudo journalctl --dmesg --grep 'checksum error' | awk '{ print $25, $31 }' | sort -u

From there, I ran sha256sum on the bad file, confirmed it resulted in an error, and confirmed I had a backup of this file.

Источник

I ran btrfs scrub and got this:

scrub status for 57cf76da-ea78-43d3-94d3-0976308bb4cc
    scrub started at Wed Mar 15 10:30:16 2017 and finished after 00:16:39
    total bytes scrubbed: 390.45GiB with 28 errors
    error details: csum=28
    corrected errors: 0, uncorrectable errors: 28, unverified errors: 0

OK, I have good backups, and I would like to know which files these 28 errors are in so I can restore them from backup. That would save me a lot of time over wiping and restoring the whole disk.

asked Mar 15, 2017 at 18:04

As @derobert pointed out in the comments, the path is to be found in the output of dmesg and looks like this:

[ 1202.714916] BTRFS warning (device dm-2): checksum error at logical 470470615040 on dev /dev/mapper/a-root, sector 923098608, root 2757, inode 1120855, offset 110592, length 4
096, links 1 (path: usr/lib/firmware/iwlwifi-3945-2.ucode)

And this command will print a list of the files to recover from backup:

dmesg| grep -e "BTRFS warning.*path:" | sed -e 's/^.*path: //'

answered Mar 15, 2017 at 21:51

You may also be able to use journalctl if you’re on a systemd-based system.

$ sudo journalctl --dmesg --grep 'checksum error'

See my full response at the thread @Livius linked above.

answered Sep 10, 2021 at 14:33

Источник

I am on Ubuntu but regularly needthe ArchWiki (not just for BTRFS) to figure out what to do. Just like in this case. Why does «STATUS» give me

$ sudo btrfs scrub start -Bd -c 2 -n 4 /dev/sdd

Scrub device /dev/sdd (id 1) done
Scrub started:    Sun Aug 29 09:24:03 2021
Status:           finished
Duration:         0:48:00
Total to scrub:   478.07GiB
Rate:             141.17MiB/s
Error summary:    read=45
  Corrected:      0
  Uncorrectable:  45
  Unverified:     1107
ERROR: there are uncorrectable errors

Strangely, Status does not mention the Unverified 1107!:

$ sudo btrfs scrub status /dev/sdd
UUID:             8e9f178a-e531-40ce-87a9-801aa11aa4ea
Scrub started:    Sun Aug 29 09:24:03 2021
Status:           finished
Duration:         0:48:00
Total to scrub:   397.04GiB
Rate:             141.17MiB/s
Error summary:    read=45
  Corrected:      0
  Uncorrectable:  45
  Unverified:     0

This is a Samsung 870 EVO 2TB SSD, 2.5″ SATA. I have dup metadata, single data.

The archwiki says, when scrub shows uncorrectable errors, run:

journalctl --output=cat --grep='BTRFS .* i/o error' | sort | uniq | less

And from the output I see all errors are related to 1 .mkv file.

2 questions:

Why does btrfs scrub status /dev/sdd not show the 1107 Unverified?
Can I simply delete the mkv file and consider it solved? Or do I need to do something to mark the sectors as damaged (like checkdisk in Windows does)?

Источник

Hi, after a kernel panic and a hard reset, I conducted a BTRFS scrub on my RAID1 cache pool. What I got were 4 uncorrectable errors. I deleted the docker.img and conducted SMART tests on both SSDs without issue. But upon running the scrub again, I still get the same 4 errors. I’m not sure if it’s a hardware problem and how I should proceed to fix this. Below are the logs and attached is the diagnostics. Many thanks.

Cache 1:

Nov 10 16:13:09 HK-HomeLab kernel: BTRFS warning (device dm-10): checksum error at logical 530594066432 on dev /dev/mapper/sdn1, physical 458653364224, root 5, inode 118530799, offset 28672, length 4096, links 1 (path: appdata/binhex-plexpass/Plex Media Server/Metadata/TV Shows/2/902b34d01d1d00699de74ef41aa4a378087be30.bundle/Contents/com.plexapp.agents.thetvdb/seasons/5/episodes/13/thumbs/66948d605b6a303889512aaaad8668cb4b5d7106)
Nov 10 16:13:09 HK-HomeLab kernel: BTRFS error (device dm-10): bdev /dev/mapper/sdn1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
Nov 10 16:13:09 HK-HomeLab kernel: BTRFS error (device dm-10): unable to fixup (regular) error at logical 530594066432 on dev /dev/mapper/sdn1
Nov 10 16:13:09 HK-HomeLab kernel: BTRFS warning (device dm-10): checksum error at logical 530594070528 on dev /dev/mapper/sdn1, physical 458653368320, root 5, inode 118530799, offset 32768, length 4096, links 1 (path: appdata/binhex-plexpass/Plex Media Server/Metadata/TV Shows/2/902b34d01d1d00699de74ef41aa4a378087be30.bundle/Contents/com.plexapp.agents.thetvdb/seasons/5/episodes/13/thumbs/66948d605b6a303889512aaaad8668cb4b5d7106)
Nov 10 16:13:09 HK-HomeLab kernel: BTRFS error (device dm-10): bdev /dev/mapper/sdn1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
Nov 10 16:13:09 HK-HomeLab kernel: BTRFS error (device dm-10): unable to fixup (regular) error at logical 530594070528 on dev /dev/mapper/sdn1

Cache 2:

Nov 10 15:59:58 HK-HomeLab kernel: BTRFS warning (device dm-10): checksum error at logical 530594066432 on dev /dev/mapper/sdd1, physical 36651855872, root 5, inode 118530799, offset 28672, length 4096, links 1 (path: appdata/binhex-plexpass/Plex Media Server/Metadata/TV Shows/2/902b34d01d1d00699de74ef41aa4a378087be30.bundle/Contents/com.plexapp.agents.thetvdb/seasons/5/episodes/13/thumbs/66948d605b6a303889512aaaad8668cb4b5d7106)
Nov 10 15:59:58 HK-HomeLab kernel: BTRFS error (device dm-10): bdev /dev/mapper/sdd1 errs: wr 0, rd 0, flush 0, corrupt 7, gen 0
Nov 10 15:59:58 HK-HomeLab kernel: BTRFS error (device dm-10): unable to fixup (regular) error at logical 530594066432 on dev /dev/mapper/sdd1
Nov 10 15:59:58 HK-HomeLab kernel: BTRFS warning (device dm-10): checksum error at logical 530594070528 on dev /dev/mapper/sdd1, physical 36651859968, root 5, inode 118530799, offset 32768, length 4096, links 1 (path: appdata/binhex-plexpass/Plex Media Server/Metadata/TV Shows/2/902b34d01d1d00699de74ef41aa4a378087be30.bundle/Contents/com.plexapp.agents.thetvdb/seasons/5/episodes/13/thumbs/66948d605b6a303889512aaaad8668cb4b5d7106)
Nov 10 15:59:58 HK-HomeLab kernel: BTRFS error (device dm-10): bdev /dev/mapper/sdd1 errs: wr 0, rd 0, flush 0, corrupt 8, gen 0
Nov 10 15:59:58 HK-HomeLab kernel: BTRFS error (device dm-10): unable to fixup (regular) error at logical 530594070528 on dev /dev/mapper/sdd1

hk-homelab-diagnostics-20191110-0837.zip

Источник

I have two relatively new 4T hard drives (WD Data Center Re WD4000FYYZ) formatted as btrfs with raid1 data and raid1 metadata.

I copied a large binary file to the volume (~76 GB). Soon after copying the file, I ran a btrfs scrub. There were no errors.

A few months later, a scrub returned an unrecoverable error on that file. It has not been modified since it was originally copied. I might add that the SMART attributes for both drives do not indicate any errors (Current_Pending_Sector or otherwise).

The system with the drives does not have ECC memory.

The only thing that I can think of that might cause this kind of error is that in writing to another file whose data checksums were contained in the same block as some of the checksums for the big file, some corruption occurred in memory that allowed bad data to pollute one or more of the checksums for the big file.

Unfortunately, I was hoping in migrating to btrfs that once data was loaded and scrubbed successfully, you could be confident that it would remain so if it were not written to (in raid1/5/6 configuration, of course). Obviously, this is not the case.

Can anyone explain how this could have happened? Also, if I had taken a snapshot of the volume that contained the big file, would I still have had access to the original, uncorrupted data from the snapshot?

Источник

UPDATE 5: Solution:

PSU was overloaded when a HDD was changed with a newer model that use more power. A super coiner.

Copy RAID data to extern HDD, removed 2 HDD, —created a new RAID 6 array, filesystem BTRFS and copy data back.

UPDATE 4: btrfs check —repair —init-csum-tree —init-extent-tree /dev/md0

Output after some hours:

enabling repair mode

Creating a new CRC tree

Opening filesystem to check…

Checking filesystem on /dev/md0

UUID: 05bbc4f1-d4d8-4af2-863f-13eb864736b1

Creating a new extent tree

parent transid verify failed on 30490624 wanted 4 found 6049

Ignoring transid failure

Reinitialize checksum tree

ctree.c:2245: split_leaf: BUG_ON `1` triggered, value 1

btrfs(+0x141e9)[0x55709ed041e9]

btrfs(+0x14284)[0x55709ed04284]

btrfs(+0x169ad)[0x55709ed069ad]

btrfs(btrfs_search_slot+0xf24)[0x55709ed07f9f]

btrfs(btrfs_csum_file_block+0x25f)[0x55709ed15888]

btrfs(+0x4aa30)[0x55709ed3aa30]

btrfs(cmd_check+0xf0b)[0x55709ed46af8]

btrfs(main+0x1f3)[0x55709ed03e63]

/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7f46fb63109b]

btrfs(_start+0x2a)[0x55709ed03eaa]

Afbrudt (SIGABRT)

And still scrub, uncorrectable, csum errors.

UPDATE 3: PSU 5 V is 4.78 and dropping to 4.74 when when scrub. Plan to change PSU.

UPDATE 2:

The READ 6 array is 18 TB, ~ 4 TB used.

The uncorrectable
scrub, csum errors is alwas in the first 1 TB data.

I have empty the
trash in SMB/CIFS «Empty now» and overwrite every file from
the backup, showed in the log that have checksum error.

Now is uncorrectable
scrub, csum errors not 2942 but 1523, also concentrated to the first
~ 1 TB data and it is now other files that have uncorrectable scrub,
csum errors.

I like to find the
root cause.

1) For now my plan
is to overwrite the new file from backup to see what happen. (it will
not find the root cause)

2) Replace a RAID 6
disk that show increasing SMART Att 5 errors.

My main questions
is:

Why is the scrub
errors only in the first 1 TB data?

Why is it some time
possible to copy the file and the after some day not possible to read
the file?

Why is other file
that have stay in the file system now with csum errors?

Is it hardware or
software error?

Why is the copy
function to Windows 10 from BTRFS RAID 6 have 3 states:

a) Some time no
error of file with checksum error and the copy look as normal.

b) Other time
Windows 10 just spring the copy process over without error in the
middle of the copy process but the file is not copy to the windows
folder

c) Stop the copy
process and report «network error»

It look as when mark
the file in Windows 10 and key <ctrl> + <c> and then in
the windows folder key <ctrl> + <v> the copy have higher
change to success compare to using mouse: Mark and click the file and
move the marked file, drag the file with the mouse to the windows
folder and release the mouse click button. If the file have a
checksum error then is nearly always fail using the mouse.

After overwrite the file in the RAID array then the file can be copy without problem using mouse and keyboard.

But if <ctrl> + <c>, ctrl> + <v> is not working the delete SMB/CIFS and add with same setting can get <ctrl> + <c>, ctrl> + <v> to work first time, next time it is not working, but now you can overwrite the file with that one you just copy and the that file can be copy using mouse or keyboard….

Any help is
appreciated.

UPDATE 1: Found a broken SATA cable that is replaced but still uncorrectable errors scrub errors.

***********************************************************************************************************************************************************

I have situation that for me is confusing:

All copy is over the
network from OVM to Windows 10.

btrfs
scrub start /dev/md0 and then

btrfs
scrub status /dev/md0

return:

root@RAID-6-2-TB:~#

scrub
status for 05bbc4f1-d4d8-4af2-863f-13eb864736b1

scrub
started at Sun Feb 6 16:40:09 2022, running for 00:43:15

total
bytes scrubbed: 1.11TiB with 2942 errors

error
details: csum=2942

corrected
errors: 0, uncorrectable errors: 2942, unverified errors: 0

root@RAID-6-2-TB:~#
btrfs scrub status /dev/md0

scrub
status for 05bbc4f1-d4d8-4af2-863f-13eb864736b1

scrub
started at Sun Feb 6 16:40:09 2022 and finished after 02:35:46

total
bytes scrubbed: 4.00TiB with 2942 errors

error
details: csum=2942

corrected
errors: 0, uncorrectable errors: 2942, unverified errors: 0

md0: 18 TB, 11 * 2 TB
disk, RAID 6.

Q1: Is it csum that
is wrong or the datafile?

The OMV RAID 6
system was running very well but one disk in the array was showing
sick SMART Att 5 increasing so I –add and –Replace the disk.

Before and after the
disk replacement I run a mdadm btrfs
scrub status /dev/md0

and it report 0
error.

Same day another
disk was reporting ~3000 SMART Att 5 bad sectored.

I perform same
procedure.

Scrub: OK

–add and –Replace

And now Scrub always
when it is done return csum=2942 uncorrectable errors: 2942. Actually
the uncorrectable errors and csum is counting when the scrub is
running until ~1 TB of the 4 TB data. Then it reach 2942.

Memtest86: Pass 1½
times

ECC memory: No

System: OMV 5,
Stable, the system don’t hang

mdadm RAID 6

OMV 5 is updated.

File system: BTRFS

No UDMA_CRC_Error

Some of the 11 disk
have few bad sectors but is not increasing

Scrub has run
without errors in the pass many time.

Then
scrub show:

error
details: csum=2942

corrected
errors: 0, uncorrectable errors: 2942, unverified errors: 0

The log show error
as:

Feb 1 10:35:33

RAID-6-2-TB
kernel: [ 621.029918] BTRFS warning (device md0): checksum error at
logical 178256846848 on dev /dev/md0, physical 179338977280, root 5,
inode 13737, offset 532652032, length 4096, links 1 (path:
one/TV/name.ts)

When try to copy
this specific file from md0 it fail at Feb 1th. I was try many time
without success.

But today Feb 6th
and 7th, the file can be copy and look OK…

Is the file being
repaired or is the error gone by itself?

Another file is
tonight copied but today the file can’t being copied.

More detail: When I
shut down OMV and disconnect one disk (remove SATA cable),

mdadm —assemble
–scan can’t assemble the RAID array:

mdadm —assemble
–scan return no message… No warning, no error, nothing.

The RAID is not
showed – it look as it is gone.

When I then shout
down and connect the same SATA cable to the same disk and power up
the system, then the RAID 6 is back.

But when the system
is running and I hot disconnect the SATA cable from the same RAID 6
disk, the system detect the missing disk and report:

clean, degraded

Is it a
configuration problem?

What if a disk is gone when the system is turned off, will it the be possible to assemble the RAID array?

Q2: Whey is the RAID
6 array gone and not possible to assemble?

Scrub for the clean,
degraded RAID 6 with the missing disk is returning same csum=2942 errors.

That
is weird
to me because I get errors in more then
one directions.

Q3: Whey
uncorrectable errors and only csum?

I expect RAID 6 +
BTRFS to fix the errors.

root@RAID-6-2-TB:~#
btrfsck —force /dev/md0

Opening
filesystem to check…

WARNING:
filesystem mounted, continuing because of —force

Checking
filesystem on /dev/md0

UUID:
05bbc4f1-d4d8-4af2-863f-13eb864736b1

[1/7]
checking root items

[2/7]
checking extents

[3/7]
checking free space cache

[4/7]
checking fs roots

[5/7]
checking only csums items (without verifying data)

[6/7]
checking root refs

[7/7]
checking quota groups skipped (not enabled on this FS)

found
4392425177088 bytes used, no error found

total
csum bytes: 4283128720

total
tree bytes: 5427363840

total
fs tree bytes: 16924672

total
extent tree bytes: 31080448

btree
space waste bytes: 1041717880

file
data blocks allocated: 4386997813248

referenced
4386997800960

Q4: It is a local issue or mdadm/BTRFS bug?

Q5:
Way is csum=2942 and uncorrectable errors: 2942 equal?

2 of 11 SATA cable
is as a trobleshuting experiment replaced without any change.

A 3. situation that
is confusion:

A file that in the
buggy situation is copied. The file look OK. But when I try now I can
see in Windows 10 that the copy progress bar is displayed, but when
the progress bar is reached 30% then rest of the progress bar jump
fast to the end, looking as the copy is finished, but THE FILE IS NOT
IN THE TO FOLDER! No error msg…!

Checksum error have
always root 5. What is root 5 means? It is good or bad that it always
is root 5?

inode is displayed
meny time with different number from ~ 13385 to ~ 14943.

offset is displayed
meny time with different number.

openmediavault
Version 5.6.24-1 (Usul)

Kernel Linux
5.10.0-0.bpo.9-amd64

mdadm — v4.1 –
2018-10-01

SMP Debian
5.10.70-1~bpo10+1 (2021-10-10) x86_64

root@RAID-6-2-TB:~# cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]

md0 : active raid6 sdk[9] sdl[10] sda[14] sdd[2] sdj[6] sdb[13] sdh[4] sdi[5] sdg[11] sdf[8] sde[3]

17580432384 blocks super 1.2 level 6, 512k chunk, algorithm 2 [11/11] [UUUUUUUUUUU]

bitmap: 0/15 pages [0KB], 65536KB chunk

unused devices: <none>

Источник

scrub btrfs filesystem, verify block checksums

Examples (TL;DR)

Start a scrub: sudo btrfs scrub start path/to/btrfs_mount
Show the status of an ongoing or last completed scrub: sudo btrfs scrub status path/to/btrfs_mount
Cancel an ongoing scrub: sudo btrfs scrub cancel path/to/btrfs_mount
Resume a previously cancelled scrub: sudo btrfs scrub resume path/to/btrfs_mount
Start a scrub, but wait until the scrub finishes before exiting: sudo btrfs scrub start -B path/to/btrfs_mount
Start a scrub in quiet mode (does not print errors or statistics): sudo btrfs scrub start -q path/to/btrfs_mount

tldr.sh

Synopsis

btrfs scrub <subcommand> <args>

Description

Scrub is a pass over all filesystem data and metadata and verifying the checksums. If a valid copy is available (replicated block group profiles) then the damaged one is repaired. All copies of the replicated profiles are validated.

NOTE:

Scrub is not a filesystem checker (fsck) and does not verify nor repair structural damage in the filesystem. It really only checks checksums of data and tree blocks, it doesn’t ensure the content of tree blocks is valid and consistent. There’s some validation performed when metadata blocks are read from disk but it’s not extensive and cannot substitute full btrfs check run.

The user is supposed to run it manually or via a periodic system service. The recommended period is a month but could be less. The estimated device bandwidth utilization is about 80% on an idle filesystem. The IO priority class is by default idle so background scrub should not significantly interfere with normal filesystem operation. The IO scheduler set for the device(s) might not support the priority classes though.

The scrubbing status is recorded in /var/lib/btrfs/ in textual files named scrub.status.UUID for a filesystem identified by the given UUID. (Progress state is communicated through a named pipe in file scrub.progress.UUID in the same directory.) The status file is updated every 5 seconds. A resumed scrub will continue from the last saved position.

Scrub can be started only on a mounted filesystem, though it’s possible to scrub only a selected device. See btrfs scrub start for more.

Subcommand

cancel <path>|<device>

If a scrub is running on the filesystem identified by path or device, cancel it.

If a device is specified, the corresponding filesystem is found and btrfs scrub cancel behaves as if it was called on that filesystem. The progress is saved in the status file so btrfs scrub resume can continue from the last position.

resume [-BdqrR] [-c <ioprio_class> -n <ioprio_classdata>] <path>|<device>

Resume a cancelled or interrupted scrub on the filesystem identified by path or on a given device. The starting point is read from the status file if it exists.

This does not start a new scrub if the last scrub finished successfully.

Options

see scrub start.

start [-BdqrRf] [-c <ioprio_class> -n <ioprio_classdata>] <path>|<device>

Start a scrub on all devices of the mounted filesystem identified by path or on a single device. If a scrub is already running, the new one will not start. A device of an unmounted filesystem cannot be scrubbed this way.

Without options, scrub is started as a background process. The automatic repairs of damaged copies is performed by default for block group profiles with redundancy.

The default IO priority of scrub is the idle class. The priority can be configured similar to the ionice(1) syntax using -c and -n options. Note that not all IO schedulers honor the ionice settings.

Options

-B: do not background and print scrub statistics when finished
-d: print separate statistics for each device of the filesystem (-B only) at the end
-r: run in read-only mode, do not attempt to correct anything, can be run on a read-only filesystem
-R: raw print mode, print full data instead of summary
-c <ioprio_class>: set IO priority class (see ionice(1) manpage)
-n <ioprio_classdata>: set IO priority classdata (see ionice(1) manpage)
-f: force starting new scrub even if a scrub is already running, this can useful when scrub status file is damaged and reports a running scrub although it is not, but should not normally be necessary
-q: (deprecated) alias for global -q option

status [options] <path>|<device>

Show status of a running scrub for the filesystem identified by path or for the specified device.

If no scrub is running, show statistics of the last finished or cancelled scrub for that filesystem or device.

Options

-d: print separate statistics for each device of the filesystem
-R: print all raw statistics without postprocessing as returned by the status ioctl
—raw: print all numbers raw values in bytes without the B suffix
—human-readable: print human friendly numbers, base 1024, this is the default
—iec: select the 1024 base for the following options, according to the IEC standard
—si: select the 1000 base for the following options, according to the SI standard
—kbytes: show sizes in KiB, or kB with —si
—mbytes: show sizes in MiB, or MB with —si
—gbytes: show sizes in GiB, or GB with —si
—tbytes: show sizes in TiB, or TB with —si

Exit Status

btrfs scrub returns a zero exit status if it succeeds. Non zero is returned in case of failure:

scrub couldn’t be performed
there is nothing to resume
scrub found uncorrectable errors

Availability

btrfs is part of btrfs-progs. Please refer to the documentation at https://btrfs.readthedocs.io or wiki http://btrfs.wiki.kernel.org for further information.

Referenced By

btrfs(8), btrfs-check(8), btrfs-rescue(8), configuration.nix(5).

Jan 25, 2023 6.1.3 BTRFS

Источник

Examples (TL;DR)

Synopsis

Description

Subcommand

Exit Status

Availability

See Also

Referenced By

Читайте также: